Feature Engineering

1 Introduction

  • Feature engineering:
    • finding/creating new datasets and attributes
    • improving current features
    • representing features appropriately

2 Transformations

  • Handle skew
  • Center (subtract each number from it's mean)
  • Scale (divide this by the standard deviation)
  • Results in numeric variables taking the 0-1 range
  • Ultimately you're solving a bunch of simultaneous equations, the scale of the variables can matter
  • Other transformations can be chosen using the BoxCox formula

3 Box-Cox

  • Essentially produces a value, lambda which is used as a power for the original variable
  • If the value is 0, take the log
with(ppr_train, BoxCoxTrans(price))
  • So, nicely enough, it supports my decision to choose a log-transform
  • If we look at the counts from daily_data earlier, it suggests 0.4, which indicates a square root transform

4 Categorical data

  • Often many choices in how to represent
  • How many levels should one allow?
  • Do these categories have any hierarhical structure?
  • Should all levels of a categorical variable be included? (yes for trees, no for linear models)
  • Some categories may define the unit of analysis (i.e. customer_id, order_id, time)

5 Quick model

ppr_train3 <- select(ppr_train2, -postal_code, 
                     -address, 
                     -price, -date_of_sale)
ppr_train_s <- sample_frac(ppr_train3, size=0.1)
ppr_test_samp <- sample_frac(ppr_train3, 
                             size=0.05)
train_ctrl <- trainControl(method="repeatedcv", 
                           repeats=5)
ppr_glmnet <- train(log_price~.,
                    data=ppr_train_s,
                    method="glmnet",
                    trControl=train_ctrl)
preds <- predict(ppr_glmnet, newdata=ppr_test_samp)
ppr_test_samp$glmnet_preds <- preds

6 Sampling

  • The first rule of practical model-building is to sample judiciously
  • This is entirely a device to speed up iteration, and can often be a necessity forced upon the project by a lack of resources
ggplot(ppr_test_samp, aes(x=log_price^10, y=glmnet_preds^10))+geom_point()

predsvobs_glmnet.png

  • The disparity between the two sets of data (individual vs commercial) is very, very apparent
  • Do we care about the extremely large examples?

7 Modelling Objectives

  • We need to be very clear with what we are trying to accomplish
  • With this data, I was trying to assess where I should buy a house, and how I should value it
  • For that purpose, the huge (1mn plus) sales add absolutely no value to me, because I can't possibly afford them

8 Recontextualising the Problem

ppr_train_s2 <- filter(ppr_train_s, 
                       log_price<=log(1e6, base=10)) %>% 
    select(-is_big)
ppr_glmnet2 <- train(log_price~.,
                     data=ppr_train_s2,
                     method="glmnet",
                     trControl=train_ctrl)
preds2 <- predict(ppr_glmnet2,
                  newdata=ppr_test_samp)
ppr_test_samp$glmnet_preds2 <- preds2
  • Interestingly enough, the model containing the large prices does better
  • I don't really have a good explanation for this

9 Feature Engineering

  • The first step is to examine what data we have
  • Are there any sources of data which could be used better?
  • The obvious idea here is to geocode the addresses
  • This will give us exact location, and allow us to examine more CSO data

10 Python Geocoding Script

  • This is the sad truth about data science:
    • you'll be using multiple tools in order to get your work done
  • In this case, I actually didn't write the script myself
  • I got the script from Shane Lynn, who did this up to 2016
  • I used all my free credits from Google Cloud Platform to get this updated to 2018

11 Extra Variables from geocoding

ppr_gc1 <- filter(ppr_gc, year<=2016)
ppr_gc2 <- filter(ppr_gc, year>2016)
#github doesn't like large files
write_csv(ppr_gc1, path="ppr_gecoded1.csv")
write_csv(ppr_gc2, path="ppr_gecoded2.csv")
ppr_gc <- read_csv(
  "~/Dropbox/PPR/ppr_geocoded_till_oct2018.csv")
names(ppr_gc)[12:24]
x
date
formatted_address
accuracy
latitude
longitude
postcode
type
geo_county
electoral_district
electoral_district_id
region
small_area
month
  • To be fair, this isn't everything, but it's a lot more than I had earlier
  • This is the key point of feature engineering:
    • You are much more likely to be successful by adding new datasets rather than new features
    • This is worth remembering in future research

12 Geocoded Data

  • Note that we have the electoral districts, and the small areas encoded into the file
  • This provides us with a link into the CSO census data
  • This data is all stored in shapefiles, which also encode the area and size of each unit (electoral district/small area/constituency)
  • What do we get from this?

13 Shape Files!

library(sp)
library(rgdal)
shp <- readOGR("~/Dropbox/PPR/electoral_divisions_gps.shp")
#df stored in this slot
names(shp@data[,13:length(shp@data)]) 
x
MALE2011
FEMALE2011
TOTAL2011
PPOCC2011
UNOCC2011
VACANT2011
HS2011
PCVAC2011
LAND_AREA
TOTAL_AREA
CREATEDATE
  • So this is really, really useful
  • Note that HS2011 is the number of households
  • This provides a denominator for the sales of houses
  • We also have vacancies and land area, which we can use to calculate density and other attributes
  • Note that despite the names, these are 2016 figures

14 Pobal Data

  • Pobal are an agency that investigates deprivation across Ireland
  • They produce a dataset which has lots of useful variables on a small area/electoral district area
  • Full details can be found here
  • The dataset itself can be found here

15 Pobal Data Details

  • Note that there is a data dictionary
  • These are always worth looking for, and if you can't find one, it's well worth actually documenting what you find in a dataset, especially if you'll ever need it again
  • You'll need to join your datasets together, obviously
  • When that's done, you'll gain the following variables:
pobal <- read_csv("~/Dropbox/PPR/pobal_deprivation_index_ed.csv")
names(pobal)[25:45]
x
share
POPCHG06
POPCHG11
POPCHG16
AGEDEP06
AGEDEP11
AGEDEP16
LONEPA06
LONEPA11
LONEPA16
EDLOW_06
EDLOW_11
EDLOW_16
EDHIGH06
EDHIGH11
EDHIGH16
HLPROF06
HLPROF11
HLPROF16
LSKILL06
LSKILL11
  • Note that the HP variables are indexes of absolute and relative deprivation, and I believe that they are the result of PCA

16 Reassassing the Situation

  • So, sometimes the best feature engineering you can do is find other data :)
  • However, we now have a dataset where we can run interesting variable treatments (i.e. we have lots of continuous variables )

17 Transformations

  • So the best thing to do is to examine the data again, as we did before
sapply(ppr_pobal, n_distinct) %>%
    as.data.frame() %>%
    rownames_to_column() %>%
    arrange(desc(`.`))
rowname 0
geometry 298501
rowid 298501
input_string 263158
longitude 180897
latitude 177022
postcode 113999
price 18119
small_area 18019
electoral_district_id 3352
ID06 3322
electoral_district 3066
ED_Name 3035
sale_date 2337
TOTPOP16 1751
popdep_T 1751
TOTPOP11 1739
TOTPOP06 1704
TOTPOP02 1673
popdep_L 1304
popdep_M 902
OHOUSE16 596
OHOUSE11 585
EDHIGH16 547
HLPROF16 529
popdep_H 523
HLPROF11 518
EDHIGH11 505
OHOUSE06 500
HLPROF06 497
EDHIGH06 493
PRRENT16 460
LONEPA11 458
PRRENT11 456
LONEPA06 453
LONEPA16 437
EDLOW_06 430
UNEMPM11 419
EDLOW_11 391
PRRENT06 383
LSKILL06 379
HP2016abs 360
HP2011rel 360
LSKILL11 355
HP2016rel 352
HP2006abs 351
HP2006rel 351
LSKILL16 351
HP2011abs 350
EDLOW_16 350
UNEMPM16 347
UNEMPF11 321
type 320
LARENT16 307
UNEMPF16 300
AGEDEP06 295
AGEDEP11 294
AGEDEP16 294
LARENT11 286
LARENT06 284
UNEMPM06 271
UNEMPF06 232
share_H 60
POPCHG11 60
share_T 46
share 45
share_M 44
share_L 42
POPCHG06 39
NUTS4 35
geo_county 34
ppr_county 26
PEROOM11 19
PEROOM16 17
POPCHG16 15
month 12
NUTS3 9
region 8
year 7
PEROOM06 7
description_of_property 6
NUTS2 3
NUTS1 2
  • We can see that the top few variables have far too many levels, and that the bottom few variables will be problematic as they have 1-5 levels 1

18 Missing Data

sapply(ppr_pobal, function(x) 
  sum(is.na(x))/length(x)) %>%
    as.data.frame() %>%
    rownames_to_column() %>%
    dplyr::arrange(desc(`.`)) %>%
  mutate_if(is.numeric, round, 4) %>%
  head(n=5)
  rowname 0
1 postcode 0.5491
2 ID06 0.0012
3 ED_Name 0.0012
4 NUTS4 0.0012
5 NUTS3 0.0012
  • As we can see, it's mostly postcode that's missing
  • Here, it's probably a better idea to just drop it

19 BoxCox parameters

num_vars <-  as.data.frame(ppr_pobal) %>%
    na.omit() %>%
select_if(is.numeric)

bc2 <- lapply(num_vars, 
              BoxCoxTrans, 
              na.rm=TRUE)
lambdas <- sapply(bc2, function (x) x[["lambda"]])
head(lambdas)

year 2
price 0.1
latitude 2
longitude nil
month 0.8
ID06 0.1
  • So we have a bunch of potential transformations to apply

20 PCA

  • Principal Components Analysis (PCA) is a data reduction method
  • It essentially examines the covariance between each variable, and returns the combination of each variable that is the best linear predictor
  • The first component explains most of the variance
  • And each extra component is uncorrelated with the previous
  • This makes it much easier to model the variables
  • However, it can make them much harder to interpret

21 Practical PCA

  • Keeping all components just gives us uncorrelation (which makes each model easier to interpret)
  • However, we normally want to describe as much variance as possible with as few variables as possible
  • This necessitates choosing K components to keep
  • How do we choose K?

22 This is a very very hard problem

  • Because it's unsupervised
  • If we had an outcome metric, we could cross-validate
  • This is much harder without a response variable
  • If we do have an outcome variable, then we would need to repeat the selection of K for each fold (to avoid bias)
  • There are lots of methods, many of which don't agree

23 One method to rule them all

pobal_vars <- dplyr::select(num_vars, 26:64) %>% 
    lapply(function (x) scale(x, 
                              center=TRUE, 
                              scale=TRUE))
pca <- princomp(~., data=pobal_vars)

24 Screeplot

screeplot(pca, type="lines")

scree1.png

25 How to use the scree plot

  • Keep the components until the line levels off
  • That's about five here
  • You should also look at the variable loadings, described below
  • If you want approximately 100 pages of this, consult my thesis

26 Loadings

loadings <- pca$loadings
loadings[,1:6] %>%
  round(3) %>%
  head()
  Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
POPCHG06 0.004 0.012 0.338 0.37 0.245 0.092
POPCHG11 -0.006 -0.02 -0.336 -0.442 -0.23 -0.252
POPCHG16 -0.007 -0.114 -0.229 -0.138 -0.068 -0.222
AGEDEP06 0.024 0.266 0.14 -0.181 -0.016 0.293
AGEDEP11 0.004 0.289 0.053 -0.215 -0.076 0.287
AGEDEP16 -0.012 0.288 0.018 -0.188 -0.094 0.255
  • This will allow you to name the variables which makes them more comfortable to model with

27 Y-aware Variable treatment

  • Recently, a wonderful package for R has become available
  • vtreat is a package for y-aware variable selection
  • It does a whole host of variable selection and preparation techniques, and can significantly improve outcomes of modelling processes
  • However, it is y-aware which means that it must be placed into the cross-validation process
  • This can add significant compute requirements to your project
  • even if you don't use it, the paper is a great source of references and ideas for variable treatment

28 Impact Coding

  • Terminology developed by the authors of the vtreat package (explanation available here)
  • Essentially the variables are smoothed towards the grand average
  • In many ways, this is what a Bayesian hierarchical model will do, but this approach is much faster
  • You must use a distinct set of data to prepare this coding, otherwise you will overfit

29 Informative Missingness

  • Sometimes, the fact that a variable is missing is itself informative
  • A good example is income:
    • higher income people are less likely to report their income
    • this introduces a bias based on whether or not this variable is missing
  • We can create an indicator variable_is_missing, which allows us to include this in a regression model
  • Note that trees and other methods do not have as much trouble with these variables, so vtreat may be less useful there

30 Example: Electoral Districts

  • There are 3k electoral districts in the data
ed_count  <- ppr_pobal %>% 
  as.data.frame() %>% 
  group_by(electoral_district_id) %>% 
    summarise(count=n(), median=median(price))
ggplot(ed_count, aes(x=count))+geom_density()

ed_count.jpg

31 Modelling without Vtreat

  • Most of these have very few observations
#otherwise the below errors out
ppr_pobal_s <- sample_frac(ppr_pobal, size=0.1) 
##takes a loooooonnnnng time
lm_ed1 <- lm(log(price)~electoral_district_id, 
             data=ppr_pobal_s)
  • We get an R^2 of .49 (adjusted for the number of parameters)
  • This is pretty good, but we need to estimate a lot of parameters
  • Can we do better with vtreat

32 Vtreat Approach

require(vtreat)
#not a perfectly independent sample
##don't do it like this
ppr_train_s2 <- sample_frac(ppr_pobal, 
                            size=0.05) 
treat <- designTreatmentsN(ppr_train_s2, 
                           varlist=c("electoral_district_id"), 
                           outcomename = "price")
train_prep <- prepare(treat, 
                      dframe=ppr_pobal_s)

33 Vtreat In practice

ppr_pobal_test <- sample_frac(ppr_pobal, size=0.1)
test_vtreat <- prepare(treat, ppr_pobal_test)
glmnet_preds <- predict(glmnet_ed1, newx=as.matrix(test_vtreat), 
                        type="response")
predsvobs <- data.frame(preds=exp(glmnet_preds), 
                        obs=ppr_pobal_test$price)
ggplot(predsvobs, aes(x=obs, y=preds.s0))+geom_point()
ggplot(predsvobs, aes(x=obs, y=preds.s60))+geom_point()
predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test$price)
with(predsvobs, cor(preds.s60, obs))

  • We get a correlation of around the same
  • But the joy of vtreat is that we can easily apply the same approach to multiple variables, and it will handle missings and encodings for us automatically

34 Vtreat on multiple variables

tf <- designTreatmentsN(ppr_train_s2, varlist=
                                        setdiff(
                                          names(ppr_train_s2), 
                                          "price"),
                                outcomename = "price")
ppr_p_t <- prepare(tf,
                   dframe=ppr_pobal_s)
ppr_glmnet_full <- cv.glmnet(x=as.matrix(ppr_pobal_treat),
                          y=log(ppr_pobal_s$price))
test_vtreat_full <- prepare(treat_full, ppr_pobal_test)

35 Vtreat Results

  • When we use all of the variables, we get pretty good results

obsvpred.png

  • The predictions are pretty accurate, for most of the data

36 Predictions vs Observed

log_resid.png

37 Is this model accurate?

  • The most likely residual is approx 1000, which is very very good
  • In fact, it's probably too good
  • Examine the variables and residuals found from this model
  • Discuss potential issues which may be flattering this model

38 Conclusions

  • Sometimes, the best way to engineer features is to get more data
  • Scaling and centering are good default transformations for numeric variables
  • If you have lots of correlated variables, PCA can be a good choice
  • If you want to transform effectively, then use Box Cox transformations
  • vtreat can automate lots of common transformations, and if you can spare the data, you should definitely use it

Footnotes:

1

particularly an issue when using cross-validation

Author: Richie Morrisroe

Created: 2020-04-29 Wed 11:45

Validate