# Feature Engineering

## 1 Introduction

• Feature engineering:
• finding/creating new datasets and attributes
• improving current features
• representing features appropriately

## 2 Transformations

• Handle skew
• Center (subtract each number from it's mean)
• Scale (divide this by the standard deviation)
• Results in numeric variables taking the 0-1 range
• Ultimately you're solving a bunch of simultaneous equations, the scale of the variables can matter
• Other transformations can be chosen using the BoxCox formula

## 3 Box-Cox

• Essentially produces a value, lambda which is used as a power for the original variable
• If the value is 0, take the log
```with(ppr_train, BoxCoxTrans(price))
```
• So, nicely enough, it supports my decision to choose a log-transform
• If we look at the counts from daily_data earlier, it suggests 0.4, which indicates a square root transform

## 4 Categorical data

• Often many choices in how to represent
• How many levels should one allow?
• Do these categories have any hierarhical structure?
• Should all levels of a categorical variable be included? (yes for trees, no for linear models)
• Some categories may define the unit of analysis (i.e. customer_id, order_id, time)

## 5 Quick model

```ppr_train3 <- select(ppr_train2, -postal_code,
-price, -date_of_sale)
ppr_train_s <- sample_frac(ppr_train3, size=0.1)
ppr_test_samp <- sample_frac(ppr_train3,
size=0.05)
train_ctrl <- trainControl(method="repeatedcv",
repeats=5)
ppr_glmnet <- train(log_price~.,
data=ppr_train_s,
method="glmnet",
trControl=train_ctrl)
preds <- predict(ppr_glmnet, newdata=ppr_test_samp)
ppr_test_samp\$glmnet_preds <- preds
```

## 6 Sampling

• The first rule of practical model-building is to sample judiciously
• This is entirely a device to speed up iteration, and can often be a necessity forced upon the project by a lack of resources
```ggplot(ppr_test_samp, aes(x=log_price^10, y=glmnet_preds^10))+geom_point()
``` • The disparity between the two sets of data (individual vs commercial) is very, very apparent
• Do we care about the extremely large examples?

## 7 Modelling Objectives

• We need to be very clear with what we are trying to accomplish
• With this data, I was trying to assess where I should buy a house, and how I should value it
• For that purpose, the huge (1mn plus) sales add absolutely no value to me, because I can't possibly afford them

## 8 Recontextualising the Problem

```ppr_train_s2 <- filter(ppr_train_s,
log_price<=log(1e6, base=10)) %>%
select(-is_big)
ppr_glmnet2 <- train(log_price~.,
data=ppr_train_s2,
method="glmnet",
trControl=train_ctrl)
preds2 <- predict(ppr_glmnet2,
newdata=ppr_test_samp)
ppr_test_samp\$glmnet_preds2 <- preds2
```
• Interestingly enough, the model containing the large prices does better
• I don't really have a good explanation for this

## 9 Feature Engineering

• The first step is to examine what data we have
• Are there any sources of data which could be used better?
• The obvious idea here is to geocode the addresses
• This will give us exact location, and allow us to examine more CSO data

## 10 Python Geocoding Script

• you'll be using multiple tools in order to get your work done
• In this case, I actually didn't write the script myself
• I got the script from Shane Lynn, who did this up to 2016
• I used all my free credits from Google Cloud Platform to get this updated to 2018

## 11 Extra Variables from geocoding

```ppr_gc1 <- filter(ppr_gc, year<=2016)
ppr_gc2 <- filter(ppr_gc, year>2016)
#github doesn't like large files
write_csv(ppr_gc1, path="ppr_gecoded1.csv")
write_csv(ppr_gc2, path="ppr_gecoded2.csv")
"~/Dropbox/PPR/ppr_geocoded_till_oct2018.csv")
names(ppr_gc)[12:24]
```
x
date
accuracy
latitude
longitude
postcode
type
geo_county
electoral_district
electoral_district_id
region
small_area
month
• To be fair, this isn't everything, but it's a lot more than I had earlier
• This is the key point of feature engineering:
• You are much more likely to be successful by adding new datasets rather than new features
• This is worth remembering in future research

## 12 Geocoded Data

• Note that we have the electoral districts, and the small areas encoded into the file
• This provides us with a link into the CSO census data
• This data is all stored in shapefiles, which also encode the area and size of each unit (electoral district/small area/constituency)
• What do we get from this?

## 13 Shape Files!

```library(sp)
library(rgdal)
#df stored in this slot
names(shp@data[,13:length(shp@data)])
```
x
MALE2011
FEMALE2011
TOTAL2011
PPOCC2011
UNOCC2011
VACANT2011
HS2011
PCVAC2011
LAND_AREA
TOTAL_AREA
CREATEDATE
• So this is really, really useful
• Note that HS2011 is the number of households
• This provides a denominator for the sales of houses
• We also have vacancies and land area, which we can use to calculate density and other attributes
• Note that despite the names, these are 2016 figures

## 14 Pobal Data

• Pobal are an agency that investigates deprivation across Ireland
• They produce a dataset which has lots of useful variables on a small area/electoral district area
• Full details can be found here
• The dataset itself can be found here

## 15 Pobal Data Details

• Note that there is a data dictionary
• These are always worth looking for, and if you can't find one, it's well worth actually documenting what you find in a dataset, especially if you'll ever need it again
• You'll need to join your datasets together, obviously
• When that's done, you'll gain the following variables:
```pobal <- read_csv("~/Dropbox/PPR/pobal_deprivation_index_ed.csv")
names(pobal)[25:45]
```
x
share
POPCHG06
POPCHG11
POPCHG16
AGEDEP06
AGEDEP11
AGEDEP16
LONEPA06
LONEPA11
LONEPA16
EDLOW_06
EDLOW_11
EDLOW_16
EDHIGH06
EDHIGH11
EDHIGH16
HLPROF06
HLPROF11
HLPROF16
LSKILL06
LSKILL11
• Note that the HP variables are indexes of absolute and relative deprivation, and I believe that they are the result of PCA

## 16 Reassassing the Situation

• So, sometimes the best feature engineering you can do is find other data :)
• However, we now have a dataset where we can run interesting variable treatments (i.e. we have lots of continuous variables )

## 17 Transformations

• So the best thing to do is to examine the data again, as we did before
```sapply(ppr_pobal, n_distinct) %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(desc(`.`))
```
rowname 0
geometry 298501
rowid 298501
input_string 263158
longitude 180897
latitude 177022
postcode 113999
price 18119
small_area 18019
electoral_district_id 3352
ID06 3322
electoral_district 3066
ED_Name 3035
sale_date 2337
TOTPOP16 1751
popdep_T 1751
TOTPOP11 1739
TOTPOP06 1704
TOTPOP02 1673
popdep_L 1304
popdep_M 902
OHOUSE16 596
OHOUSE11 585
EDHIGH16 547
HLPROF16 529
popdep_H 523
HLPROF11 518
EDHIGH11 505
OHOUSE06 500
HLPROF06 497
EDHIGH06 493
PRRENT16 460
LONEPA11 458
PRRENT11 456
LONEPA06 453
LONEPA16 437
EDLOW_06 430
UNEMPM11 419
EDLOW_11 391
PRRENT06 383
LSKILL06 379
HP2016abs 360
HP2011rel 360
LSKILL11 355
HP2016rel 352
HP2006abs 351
HP2006rel 351
LSKILL16 351
HP2011abs 350
EDLOW_16 350
UNEMPM16 347
UNEMPF11 321
type 320
LARENT16 307
UNEMPF16 300
AGEDEP06 295
AGEDEP11 294
AGEDEP16 294
LARENT11 286
LARENT06 284
UNEMPM06 271
UNEMPF06 232
share_H 60
POPCHG11 60
share_T 46
share 45
share_M 44
share_L 42
POPCHG06 39
NUTS4 35
geo_county 34
ppr_county 26
PEROOM11 19
PEROOM16 17
POPCHG16 15
month 12
NUTS3 9
region 8
year 7
PEROOM06 7
description_of_property 6
NUTS2 3
NUTS1 2
• We can see that the top few variables have far too many levels, and that the bottom few variables will be problematic as they have 1-5 levels 1

## 18 Missing Data

```sapply(ppr_pobal, function(x)
sum(is.na(x))/length(x)) %>%
as.data.frame() %>%
rownames_to_column() %>%
dplyr::arrange(desc(`.`)) %>%
mutate_if(is.numeric, round, 4) %>%
```
rowname 0
1 postcode 0.5491
2 ID06 0.0012
3 ED_Name 0.0012
4 NUTS4 0.0012
5 NUTS3 0.0012
• As we can see, it's mostly postcode that's missing
• Here, it's probably a better idea to just drop it

## 19 BoxCox parameters

```num_vars <-  as.data.frame(ppr_pobal) %>%
na.omit() %>%
select_if(is.numeric)

bc2 <- lapply(num_vars,
BoxCoxTrans,
na.rm=TRUE)
lambdas <- sapply(bc2, function (x) x[["lambda"]])

```
 year 2 price 0.1 latitude 2 longitude nil month 0.8 ID06 0.1
• So we have a bunch of potential transformations to apply

## 20 PCA

• Principal Components Analysis (PCA) is a data reduction method
• It essentially examines the covariance between each variable, and returns the combination of each variable that is the best linear predictor
• The first component explains most of the variance
• And each extra component is uncorrelated with the previous
• This makes it much easier to model the variables
• However, it can make them much harder to interpret

## 21 Practical PCA

• Keeping all components just gives us uncorrelation (which makes each model easier to interpret)
• However, we normally want to describe as much variance as possible with as few variables as possible
• This necessitates choosing K components to keep
• How do we choose K?

## 22 This is a very very hard problem

• Because it's unsupervised
• If we had an outcome metric, we could cross-validate
• This is much harder without a response variable
• If we do have an outcome variable, then we would need to repeat the selection of K for each fold (to avoid bias)
• There are lots of methods, many of which don't agree

## 23 One method to rule them all

```pobal_vars <- dplyr::select(num_vars, 26:64) %>%
lapply(function (x) scale(x,
center=TRUE,
scale=TRUE))
pca <- princomp(~., data=pobal_vars)
```

## 24 Screeplot

```screeplot(pca, type="lines")
``` ## 25 How to use the scree plot

• Keep the components until the line levels off
• If you want approximately 100 pages of this, consult my thesis

```loadings <- pca\$loadings
round(3) %>%
```
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
POPCHG06 0.004 0.012 0.338 0.37 0.245 0.092
POPCHG11 -0.006 -0.02 -0.336 -0.442 -0.23 -0.252
POPCHG16 -0.007 -0.114 -0.229 -0.138 -0.068 -0.222
AGEDEP06 0.024 0.266 0.14 -0.181 -0.016 0.293
AGEDEP11 0.004 0.289 0.053 -0.215 -0.076 0.287
AGEDEP16 -0.012 0.288 0.018 -0.188 -0.094 0.255
• This will allow you to name the variables which makes them more comfortable to model with

## 27 Y-aware Variable treatment

• Recently, a wonderful package for R has become available
• vtreat is a package for y-aware variable selection
• It does a whole host of variable selection and preparation techniques, and can significantly improve outcomes of modelling processes
• However, it is y-aware which means that it must be placed into the cross-validation process
• even if you don't use it, the paper is a great source of references and ideas for variable treatment

## 28 Impact Coding

• Terminology developed by the authors of the vtreat package (explanation available here)
• Essentially the variables are smoothed towards the grand average
• In many ways, this is what a Bayesian hierarchical model will do, but this approach is much faster
• You must use a distinct set of data to prepare this coding, otherwise you will overfit

## 29 Informative Missingness

• Sometimes, the fact that a variable is missing is itself informative
• A good example is income:
• higher income people are less likely to report their income
• this introduces a bias based on whether or not this variable is missing
• We can create an indicator variable_is_missing, which allows us to include this in a regression model
• Note that trees and other methods do not have as much trouble with these variables, so vtreat may be less useful there

## 30 Example: Electoral Districts

• There are 3k electoral districts in the data
```ed_count  <- ppr_pobal %>%
as.data.frame() %>%
group_by(electoral_district_id) %>%
summarise(count=n(), median=median(price))
ggplot(ed_count, aes(x=count))+geom_density()
``` ## 31 Modelling without Vtreat

• Most of these have very few observations
```#otherwise the below errors out
ppr_pobal_s <- sample_frac(ppr_pobal, size=0.1)
##takes a loooooonnnnng time
lm_ed1 <- lm(log(price)~electoral_district_id,
data=ppr_pobal_s)
```
• We get an R^2 of .49 (adjusted for the number of parameters)
• This is pretty good, but we need to estimate a lot of parameters
• Can we do better with vtreat

## 32 Vtreat Approach

```require(vtreat)
#not a perfectly independent sample
##don't do it like this
ppr_train_s2 <- sample_frac(ppr_pobal,
size=0.05)
treat <- designTreatmentsN(ppr_train_s2,
varlist=c("electoral_district_id"),
outcomename = "price")
train_prep <- prepare(treat,
dframe=ppr_pobal_s)

```

## 33 Vtreat In practice

```ppr_pobal_test <- sample_frac(ppr_pobal, size=0.1)
test_vtreat <- prepare(treat, ppr_pobal_test)
glmnet_preds <- predict(glmnet_ed1, newx=as.matrix(test_vtreat),
type="response")
predsvobs <- data.frame(preds=exp(glmnet_preds),
obs=ppr_pobal_test\$price)
ggplot(predsvobs, aes(x=obs, y=preds.s0))+geom_point()
ggplot(predsvobs, aes(x=obs, y=preds.s60))+geom_point()
predsvobs <- data.frame(preds=exp(glmnet_preds), obs=ppr_pobal_test\$price)
with(predsvobs, cor(preds.s60, obs))

```
• We get a correlation of around the same
• But the joy of vtreat is that we can easily apply the same approach to multiple variables, and it will handle missings and encodings for us automatically

## 34 Vtreat on multiple variables

```tf <- designTreatmentsN(ppr_train_s2, varlist=
setdiff(
names(ppr_train_s2),
"price"),
outcomename = "price")
ppr_p_t <- prepare(tf,
dframe=ppr_pobal_s)
ppr_glmnet_full <- cv.glmnet(x=as.matrix(ppr_pobal_treat),
y=log(ppr_pobal_s\$price))
test_vtreat_full <- prepare(treat_full, ppr_pobal_test)
```

## 35 Vtreat Results

• When we use all of the variables, we get pretty good results • The predictions are pretty accurate, for most of the data

## 36 Predictions vs Observed ## 37 Is this model accurate?

• The most likely residual is approx 1000, which is very very good
• In fact, it's probably too good
• Examine the variables and residuals found from this model
• Discuss potential issues which may be flattering this model

## 38 Conclusions

• Sometimes, the best way to engineer features is to get more data
• Scaling and centering are good default transformations for numeric variables
• If you have lots of correlated variables, PCA can be a good choice
• If you want to transform effectively, then use Box Cox transformations
• vtreat can automate lots of common transformations, and if you can spare the data, you should definitely use it

## Footnotes:

1

particularly an issue when using cross-validation

Created: 2020-04-29 Wed 11:45

Validate