Property Analysis

1 Introduction

It is a truth universally acknowledged that a married man in possession of a good fortune must be in want of a house.

I've taken the liberty of updating good old Jane's statement for more contemporary times. This article talks about my process of looking at and understanding property prices in Ireland.

Some background for my non-Irish readers:

Ireland was mostly poor from the 1930's to the 1990's
Banks were very hesitant to lend money to anyone
And the interest rates were very high

This all changed during the Celtic Tiger period, when a flood of cheap euro-denominated money was dropped in by German and French helicopters ¹. This lead (predictably) to a massive property bubble, which was sadly interrupted by the advent of financial contagion in 2007. Over the next few years, the economy fell apart, we were dragged into a bailout program, and house prices collapsed.

A few years after that, they started to recover, and right now, we have a massive housing crisis in that people can't afford houses, which puts more stress on the rental market, increasing prices.

Given that I don't trust the property-porn engaged in by the media and government, I decided to take matters into my own hands, and consult the property price register (PPR for future reference) and figure out whether or not I should buy or continue to slum it in a hell of short-term lets and fear ².

1.1 Property Price Register

So, the property price register was established in the depths of the economic crisis ³, and so has a number of problems. The most obvious is that we only have data from 2010. Given data from the boom, it would be a much easier task to estimate tops and bottoms, but we can only work with what we have. The second problem with the data is that Ireland doesn't have postcodes ⁴, which makes localisation of the properties a challenge.

Nonetheless, the nice people at the PPR have collated the data for us, so there is that. They helpfully provide the data in CSV format, and have a document which contains all of the property data, which can be found here ⁵

So, on with the show! ⁶

1.1.1 Data Cleaning

In data analysis, much as in life, cleaning is important. Unfortunately, while keeping ones house and body cleaning rarely takes up all of one's time, data cleaning, can, and often does take up almost all of the time. So, enjoying (or at least not hating) this process, is an important career skill.

Anyways, so the first thing we do is look at the data, to assess what needs to be done.

dat  <- readr::read_csv("PPR_2020.csv")
  ## dat <- readxl::read_excel("PPR-ALL.xlsx", sheet="PPR-ALL")
  names(dat)

Date of Sale (dd/mm/yyyy)

Address

Postal Code

County

Price ()

Not Full Market Price

VAT Exclusive

Description of Property

Property Size Description

OK, so we have date (nicely, they have specified the format), address, postal code, county.

I have some problems with these names. The spaces will cause problems, and capitalisation is so passe. I'll fix these problems with some code.

normalise_names <- function(df) {
    nms <- names(df)
    normed <- iconv(
        tolower(
            gsub("([[:space:]]|[[:punct:]])+", "_",
                 x=nms)),
        "latin1", "ASCII", sub="")
    drop_usc <- gsub("([a-z_])_*$", "\\1",
                     x=normed)
    names(df) <- drop_usc
    df
}

This function does a few things:

it replaces spaces or punctuation with underscores⁷
it then converts all text to lowercase
Then we exert cultural imperialism and export to ascii⁹
finally, we replace multiple underscores with one (for tidyness, mostly)

dat2 <- normalise_names(dat)
names(dat2)

date_of_sale_dd_mm_yyyy

address

postal_code

county

price

not_full_market_price

vat_exclusive

description_of_property

property_size_description

Much better. Little things like this add up to a better, more productive life.

Next, we need to remove the commas from the input doc (which have become hopelessly mangled). Again, we abuse iconv, this time on a particular column. My life would be much easier if I took the time to learn how to handle encodings ¹⁰

fix_price <- function(x) {
    nopunct <- gsub(",|", "", x=x)
    nums <- iconv(nopunct, "latin1", "ASCII", sub="")
}
dat2$price <- with(dat2, fix_price(price))

These preliminaries accomplished, we can move towards actually looking at the data

prop_df <- dat2 %>%
    mutate(date_of_sale=lubridate::dmy(date_of_sale_dd_mm_yyyy),
           not_full_market_price=as.factor(not_full_market_price),
           postal_code=as.factor(fct_explicit_na(postal_code)),
           vat_exclusive=as.factor(vat_exclusive),
           county=as.factor(county),
           price=as.numeric(price))
prop_df2 <- mutate(prop_df,
                   year=lubridate::year(date_of_sale),
                   month=lubridate::month(date_of_sale),
                   day=lubridate::day(date_of_sale))

prop_sample <- sample_frac(prop_df2, size=0.1)

We also take a ten percent sample, just to speed up the analysis ¹¹

1.1.2 Wasting Time with Plots

So, when we do the simplest plot ever (a one-way summary of price), it looks pretty weird. Most of the density is in the one line near zero, and there's a long tail that goes up to 100mn. Who are these people buying houses for many millions of euros? More importantly, how can I become one of them?

The first thing to do when you find something weird like this is to take a look at the observations and see if they are different ¹². Let's sort the data by price and look at the addresses.

arrange(prop_df, desc(price)) %>%
    select(address, county, date_of_sale, price) %>%
    head()

address	county	date_of_sale	price
Blocks A and C, The Point Campus	Dublin	2019-12-23	142256576.05
COMMERCIAL PROPERTY AND, 423 RESI UNITS AT CLANCY QUAY, ISLANDBRIDGE	Dublin	2018-06-22	139165000
266 Apts & 18 Commercial Units, Heuston South Quarter, St. John's Road West	Dublin	2019-07-19	129500000
APARTMENT 1 - 197 NEPTUNE BLOCK, HONEYPARK, DUN LAOGHAIRE	Dublin	2020-02-27	114440820
189 APARTMENTS AT ELMFIELD, BALLYOGAN, DUBLIN 18	Dublin	2020-02-27	99762972.37
Apartments 1-164 Fairwary Court, & 1-50 Fairway Hall, Cualanor Upper Glenageary Road	Dublin	2019-05-27	95242291

OK, so the Cannon house thing must be more than one apartment. Otherwise things have gone even madder than I thought. Also Apt 204 the Alliance, and 182 the Elysian ¹³. Entire apartment blocks for 76-86 million makes more sense, however.

arrange(prop_df, desc(price)) %>%
    select(address, county, date_of_sale, price) %>%
    tail()

address	county	date_of_sale	price
9 Greenfields, Lanesboro Road, Roscommon	Roscommon	2015-05-22	5179
LAVALLY, BALLINTOGHER	Sligo	2014-05-01	5177
65 ST JOSEPH'S PARK, NENAGH, CO TIPPERARY	Tipperary	2018-09-04	5080
Loghnabradden, Fintown, Co. Donegal	Donegal	2012-01-11	5079
CLOGHAN, GLENCOLMCILLE, DONEGAL	Donegal	2014-07-18	5079
COULAGHARD, EYERIES, BEARA	Cork	2019-03-11	5030.53

there must be something really wrong with these houses.

It does tell us something about the tails¹⁴ of the crash, in that property was worth so little in Roscommon in 2015 that this happened. It may also be an error, as I don't know what kind of checking goes into this data.

However, it's interesting to note that there weren't that many sales. Even at 60k it's only just over 5% of the housing stock ¹⁵. That's not a huge amount, but given the lack of data for the boom, it's very difficult to make those judgements.

So the overall trend is pretty clear. The chart above shows median prices. For those of you who aren't stats nerds, the median is the middle value. If you laid out all the prices in a line, the median would be the middle of the line. This means that it's a reasonable summary of the distribution.

Really, you should look at the distribution itself, but aggregations such as the median are useful for plots and telling people about.

The mean, which is the sum divided by the length of the set of numbers is much higher. This is pretty normal in most sets of data because huge values at the top or bottom have more impact on the mean. Again, the same trend is pretty visible. The mean price in Dublin is 400k, which assuming a 10% deposit means that a household would need a combined income of approximately 102k.

This puts houses in Dublin far out of reach of ordinary earners, which is perhaps not surprising given the incredible inflation in asset prices we've seen since the financial crisis in 2008.

Note that I've filtered out the properties sold for more than 2mn, as these are mostly apartment blocks and groups of houses. These probably shouldn't be looked at for this analysis ¹⁶.

Looking closer at Dublin (and breaking down by postal code), we can see some more interesting patterns.

Prices are definitely trending up, but there are some large outliers across the board.

Like this former house owned by an executive in Dublin 20, for example. That place definitely doesn't reflect the average experience in that area. In general, I'm beginning to think that there are multiple housing markets in Ireland, not one. This makes sense as one can only compete against others in the same price range. This would suggest that I should regard the house buying denizens of South Dublin to be some kind of alien species. I do enjoy when a data-driven theory agrees with my priors.

Long periods of not much activity, followed by sudden spikes.
We should probably dig into some of the spikes to figure out what they mean.
It's possible that they may be one overall transaction recorded as multiples (for instance, when an entire apartment block is sold to one developer, and it's recorded unit by unit).

1.1.3 Conclusions

So, above we've reviewed the property price register, and done a bunch of data cleaning, but not really accomplished that much overall. Part Two will cover some other interesting datasets ¹⁷, while future parts will cover geocoding the data and looking at spatial patterns, as well as some bayesian modelling with Stan.

Hopefully someone makes it thus far, at least ;).

Footnotes:

In fact, it was done via banks, but it doesn't really have the same ring to it now, does it?

actually I've been in the same house for many years, but I likes a good simile.

by the Greens actually, so at least I don't feel my vote for them was entirely wasted

⁴

we do now, but they are not on any of the addresses

⁵

they also provide commercial data (which I just noticed), but I am resisting the urge to become distracted

⁶

for very nerdy, boring values of show

⁷

You could use any character, but I like underscores. It's totally pointless though, so really we should just standardise on one ⁸

⁸

clearly underscores

⁹

this whole lack of understanding of encodings is deeply embarassing

¹⁰

look at how well I'm resisting distraction!

¹¹

turns out sampling is pretty cool, who knew?

¹²

or you could just log it and get one with building a CNN I guess

¹³

that dude finally made his money back, after opening really expensive apartments just as the crash hit

¹⁴

geddit?

¹⁵

assuming 1mn households. I should probably check this with the CSO

¹⁶

yes, this is really subjective, like almost everything in data analysis. The advantage here is that I provide code so people can change that however they want

¹⁷

it's possible that if you've read this article before you'll note that this was once one article. I came to the conclusion that people might be more inclined to read my article if they weren't more than 10,000 words long.