An Analytics project utilizing housing market data sourced from the Dutch website Funda.
Nathalia V. M. de Oliveira - April, 2024.
This project analyzes the Dutch housing market based on property listings available on the Funda.nl website. The objective is:
The process involves data extraction via web scraping, data transformation for cleaning and preparation, and exploratory analysis through interactive visualizations in Power BI. The result is a set of dashboards offering detailed insights into geographic distribution, property characteristics, and price and size statistics by province and city.
If you are NOT interested in the technical part, you can go straight to the Results.
Funda is a Dutch residential platform established since more than 20 years ago. According to their website, Funda is the largest platform connecting supply and demand in the real estate market of the Netherlands, with over 4 million unique visitors per month and hosting around 97% of the Dutch housing market.
According Funda’s Terms and Conditions, scraping its website is only allowed for personal use.
The data were scraped from Funda using FundaScraper module for Python. Before running it, there are several different sets of arguments that can generate diverse searches. For this project, the following arguments have been used:
area=provincie
want_to=buy
find_past=False
page_start=1
n_pages = 50
raw_data=True
Although searching by province may return, for instance, 12,000 results, Funda only makes a maximum of 666 pages accessible, as shown below. This equates to 9,990 entries, as each page contains 15 listings.
You can have a preview of this search by province results here.
Then, every 50 pages (or 750 entries), the script scrapes and records the data into a CSV regarding each province.
The pictures show there are no next pages.
The data were collected on the 10th of April, 2024, which means that the dataset for this case study includes all the housing property listings available for sale during that period.
The raw scraped dataset contains the following columns:
The raw data look like expected, a little mess.
You can also take a look at the whole table. It’s possible to see that it is going to take a lot of work.
And so, the following provinces CSV files were generated. The number corresponds to the quantity of listings returned:
Drenthe - 1912
Groningen - 1946
Flevoland - 2161
Friesland - 2280
Zeeland - 3033
Limburg - 4095
Utrecht - 4146
Overijssel - 4305
Gelderland - 7410
Zuid Holland - 9990
Noord Brabant - 10004
Noord Holland - 10004
For some reason, Noord Holland and Noord Brabant have more than 9990 listings. I assumed they could be duplicates and would be treated accordingly during the processing step.
The raw dataset then starts with 61,286 entries and 1.1 GB in size.
On the transformation step, when running the data processing script, several processes were applied to the data so that could be better suited for analytics.
At first, the script loads raw data from each province CSV file,
and creates them specified folders in a general chunk folder, in which the processed data will be written after the cleansing.
The processing step ended up heavy and a bit slow. There was a need to adopt a measure to maintain the already processed data and avoid losses due to frequent interruptions in execution caused by unexpected occurrences. In order to breaking down the dataset into smaller and more manageable pieces, the cleaning and writing process for each province was then partitioned into chunks of 300 instances and executed in a loop.
Process all perfomed using Pandas.
'photo', 'descrip', 'log_id', 'insulation', 'ownership'
and other that can be seen at Data processing script.'226 m²'
to number 226
.'€ 1.190.000 k.k.'
to number 1190000
.
1960-1970
Na 2020
Voor 1906
2025
to YYYY
. D
F
A+
A+++
A+++
A++++
C
A 0,78
na
C
to A, B, C, D, E, F, G and agregating all A greater than A+ into A+
.
Cv-ketel, gashaard en gedeeltelijke vloerverwarming
Cv-ketel en gedeeltelijke vloerverwarming
Gehele vloerverwarming, warmte terugwininstallatie en warmtepomp
Gaskachels
Gehele vloerverwarming en warmtepomp
Cv-ketel en pelletkachel
Open haard, gedeeltelijke vloerverwarming en warmtepomp
Cv-ketel
Blokverwarming
to
boiler
boiler
heat pump
gas heaters
heat pump
boiler
heat pump
boiler
block heating
Soort garageVrijstaande houten garageCapaciteit2 auto'sVoorzieningenElektrische deur en elektra
Soort garageAangebouwde stenen garageCapaciteit3 auto'sVoorzieningenElektraIsolatieGeen isolatie
Soort parkeergelegenheidOp eigen terrein en openbaar parkeren
Soort garageCarport, garage met carport en vrijstaande stenen garageCapaciteit4 auto'sVoorzieningenElektrische deur, vliering, elektra, verwarming en stromend waterIsolatieDakisolatie en muurisolatie
to 'yes' or 'no'
.
House type (house_type), House ID (house_id) : Extracts from the URLs. For exemple
https://www.funda.nl/detail/koop/utrecht/huis-jeanne-d-arcdreef-11/89977384/
Finally, a third script was used to concatenate the chunks of each province into a file per province and then, concatenates all the province files into a single CSV file.
The final processed file contains 53,931 entries and 18,7 MB in size.
The exploratory data analysis, as well as the visualizations, were built using Power BI, in order to understand the data distribution, patterns, relationships and generate insights. It comprises the phases bellow:
The processed data CSV file is loaded, and techniques such as type adjustment, column splitting and renaming, and creating columns with value ranges are applied in the Power Query editor.
For outlier detection, the Z-score method was applied on the property prices and prices per m². Results greater than 3 were considered outliers, which in the dataset reflect prices from €1,900,000 and €4,299.06 per m². Then,
At this stage, a new data source has been also incorporated: a table with some of the largest and main cities in the Netherlands along with their respective populations, which will be used in related analyses. Thus, the “Funda” and “LargestCities” tables received a relationship through the common column “city”.
A few measures were defined using DAX (Data Analysis Expressions), in a separated table, to support the analyses:
The results are presented through interactive dashboards in Power BI, offering a detailed analysis of the geographic distribution of properties, common features, price and size statistics, and comparisons between provinces and cities.
On the first dashboard, it’s possible to see how the properties are distributed within the country. The number and percentage of listings in each province show their contributions to the housing market supply of the country.
They are 53,929 listings over 2030 cities.
Zuid-Holland, Noord-Holland, and Noord-Brabant are the provinces with the highest number of property listings for sale, accounting for approximately 50% of the country’s total supply. The standard map visual from Power BI uses coordinates as the geolocation indicators.
The second dashboard displays the occurrence of some features among the properties.
Minimum value: €79,000
Maximum value: €1,900,000
Mean: €529,783.60
Median: €450,000
Standard deviation: €280,298.67
Interquartile range: €294,000
1st quartile: €345,000
3rd quartile: €639,000
The same metrics from the previous boxplot presents now a comparison between the provinces:
Column chart: With €5.67K and €5.04K, Noord-Holland and Utrecht have the highest average prices per square meter among the provinces, while Limburg and Groningen have an average of around €3K.
The scatter chart highlights the relationship between the average size of properties and the average price per square meter across different provinces.
Noord-Holland and Utrecht are disputed markets with high prices per m² and smaller property sizes;
Zuid-Holland offers a balance with moderate property prices and sizes;
Zeeland, Groningen, Drenthe, and Limburg provide more affordable options with larger property sizes, making them attractive for buyers seeking more space at lower costs.;
Provinces like Gelderland, Noord-Brabant, Friesland, and Overijssel provide a good balance of size and price, catering to a wide range of buyers.
On the Heat Map, the price per m² values are depicted by color, which allows us to have a general sense of the intensity of the value throughout the country, provinces, cities, and neighborhoods. Some Power BI maps have a point limit for rendering. In the case of the customized visual used in this representation (by Weiwei Cui), the limit is 30,000 points, which can be selected randomly or according to the data input (there is no clear documentation of it). This makes the visualizations more consistent when filtered by provinces or cities.
Now the scatter chart represents, in three dimensions, the main cities of the Netherlands with their average prices per m², average sizes, and population densities.
Amsterdam, standing alone, represents the lowest value for money in terms of price per m² and space; while at the other extreme, also standing alone, is Venlo, with the second lowest average prices and the largest average space among the 32 cities.
A dashed line in the background shows the increasing trend between the variables price per m² and size.
In the last dashboard, there is a bubble map with a size scaler relative to the property price. With a more open zoom, it is possible to see the country with province labels divided by colors, and also use them for filtering. Once filtered to a smaller granularity, such as by city, the bubbles will become more visible, and it is possible to identify the cheapest to most expensive properties.
Both through the zoom and the table next to it, it is also possible to navigate through the neighborhoods, select a property, and open the URL referring to it, which appears in the bottom left corner, as shown in the figure.
Possible limitations of this project:
In the Report’s video it’s possible to better see interactive visualizations.
The Funda Scraper allows the scraping of past data, meaning a history of listings with properties that have also been sold, not just those currently on the market. With this data, it would be possible to perform a temporal analysis of the market, extract patterns, and make predictions, which would be a very interesting extension for this project.