Nathalia V. M. de Oliveira

Logo

An Analytics project utilizing housing market data sourced from the Dutch website Funda.

View the Project on GitHub nathmota/funda-analysis

Funda Property Listings Data Analysis

Nathalia V. M. de Oliveira - April, 2024.

Analysis of the Dutch Housing Market

This project analyzes the Dutch housing market based on property listings available on the Funda.nl website. The objective is:

  1. Provide an overview of the housing market in the Netherlands through online property listed on the Funda.nl website;
  2. Generate insights into the most valuable or cost-effective locations among provinces and cities;
  3. Examine the profile of houses and the distribution of their most common features;
  4. Identify factors that may contribute to the variation in property prices.

The process involves data extraction via web scraping, data transformation for cleaning and preparation, and exploratory analysis through interactive visualizations in Power BI. The result is a set of dashboards offering detailed insights into geographic distribution, property characteristics, and price and size statistics by province and city.

If you are NOT interested in the technical part, you can go straight to the Results.

Data Source

Funda is a Dutch residential platform established since more than 20 years ago. According to their website, Funda is the largest platform connecting supply and demand in the real estate market of the Netherlands, with over 4 million unique visitors per month and hosting around 97% of the Dutch housing market.

According Funda’s Terms and Conditions, scraping its website is only allowed for personal use.

Extraction - Collecting the Data

Webscraping script

The data were scraped from Funda using FundaScraper module for Python. Before running it, there are several different sets of arguments that can generate diverse searches. For this project, the following arguments have been used:

area=provincie
want_to=buy
find_past=False
page_start=1
n_pages = 50
raw_data=True

Although searching by province may return, for instance, 12,000 results, Funda only makes a maximum of 666 pages accessible, as shown below. This equates to 9,990 entries, as each page contains 15 listings.

You can have a preview of this search by province results here.

Then, every 50 pages (or 750 entries), the script scrapes and records the data into a CSV regarding each province.

Search results

The pictures show there are no next pages.

Search results

Data Overview

The data were collected on the 10th of April, 2024, which means that the dataset for this case study includes all the housing property listings available for sale during that period.

The raw scraped dataset contains the following columns:

  1. url
  2. price
  3. address
  4. description
  5. listed_since
  6. zip_code
  7. size
  8. year_built
  9. living_area
  10. kind_of_house
  11. building_type
  12. num_of_rooms
  13. num_of_bathrooms
  14. layout
  15. energy_label
  16. insulation
  17. heating
  18. ownership
  19. exteriors
  20. parking
  21. neighborhood_name
  22. date_list
  23. date_sold
  24. term
  25. price_sold
  26. last_ask_price
  27. last_ask_price_m2
  28. city

The raw data look like expected, a little mess.

raw data

You can also take a look at the whole table. It’s possible to see that it is going to take a lot of work.

And so, the following provinces CSV files were generated. The number corresponds to the quantity of listings returned:

Drenthe - 1912 
Groningen - 1946
Flevoland - 2161
Friesland - 2280
Zeeland - 3033
Limburg - 4095
Utrecht - 4146
Overijssel - 4305
Gelderland - 7410
Zuid Holland - 9990
Noord Brabant - 10004
Noord Holland - 10004

For some reason, Noord Holland and Noord Brabant have more than 9990 listings. I assumed they could be duplicates and would be treated accordingly during the processing step.

The raw dataset then starts with 61,286 entries and 1.1 GB in size.

Transformation - Processing the Data

On the transformation step, when running the data processing script, several processes were applied to the data so that could be better suited for analytics.

At first, the script loads raw data from each province CSV file,

Applied steps

and creates them specified folders in a general chunk folder, in which the processed data will be written after the cleansing.

Applied steps

Using chunks

The processing step ended up heavy and a bit slow. There was a need to adopt a measure to maintain the already processed data and avoid losses due to frequent interruptions in execution caused by unexpected occurrences. In order to breaking down the dataset into smaller and more manageable pieces, the cleaning and writing process for each province was then partitioned into chunks of 300 instances and executed in a loop.

Transformation part I

Process all perfomed using Pandas.

General Data Cleansing:

Specific features cleansing:

where house_type = ‘huis’ and house_id = ‘89977384’.

Concatenating chunks and files

Finally, a third script was used to concatenate the chunks of each province into a file per province and then, concatenates all the province files into a single CSV file.

Applied steps

The final processed file contains 53,931 entries and 18,7 MB in size.

Exploratory Data Analysis (EDA) and Visualization

The exploratory data analysis, as well as the visualizations, were built using Power BI, in order to understand the data distribution, patterns, relationships and generate insights. It comprises the phases bellow:

Transformation part 2 - Power BI

The processed data CSV file is loaded, and techniques such as type adjustment, column splitting and renaming, and creating columns with value ranges are applied in the Power Query editor.

Applied steps

Outliers

For outlier detection, the Z-score method was applied on the property prices and prices per m². Results greater than 3 were considered outliers, which in the dataset reflect prices from €1,900,000 and €4,299.06 per m². Then,

Modeling

At this stage, a new data source has been also incorporated: a table with some of the largest and main cities in the Netherlands along with their respective populations, which will be used in related analyses. Thus, the “Funda” and “LargestCities” tables received a relationship through the common column “city”.

model

Measures

A few measures were defined using DAX (Data Analysis Expressions), in a separated table, to support the analyses:

measures

Visualization and Results

The results are presented through interactive dashboards in Power BI, offering a detailed analysis of the geographic distribution of properties, common features, price and size statistics, and comparisons between provinces and cities.

1. Geo Distribution

On the first dashboard, it’s possible to see how the properties are distributed within the country. The number and percentage of listings in each province show their contributions to the housing market supply of the country.

pg1

They are 53,929 listings over 2030 cities.

Zuid-Holland, Noord-Holland, and Noord-Brabant are the provinces with the highest number of property listings for sale, accounting for approximately 50% of the country’s total supply. The standard map visual from Power BI uses coordinates as the geolocation indicators.

2. Property Overview

pg2

The second dashboard displays the occurrence of some features among the properties.

3. Property Rooms

pg3

4. Size & Price Stats

pg4

5. Age & Size per m² Stats

pg5

6. Provinces Price & Price per m² Average

The same metrics from the previous boxplot presents now a comparison between the provinces:

pg6

7. Provinces Price per m² x Size Average

pg7

The scatter chart highlights the relationship between the average size of properties and the average price per square meter across different provinces.

8. Price per m² Heat Map

On the Heat Map, the price per m² values are depicted by color, which allows us to have a general sense of the intensity of the value throughout the country, provinces, cities, and neighborhoods. Some Power BI maps have a point limit for rendering. In the case of the customized visual used in this representation (by Weiwei Cui), the limit is 30,000 points, which can be selected randomly or according to the data input (there is no clear documentation of it). This makes the visualizations more consistent when filtered by provinces or cities.

pg8

pg8

9. Major Cities by Population

pg9

Now the scatter chart represents, in three dimensions, the main cities of the Netherlands with their average prices per m², average sizes, and population densities.

10. Prices Map Navigation

In the last dashboard, there is a bubble map with a size scaler relative to the property price. With a more open zoom, it is possible to see the country with province labels divided by colors, and also use them for filtering. Once filtered to a smaller granularity, such as by city, the bubbles will become more visible, and it is possible to identify the cheapest to most expensive properties.

pg10

Both through the zoom and the table next to it, it is also possible to navigate through the neighborhoods, select a property, and open the URL referring to it, which appears in the bottom left corner, as shown in the figure.

pg10

Challenges and Limitations

Possible limitations of this project:

Report’s Video

In the Report’s video it’s possible to better see interactive visualizations.

Possible extensions:

The Funda Scraper allows the scraping of past data, meaning a history of listings with properties that have also been sold, not just those currently on the market. With this data, it would be possible to perform a temporal analysis of the market, extract patterns, and make predictions, which would be a very interesting extension for this project.