I like web scraping. I find it amazing that with a bit of code one can collect whatever data from anywhere in the web and make use of it. I am also interested in real estate, particularly in the real estate market of my hometown Hamburg, Germany. To shed some light onto this market I decided to collect some data and play with it.

In this post I will run you through a basic web scraper in python. I am using the libraries requests and BeautifulSoup. With requests I will download the html document that contains the information I am after. With BeautifulSoup I will extract that information.

Strategy

In order to get all information on real estates for rent in Hamburg I first use the website  Immoscout and enter our search criteria. E.g. Flats for rent in Hamburg area.

That leads us to the first results page with matching classifieds. There you can see that certain data like classifieds ID and address are already shown.

I will call this data “meta” data.

On the bottom of the results page you see a drop down menu with page numbers. One page on Immoscout shows a maximum of 20 classifieds so I need to browse through all pages available to scrape all classifieds.

I use the pages information and the meta data to build a list of all available classifieds, to later scrape detail information for each of them.

When you now click on one of the offers, you can see that the url looks like this https://www.immobilienscout24.de/expose/90724026 plus a bunch of  parameters you can ignore for now.

That number in the end of the URL is the classifieds ID (or expose ID).

That’s perfect, because now I can easily loop over our meta data that contains a lots of these ids, build URLs from them and scrape additional information.

Setup Code

First I import all the libraries that I need:

from bs4 import BeautifulSoup 
import time 
import requests 
import pandas as pd 
import re 
import datetime 
import numpy as np 
import multiprocessing 
from pebble 
import ProcessPool from 
concurrent.futures import TimeoutError



Then I define some functions that do the heavy lifting.


def get_soup(url, echo=False):
    
    if echo:
        print('scraping:', url)
    
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, "lxml")
    return soup 



Ok, so this first function uses requests to download the HTML document behind a given URL and parses it so BeautifulSoup can apply its magic to it.

def parallelize_function_timeout(urls, func):
	with ProcessPool() as pool: 
		future = pool.map(func, urls, timeout=90) 
		iterator = future.result()
		result = pd.DataFrame()
		while True:
			try:
				result = result.append(next(iterator))
			except StopIteration:
				break
			except TimeoutError as error:
				print("scraping took longer than %d seconds" % error.args[1])
		return result 
		
		
		

The parallelize_function_timeout takes a list of inputs (URLs) and a function and runs this function in parallel using multiprocessing.

I use multiprocessing here because scraping large amounts of URLs is a very I/O-Bound task. While it is not that computationally expensive the bottleneck really is I/O-Flow. To speed it up I will be scraping with multiple processes simultaneously.

So whats going on in this function?

First, I initiate a pool of sub processes and run our function on the list of URLs with future = pool.map(func, urls, timeout=90) and specify a timeout duration in seconds. The results of all sub processes running our function are appended into results and returned from the  parallelize_function_timeout function .If the iteration is stopped, e.g. by a keyboard interrupt it stops, if the function takes too long for one URL it times out and the URL is skipped.

We are now done with the process management and can finally take care of the information retrieval.


def scrape_meta_chunk(url, return_soup=False):
    
    soup = get_soup(url)
    try:
        
        data = pd.DataFrame(
            [
            [location['data-result-id'],
             location.getText().split(', ')[-1],
             location.getText().split(', ')[-2],
             location.getText().split(', ')[-3] if len(location.getText().split(', ')) == 3 else '',
             str(datetime.datetime.now())]
            for location in
            soup.findAll("button", {"class": "button-link link-internal result-list-entry__map-link", "title": "Auf der Karte anzeigen"})], columns=['id','city_county', 'city_quarter', 'street', 'scraped_ts']
        )
        
    except Exception as err:
        print(url, err)
        data = pd.DataFrame()

    if return_soup:
        return data, soup
    else:
        return data 



This function extracts what I earlier called meta-data. It downloads a search results page from Immoscout and extracts ID and addresses of all 20 listed results.

For that I use a python list comprehension to iterate through occurrences of the HTML-Tag <button> with the class button-link link-internal result-list-entry__map-link and the title Auf der Karte anzeigen. From each occurence (iteration) it retrieves the information we are after and stores it in a neat pandas DataFrame.


def scrape_details_chunk(realty):
        
    expose_id = []
    title = []
    realty_type = []
    floor = []
    square_m = []
    storage_m = []
    num_rooms = []
    num_bedrooms = []
    num_baths = []
    num_carparks = []
    year_built = []
    last_refurb= []
    quality = []
    heating_type = []
    fuel_type = []
    energy_consumption = []
    energy_class = []
    net_rent = []
    gross_rent = []
    
            
    url = 'https://www.immobilienscout24.de/expose/' + realty
    
    soup= get_soup(url)
    expose_id.append(realty)
    title.append(soup.find('title').getText())
    if soup.find('dd', {'class':"is24qa-typ grid-item three-fifths"}):
        realty_type.append(soup.find('dd', {'class':"is24qa-typ grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        realty_type.append(0)
    if soup.find('dd', {'class':"is24qa-etage grid-item three-fifths"}):
        floor.append(soup.find('dd', {'class':"is24qa-etage grid-item three-fifths"}).getText())
    else:
        floor.append(0)
    if soup.find('dd', {'class':"is24qa-wohnflaeche-ca grid-item three-fifths"}):
        square_m.append(soup.find('dd', {'class':"is24qa-wohnflaeche-ca grid-item three-fifths"}).getText().replace(' ', '').replace('m²', ''))
    else:
        square_m.append(0)
    if soup.find('dd', {'class':"is24qa-nutzflaeche-ca grid-item three-fifths"}):
        storage_m.append(soup.find('dd', {'class':"is24qa-nutzflaeche-ca grid-item three-fifths"}).getText().replace(' ', '').replace('m²', ''))
    else:
        storage_m.append(0)
    if soup.find('dd', {'class':"is24qa-zimmer grid-item three-fifths"}):
        num_rooms.append(soup.find('dd', {'class':"is24qa-zimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
         num_rooms.append(0)
    if soup.find('dd', {'class':"is24qa-schlafzimmer grid-item three-fifths"}):
        num_bedrooms.append(soup.find('dd', {'class':"is24qa-schlafzimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        num_bedrooms.append(0)
    if soup.find('dd', {'class':"is24qa-badezimmer grid-item three-fifths"}):
        num_baths.append(soup.find('dd', {'class':"is24qa-badezimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        num_baths.append(0)
    if soup.find('dd', {'class':"is24qa-garage-stellplatz grid-item three-fifths"}):
        num_carparks.append(re.sub('[^0-9]', '', soup.find('dd', {'class':"is24qa-garage-stellplatz grid-item three-fifths"}).getText()))
    else:
        num_carparks.append(0)
    if soup.find('div', {'class':"is24qa-kaltmiete is24-value font-semibold"}):
        net_rent.append(re.sub('[^0-9]', '', soup.find('div', {'class':"is24qa-kaltmiete is24-value font-semibold"}).getText()))
    else:
        net_rent.append(0)
    if soup.find('dd', {'class':"is24qa-baujahr grid-item three-fifths"}):
        year_built.append(soup.find('dd', {'class':"is24qa-baujahr grid-item three-fifths"}).getText())
    else:
        year_built.append(0)
    if soup.find('dd', {'class':"is24qa-modernisierung-sanierung grid-item three-fifths"}):
        last_refurb.append(soup.find('dd', {'class':"is24qa-modernisierung-sanierung grid-item three-fifths"}).getText())
    else:
        last_refurb.append(0)
    if soup.find('dd', {'class':"is24qa-qualitaet-der-ausstattung grid-item three-fifths"}):
        quality.append(soup.find('dd', {'class':"is24qa-qualitaet-der-ausstattung grid-item three-fifths"}).getText().strip())
    else:
        quality.append(0)
    if soup.find('dd', {'class':"is24qa-heizungsart grid-item three-fifths"}):
        heating_type.append(soup.find('dd', {'class':"is24qa-heizungsart grid-item three-fifths"}).getText().strip())
    else:
        heating_type.append(0)
    if soup.find('dd', {'class':"is24qa-wesentliche-energietraeger grid-item three-fifths"}):
        fuel_type.append(soup.find('dd', {'class':"is24qa-wesentliche-energietraeger grid-item three-fifths"}).getText().strip())
    else:
        fuel_type.append(0)
    if soup.find('dd', {'class':"is24qa-endenergiebedarf grid-item three-fifths"}):
        energy_consumption.append(soup.find('dd', {'class':"is24qa-endenergiebedarf grid-item three-fifths"}).getText().strip())
    else:
        energy_consumption.append(0)
    if soup.find('dd', {'class':"is24qa-energieeffizienzklasse grid-item three-fifths"}):
        energy_class.append(soup.find('dd', {'class':"is24qa-energieeffizienzklasse grid-item three-fifths"}).getText().strip())
    else:
        energy_class.append(0)
    if soup.find('dd', {'class':"is24qa-gesamtmiete grid-item three-fifths font-bold"}):
        gross_rent.append(soup.find('dd', {'class':"is24qa-gesamtmiete grid-item three-fifths font-bold"}).getText().strip().replace(' €',''))
    else:
        gross_rent.append(0)
    results = pd.DataFrame({
    'id': expose_id,
    'title': title,
    'realty_type': realty_type,
    'floor': floor,
    'square_m': square_m,
    'storage_m': storage_m,
    'num_rooms': num_rooms,
    'num_bedrooms': num_bedrooms,
    'num_baths': num_baths,
    'num_carparks': num_carparks,
    'year_built': year_built,
    'last_refurb': last_refurb,
    'quality': quality,
    'heating_type': heating_type,
    'fuel_type': fuel_type,
    'energy_consumption': energy_consumption,
    'energy_class': energy_class,
    'net_rent': net_rent,
    'gross_rent': gross_rent,
    'scraped_ts': str(datetime.datetime.now())
    })
    
    return results



The function scrape_details_chunk is responsible for scraping the detailed information from each flats details page. This function comes to action when we loop throuh each classifieds ID using URLs like https://www.immobilienscout24.de/expose/90724026.

What it does in particular is this: First it initiates empty lists for each piece or information we want to scrape like the classifieds title, square meters, number of rooms, you get the idea. Then it creates the URL for a specific flat by concentrating the base URL ´https://www.immobilienscout24.de/expose/´ with the classifieds ID from the meta-data. It uses BeautifulSoup to parse and extract the data we are after and combines it into a DataFrame.

Thats the setup. Now we can start scraping.

The Scraping Process

Until now I explained how the scraping will be done and what functions I came up with to do the actual scraping. Now we need to put all that to use.

First, we scrape the meta-data from the first result page of our real estate search. So we use the function scrape_meta_chunk on https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30.



url = 'https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30' 

realty_meta_df, soup = scrape_meta_chunk(url, return_soup=True) 

num_pages = len(soup.find_all('option')) 

print('scraped', len(realty_meta_df), 'realties on first page') 
print('found', num_pages, 'pages to scrape') 


> scraped 20 realties on first page
> found 101 pages to scrape 

With num_pages = len(soup.find_all('option')) we extract the number of result pages for this search so we can loop over each page later. Now that we know how many pages there are with our search results (101 in this case) we can start looping over them and extract the meta-data from each of them.



realty_meta_urls = ['https://www.immobilienscout24.de/Suche/S-T/P-' + str(page) + '/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30' for page in range(2, num_pages + 1)] realty_meta_df = 
realty_meta_df.append(parallelize_function_timeout(realty_meta_urls, scrape_meta_chunk))
print('Done')
 
    
      
> Done 

Here we create a new URL for each page number we have found. This gives us a list of 101 pages to scrape meta-data from. We pass this list and the function realty_meta_urls to the parallelize_function_timeout to scrape these URLs on multiple cores. This gives us a DataFrame containing meta-data for all found classifieds including their IDs. With that we are able to scrape each detail page as follows.



realty_details = pd.DataFrame(
columns=[
    'expose_id',
    'title',
    'realty_type',
    'floor',
    'square_m',
    'storage_m',
    'num_rooms',
    'num_bedrooms',
    'num_baths',
    'num_carparks',
    'year_built',
    'last_refurb',
    'quality',
    'heating_type',
    'fuel_type',
    'energy_consumption',
    'energy_class',
    'net_rent',
    'gross_rent' ])
	
	

realties = list(set(list(realty_meta_df.id))) 

print('scraping', len(realties), 'realties...') 
realty_details_df = parallelize_function_timeout(realties, scrape_details_chunk)
realty_details_df = realty_details_df.drop_duplicates() 
realty_details_df = realty_details_df.reset_index(drop=True) 

print('scraped', len(realty_details_df), 'realties.') 



> scraping 2020 realties...
> scraped 2020 realties. 

Here I am creating a new DataFrame with a column for each data dimension I am scraping. Then I pass a list of classifieds IDs to the scrape_details_chunk via parallelize_function_timeout. This takes a while and finally returns a clean DataFrame containing all desired detail-data. At last I merge the detail-data with the meta-data to end up with only one DataFrame and print the first few rows of the scraped data.

That’s it. We are done.


realty_meta_df['id'] = realty_meta_df['id'].astype(int) 
realty_details_df['id'] = realty_details_df['id'].astype(int) 

final_data = realty_meta_df.drop('scraped_ts',axis = 1).merge(realty_details_df, on='id', suffixes = ['_meta', '_details']) 


 
final_data.head() 
id city_county city_quarter street title realty_type floor square_m storage_m num_rooms ... year_built last_refurb quality heating_type fuel_type energy_consumption energy_class net_rent gross_rent scraped_ts
0 110713974 Hamburg Altona-Nord Glückel-von-Hameln-Straße 2 neue Wohnung - neues Glück Etagenwohnung 0 75,14 0 2 ... 2019 0 0 0 0 0 0 1150 1.370,66 2019-05-14 13:30:01.252678
1 110714010 Hamburg Altona-Nord Susanne-von-Paczensky-Straße 9 Grandiose 4-Zimmerwohung! ***Erstbezug*** Etagenwohnung 0 99,39 0 4 ... 2019 0 0 0 0 0 0 1440 1.828,13 2019-05-14 13:29:57.649693
2 110247048 Hamburg Altona-Nord Susanne-von-Paczensky-Straße 7 Traumhafte Penthousewohnung! ***Erstbezug*** Etagenwohnung 0 166,02 0 5 ... 2019 0 0 0 0 0 0 2500 3.090,15 2019-05-14 13:29:32.219968
3 110247042 Hamburg Altona-Nord Susanne-von-Paczensky-Straße 11 Geniale 3-Zimmerwohnung! ***Erstbezug*** Etagenwohnung 0 97,65 0 3 ... 2019 0 0 0 0 0 0 1470 1.854,57 2019-05-14 13:29:44.507045
4 110714047 Hamburg Altona-Nord Eva-Rühmkorf-Straße 8 ***Hervorragende 3-Zimmerwohnung*** Erstbezug! Etagenwohnung 0 86,04 0 3 ... 2019 0 0 0 0 0 0 1300 1.648,85 2019-05-14 13:29:47.011108