parallel webscraping in python

I like web scraping. I find it amazing that with a bit of code one can collect whatever data from anywhere in the web and make use of it. I am also interested in real estate, particularly in the real estate market of my hometown Hamburg, Germany. To shed some light onto this market I decided to collect some data and play with it.

In this post I will run you through a basic web scraper in python. I am using the libraries requests and BeautifulSoup. With requests I will download the html document that contains the information I am after. With BeautifulSoup I will extract that information.

Strategy

In order to get all information on real estates for rent in Hamburg I first use the website Immoscout and enter our search criteria. E.g. Flats for rent in Hamburg area.

That leads us to the first results page with matching classifieds. There you can see that certain data like classifieds ID and address are already shown.

I will call this data “meta” data.

On the bottom of the results page you see a drop down menu with page numbers. One page on Immoscout shows a maximum of 20 classifieds so I need to browse through all pages available to scrape all classifieds.

I use the pages information and the meta data to build a list of all available classifieds, to later scrape detail information for each of them.

When you now click on one of the offers, you can see that the url looks like this https://www.immobilienscout24.de/expose/90724026 plus a bunch of parameters you can ignore for now.

That number in the end of the URL is the classifieds ID (or expose ID).

That’s perfect, because now I can easily loop over our meta data that contains a lots of these ids, build URLs from them and scrape additional information.

Setup Code

First I import all the libraries that I need:

from bs4 import BeautifulSoup 
import time 
import requests 
import pandas as pd 
import re 
import datetime 
import numpy as np 
import multiprocessing 
from pebble 
import ProcessPool from 
concurrent.futures import TimeoutError

Then I define some functions that do the heavy lifting.

def get_soup(url, echo=False):
    
    if echo:
        print('scraping:', url)
    
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data, "lxml")
    return soup 

Ok, so this first function uses requests to download the HTML document behind a given URL and parses it so BeautifulSoup can apply its magic to it.

def parallelize_function_timeout(urls, func):
	with ProcessPool() as pool: 
		future = pool.map(func, urls, timeout=90) 
		iterator = future.result()
		result = pd.DataFrame()
		while True:
			try:
				result = result.append(next(iterator))
			except StopIteration:
				break
			except TimeoutError as error:
				print("scraping took longer than %d seconds" % error.args[1])
		return result 
		
		
		

The parallelize_function_timeout takes a list of inputs (URLs) and a function and runs this function in parallel using multiprocessing.

I use multiprocessing here because scraping large amounts of URLs is a very I/O-Bound task. While it is not that computationally expensive the bottleneck really is I/O-Flow. To speed it up I will be scraping with multiple processes simultaneously.

So whats going on in this function?

First, I initiate a pool of sub processes and run our function on the list of URLs with future = pool.map(func, urls, timeout=90) and specify a timeout duration in seconds. The results of all sub processes running our function are appended into results and returned from the parallelize_function_timeout function .If the iteration is stopped, e.g. by a keyboard interrupt it stops, if the function takes too long for one URL it times out and the URL is skipped.

We are now done with the process management and can finally take care of the information retrieval.

def scrape_meta_chunk(url, return_soup=False):
    
    soup = get_soup(url)
    try:
        
        data = pd.DataFrame(
            [
            [location['data-result-id'],
             location.getText().split(', ')[-1],
             location.getText().split(', ')[-2],
             location.getText().split(', ')[-3] if len(location.getText().split(', ')) == 3 else '',
             str(datetime.datetime.now())]
            for location in
            soup.findAll("button", {"class": "button-link link-internal result-list-entry__map-link", "title": "Auf der Karte anzeigen"})], columns=['id','city_county', 'city_quarter', 'street', 'scraped_ts']
        )
        
    except Exception as err:
        print(url, err)
        data = pd.DataFrame()

    if return_soup:
        return data, soup
    else:
        return data 

This function extracts what I earlier called meta-data. It downloads a search results page from Immoscout and extracts ID and addresses of all 20 listed results.

For that I use a python list comprehension to iterate through occurrences of the HTML-Tag <button> with the class button-link link-internal result-list-entry__map-link and the title Auf der Karte anzeigen. From each occurence (iteration) it retrieves the information we are after and stores it in a neat pandas DataFrame.

def scrape_details_chunk(realty):
        
    expose_id = []
    title = []
    realty_type = []
    floor = []
    square_m = []
    storage_m = []
    num_rooms = []
    num_bedrooms = []
    num_baths = []
    num_carparks = []
    year_built = []
    last_refurb= []
    quality = []
    heating_type = []
    fuel_type = []
    energy_consumption = []
    energy_class = []
    net_rent = []
    gross_rent = []
    
            
    url = 'https://www.immobilienscout24.de/expose/' + realty
    
    soup= get_soup(url)
    expose_id.append(realty)
    title.append(soup.find('title').getText())
    if soup.find('dd', {'class':"is24qa-typ grid-item three-fifths"}):
        realty_type.append(soup.find('dd', {'class':"is24qa-typ grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        realty_type.append(0)
    if soup.find('dd', {'class':"is24qa-etage grid-item three-fifths"}):
        floor.append(soup.find('dd', {'class':"is24qa-etage grid-item three-fifths"}).getText())
    else:
        floor.append(0)
    if soup.find('dd', {'class':"is24qa-wohnflaeche-ca grid-item three-fifths"}):
        square_m.append(soup.find('dd', {'class':"is24qa-wohnflaeche-ca grid-item three-fifths"}).getText().replace(' ', '').replace('m²', ''))
    else:
        square_m.append(0)
    if soup.find('dd', {'class':"is24qa-nutzflaeche-ca grid-item three-fifths"}):
        storage_m.append(soup.find('dd', {'class':"is24qa-nutzflaeche-ca grid-item three-fifths"}).getText().replace(' ', '').replace('m²', ''))
    else:
        storage_m.append(0)
    if soup.find('dd', {'class':"is24qa-zimmer grid-item three-fifths"}):
        num_rooms.append(soup.find('dd', {'class':"is24qa-zimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
         num_rooms.append(0)
    if soup.find('dd', {'class':"is24qa-schlafzimmer grid-item three-fifths"}):
        num_bedrooms.append(soup.find('dd', {'class':"is24qa-schlafzimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        num_bedrooms.append(0)
    if soup.find('dd', {'class':"is24qa-badezimmer grid-item three-fifths"}):
        num_baths.append(soup.find('dd', {'class':"is24qa-badezimmer grid-item three-fifths"}).getText().replace(' ', ''))
    else:
        num_baths.append(0)
    if soup.find('dd', {'class':"is24qa-garage-stellplatz grid-item three-fifths"}):
        num_carparks.append(re.sub('[^0-9]', '', soup.find('dd', {'class':"is24qa-garage-stellplatz grid-item three-fifths"}).getText()))
    else:
        num_carparks.append(0)
    if soup.find('div', {'class':"is24qa-kaltmiete is24-value font-semibold"}):
        net_rent.append(re.sub('[^0-9]', '', soup.find('div', {'class':"is24qa-kaltmiete is24-value font-semibold"}).getText()))
    else:
        net_rent.append(0)
    if soup.find('dd', {'class':"is24qa-baujahr grid-item three-fifths"}):
        year_built.append(soup.find('dd', {'class':"is24qa-baujahr grid-item three-fifths"}).getText())
    else:
        year_built.append(0)
    if soup.find('dd', {'class':"is24qa-modernisierung-sanierung grid-item three-fifths"}):
        last_refurb.append(soup.find('dd', {'class':"is24qa-modernisierung-sanierung grid-item three-fifths"}).getText())
    else:
        last_refurb.append(0)
    if soup.find('dd', {'class':"is24qa-qualitaet-der-ausstattung grid-item three-fifths"}):
        quality.append(soup.find('dd', {'class':"is24qa-qualitaet-der-ausstattung grid-item three-fifths"}).getText().strip())
    else:
        quality.append(0)
    if soup.find('dd', {'class':"is24qa-heizungsart grid-item three-fifths"}):
        heating_type.append(soup.find('dd', {'class':"is24qa-heizungsart grid-item three-fifths"}).getText().strip())
    else:
        heating_type.append(0)
    if soup.find('dd', {'class':"is24qa-wesentliche-energietraeger grid-item three-fifths"}):
        fuel_type.append(soup.find('dd', {'class':"is24qa-wesentliche-energietraeger grid-item three-fifths"}).getText().strip())
    else:
        fuel_type.append(0)
    if soup.find('dd', {'class':"is24qa-endenergiebedarf grid-item three-fifths"}):
        energy_consumption.append(soup.find('dd', {'class':"is24qa-endenergiebedarf grid-item three-fifths"}).getText().strip())
    else:
        energy_consumption.append(0)
    if soup.find('dd', {'class':"is24qa-energieeffizienzklasse grid-item three-fifths"}):
        energy_class.append(soup.find('dd', {'class':"is24qa-energieeffizienzklasse grid-item three-fifths"}).getText().strip())
    else:
        energy_class.append(0)
    if soup.find('dd', {'class':"is24qa-gesamtmiete grid-item three-fifths font-bold"}):
        gross_rent.append(soup.find('dd', {'class':"is24qa-gesamtmiete grid-item three-fifths font-bold"}).getText().strip().replace(' €',''))
    else:
        gross_rent.append(0)
    results = pd.DataFrame({
    'id': expose_id,
    'title': title,
    'realty_type': realty_type,
    'floor': floor,
    'square_m': square_m,
    'storage_m': storage_m,
    'num_rooms': num_rooms,
    'num_bedrooms': num_bedrooms,
    'num_baths': num_baths,
    'num_carparks': num_carparks,
    'year_built': year_built,
    'last_refurb': last_refurb,
    'quality': quality,
    'heating_type': heating_type,
    'fuel_type': fuel_type,
    'energy_consumption': energy_consumption,
    'energy_class': energy_class,
    'net_rent': net_rent,
    'gross_rent': gross_rent,
    'scraped_ts': str(datetime.datetime.now())
    })
    
    return results

The function scrape_details_chunk is responsible for scraping the detailed information from each flats details page. This function comes to action when we loop throuh each classifieds ID using URLs like https://www.immobilienscout24.de/expose/90724026.

What it does in particular is this: First it initiates empty lists for each piece or information we want to scrape like the classifieds title, square meters, number of rooms, you get the idea. Then it creates the URL for a specific flat by concentrating the base URL ´https://www.immobilienscout24.de/expose/´ with the classifieds ID from the meta-data. It uses BeautifulSoup to parse and extract the data we are after and combines it into a DataFrame.

Thats the setup. Now we can start scraping.

The Scraping Process

Until now I explained how the scraping will be done and what functions I came up with to do the actual scraping. Now we need to put all that to use.

First, we scrape the meta-data from the first result page of our real estate search. So we use the function scrape_meta_chunk on https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30.

url = 'https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30' 

realty_meta_df, soup = scrape_meta_chunk(url, return_soup=True) 

num_pages = len(soup.find_all('option')) 

print('scraped', len(realty_meta_df), 'realties on first page') 
print('found', num_pages, 'pages to scrape') 

> scraped 20 realties on first page
> found 101 pages to scrape 

With num_pages = len(soup.find_all('option')) we extract the number of result pages for this search so we can loop over each page later. Now that we know how many pages there are with our search results (101 in this case) we can start looping over them and extract the meta-data from each of them.

realty_meta_urls = ['https://www.immobilienscout24.de/Suche/S-T/P-' + str(page) + '/Wohnung-Miete/Umkreissuche/Hamburg/-/1840/2621814/-/-/30' for page in range(2, num_pages + 1)] realty_meta_df = 
realty_meta_df.append(parallelize_function_timeout(realty_meta_urls, scrape_meta_chunk))
print('Done')

> Done

Here we create a new URL for each page number we have found. This gives us a list of 101 pages to scrape meta-data from. We pass this list and the function realty_meta_urls to the parallelize_function_timeout to scrape these URLs on multiple cores. This gives us a DataFrame containing meta-data for all found classifieds including their IDs. With that we are able to scrape each detail page as follows.

realty_details = pd.DataFrame(
columns=[
    'expose_id',
    'title',
    'realty_type',
    'floor',
    'square_m',
    'storage_m',
    'num_rooms',
    'num_bedrooms',
    'num_baths',
    'num_carparks',
    'year_built',
    'last_refurb',
    'quality',
    'heating_type',
    'fuel_type',
    'energy_consumption',
    'energy_class',
    'net_rent',
    'gross_rent' ])
	
	

realties = list(set(list(realty_meta_df.id))) 

print('scraping', len(realties), 'realties...') 
realty_details_df = parallelize_function_timeout(realties, scrape_details_chunk)
realty_details_df = realty_details_df.drop_duplicates() 
realty_details_df = realty_details_df.reset_index(drop=True) 

print('scraped', len(realty_details_df), 'realties.') 

> scraping 2020 realties...
> scraped 2020 realties. 

Here I am creating a new DataFrame with a column for each data dimension I am scraping. Then I pass a list of classifieds IDs to the scrape_details_chunk via parallelize_function_timeout. This takes a while and finally returns a clean DataFrame containing all desired detail-data. At last I merge the detail-data with the meta-data to end up with only one DataFrame and print the first few rows of the scraped data.

That’s it. We are done.

realty_meta_df['id'] = realty_meta_df['id'].astype(int) 
realty_details_df['id'] = realty_details_df['id'].astype(int) 

final_data = realty_meta_df.drop('scraped_ts',axis = 1).merge(realty_details_df, on='id', suffixes = ['_meta', '_details']) 

final_data.head()

	id	city_county	city_quarter	street	title	realty_type	square_m	num_rooms	...	year_built	net_rent	gross_rent	scraped_ts
0	110713974	Hamburg	Altona-Nord	Glückel-von-Hameln-Straße 2	neue Wohnung - neues Glück	Etagenwohnung	75,14	2	...	2019	1150	1.370,66	2019-05-14 13:30:01.252678
1	110714010	Hamburg	Altona-Nord	Susanne-von-Paczensky-Straße 9	Grandiose 4-Zimmerwohung! *Erstbezug*	Etagenwohnung	99,39	4	...	2019	1440	1.828,13	2019-05-14 13:29:57.649693
2	110247048	Hamburg	Altona-Nord	Susanne-von-Paczensky-Straße 7	Traumhafte Penthousewohnung! *Erstbezug*	Etagenwohnung	166,02	5	...	2019	2500	3.090,15	2019-05-14 13:29:32.219968
3	110247042	Hamburg	Altona-Nord	Susanne-von-Paczensky-Straße 11	Geniale 3-Zimmerwohnung! *Erstbezug*	Etagenwohnung	97,65	3	...	2019	1470	1.854,57	2019-05-14 13:29:44.507045
4	110714047	Hamburg	Altona-Nord	Eva-Rühmkorf-Straße 8	*Hervorragende 3-Zimmerwohnung* Erstbezug!	Etagenwohnung	86,04	3	...	2019	1300	1.648,85	2019-05-14 13:29:47.011108