X

Python Web Scraping Tutorial For Beginners

Introduction to web scraping

In a ideal world you wouldn’t need to scrape the web, but the world is far from ideal, websites are far from well structured and in order to gather the data we need it’s likely you’ll have to scrape. I currently find myself having to extract information from the web almost daily and to do so, scrape that data around 80% of the time, rather than getting the information from an API. Now if you can get information from an API then that would be the preferred solution as often you’ll get the data much fast and two is that you won’t piss off the website you’re scraping.

What is web scraping?

Webscraping is the process of collecting unstructured data from HTML, JSON, XML which is available on a webpage and structuring it into your data model, e.g. spreadsheet, database.

This tutorial makes some assumptions, firstly that you have Python installed on either windows/mac and you know how to install libraries.

Disclaimer

When scraping data you should try to be as friendly as possible, not overloading requests and accidentally DDossing the website and also identifying yourself as a crawler.

Libraries needed to scrape the web in python

There are many ways to do web scraping but to write your first crawler we’re going to keep it simple. Firstly you’ll need to have a library that has that ability to load a webpage. For this we’re going to use requests. It’s very versatile and simple to use and you’ll be using it on your python journey to implement much more complex features.

The second is being able to find and extract information within the loaded webpage and for that we’re going to use BeautifulSoup. It gives you a huge array of methods for you to use on the BeautifulSoup object that is created from your requests response.

  1. Requests – pip install requests
  2. BeautifulSoup pip install beautifulsoup4
  3. Selenium – pip install selenium

Step 1 – Loading a web page with requests

Right so firstly I’m going to set all my examples to a wikipedia table which lists all

import requests

r = requests.get('https://en.wikipedia.org/wiki/List_of_towns_in_England')

print(r)
print(r.status_code)
print(r.headers)

After importing requests you need to set up a object in which to interact with the request response.

There are a couple interactions here, such as the status code of the requested URL and secondly the header information returned from the server.

So having been able to request a web page we need to interact with the response via beautifulsoup.

Step 2 – Create BeautifulSoup Object

Within python idioms this is referred to a creating the soup, and many programmers use the variable “soup” to store all the information that can be found within the page

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/List_of_towns_in_England')

soup = BeautifulSoup(r.content)

print(soup)

As we’re going to be iterating over information on the page itself when you create your soup you pass the content method for the request object.

Run the code above and you’ll see a large amount of HTML output on the page, this is everything the server responded with in making that request.

Step 3  – Extract HTML elements

With beautiful soup we can extract the html elements that match our request.

Extracting the title

title = soup.title
title_stripped = soup.title.text

print(title)
print(title_stripped)

certain elements can be directly derived from the soup by calling a method “.title” see the first example. This extracts all the contents of the title to print;

<title>List of towns in England – Wikipedia</title>

Most of the time you don’t need this <title> tags so in order to strip the element down you can append .text which just exports the text within the title.

In using this method you’re only able to capture the first instance of that element that existed. If the page had two titles, (which it shouldn’t) you’d only be able to collect information about the first one.

Extracting links on a html page

When it comes to gathering links you need to use the find method.

Add this to your code and test it out.

links = soup.find_all('a')

Printing links prints out one long list of information, in order to be more selective you need to iterate through each link one at a time. We can use a for loop to go through each link on the page and for each link in links print out just the hyperlink address & the hyperlink text.

for link in links:
    print(link.get('href'))

And for the anchor text of each link

for link in links:
    print(link.text)

Test yourself

Ok so that was a very simple introduction to selecting elements. Try it for yourself and try to also gather all of the;

  • H1
  • H2

Answers in dropdown.

Answers
h1 = soup.find('h1').text
h2 = soup.find_all('h2')

for value in h2:
    print(value.text)

So hopefully you’ve understood a little about opening pages and selecting data from the page.

Step 4 – Extracting data from html tables

You will have seen this table with a list of all the towns in England and for most people wanting to scrape this page, I’d imagine this would be their target data source.

You can do this using the libraries listed so far however I want to introduce to you a library called pandas. In short pandas creates virtual spreadsheets which can store information in them, perfect for storing an html table

I’m a massive advocate of pandas so will try and use it wherever possible.

Objective: Extract all data on towns within the UK

In this example all of the tables are the same structure which is really nice, in this case we’re going to put the find table part of our web scraper into a function.

Step 1) Find all the html table elements

def find_tables(soup):
    table = soup.find_all('table')

Step 2) Create the structure of the tables

def find_tables(soup):
    table = soup.find_all('table')
    # Goes through each found table and find all the rows, excluding the headers which we'll add after
    count = 0
    for row in table:
        records = []
        for tr in row.find_all("tr")[1:]:
            trs = tr.find_all("td")
            record = []
            record.append(trs[0].text)
            record.append(trs[1].text)
            record.append(trs[2].text)
            record.append(trs[0].a["href"])
            records.append(record)

We take the row one from the top because this we we exclude the headings.

We then create a list of lists with each row of the table which we pass into our dataframe to create the table.

Step 3) Add the headers and export table to csv

def find_tables(soup):
    table = soup.find_all('table')
    # Goes through each found table and find all the rows, excluding the headers which we'll add after
    count = 0
    for row in table:
        records = []
        for tr in row.find_all("tr")[1:]:
            trs = tr.find_all("td")
            record = []
            record.append(trs[0].text)
            record.append(trs[1].text)
            record.append(trs[2].text)
            record.append(trs[0].a["href"])
            records.append(record)
    # Finds the headings and puts them into a list to pass to columns of dataframe & matches them using a slice
        header = row.find("tr").text
        headings = header.split('\n')
        headings.insert(4, 'Link')
        df = pd.DataFrame(data=records, columns=headings[1:5], index=None)
        if count == 0:
            df.to_csv('table_export.csv',mode='a',index=None)
        else:
            df.to_csv('table_export.csv', mode='a', index=None,header=None)
        count += 1

I’ve created a separate list for the column names as we can pass this list of column names to the dataframe using the headers argument. I only want headers in my csv in the first row so with some simple flow control after the first iteration I can ignore the headers in pandas when I subsequently add to my CSV.

There you have it you’re now able to export loads of data from the page. If you’re looking to run further information such as scrape the links within the table then keep an eye out for my crawling tutorial where i’ll cover queuing and other crawl based topics.

Final script

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://en.wikipedia.org/wiki/List_of_towns_in_England')
soup = BeautifulSoup(r.content,'html.parser')

title = soup.title
title_stripped = soup.robots
links = soup.find_all('a')
h1 = soup.find('h1').text
h2 = soup.find_all('h2')

def find_tables(soup):
    table = soup.find_all('table')
    # Goes through each found table and find all the rows, excluding the headers which we'll add after
    count = 0
    for row in table:
        records = []
        for tr in row.find_all("tr")[1:]:
            trs = tr.find_all("td")
            record = []
            record.append(trs[0].text)
            record.append(trs[1].text)
            record.append(trs[2].text)
            record.append(trs[0].a["href"])
            records.append(record)
    # Finds the headings and puts them into a list to pass to columns of dataframe & matches them using a slice
        header = row.find("tr").text
        headings = header.split('\n')
        headings.insert(4, 'Link')
        df = pd.DataFrame(data=records, columns=headings[1:5], index=None)
        if count == 0:
            df.to_csv('table_export.csv',mode='a',index=None)
        else:
            df.to_csv('table_export.csv', mode='a', index=None,header=None)
        count += 1

find_tables(soup)


I hope this gave you some good insight into the power of python and web scraping. Keep tuned for more tutorials!

Will Cecil: Digital Marketer, Python Tinkerer & Tech Enthusiast. Follow me: Website / Twitter / Github
Related Post