Python Web Scraping

Introduction to Web Scraping

Welcome to the wide world of web scraping! Web scraping is used by many fields to collect data not easily available in other formats. You could be a journalist, working on a new story, or a data scientist extracting a new dataset. Web scraping is a useful tool even for just a casual programmer, if you need to check your latest homework assignments on your university page and have them emailed to you. Whatever your motivation, we hope you are ready to learn!

In this chapter, we will cover the following topics:

Introducing the field of web scraping
Explaining the legal challenges
Explaining Python 3 setup
Performing background research on our target website
Progressively building our own advanced web crawler
Using non-standard libraries to help scrape the Web

Is web scraping legal?

Web scraping, and what is legally permissible when web scraping, are still being established despite numerous rulings over the past two decades. If the scraped data is being used for personal and private use, and within fair use of copyright laws, there is usually no problem. However, if the data is going to be republished, if the scraping is aggressive enough to take down the site, or if the content is copyrighted and the scraper violates the terms of service, then there are several legal precedents to note.

In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court decided scraping and republishing facts, such as telephone listings, are allowed. A similar case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd, demonstrated that only data with an identifiable author can be copyrighted. Another scraped content case in the United States, evaluating the reuse of Associated Press stories for an aggregated news product, was ruled a violation of copyright in Associated Press v. Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular crawling and deep linking is permissible.

There have also been several cases in which companies have charged the plaintiff with aggressive scraping and attempted to stop the scraping via a legal order. The most recent case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it could not be considered intentional harm, despite the crawler activity leading to some site stability issues.

These cases suggest that, when the scraped data constitutes public facts (such as business locations and telephone listings), it can be republished following fair use rules. However, if the data is original (such as opinions and reviews or private user data), it most likely cannot be republished for copyright reasons. In any case, when you are scraping data from a website, remember you are their guest and need to behave politely; otherwise, they may ban your IP address or proceed with legal action. This means you should make download requests at a reasonable rate and define a user agent to identify your crawler. You should also take measures to review the Terms of Service of the site and ensure the data you are taking is not considered private or copyrighted.

If you have doubts or questions, it may be worthwhile to consult a media lawyer regarding the precedents in your area of residence.

You can read more about these legal cases at the following sites:

Feist Publications Inc. v. Rural Telephone Service Co.
(http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=499&invol=340)
Telstra Corporation Limited v. Phone Directories Company Pvt Ltd
(http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html)

Associated Press v.Meltwater
(http://www.nysd.uscourts.gov/cases/show.php?db=special&id=279)

ofir.dk vs home.dk
(http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf)

QVC v. Resultly
(https://www.paed.uscourts.gov/documents/opinions/16D0129P.pdf)

Background research

Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website. The website itself can help us via the robots.txt and Sitemap files, and there are also external tools available to provide further details such as Google Search and WHOIS.

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1 
User-agent: BadCrawler 
Disallow: / 

# section 2 
User-agent: * 
Crawl-delay: 5 
Disallow: /trap 

# section 3 
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

Examining the Sitemap

Sitemap files are provided bywebsites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Many web publishing platforms have the ability to generate a sitemap automatically. Here is the content of the Sitemap file located in the listed robots.txt file:

<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 
  <url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url> 
  <url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url> 
  <url><loc>http://example.webscraping.com/view/Albania-3</loc></url> 
  ... 
</urlset>

This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap files provide an efficient way to crawl a website, but need to be treated carefully because they can be missing, out-of-date, or incomplete.

Estimating the size of a website

The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4 , Concurrent Downloading, on distributed downloading.

A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.

Here are the site search results for our example website when searching Google for site:example.webscraping.com:

As we can see, Google currently estimates more than 200 web pages (this result may vary), which is around the website size. For larger websites, Google's estimates may be less accurate.

We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Again, your results may vary in size; however, this additional filter is useful because ideally you only want to crawl the part of a website containing useful data rather than every page.

Identifying the technology used by a website

The type of technology used to build a websitewill affect how we crawl it. A useful tool to check the kind of technologies a website is built with is the module detectem, which requires Python 3.5+ and Docker. If you don't already have Docker installed, follow the instructions for your operating system at https://www.docker.com/products/overview. Once Docker is installed, you can run the following commands.

docker pull scrapinghub/splash
pip install detectem

This will pull the latest Docker image from ScrapingHub and install the package via pip. It is recommended to use a Python virtual environment (https://docs.python.org/3/library/venv.html) or a Conda environment (https://conda.io/docs/using/envs.html) and to check the project's ReadMe page (https://github.com/spectresearch/detectem) for any updates or changes.

Why use environments?
Imagine if your project was developed with an earlier version of a library such as detectem, and then, in a later version, detectem introduced some backwards-incompatible changes that break your project. However, different projects you are working on would like to use the newer version. If your project uses the system-installed detectem, it is eventually going to break when libraries are updated to support other projects.
Ian Bicking's virtualenv provides a clever hack to this problem by copying the system Python executable and its dependencies into a local directory to create an isolated Python environment. This allows a project to install specific versions of Python libraries locally and independently of the wider system. You can even utilize different versions of Python in different virtual environments. Further details are available in the documentation at https://virtualenv.pypa.io. Conda environments offer similar functionality using the Anaconda Python path.

The detectem module uses a series of requests and responses to detect technologies used by the website, based on a series of extensible modules. It uses Splash (https://github.com/scrapinghub/splash), a scriptable browser developed by ScrapingHub (https://scrapinghub.com/). To run the module, simply use the det command:

       $ det http://example.webscraping.com
       [('jquery', '1.11.0')]

We can see the example website uses a common JavaScript library, so its content is likely embedded in the HTML and should be relatively straightforward to scrape.

Detectem is still fairly young and aims to eventually have Python parity to Wappalyzer (https://github.com/AliasIO/Wappalyzer), a Node.js-based project supporting parsing of many different backends as well as ad networks, JavaScript libraries, and server setups. You can also run Wappalyzer via Docker. To first download the Docker image, run:

$ docker pull wappalyzer/cli

Then, you can run the script from the Docker instance:

$ docker run wappalyzer/cli http://example.webscraping.com

The output is a bit hard to read, but if we copy and paste it into a JSON linter, we can see the many different libraries and technologies detected:

{'applications': 
[{'categories': ['Javascript Frameworks'],
     'confidence': '100',
     'icon': 'Modernizr.png',
     'name': 'Modernizr',
     'version': ''},
 {'categories': ['Web Servers'],
     'confidence': '100',
     'icon': 'Nginx.svg',
     'name': 'Nginx',
     'version': ''},
 {'categories': ['Web Frameworks'],
     'confidence': '100',
     'icon': 'Twitter Bootstrap.png',
     'name': 'Twitter Bootstrap',
     'version': ''},
 {'categories': ['Web Frameworks'],
     'confidence': '100',
     'icon': 'Web2py.png',
     'name': 'Web2py',
     'version': ''},
 {'categories': ['Javascript Frameworks'],
     'confidence': '100',
     'icon': 'jQuery.svg',
     'name': 'jQuery',
     'version': ''},
 {'categories': ['Javascript Frameworks'],
     'confidence': '100',
     'icon': 'jQuery UI.svg',
     'name': 'jQuery UI',
     'version': '1.10.3'},
 {'categories': ['Programming Languages'],
     'confidence': '100',
     'icon': 'Python.png',
     'name': 'Python',
     'version': ''}],
 'originalUrl': 'http://example.webscraping.com',
 'url': 'http://example.webscraping.com'}

Here, we can see that Python and the web2py frameworks were detected with very high confidence. We can also see that the frontend CSS framework Twitter Bootstrap is used. Wappalyzer also detected Modernizer.js and the use of Nginx as the backend server. Because the site is only using JQuery and Modernizer, it is unlikely the entire page is loaded by JavaScript. If the website was instead built with AngularJS or React, then its content would likely be loaded dynamically. Or, if the website used ASP.NET, it would be necessary to use sessions and form submissions to crawl web pages. Working with these more difficult cases will be covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.

Finding the owner of a website

For some websites it may matter to us who the owner is. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS protocol to see who is the registered owner of the domain name. A Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, can be installed via pip:

   pip install python-whois

Here is the most informative part of the WHOIS response when querying the appspot.com domain with this module:

   >>> import whois
   >>> print(whois.whois('appspot.com'))
    {
      ...
      "name_servers": [
        "NS1.GOOGLE.COM", 
        "NS2.GOOGLE.COM", 
        "NS3.GOOGLE.COM", 
        "NS4.GOOGLE.COM", 
        "ns4.google.com", 
        "ns2.google.com", 
        "ns1.google.com", 
        "ns3.google.com"
      ], 
      "org": "Google Inc.", 
      "emails": [
        "[email protected]", 
        "[email protected]"
      ]
    }

We can see here that this domain is owned by Google, which is correct; this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks IPs that quickly scrape their services; and you, or someone you live or work with, might need to use Google services. I have experienced being asked to enter captchas to use Google services for short periods, even after running only simple search crawlers on Google domains.

Crawling your first website

In order to scrape a website, we first need to download its web pages containing the data of interest, a process known as crawling. There are a number of approaches that can be used to crawl a website, and the appropriate choice will depend on the structure of the target website. This chapter will explore how to download web pages safely, and then introduce the following three common approaches to crawling a website:

Crawling a sitemap
Iterating each page using database IDs
Following web page links

We have so far used the terms scraping and crawling interchangeably, but let's take a moment to define the similarities and differences in these two approaches.

Scraping versus crawling

Depending on the information you are after and the site content and structure, you may need to either build a web scraper or a website crawler. What is the difference?

A web scraper is usually built to target a particular website or sites and to garner specific information on those sites. A web scraper is built to access these specific pages and will need to be modified if the site changes or if the information location on the site is changed. For example, you might want to build a web scraper to check the daily specials at your favorite local restaurant, and to do so you would scrape the part of their site where they regularly update that information.

In contrast, a web crawler is usually built in a generic way; targeting either websites from a series of top-level domains or for the entire web. Crawlers can be built to gather more specific information, but are usually used to crawl the web, picking up small and generic bits of information from many different sites or pages and following links to other pages.

In addition to crawlers and scrapers, we will also cover web spiders in Chapter 8, Scrapy. Spiders can be used for crawling a specific set of sites or for broader crawls across many sites or even the Internet.

Generally, we will use specific terms to reflect our use cases; as you develop your web scraping, you may notice distinctions in technologies, libraries, and packages you may want to use. In these cases, your knowledge of the differences in these terms will help you select an appropriate package or technology based on the terminology used (such as, is it only for scraping? Is it also for spiders?).

Downloading a web page

To scrape web pages, we first need to download them. Here is a simple Python script that uses Python's urllib module to download a URL:

import urllib.request
def download(url): 
    return urllib.request.urlopen(url).read()

When a URL is passed, this function will download the web page and return the HTML. The problem with this snippet is that, when downloading the web page, we might encounter errors that are beyond our control; for example, the requested page may no longer exist. In these cases, urllib will raise an exception and exit the script. To be safer, here is a more robust version to catch these exceptions:

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

def download(url):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
    return html

Now, when a download or URL error is encountered, the exception is caught and the function returns None.

Throughout this book, we will assume you are creating files with code that is presented without prompts (like the code above). When you see code that begins with a Python prompt >>> or and IPython prompt In [1]:, you will need to either enter that into the main file you have been using, or save the file and import those functions and classes into your Python interpreter. If you run into any issues, please take a look at the code in the book repository at https://github.com/kjam/wswp.

Retrying downloads

Often, the errors encountered when downloading are temporary; an example is when the web server is overloaded and returns a 503 Service Unavailable error. For these errors, we can retry the download after a short time because the server problem may now be resolved. However, we do not want to retry downloading for all errors. If the server returns 404 Not Found, then the web page does not currently exist and the same request is unlikely to produce a different result.

The full list of possible HTTP errors is defined by the Internet Engineering Task Force, and is available for viewing at https://tools.ietf.org/html/rfc7231#section-6. In this document, we can see that 4xx errors occur when there is something wrong with our request and 5xx errors occur when there is something wrong with the server. So, we will ensure our download function only retries the 5xx errors. Here is the updated version to support this:

def download(url, num_retries=2): 
    print('Downloading:', url)
    try: 
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e: 
        print('Download error:', e.reason)
        html = None 
        if num_retries > 0: 
                 if hasattr(e, 'code') and 500 <= e.code < 600: 
                # recursively retry 5xx HTTP errors 
                return download(url, num_retries - 1) 
    return html

Now, when a download error is encountered with a 5xx code, the download error is retried by recursively calling itself. The function now also takes an additional argument for the number of times the download can be retried, which is set to two times by default. We limit the number of times we attempt to download a web page because the server error may not recover. To test this functionality we can try downloading http://httpstat.us/500, which returns the 500 error code:

    >>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error

As expected, the download function now tries downloading the web page, and then, on receiving the 500 error, it retries the download twice before giving up.

Setting a user agent

By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example, http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.

To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):

def download(url, user_agent='wswp', num_retries=2): 
    print('Downloading:', url) 
    request = urllib.request.Request(url) 
    request.add_header('User-agent', user_agent)
    try: 
        html = urllib.request.urlopen(request).read() 
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None 
        if num_retries > 0: 
            if hasattr(e, 'code') and 500 <= e.code < 600: 
                # recursively retry 5xx HTTP errors 
                return download(url, num_retries - 1) 
    return html

If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.

Sitemap crawler

For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags.

We will need to update our code to handle encoding conversions as our current download function simply returns bytes. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:

import re

def download(url, user_agent='wswp', num_retries=2, charset='utf-8'): 
    print('Downloading:', url) 
    request = urllib.request.Request(url) 
    request.add_header('User-agent', user_agent)
    try: 
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None 
        if num_retries > 0: 
            if hasattr(e, 'code') and 500 <= e.code < 600: 
            # recursively retry 5xx HTTP errors 
            return download(url, num_retries - 1) 
    return html

def crawl_sitemap(url): 
    # download the sitemap file 
    sitemap = download(url) 
    # extract the sitemap links 
    links = re.findall('<loc>(.*?)</loc>', sitemap) 
    # download each link 
    for link in links: 
        html = download(link) 
        # scrape html here 
        # ...

Now, we can run the sitemap crawler to download all countries from the example website:

    >>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...

As shown in our download method above, we had to update the character encoding to utilize regular expressions with the website response. The Python read method on the response will return bytes, and the re module expects a string. Our code depends on the website maintainer to include the proper character encoding in the response headers. If the character encoding header is not returned, we default to UTF-8 and hope for the best. Of course, this decoding will throw an error if either the header encoding returned is incorrect or if the encoding is not set and also not UTF-8. There are some more complex ways to guess encoding (see: https://pypi.python.org/pypi/chardet), which are fairly easy to implement.

For now, the Sitemap crawler works as expected. But as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.

If you don't want to continue the crawl at any time you can hit Ctrl + C or cmd + C to exit the Python interpreter or program execution.

ID iteration crawler

In this section, we will take advantage of a weakness in the website structure to easily access all the content. Here are the URLs of some sample countries:

We can see that the URLs only differ in the final section of the URL path, with the country name (known as a slug) and ID. It is a common practice to include a slug in the URL to help with search engine optimization. Quite often, the web server will ignore the slug and only use the ID to match relevant records in the database. Let's check whether this works with our example website by removing the slug and checking the page http://example.webscraping.com/view/1:

The web page still loads! This is useful to know because now we can ignore the slug and simply utilize database IDs to download all the countries. Here is an example code snippet that takes advantage of this trick:

import itertools 

def crawl_site(url):
    for page in itertools.count(1): 
        pg_url = '{}{}'.format(url, page) 
        html = download(pg_url) 
        if html is None: 
            break 
        # success - can scrape the result

Now we can use the function by passing in the base URL:

>>> crawl_site('http://example.webscraping.com/view/-')
Downloading: http://example.webscraping.com/view/-1
Downloading: http://example.webscraping.com/view/-2
Downloading: http://example.webscraping.com/view/-3
Downloading: http://example.webscraping.com/view/-4
[...]

Here, we iterate the ID until we encounter a download error, which we assume means our scraper has reached the last country. A weakness in this implementation is that some records may have been deleted, leaving gaps in the database IDs. Then, when one of these gaps is reached, the crawler will immediately exit. Here is an improved version of the code that allows a number of consecutive download errors before exiting:

def crawl_site(url, max_errors=5):
    for page in itertools.count(1): 
        pg_url = '{}{}'.format(url, page) 
        html = download(pg_url) 
        if html is None: 
            num_errors += 1
            if num_errors == max_errors:
                # max errors reached, exit loop
                break
        else:
            num_errors = 0
            # success - can scrape the result

The crawler in the preceding code now needs to encounter five consecutive download errors to stop iteration, which decreases the risk of stopping iteration prematurely when some records have been deleted or hidden.

Iterating the IDs is a convenient approach to crawling a website, but is similar to the sitemap approach in that it will not always be available. For example, some websites will check whether the slug is found in the URL and if not return a 404 Not Found error. Also, other websites use large nonsequential or nonnumeric IDs, so iterating is not practical. For example, Amazon uses ISBNs, as the ID for the available books, that have at least ten digits. Using an ID iteration for ISBNs would require testing billions of possible combinations, which is certainly not the most efficient approach to scraping the website content.

As you've been following along, you might have noticed some download errors with the message TOO MANY REQUESTS . Don't worry about them at the moment; we will cover more about handling these types of error in the Advanced Features section of this chapter.

Link crawlers

So far, we have implemented two simple crawlers that take advantage of the structure of our sample website to download all published countries. These techniques should be used when available, because they minimize the number of web pages to download. However, for other websites, we need to make our crawler act more like a typical user and follow links to reach the interesting content.

We could simply download the entire website by following every link. However, this would likely download many web pages we don't need. For example, to scrape user account details from an online forum, only account pages need to be downloaded and not discussion threads. The link crawler we use in this chapter will use regular expressions to determine which web pages it should download. Here is an initial version of the code:

import re 

def link_crawler(start_url, link_regex): 
    """ Crawl from the given start URL following links matched by link_regex 
    """ 
    crawl_queue = [start_url] 
    while crawl_queue: 
        url = crawl_queue.pop() 
        html = download(url) 
        if html is not None:
            continue
        # filter for links matching our regular expression 
        for link in get_links(html): 
            if re.match(link_regex, link): 
                crawl_queue.append(link) 

def get_links(html): 
    """ Return a list of links from html 
    """ 
    # a regular expression to extract all links from the webpage 
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE) 
    # list of all links from the webpage 
    return webpage_regex.findall(html)

To run this code, simply call the link_crawler function with the URL of the website you want to crawl and a regular expression to match links you want to follow. For the example website, we want to crawl the index with the list of countries and the countries themselves.

We know from looking at the site that the index links follow this format:

The country web pages follow this format:

So a simple regular expression to match both types of web page is /(index|view)/. What happens when the crawler is run with these inputs? You receive the following download error:

>>> link_crawler('http://example.webscraping.com', '/(index|view)/') 
Downloading: http://example.webscraping.com 
Downloading: /index/1 
Traceback (most recent call last): 
  ... 
ValueError: unknown url type: /index/1

Regular expressions are great tools for extracting information from strings, and I recommend every programmer learn how to read and write a few of them. That said, they tend to be quite brittle and easily break. We'll cover more advanced ways to extract links and identify their pages as we advance through the book.

The problem with downloading /index/1 is that it only includes the path of the web page and leaves out the protocol and server, which is known as a relative link. Relative links work when browsing because the web browser knows which web page you are currently viewing and takes the steps necessary to resolve the link. However, urllib doesn't have this context. To help urllib locate the web page, we need to convert this link into an absolute link, which includes all the details to locate the web page. As might be expected, Python includes a module in urllib to do just this, called parse. Here is an improved version of link_crawler that uses the urljoin method to create the absolute links:

from urllib.parse import urljoin

def link_crawler(start_url, link_regex): 
    """ Crawl from the given start URL following links matched by link_regex 
    """ 
    crawl_queue = [start_url] 
    while crawl_queue: 
        url = crawl_queue.pop() 
        html = download(url) 
        if not html:
            continue
        for link in get_links(html): 
            if re.match(link_regex, link): 
                abs_link = urljoin(start_url, link) 
                crawl_queue.append(abs_link)

When this example is run, you can see it downloads the matching web pages; however, it keeps downloading the same locations over and over. The reason for this behavior is that these locations have links to each other. For example, Australia links to Antarctica and Antarctica links back to Australia, so the crawler will continue to queue the URLs and never reach the end of the queue. To prevent re-crawling the same links, we need to keep track of what's already been crawled. The following updated version of link_crawler stores the URLs seen before, to avoid downloading duplicates:

def link_crawler(start_url, link_regex): 
    crawl_queue = [start_url] 
    # keep track which URL's have seen before 
    seen = set(crawl_queue) 
    while crawl_queue: 
        url = crawl_queue.pop() 
        html = download(url)
        if not html:
            continue 
        for link in get_links(html): 
            # check if link matches expected regex 
            if re.match(link_regex, link): 
                abs_link = urljoin(start_url, link) 
                # check if have already seen this link 
                if abs_link not in seen: 
                    seen.add(abs_link) 
                    crawl_queue.append(abs_link)

When this script is run, it will crawl the locations and then stop as expected. We finally have a working link crawler!

Advanced features

Now, let's add some features to make our link crawler more useful for crawling other websites.

Parsing robots.txt

First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:

    >>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as we saw in the definition in the example site's robots.txt.

To integrate robotparser into the link crawler, we first want to create a new function to return the robotparser object:

def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

We need to reliably set the robots_url; we can do so by passing an extra keyword argument to our function. We can also set a default value catch in case the user does not pass the variable. Assuming the crawl will start at the root of the site, we can simply add robots.txt to the end of the URL. We also need to define the user_agent:

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
    ...
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)

Finally, we add the parser check in the crawl loop:

... 
while crawl_queue: 
    url = crawl_queue.pop() 
    # check url passes robots.txt restrictions 
    if rp.can_fetch(user_agent, url):
         html = download(url, user_agent=user_agent) 
         ... 
    else: 
        print('Blocked by robots.txt:', url)

We can test our advanced link crawler and its use of robotparser by using the bad user agent string.

>>> link_crawler('http://example.webscraping.com', '/(index|view)/', user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com

Supporting proxies

Sometimes it's necessary to access a website through a proxy. For example, Hulu is blocked in many countries outside the United States as are some videos on YouTube. Supporting proxies with urllib is not as easy as it could be. We will cover requests for a more user-friendly Python HTTP module that can also handle proxies later in this chapter. Here's how to support a proxy with urllib:

proxy = 'http://myproxy.net:1234' # example string 
proxy_support = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener) 
# now requests via urllib.request will be handled via proxy

Here is an updated version of the download function to integrate this:

def download(url, user_agent='wswp', num_retries=2, charset='utf-8', proxy=None): 
    print('Downloading:', url) 
    request = urllib.request.Request(url) 
    request.add_header('User-agent', user_agent)
    try: 
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None 
        if num_retries > 0: 
            if hasattr(e, 'code') and 500 <= e.code < 600: 
            # recursively retry 5xx HTTP errors 
            return download(url, num_retries - 1) 
    return html

The current urllib module does not support https proxies by default (Python 3.5). This may change with future versions of Python, so check the latest documentation. Alternatively, you can use the documentation's recommended recipe (https://code.activestate.com/recipes/456195/) or keep reading to learn how to use the requests library.

Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle: 
    """Add a delay between downloads to the same domain 
    """ 
    def __init__(self, delay): 
        # amount of delay between downloads for each domain 
        self.delay = delay 
        # timestamp of when a domain was last accessed 
        self.domains = {} 

    def wait(self, url): 
        domain = urlparse(url).netloc 
        last_accessed = self.domains.get(domain) 

        if self.delay > 0 and last_accessed is not None: 
            sleep_secs = self.delay - (time.time() - last_accessed) 
            if sleep_secs > 0: 
                # domain has been accessed recently 
                # so need to sleep 
                time.sleep(sleep_secs) 
        # update the last accessed time 
        self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
... 
throttle.wait(url) 
html = download(url, user_agent=user_agent, num_retries=num_retries, 
                proxy=proxy, charset=charset)

Avoiding spider traps

Currently, our crawler will follow any link it hasn't seen before. However, some websites dynamically generate their content and can have an infinite number of web pages. For example, if the website has an online calendar with links provided for the next month and year, then the next month will also have links to the next month, and so on for however long the widget is set (this can be a LONG time). The site may offer the same functionality with simple pagination navigation, essentially paginating over empty search result pages until the maximum pagination is reached. This situation is known as a spider trap.

A simple way to avoid getting stuck in a spider trap is to track how many links have been followed to reach the current web page, which we will refer to as depth. Then, when a maximum depth is reached, the crawler does not add links from that web page to the queue. To implement maximum depth, we will change the seen variable, which currently tracks visited web pages, into a dictionary to also record the depth the links were found at:

def link_crawler(..., max_depth=4): 
    seen = {} 
    ... 
    if rp.can_fetch(user_agent, url): 
        depth = seen.get(url, 0)
        if depth == max_depth:
            print('Skipping %s due to depth' % url)
            continue
        ...
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen: 
                    seen[abs_link] = depth + 1 
                    crawl_queue.append(abs_link)

Now, with this feature, we can be confident the crawl will complete eventually. To disable this feature, max_depth can be set to a negative number so the current depth will never be equal to it.

Final version

The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.

To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:

    >>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/

Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:

    >>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1

As expected, the crawl stopped after downloading the first page of countries.

Using the requests library

Although we have built a fairly advanced parser using only urllib, the majority of scrapers written in Python today utilize the requests library to manage complex HTTP requests. What started as a small library to help wrap urllib features in something "human-readable" is now a very large project with hundreds of contributors. Some of the features available include built-in handling of encoding, important updates to SSL and security, as well as easy handling of POST requests, JSON, cookies, and proxies.

Throughout most of this book, we will utilize the requests library for its simplicity and ease of use, and because it has become the de facto standard for most web scraping.

To install requests, simply use pip:

pip install requests

For an in-depth overview of all features, you should read the documentation at http://python-requests.org or browse the source code at https://github.com/kennethreitz/requests.

To compare differences using the two libraries, I've also built the advanced link crawler so that it can use requests. You can see the code at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler_using_requests.py. The main download function shows the key differences. The requests version is as follows:

def download(url, user_agent='wswp', num_retries=2, proxies=None):
    print('Downloading:', url)
    headers = {'User-Agent': user_agent}
    try:
        resp = requests.get(url, headers=headers, proxies=proxies)
        html = resp.text
        if resp.status_code >= 400:
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    except requests.exceptions.RequestException as e:
        print('Download error:', e.reason)
        html = None

One notable difference is the ease of use of having status_code as an available attribute for each request. Additionally, we no longer need to test for character encoding, as the text attribute on our Response object does so automatically. In the rare case of an non-resolvable URL or timeout, they are all handled by RequestException so it makes for an easy catch statement. Proxy handling is also taken care of by simply passing a dictionary of proxies (that is {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}).

We will continue to compare and use both libraries, so that you are familiar with them depending on your needs and use case. I strongly recommend using requests whenever you are handling more complex websites, or need to handle important humanizing methods such as using cookies or sessions. We will talk more about these methods in Chapter 6, Interacting with Forms.

Gerry Aug 26, 2017

Finally a book that covers more than just the basics of webscraping. Packt needs better proof readers though. Language errors.

Amazon Verified review

Anonymous Feb 17, 2018

I would not recommend this book for any beginners in Python Web Scraping. Why? The website example they use in the book HAS NOT BEEN maintained and the code used in the book to reference the example website DOES NOT MATCH. I also found multiple complaints on the Internet from others. You will be so frustrated figuring out if you typed the code wrong, where in fact, the website links of the actual site don't match what's typed in the book. I'm glad I have some prior programming experience where I can fix some of the issues I experienced on the fly, but this takes additional time and testing. Overall, the book does go in depth and I think will be good for those with prior Python Web Scraping experience.