Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Python Web Scraping Cookbook
Python Web Scraping Cookbook

Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS

eBook
$24.99 $35.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Python Web Scraping Cookbook

Getting Started with Scraping

In this chapter, we will cover the following topics:

  • Setting up a Python development environment
  • Scraping Python.org with Requests and Beautiful Soup
  • Scraping Python.org with urllib3 and Beautiful Soup
  • Scraping Python.org with Scrapy
  • Scraping Python.org with Selenium and PhantomJs

Introduction

The amount of data available on the web is consistently growing both in quantity and in form. Businesses require this data to make decisions, particularly with the explosive growth of machine learning tools which require large amounts of data for training. Much of this data is available via Application Programming Interfaces, but at the same time a lot of valuable data is still only available through the process of web scraping.

This chapter will focus on several fundamentals of setting up a scraping environment and performing basic requests for data with several of the tools of the trade. Python is the programing language of choice for this book, as well as amongst many who build systems to perform scraping. It is an easy to use programming language which has a very rich ecosystem of tools for many tasks. If you program in other languages, you will find it easy to pick up and you may never go back!

Setting up a Python development environment

If you have not used Python before, it is important to have a working development environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment with virtualenv and manage project dependencies with pip . We also get the code for the book and install it into the Python virtual environment.

Getting ready

We will exclusively be using Python 3.x, and specifically in my case 3.6.1. While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed. You can find references for Python installers at www.python.org.

You can check Python's version with python --version

pip comes installed with Python 3.x, so we will omit instructions on its installation. Additionally, all command line examples in this book are run on a Mac. For Linux users the commands should be identical. On Windows, there are alternate commands (like dir instead of ls), but these alternatives will not be covered.

How to do it...

We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.

Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command:

~ $ pip install virtualenv
Collecting virtualenv
Using cached virtualenv-15.1.0-py2.py3-none-any.whl
Installing collected packages: virtualenv
Successfully installed virtualenv-15.1.0

Now we can use virtualenv. But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list:

~ $ pip list
alabaster (0.7.9)
amqp (1.4.9)
anaconda-client (1.6.0)
anaconda-navigator (1.5.3)
anaconda-project (0.4.1)
aniso8601 (1.3.0)

I've truncated to the first few lines as there are quite a few. For me there are 222 packages installed.

Packages can also be uninstalled using pip uninstall followed by the package name. I'll leave it to you to give it a try.

Now back to virtualenv. Using virtualenv is very simple. Let's use it to create an environment and install the code from github. Let's walk through the steps:

  1. Create a directory to represent the project and enter the directory.
~ $ mkdir pywscb
~ $ cd pywscb
  1. Initialize a virtual environment folder named env:
pywscb $ virtualenv env
Using base prefix '/Users/michaelheydt/anaconda'
New python executable in /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3.6m.dylib
Installing setuptools, pip, wheel...done.
  1. This creates an env folder. Let's take a look at what was installed.
pywscb $ ls -la env
total 8
drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 .
drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 ..
drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bin
drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 include
drwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib
-rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pip-selfcheck.json
  1. New we activate the virtual environment. This command uses the content in the env folder to configure Python. After this all python activities are relative to this virtual environment.
pywscb $ source env/bin/activate
(env) pywscb $
  1. We can check that python is indeed using this virtual environment with the following command:
(env) pywscb $ which python
/Users/michaelheydt/pywscb/env/bin/python

With our virtual environment created, let's clone the books sample code and take a look at its structure.

(env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git
Cloning into 'PythonWebScrapingCookbook'...
remote: Counting objects: 420, done.
remote: Compressing objects: 100% (316/316), done.
remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0
Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done.
Resolving deltas: 100% (164/164), done.
Checking connectivity... done.

This created a PythonWebScrapingCookbook directory.

(env) pywscb $ ls -l
total 0
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook
drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env

Let's change into it and examine the content.

(env) PythonWebScrapingCookbook $ ls -l
total 0
drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www

There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let's look at the contents of the py directory:

(env) py $ ls -l
total 0
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01
drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03
drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04
drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06
drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08
drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09
drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11
drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules

Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).

Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):

export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules"
export PYTHONPATH

The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder:

(env) py $ ls -la 06
total 96
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 .
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 ..
-rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py
-rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py
-rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py
-rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py
-rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py
-rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py
-rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21 07_scrape_continuous_twitter.py
-rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py
-rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py
-rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py
-rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py
-rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py

In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>.

Congratulations, you've now got a Python environment configured with the books code!

Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:

(env) py $ deactivate
py $

And checking which python we can see it has switched back:

py $ which python
/Users/michaelheydt/anaconda/bin/python
I won't be using the virtual environment for the rest of the book. When you see command prompts they will be either of the form "<directory> $" or simply "$".

Now let's move onto doing some scraping.

Scraping Python.org with Requests and Beautiful Soup

In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org. We'll install both of the libraries and get some basic familiarity with them. We'll come back to them both in subsequent chapters and dive deeper into each.

Getting ready...

In this recipe, we will scrape the upcoming Python events from https://www.python.org/events/pythonevents. The following is an an example of The Python.org Events Page (it changes frequently, so your experience will differ):

We will need to ensure that Requests and Beautiful Soup are installed. We can do that with the following:

pywscb $ pip install requests
Downloading/unpacking requests
Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloaded
Downloading/unpacking certifi>=2017.4.17 (from requests)
Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB downloaded
Downloading/unpacking idna>=2.5,<2.7 (from requests)
Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloaded
Downloading/unpacking chardet>=3.0.2,<3.1.0 (from requests)
Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded
Downloading/unpacking urllib3>=1.21.1,<1.23 (from requests)
Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloaded
Installing collected packages: requests, certifi, idna, chardet, urllib3
Successfully installed requests certifi idna chardet urllib3
Cleaning up...
pywscb $ pip install bs4
Downloading/unpacking bs4
Downloading bs4-0.0.1.tar.gz
Running setup.py (path:/Users/michaelheydt/pywscb/env/build/bs4/setup.py) egg_info for package bs4

How to do it...

Now let's go and learn to scrape a couple events. For this recipe we will start by using interactive python.

  1. Start it with the ipython command:

$ ipython
Python 3.6.1 |Anaconda custom (x86_64)| (default, Mar 22 2017, 19:25:17)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]:
  1. Next we import Requests
In [1]: import requests
  1. We now use requests to make a GET HTTP request for the following url: https://www.python.org/events/python-events/ by making a GET request:
In [2]: url = 'https://www.python.org/events/python-events/'
In [3]: req = requests.get(url)
  1. That downloaded the page content but it is stored in our requests object req. We can retrieve the content using the .text property. This prints the first 200 characters.
req.text[:200]
Out[4]: '<!doctype html>\n<!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->\n<!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->\n<!--[if IE 8]> <h'

We now have the raw HTML of the page. We can now use beautiful soup to parse the HTML and retrieve the event data.

  1. First import Beautiful Soup
In [5]: from bs4 import BeautifulSoup
  1. Now we create a BeautifulSoup object and pass it the HTML.
In [6]: soup = BeautifulSoup(req.text, 'lxml')
  1. Now we tell Beautiful Soup to find the main <ul> tag for the recent events, and then to get all the <li> tags below it.
In [7]: events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
  1. And finally we can loop through each of the <li> elements, extracting the event details, and print each to the console:
In [13]: for event in events:
...: event_details = dict()
...: event_details['name'] = event_details['name'] = event.find('h3').find("a").text
...: event_details['location'] = event.find('span', {'class', 'event-location'}).text
...: event_details['time'] = event.find('time').text
...: print(event_details)
...:
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. 2018'}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. 2018'}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. 2018'}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. 2018'}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. 2018'}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. 2018'}

This entire example is available in the 01/01_events_with_requests.py script file. The following is its content and it pulls together all of what we just did step by step:

import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
req = requests.get(url)

soup = BeautifulSoup(req.text, 'lxml')

events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

for event in events:
event_details = dict()
event_details['name'] = event.find('h3').find("a").text
event_details['location'] = event.find('span', {'class', 'event-location'}).text
event_details['time'] = event.find('time').text
print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

You can run this using the following command from the terminal:

$ python 01_events_with_requests.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. 2018'}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. 2018'}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. 2018'}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. 2018'}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. 2018'}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. 2018'}

How it works...

We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works. The following important points about Requests:

  • Requests is used to execute HTTP requests. We used it to make a GET verb request of the URL for the events page.
  • The Requests object holds the results of the request. This is not only the page content, but also many other items about the result such as HTTP status codes and headers.
  • Requests is used only to get the page, it does not do an parsing.

We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML.

To understand how this worked, the content of the page has the following HTML to start the Upcoming Events section:

We used the power of Beautiful Soup to:

  • Find the <ul> element representing the section, which is found by looking for a <ul> with the a class attribute that has a value of list-recent-events.
  • From that object, we find all the <li> elements.

Each of these <li> tags represent a different event. We iterate over each of those making a dictionary from the event data found in child HTML tags:

  • The name is extracted from the <a> tag that is a child of the <h3> tag
  • The location is the text content of the <span> with a class of event-location
  • And the time is extracted from the datetime attribute of the <time> tag.

Scraping Python.org in urllib3 and Beautiful Soup

In this recipe we swap out the use of requests for another library urllib3. This is another common library for retrieving data from URLs and for other functions involving URLs such as parsing of the parts of the actual URL and handling various encodings.

Getting ready...

This recipe requires urllib3 installed. So install it with pip:

$ pip install urllib3
Collecting urllib3
Using cached urllib3-1.22-py2.py3-none-any.whl
Installing collected packages: urllib3
Successfully installed urllib3-1.22

How to do it...

The recipe is implemented in 01/02_events_with_urllib3.py. The code is the following:

import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
req = urllib3.PoolManager()
res = req.request('GET', url)

soup = BeautifulSoup(res.data, 'html.parser')

events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

for event in events:
event_details = dict()
event_details['name'] = event.find('h3').find("a").text
event_details['location'] = event.find('span', {'class', 'event-location'}).text
event_details['time'] = event.find('time').text
print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

The run it with the python interpreter. You will get identical output to the previous recipe.

How it works

The only difference in this recipe is how we fetch the resource:

req = urllib3.PoolManager()
res = req.request('GET', url)

Unlike Requests, urllib3 doesn't apply header encoding automatically. The reason why the code snippet works in the preceding example is because BS4 handles encoding beautifully. But you should keep in mind that encoding is an important part of scraping. If you decide to use your own framework or use other libraries, make sure encoding is well handled.

There's more...

Requests and urllib3 are very similar in terms of capabilities. it is generally recommended to use Requests when it comes to making HTTP requests. The following code example illustrates a few advanced features:

import requests

# builds on top of urllib3's connection pooling
# session reuses the same TCP connection if
# requests are made to the same host
# see https://en.wikipedia.org/wiki/HTTP_persistent_connection for details
session
= requests.Session()

# You may pass in custom cookie
r = session.get('http://httpbin.org/get', cookies={'my-cookie': 'browser'})
print(r.text)
# '{"cookies": {"my-cookie": "test cookie"}}'

# Streaming is another nifty feature
# From http://docs.python-requests.org/en/master/user/advanced/#streaming-requests
# copyright belongs to reques.org
r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():
# filter out keep-alive new lines
if line:
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))

Scraping Python.org with Scrapy

Scrapy is a very popular open source Python scraping framework for extracting data. It was originally designed for only scraping, but it is has also evolved into a powerful web crawling solution.

In our previous recipes, we used Requests and urllib2 to fetch data and Beautiful Soup to extract data. Scrapy offers all of these functionalities with many other built-in modules and extensions. It is also our tool of choice when it comes to scraping with Python.

Scrapy offers a number of powerful features that are worth mentioning:

  • Built-in extensions to make HTTP requests and handle compression, authentication, caching, manipulate user-agents, and HTTP headers
  • Built-in support for selecting and extracting data with selector languages such as CSS and XPath, as well as support for utilizing regular expressions for selection of content and links
  • Encoding support to deal with languages and non-standard encoding declarations
  • Flexible APIs to reuse and write custom middleware and pipelines, which provide a clean and easy way to implement tasks such as automatically downloading assets (for example, images or media) and storing data in storage such as file systems, S3, databases, and others

Getting ready...

There are several means of creating a scraper with Scrapy. One is a programmatic pattern where we create the crawler and spider in our code. It is also possible to configure a Scrapy project from templates or generators and then run the scraper from the command line using the scrapy command. This book will follow the programmatic pattern as it contains the code in a single file more effectively. This will help when we are putting together specific, targeted, recipes with Scrapy.

This isn't necessarily a better way of running a Scrapy scraper than using the command line execution, just one that is a design decision for this book. Ultimately this book is not about Scrapy (there are other books on just Scrapy), but more of an exposition on various things you may need to do when scraping, and in the ultimate creation of a functional scraper as a service in the cloud.

How to do it...

The script for this recipe is 01/03_events_with_scrapy.py. The following is the code:

import scrapy
from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'

start_urls = ['https://www.python.org/events/python-events/',]
found_events = []

def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)

if __name__ == "__main__":
process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()

for event in spider.found_events: print(event)

The following runs the script and shows the output:

~ $ python 03_events_with_scrapy.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. '}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. '}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. '}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. '}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. '}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. '}
{'name': 'PyCon Pakistan', 'location': 'Lahore, Pakistan', 'time': '16 Dec. – 17 Dec. '}
{'name': 'PyCon Indonesia 2017', 'location': 'Surabaya, Indonesia', 'time': '09 Dec. – 10 Dec. '}

The same result but with another tool. Let's go take a quick review of how this works.

How it works

We will get into some details about Scrapy in later chapters, but let's just go through this code quick to get a feel how it is accomplishing this scrape. Everything in Scrapy revolves around creating a spider. Spiders crawl through pages on the Internet based upon rules that we provide. This spider only processes one single page, so it's not really much of a spider. But it shows the pattern we will use through later Scrapy examples.

The spider is created with a class definition that derives from one of the Scrapy spider classes. Ours derives from the scrapy.Spider class.

class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'

start_urls = ['https://www.python.org/events/python-events/',]

Every spider is given a name, and also one or more start_urls which tell it where to start the crawling.

This spider has a field to store all the events that we find:

    found_events = []

The spider then has a method names parse which will be called for every page the spider collects.

def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)

The implementation of this method uses and XPath selection to get the events from the page (XPath is the built in means of navigating HTML in Scrapy). It them builds the event_details dictionary object similarly to the other examples, and then adds it to the found_events list.

The remaining code does the programmatic execution of the Scrapy crawler.

    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()

It starts with the creation of a CrawlerProcess which does the actual crawling and a lot of other tasks. We pass it a LOG_LEVEL of ERROR to prevent the voluminous Scrapy output. Change this to DEBUG and re-run it to see the difference.

Next we tell the crawler process to use our Spider implementation. We get the actual spider object from that crawler so that we can get the items when the crawl is complete. And then we kick of the whole thing by calling process.start().

When the crawl is completed we can then iterate and print out the items that were found.

    for event in spider.found_events: print(event)
This example really didn't touch any of the power of Scrapy. We will look more into some of the more advanced features later in the book.

Scraping Python.org with Selenium and PhantomJS

This recipe will introduce Selenium and PhantomJS, two frameworks that are very different from the frameworks in the previous recipes. In fact, Selenium and PhantomJS are often used in functional/acceptance testing. We want to demonstrate these tools as they offer unique benefits from the scraping perspective. Several that we will look at later in the book are the ability to fill out forms, press buttons, and wait for dynamic JavaScript to be downloaded and executed.

Selenium itself is a programming language neutral framework. It offers a number of programming language bindings, such as Python, Java, C#, and PHP (amongst others). The framework also provides many components that focus on testing. Three commonly used components are:

  • IDE for recording and replaying tests
  • Webdriver, which actually launches a web browser (such as Firefox, Chrome, or Internet Explorer) by sending commands and sending the results to the selected browser
  • A grid server executes tests with a web browser on a remote server. It can run multiple test cases in parallel.

Getting ready

First we need to install Selenium. We do this with our trusty pip:

~ $ pip install selenium
Collecting selenium
Downloading selenium-3.8.1-py2.py3-none-any.whl (942kB)
100% |████████████████████████████████| 952kB 236kB/s
Installing collected packages: selenium
Successfully installed selenium-3.8.1

This installs the Selenium Client Driver for Python (the language bindings). You can find more information on it at https://github.com/SeleniumHQ/selenium/blob/master/py/docs/source/index.rst if you want to in the future.

For this recipe we also need to have the driver for Firefox in the directory (it's named geckodriver). This file is operating system specific. I've included the file for Mac in the folder. To get other versions, visit https://github.com/mozilla/geckodriver/releases.

Still, when running this sample you may get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

If you do, put the geckodriver file somewhere on your systems PATH, or add the 01 folder to your path. Oh, and you will need to have Firefox installed.

Finally, it is required to have PhantomJS installed. You can download and find installation instructions at: http://phantomjs.org/

How to do it...

The script for this recipe is 01/04_events_with_selenium.py.

  1. The following is the code:
from selenium import webdriver

def get_upcoming_events(url):
driver = webdriver.Firefox()
driver.get(url)

events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

for event in events:
event_details = dict()
event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
event_details['time'] = event.find_element_by_xpath('p/time').text
print(event_details)

driver.close()

get_upcoming_events('https://www.python.org/events/python-events/')
  1. And run the script with Python. You will see familiar output:
~ $ python 04_events_with_selenium.py
{'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan.'}
{'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan.'}
{'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb.'}
{'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb.'}
{'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb.'}
{'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb.'}

During this process, Firefox will pop up and open the page. We have reused the previous recipe and adopted Selenium.

The Window Popped up by Firefox

How it works

The primary difference in this recipe is the following code:

driver = webdriver.Firefox()
driver.get(url)

This gets the Firefox driver and uses it to get the content of the specified URL. This works by starting Firefox and automating it to go the the page, and then Firefox returns the page content to our app. This is why Firefox popped up. The other difference is that to find things we need to call find_element_by_xpath to search the resulting HTML.

There's more...

PhantomJS, in many ways, is very similar to Selenium. It has fast and native support for various web standards, with features such as DOM handling, CSS selector, JSON, Canvas, and SVG. It is often used in web testing, page automation, screen capturing, and network monitoring.

There is one key difference between Selenium and PhantomJS: PhantomJS is headless and uses WebKit. As we saw, Selenium opens and automates a browser. This is not very good if we are in a continuous integration or testing environment where the browser is not installed, and where we also don't want thousands of browser windows or tabs being opened. Being headless, makes this faster and more efficient.

The example for PhantomJS is in the 01/05_events_with_phantomjs.py file. There is a single one line change:

driver = webdriver.PhantomJS('phantomjs')

And running the script results in similar output to the Selenium / Firefox example, but without a browser popping up and also it takes less time to complete.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Hands-on recipes for advancing your web scraping skills to expert level
  • One-stop solution guide to address complex and challenging web scraping tasks using Python
  • Understand web page structures and collect data from a website with ease

Description

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more. You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills needed to design and develop reliable performance data flows, but also deploy your codebase to AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective. Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with Ajax websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, and LXML. By the end of this book, you will be able to scrape websites more efficiently and able to deploy and operate your scraper in the cloud.

Who is this book for?

This book is ideal for Python programmers, web administrators, security professionals, and anyone who wants to perform web analytics. Familiarity with Python and basic understanding of web scraping will be useful to make the best of this book.

What you will learn

  • Use a variety of tools to scrape any website and data, including BeautifulSoup, Scrapy, Selenium and many more
  • Master expression languages, such as XPath and CSS, and regular expressions to extract web data
  • Deal with scraping traps such as hidden form fields, throttling, pagination, and different status codes
  • Build robust scraping pipelines with SQS and RabbitMQ
  • Scrape assets like image media and learn what to do when Scraper fails to run
  • Explore ETL techniques of building a customized crawler, parser, and convert structured and unstructured data from websites
  • Deploy and run your scraper as a service in AWS Elastic Container Service

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 09, 2018
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781787286634
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Feb 09, 2018
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781787286634
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 126.97
Python Web Scraping
$38.99
Python Machine Learning, Second Edition
$43.99
Python Web Scraping Cookbook
$43.99
Total $ 126.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Getting Started with Scraping Chevron down icon Chevron up icon
Data Acquisition and Extraction Chevron down icon Chevron up icon
Processing Data Chevron down icon Chevron up icon
Working with Images, Audio, and other Assets Chevron down icon Chevron up icon
Scraping - Code of Conduct Chevron down icon Chevron up icon
Scraping Challenges and Solutions Chevron down icon Chevron up icon
Text Wrangling and Analysis Chevron down icon Chevron up icon
Searching, Mining and Visualizing Data Chevron down icon Chevron up icon
Creating a Simple Data API Chevron down icon Chevron up icon
Creating Scraper Microservices with Docker Chevron down icon Chevron up icon
Making the Scraper as a Service Real Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.3
(3 Ratings)
5 star 33.3%
4 star 0%
3 star 0%
2 star 0%
1 star 66.7%
Tonya Oliver Mar 26, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It's probably worth the read. I don't like the fact that Amazon is forcing me to write this review with no less than 18 words. It's too bad because the book isn't being reviewed here it's Amazon. How's that for 18 words Amazon?
Amazon Verified review Amazon
John Ewers Apr 17, 2018
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
I bought book trying to learn how I could download a few tables from the web into python. I had 2 issues that I needed help with: 1) sites that i need data from require passwords 2) sites have javascript that needs to run before I can grab data. After 4 hours, I got absolutely nothing out of this book and went to youtube / stack overflow (w/ these tools, I figured out my problem in less time than I spent w/ this book)The book starts off by going over a few details on many different scraping libraries. There isn't enough detail to do anything useful w/ webscraping, you just become aware of the existence of this libraries. The 2nd 2/3rds of the book focus exclusively with 'scrappy'. This appears to be a good resource for crawling (finding new websites to go onto); however, not so good for scraping known sites (certainly not for beginner / intermediate python users). If you want to go crawling, this may be a good book for you. I was stunned that reading HTML behind the sites you want to scrape was barely mentioned. This is a key element of any "how to" you can find on youtube and wo a lot of html experience, one of the more challenging parts of scraping.One of my biggest issues was w/ passwords. Book only offered 1/2 a page on this w/ an extremely simple example. Solution did not work on any of the 3 sites I tried it on. Also, I coudl not find 1 mention of what to do w/ javascript.Overall, useless book for me
Amazon Verified review Amazon
Patrick Klein Jan 08, 2022
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This "book" feels like a collection of Stack Overflow answers to very basic topics with the added disadvantages that it's harder to navigate and you can't just copy-paste.I'm sending this one back.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.