Crawling the web
Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in your arsenal when scraping the web.
To do so, we crawl a page looking for a short phrase, and we print any paragraph that contains it. We will search only in pages that belong to a single site, for example: only URLs starting with www.somesite.com
. We won't follow links to external sites.
Getting ready
This recipe builds on the concepts introduced so far, so it will involve downloading and parsing pages to search for links and then continue downloading.
When crawling the web, remember to set limits when downloading. It's very easy to crawl over too many pages. As anyone checking Wikipedia can confirm, the internet is potentially limitless.
We'll use a prepared example, available in the GitHub repo at https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master...