Building Your First Web Scraping Application
The internet, and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name HyperText Transfer Protocol), which started the WWW.
This process happens each time that we request a web page, so it should be familiar to almost everyone. But we can also perform these operations programmatically to retrieve and process information automatically. Python has in its standard library an HTTP client, but the fantastic requests
module makes obtaining web pages very easy. In this chapter, we will see how.
In this chapter, we'll cover the following recipes:
- Downloading web pages
- Parsing HTML
- Crawling the web
- Subscribing to feeds
- Accessing web APIs
- Interacting with forms
- Using Selenium for advanced interaction
- Accessing password...