Reading HTML tables
You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia or other websites.
In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography. In particular, we want to scrape the table in the image that was in Wikipedia during 2019:
Wikipedia table for studio albums
How to do it...
- Use the
read_html
function to load all of the tables from https://en.wikipedia.org/wiki/The_Beatles_discography:>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography >>> dfs = pd.read_html(url) >>> len(dfs) 51
- Inspect the first DataFrame:
>>> dfs[0] The Beatles discography The Beatles discography.1 0 The Beat... The Beat... 1 Studio a... 23 2 Live albums 5 3 Compilat... 53 4 Video al... ...