Collecting data from various data sources
There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:
- Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
- Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
- Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.
Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.
Reading data directly from files
Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas
. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data
file. You need to upload the file to the same folder where the notebook resides.
The following code snippet reads the data and displays the first several rows of the datasets:
import pandas as pd df = pd.read_csv("processed.hungarian.data", sep=",", names = ["age","sex","cp","trestbps", "chol","fbs","restecg","thalach", "exang","oldpeak","slope","ca", "thal","num"]) df.head()
This produces the following output:
In the following section, you will learn how to obtain data from an API.
Obtaining data from an API
In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.
Note
When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.
Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, and police stations.
In terms of using external APIs
Like many APIs, the Google Map Place API requires you to create an account on its platform – the Google Cloud Platform. It is free, but still requires a credit card account for some services it provides. Please pay attention so that you won't be mistakenly charged.
After obtaining and activating the API credentials, the developer can build standard HTTP requests to query the endpoints. For example, the textsearch
endpoint is used to query places based on text. Here, you will use the API to query information about libraries in Culver City, Los Angeles:
- First, let's import the necessary libraries:
import requests import json
- Initialize the API key and endpoints. We need to replace
API_KEY
with a real API key to make the code work:API_KEY = Your API key goes here TEXT_SEARCH_URL = https://maps.googleapis.com/maps/api/place/textsearch/json? query = "Culver City Library"
- Obtain the response returned and parse the returned data into JSON format. Let's examine it:
response = requests.get(TEXT_SEARCH_URL+'query='+query+'&key='+API_KEY) json_object = response.json() print(json_object)
This is a one-result response. Otherwise, the results
fields will have multiple entries. You can index the multi-entry results
fields as a normal Python list object:
{'html_attributions': [], 'results': [{'formatted_address': '4975 Overland Ave, Culver City, CA 90230, United States', 'geometry': {'location': {'lat': 34.0075635, 'lng': -118.3969651}, 'viewport': {'northeast': {'lat': 34.00909257989272, 'lng': -118.3955611701073}, 'southwest': {'lat': 34.00639292010727, 'lng': -118.3982608298927}}}, 'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/civic_building-71.png', 'id': 'ccdd10b4f04fb117909897264c78ace0fa45c771', 'name': 'Culver City Julian Dixon Library', 'opening_hours': {'open_now': True}, 'photos': [{'height': 3024, 'html_attributions': ['<a href="https://maps.google.com/maps/contrib/102344423129359752463">Khaled Alabed</a>'], 'photo_reference': 'CmRaAAAANT4Td01h1tkI7dTn35vAkZhx_-mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF_cpLz76sD_81fns1OGhT4KU-zWTbuNY54_4_XozE02pLNWw', 'width': 4032}], 'place_id': 'ChIJrUqREx-6woARFrQdyscOZ-8', 'plus_code': {'compound_code': '2J53+26 Culver City, California', 'global_code': '85632J53+26'}, 'rating': 4.2, 'reference': 'ChIJrUqREx-6woARFrQdyscOZ-8', 'types': ['library', 'point_of_interest', 'establishment'], 'user_ratings_total': 49}], 'status': 'OK'}
The address and name of the library can be obtained as follows:
print(json_object["results"][0]["formatted_address"]) print(json_object["results"][0]["name"])
The result reads as follows:
4975 Overland Ave, Culver City, CA 90230, United States Culver City Julian Dixon Library
Information
An API can be especially helpful for data augmentation. For example, if you have a list of addresses that are corrupted or mislabeled, using the Google Map API may help you correct wrong data.
Obtaining data from scratch
There are instances where you would need to build your own dataset from scratch.
One way of building data is to crawl and parse the internet. On the internet, a lot of public resources are open to the public and free to use. Google's spiders crawl the internet relentlessly 24/7 to keep its search results up to date. You can write your own code to gather information online instead of opening a web browser to do it manually.
Doing a survey and obtaining feedback, whether explicitly or implicitly, is another way to obtain private data. Companies such as Google and Amazon gather tons of data from user profiling. Such data builds the core of their dominating power in ads and e-commerce. We won't be covering this method, however.
Legal issue of crawling
Notice that in some cases, web crawling is highly controversial. Before crawling a website, do check their user agreement. Some websites explicitly forbid web crawling. Even if a website is open to web crawling, intensive requests may dramatically slow down the website, disabling its normal functionality to serve other users. It is a courtesy not only to respect their policy, but also the law.
Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the president's office, University of Southern California: http://departmentsdirectory.usc.edu/pres_off.html:
- First, let's import the necessary libraries.
re
is the Python built-in regular expression library.requests
is an HTTP client that enables communication with the internet through thehttp
protocol:import re import requests
- If you look at the web page, you will notice that there is a pattern within the phone numbers. All the phone numbers start with three digits, followed by a hyphen and then four digits. Our objective now is to compile such a pattern:
pattern = re.compile("\d{3}-\d{4}")
- The next step is to create an
http
client and obtain the response from theGET
call:response = requests.get("http://departmentsdirectory.usc.edu/pres_off.html")
- The
data
attribute ofresponse
can be converted into a long string and fed to thefindall
method:pattern.findall(str(response.data))
The results contain all the phone numbers on the web page:
['740-2111', '821-1342', '740-2111', '740-2111', '740-2111', '740-2111', '740-2111', '740-2111', '740-9749', '740-2505', '740-6942', '821-1340', '821-6292']
In this section, we introduced three different ways of collecting data: reading tabulated data from data files provided by others, obtaining data from APIs, and building data from scratch. In the rest of the book, we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository. In most cases, API data and scraped data will be integrated into tabulated datasets for production usage.