Plotting data from a DataFrame
As with many mathematical problems, one of the first steps to finding some way to visualize the problem and all the information is to formulate a strategy. For data-based problems, this usually means producing a plot of the data and visually inspecting it for trends, patterns, and the underlying structure. Since this is such a common operation, pandas provides a quick and simple interface for plotting data in various forms, using Matplotlib under the hood by default, directly from a Series
or DataFrame
.
In this recipe, we will learn how to plot data directly from a DataFrame
or Series
to understand the underlying trends and structure.
Getting ready
For this recipe, we will need the pandas library imported as pd
, the NumPy library imported as np
, the Matplotlib pyplot
module imported as plt
, and a default random number generator instance created using the following commands:
from numpy.random import default_rng rng = default_rng(12345)
How to do it...
Follow these steps to create a simple DataFrame
using random data and produce plots of the data it contains:
- Create a sample
DataFrame
using random data:diffs = rng.standard_normal(size=100)
walk = diffs.cumsum()
df = pd.DataFrame({
"diffs": diffs,
"walk": walk
})
- Next, we have to create a blank figure with two subplots ready for plotting:
fig, (ax1, ax2) = plt.subplots(1, 2, tight_layout=True)
- We have to plot the
walk
column as a standard line graph. This can be done by using theplot
method on theSeries
(column) object without additional arguments. We will force the plotting onax1
by passing theax=ax1
keyword argument:df["walk"].plot(ax=ax1, title="Random walk", color="k")
ax1.set_xlabel("Index")
ax1.set_ylabel("Value")
- Now, we have to plot a histogram of the
diffs
column by passing thekind="hist"
keyword argument to theplot
method:df["diffs"].plot(kind="hist", ax=ax2,
title="Histogram of diffs", color="k", alpha=0.6)
ax2.set_xlabel("Difference")
The resulting plots are shown here:
Figure 6.1 – Plot of the walk value and a histogram of differences from a DataFrame
Here, we can see that the histogram of differences approximates a standard normal distribution (mean 0 and variance 1). The random walk plot shows the cumulative sum of the differences and oscillates (fairly symmetrically) above and below 0.
How it works...
The plot
method on a Series
(or a DataFrame
) is a quick way to plot the data it contains against the row index. The kind
keyword argument is used to control the type of plot that is produced, with a line plot being the default. There are lots of options for the plotting type, including bar
for a vertical bar chart, barh
for a horizontal bar chart, hist
for a histogram (also seen in this recipe), box
for a box plot, and scatter
for a scatter plot. There are several other keyword arguments to customize the plot that it produces. In this recipe, we also provided the title
keyword argument to add a title to each subplot.
Since we wanted to put both plots on the same figure side by side using subplots that we had already created, we used the ax
keyword argument to pass in the respective axes handles to the plotting routine. Even if you let the plot
method construct a figure, you may still need to use the plt.show
routine to display the figure with certain settings.
There’s more...
We can produce several common types of plots using the pandas interface. This includes, in addition to those mentioned in this recipe, scatter plots, bar plots (horizontal bars and vertical bars), area plots, pie charts, and box plots. The plot
method also accepts various keyword arguments to customize the appearance of the plot.