Summarizing a DataFrame
In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.
In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.
How to do it...
- Read in the movie dataset, and examine the basic descriptive properties,
.shape
,.size
, and.ndim
, along with running thelen
function:>>> movies = pd.read_csv("data/movie.csv") >>> movies.shape (4916, 28) >>> movies.size 137648 >>> movies.ndim 2 >>> len(movies) 4916
- The
.count
method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:>>> movies.count() color 4897 director_name 4814 num_critic_for_reviews 4867 duration 4901 director_facebook_likes 4814 ... title_year 4810 actor_2_facebook_likes 4903 imdb_score 4916 aspect_ratio 4590 movie_facebook_likes 4916 Length: 28, dtype: int64
- The other methods that compute summary statistics,
.min
,.max
,.mean
,.median
, and.std
, return Series that have the column names of the numeric columns in the index and their aggregations as the values:>>> movies.min() num_critic_for_reviews 1.00 duration 7.00 director_facebook_likes 0.00 actor_3_facebook_likes 0.00 actor_1_facebook_likes 0.00 ... title_year 1916.00 actor_2_facebook_likes 0.00 imdb_score 1.60 aspect_ratio 1.18 movie_facebook_likes 0.00 Length: 16, dtype: float64
- The
.describe
method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using.T
as I can usually fit more information on the screen that way:>>> movies.describe().T count mean ... 75% max num_criti... 4867.0 137.988905 ... 191.00 813.0 duration 4901.0 107.090798 ... 118.00 511.0 director_... 4814.0 691.014541 ... 189.75 23000.0 actor_3_f... 4893.0 631.276313 ... 633.00 23000.0 actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0 ... ... ... ... ... ... title_year 4810.0 2002.447609 ... 2011.00 2016.0 actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0 imdb_score 4916.0 6.437429 ... 7.20 9.5 aspect_ratio 4590.0 2.222349 ... 2.35 16.0 movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
- It is possible to specify exact quantiles in the
.describe
method using thepercentiles
parameter:>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T count mean ... 99% max num_criti... 4867.0 137.988905 ... 546.68 813.0 duration 4901.0 107.090798 ... 189.00 511.0 director_... 4814.0 691.014541 ... 16000.00 23000.0 actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0 actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0 ... ... ... ... ... ... title_year 4810.0 2002.447609 ... 2016.00 2016.0 actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0 imdb_score 4916.0 6.437429 ... 8.50 9.5 aspect_ratio 4590.0 2.222349 ... 4.00 16.0 movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
How it works...
Step 1 gives basic information on the size of the dataset. The .shape
attribute returns a tuple with the number of rows and columns. The .size
attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim
attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len
function, it returns the number of rows.
The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.
If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.
Note that numeric columns have missing values but have a result returned by .describe
. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna
parameter to False
. This will cause pandas to return NaN
for all these aggregation methods if there exists at least a single missing value.
The .describe
method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles
parameter. See the Developing a data analysis routine recipe for more on the .describe
method.
There's more...
To see how the .skipna
parameter affects the outcome, we can set its value to False
and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64