Selecting multiple DataFrame columns
We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.
In this recipe, all the actor and director columns will be selected from the movie dataset.
How to do it...
- Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:
>>> import pandas as pd >>> import numpy as np >>> movies = pd.read_csv("data/movie.csv") >>> movie_actor_director = movies[ ... [ ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... "director_name", ... ] ... ] >>> movie_actor_director.head() actor_1_name actor_2_name actor_3_name director_name 0 CCH Pounder Joel Dav... Wes Studi James Ca... 1 Johnny Depp Orlando ... Jack Dav... Gore Ver... 2 Christop... Rory Kin... Stephani... Sam Mendes 3 Tom Hardy Christia... Joseph G... Christop... 4 Doug Walker Rob Walker NaN Doug Walker
- There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
>>> type(movies[["director_name"]]) <class 'pandas.core.frame.DataFrame'> >>> type(movies["director_name"]) <class 'pandas.core.series.Series'>
- We can also use
.loc
to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:
) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:>>> type(movies.loc[:, ["director_name"]]) <class 'pandas.core.frame.DataFrame'> >>> type(movies.loc[:, "director_name"]) <class 'pandas.core.series.Series'>
How it works...
The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.
Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.
Step 3 shows how to use the loc
attribute to pull out a Series or a DataFrame.
There's more...
Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
One of the most common exceptions raised when working with pandas is KeyError
. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')