You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Length 626 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Theodore Petrou

Matthew Harrison

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations FREE CHAPTER

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

Selecting multiple DataFrame columns

We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.

In this recipe, all the actor and director columns will be selected from the movie dataset.

How to do it...

Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:

>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
...     [
...         "actor_1_name",
...         "actor_2_name",
...         "actor_3_name",
...         "director_name",
...     ]
... ]
>>> movie_actor_director.head()
  actor_1_name actor_2_name actor_3_name director_name
0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
2  Christop...  Rory Kin...  Stephani...   Sam Mendes
3    Tom Hardy  Christia...  Joseph G...  Christop...
4  Doug Walker   Rob Walker          NaN  Doug Walker

There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
```
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>
```
We can also use .loc to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:
```
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
```

How it works...

The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.

Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.

Step 3 shows how to use the loc attribute to pull out a Series or a DataFrame.

There's more...

Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

One of the most common exceptions raised when working with pandas is KeyError. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Table of Contents (17) Chapters

Selecting multiple DataFrame columns

How to do it...

How it works...

There's more...

Authors (2)

Other recommended products

Personalised recommendations for you