You're reading from Pandas Cookbook Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python

Product type Paperback

Published in Oct 2017

Publisher Packt

ISBN-13 9781784393878

Length 532 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Author (1):

Theodore Petrou

View More author details

Creating and deleting columns

During a data analysis, it is extremely likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. Pandas has a few different ways to add new columns to a DataFrame.

Getting ready

In this recipe, we create new columns in the movie dataset by using the assignment and then delete columns with the drop method.

How to do it...

The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into the indexing operator. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:

>>> movie = pd.read_csv('data/movie.csv')
>>> movie['has_seen'] = 0

There are several columns that contain data on the number of Facebook likes. Let's add up all the actor and director Facebook likes and assign them to the actor_director_facebook_likes column:

>>> movie['actor_director_facebook_likes'] =  \
        (movie['actor_1_facebook_likes'] + 
         movie['actor_2_facebook_likes'] + 
         movie['actor_3_facebook_likes'] + 
         movie['director_facebook_likes'])

From the Calling Series method recipe in this chapter, we know that this dataset contains missing values. When numeric columns are added to one another as in the preceding step, pandas defaults missing values to zero. But, if all values for a particular row are missing, then pandas keeps the total as missing as well. Let's check if there are missing values in our new column and fill them with 0:

>>> movie['actor_director_facebook_likes'].isnull().sum()
122
>>> movie['actor_director_facebook_likes'] = \
    movie['actor_director_facebook_likes'].fillna(0)

There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of this column comes from our newly created column, actor_director_facebook_likes. Before we create our percentage column, let's do some basic data validation. Let's ensure that cast_total_facebook_likes is greater than or equal to actor_director_facebook_likes:

>>> movie['is_cast_likes_more'] = \
         (movie['cast_total_facebook_likes'] >=             
          movie['actor_director_facebook_likes'])

is_cast_likes_more is now a column of boolean values. We can check whether all the values of this column are True with the all Series method:

>>> movie['is_cast_likes_more'].all()
False

It turns out that there is at least one movie with more actor_director_facebook_likes than cast_total_facebook_likes. It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete column actor_director_facebook_likes:

>>> movie = movie.drop('actor_director_facebook_likes',
                       axis='columns')

Let's recreate a column of just the total actor likes:

>>> movie['actor_total_facebook_likes'] = \
         (movie['actor_1_facebook_likes'] + 
          movie['actor_2_facebook_likes'] + 
          movie['actor_3_facebook_likes'])

>>> movie['actor_total_facebook_likes'] = \
         movie['actor_total_facebook_likes'].fillna(0)

Check again whether all the values in cast_total_facebook_likes are greater than the actor_total_facebook_likes:

>>> movie['is_cast_likes_more'] = \
         (movie['cast_total_facebook_likes'] >= 
          movie['actor_total_facebook_likes'])
    
>>> movie['is_cast_likes_more'].all()
True

Finally, let's calculate the percentage of the cast_total_facebook_likes that come from actor_total_facebook_likes:

>>> movie['pct_actor_cast_like'] = \
         (movie['actor_total_facebook_likes'] / 
          movie['cast_total_facebook_likes'])

Let's validate that the min and max of this column fall between 0 and 1:

>>> (movie['pct_actor_cast_like'].min(), 
     movie['pct_actor_cast_like'].max())
(0.0, 1.0)

We can then output this column as a Series. First, we need to set the index to the movie title so we can properly identify each value.

>>> movie.set_index('movie_title')['pct_actor_cast_like'].head()
movie_title
Avatar                                        0.577369
Pirates of the Caribbean: At World's End      0.951396
Spectre                                       0.987521
The Dark Knight Rises                         0.683783
Star Wars: Episode VII - The Force Awakens    0.000000
Name: pct_actor_cast_like, dtype: float64

How it works...

Many pandas operations are flexible, and column creation is one of them. This recipe assigns both a scalar value, as seen in Step 1, and a Series, as seen in step 2, to create a new column.

Step 2 adds four different Series together with the plus operator. Step 3 uses method chaining to find and fill missing values. Step 4 uses the greater than or equal comparison operator to return a boolean Series, which is then evaluated with the all method in step 5 to check whether every single value is True or not.

The drop method accepts the name of the row or column to delete. It defaults to dropping rows by the index names. To drop columns you must set the axis parameter to either 1 or columns. The default value for axis is 0 or the string index.

Steps 7 and 8 redo the work of step 3 to step 5 without the director_facebook_likes column. Step 9 finally calculates the desired column we wanted since step 4. Step 10 validates that the percentages are between 0 and 1.

There's more...

It is possible to insert a new column into a specific place in a DataFrame besides the end with the insert method. The insert method takes the integer position of the new column as its first argument, the name of the new column as its second, and the values as its third. You will need to use the get_loc Index method to find the integer location of the column name.

The insert method modifies the calling DataFrame in-place, so there won't be an assignment statement. The profit of each movie may be calculated by subtracting budget from gross and inserting it directly after gross with the following:

>>> profit_index = movie.columns.get_loc('gross') + 1
>>> profit_index
9

>>> movie.insert(loc=profit_index,
                 column='profit',
                 value=movie['gross'] - movie['budget'])

An alternative to deleting columns with the drop method is to use the del statement:

>>> del movie['actor_director_facebook_likes']

You're reading from Pandas Cookbook Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python

Table of Contents (12) Chapters

Creating and deleting columns

Getting ready

How to do it...

How it works...

There's more...

See also

Authors (1)

Other recommended products

Personalised recommendations for you