Selecting columns with methods
Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes
and .filter
methods are two useful methods to do this.
If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.
How to do it...
- Read in the movie dataset. Shorten the column names for display. Use the
.get_dtype_counts
method to output the number of columns with each specific data type:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return ( ... str(col) ... .replace("facebook_likes", "fb") ... .replace("_for_reviews", "") ... ) >>> movies = movies.rename(columns=shorten) >>> movies.dtypes.value_counts() float64 13 int64 3 object 12 dtype: int64
- Use the
.select_dtypes
method to select only the integer columns:>>> movies.select_dtypes(include="int").head() num_voted_users cast_total_fb movie_fb 0 886204 4834 33000 1 471220 48350 0 2 275868 11700 85000 3 1144337 106759 164000 4 8 143 0
- If you would like to select all the numeric columns, you may pass the string
number
to theinclude
parameter:>>> movies.select_dtypes(include="number").head() num_critics duration ... aspect_ratio movie_fb 0 723.0 178.0 ... 1.78 33000 1 302.0 169.0 ... 2.35 0 2 602.0 148.0 ... 2.35 85000 3 813.0 164.0 ... 2.35 164000 4 NaN NaN ... NaN 0
- If we wanted integer and string columns we could do the following:
>>> movies.select_dtypes(include=["int", "object"]).head() color direc/_name ... conte/ating movie_fb 0 Color James Cameron ... PG-13 33000 1 Color Gore Verbinski ... PG-13 0 2 Color Sam Mendes ... PG-13 85000 3 Color Christopher Nolan ... PG-13 164000 4 NaN Doug Walker ... NaN 0
- To exclude only floating-point columns, do the following:
>>> movies.select_dtypes(exclude="float").head() color director_name ... content_rating movie_fb 0 Color James Ca... ... PG-13 33000 1 Color Gore Ver... ... PG-13 0 2 Color Sam Mendes ... PG-13 85000 3 Color Christop... ... PG-13 164000 4 NaN Doug Walker ... NaN 0
- An alternative method to select columns is with the
.filter
method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use thelike
parameter to search for all the Facebook columns or the names that contain the exact string,fb
. Thelike
parameter is checking for substrings in column names:>>> movies.filter(like="fb").head() director_fb actor_3_fb ... actor_2_fb movie_fb 0 0.0 855.0 ... 936.0 33000 1 563.0 1000.0 ... 5000.0 0 2 0.0 161.0 ... 393.0 85000 3 22000.0 23000.0 ... 23000.0 164000 4 131.0 NaN ... 12.0 0
- The
.filter
method has more tricks (or parameters) up its sleeve. If you use theitems
parameters, you can pass in a list of column names:>>> cols = [ ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... "director_name", ... ] >>> movies.filter(items=cols).head() actor_1_name ... director_name 0 CCH Pounder ... James Cameron 1 Johnny Depp ... Gore Verbinski 2 Christoph Waltz ... Sam Mendes 3 Tom Hardy ... Christopher Nolan 4 Doug Walker ... Doug Walker
- The
.filter
method allows columns to be searched with regular expressions using theregex
parameter. Here, we search for all columns that have a digit somewhere in their name:>>> movies.filter(regex=r"\d").head() actor_3_fb actor_2_name ... actor_3_name actor_2_fb 0 855.0 Joel Dav... ... Wes Studi 936.0 1 1000.0 Orlando ... ... Jack Dav... 5000.0 2 161.0 Rory Kin... ... Stephani... 393.0 3 23000.0 Christia... ... Joseph G... 23000.0 4 NaN Rob Walker ... NaN 12.0
How it works...
Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes
attribute to get the exact data type for each column. The .select_dtypes
method accepts either a list or single data type in its include
or exclude
parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.
The .filter
method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items
, like
, and regex
, only one of which can be used at a time.
The like
parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex
parameter instead to select column names through a regular expression. This particular regular expression, r'\d'
, represents all digits from zero to nine and matches any string with at least a single digit in it.
The filter method comes with another parameter, items
, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError
will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf'])
runs without error and returns a single column DataFrame.
There's more...
One confusing aspect of .select_dtypes
is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:
np.number
,'number'
– Selects both integers and floats regardless of sizenp.float64
,np.float_
,float
,'float64'
,'float_'
,'float'
– Selects only 64-bit floatsnp.float16
,np.float32
,np.float128
,'float16'
,'float32'
,'float128'
– Respectively selects exactly 16, 32, and 128-bit floatsnp.floating
,'floating'
– Selects all floats regardless of sizenp.int0
,np.int64
,np.int_
,int
,'int0'
,'int64'
,'int_'
,'int'
– Selects only 64-bit integersnp.int8
,np.int16
,np.int32
,'int8'
,'int16'
,'int32'
– Respectively selects exactly 8, 16, and 32-bit integersnp.integer
,'integer'
– Selects all integers regardless of size'Int64'
– Selects nullable integer; no NumPy equivalentnp.object
,'object'
,'O'
– Select all object data typesnp.datetime64
,'datetime64'
,'datetime'
– All datetimes are 64 bitsnp.timedelta64
,'timedelta64'
,'timedelta'
– All timedeltas are 64 bitspd.Categorical
,'category'
– Unique to pandas; no NumPy equivalent
Because all integers and floats default to 64 bits, you may select them by using the string 'int'
or 'float'
as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'
.