Pandas Library Notes
This is a summary of the most useful pandas features and is based on some of the techniques discussed in Ted Petrou’s Minimally Sufficient Pandas article.
Selecting a column of data
For this example, imagine that we have a DataFrame
df
with columns name
, age
, gender
, home address
and count
.
To select the age
column as a Series
, use df['age']
.
The bracket notation will work even in these scenarios:
- spaces are present in the column name, for example,
home address
, usedf['home address']
- column name matches a
DataFrame
method name, saycount
, usedf['count']
- column name is defined in a variable, for example,
col = 'age'
, usedf[col]
Selecting rows and columns using labels (loc) or integer location (iloc)
For the following examples we are going to use a CSV file of 500 made-up US addresses downloaded from https://www.briandunning.com/sample-data/
In : df = pd.read_csv('us-500.csv')
In : df.columns
Out:
Index(['first_name', 'last_name', 'company_name', 'address', 'city', 'county',
'state', 'zip', 'phone1', 'phone2', 'email', 'web'],
dtype='object')
Selecting using labels
Let us first set the label of each row to be the respective content in the last_name
column.
In : df.set_index('last_name', inplace=True)
To select using labels use the DataFrame
loc
method. For example, to select the row with label Venere
use df.loc['Venere']
. This will return a Series
as follows.
In : df.loc['Venere']
Out:
first_name Art
company_name Chemel, James L Cpa
address 8 W Cerritos Ave #54
city Bridgeport
county Gloucester
state NJ
zip 8014
phone1 856-636-8749
phone2 856-264-4130
email art@venere.org
web http://www.chemeljameslcpa.com
Name: Venere, dtype: object
To return a DataFrame
instead, use df.loc[['Venere']]
, the list syntax.
In : df.loc[['Venere']]
Out:
first_name company_name address city \
last_name
Venere Art Chemel, James L Cpa 8 W Cerritos Ave #54 Bridgeport
county state zip phone1 phone2 email \
last_name
Venere Gloucester NJ 8014 856-636-8749 856-264-4130 art@venere.org
web
last_name
Venere http://www.chemeljameslcpa.com
To select rows with labels Venere
and Foller
and return only first_name
, company_name
, city
and state
, use df.loc[['Venere','Foller'],['first_name','company_name','city','state']]
.
In : df.loc[['Venere','Foller'],['first_name','company_name','city','state']]
Out:
first_name company_name city state
last_name
Venere Art Chemel, James L Cpa Bridgeport NJ
Foller Donette Printing Dimensions Hamilton OH
We can also use the slicing syntax, i.e. using the a:b
syntax, to retrieve only rows and columns that match or are in between the labels specified. For example, to retrieve all records between Venere
and Foller
inclusive, and only get the contact information, use df.loc['Venere':'Foller','phone1':'email']
.
In : df.loc['Venere':'Foller','phone1':'email']
Out:
phone1 phone2 email
last_name
Venere 856-636-8749 856-264-4130 art@venere.org
Paprocki 907-385-4412 907-921-2010 lpaprocki@hotmail.com
Foller 513-570-1893 513-549-4561 donette.foller@cox.net
Selecting using conditions
To return all records that match some specified criteria, say all people from Washington state, WA
, and return specific properties, for example, their first name and phone number, use a boolean Series
with loc
, as follows.
In : df.loc[df['state']=='WA', ['first_name','phone1']]
Out:
first_name phone1
last_name
Bartolet Glen 206-697-5796
Loader Alex 253-660-7821
Hellickson Dottie 206-540-6076
Engelberg Johnna 425-986-7573
Heintzman Samira 206-311-4137
Pagliuca Laurel 509-695-5199
Hoopengardner Ettie 509-755-5393
Biddy Jani 206-711-6498
Selecting using row/column numbers
The DataFrame
iloc
method should be used to select rows and columns based on their current numeric position within the table. For instance, to select the first row use df.iloc[[0]]
. To select the last row, use df.iloc[[-1]]
.
In : df.iloc[[0]]
Out:
first_name company_name address city \
last_name
Butt James Benton, John B Jr 6649 N Blue Gum St New Orleans
county state zip phone1 phone2 email \
last_name
Butt Orleans LA 70116 504-621-8927 504-845-1427 jbutt@gmail.com
web
last_name
Butt http://www.bentonjohnbjr.com
In :df.iloc[[-1]]
Out:
first_name company_name address city \
last_name
Motley Chauncey Affiliated With Travelodge 63 E Aurora Dr Orlando
county state zip phone1 phone2 \
last_name
Motley Orange FL 32804 407-413-4842 407-557-8857
email web
last_name
Motley chauncey_motley@aol.com http://www.affiliatedwithtravelodge.com
You can also use the list syntax to choose particular rows and columns. For example, to select the first three odd numbered rows and return their first three even numbered columns, we would use df.iloc[[1,3,5],[2,4,6]]
.
In : df.iloc[[1,3,5],[2,4,6]]
Out:
address county zip
last_name
Darakjy 4 B Blue Ridge Blvd Livingston 48116
Paprocki 639 Main St Anchorage 99501
Morasca 3 Mcauley Dr Ashland 44805
Slicing is also possible with the iloc
method. For example, retrieving the first five records and their respective contact information found in columns 8 to 10 is done as follows. Keep in mind that indexing is zero based, so we need to specify 7:10
. Also note that 10
is specified since that last element is exluced and we need up to 9
.
In : df.iloc[:5,7:10]
Out:
phone1 phone2 email
last_name
Butt 504-621-8927 504-845-1427 jbutt@gmail.com
Darakjy 810-292-9388 810-374-9840 josephine_darakjy@darakjy.org
Venere 856-636-8749 856-264-4130 art@venere.org
Paprocki 907-385-4412 907-921-2010 lpaprocki@hotmail.com
Foller 513-570-1893 513-549-4561 donette.foller@cox.net
Loading delimited text files
Whenever you are loading delimited text files always use the read_csv
DataFrame
method. Just take care to specify the delimiter if it is not a comma.
So for a comma delimited file, df = pd.read_csv('filename.csv')
or df = pd.read_csv('filename.csv', delimiter=',')
are equivalent.
Checking for NA entries
To determine which entries in a DataFrame
are NaN
, None
or NaT
use the isna
DataFrame
method.
In the following example we first make a new DataFrame
using the first five records from the original 500 entries DataFrame
. Then we set a couple of entries to have NaT
values in the zip
field. Finally, we use the loc
DataFrame
method to conditionally filter records using the isna
method.
In : mini_df = df.loc[:4,:].copy()
In : mini_df.loc[:,'zip']
Out:
0 70116
1 48116
2 8014
3 99501
4 45011
Name: zip, dtype: int64
In : mini_df.loc[[0,2], 'zip'] = pd.NaT
In : mini_df.loc[:,'zip']
Out:
0 NaT
1 48116
2 NaT
3 99501
4 45011
In : mini_df.loc[mini_df.loc[:,'zip'].isna(), ['first_name', 'last_name', 'zip']]
Out:
first_name last_name zip
0 James Butt NaT
2 Art Venere NaT
If we instead want to view those entries which are not equal to NA
, we can either use the notna
method or the inversion operator ~
.
In : mini_df.loc[~mini_df.loc[:,'zip'].isna(), ['first_name', 'last_name', 'zip']]
Out:
first_name last_name zip
1 Josephine Darakjy 48116
3 Lenna Paprocki 99501
4 Donette Foller 45011
In : mini_df.loc[mini_df.loc[:,'zip'].notna(), ['first_name', 'last_name', 'zip']]
Out:
first_name last_name zip
1 Josephine Darakjy 48116
3 Lenna Paprocki 99501
4 Donette Foller 45011
Performing arithmetic or comparison operations
Always use the Python arithmetic and comparison operators unless you need to specify something like alignment. The following is an example where the pandas DataFrame
arithmetic and comparison methods have to be used.
In : product_cost_split = pd.DataFrame(np.array([[0.4, 0.3, 0.3],
...: [0.3, 0.5, 0.2],
...: [0.3, 0.3, 0.4]]),
...: columns=['design','material','production'],
...: index=['product_a','product_b','product_c'])
...:
In : product_cost_split
Out:
design material production
product_a 0.4 0.3 0.3
product_b 0.3 0.5 0.2
product_c 0.3 0.3 0.4
In : product_cost = pd.Series({'product_a': 320, 'product_b':474, 'product_c':281})
In : product_cost
Out:
product_a 320
product_b 474
product_c 281
dtype: int64
If we want to compute the exact cost of design, material and production for each product, we need to multiply each product row with the corresponding total cost value in the Series
. Using the Python multiplication operator gives a wrong and very unexpected result. This happens because pandas in this case assumes the wrong alignment.
In : product_cost_split * product_cost
Out:
design material product_a product_b product_c production
product_a NaN NaN NaN NaN NaN NaN
product_b NaN NaN NaN NaN NaN NaN
product_c NaN NaN NaN NaN NaN NaN
To get the correct results, we need to use the pandas DataFrame
mul
operator and specify the axis, so that pandas aligns the index of the DataFrame
to the index in the Series
.
In : product_cost_split.mul(product_cost, axis='index')
Out:
design material production
product_a 128.0 96.0 96.0
product_b 142.2 237.0 94.8
product_c 84.3 84.3 112.4
Computing sum, min, max, and abs
The pandas DataFrame
methods sum
, min
, max
and abs
should always be used to compute these values.
Although Python has its own methods of the same name, their performance is much slower than the DataFrame
methods. The reason for this is that the DataFrame
methods are C functions, while the Python methods execute a for loop.
The following example compares the performance of min
and sum
on a 100,000 element Series
. The pandas methods are much faster, min
being over 7 times faster than the Python built-in method, while sum
is 5 times faster than the Python built-in method.
In : s = pd.Series(np.random.rand(10** 5))
In : %timeit s.min()
866 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In : %timeit min(s)
6.74 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In : %timeit s.sum()
937 µs ± 55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In : %timeit sum(s)
4.71 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Aggregating using groupby
To summarise a DataFrame
, for example, to count how many items of each type are present in a data set or to compute the mininum, maximum and median price of products in a data set, we can use the groupby
and agg
functions as follows.
df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
If you need to aggregate multiple columns and for each compute various statistics, extend the dictionary in the agg
function as follows.
df.groupby('grouping column').agg({'aggregating column 1': ['aggregating function 1', 'aggregating function 2'],
'aggregating column 2': ['aggregating function 1', 'aggregating function 2'],
'aggregating column 3': 'aggregating function'})
For the following example we are going to use a CSV file of 500 made-up US addresses downloaded from https://www.briandunning.com/sample-data/. Let us say we want to determine the top five states by number of people present in the data set. To do this we need to group by state and then use the count
aggregate function. Here we are also sorting in descending order and presenting only the top five entries.
In : ppl = pd.read_csv('us-500.csv')
In : ppl.groupby('state').agg({'state':'count'}).sort_values('state', ascending=False).head(5)
Out:
state
state
CA 72
NJ 52
NY 46
TX 32
PA 29
Working with a MultiIndex
Certain group by and aggregation operations will result in what is called a multi-index. For example, using the 500 made-up US addresses data set, we might want to group by state and city so as to count the number of individuals residing within each city. This can be done as follows.
In : ppl_city_counts = ppl.groupby(['state','city']).agg({'city':'count'})
In : ppl_city_counts.head(10)
Out:
city
state city
AK Anchorage 4
Fairbanks 2
AR Little Rock 1
AZ Mesa 1
Peoria 1
Phoenix 5
Scottsdale 2
CA Anaheim 1
Bellflower 1
Berkeley 1
As you can see, the index is now a multi-index, composed of state and city. While this is easy to read it requires different syntax to make subset selections. For instance, to view number of residents in the cities of AK
and AZ
states only, we can use IndexSlice
, as follows.
In : ppl_city_counts.loc[pd.IndexSlice[['AK','AZ'], :], :]
Out:
city
state city
AK Anchorage 4
Fairbanks 2
AZ Mesa 1
Peoria 1
Phoenix 5
Scottsdale 2
Another example could be selecting all Philadelphia
residents, as follows.
In : ppl_city_counts.loc[pd.IndexSlice[:, ['Philadelphia']], :]
Out:
city
state city
PA Philadelphia 8
If you find this cumbersome to work with, you can always flatten the index and rename the columns to get a normal table layout with a single-level index.
In : ppl_city_counts.columns = ['residents count']
In : ppl_city_counts.reset_index(inplace=True)
In : ppl_city_counts.head(5)
Out:
state city residents count
0 AK Anchorage 4
1 AK Fairbanks 2
2 AR Little Rock 1
3 AZ Mesa 1
4 AZ Peoria 1
Then, filtering on a single-level index DataFrame
to get the same results as in the previous two filtering examples would be done like so.
In : ppl_city_counts.loc[ppl_city_counts.state.isin(['AK','AZ'])]
Out:
state city residents count
0 AK Anchorage 4
1 AK Fairbanks 2
3 AZ Mesa 1
4 AZ Peoria 1
5 AZ Phoenix 5
6 AZ Scottsdale 2
In : ppl_city_counts.loc[ppl_city_counts.city.isin(['Philadelphia'])]
Out:
state city residents count
294 PA Philadelphia 8
Pivot tables and cross tabulations (contingency table)
While there is some equivalence between groupby
, pivot_table
and pandas.crosstab
for certain queries, it is best to stick to groupby
, except for the following two scenarios.
- If you want to generate a table to compare values across columns, use
pivot_table
because the output is easier to compare. On the other hand, if you want to further process the data, usegroupby
. - If you want to show the frequency between two variables, say gender counts in a university across races, use the
pandas.crosstab
function. Furthermore, you can easily output relative frequencies by setting the parameternormalize='columns'
.