Introduction to Pandas
Contents
Introduction to Pandas#
The pandas
package in Python provides a large collection of functions for importing, merging and wrangling data of many different types. This can be as simple as creating data objects from a text file on your computer to scraping text from websites and saving to particular data structures.
In this section of the course, we will learn about some basic functionality of the pandas package for importing and wrangling data in a reproducible and ‘tidy’ way. Then we will pick up further on this at the end of the course to learn about more complex ways of handling and working with external data sources.
There are many resources available to help you, with full documentation at the pandas repository. If you want to learn more advanced ways of using pandas
, this is a great place to start.
Importing Pandas#
As with the numpy
package, it is customary to import the pandas package using a shortened name, pd
.
import pandas as pd
In the simplest form, the pandas
package contains functions that can create and manipulate enhanced data objects relative to NumPy
. Specifically, there are three main types of data sctructure in pandas
that we need to familiarise ourself with. Series
, DataFrame
s and Indexes
. We may have an external files that we want to read into Python in order to later study or analyse or create our own objects from scratch.
Pandas Series Object#
Let us begin by looking at the Series
object, a way of storing vectors of data. We supply a vector of values that we wish to store in our object as follows and observe the contents of the created object.
data = pd.Series([0.1,0.2,0.3,0.4])
data
0 0.1
1 0.2
2 0.3
3 0.4
dtype: float64
As in numpy
, data in pandas
objects can be accessed using the square pbracket notation
data[2:4]
2 0.3
3 0.4
dtype: float64
You can see that this object has both a vector of values, (which can be accessed using data.values
), but also an index (accessed using data.index
). Although it may seem very similar to the numpy
array, the pandas
Series
object is much more flexible than the former. In particular, we may want to use non-integers for the index, such as strings. In that case we can construct data dictionaries, allowing the association of the individual indices to specific values. For example, let us consider some made up fruit sales at a shop in St Andrews.
fruits={'Apples':1537,
'Bananas': 2541,
'Kiwis': 527,
'Peaches':894}
fruit_sales = pd.Series(fruits)
fruit_sales
Apples 1537
Bananas 2541
Kiwis 527
Peaches 894
dtype: int64
Once we have created our data series out of the original dictionary, we can acccess the value we want by specifying the corresponding key in our dictionary.
fruit_sales['Peaches']
894
We could also create the data series using the following code, specifying both the data vector and its corresponding keynames:
fruit_sales = pd.Series([1537,2541,527,894],index=['Apples','Bananas','Kiwis','Peaches'])
fruit_sales
Apples 1537
Bananas 2541
Kiwis 527
Peaches 894
dtype: int64
Pandas DataFrame Object#
A central concept in statististical data analysis is the dataframe object. We often represent the data as observations where for each observation (stored as rows), multiple variables (storeed as columns) were recorded. The idea behind a dataframe is to create an object that is suited to managing this kind of data. In the pandas
package, there is the DataFrame
function that creates an object of type dataframe. We can create a dataframe of the fruit sales Series object generated above.
dat=pd.DataFrame(fruit_sales)
print(dat.info())
print(dat.head())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Apples to Peaches
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 4 non-null int64
dtypes: int64(1)
memory usage: 236.0+ bytes
None
0
Apples 1537
Bananas 2541
Kiwis 527
Peaches 894
Once you have imported the data you want, you can study the resulting object to determine its properties. There is both the info
function, which provides a variety of useful details on the structure of the object, and the head
function that prints out the first five rows of the dataframe.
Creating DataFrames From Scratch#
We can also create a dataframe from scratch and input out own data. For example, let us assume we have measured people’s height as well as noting their age, using None
in the case where the data are missing.
df = pd.DataFrame({'Height':[1.7,1.8,1.5,1.6,1.9],
'Age':[37,28,None,19,21]})
df.info()
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Height 5 non-null float64
1 Age 4 non-null float64
dtypes: float64(2)
memory usage: 208.0 bytes
Height | Age | |
---|---|---|
0 | 1.7 | 37.0 |
1 | 1.8 | 28.0 |
2 | 1.5 | NaN |
3 | 1.6 | 19.0 |
4 | 1.9 | 21.0 |
Importing Data From External Files#
We have now seen how to create DataFrame objects from data we are inputting, but what if we have data in a spreadsheet or other file that we want to read into Python? pandas
has a range of read_*
functions to read in data of different formats. We will focus here on the read_csv
function in particular, but if you have .xls, html, .txt etc. files, you may need to use one of the other functions.
We will start with some data downloaded from gapminder relating to GDP per capita of various different countries and acorss years. We need to specify the path to the folder (here assumed to be in the current working directory) and use the read function to import it. Once we have done that, we can study the first few rowys of the dataframe.
# Data from https://www.gapminder.org/data/
income=pd.read_csv('dat/indicator_gapminder_gdp_per_capita_ppp-Data.csv')
income.head()
GDP per capita | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abkhazia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | 472.053500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 785.127571 | 804.717458 | 874.0 | 887.914578 | 983.652314 | 984.805841 | 1154.859365 | 1214.613653 | 1261.354184 | 1349.696941 |
2 | Akrotiri and Dhekelia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4855.210024 | 5115.252837 | 5369.0 | 5652.049321 | 5958.021197 | 6365.530359 | 6550.896164 | 6746.445312 | 6914.267317 | 6969.306283 |
4 | Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5576.851564 | 5790.967692 | 6011.0 | 6022.270940 | 6133.782763 | 6162.719840 | 6173.729741 | 6300.648214 | 6354.640523 | 6419.127829 |
5 rows × 214 columns
These data report various country’s GDP, across years (columns) for different countries (rows). We could import these as a series with the country being the index and the GDP the values, but in this case we will assume the column of countries is of wider interest as a variable in its own right. If you look at the data, you may notice that there is a large number of missing values represented by NaN
. It we explore the dtypes
of the dataframe, we see that the GDPs are classified as floats, whilst country is an object
(string).
income.dtypes
GDP per capita object
1800 float64
1801 float64
1802 float64
1803 float64
...
2008 float64
2009 float64
2010 float64
2011 float64
2012 float64
Length: 214, dtype: object
Before we go any further, notice that the first column is labelled incorrectly as GDP per capita (rather than country), so let us rename the first column using the .rename()
function.
income=income.rename(columns={"GDP per capita": "country"})
income.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abkhazia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | 472.053500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 785.127571 | 804.717458 | 874.0 | 887.914578 | 983.652314 | 984.805841 | 1154.859365 | 1214.613653 | 1261.354184 | 1349.696941 |
2 | Akrotiri and Dhekelia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4855.210024 | 5115.252837 | 5369.0 | 5652.049321 | 5958.021197 | 6365.530359 | 6550.896164 | 6746.445312 | 6914.267317 | 6969.306283 |
4 | Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5576.851564 | 5790.967692 | 6011.0 | 6022.270940 | 6133.782763 | 6162.719840 | 6173.729741 | 6300.648214 | 6354.640523 | 6419.127829 |
5 rows × 214 columns
Data wrangling#
Data wrangling is a process by which we modify the data to a structure or format relevant to how we wish to utilise it. For example, we may want to extract certain entries or variables from the data, reshape the data to a more appropriate format, or check for and deal with missing values. We can slice specific column(s) based on their names; rows and/or columns specifying their loc
(the index names of the rows and/or columns); rows and/or columns based on their array indices using iloc
; or based on particular Boolean operations. Some care needs to be taken when using loc
or iloc
as if the dataframe from this function is saved into a new object and the orginal full dataframe is edited, it can sometimes change the sliced data automatically. This can be avoided by using .copy()
. Below, if changes are made to income
above, then this will not utomatically change the sliced dataframe unless the slicing is rerun.
income['1800'].head()
0 NaN
1 472.053500
2 NaN
3 601.215222
4 766.253664
Name: 1800, dtype: float64
income[['1800','1805']].head()
1800 | 1805 | |
---|---|---|
0 | NaN | NaN |
1 | 472.053500 | NaN |
2 | NaN | NaN |
3 | 601.215222 | NaN |
4 | 766.253664 | NaN |
income.loc[11:20, '1800':'1805'].copy()
1800 | 1801 | 1802 | 1803 | 1804 | 1805 | |
---|---|---|---|---|---|---|
11 | 378.403254 | NaN | NaN | NaN | NaN | NaN |
12 | 575.855054 | NaN | NaN | NaN | NaN | NaN |
13 | 671.478209 | NaN | NaN | NaN | NaN | NaN |
14 | 1434.507479 | 1440.072847 | 1445.659806 | 1451.268441 | 1456.898835 | 1462.551074 |
15 | 477.152815 | NaN | NaN | NaN | NaN | NaN |
16 | 1409.188697 | NaN | NaN | NaN | NaN | NaN |
17 | 808.991793 | NaN | NaN | NaN | NaN | NaN |
18 | 608.794297 | 608.635943 | 608.477630 | 608.319358 | 608.161127 | 608.002938 |
19 | 1017.813988 | NaN | NaN | NaN | NaN | NaN |
20 | 510.549704 | NaN | NaN | NaN | NaN | NaN |
income.iloc[11:21,:5].copy()
country | 1800 | 1801 | 1802 | 1803 | |
---|---|---|---|---|---|
11 | Armenia | 378.403254 | NaN | NaN | NaN |
12 | Aruba | 575.855054 | NaN | NaN | NaN |
13 | Australia | 671.478209 | NaN | NaN | NaN |
14 | Austria | 1434.507479 | 1440.072847 | 1445.659806 | 1451.268441 |
15 | Azerbaijan | 477.152815 | NaN | NaN | NaN |
16 | Bahamas | 1409.188697 | NaN | NaN | NaN |
17 | Bahrain | 808.991793 | NaN | NaN | NaN |
18 | Bangladesh | 608.794297 | 608.635943 | 608.477630 | 608.319358 |
19 | Barbados | 1017.813988 | NaN | NaN | NaN |
20 | Belarus | 510.549704 | NaN | NaN | NaN |
income[income['country']=='Australia']
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | Australia | 671.478209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 31634.242428 | 32098.50615 | 32798.0 | 33204.155734 | 34196.812421 | 34444.333064 | 34389.09927 | 34884.763757 | 35253.938423 | 36064.737277 |
1 rows × 214 columns
Reshaping - long ↔ wide format#
Wide to long format#
It may not be convenient to have all the years as additional columns (often called ‘wide’ format). Instead, we would perhaps rather have a single column of GDP values, but with replicates of the country for each year (‘long’ format). To do this in Python we use the melt
function, specifying:
id_vars
: the column name(s) in each data to maintain/replicatevalue_vars
: the column(s) that will be collapsed down to a single columnvar_names
: the column name of the new variable constructed from the previous column namesvalue_name
: the column name for the values of the collapsed data
yearlist=list(income.columns)[1:]
income_long=income.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='GDP')
income_long.head()
country | Year | GDP | |
---|---|---|---|
0 | Abkhazia | 1800 | NaN |
1 | Afghanistan | 1800 | 472.053500 |
2 | Akrotiri and Dhekelia | 1800 | NaN |
3 | Albania | 1800 | 601.215222 |
4 | Algeria | 1800 | 766.253664 |
Long to wide format#
To go from long to wide format, i.e, the reverse of this, we can use the pivot()
function.
columns: Column to use to make new frame’s columns
values: Column(s) to use for populating new frame’s values
index: Column to use to make new dataframe’s index
income_wide=pd.pivot(income_long, index='country', columns='Year', values='GDP')
income_wide.head()
Year | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | |||||||||||||||||||||
Abkhazia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Afghanistan | 472.053500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 785.127571 | 804.717458 | 874.0 | 887.914578 | 983.652314 | 984.805841 | 1154.859365 | 1214.613653 | 1261.354184 | 1349.696941 |
Akrotiri and Dhekelia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4855.210024 | 5115.252837 | 5369.0 | 5652.049321 | 5958.021197 | 6365.530359 | 6550.896164 | 6746.445312 | 6914.267317 | 6969.306283 |
Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5576.851564 | 5790.967692 | 6011.0 | 6022.270940 | 6133.782763 | 6162.719840 | 6173.729741 | 6300.648214 | 6354.640523 | 6419.127829 |
5 rows × 213 columns
Combining Datasets: concat
, merge
and join
#
A common task when working with data is the need to combine multiple data structures together. This could be when we have new observations of existing variables (extra rows), or new variables measured on existing objects (extra columns), for example. The pandas
package allows the combination of data in this way. Understanding the properties of the data we are combining is very important to ensure this is done in the correct way.
In pandas
, there are multiple functions for combining dataframes. Each has its uses and syntax. We will study examples of each below, starting by importing a similar dataset on total populations for each country and converting to long format.
popn=pd.read_csv('dat/population_total.csv')
yearlist=list(popn.columns)[1:]
popn_long=popn.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='Popn')
popn_long.head()
country | Year | Popn | |
---|---|---|---|
0 | Afghanistan | 1799 | 3.28M |
1 | Angola | 1799 | 1.57M |
2 | Albania | 1799 | 400k |
3 | Andorra | 1799 | 2650 |
4 | United Arab Emirates | 1799 | 40.2k |
The population data start in the same year (1800) but goes as far as 2100, rather than 2012. This can cause problems when merging the data if we are not careful.
The first function we will use to combine the data is the concat
function. Here we can concatenate the two dataframes, that is stack the corresponding columns on top of each other. As we have noticed there are some columns in the population dataframe that are not replicated in the GDP data, we could stack all of the GDP data with only the corresponding columns from the population dataframe (this is fiddly if they do not follow the same order). An easier way is to use the join
argument. Inner joining means only including data that are replicated across both dataframes.
pd.concat(([income,popn.loc[:,:"2012"]]))
#or
pd.concat(([income,popn]),join="inner")
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abkhazia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | 472.0535 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 785.127571 | 804.717458 | 874.0 | 887.914578 | 983.652314 | 984.805841 | 1154.859365 | 1214.613653 | 1261.354184 | 1349.696941 |
2 | Akrotiri and Dhekelia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4855.210024 | 5115.252837 | 5369.0 | 5652.049321 | 5958.021197 | 6365.530359 | 6550.896164 | 6746.445312 | 6914.267317 | 6969.306283 |
4 | Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5576.851564 | 5790.967692 | 6011.0 | 6022.27094 | 6133.782763 | 6162.71984 | 6173.729741 | 6300.648214 | 6354.640523 | 6419.127829 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
192 | Samoa | 47.3k | 47.3k | 47.3k | 47.3k | 47.3k | 47.3k | 47.2k | 47.2k | 47.2k | ... | 177k | 179k | 180k | 181k | 182k | 183k | 185k | 186k | 187k | 189k |
193 | Yemen | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | ... | 19M | 19.5M | 20.1M | 20.7M | 21.3M | 21.9M | 22.5M | 23.2M | 23.8M | 24.5M |
194 | South Africa | 1.45M | 1.46M | 1.46M | 1.47M | 1.47M | 1.48M | 1.49M | 1.49M | 1.5M | ... | 46.7M | 47.3M | 47.9M | 48.5M | 49.1M | 49.8M | 50.5M | 51.2M | 52M | 52.8M |
195 | Zambia | 758k | 770k | 782k | 794k | 806k | 818k | 831k | 843k | 856k | ... | 11.3M | 11.6M | 11.9M | 12.2M | 12.5M | 12.8M | 13.2M | 13.6M | 14M | 14.5M |
196 | Zimbabwe | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | ... | 12M | 12M | 12.1M | 12.2M | 12.3M | 12.4M | 12.5M | 12.7M | 12.9M | 13.1M |
457 rows × 213 columns
If we did not select the matching rows, the default is to use an ‘outer’ join, which stacks the columns on top of each other, but filling the entries with NaN
where the column does not exist. In the case of our dataframes, the GDP does not exist from 2013 to 2100, so all these elements will be NaN
s (top right of the new dataset below).
pd.concat(([income,popn]),join="outer")
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abkhazia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | 472.0535 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Akrotiri and Dhekelia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
192 | Samoa | 47.3k | 47.3k | 47.3k | 47.3k | 47.3k | 47.3k | 47.2k | 47.2k | 47.2k | ... | 315k | 314k | 314k | 314k | 313k | 313k | 312k | 312k | 311k | 310k |
193 | Yemen | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | 2.59M | ... | 54.5M | 54.4M | 54.3M | 54.1M | 54M | 53.8M | 53.7M | 53.5M | 53.4M | 53.2M |
194 | South Africa | 1.45M | 1.46M | 1.46M | 1.47M | 1.47M | 1.48M | 1.49M | 1.49M | 1.5M | ... | 79.8M | 79.8M | 79.7M | 79.7M | 79.6M | 79.5M | 79.5M | 79.4M | 79.3M | 79.2M |
195 | Zambia | 758k | 770k | 782k | 794k | 806k | 818k | 831k | 843k | 856k | ... | 74.4M | 75.2M | 76M | 76.8M | 77.6M | 78.4M | 79.2M | 80M | 80.8M | 81.5M |
196 | Zimbabwe | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | 1.09M | ... | 30.8M | 30.9M | 30.9M | 30.9M | 30.9M | 30.9M | 31M | 31M | 31M | 31M |
457 rows × 303 columns
Another type of combining datasets is to merge the two together where they share some common columns. For example, these data have the same countries and years. Firstly, let us join just the wide format dataframes on country. This will match the countries in each of the dataframes and produce a single dataframe with the remaining columns stuck together. Where there are other columns in the dataframes with the same name, these will be indexed by _x
,_y
etc. by default, or by the suffixes provided. Any rows from either data set that do not match the other data set are removed.
datwide_merge=pd.merge(income,popn,on='country',suffixes=('_GDP', '_popn'))
datwide_merge.head()
country | 1800_GDP | 1801_GDP | 1802_GDP | 1803_GDP | 1804_GDP | 1805_GDP | 1806_GDP | 1807_GDP | 1808_GDP | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 472.053500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 76.6M | 76.4M | 76.3M | 76.1M | 76M | 75.8M | 75.6M | 75.4M | 75.2M | 74.9M |
1 | Albania | 601.215222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.33M | 1.3M | 1.27M | 1.25M | 1.22M | 1.19M | 1.17M | 1.14M | 1.11M | 1.09M |
2 | Algeria | 766.253664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 70.4M | 70.5M | 70.5M | 70.6M | 70.7M | 70.7M | 70.7M | 70.7M | 70.7M | 70.7M |
3 | Andorra | 1260.123256 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 63k | 62.9k | 62.9k | 62.8k | 62.7k | 62.7k | 62.6k | 62.5k | 62.5k | 62.4k |
4 | Angola | 359.932582 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 168M | 170M | 172M | 175M | 177M | 179M | 182M | 184M | 186M | 188M |
5 rows × 515 columns
See here that the years up to 2012 are repeated and have suffixes, whereas subsequent years only appear in the population dataframe, and so these do not have any suffix.
With the long format data, we can merge the data on both country and year. In this case there are no further replicated columns so no suffixes are required, but these can be given if required. The resulting dataframe will have a single country and year with additional columns for GDP and populations bound onto them.
dat_merge=pd.merge(income_long,popn_long,on=['country','Year'])
dat_merge.head()
country | Year | GDP | Popn | |
---|---|---|---|---|
0 | Afghanistan | 1800 | 472.053500 | 3.28M |
1 | Albania | 1800 | 601.215222 | 402k |
2 | Algeria | 1800 | 766.253664 | 2.51M |
3 | Andorra | 1800 | 1260.123256 | 2650 |
4 | Angola | 1800 | 359.932582 | 1.57M |
Another option is if the two dataframes contain the same columns but these are called different names in each dataframe. In that case, merging as above will not work. In this case, there is an option for specifying what the names of the columns are that should be matched.
dat_merge=pd.merge(income_long,popn_long,left_on=['country','Year'],right_on=['country','Year'])
dat_merge.head()
country | Year | GDP | Popn | |
---|---|---|---|---|
0 | Afghanistan | 1800 | 472.053500 | 3.28M |
1 | Albania | 1800 | 601.215222 | 402k |
2 | Algeria | 1800 | 766.253664 | 2.51M |
3 | Andorra | 1800 | 1260.123256 | 2650 |
4 | Angola | 1800 | 359.932582 | 1.57M |
Finally, there is the join
function, used in a similar way to numpy
append
but this joins on index. It should produce an identical dataframe to that above.
income_long.join(popn_long.set_index(['country','Year']),on=['country','Year'])
country | Year | GDP | Popn | |
---|---|---|---|---|
0 | Abkhazia | 1800 | NaN | NaN |
1 | Afghanistan | 1800 | 472.053500 | 3.28M |
2 | Akrotiri and Dhekelia | 1800 | NaN | NaN |
3 | Albania | 1800 | 601.215222 | 402k |
4 | Algeria | 1800 | 766.253664 | 2.51M |
... | ... | ... | ... | ... |
55375 | Yemen, Rep. | 2012 | 2043.787776 | NaN |
55376 | Yugoslavia | 2012 | NaN | NaN |
55377 | Zambia | 2012 | 1550.923859 | 14.5M |
55378 | Zimbabwe | 2012 | 545.344601 | 13.1M |
55379 | Åland | 2012 | NaN | NaN |
55380 rows × 4 columns
Data Aggregation and Summarisation#
Now that we have the data imported, reshaped, combined and know how to index it, our final task is to calculate some useful summaries of the data. You will already have learnt how to calculate useful summaries of numpy
arrays using np.sum()
, for example. The pandas
package has all of these similar functions that can be called using either pd.sum(df)
or directly from the dataframe itself using:
print(income.sum(numeric_only=True).head())
print(income.mean(numeric_only=True).head())
1800 164489.577467
1801 36543.812847
1802 36718.280093
1803 36751.917126
1804 36770.763981
dtype: float64
1800 751.093961
1801 1178.832672
1802 1184.460648
1803 1185.545714
1804 1186.153677
dtype: float64
Note that as the first column, country
, is a string variable, calculating the sum or mean of this does not make sense. Using the numeric_oly=True
argument only calculates summaries on columns that are numeric. By default, these summary functions will calculate the aggregated summary over the rows, that is a single summary is provided for each column. We can specify alternative axes to compute summaries over using the axis
argument. For example, if we want to calculate the average by country across all years, we would use:
print(income.mean(axis='columns',numeric_only=True))
0 NaN
1 817.075676
2 NaN
3 3212.062887
4 4375.433233
...
255 1566.370726
256 NaN
257 1320.193141
258 620.935603
259 NaN
Length: 260, dtype: float64
There is also a really helpful function describe()
that provides many useful summaries of a dataframe in a single output. This function does not require us to specify the numeric only variables and does this automatically.
income.describe()
1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | ... | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 219.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | 31.000000 | ... | 217.000000 | 217.000000 | 230.000000 | 216.000000 | 213.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 191.000000 |
mean | 751.093961 | 1178.832672 | 1184.460648 | 1185.545714 | 1186.153677 | 1189.816980 | 1193.944536 | 1193.470663 | 1182.443690 | 1184.676918 | ... | 12117.045916 | 12613.662437 | 13470.188318 | 13792.388794 | 14218.956165 | 14246.914126 | 13613.980391 | 13948.423690 | 14216.294100 | 12482.299676 |
std | 363.341906 | 522.892265 | 526.138489 | 525.795645 | 527.457253 | 529.937857 | 533.391361 | 534.956049 | 512.663576 | 517.505712 | ... | 13887.534772 | 14427.012694 | 15269.288053 | 15705.158194 | 16152.693584 | 15963.714599 | 15018.479845 | 15565.955318 | 15981.327997 | 14351.997403 |
min | 340.000000 | 562.738123 | 562.591749 | 562.445412 | 562.299114 | 562.152854 | 562.006632 | 561.860447 | 561.714301 | 561.568193 | ... | 299.490250 | 314.518974 | 330.000000 | 338.270005 | 348.969090 | 359.658528 | 359.080223 | 373.632392 | 387.705891 | 403.164594 |
25% | 501.320914 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | ... | 2152.689867 | 2250.897567 | 2438.507760 | 2415.720812 | 2524.534357 | 2584.742113 | 2620.289388 | 2764.223704 | 2988.593507 | 2437.784461 |
50% | 667.709189 | 1056.469106 | 1057.879973 | 1059.292724 | 1060.707362 | 1062.123889 | 1063.542308 | 1064.962621 | 1066.384831 | 1067.808940 | ... | 6500.126735 | 6877.986057 | 7183.000000 | 7335.729723 | 7508.367883 | 7826.267141 | 7671.274511 | 8069.778733 | 8014.654838 | 6975.140224 |
75% | 882.536501 | 1417.076555 | 1422.761772 | 1428.469822 | 1456.569099 | 1438.853749 | 1457.920399 | 1447.689538 | 1474.611622 | 1479.159408 | ... | 19067.164078 | 19287.908016 | 20320.750000 | 21188.972251 | 21990.961428 | 22473.689659 | 20910.631598 | 20854.775489 | 20942.498188 | 16649.509264 |
max | 2716.870226 | 2736.975065 | 2740.533133 | 2744.095826 | 2747.663151 | 2751.235113 | 2754.811718 | 2758.392974 | 2761.978884 | 2765.569457 | ... | 77249.852411 | 78810.335959 | 81825.080975 | 88389.525727 | 95877.649215 | 92607.243436 | 82377.950189 | 89000.290551 | 90039.049626 | 91492.645269 |
8 rows × 213 columns
Finally, if we only want the summary for a particular year (cloumn), we can slice the dataframe first before calculating the summary.
income['1800'].mean()
751.093961034262
Summary#
This is a brief intro to some of the capabilities of the pandas
package and the general tasks we can use it for. There are many other capabilities that we do not discuss here, some of which will be covered in the final two weeks of the course. More advanced concepts are left for you to explore!