Introduction to Pandas#

The pandas package in Python provides a large collection of functions for importing, merging and wrangling data of many different types. This can be as simple as creating data objects from a text file on your computer to scraping text from websites and saving to particular data structures.

In this section of the course, we will learn about some basic functionality of the pandas package for importing and wrangling data in a reproducible and ‘tidy’ way. Then we will pick up further on this at the end of the course to learn about more complex ways of handling and working with external data sources.

There are many resources available to help you, with full documentation at the pandas repository. If you want to learn more advanced ways of using pandas, this is a great place to start.

Importing Pandas#

As with the numpy package, it is customary to import the pandas package using a shortened name, pd.

import pandas as pd

In the simplest form, the pandas package contains functions that can create and manipulate enhanced data objects relative to NumPy. Specifically, there are three main types of data sctructure in pandas that we need to familiarise ourself with. Series, DataFrames and Indexes. We may have an external files that we want to read into Python in order to later study or analyse or create our own objects from scratch.

Pandas Series Object#

Let us begin by looking at the Series object, a way of storing vectors of data. We supply a vector of values that we wish to store in our object as follows and observe the contents of the created object.

data = pd.Series([0.1,0.2,0.3,0.4])
data

  0.1
  0.2
  0.3
  0.4
dtype: float64

As in numpy, data in pandas objects can be accessed using the square pbracket notation

data[2:4]

2    0.3
3    0.4
dtype: float64

You can see that this object has both a vector of values, (which can be accessed using data.values), but also an index (accessed using data.index). Although it may seem very similar to the numpy array, the pandas Series object is much more flexible than the former. In particular, we may want to use non-integers for the index, such as strings. In that case we can construct data dictionaries, allowing the association of the individual indices to specific values. For example, let us consider some made up fruit sales at a shop in St Andrews.

fruits={'Apples':1537,
        'Bananas': 2541,
        'Kiwis': 527,
        'Peaches':894}

fruit_sales = pd.Series(fruits)
fruit_sales

Apples     1537
Bananas    2541
Kiwis       527
Peaches     894
dtype: int64

Once we have created our data series out of the original dictionary, we can acccess the value we want by specifying the corresponding key in our dictionary.

fruit_sales['Peaches']

We could also create the data series using the following code, specifying both the data vector and its corresponding keynames:

fruit_sales = pd.Series([1537,2541,527,894],index=['Apples','Bananas','Kiwis','Peaches'])
fruit_sales

Apples     1537
Bananas    2541
Kiwis       527
Peaches     894
dtype: int64

Pandas DataFrame Object#

A central concept in statististical data analysis is the dataframe object. We often represent the data as observations where for each observation (stored as rows), multiple variables (storeed as columns) were recorded. The idea behind a dataframe is to create an object that is suited to managing this kind of data. In the pandas package, there is the DataFrame function that creates an object of type dataframe. We can create a dataframe of the fruit sales Series object generated above.

dat=pd.DataFrame(fruit_sales)
print(dat.info())
print(dat.head())

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Apples to Peaches
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       4 non-null      int64
dtypes: int64(1)
memory usage: 236.0+ bytes
None
            0
Apples   1537
Bananas  2541
Kiwis     527
Peaches   894

Once you have imported the data you want, you can study the resulting object to determine its properties. There is both the info function, which provides a variety of useful details on the structure of the object, and the head function that prints out the first five rows of the dataframe.

Creating DataFrames From Scratch#

We can also create a dataframe from scratch and input out own data. For example, let us assume we have measured people’s height as well as noting their age, using None in the case where the data are missing.

df = pd.DataFrame({'Height':[1.7,1.8,1.5,1.6,1.9],
                  'Age':[37,28,None,19,21]}) 
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Height  5 non-null      float64
 1   Age     4 non-null      float64
dtypes: float64(2)
memory usage: 208.0 bytes

	Height	Age
0	1.7	37.0
1	1.8	28.0
2	1.5	NaN
3	1.6	19.0
4	1.9	21.0

Importing Data From External Files#

We have now seen how to create DataFrame objects from data we are inputting, but what if we have data in a spreadsheet or other file that we want to read into Python? pandas has a range of read_* functions to read in data of different formats. We will focus here on the read_csv function in particular, but if you have .xls, html, .txt etc. files, you may need to use one of the other functions. We will start with some data downloaded from gapminder relating to GDP per capita of various different countries and acorss years. We need to specify the path to the folder (here assumed to be in the current working directory) and use the read function to import it. Once we have done that, we can study the first few rowys of the dataframe.

# Data from https://www.gapminder.org/data/
income=pd.read_csv('dat/indicator_gapminder_gdp_per_capita_ppp-Data.csv')
income.head()

	GDP per capita	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	472.053500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	785.127571	804.717458	874.0	887.914578	983.652314	984.805841	1154.859365	1214.613653	1261.354184	1349.696941
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4855.210024	5115.252837	5369.0	5652.049321	5958.021197	6365.530359	6550.896164	6746.445312	6914.267317	6969.306283
4	Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5576.851564	5790.967692	6011.0	6022.270940	6133.782763	6162.719840	6173.729741	6300.648214	6354.640523	6419.127829

5 rows × 214 columns

These data report various country’s GDP, across years (columns) for different countries (rows). We could import these as a series with the country being the index and the GDP the values, but in this case we will assume the column of countries is of wider interest as a variable in its own right. If you look at the data, you may notice that there is a large number of missing values represented by NaN. It we explore the dtypes of the dataframe, we see that the GDPs are classified as floats, whilst country is an object (string).

income.dtypes

GDP per capita     object
            float64
            float64
            float64
            float64
                   ...   
            float64
            float64
            float64
            float64
            float64
Length: 214, dtype: object

Before we go any further, notice that the first column is labelled incorrectly as GDP per capita (rather than country), so let us rename the first column using the .rename()function.

income=income.rename(columns={"GDP per capita": "country"})
income.head()

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	472.053500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	785.127571	804.717458	874.0	887.914578	983.652314	984.805841	1154.859365	1214.613653	1261.354184	1349.696941
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4855.210024	5115.252837	5369.0	5652.049321	5958.021197	6365.530359	6550.896164	6746.445312	6914.267317	6969.306283
4	Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5576.851564	5790.967692	6011.0	6022.270940	6133.782763	6162.719840	6173.729741	6300.648214	6354.640523	6419.127829

5 rows × 214 columns

Data wrangling#

Data wrangling is a process by which we modify the data to a structure or format relevant to how we wish to utilise it. For example, we may want to extract certain entries or variables from the data, reshape the data to a more appropriate format, or check for and deal with missing values. We can slice specific column(s) based on their names; rows and/or columns specifying their loc (the index names of the rows and/or columns); rows and/or columns based on their array indices using iloc; or based on particular Boolean operations. Some care needs to be taken when using loc or iloc as if the dataframe from this function is saved into a new object and the orginal full dataframe is edited, it can sometimes change the sliced data automatically. This can be avoided by using .copy(). Below, if changes are made to income above, then this will not utomatically change the sliced dataframe unless the slicing is rerun.

income['1800'].head()

         NaN
  472.053500
         NaN
  601.215222
  766.253664
Name: 1800, dtype: float64

income[['1800','1805']].head()

	1800	1805
0	NaN	NaN
1	472.053500	NaN
2	NaN	NaN
3	601.215222	NaN
4	766.253664	NaN

income.loc[11:20, '1800':'1805'].copy()

	1800	1801	1802	1803	1804	1805
11	378.403254	NaN	NaN	NaN	NaN	NaN
12	575.855054	NaN	NaN	NaN	NaN	NaN
13	671.478209	NaN	NaN	NaN	NaN	NaN
14	1434.507479	1440.072847	1445.659806	1451.268441	1456.898835	1462.551074
15	477.152815	NaN	NaN	NaN	NaN	NaN
16	1409.188697	NaN	NaN	NaN	NaN	NaN
17	808.991793	NaN	NaN	NaN	NaN	NaN
18	608.794297	608.635943	608.477630	608.319358	608.161127	608.002938
19	1017.813988	NaN	NaN	NaN	NaN	NaN
20	510.549704	NaN	NaN	NaN	NaN	NaN

income.iloc[11:21,:5].copy()

	country	1800	1801	1802	1803
11	Armenia	378.403254	NaN	NaN	NaN
12	Aruba	575.855054	NaN	NaN	NaN
13	Australia	671.478209	NaN	NaN	NaN
14	Austria	1434.507479	1440.072847	1445.659806	1451.268441
15	Azerbaijan	477.152815	NaN	NaN	NaN
16	Bahamas	1409.188697	NaN	NaN	NaN
17	Bahrain	808.991793	NaN	NaN	NaN
18	Bangladesh	608.794297	608.635943	608.477630	608.319358
19	Barbados	1017.813988	NaN	NaN	NaN
20	Belarus	510.549704	NaN	NaN	NaN

income[income['country']=='Australia']

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
13	Australia	671.478209	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	31634.242428	32098.50615	32798.0	33204.155734	34196.812421	34444.333064	34389.09927	34884.763757	35253.938423	36064.737277

1 rows × 214 columns

Reshaping - long ↔ wide format#

Wide to long format#

It may not be convenient to have all the years as additional columns (often called ‘wide’ format). Instead, we would perhaps rather have a single column of GDP values, but with replicates of the country for each year (‘long’ format). To do this in Python we use the melt function, specifying:

id_vars: the column name(s) in each data to maintain/replicate
value_vars: the column(s) that will be collapsed down to a single column
var_names: the column name of the new variable constructed from the previous column names
value_name: the column name for the values of the collapsed data

yearlist=list(income.columns)[1:]
income_long=income.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='GDP')
income_long.head()

	country	Year	GDP
0	Abkhazia	1800	NaN
1	Afghanistan	1800	472.053500
2	Akrotiri and Dhekelia	1800	NaN
3	Albania	1800	601.215222
4	Algeria	1800	766.253664

Long to wide format#

To go from long to wide format, i.e, the reverse of this, we can use the pivot() function.

columns: Column to use to make new frame’s columns
values: Column(s) to use for populating new frame’s values
index: Column to use to make new dataframe’s index

income_wide=pd.pivot(income_long, index='country', columns='Year', values='GDP')
income_wide.head()

Year	1800	1801	1802	1803	1804	1805	1806	1807	1808	1809	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
country
Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Afghanistan	472.053500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	785.127571	804.717458	874.0	887.914578	983.652314	984.805841	1154.859365	1214.613653	1261.354184	1349.696941
Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4855.210024	5115.252837	5369.0	5652.049321	5958.021197	6365.530359	6550.896164	6746.445312	6914.267317	6969.306283
Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5576.851564	5790.967692	6011.0	6022.270940	6133.782763	6162.719840	6173.729741	6300.648214	6354.640523	6419.127829

5 rows × 213 columns

Combining Datasets: `concat`, `merge` and `join`#

A common task when working with data is the need to combine multiple data structures together. This could be when we have new observations of existing variables (extra rows), or new variables measured on existing objects (extra columns), for example. The pandas package allows the combination of data in this way. Understanding the properties of the data we are combining is very important to ensure this is done in the correct way.

In pandas, there are multiple functions for combining dataframes. Each has its uses and syntax. We will study examples of each below, starting by importing a similar dataset on total populations for each country and converting to long format.

popn=pd.read_csv('dat/population_total.csv')
yearlist=list(popn.columns)[1:]
popn_long=popn.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='Popn')
popn_long.head()

	country	Year	Popn
0	Afghanistan	1799	3.28M
1	Angola	1799	1.57M
2	Albania	1799	400k
3	Andorra	1799	2650
4	United Arab Emirates	1799	40.2k

The population data start in the same year (1800) but goes as far as 2100, rather than 2012. This can cause problems when merging the data if we are not careful.

The first function we will use to combine the data is the concat function. Here we can concatenate the two dataframes, that is stack the corresponding columns on top of each other. As we have noticed there are some columns in the population dataframe that are not replicated in the GDP data, we could stack all of the GDP data with only the corresponding columns from the population dataframe (this is fiddly if they do not follow the same order). An easier way is to use the join argument. Inner joining means only including data that are replicated across both dataframes.

pd.concat(([income,popn.loc[:,:"2012"]]))
#or
pd.concat(([income,popn]),join="inner")

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	472.0535	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	785.127571	804.717458	874.0	887.914578	983.652314	984.805841	1154.859365	1214.613653	1261.354184	1349.696941
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4855.210024	5115.252837	5369.0	5652.049321	5958.021197	6365.530359	6550.896164	6746.445312	6914.267317	6969.306283
4	Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5576.851564	5790.967692	6011.0	6022.27094	6133.782763	6162.71984	6173.729741	6300.648214	6354.640523	6419.127829
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
192	Samoa	47.3k	47.3k	47.3k	47.3k	47.3k	47.3k	47.2k	47.2k	47.2k	...	177k	179k	180k	181k	182k	183k	185k	186k	187k	189k
193	Yemen	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	...	19M	19.5M	20.1M	20.7M	21.3M	21.9M	22.5M	23.2M	23.8M	24.5M
194	South Africa	1.45M	1.46M	1.46M	1.47M	1.47M	1.48M	1.49M	1.49M	1.5M	...	46.7M	47.3M	47.9M	48.5M	49.1M	49.8M	50.5M	51.2M	52M	52.8M
195	Zambia	758k	770k	782k	794k	806k	818k	831k	843k	856k	...	11.3M	11.6M	11.9M	12.2M	12.5M	12.8M	13.2M	13.6M	14M	14.5M
196	Zimbabwe	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	...	12M	12M	12.1M	12.2M	12.3M	12.4M	12.5M	12.7M	12.9M	13.1M

457 rows × 213 columns

If we did not select the matching rows, the default is to use an ‘outer’ join, which stacks the columns on top of each other, but filling the entries with NaN where the column does not exist. In the case of our dataframes, the GDP does not exist from 2013 to 2100, so all these elements will be NaNs (top right of the new dataset below).

pd.concat(([income,popn]),join="outer")

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2091	2092	2093	2094	2095	2096	2097	2098	2099	2100
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	472.0535	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
192	Samoa	47.3k	47.3k	47.3k	47.3k	47.3k	47.3k	47.2k	47.2k	47.2k	...	315k	314k	314k	314k	313k	313k	312k	312k	311k	310k
193	Yemen	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	2.59M	...	54.5M	54.4M	54.3M	54.1M	54M	53.8M	53.7M	53.5M	53.4M	53.2M
194	South Africa	1.45M	1.46M	1.46M	1.47M	1.47M	1.48M	1.49M	1.49M	1.5M	...	79.8M	79.8M	79.7M	79.7M	79.6M	79.5M	79.5M	79.4M	79.3M	79.2M
195	Zambia	758k	770k	782k	794k	806k	818k	831k	843k	856k	...	74.4M	75.2M	76M	76.8M	77.6M	78.4M	79.2M	80M	80.8M	81.5M
196	Zimbabwe	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	1.09M	...	30.8M	30.9M	30.9M	30.9M	30.9M	30.9M	31M	31M	31M	31M

457 rows × 303 columns

Another type of combining datasets is to merge the two together where they share some common columns. For example, these data have the same countries and years. Firstly, let us join just the wide format dataframes on country. This will match the countries in each of the dataframes and produce a single dataframe with the remaining columns stuck together. Where there are other columns in the dataframes with the same name, these will be indexed by _x,_y etc. by default, or by the suffixes provided. Any rows from either data set that do not match the other data set are removed.

datwide_merge=pd.merge(income,popn,on='country',suffixes=('_GDP', '_popn'))
datwide_merge.head()

	country	1800_GDP	1801_GDP	1802_GDP	1803_GDP	1804_GDP	1805_GDP	1806_GDP	1807_GDP	1808_GDP	...	2091	2092	2093	2094	2095	2096	2097	2098	2099	2100
0	Afghanistan	472.053500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	76.6M	76.4M	76.3M	76.1M	76M	75.8M	75.6M	75.4M	75.2M	74.9M
1	Albania	601.215222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.33M	1.3M	1.27M	1.25M	1.22M	1.19M	1.17M	1.14M	1.11M	1.09M
2	Algeria	766.253664	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	70.4M	70.5M	70.5M	70.6M	70.7M	70.7M	70.7M	70.7M	70.7M	70.7M
3	Andorra	1260.123256	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	63k	62.9k	62.9k	62.8k	62.7k	62.7k	62.6k	62.5k	62.5k	62.4k
4	Angola	359.932582	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	168M	170M	172M	175M	177M	179M	182M	184M	186M	188M

5 rows × 515 columns

See here that the years up to 2012 are repeated and have suffixes, whereas subsequent years only appear in the population dataframe, and so these do not have any suffix.

With the long format data, we can merge the data on both country and year. In this case there are no further replicated columns so no suffixes are required, but these can be given if required. The resulting dataframe will have a single country and year with additional columns for GDP and populations bound onto them.

dat_merge=pd.merge(income_long,popn_long,on=['country','Year'])
dat_merge.head()

	country	Year	GDP	Popn
0	Afghanistan	1800	472.053500	3.28M
1	Albania	1800	601.215222	402k
2	Algeria	1800	766.253664	2.51M
3	Andorra	1800	1260.123256	2650
4	Angola	1800	359.932582	1.57M

Another option is if the two dataframes contain the same columns but these are called different names in each dataframe. In that case, merging as above will not work. In this case, there is an option for specifying what the names of the columns are that should be matched.

dat_merge=pd.merge(income_long,popn_long,left_on=['country','Year'],right_on=['country','Year'])
dat_merge.head()

	country	Year	GDP	Popn
0	Afghanistan	1800	472.053500	3.28M
1	Albania	1800	601.215222	402k
2	Algeria	1800	766.253664	2.51M
3	Andorra	1800	1260.123256	2650
4	Angola	1800	359.932582	1.57M

Finally, there is the join function, used in a similar way to numpy append but this joins on index. It should produce an identical dataframe to that above.

income_long.join(popn_long.set_index(['country','Year']),on=['country','Year'])

	country	Year	GDP	Popn
0	Abkhazia	1800	NaN	NaN
1	Afghanistan	1800	472.053500	3.28M
2	Akrotiri and Dhekelia	1800	NaN	NaN
3	Albania	1800	601.215222	402k
4	Algeria	1800	766.253664	2.51M
...	...	...	...	...
55375	Yemen, Rep.	2012	2043.787776	NaN
55376	Yugoslavia	2012	NaN	NaN
55377	Zambia	2012	1550.923859	14.5M
55378	Zimbabwe	2012	545.344601	13.1M
55379	Åland	2012	NaN	NaN

55380 rows × 4 columns

Data Aggregation and Summarisation#

Now that we have the data imported, reshaped, combined and know how to index it, our final task is to calculate some useful summaries of the data. You will already have learnt how to calculate useful summaries of numpy arrays using np.sum(), for example. The pandas package has all of these similar functions that can be called using either pd.sum(df) or directly from the dataframe itself using:

print(income.sum(numeric_only=True).head())

print(income.mean(numeric_only=True).head())

  164489.577467
   36543.812847
   36718.280093
   36751.917126
   36770.763981
dtype: float64
   751.093961
  1178.832672
  1184.460648
  1185.545714
  1186.153677
dtype: float64

Note that as the first column, country, is a string variable, calculating the sum or mean of this does not make sense. Using the numeric_oly=True argument only calculates summaries on columns that are numeric. By default, these summary functions will calculate the aggregated summary over the rows, that is a single summary is provided for each column. We can specify alternative axes to compute summaries over using the axis argument. For example, if we want to calculate the average by country across all years, we would use:

print(income.mean(axis='columns',numeric_only=True))

            NaN
     817.075676
            NaN
    3212.062887
    4375.433233
          ...     
  1566.370726
          NaN
  1320.193141
   620.935603
          NaN
Length: 260, dtype: float64

There is also a really helpful function describe() that provides many useful summaries of a dataframe in a single output. This function does not require us to specify the numeric only variables and does this automatically.

income.describe()

	1800	1801	1802	1803	1804	1805	1806	1807	1808	1809	...	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
count	219.000000	31.000000	31.000000	31.000000	31.000000	31.000000	31.000000	31.000000	31.000000	31.000000	...	217.000000	217.000000	230.000000	216.000000	213.000000	210.000000	210.000000	210.000000	210.000000	191.000000
mean	751.093961	1178.832672	1184.460648	1185.545714	1186.153677	1189.816980	1193.944536	1193.470663	1182.443690	1184.676918	...	12117.045916	12613.662437	13470.188318	13792.388794	14218.956165	14246.914126	13613.980391	13948.423690	14216.294100	12482.299676
std	363.341906	522.892265	526.138489	525.795645	527.457253	529.937857	533.391361	534.956049	512.663576	517.505712	...	13887.534772	14427.012694	15269.288053	15705.158194	16152.693584	15963.714599	15018.479845	15565.955318	15981.327997	14351.997403
min	340.000000	562.738123	562.591749	562.445412	562.299114	562.152854	562.006632	561.860447	561.714301	561.568193	...	299.490250	314.518974	330.000000	338.270005	348.969090	359.658528	359.080223	373.632392	387.705891	403.164594
25%	501.320914	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	...	2152.689867	2250.897567	2438.507760	2415.720812	2524.534357	2584.742113	2620.289388	2764.223704	2988.593507	2437.784461
50%	667.709189	1056.469106	1057.879973	1059.292724	1060.707362	1062.123889	1063.542308	1064.962621	1066.384831	1067.808940	...	6500.126735	6877.986057	7183.000000	7335.729723	7508.367883	7826.267141	7671.274511	8069.778733	8014.654838	6975.140224
75%	882.536501	1417.076555	1422.761772	1428.469822	1456.569099	1438.853749	1457.920399	1447.689538	1474.611622	1479.159408	...	19067.164078	19287.908016	20320.750000	21188.972251	21990.961428	22473.689659	20910.631598	20854.775489	20942.498188	16649.509264
max	2716.870226	2736.975065	2740.533133	2744.095826	2747.663151	2751.235113	2754.811718	2758.392974	2761.978884	2765.569457	...	77249.852411	78810.335959	81825.080975	88389.525727	95877.649215	92607.243436	82377.950189	89000.290551	90039.049626	91492.645269

8 rows × 213 columns

Finally, if we only want the summary for a particular year (cloumn), we can slice the dataframe first before calculating the summary.

income['1800'].mean()

751.093961034262

Summary#

This is a brief intro to some of the capabilities of the pandas package and the general tasks we can use it for. There are many other capabilities that we do not discuss here, some of which will be covered in the final two weeks of the course. More advanced concepts are left for you to explore!

MT3510 Introduction to Mathematical Computing

Introduction to Pandas

Contents

Introduction to Pandas#

Importing Pandas#

Pandas Series Object#

Pandas DataFrame Object#

Creating DataFrames From Scratch#

Importing Data From External Files#

Data wrangling#

Reshaping - long ↔ wide format#

Wide to long format#

Long to wide format#

Combining Datasets: `concat`, `merge` and `join`#

Data Aggregation and Summarisation#

Summary#

MT3510 Introduction to Mathematical Computing

Introduction to Pandas

Contents

Introduction to Pandas#

Importing Pandas#

Pandas Series Object#

Pandas DataFrame Object#

Creating DataFrames From Scratch#

Importing Data From External Files#

Data wrangling#

Reshaping - long ↔ wide format#

Wide to long format#

Long to wide format#

Combining Datasets: concat, merge and join#

Data Aggregation and Summarisation#

Summary#

Combining Datasets: `concat`, `merge` and `join`#