Introduction to Pandas#

The pandas package in Python provides a large collection of functions for importing, merging and wrangling data of many different types. This can be as simple as creating data objects from a text file on your computer to scraping text from websites and saving to particular data structures.

In this section of the course, we will learn about some basic functionality of the pandas package for importing and wrangling data in a reproducible and ‘tidy’ way. Then we will pick up further on this at the end of the course to learn about more complex ways of handling and working with external data sources.

There are many resources available to help you, with full documentation at the pandas repository. If you want to learn more advanced ways of using pandas, this is a great place to start.

Importing Pandas#

As with the numpy package, it is customary to import the pandas package using a shortened name, pd.

import pandas as pd

In the simplest form, the pandas package contains functions that can create and manipulate enhanced data objects relative to NumPy. Specifically, there are three main types of data sctructure in pandas that we need to familiarise ourself with. Series, DataFrames and Indexes. We may have an external files that we want to read into Python in order to later study or analyse or create our own objects from scratch.

Pandas Series Object#

Let us begin by looking at the Series object, a way of storing vectors of data. We supply a vector of values that we wish to store in our object as follows and observe the contents of the created object.

data = pd.Series([0.1,0.2,0.3,0.4])
data
0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

As in numpy, data in pandas objects can be accessed using the square pbracket notation

data[2:4]
2    0.3
3    0.4
dtype: float64

You can see that this object has both a vector of values, (which can be accessed using data.values), but also an index (accessed using data.index). Although it may seem very similar to the numpy array, the pandas Series object is much more flexible than the former. In particular, we may want to use non-integers for the index, such as strings. In that case we can construct data dictionaries, allowing the association of the individual indices to specific values. For example, let us consider some made up fruit sales at a shop in St Andrews.

fruits={'Apples':1537,
        'Bananas': 2541,
        'Kiwis': 527,
        'Peaches':894}

fruit_sales = pd.Series(fruits)
fruit_sales
Apples     1537
Bananas    2541
Kiwis       527
Peaches     894
dtype: int64

Once we have created our data series out of the original dictionary, we can acccess the value we want by specifying the corresponding key in our dictionary.

fruit_sales['Peaches']
894

We could also create the data series using the following code, specifying both the data vector and its corresponding keynames:

fruit_sales = pd.Series([1537,2541,527,894],index=['Apples','Bananas','Kiwis','Peaches'])
fruit_sales
Apples     1537
Bananas    2541
Kiwis       527
Peaches     894
dtype: int64

Pandas DataFrame Object#

A central concept in statististical data analysis is the dataframe object. We often represent the data as observations where for each observation (stored as rows), multiple variables (storeed as columns) were recorded. The idea behind a dataframe is to create an object that is suited to managing this kind of data. In the pandas package, there is the DataFrame function that creates an object of type dataframe. We can create a dataframe of the fruit sales Series object generated above.

dat=pd.DataFrame(fruit_sales)
print(dat.info())
print(dat.head())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Apples to Peaches
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       4 non-null      int64
dtypes: int64(1)
memory usage: 236.0+ bytes
None
            0
Apples   1537
Bananas  2541
Kiwis     527
Peaches   894

Once you have imported the data you want, you can study the resulting object to determine its properties. There is both the info function, which provides a variety of useful details on the structure of the object, and the head function that prints out the first five rows of the dataframe.

Creating DataFrames From Scratch#

We can also create a dataframe from scratch and input out own data. For example, let us assume we have measured people’s height as well as noting their age, using None in the case where the data are missing.

df = pd.DataFrame({'Height':[1.7,1.8,1.5,1.6,1.9],
                  'Age':[37,28,None,19,21]}) 
df.info()
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Height  5 non-null      float64
 1   Age     4 non-null      float64
dtypes: float64(2)
memory usage: 208.0 bytes
Height Age
0 1.7 37.0
1 1.8 28.0
2 1.5 NaN
3 1.6 19.0
4 1.9 21.0

Importing Data From External Files#

We have now seen how to create DataFrame objects from data we are inputting, but what if we have data in a spreadsheet or other file that we want to read into Python? pandas has a range of read_* functions to read in data of different formats. We will focus here on the read_csv function in particular, but if you have .xls, html, .txt etc. files, you may need to use one of the other functions. We will start with some data downloaded from gapminder relating to GDP per capita of various different countries and acorss years. We need to specify the path to the folder (here assumed to be in the current working directory) and use the read function to import it. Once we have done that, we can study the first few rowys of the dataframe.

# Data from https://www.gapminder.org/data/
income=pd.read_csv('dat/indicator_gapminder_gdp_per_capita_ppp-Data.csv')
income.head()
GDP per capita 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 472.053500 NaN NaN NaN NaN NaN NaN NaN NaN ... 785.127571 804.717458 874.0 887.914578 983.652314 984.805841 1154.859365 1214.613653 1261.354184 1349.696941
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN ... 4855.210024 5115.252837 5369.0 5652.049321 5958.021197 6365.530359 6550.896164 6746.445312 6914.267317 6969.306283
4 Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN ... 5576.851564 5790.967692 6011.0 6022.270940 6133.782763 6162.719840 6173.729741 6300.648214 6354.640523 6419.127829

5 rows × 214 columns

These data report various country’s GDP, across years (columns) for different countries (rows). We could import these as a series with the country being the index and the GDP the values, but in this case we will assume the column of countries is of wider interest as a variable in its own right. If you look at the data, you may notice that there is a large number of missing values represented by NaN. It we explore the dtypes of the dataframe, we see that the GDPs are classified as floats, whilst country is an object (string).

income.dtypes
GDP per capita     object
1800              float64
1801              float64
1802              float64
1803              float64
                   ...   
2008              float64
2009              float64
2010              float64
2011              float64
2012              float64
Length: 214, dtype: object

Before we go any further, notice that the first column is labelled incorrectly as GDP per capita (rather than country), so let us rename the first column using the .rename()function.

income=income.rename(columns={"GDP per capita": "country"})
income.head()
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 472.053500 NaN NaN NaN NaN NaN NaN NaN NaN ... 785.127571 804.717458 874.0 887.914578 983.652314 984.805841 1154.859365 1214.613653 1261.354184 1349.696941
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN ... 4855.210024 5115.252837 5369.0 5652.049321 5958.021197 6365.530359 6550.896164 6746.445312 6914.267317 6969.306283
4 Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN ... 5576.851564 5790.967692 6011.0 6022.270940 6133.782763 6162.719840 6173.729741 6300.648214 6354.640523 6419.127829

5 rows × 214 columns

Data wrangling#

Data wrangling is a process by which we modify the data to a structure or format relevant to how we wish to utilise it. For example, we may want to extract certain entries or variables from the data, reshape the data to a more appropriate format, or check for and deal with missing values. We can slice specific column(s) based on their names; rows and/or columns specifying their loc (the index names of the rows and/or columns); rows and/or columns based on their array indices using iloc; or based on particular Boolean operations. Some care needs to be taken when using loc or iloc as if the dataframe from this function is saved into a new object and the orginal full dataframe is edited, it can sometimes change the sliced data automatically. This can be avoided by using .copy(). Below, if changes are made to income above, then this will not utomatically change the sliced dataframe unless the slicing is rerun.

income['1800'].head()
0           NaN
1    472.053500
2           NaN
3    601.215222
4    766.253664
Name: 1800, dtype: float64
income[['1800','1805']].head()
1800 1805
0 NaN NaN
1 472.053500 NaN
2 NaN NaN
3 601.215222 NaN
4 766.253664 NaN
income.loc[11:20, '1800':'1805'].copy()
1800 1801 1802 1803 1804 1805
11 378.403254 NaN NaN NaN NaN NaN
12 575.855054 NaN NaN NaN NaN NaN
13 671.478209 NaN NaN NaN NaN NaN
14 1434.507479 1440.072847 1445.659806 1451.268441 1456.898835 1462.551074
15 477.152815 NaN NaN NaN NaN NaN
16 1409.188697 NaN NaN NaN NaN NaN
17 808.991793 NaN NaN NaN NaN NaN
18 608.794297 608.635943 608.477630 608.319358 608.161127 608.002938
19 1017.813988 NaN NaN NaN NaN NaN
20 510.549704 NaN NaN NaN NaN NaN
income.iloc[11:21,:5].copy()
country 1800 1801 1802 1803
11 Armenia 378.403254 NaN NaN NaN
12 Aruba 575.855054 NaN NaN NaN
13 Australia 671.478209 NaN NaN NaN
14 Austria 1434.507479 1440.072847 1445.659806 1451.268441
15 Azerbaijan 477.152815 NaN NaN NaN
16 Bahamas 1409.188697 NaN NaN NaN
17 Bahrain 808.991793 NaN NaN NaN
18 Bangladesh 608.794297 608.635943 608.477630 608.319358
19 Barbados 1017.813988 NaN NaN NaN
20 Belarus 510.549704 NaN NaN NaN
income[income['country']=='Australia']
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
13 Australia 671.478209 NaN NaN NaN NaN NaN NaN NaN NaN ... 31634.242428 32098.50615 32798.0 33204.155734 34196.812421 34444.333064 34389.09927 34884.763757 35253.938423 36064.737277

1 rows × 214 columns

Reshaping - long ↔ wide format#

Wide to long format#

It may not be convenient to have all the years as additional columns (often called ‘wide’ format). Instead, we would perhaps rather have a single column of GDP values, but with replicates of the country for each year (‘long’ format). To do this in Python we use the melt function, specifying:

  • id_vars: the column name(s) in each data to maintain/replicate

  • value_vars: the column(s) that will be collapsed down to a single column

  • var_names: the column name of the new variable constructed from the previous column names

  • value_name: the column name for the values of the collapsed data

yearlist=list(income.columns)[1:]
income_long=income.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='GDP')
income_long.head()
country Year GDP
0 Abkhazia 1800 NaN
1 Afghanistan 1800 472.053500
2 Akrotiri and Dhekelia 1800 NaN
3 Albania 1800 601.215222
4 Algeria 1800 766.253664

Long to wide format#

To go from long to wide format, i.e, the reverse of this, we can use the pivot() function.

  • columns: Column to use to make new frame’s columns

  • values: Column(s) to use for populating new frame’s values

  • index: Column to use to make new dataframe’s index

income_wide=pd.pivot(income_long, index='country', columns='Year', values='GDP')
income_wide.head()
Year 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
country
Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Afghanistan 472.053500 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 785.127571 804.717458 874.0 887.914578 983.652314 984.805841 1154.859365 1214.613653 1261.354184 1349.696941
Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4855.210024 5115.252837 5369.0 5652.049321 5958.021197 6365.530359 6550.896164 6746.445312 6914.267317 6969.306283
Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 5576.851564 5790.967692 6011.0 6022.270940 6133.782763 6162.719840 6173.729741 6300.648214 6354.640523 6419.127829

5 rows × 213 columns

Combining Datasets: concat, merge and join#

A common task when working with data is the need to combine multiple data structures together. This could be when we have new observations of existing variables (extra rows), or new variables measured on existing objects (extra columns), for example. The pandas package allows the combination of data in this way. Understanding the properties of the data we are combining is very important to ensure this is done in the correct way.

In pandas, there are multiple functions for combining dataframes. Each has its uses and syntax. We will study examples of each below, starting by importing a similar dataset on total populations for each country and converting to long format.

popn=pd.read_csv('dat/population_total.csv')
yearlist=list(popn.columns)[1:]
popn_long=popn.melt(id_vars='country',value_vars=yearlist,var_name='Year',value_name='Popn')
popn_long.head()
country Year Popn
0 Afghanistan 1799 3.28M
1 Angola 1799 1.57M
2 Albania 1799 400k
3 Andorra 1799 2650
4 United Arab Emirates 1799 40.2k

The population data start in the same year (1800) but goes as far as 2100, rather than 2012. This can cause problems when merging the data if we are not careful.

The first function we will use to combine the data is the concat function. Here we can concatenate the two dataframes, that is stack the corresponding columns on top of each other. As we have noticed there are some columns in the population dataframe that are not replicated in the GDP data, we could stack all of the GDP data with only the corresponding columns from the population dataframe (this is fiddly if they do not follow the same order). An easier way is to use the join argument. Inner joining means only including data that are replicated across both dataframes.

pd.concat(([income,popn.loc[:,:"2012"]]))
#or
pd.concat(([income,popn]),join="inner")
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 472.0535 NaN NaN NaN NaN NaN NaN NaN NaN ... 785.127571 804.717458 874.0 887.914578 983.652314 984.805841 1154.859365 1214.613653 1261.354184 1349.696941
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN ... 4855.210024 5115.252837 5369.0 5652.049321 5958.021197 6365.530359 6550.896164 6746.445312 6914.267317 6969.306283
4 Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN ... 5576.851564 5790.967692 6011.0 6022.27094 6133.782763 6162.71984 6173.729741 6300.648214 6354.640523 6419.127829
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
192 Samoa 47.3k 47.3k 47.3k 47.3k 47.3k 47.3k 47.2k 47.2k 47.2k ... 177k 179k 180k 181k 182k 183k 185k 186k 187k 189k
193 Yemen 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M ... 19M 19.5M 20.1M 20.7M 21.3M 21.9M 22.5M 23.2M 23.8M 24.5M
194 South Africa 1.45M 1.46M 1.46M 1.47M 1.47M 1.48M 1.49M 1.49M 1.5M ... 46.7M 47.3M 47.9M 48.5M 49.1M 49.8M 50.5M 51.2M 52M 52.8M
195 Zambia 758k 770k 782k 794k 806k 818k 831k 843k 856k ... 11.3M 11.6M 11.9M 12.2M 12.5M 12.8M 13.2M 13.6M 14M 14.5M
196 Zimbabwe 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M ... 12M 12M 12.1M 12.2M 12.3M 12.4M 12.5M 12.7M 12.9M 13.1M

457 rows × 213 columns

If we did not select the matching rows, the default is to use an ‘outer’ join, which stacks the columns on top of each other, but filling the entries with NaN where the column does not exist. In the case of our dataframes, the GDP does not exist from 2013 to 2100, so all these elements will be NaNs (top right of the new dataset below).

pd.concat(([income,popn]),join="outer")
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 472.0535 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
192 Samoa 47.3k 47.3k 47.3k 47.3k 47.3k 47.3k 47.2k 47.2k 47.2k ... 315k 314k 314k 314k 313k 313k 312k 312k 311k 310k
193 Yemen 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M 2.59M ... 54.5M 54.4M 54.3M 54.1M 54M 53.8M 53.7M 53.5M 53.4M 53.2M
194 South Africa 1.45M 1.46M 1.46M 1.47M 1.47M 1.48M 1.49M 1.49M 1.5M ... 79.8M 79.8M 79.7M 79.7M 79.6M 79.5M 79.5M 79.4M 79.3M 79.2M
195 Zambia 758k 770k 782k 794k 806k 818k 831k 843k 856k ... 74.4M 75.2M 76M 76.8M 77.6M 78.4M 79.2M 80M 80.8M 81.5M
196 Zimbabwe 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M 1.09M ... 30.8M 30.9M 30.9M 30.9M 30.9M 30.9M 31M 31M 31M 31M

457 rows × 303 columns

Another type of combining datasets is to merge the two together where they share some common columns. For example, these data have the same countries and years. Firstly, let us join just the wide format dataframes on country. This will match the countries in each of the dataframes and produce a single dataframe with the remaining columns stuck together. Where there are other columns in the dataframes with the same name, these will be indexed by _x,_y etc. by default, or by the suffixes provided. Any rows from either data set that do not match the other data set are removed.

datwide_merge=pd.merge(income,popn,on='country',suffixes=('_GDP', '_popn'))
datwide_merge.head()
country 1800_GDP 1801_GDP 1802_GDP 1803_GDP 1804_GDP 1805_GDP 1806_GDP 1807_GDP 1808_GDP ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Afghanistan 472.053500 NaN NaN NaN NaN NaN NaN NaN NaN ... 76.6M 76.4M 76.3M 76.1M 76M 75.8M 75.6M 75.4M 75.2M 74.9M
1 Albania 601.215222 NaN NaN NaN NaN NaN NaN NaN NaN ... 1.33M 1.3M 1.27M 1.25M 1.22M 1.19M 1.17M 1.14M 1.11M 1.09M
2 Algeria 766.253664 NaN NaN NaN NaN NaN NaN NaN NaN ... 70.4M 70.5M 70.5M 70.6M 70.7M 70.7M 70.7M 70.7M 70.7M 70.7M
3 Andorra 1260.123256 NaN NaN NaN NaN NaN NaN NaN NaN ... 63k 62.9k 62.9k 62.8k 62.7k 62.7k 62.6k 62.5k 62.5k 62.4k
4 Angola 359.932582 NaN NaN NaN NaN NaN NaN NaN NaN ... 168M 170M 172M 175M 177M 179M 182M 184M 186M 188M

5 rows × 515 columns

See here that the years up to 2012 are repeated and have suffixes, whereas subsequent years only appear in the population dataframe, and so these do not have any suffix.

With the long format data, we can merge the data on both country and year. In this case there are no further replicated columns so no suffixes are required, but these can be given if required. The resulting dataframe will have a single country and year with additional columns for GDP and populations bound onto them.

dat_merge=pd.merge(income_long,popn_long,on=['country','Year'])
dat_merge.head()
country Year GDP Popn
0 Afghanistan 1800 472.053500 3.28M
1 Albania 1800 601.215222 402k
2 Algeria 1800 766.253664 2.51M
3 Andorra 1800 1260.123256 2650
4 Angola 1800 359.932582 1.57M

Another option is if the two dataframes contain the same columns but these are called different names in each dataframe. In that case, merging as above will not work. In this case, there is an option for specifying what the names of the columns are that should be matched.

dat_merge=pd.merge(income_long,popn_long,left_on=['country','Year'],right_on=['country','Year'])
dat_merge.head()
country Year GDP Popn
0 Afghanistan 1800 472.053500 3.28M
1 Albania 1800 601.215222 402k
2 Algeria 1800 766.253664 2.51M
3 Andorra 1800 1260.123256 2650
4 Angola 1800 359.932582 1.57M

Finally, there is the join function, used in a similar way to numpy append but this joins on index. It should produce an identical dataframe to that above.

income_long.join(popn_long.set_index(['country','Year']),on=['country','Year'])
country Year GDP Popn
0 Abkhazia 1800 NaN NaN
1 Afghanistan 1800 472.053500 3.28M
2 Akrotiri and Dhekelia 1800 NaN NaN
3 Albania 1800 601.215222 402k
4 Algeria 1800 766.253664 2.51M
... ... ... ... ...
55375 Yemen, Rep. 2012 2043.787776 NaN
55376 Yugoslavia 2012 NaN NaN
55377 Zambia 2012 1550.923859 14.5M
55378 Zimbabwe 2012 545.344601 13.1M
55379 Åland 2012 NaN NaN

55380 rows × 4 columns

Data Aggregation and Summarisation#

Now that we have the data imported, reshaped, combined and know how to index it, our final task is to calculate some useful summaries of the data. You will already have learnt how to calculate useful summaries of numpy arrays using np.sum(), for example. The pandas package has all of these similar functions that can be called using either pd.sum(df) or directly from the dataframe itself using:

print(income.sum(numeric_only=True).head())

print(income.mean(numeric_only=True).head())
1800    164489.577467
1801     36543.812847
1802     36718.280093
1803     36751.917126
1804     36770.763981
dtype: float64
1800     751.093961
1801    1178.832672
1802    1184.460648
1803    1185.545714
1804    1186.153677
dtype: float64

Note that as the first column, country, is a string variable, calculating the sum or mean of this does not make sense. Using the numeric_oly=True argument only calculates summaries on columns that are numeric. By default, these summary functions will calculate the aggregated summary over the rows, that is a single summary is provided for each column. We can specify alternative axes to compute summaries over using the axis argument. For example, if we want to calculate the average by country across all years, we would use:

print(income.mean(axis='columns',numeric_only=True))
0              NaN
1       817.075676
2              NaN
3      3212.062887
4      4375.433233
          ...     
255    1566.370726
256            NaN
257    1320.193141
258     620.935603
259            NaN
Length: 260, dtype: float64

There is also a really helpful function describe() that provides many useful summaries of a dataframe in a single output. This function does not require us to specify the numeric only variables and does this automatically.

income.describe()
1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
count 219.000000 31.000000 31.000000 31.000000 31.000000 31.000000 31.000000 31.000000 31.000000 31.000000 ... 217.000000 217.000000 230.000000 216.000000 213.000000 210.000000 210.000000 210.000000 210.000000 191.000000
mean 751.093961 1178.832672 1184.460648 1185.545714 1186.153677 1189.816980 1193.944536 1193.470663 1182.443690 1184.676918 ... 12117.045916 12613.662437 13470.188318 13792.388794 14218.956165 14246.914126 13613.980391 13948.423690 14216.294100 12482.299676
std 363.341906 522.892265 526.138489 525.795645 527.457253 529.937857 533.391361 534.956049 512.663576 517.505712 ... 13887.534772 14427.012694 15269.288053 15705.158194 16152.693584 15963.714599 15018.479845 15565.955318 15981.327997 14351.997403
min 340.000000 562.738123 562.591749 562.445412 562.299114 562.152854 562.006632 561.860447 561.714301 561.568193 ... 299.490250 314.518974 330.000000 338.270005 348.969090 359.658528 359.080223 373.632392 387.705891 403.164594
25% 501.320914 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 ... 2152.689867 2250.897567 2438.507760 2415.720812 2524.534357 2584.742113 2620.289388 2764.223704 2988.593507 2437.784461
50% 667.709189 1056.469106 1057.879973 1059.292724 1060.707362 1062.123889 1063.542308 1064.962621 1066.384831 1067.808940 ... 6500.126735 6877.986057 7183.000000 7335.729723 7508.367883 7826.267141 7671.274511 8069.778733 8014.654838 6975.140224
75% 882.536501 1417.076555 1422.761772 1428.469822 1456.569099 1438.853749 1457.920399 1447.689538 1474.611622 1479.159408 ... 19067.164078 19287.908016 20320.750000 21188.972251 21990.961428 22473.689659 20910.631598 20854.775489 20942.498188 16649.509264
max 2716.870226 2736.975065 2740.533133 2744.095826 2747.663151 2751.235113 2754.811718 2758.392974 2761.978884 2765.569457 ... 77249.852411 78810.335959 81825.080975 88389.525727 95877.649215 92607.243436 82377.950189 89000.290551 90039.049626 91492.645269

8 rows × 213 columns

Finally, if we only want the summary for a particular year (cloumn), we can slice the dataframe first before calculating the summary.

income['1800'].mean()
751.093961034262

Summary#

This is a brief intro to some of the capabilities of the pandas package and the general tasks we can use it for. There are many other capabilities that we do not discuss here, some of which will be covered in the final two weeks of the course. More advanced concepts are left for you to explore!