Exploring data using Pandas¶
Our first task in this week’s lesson is to learn how to read and explore data files in Python. We will focus on using pandas which is an open-source package for data analysis in Python. Pandas is an excellent toolkit for working with real world data that often have a tabular structure (rows and columns).
We will first get familiar with pandas data structures: DataFrame and Series:
- Pandas DataFrame (a 2-dimensional data structure) is used for storing and mainpulating table-like data (data with rows and columns) in Python. You can think of pandas DataFrame as a programmable spreadsheet.
- Pandas Series (a 1-dimensional data structure) is used for storing and manipulating an sequence of values. Pandas Series is kind of like a list, but more clever. One row or one column in a Pandas DataFrame is actually a Pandas Series.
These Pandas structures incorporate a number of things we’ve already encountered, such as indices, data stored in a collection, and data types. Let’s have another look at the Pandas data structures below with some additional annotation.
As you can see, both DataFrames and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the lesson this week we’ll use many of these features to explore real-world data and learn some useful data analysis procedures.
For a comprehensive overview of pandas data structures you can have a look at Chapter 5 in Wes MacKinney’s book Python for Data Analysis (2nd Edition, 2017) and Pandas online documentation about data structures.
Note
Pandas is a “high-level” package, which means that it makes use of several other packages, such as NumPy, in the background. There are several ways in which data can be read from a file in Python, and this year we have decided to focus primarily on pandas because it is easy-to-use, efficient and intuitive. If you are curoius about other approaches for interacting with data files, you can find lesson materials from previous years about reading data using NumPy or built-in Python functions.
Input data: weather statistics¶
Our input data is a text file containing weather observations from Kumpula, Helsinki, Finland retrieved from NOAA*:
- File name: Kumpula-June-2016-w-metadata.txt
- The file is available in the binder and CSC notebook instances, under L5 folder
- The data file contains observed daily mean, minimum, and maximum temperatures from June 2016 recorded from the Kumpula weather observation station in Helsinki.
- There are 30 rows of data in this sample data set.
- The data has been derived from a data file of daily temperature measurments downloaded from NOAA.
*US National Oceanographic and Atmospheric Administration’s National Centers for Environmental Information climate database
Reading a data file with Pandas¶
Now we’re ready to read in our temperature data file. First, we need to import the Pandas module. It is customary to import pandas as pd
:
[1]:
import pandas as pd
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Next, we’ll read the input data file, and store the contents of that file into a variable called data
Using the pandas.read_csv()
function:
[2]:
# Read the file using pandas
data = pd.read_csv('Kumpula-June-2016-w-metadata.txt')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Reading data from different formats
pandas.read_csv()
is a general function for reading data files separated by commas, spaces, or other common separators. Here we only provided one argument (the filepath) to the pd.read_csv()
method. For a full list of available parameters, please refer to pandas documentation for pandas.read_csv, or run help(pd.read_csv)
.
Pandas has several different functions for parsing input data from different formats. There is, for example, a separate function for reading Excel files read_excel
. Another useful function is read_pickle
for reading data stored in the Python pickle format. Check out pandas documentation about input and output functions and Chapter 6 in
MacKinney (2017): Python for Data Analysis for more details about reading data.
If all goes as planned, you should now have a new variable data
in memory that contains the input data. You can check the the contents of this variable by calling data
or print(data)
:
[3]:
print(data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-dbd883db58b7> in <module>()
----> 1 print(data)
NameError: name 'data' is not defined
This looks OK, but there are some strange values present such as NaN
, and the first lines of the dataframe look a bit weird..
NaN
stands for “not a number”, and might indicate some problem with reading in the contents of the file. Plus, we expected about 30 lines of data, but the index values go up to 36 when we print the contents of the data
variable. Looks like we need to investigate this further.
As we can observe, there are some metadata at the top of the file giving basic information about its contents and source. This isn’t data we want to process, so we need to skip over that part of the file when we load it.
Here are the 8 first rows of data in the text file (note that the 8th row is blank):
# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki
# for June 1-30, 2016
# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
# Data processing: Extracted temperatures from raw data file, converted to
# comma-separated format
#
# David Whipp - 02.10.2017
Fortunately, that’s easy to do in pandas, we just need to add the skiprows
parameter when we read the file, listing the number of rows to skip (8 in this case).
Let’s try reading the datafile again, and this time defining the skiprows
parameter.
[4]:
data = pd.read_csv('Kumpula-June-2016-w-metadata.txt', skiprows=8)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Let’s now print the dataframe and see what changed:
[5]:
print(data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-dbd883db58b7> in <module>()
----> 1 print(data)
NameError: name 'data' is not defined
After reading in the data, it is always good to check that everything went well (like we just did with the print-statement above). The challenge can also be that large datafiles might not nicely print on screen using the print()
-function so it might be better to look at only the top 5-10 lines of the file rather than loading the entire thing.
We can use pandas.DataFrame.head
to quickly check the contents of the dataframe. This method returns the first n rows for the dataframe. By default, it returns 5 first rows of the DataFrame:
[6]:
data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
We can also check the last rows of the data using data.tail()
:
[7]:
data.tail()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-0b73fe40d24a> in <module>()
----> 1 data.tail()
NameError: name 'data' is not defined
Note that Pandas that DataFrames have labelled axes (rows and columns). In our sample data, the rows labeled with an index value (0
to 29
), and columns labelled YEARMODA
, TEMP
, MAX
, and MIN
. Later on, we will learn how to use these labels for selecting and updating subsets of the data.
Let’s also confirm the data type of our data variable:
[8]:
type(data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-b5f01a7c0d9a> in <module>()
----> 1 type(data)
NameError: name 'data' is not defined
No surprises here, our data variable is a Pandas DataFrame ;).
Note
We can use IPython magic commands to figure out what variables we have in memory. IPython magic command %who
will display names of those variables that you have defined during this session. Magic command %whose
prints out more information about these variables.
[9]:
# Display variable names:
%who
Interactive namespace is empty.
[10]:
# Display variable name, type and info
%whos
Interactive namespace is empty.
DataFrame properties¶
We now have some basic Python skills and the ability to read in data from a file for processing. A normal first step when you load new data is to explore the dataset a bit to understand how the data is structured, and what kind of values are stored in there.
Let’s start by checking the size of our data frame. We can use the len()
function similarly as with lists to check how many rows we have:**
[11]:
# Print number of rows using len()-function
print(len(data))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-21e0eb75ee61> in <module>()
1 # Print number of rows using len()-function
----> 2 print(len(data))
NameError: name 'data' is not defined
We can also get a quick sense of the size of the dataset using the shape
attribute.
[12]:
# Print dataframe shape
print(data.shape)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-f0ce8541e3ce> in <module>()
1 # Print dataframe shape
----> 2 print(data.shape)
NameError: name 'data' is not defined
Here we see that our dataset has 30 rows, 4 columns, just as we saw above when printing out the whole DataFrame.
We can also check the column names we have in our DataFrame. We already saw the column names when we checked the 5 first rows using data.head()
, but often it is useful to access only the column names. You can call data.columns
(returns the an index object) or data.columns.values
(returns a list of column values) to check the column names:
[13]:
#Print column values
print(data.columns.values)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-04d859114cf0> in <module>()
1 #Print column values
----> 2 print(data.columns.values)
NameError: name 'data' is not defined
We can also find information about the row identifiers using the index
attribute:
[14]:
#Print index
print(data.index)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-5ddd1cfe314c> in <module>()
1 #Print index
----> 2 print(data.index)
NameError: name 'data' is not defined
Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, however, pandas allows also other ways of identifying the rows. DataFrame indices could, for example, be character strings, or date objects (you will learn more about re-setting the index later). Eventually, the “length” of the DataFrame (the number of rows) is actually the length of the index:
[15]:
#Check length of the index
len(data.index)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-f64fbfea1985> in <module>()
1 #Check length of the index
----> 2 len(data.index)
NameError: name 'data' is not defined
What about the data types of each column in our dataFrame? We can check the data type of all the columns at once using pandas.DataFrame.dtypes
:
[16]:
# Print data types
print(data.dtypes)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-db9b01e52bf4> in <module>()
1 # Print data types
----> 2 print(data.dtypes)
NameError: name 'data' is not defined
Here we see that YEARMODA
is an integer value (with 64-bit precision; int64), while the other values are all decimal values with 64-bit precision (float64).
Selecting columns¶
In pandas, we select columns based on the column values (columns names). The basic syntax is dataframe[value]
, where value can be a single column name, or a list of column names. Let’s start by selecting two columns, 'YEARMODA'
and 'TEMP'
:
[17]:
selection = data[['YEARMODA','TEMP']]
print(selection)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Let’s also check the data type of this selection:
[18]:
type(selection)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-221f8e6237eb> in <module>()
----> 1 type(selection)
NameError: name 'selection' is not defined
The subset is still a pandas DataFrame, and we are able to use all the methods related to a pandas DataFrame also with this subset. For example, we can check the shape:
[19]:
selection.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-937bde88912d> in <module>()
----> 1 selection.shape
NameError: name 'selection' is not defined
We can also access a single column of the data based on the column name:
[20]:
data['TEMP']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-cf7d4568b949> in <module>()
----> 1 data['TEMP']
NameError: name 'data' is not defined
What about the type of the column itself?
[21]:
# Check datatype of the column
type(data['TEMP'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-4d3312a9365e> in <module>()
1 # Check datatype of the column
----> 2 type(data['TEMP'])
NameError: name 'data' is not defined
Each column (and each row) in a pandas data frame is actually a pandas Series - a 1 dimensional data structure!
Note
You can also retreive a column using a different syntax:
data.TEMP
This syntax works only if the column name is a valid name for a Python variable (e.g. the column name should not contain whitespace). The syntax data["column"]
works for all kinds of column names, so we recommend using this approach :)
[22]:
type(data.TEMP)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-624cdce2d39a> in <module>()
----> 1 type(data.TEMP)
NameError: name 'data' is not defined
Descriptive statistics¶
Pandas DataFrames and Series contain useful methods for getting summary statistics. Available methods include mean()
, median()
, min()
, max()
, and std()
(the standard deviation).
We could, for example, check the mean temperature in our input data. We check the mean for a single column (Series):
[23]:
# Check mean value of a column
data['TEMP'].mean()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-23-5109ed6a0477> in <module>()
1 # Check mean value of a column
----> 2 data['TEMP'].mean()
NameError: name 'data' is not defined
and for all columns (in the DataFrame):
[24]:
# Check mean value for all columns
data.mean()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-80156cf27b9a> in <module>()
1 # Check mean value for all columns
----> 2 data.mean()
NameError: name 'data' is not defined
For an overview of the basic statistics for all attributes in the data, we can use the describe()
method:
[25]:
# Get descriptive statistics
data.describe()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-25-39a8f25af9ce> in <module>()
1 # Get descriptive statistics
----> 2 data.describe()
NameError: name 'data' is not defined
Extra: Very basic plots¶
Pandas comes with a handful of plotting methods, which all rely on the plotting library matplotlib.
For very basic plots, we don’t need to import matplotlib separately, but we need to run a magic command %matplotlib inline in order to make the plots visible in our notebook:
[26]:
%matplotlib inline
We can already achieve very simple plots using the DataFrame.plot
-method. Let’s plot all the columns that contain values related to temperatures:
[27]:
data[["TEMP", "MAX", "MIN"]].plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-bc9ee740c18a> in <module>()
----> 1 data[["TEMP", "MAX", "MIN"]].plot()
NameError: name 'data' is not defined
We can could also at the distribution of temperature values using a boxplot visualization:
Now, you might want to start modifying the plots by adding axis labels, color settings and other formatting. For anythin more advanced, we should import matplotlib (import matplotlib.pyplot as plt
). We will learn more about matplotlib and plotting data during week 7 :)
Extra: From lists to pandas objects¶
Most often we create pandas objects by reading in data from an external source, such as a text file. Here, we will briefly see how you can create pandas objects from Python lists. If you have long lists of numbers, for instance, creating a Pandas Series will allow you to interact with these values more efficiently in terms of computing time.
[28]:
# Create Pandas Series from a list
number_series = pd.Series([ 4, 5, 6, 7.0])
print(number_series)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Note that Pandas is smart about the conversion, detecting a single floating point value (7.0
) and assigning all values in the Series the data type float64.
If needed, you can also set a custom index when creating the object:
[29]:
number_series = pd.Series([ 4, 5, 6, 7.0], index=['a','b','c','d'])
print(number_series)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
[30]:
type(number_series)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-b4139431d30b> in <module>()
----> 1 type(number_series)
NameError: name 'number_series' is not defined
How about combining several lists as a DataFrame? Let’s take a subset of the lists we used in Exercise 3, problem 3 and see how we could combine those as a pandas DataFrame:
[31]:
# Station names
stations = ['Hanko Russarö', 'Heinola Asemantaus', 'Helsinki Kaisaniemi', 'Helsinki Malmi airfield']
# Latitude coordinates of Weather stations
lats = [59.77, 61.2, 60.18, 60.25]
# Longitude coordinates of Weather stations
lons = [22.95, 26.05, 24.94, 25.05]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
Often we indeed create pandas DataFrames by reading in data (e.g. using pd.read_csv(filename.csv)
), but sometimes your might also combine lists into a DataFrame inside the script using the pandas.DataFrame
constructor. Here, we are using a Python dictionary {"column_1": list_1, "column_2": list_2, ...}
to indicate the structure of our data.
[32]:
new_data = pd.DataFrame(data = {"station_name" : stations, "lat" : lats, "lon" : lons})
new_data
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
[33]:
type(new_data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-78c31de88419> in <module>()
----> 1 type(new_data)
NameError: name 'new_data' is not defined
Often, you might start working with an empty data frame in stead of existing lists:
[34]:
df = pd.DataFrame()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/python/3.8.0/lib/python3.8/codeop.py in __call__(self, source, filename, symbol)
131
132 def __call__(self, source, filename, symbol):
--> 133 codeob = compile(source, filename, symbol, self.flags, 1)
134 for feature in _features:
135 if codeob.co_flags & feature.compiler_flag:
TypeError: required field "type_ignores" missing from Module
[35]:
print(df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-2a851eda2e88> in <module>()
----> 1 print(df)
NameError: name 'df' is not defined
Check more details about available paramenters and methods from the pandas.DataFrame documentation.
That’s it! Next, we will have a look at basic operations for data analysis in Pandas.