Exploring data using Pandas¶
Our first task in this week’s lesson is to learn how to read and explore data files using Pandas.
While exploring the weather data, we will get familiar with Pandas data structures: Series and DataFrame. Pandas DataFrame (a 2-dimensional data structure) is used for storing and mainpulating table-like data (data with rows and columns) in Python. You can think of Pandas DataFrame as an attribute table (an excel-like spreadsheet, but much better!). Pandas Series (a 1-dimensional data structure) is used for storing and manipulating an sequence of values. Pandas Series is kind of like a list, but more clever. One row or one column in a Pandas DataFrame is actually a Pandas Series. For a comprehensive overview of pandas data structures, you can have a look at Wes MacKinney’s book Python for Data Analysis (2nd Edition, 2017) and Pandas online documentation about data structures.
Reading data files using Pandas will make life a bit easier compared to the traditional Python way of reading data files. If you’re curious about that, you can check out some of the lesson materials from previous years about reading data in the Pythonic way.
Preparations for this lesson (working environment and input data)¶
- Start a jupyter lab instance (using the links at the top of this page) and open this lesson in an interactive mode.
- Have a look at the input data which is a text file containing weather observations from Kumpula
- The data file name is Kumpula-June-2016-w-metadata.txt
- The file is available in the binder and CSC notebook instances, under L5 folder
- The data file contains observed daily mean, minimum, and maximum temperatures from June 2016 recorded from the Kumpula weather observation station in Helsinki.
- The data has been derived from a data file of daily temperature measurments downloaded from the US National Oceanographic and Atmospheric Administration’s National Centers for Environmental Information climate database.
- There are something like 30 lines of data in the data file.
Reading a data file with Pandas¶
Now we’re ready to read in our temperature data file. First, we need to import the Pandas module.
[8]:
import pandas as pd
That’s it, Pandas is ready to use now.
Notice we imported the Pandas module with the name pd
.
Now we’ll read the file data into a variable called ``dataFrame``. Using the pandas.read_csv
-function.
[9]:
# Read the file using pandas
dataFrame = pd.read_csv('Kumpula-June-2016-w-metadata.txt')
pd.read_csv()
is a general function for reading data files separated by commas, spaces, or other common separators. For a full list of parameters for this function, please refer to pandas documentation for pandas.read_csv.
Here we use the function simply by giving the filename as an input parameter. If all goes as planned, you should now have a new variable defined as dataFrame
in memory that contains the data file’s contents. You can check the the contents of this variable by typing
[10]:
print(dataFrame)
# Data file contents: Daily temperatures (mean min \
0 # for June 1-30 2016
1 # Data source: https://www.ncdc.noaa.gov/cdo-w... NaN
2 # Data processing: Extracted temperatures from... converted to
3 # comma-separated format NaN
4 # NaN
5 # David Whipp - 02.10.2017 NaN
6 YEARMODA TEMP
7 20160601 65.5
8 20160602 65.8
9 20160603 68.4
10 20160604 57.5
11 20160605 51.4
12 20160606 52.2
13 20160607 56.9
14 20160608 54.2
15 20160609 49.4
16 20160610 49.5
17 20160611 54.0
18 20160612 55.4
19 20160613 58.3
20 20160614 59.7
21 20160615 63.4
22 20160616 57.8
23 20160617 60.4
24 20160618 57.3
25 20160619 56.3
26 20160620 59.3
27 20160621 62.6
28 20160622 61.7
29 20160623 60.9
30 20160624 61.1
31 20160625 65.7
32 20160626 69.6
33 20160627 60.7
34 20160628 65.4
35 20160629 65.8
36 20160630 65.7
max) for Kumpula Helsinki
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 MAX MIN
7 73.6 54.7
8 80.8 55.0
9 77.9 55.6
10 70.9 47.3
11 58.3 43.2
12 59.7 42.8
13 65.1 45.9
14 60.4 47.5
15 54.1 45.7
16 55.9 43.0
17 62.1 41.7
18 64.2 46.0
19 68.2 47.3
20 67.8 47.8
21 70.3 49.3
22 67.5 55.6
23 70.7 55.9
24 62.8 54.0
25 59.2 54.1
26 69.1 52.2
27 71.4 50.4
28 70.2 55.4
29 67.1 54.9
30 68.9 56.7
31 75.4 57.9
32 77.7 60.3
33 70.0 57.6
34 73.0 55.8
35 73.2 59.7
36 72.7 59.2
This looks OK, but there are some strange values present such as NaN
, and the first lines of the dataframe look a bit weird..
NaN
stands for “not a number”, and might indicate some problem with reading in the contents of the file. Plus, we expected about 30 lines of data, but the index values go up to 36 when we print the contents of the dataFrame
variable. Looks like we need to investigate this further.
As we can observe, there are some metadata at the top of the file giving basic information about its contents and source. This isn’t data we want to process, so we need to skip over that part of the file when we load it.
Here are the 8 first rows of data in the text file (note that the 8th row is blank):
# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki
# for June 1-30, 2016
# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
# Data processing: Extracted temperatures from raw data file, converted to
# comma-separated format
#
# David Whipp - 02.10.2017
Fortunately, that’s easy to do in Pandas, we just need to add the ``skiprows`` parameter when we read the file, listing the number of rows to skip (8 in this case).
Let’s try reading the datafile again using Pandas, and this time defining the skiprows
parameter.
[12]:
dataFrame = pd.read_csv('Kumpula-June-2016-w-metadata.txt', skiprows=8)
Let’s now print the rows and see what changed.
[13]:
print(dataFrame)
YEARMODA TEMP MAX MIN
0 20160601 65.5 73.6 54.7
1 20160602 65.8 80.8 55.0
2 20160603 68.4 77.9 55.6
3 20160604 57.5 70.9 47.3
4 20160605 51.4 58.3 43.2
5 20160606 52.2 59.7 42.8
6 20160607 56.9 65.1 45.9
7 20160608 54.2 60.4 47.5
8 20160609 49.4 54.1 45.7
9 20160610 49.5 55.9 43.0
10 20160611 54.0 62.1 41.7
11 20160612 55.4 64.2 46.0
12 20160613 58.3 68.2 47.3
13 20160614 59.7 67.8 47.8
14 20160615 63.4 70.3 49.3
15 20160616 57.8 67.5 55.6
16 20160617 60.4 70.7 55.9
17 20160618 57.3 62.8 54.0
18 20160619 56.3 59.2 54.1
19 20160620 59.3 69.1 52.2
20 20160621 62.6 71.4 50.4
21 20160622 61.7 70.2 55.4
22 20160623 60.9 67.1 54.9
23 20160624 61.1 68.9 56.7
24 20160625 65.7 75.4 57.9
25 20160626 69.6 77.7 60.3
26 20160627 60.7 70.0 57.6
27 20160628 65.4 73.0 55.8
28 20160629 65.8 73.2 59.7
29 20160630 65.7 72.7 59.2
That looks more like it.
So, what happened? Why did the dataframe look weird before we skipped the first rows?
Well, the file data was read into a Pandas DataFrame, which is a two-dimensional structure used for storing table-like data. A pandas dataframe contains a collection of columns, which can all be a different value type (string, float, int, boolean, etc.).
In our first attempt to read data from the text file, the first rows containing metadata did not belong to the actual data array, and that’s why the output looked weird.
By default, the ``read_csv``-function infers column names from the first line of the file. For other options, see parameter header
in the read_csv documentation.
What would happen if we skipped 9 rows? (try it out!)
[ ]:
In fact, one of the neat things in Pandas is that that DataFrames have labelled axes (rows and columns). For our example, we have the rows labeled with an index value (0
to 29
), and columns labelled YEARMODA
, TEMP
, MAX
, and MIN
. This is nice because we can easily use these labels to divide up our data and make interacting with it easier as you’ll see later in the lesson.
The example above, when trying to read a datafile with some header text (the metadata in this case), is very common. Reading data into Pandas is pretty easy, but it helps to have a sense of what the datafile looks like before you try to read it.
After reading in the data, it is always good to check that everything went well (for example as we did with the print-statement above). The challenge can also be that large datafiles might not nicely print on screen using the print()
-function so it might be better to look at only the top 5-10 lines of the file rather than loading the entire thing.
We can use ``pandas.DataFrame.head`` -function to quickly check the contents of the dataframe. This function returns the first n rows for the dataframe. By default, it gives 5 first rows.
[14]:
dataFrame.head()
[14]:
YEARMODA | TEMP | MAX | MIN | |
---|---|---|---|---|
0 | 20160601 | 65.5 | 73.6 | 54.7 |
1 | 20160602 | 65.8 | 80.8 | 55.0 |
2 | 20160603 | 68.4 | 77.9 | 55.6 |
3 | 20160604 | 57.5 | 70.9 | 47.3 |
4 | 20160605 | 51.4 | 58.3 | 43.2 |
Now we can move on to exploring our data.
Exploring our dataset¶
We now have some basic Python skills and the ability to read in data from a file for processing. A normal first step when you load new data is to explore the dataset a bit to understand how the data is structured, and what kind of values are stored in there.
Let’s start by looking at the different columns we have in our DataFrame. We can find this in the columns
attribute that is part of the DataFrame object (columns is an in-built attribute in the DataFrame data type).
[15]:
#Print columns
print(dataFrame.columns)
Index(['YEARMODA', 'TEMP', 'MAX', 'MIN'], dtype='object')
Here we see the names of the different columns in the datafile, as one might expect.
We can also find information about the rows in the datafile using the ``index`` attribute.
[16]:
#Print index
print(dataFrame.index)
RangeIndex(start=0, stop=30, step=1)
Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, but it suggests that maybe there are other ways to identify the rows in data using Pandas.
Again, we’ll see a bit more about this later.
For now, it is also useful to point out that if you want to just know how many rows you have, you can use the ``len()`` function.
[17]:
# Print number of rows using len()-function
# print(len(dataFrame.index))
print(len(dataFrame))
30
We can also get a quick sense of the size of the dataset using the ``shape`` attribute.
[18]:
# Print dataframe shape
print(dataFrame.shape)
(30, 4)
Here we see that our dataset has 30 rows, 4 columns, just as we saw above.
Now let’s consider the types of data we have in our DataFrame. First, let’s see what type of data the DataFrame is.
[19]:
# check data type of the dataFrame variable
type(dataFrame)
[19]:
pandas.core.frame.DataFrame
No surprises here, our Pandas DataFrame is a Pandas DataFrame ;).
NOTE: We can also use IPython magic to figure out what kind of variable we have in memory. IPython magic command %who
will display names of those variables that you have defined during this session. Magic command %whose
prints out more information about these variables:
[20]:
# Display variable names:
#%who
# Display variable name, type and info
%whos
Variable Type Data/Info
----------------------------------
dataFrame DataFrame YEARMODA TEMP MAX <...>0160630 65.7 72.7 59.2
pd module <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
So, what about the data types of each column in our dataFrame? Finding the types of data in the columns of the DataFrame is easy:
[21]:
# Print data types
print(dataFrame.dtypes)
YEARMODA int64
TEMP float64
MAX float64
MIN float64
dtype: object
The dtypes
attribute holds the data types for each column, nice. Here we see that YEARMODA
is an integer value (with 64-bit precision; int64), while the other values are all decimal values with 64-bit precision (float64).
We can also select a single column of the data using the column name:
[22]:
#Print out a single column
print(dataFrame['TEMP'])
0 65.5
1 65.8
2 68.4
3 57.5
4 51.4
5 52.2
6 56.9
7 54.2
8 49.4
9 49.5
10 54.0
11 55.4
12 58.3
13 59.7
14 63.4
15 57.8
16 60.4
17 57.3
18 56.3
19 59.3
20 62.6
21 61.7
22 60.9
23 61.1
24 65.7
25 69.6
26 60.7
27 65.4
28 65.8
29 65.7
Name: TEMP, dtype: float64
Furthermore, printing out its values shows not only the values, but also their data type. What about the type of the column itself?
[23]:
# Check datatype of that column
type(dataFrame['TEMP'])
[23]:
pandas.core.series.Series
Interesting.
So rather than seeing a DataFrame type or float64, a selected column from a DataFrame is called a Series in Pandas. Here, we will briefly see how you can create a Pandas Series from a Python list (and further convert it into a Pandas DataFrame..).
If you have long lists of numbers, for instance, creating a Pandas Series will allow you to interact with these values more efficiently in terms of computing time.
[24]:
# Create Pandas Series from a list
myList = [1, 2, 3, 4, 5, 6, 7.0]
mySeries = pd.Series(myList)
print(mySeries)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
As you can see, myList
is converted to a Pandas Series using the ps.Series()
function. Also, note that Pandas is smart about the conversion, detecting a single floating point value (7.0
) and assigning all values in the Series the data type float64.
In turn, we could convert this series back into a pandas dataframe by initiating a new pandas.DataFrame:
[25]:
# Check data type of the input
print(type(mySeries))
# Create a pandas dataframe
myDataFrame = pd.DataFrame(mySeries)
#Check the data type
print(type(myDataFrame))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
[26]:
print(myDataFrame)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
So, now we have a pandas dataframe with one column of data, and an associated index. By default, the column name (label) has been set to ‘0’. We can still give our column a new name:
[27]:
# Rename the first column
myDataFrame.columns = ["number"]
print(myDataFrame)
number
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
Let’s continue with our original DataFrame dataFrame
and columns stored in there.
Just like DataFrames, Pandas Series have a set of attributes they know about themselves and methods they can use to make calculations using the Series data.
Useful methods include mean()
, median()
, min()
, max()
, and std()
(the standard deviation).
[28]:
# Check mean value of a column
dataFrame['TEMP'].mean()
[28]:
59.73
Here, we don’t even need to store dataFrame['TEMP']
as a separate series in order to find the mean value using the mean()
method
One useful function to get an overview of the basic statistics for all attributes in your DataFrame is the describe()
function.
[29]:
# Get descriptive statistics
dataFrame.describe()
[29]:
YEARMODA | TEMP | MAX | MIN | |
---|---|---|---|---|
count | 3.000000e+01 | 30.000000 | 30.000000 | 30.000000 |
mean | 2.016062e+07 | 59.730000 | 67.940000 | 51.750000 |
std | 8.803408e+00 | 5.475472 | 6.651761 | 5.634484 |
min | 2.016060e+07 | 49.400000 | 54.100000 | 41.700000 |
25% | 2.016061e+07 | 56.450000 | 63.150000 | 47.300000 |
50% | 2.016062e+07 | 60.050000 | 69.000000 | 54.050000 |
75% | 2.016062e+07 | 64.900000 | 72.375000 | 55.750000 |
max | 2.016063e+07 | 69.600000 | 80.800000 | 60.300000 |
Finally, there are occasions where you’ll need to convert data stored within a Series to another data type.
If you’re planning to print a large number of value to the screen, for instance, it might be helpful to have those values as character strings.
Data type conversions is most easily done using the astype()
method.
[30]:
#Convert temperature values to string
print(dataFrame['TEMP'].astype(str))
0 65.5
1 65.8
2 68.4
3 57.5
4 51.4
5 52.2
6 56.9
7 54.2
8 49.4
9 49.5
10 54.0
11 55.4
12 58.3
13 59.7
14 63.4
15 57.8
16 60.4
17 57.3
18 56.3
19 59.3
20 62.6
21 61.7
22 60.9
23 61.1
24 65.7
25 69.6
26 60.7
27 65.4
28 65.8
29 65.7
Name: TEMP, dtype: object
In this case, the object data type indicates the temperature values are stored as character strings. A more obvious case is converting to integer values.
[31]:
# Check the output
print(dataFrame['TEMP'].astype(int))
0 65
1 65
2 68
3 57
4 51
5 52
6 56
7 54
8 49
9 49
10 54
11 55
12 58
13 59
14 63
15 57
16 60
17 57
18 56
19 59
20 62
21 61
22 60
23 61
24 65
25 69
26 60
27 65
28 65
29 65
Name: TEMP, dtype: int32
Here you can clearly see the temperature values are now whole numbers.
NOTE:
Be careful with type conversions from floating point values to integers. The conversion simply drops the stuff to the right of the decimal point, so all values are rounded down to the nearest whole number. For example, 99.99 will be rounded to 99 as an integer. This can be dangerous in some cases.
Hence, it might be good to round the values before converting them to integers. Chaining the round and type conversion functions solves this issue as the .round(0).astype(int)
command first rounds the values with zero decimals and then converts those values into integers.
[32]:
# Integer, truncated:
print("Truncated values:")
print(dataFrame['TEMP'].astype(int).head())
# Add empty line:
print("\n")
print("Rounded values:")
#Integer, rounded:
print(dataFrame['TEMP'].round(0).astype(int).head())
Truncated values:
0 65
1 65
2 68
3 57
4 51
Name: TEMP, dtype: int32
Rounded values:
0 66
1 66
2 68
3 58
4 51
Name: TEMP, dtype: int32
That’s it! Next, we will have a look at basic operations for data analysis in Pandas.