Advanced data processing with Pandas¶
In this week, we will continue developing our skills using Pandas to analyze climate data. The aim of this lesson is to learn different functions to manipulate with the data and do simple analyses. In the end, our goal is to detect weather anomalies (stormy winds) in Helsinki, during August 2017.
Reading the data¶
Notice that this time, we will read the actual data obtained from NOAA without any modifications to the actual data by us. The data is separated with varying amount of spaces (fixed width). The first lines and columns of the data looks like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ...
029740 99999 201708040000 114 6 *** *** BKN * * * 25.0 03 ** ** ** ** ** ** ** 2 58 56 1005.6 ...
Because the data is separated with varying amount of spaces, we need to tell Pandas how to read it. We can control the delimiter with sep
parameter following the documentation of the function read_csv()
:
Hence, we can separate the columns by varying number spaces of spaces with sep='\s+'
-parameter. Our data also included No Data values with varying number of *
-characters. Hence, we need to take also those into account when reading the data. We can tell Pandas to consider those characters as NaNs by specifying na_values=['*', '**', '***', '****', '*****', '******']
.
- Let’s start by reading the data with pandas:
[1]:
import pandas as pd
fp = "data/6591337447542dat_sample.txt"
# Read data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
Exploring data and renaming columns¶
- Let’s see how the data looks by printing the first five rows with
head()
function
[2]:
print(data.head())
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M ... SLP \
0 29740 99999 201708040000 114 6 NaN NaN BKN NaN NaN ... 1005.6
1 29740 99999 201708040020 100 6 NaN 75.0 NaN NaN NaN ... NaN
2 29740 99999 201708040050 100 5 NaN 60.0 NaN NaN NaN ... NaN
3 29740 99999 201708040100 123 8 NaN 63.0 OVC NaN NaN ... 1004.7
4 29740 99999 201708040120 110 7 NaN 70.0 NaN NaN NaN ... NaN
ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
0 NaN 999.2 NaN NaN NaN NaN NaN NaN 0.0
1 29.68 NaN NaN NaN NaN NaN NaN NaN NaN
2 29.65 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN 998.4 NaN NaN NaN NaN NaN NaN 0.0
4 29.65 NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 33 columns]
Okay so we can see that the data was successfully read to the DataFrame and we also seemed to be able to convert the asterix (*) characters into NaN
-values.
- Let’s continue and check what columns do we have.
[3]:
data.columns
[3]:
Index(['USAF', 'WBAN', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'CLG', 'SKC', 'L',
'M', 'H', 'VSB', 'MW', 'MW.1', 'MW.2', 'MW.3', 'AW', 'AW.1', 'AW.2',
'AW.3', 'W', 'TEMP', 'DEWP', 'SLP', 'ALT', 'STP', 'MAX', 'MIN', 'PCP01',
'PCP06', 'PCP24', 'PCPXX', 'SD'],
dtype='object')
As we can see, there are quite many columns, however, we are not interested to use all of them.
- Let’s select only columns that might be used to detect unexceptional weather conditions, i.e. YR–MODAHRMN, DIR, SPD, GUS, TEMP, MAX, and MIN.
[4]:
# Specify a list of columns that will be selected from the DataFrame
select_cols = ['YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN']
# Do the selection
data = data[select_cols]
- Let’s see what our data looks like now by printing last 5 rows and the datatypes.
[5]:
# Show last five rows
print(data.tail())
# Check the data types
print("\nData-types:\n")
print(data.dtypes)
YR--MODAHRMN DIR SPD GUS TEMP MAX MIN
67 201708042220 180 11 NaN 61 NaN NaN
68 201708042250 190 8 NaN 59 NaN NaN
69 201708042300 200 9 11.0 60 NaN NaN
70 201708042320 190 8 NaN 59 NaN NaN
71 201708042350 190 8 NaN 59 NaN NaN
Data-types:
YR--MODAHRMN int64
DIR int64
SPD int64
GUS float64
TEMP int64
MAX float64
MIN float64
dtype: object
The column names that we have are somewhat ackward. Let’s change them into more intuitive ones. This can be done easily with rename()
-function.
We can define the new column names by using a specific data type in Python called dictionary where we determine “key: value
” -pairs, in which the original column name (the one which will be replaced) is the key, and the new column name is the value.
- Let’s change:
YR--MODAHRMN
column intoTIME
,SPD
intoSPEED
, andGUS
intoGUST
[6]:
# Create the dictionary with old and new names
name_conversion_dict = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'}
# Let's see what they look like and what is the type
print(name_conversion_dict)
print(type(name_conversion_dict))
{'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'}
<class 'dict'>
From above we can see that we have successfully created a dictionary that is of type dict
.
- Now we can change the column names by passing that dictionary into parameter
columns
inrename()
-function.
[7]:
# Rename the columns
data = data.rename(columns=name_conversion_dict)
# Print the new columns
print(data.columns)
Index(['TIME', 'DIR', 'SPEED', 'GUST', 'TEMP', 'MAX', 'MIN'], dtype='object')
Perfect, now our column names are more easy to understand and use.
- Let’s check some basic statistics to understand our data better.
[8]:
print(data.describe())
TIME DIR SPEED GUST TEMP MAX \
count 7.200000e+01 72.000000 72.000000 20.000000 72.000000 2.000000
mean 2.017080e+11 229.555556 11.527778 17.700000 61.513889 66.500000
std 6.973834e+02 215.759248 3.756580 5.068998 3.175580 3.535534
min 2.017080e+11 80.000000 5.000000 11.000000 58.000000 64.000000
25% 2.017080e+11 117.750000 9.000000 13.000000 59.000000 65.250000
50% 2.017080e+11 200.000000 11.000000 16.000000 61.000000 66.500000
75% 2.017080e+11 220.000000 15.000000 22.250000 64.000000 67.750000
max 2.017080e+11 990.000000 20.000000 29.000000 69.000000 69.000000
MIN
count 2.000000
mean 57.000000
std 1.414214
min 56.000000
25% 56.500000
50% 57.000000
75% 57.500000
max 58.000000
Okey so from here we can see that there are varying number of observations per column (see the count
-information).
For example ``SPD`` and ``TEMP`` column has 72 observations whereas ``GUS`` has only 20 observations and ``MAX`` and ``MIN`` has only 2 observations. From here we can already guess that MAX
and MIN
attributes are most probably not going to be useful for us. However, GUS
might be.
- Let’s explore further our data by checking the first 30 rows of it.
[9]:
print(data.head(30))
TIME DIR SPEED GUST TEMP MAX MIN
0 201708040000 114 6 NaN 58 NaN NaN
1 201708040020 100 6 NaN 59 NaN NaN
2 201708040050 100 5 NaN 59 NaN NaN
3 201708040100 123 8 NaN 59 NaN NaN
4 201708040120 110 7 NaN 59 NaN NaN
5 201708040150 100 6 NaN 61 NaN NaN
6 201708040200 138 10 13.0 59 NaN NaN
7 201708040220 120 10 NaN 59 NaN NaN
8 201708040250 100 9 NaN 59 NaN NaN
9 201708040300 108 9 12.0 59 NaN NaN
10 201708040320 90 8 NaN 59 NaN NaN
11 201708040350 80 9 NaN 59 NaN NaN
12 201708040400 102 11 15.0 58 NaN NaN
13 201708040420 80 10 NaN 59 NaN NaN
14 201708040450 80 10 NaN 59 NaN NaN
15 201708040500 119 12 17.0 58 NaN NaN
16 201708040520 990 11 NaN 59 NaN NaN
17 201708040550 100 13 NaN 59 NaN NaN
18 201708040600 121 16 23.0 58 64.0 56.0
19 201708040620 110 15 NaN 59 NaN NaN
20 201708040650 100 15 NaN 59 NaN NaN
21 201708040700 119 14 22.0 58 NaN NaN
22 201708040720 990 14 NaN 59 NaN NaN
23 201708040750 100 13 NaN 59 NaN NaN
24 201708040800 125 10 15.0 58 NaN NaN
25 201708040820 990 9 NaN 59 NaN NaN
26 201708040850 100 7 NaN 59 NaN NaN
27 201708040900 107 8 NaN 59 NaN NaN
28 201708040920 990 7 NaN 59 NaN NaN
29 201708040950 990 6 NaN 61 NaN NaN
Okey, so from here we can actually see that the ``GUST`` column contains information only on an hourly level. That might be useful! Let’s keep this in mind.
TAKE HOME MESSAGE: Whenever starting a data analysis with new dataset, it is highly useful to explore the data by calculating basic statistics from the data (+ visualizing the data, which we will learn later).
Iterating rows and using self-made functions in Pandas¶
Let’s do the “SAME THING” as so many times before and convert our Fahrenheit temperatures into Celsius (sorry if we seem to lack imagination =) ).
In this time, however, we will use our self-made function to do the conversion.
- Let’s first define the function:
[10]:
def fahrToCelsius(temp_fahrenheit):
"""
Function to convert Fahrenheit temperature into Celsius.
Parameters
----------
temp_fahrenheit: int | float
Input temperature in Fahrenheit (should be a number)
Returns
-------
Temperature in Celsius (float)
"""
# Convert the Fahrenheit into Celsius and return it
converted_temp = (temp_fahrenheit - 32) / 1.8
return converted_temp
Let’s do the conversion by iterating our data line by line and updating a column called ``CELSIUS`` that we will create.
We can iterate over the rows of Pandas DataFrame by using ``iterrows()`` -function. When iterating over the rows in our DataFrame
, it is noteworthy to understand that the Pandas actually keeps track on the ``index`` value as well. Hence, the contents of a single row actually contains not only the values, but also the index
of that row.
- Let’s see how it works. Here, we will use a specific Python command called **``break``** can be used to stop the iteration right after the first loop. This can be quite useful as we don’t want to fill our console by printing all the values and indices in our DataFrame, but to just see if the function works as we want:
[11]:
# Iterate over the rows
for idx, row in data.iterrows():
# Print the index value
print('Index:', idx)
# Print the row
print(row)
# Stop iteration with break command
break
# Let's see what is the type of our row
print("\nRow type:\n", type(row))
Index: 0
TIME 2.017080e+11
DIR 1.140000e+02
SPEED 6.000000e+00
GUST NaN
TEMP 5.800000e+01
MAX NaN
MIN NaN
Name: 0, dtype: float64
Row type:
<class 'pandas.core.series.Series'>
Okey, so here we can see that the ``idx`` variable indeed contains the index value at position 0 (the first row) and the ``row`` variable contains all the data from that given row stored as ``pd.Series``.
- Let’s now create an empty column for the Celsius temperatures and update the values into that column by using our function. Here is the whole procedure:
[12]:
# Create an empty column for the DataFrame where the values will be stored
col_name = 'Celsius'
data[col_name] = None
# Iterate over the rows
for idx, row in data.iterrows():
# Convert the Fahrenheit to Celsius
celsius = fahrToCelsius(row['TEMP'])
# Update the value of 'Celsius' column with the converted value using .loc that we learned last week
data.loc[idx, col_name] = celsius
- Let’s see what we have now.
[13]:
print(data.head())
TIME DIR SPEED GUST TEMP MAX MIN Celsius
0 201708040000 114 6 NaN 58 NaN NaN 14.4444
1 201708040020 100 6 NaN 59 NaN NaN 15
2 201708040050 100 5 NaN 59 NaN NaN 15
3 201708040100 123 8 NaN 59 NaN NaN 15
4 201708040120 110 7 NaN 59 NaN NaN 15
Great! Now we have converted our temperatures into Celsius by using the function that we created ourselves. Knowing how to use your own function in Pandas can be really useful when doing your own analyses. There is also another more powerful way of using functions in Pandas by taking advantage of **``apply()``** -function, but we will learn that later.
- Finally, let’s convert the wind speeds into meters per second values (m/s) as they are more familiar to us in Finland. This can be done with a formula ``m/s = mph x 0.44704``:
[14]:
# Convert speeds from miles to meters
data['SPEED'] = data['SPEED']*0.44704
data['GUST'] = data['GUST']*0.44704
# Print the first 5 values
print(data.head())
TIME DIR SPEED GUST TEMP MAX MIN Celsius
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444
1 201708040020 100 2.68224 NaN 59 NaN NaN 15
2 201708040050 100 2.23520 NaN 59 NaN NaN 15
3 201708040100 123 3.57632 NaN 59 NaN NaN 15
4 201708040120 110 3.12928 NaN 59 NaN NaN 15
String manipulation in Pandas¶
[17]:
print(data.head(20))
TIME DIR SPEED GUST TEMP MAX MIN Celsius
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444
1 201708040020 100 2.68224 NaN 59 NaN NaN 15
2 201708040050 100 2.23520 NaN 59 NaN NaN 15
3 201708040100 123 3.57632 NaN 59 NaN NaN 15
4 201708040120 110 3.12928 NaN 59 NaN NaN 15
5 201708040150 100 2.68224 NaN 61 NaN NaN 16.1111
6 201708040200 138 4.47040 5.81152 59 NaN NaN 15
7 201708040220 120 4.47040 NaN 59 NaN NaN 15
8 201708040250 100 4.02336 NaN 59 NaN NaN 15
9 201708040300 108 4.02336 5.36448 59 NaN NaN 15
10 201708040320 90 3.57632 NaN 59 NaN NaN 15
11 201708040350 80 4.02336 NaN 59 NaN NaN 15
12 201708040400 102 4.91744 6.70560 58 NaN NaN 14.4444
13 201708040420 80 4.47040 NaN 59 NaN NaN 15
14 201708040450 80 4.47040 NaN 59 NaN NaN 15
15 201708040500 119 5.36448 7.59968 58 NaN NaN 14.4444
16 201708040520 990 4.91744 NaN 59 NaN NaN 15
17 201708040550 100 5.81152 NaN 59 NaN NaN 15
18 201708040600 121 7.15264 10.28192 58 64.0 56.0 14.4444
19 201708040620 110 6.70560 NaN 59 NaN NaN 15
When looking the data more carefully, we can see something interesting: - ``GUST`` seems to be measured only once an hour, whereas ``SPD`` (wind speed), and our temperatures seem to be measured approximately every 20 minutes (at minutes XX:00, XX:20 and XX:50).
That might be a problem as we might not be able to compare e.g. the average wind speeds and the speeds during the gust together as they are measured with different intervals. This kind of mismatch between sampling rates of measurements is actually quite typical when working with real data.
How we can solve this problem, is to aggregate the wind speeds into hourly level data so that the attributes become comparable. First we need to be able to group the values by hour. This can be done e.g. by slicing the date+hour time from the ``TIME`` -column (i.e. removing the minutes from the end of the value).
Doing this requires two steps: 1. Convert the TIME
column from int
into str
datatype. 2. Include only numbers up to hourly accuracy (exclude minutes) by slicing texts
Note: There are also more advanced functions in Pandas to do time series manipulations by utilizing **``datetime``** datatype and **``resample()``** -function, but we won’t cover those here. You can read the Pandas docs if you are interested.
- Let’s convert the time into string. And check that the data type changes:
[21]:
# Convert to string
data['TIME_str'] = data['TIME'].astype(str)
# Check data types
print("Data type of the column:")
print(data['TIME_str'].dtypes)
print("\nData type of the first value in column:")
print(type(data.loc[0, 'TIME_str']))
Data type of the column:
object
Data type of the first value in column:
<class 'str'>
Okey it seems that now we indeed have the TIME
as str
datatype as well.
- Now we can slice them into hourly level by including only 10 first characters from the text (i.e. excluding the minute-level information).
[22]:
# SLice the string
data['TIME_dh'] = data['TIME_str'].str.slice(start=0, stop=10)
# Let's see what we have
print(data.head())
TIME DIR SPEED GUST TEMP MAX MIN Celsius TIME_str \
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444 201708040000
1 201708040020 100 2.68224 NaN 59 NaN NaN 15 201708040020
2 201708040050 100 2.23520 NaN 59 NaN NaN 15 201708040050
3 201708040100 123 3.57632 NaN 59 NaN NaN 15 201708040100
4 201708040120 110 3.12928 NaN 59 NaN NaN 15 201708040120
TIME_dh
0 2017080400
1 2017080400
2 2017080400
3 2017080401
4 2017080401
Nice! Now we have information about time on an hourly basis including the date as well.
Note: All the typical str
functionalities can be applied to Series of text data with syntax data['mySeries'].str.<functionToUse>()
.
- Let’s also slice only the hour of the day (excluding information about the date) and convert it back to integer (we will be using this information later):
[23]:
# Slice the string to parse the hour from 'TIME_str' column
data['TIME_h'] = data['TIME_str'].str.slice(start=8, stop=10)
# Convert the hour text into integer format
data['TIME_h'] = data['TIME_h'].astype(int)
print(data.head())
TIME DIR SPEED GUST TEMP MAX MIN Celsius TIME_str \
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444 201708040000
1 201708040020 100 2.68224 NaN 59 NaN NaN 15 201708040020
2 201708040050 100 2.23520 NaN 59 NaN NaN 15 201708040050
3 201708040100 123 3.57632 NaN 59 NaN NaN 15 201708040100
4 201708040120 110 3.12928 NaN 59 NaN NaN 15 201708040120
TIME_dh TIME_h
0 2017080400 0
1 2017080400 0
2 2017080400 0
3 2017080401 1
4 2017080401 1
Wunderbar, now we have also a separate column for only the hour of the day!
Aggregating data in Pandas by grouping¶
Next we want to calculate the average temperatures, wind speeds, etc. on an hourly basis to enable us to compare all of them to each other.
This can be done by aggregating the data, i.e.:
- grouping the data based on hourly values
- Iterating over those groups and calculating the average values of our attributes
- Inserting those values into a new DataFrame where we store the aggregated data
- Let’s first create a new empty DataFrame where we will store our aggregated data
[24]:
# Create a new empty DataFrame
aggr_data = pd.DataFrame()
- Let’s then group our data based on
TIME_h
attribute that contains the information about the date + hour.
[25]:
# Group the data
grouped = data.groupby('TIME_dh')
- Let’s see what we have now.
[26]:
# What is the type?
print("Type:\n", type(grouped))
# How many?
print("Length:\n", len(grouped))
Type:
<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
Length:
24
Okey, interesting. Now we have a new object with type ``DataFrameGroupBy``. And it seems that we have 24 individual groups in our data, i.e. one group for each hour of the day.
As you might have noticed earlier, the first hour in hour data is 2017080400
(midnight at 4th of August in 2017).
- Let’s now see what we have on hour
grouped
variable e.g. on the first hour2017080400
. We can get the values of that hour fromDataFrameGroupBy
-object with ``get_group()`` -function.
[29]:
# Specify the time of the first hour (as text)
time1 = '2017080400'
# Select the group
group1 = grouped.get_group(time1)
# Let's see what we have
print(group1)
TIME DIR SPEED GUST TEMP MAX MIN Celsius TIME_str \
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444 201708040000
1 201708040020 100 2.68224 NaN 59 NaN NaN 15 201708040020
2 201708040050 100 2.23520 NaN 59 NaN NaN 15 201708040050
TIME_dh TIME_h
0 2017080400 0
1 2017080400 0
2 2017080400 0
Ahaa! As we can see, a single group contains a DataFrame with values only for that specific hour. This is really useful, because now we can calculate e.g. the average values for all weather measurements (+ hour) that we have (you can use any of the statistical functions that we have seen already, e.g. mean, std, min, max, median, etc.).
We can do that by using the ``mean()`` -function that we already used during the Lesson 5.
- Let’s calculate the mean for following attributes (let’s see how to do them all at once!):
DIR
,SPEED
,GUST
,TEMP
,Celsius
.
[31]:
# Specify the columns that will be part of the calculation
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP', 'Celsius', 'TIME_h']
# Calculate the mean values all at one go
mean_values = group1[mean_cols].mean()
# Let's see what we have
print(mean_values)
DIR 104.666667
SPEED 2.533227
GUST NaN
TEMP 58.666667
Celsius 14.814815
TIME_h 0.000000
dtype: float64
Nice, now we have averaged our data and e.g. the mean Celsius temperature seems to be about right when comparing to the original values above. As you saw from this example, it is possible to do calculations for multiple columns at the same time efficiently with Pandas.
Notice that we still have information about the hour but not about the date which is at the moment stored in ``time1`` -variable. We can insert that datetime-information into our ``mean_values`` Series so that we have the date information also associated with our data.
[32]:
# Add the time information into the pandas.Series
mean_values['TIME_dh'] = time1
# Let's see what we have
print(mean_values)
DIR 104.667
SPEED 2.53323
GUST NaN
TEMP 58.6667
Celsius 14.8148
TIME_h 0
TIME_dh 2017080400
dtype: object
Perfect! Now we have also time information there.
The last thing to do is to add these mean values into our DataFrame that we created. That can be done with ``append()`` -function in a quite similar manner as with Python lists. In Pandas the data insertion is not done inplace (as when appending to Python lists) so we need to specify that we are updating the aggr_data (using the ``=`` sign). We also need to specify that we ignore the index values of our original DataFrame (i.e. the indices of mean_values
).
[33]:
# Add the values into our aggr_data DataFrame that we created in the beginning
aggr_data = aggr_data.append(mean_values, ignore_index=True)
# Let's see what we have
print(aggr_data)
Celsius DIR GUST SPEED TEMP TIME_dh TIME_h
0 14.814815 104.666667 NaN 2.533227 58.666667 2017080400 0.0
As we can see, now we have a single row in our new DataFrame where we have aggregated the data based on hourly mean values. Next we could continue doing and insert the average values from other hours in a similar manner but, of course, that is not something that we want to do manually (would require repeating these same steps too many times). Luckily, we can actually iterate over all the groups that we have in our data and do these steps using a ``for`` -loop.
When iterating over the groups in our ``DataFrameGroupBy`` -object it is important to understand that a single group in our DataFrameGroupBy
actually contains not only the actual values, but also information about the ``key`` that was used to do the grouping. Hence, when iterating over the data we need to assign the key
and the values into separate variables.
- Let’s see how we can iterate over the groups and print the key and the data from a single group (again using ``break`` to only see what is happening).
[36]:
# Iterate over groups
for key, group in grouped:
# Print key and group
print("Key:\n", key)
print("\nGroup:\n", group)
# Stop iteration with break command
break
Key:
2017080400
Group:
TIME DIR SPEED GUST TEMP MAX MIN Celsius TIME_str \
0 201708040000 114 2.68224 NaN 58 NaN NaN 14.4444 201708040000
1 201708040020 100 2.68224 NaN 59 NaN NaN 15 201708040020
2 201708040050 100 2.23520 NaN 59 NaN NaN 15 201708040050
TIME_dh TIME_h
0 2017080400 0
1 2017080400 0
2 2017080400 0
Okey so from here we can see that the ``key`` contains the value ``2017080400`` that is the same as the values in ``TIME_dh`` column. Meaning that we, indeed, grouped the values based on that column.
- Let’s see how we can create a DataFrame where we calculate the mean values for all those weather attributes that we were interested in. I will repeate slightly the earlier steps so that you can see and better understand what is happening.
[38]:
# Create an empty DataFrame for the aggregated values
aggr_data = pd.DataFrame()
# The columns that we want to aggregate
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP', 'Celsius', 'TIME_h']
# Iterate over the groups
for key, group in grouped:
# Aggregate the data
mean_values = group[mean_cols].mean()
# Add the ´key´ (i.e. the date+time information) into the aggregated values
mean_values['TIME_dh'] = key
# Append the aggregated values into the DataFrame
aggr_data = aggr_data.append(mean_values, ignore_index=True)
- Let’s see what we have now:
[39]:
print(aggr_data)
Celsius DIR GUST SPEED TEMP TIME_dh TIME_h
0 14.814815 104.666667 NaN 2.533227 58.666667 2017080400 0.0
1 15.370370 111.000000 NaN 3.129280 59.666667 2017080401 1.0
2 15.000000 119.333333 5.81152 4.321387 59.000000 2017080402 2.0
3 15.000000 92.666667 5.36448 3.874347 59.000000 2017080403 3.0
4 14.814815 87.333333 6.70560 4.619413 58.666667 2017080404 4.0
5 14.814815 403.000000 7.59968 5.364480 58.666667 2017080405 5.0
6 14.814815 110.333333 10.28192 6.854613 58.666667 2017080406 6.0
7 14.814815 403.000000 9.83488 6.109547 58.666667 2017080407 7.0
8 14.814815 405.000000 6.70560 3.874347 58.666667 2017080408 8.0
9 15.370370 695.666667 NaN 3.129280 59.666667 2017080409 9.0
10 16.481481 225.000000 5.81152 4.768427 61.666667 2017080410 10.0
11 17.777778 241.666667 8.49376 5.513493 64.000000 2017080411 11.0
12 18.888889 228.333333 6.70560 5.960533 66.000000 2017080412 12.0
13 19.629630 229.666667 8.94080 7.152640 67.333333 2017080413 13.0
14 20.185185 228.666667 12.96416 8.940800 68.333333 2017080414 14.0
15 19.074074 218.333333 10.72896 7.450667 66.333333 2017080415 15.0
16 18.703704 214.666667 10.28192 7.152640 65.666667 2017080416 16.0
17 17.592593 209.666667 8.94080 7.003627 63.666667 2017080417 17.0
18 16.851852 211.333333 10.28192 5.662507 62.333333 2017080418 18.0
19 16.111111 203.000000 5.81152 4.023360 61.000000 2017080419 19.0
20 15.925926 198.000000 5.36448 4.023360 60.666667 2017080420 20.0
21 15.925926 186.666667 NaN 3.874347 60.666667 2017080421 21.0
22 15.555556 189.000000 6.70560 4.619413 60.000000 2017080422 22.0
23 15.185185 193.333333 4.91744 3.725333 59.333333 2017080423 23.0
Awesome! Now we have aggregated our data based on daily averages and we have a new DataFrame called ``aggr_data`` where all those aggregated values are stored.
Finding outliers from the data¶
Finally, we are ready to do some real data analytics and check whether we are able to find out if there are any outliers in our data suggesting to have a storm (meaning strong winds in this case).
Here, we define an outlier if the wind speed is 2 times the standard deviation higher than the average wind speed (column SPEED
).
- Let’s first find out what is the standard deviation and the mean of the Wind speed.
[40]:
# Calculate standard deviation and average wind speed
std_wind = aggr_data['SPEED'].std()
avg_wind = aggr_data['SPEED'].mean()
print('Std:', std_wind)
print('Mean:', avg_wind)
Std: 1.6405694308360985
Mean: 5.153377777777777
Okey, so the variance in the windspeed tend to be approximately 1.6 meters per second, and the wind speed is approximately 5.2 m/s.
- Hence, the threshold for a wind speed to be an outlier with our criteria is:
[41]:
# Calculate the upper threshold for an outlier
upper_threshold = avg_wind + (std_wind*2)
print('Upper threshold for outlier:', upper_threshold)
Upper threshold for outlier: 8.434516639449974
- Let’s finally create a column called ``Outlier`` which we update with ``True`` value, if the windspeed is an outlier, and ``False``, if it is not. We do this again by iterating over the rows.
[42]:
# Create an empty column for outlier info
aggr_data['Outlier'] = None
# Iterate over rows
for idx, row in aggr_data.iterrows():
# Update the 'Outlier' column with True if the wind speed is higher than our threshold value
if row['SPEED'] > upper_threshold :
aggr_data.loc[idx, 'Outlier'] = True
else:
aggr_data.loc[idx, 'Outlier'] = False
# Let's see what we have
print(aggr_data)
Celsius DIR GUST SPEED TEMP TIME_dh TIME_h \
0 14.814815 104.666667 NaN 2.533227 58.666667 2017080400 0.0
1 15.370370 111.000000 NaN 3.129280 59.666667 2017080401 1.0
2 15.000000 119.333333 5.81152 4.321387 59.000000 2017080402 2.0
3 15.000000 92.666667 5.36448 3.874347 59.000000 2017080403 3.0
4 14.814815 87.333333 6.70560 4.619413 58.666667 2017080404 4.0
5 14.814815 403.000000 7.59968 5.364480 58.666667 2017080405 5.0
6 14.814815 110.333333 10.28192 6.854613 58.666667 2017080406 6.0
7 14.814815 403.000000 9.83488 6.109547 58.666667 2017080407 7.0
8 14.814815 405.000000 6.70560 3.874347 58.666667 2017080408 8.0
9 15.370370 695.666667 NaN 3.129280 59.666667 2017080409 9.0
10 16.481481 225.000000 5.81152 4.768427 61.666667 2017080410 10.0
11 17.777778 241.666667 8.49376 5.513493 64.000000 2017080411 11.0
12 18.888889 228.333333 6.70560 5.960533 66.000000 2017080412 12.0
13 19.629630 229.666667 8.94080 7.152640 67.333333 2017080413 13.0
14 20.185185 228.666667 12.96416 8.940800 68.333333 2017080414 14.0
15 19.074074 218.333333 10.72896 7.450667 66.333333 2017080415 15.0
16 18.703704 214.666667 10.28192 7.152640 65.666667 2017080416 16.0
17 17.592593 209.666667 8.94080 7.003627 63.666667 2017080417 17.0
18 16.851852 211.333333 10.28192 5.662507 62.333333 2017080418 18.0
19 16.111111 203.000000 5.81152 4.023360 61.000000 2017080419 19.0
20 15.925926 198.000000 5.36448 4.023360 60.666667 2017080420 20.0
21 15.925926 186.666667 NaN 3.874347 60.666667 2017080421 21.0
22 15.555556 189.000000 6.70560 4.619413 60.000000 2017080422 22.0
23 15.185185 193.333333 4.91744 3.725333 59.333333 2017080423 23.0
Outlier
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
Okey now we have at least many False values in our ``Outlier`` -column but there seems to be also one True!.
- Let’s select the rows with potential storm:
[44]:
# Select rows that were determined as outliers
storm = aggr_data.loc[aggr_data['Outlier'] == True]
print(storm)
Celsius DIR GUST SPEED TEMP TIME_dh TIME_h \
14 20.185185 228.666667 12.96416 8.9408 68.333333 2017080414 14.0
Outlier
14 True
Okey, so indeed, there was one outlier in our data but the wind during that time wasn’t that strong as the average speed was only approximately 9 m/s. This is not too strange as we were only looking at data from a single day.
Repeating the data analysis with larger dataset¶
Let’s continue by executing the steps that we have written this far and use it to explore outlier winds based on whole month of August 2017.
For this purpose, we change the input file to be ``6591337447542dat_August.txt`` that looks like this.
- Here we will repeat all the steps that we did earlier in one code block so that you can see the full picture:
[49]:
# Filepath
fp = "data/6591337447542dat_August.txt"
# Read data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
# Select only specific columns
select_cols = ['YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN']
data = data[select_cols]
# Rename the columns
name_conversion_dict = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'}
data = data.rename(columns=name_conversion_dict)
# Create column
col_name = 'Celsius'
data[col_name] = None
# Iterete over rows and convert tempetarues from Fahrenheits to Celsius
for idx, row in data.iterrows():
celsius = fahrToCelsius(row['TEMP'])
data.loc[idx, col_name] = celsius
# Convert wind speeds from miles to meters per second
data['SPEED'] = data['SPEED']*0.44704
data['GUST'] = data['GUST']*0.44704
# Convert TIME to string and parse date and hour info from the time
data['TIME_str'] = data['TIME'].astype(str)
data['TIME_dh'] = data['TIME_str'].str.slice(start=0, stop=10)
data['TIME_h'] = data['TIME_str'].str.slice(start=8, stop=10)
data['TIME_h'] = data['TIME_h'].astype(int)
# Create empty column for aggregated data
aggr_data = pd.DataFrame()
# Specify the columns which will be used in calculation
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP', 'Celsius', 'TIME_h']
# Group the values by hour
grouped = data.groupby('TIME_dh')
# Iterate over groups and update the aggregated DataFrame
for key, group in grouped:
# Calculate the mean values
mean_values = group[mean_cols].mean()
# Add the time to the Series
mean_values['TIME_dh'] = key
# Add the aggregated values into the DataFrame
aggr_data = aggr_data.append(mean_values, ignore_index=True)
# Calculate the outlier threshold for the new dataset
std_wind = aggr_data['SPEED'].std()
avg_wind = aggr_data['SPEED'].mean()
upper_threshold = avg_wind + (std_wind*2)
# Detect the outliers
aggr_data['Outlier'] = None
for idx, row in aggr_data.iterrows():
if row['SPEED'] > upper_threshold:
aggr_data.loc[idx, 'Outlier'] = True
else:
aggr_data.loc[idx, 'Outlier'] = False
# Select days with strong winds
storm = aggr_data.loc[aggr_data['Outlier'] == True]
After running the analysis with our new dataset, let’s explore and see we have.
- Let’s start by checking if the average and standard deviation of the windspeed differ from the previous ones
[50]:
# Windspeed statistics
print('Std:', std_wind)
print('Mean:', avg_wind)
Std: 2.1405899770297245
Mean: 4.1990832704402505
Okey so they are indeed different now! With larger dataset the average wind speed is 4.2 m/s (compared to 5.2 m/s previously).
- Let’s see what we have now in our ``storm`` -variable.
[51]:
print(storm)
Celsius DIR GUST SPEED TEMP TIME_dh \
10 22.777778 210.666667 12.51712 9.089813 73.000000 2017080110
11 22.777778 212.000000 11.62304 8.940800 73.000000 2017080111
12 22.407407 205.666667 12.51712 9.089813 72.333333 2017080112
86 20.185185 228.666667 12.96416 8.940800 68.333333 2017080414
104 19.814815 204.333333 11.17600 8.791787 67.666667 2017080508
132 16.296296 237.666667 13.85824 9.387840 61.333333 2017080612
230 21.666667 217.000000 12.51712 8.642773 71.000000 2017081014
280 19.074074 700.666667 26.82240 8.791787 66.333333 2017081216
301 20.555556 210.000000 NaN 9.611360 69.000000 2017081313
302 19.444444 200.000000 NaN 8.493760 67.000000 2017081314
444 22.037037 195.666667 10.72896 8.493760 71.666667 2017081914
445 20.925926 204.666667 12.51712 8.940800 69.666667 2017081915
559 14.814815 328.666667 13.41120 8.493760 58.666667 2017082409
560 15.925926 329.333333 13.85824 8.493760 60.666667 2017082410
563 16.296296 329.666667 13.41120 9.238827 61.333333 2017082413
564 15.185185 550.000000 NaN 8.493760 59.333333 2017082414
686 17.222222 214.000000 13.41120 9.089813 63.000000 2017082916
687 17.037037 210.666667 11.62304 8.791787 62.666667 2017082917
704 18.518519 203.666667 8.04672 8.493760 65.333333 2017083010
705 18.888889 218.333333 13.41120 8.940800 66.000000 2017083011
706 17.962963 215.666667 14.52880 10.579947 64.333333 2017083012
707 17.962963 217.666667 12.07008 9.089813 64.333333 2017083013
TIME_h Outlier
10 10.0 True
11 11.0 True
12 12.0 True
86 14.0 True
104 8.0 True
132 12.0 True
230 14.0 True
280 16.0 True
301 13.0 True
302 14.0 True
444 14.0 True
445 15.0 True
559 9.0 True
560 10.0 True
563 13.0 True
564 14.0 True
686 16.0 True
687 17.0 True
704 10.0 True
705 11.0 True
706 12.0 True
707 13.0 True
Okey, interesting! Now we can see the the days and hours when it has been stormy in August 2017. It seems that the storms have usually been during the day time. Let’s check if this is the case.
We can easily count how many stormy observations for different hour of the day there has been by using a **``value_counts()``** -function that calculates how many observations per certain value there are in a certain column (works best for categorigal data).
- Let’s see the counts for different hours of the day
[52]:
# Print the occurences of storm by hour
print(storm['TIME_h'].value_counts())
14.0 5
13.0 3
12.0 3
10.0 3
16.0 2
11.0 2
17.0 1
9.0 1
15.0 1
8.0 1
Name: TIME_h, dtype: int64
Okey, this is interesting. It seems that most often it has been stormy at 14:00 GMT (i.e. 16:00 at Finnish time). Notice, that there haven’t been any strong winds during the night, which is also interesting. However, as the The weather guys explains us, it is not that surprising actually =).
The average wind speed may not be the perfect measure to find extreme weather conditions. Gust might usually be a better measure for that purpose.
- Let’s see what were the strongest gust winds in our dataset by sorting the values using ``sort_values()`` -function.
[53]:
# Sort values in descending order
gust_sort = storm.sort_values(by='GUST', ascending=False)
# Let's see what we have
print(gust_sort)
Celsius DIR GUST SPEED TEMP TIME_dh \
280 19.074074 700.666667 26.82240 8.791787 66.333333 2017081216
706 17.962963 215.666667 14.52880 10.579947 64.333333 2017083012
132 16.296296 237.666667 13.85824 9.387840 61.333333 2017080612
560 15.925926 329.333333 13.85824 8.493760 60.666667 2017082410
559 14.814815 328.666667 13.41120 8.493760 58.666667 2017082409
705 18.888889 218.333333 13.41120 8.940800 66.000000 2017083011
686 17.222222 214.000000 13.41120 9.089813 63.000000 2017082916
563 16.296296 329.666667 13.41120 9.238827 61.333333 2017082413
86 20.185185 228.666667 12.96416 8.940800 68.333333 2017080414
10 22.777778 210.666667 12.51712 9.089813 73.000000 2017080110
445 20.925926 204.666667 12.51712 8.940800 69.666667 2017081915
230 21.666667 217.000000 12.51712 8.642773 71.000000 2017081014
12 22.407407 205.666667 12.51712 9.089813 72.333333 2017080112
707 17.962963 217.666667 12.07008 9.089813 64.333333 2017083013
11 22.777778 212.000000 11.62304 8.940800 73.000000 2017080111
687 17.037037 210.666667 11.62304 8.791787 62.666667 2017082917
104 19.814815 204.333333 11.17600 8.791787 67.666667 2017080508
444 22.037037 195.666667 10.72896 8.493760 71.666667 2017081914
704 18.518519 203.666667 8.04672 8.493760 65.333333 2017083010
301 20.555556 210.000000 NaN 9.611360 69.000000 2017081313
302 19.444444 200.000000 NaN 8.493760 67.000000 2017081314
564 15.185185 550.000000 NaN 8.493760 59.333333 2017082414
TIME_h Outlier
280 16.0 True
706 12.0 True
132 12.0 True
560 10.0 True
559 9.0 True
705 11.0 True
686 16.0 True
563 13.0 True
86 14.0 True
10 10.0 True
445 15.0 True
230 14.0 True
12 12.0 True
707 13.0 True
11 11.0 True
687 17.0 True
104 8.0 True
444 14.0 True
704 10.0 True
301 13.0 True
302 14.0 True
564 14.0 True
Interesting! There was one hour with quite extraordinary gust wind in our data happening at 12th of August in 2017. Indeed, that was a big storm in Helsinki called **Kiira** that caused major damage in different parts of the city. (Source: `YLE <https://yle.fi/uutiset/osasto/news/saturday_night_storm_downs_trees_cuts_electricity_in_the_south/9773250>`__; Photo: Markku Sipi)
TAKE HOME MESSAGE: As we have seen here, we can already conduct fairly interesting data analysis with Pandas that provides various useful functionalities that are fairly straightforward and easy to use. Similar approaches can be used for many different kind of datasets.