Advanced data processing with Pandas¶
In this week, we will continue developing our skills using Pandas to analyze climate data. The aim of this lesson is to learn different functions to manipulate with the data and do simple analyses. In the end, our goal is to detect weather anomalies (stormy winds) in Helsinki, during August 2017.
Downloading and reading the data¶
Notice that this time, we will read the actual data obtained from NOAA without any modifications to the actual data by us.
Start by downloading the data file 6591337447542dat_sample.txt
from this link.
The first rows of the data looks like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029740 99999 201708040000 114 6 *** *** BKN * * * 25.0 03 ** ** ** ** ** ** ** 2 58 56 1005.6 ***** 999.2 *** *** ***** ***** ***** ***** 0
029740 99999 201708040020 100 6 *** 75 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.68 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040050 100 5 *** 60 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040100 123 8 *** 63 OVC * * * 10.0 ** ** ** ** 23 ** ** ** * 59 58 1004.7 ***** 998.4 *** *** ***** ***** ***** ***** 0
029740 99999 201708040120 110 7 *** 70 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 59 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
Notice from above that our data is separated with varying amount of spaces (fixed width).
Note
Write the codes of this lesson into a separate script called weather_analysis.py
because we will re-use the codes we write here again later.
Let’s start by importing pandas and specifying the filepath to the file that we want to read.
As the data was separated with varying amount of spaces, we need to tell Pandas to read it like that
with sep
parameter that says following things about it:
Hence, we can separate the columns by varying number spaces of spaces with sep='\s+'
-parameter.
Our data also included No Data values with varying number of *
-characters. Hence, we need to take also those
into account when reading the data. We can tell Pandas to consider those characters as NaNs by specifying na_values=['*', '**', '***', '****', '*****', '******']
.
In [1]: data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-4c97493f75d7> in <module>()
----> 1 data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
684 )
685
--> 686 return _read(filepath_or_buffer, kwds)
687
688
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
450
451 # Create the parser.
--> 452 parser = TextFileReader(fp_or_buf, **kwds)
453
454 if chunksize or iterator:
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
934 self.options["has_index_names"] = kwds["has_index_names"]
935
--> 936 self._make_engine(self.engine)
937
938 def close(self):
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1166 def _make_engine(self, engine="c"):
1167 if engine == "c":
-> 1168 self._engine = CParserWrapper(self.f, **self.options)
1169 else:
1170 if engine == "python":
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1996 kwds["usecols"] = self.usecols
1997
-> 1998 self._reader = parsers.TextReader(src, **kwds)
1999 self.unnamed_cols = self._reader.unnamed_cols
2000
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/build/geo-python/site/data/L6/6591337447542dat_August.txt'
Exploring data and renaming columns¶
Let’s see how the data looks by printing the first five rows with head()
function
In [2]: data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
Let’s continue and check what columns do we have.
In [3]: data.columns
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-c3d483a1c074> in <module>()
----> 1 data.columns
NameError: name 'data' is not defined
Okey there are quite many columns and we are not interested to use all of them. Let’s select only columns that might be used to detect unexceptional weather conditions, i.e. YR–MODAHRMN, DIR, SPD, GUS, TEMP, MAX, and MIN.
In [4]: select_cols = ['YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN']
In [5]: data = data[select_cols]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-e9def1dd97a5> in <module>()
----> 1 data = data[select_cols]
NameError: name 'data' is not defined
Let’s see what our data looks like now by printing last 5 rows and the datatypes.
In [6]: data.tail() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-0b73fe40d24a> in <module>() ----> 1 data.tail() NameError: name 'data' is not defined In [7]: data.dtypes --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-6226a73926db> in <module>() ----> 1 data.dtypes NameError: name 'data' is not defined
The column names that we have are somewhat ackward. Let’s change them into more intuitive. This can be done easily with rename()
-function.
We can define the new column names by using a specific data type in Python called dictionary where we can determine the original column name (the one which will be replaced), and the new column name.
Let’s change YR--MODAHRMN
column into TIME
, SPD
into SPEED
, and GUS
into GUST
In [8]: name_conversion_dict = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'} In [9]: print(name_conversion_dict) {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'} In [10]: type(name_conversion_dict) Out[10]: dict
Now we can change the column names by passing that dictionary into parameter columns
in rename()
-function.
In [11]: data = data.rename(columns=name_conversion_dict) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-11-50fd1adf81d7> in <module>() ----> 1 data = data.rename(columns=name_conversion_dict) NameError: name 'data' is not defined In [12]: data.columns --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-12-c3d483a1c074> in <module>() ----> 1 data.columns NameError: name 'data' is not defined
Perfect, now our column names are more easy to understand and use. Let’s check some basic statistics to understand our data better.
In [13]: data.describe()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-2bb0b18689d4> in <module>()
----> 1 data.describe()
NameError: name 'data' is not defined
Okey so from here we can see that there are varying number of observations per column (see the count
-information). For example SPD
and TEMP
column has 72 observations
whereas GUS
has only 20 observations and MAX
and MIN
has only 2 observations.
From here we can already guess that MAX` and MIN
attributes are most probably not going to be useful for us .
However, GUS
might be.
Let’s explore further our data by checking the first 30 rows of it.
In [14]: data.head(30)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-87f58bd614f4> in <module>()
----> 1 data.head(30)
NameError: name 'data' is not defined
Okey, so from here we can actually see that the GUST
column contains information only on an hourly level. That might be useful! Let’s keep this in mind.
Iterating rows and using self-made functions in Pandas¶
Let’s do the same thing as many times before and convert our Fahrenheit temperatures into Celsius. In this time, however, we will use our self-made function to do the conversion.
Here I provide you the function that you can copy and paste into your own script.
def fahrToCelsius(temp_fahrenheit):
"""
Function to convert Fahrenheit temperature into Celsius.
Parameters
----------
temp_fahrenheit: int | float
Input temperature in Fahrenheit (should be a number)
"""
# Convert the Fahrenheit into Celsius and return it
converted_temp = (temp_fahrenheit - 32) / 1.8
return converted_temp
Let’s do the conversion by iterating our data line by line and updating a column called CELSIUS
that we will create.
We can iterate over the rows of Pandas DataFrame by using iterrows()
-function.
When iterating over the rows in our DataFrame
it is noteworthy to understand that the Pandas actually keeps track on the index
value as well.
Hence, the contents of a single row actually contains not only the values, but also the index
of that row.
Let’s see how it works. Here, we will use a specific Python command called break
can be used to stop the iteration right after the first loop.
This can be useful as we don’t want to fill our console by printing all the values and indices in our DataFrame, but to just see if the function works as we want.
In [15]: for idx, row in data.iterrows(): ....: print('Index:', idx) ....: print(row) ....: break ....: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-15-9d168aad9fa1> in <module>() ----> 1 for idx, row in data.iterrows(): 2 print('Index:', idx) 3 print(row) 4 break NameError: name 'data' is not defined In [16]: type(row) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-16-baa11afa6e20> in <module>() ----> 1 type(row) NameError: name 'row' is not defined
Okey, so here we can see that the idx
variable indeed contains the index value at position 0 (the first row) and the row
variable contains all the data from that given row stored as a pd.Series
.
Let’s now create an empty column for the Celsius temperatures and update the values into that column by using our function. Here is the whole procedure:
# Create an empty column for the data
col_name = 'Celsius'
data[col_name] = None
# Iterate ove rows
for idx, row in data.iterrows():
# Convert the Fahrenheit temperature of the row into Celsius
celsius = fahrToCelsius(row['TEMP'])
# Add that value into 'Celsius' column using the index of the row
data.loc[idx, col_name] = celsius
Let’s see what we have now.
In [17]: data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
Great! Now we have converted our temperatures into Celsius by using the function that we created ourselves. Knowing how to use your own function in Pandas can be really useful when doing your own analyses. There is also another more powerful way of using functions in Pandas, see [1].
Let’s also convert the wind speeds into meters per second values (m/s) as they are more familiar to us in Finland. This can be done with a formula m/s = mph x 0.44704
In [18]: data['SPEED'] = data['SPEED']*0.44704 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-18-2d7e24d8ead3> in <module>() ----> 1 data['SPEED'] = data['SPEED']*0.44704 NameError: name 'data' is not defined In [19]: data['GUST'] = data['GUST']*0.44704 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-19-6b8133f245d3> in <module>() ----> 1 data['GUST'] = data['GUST']*0.44704 NameError: name 'data' is not defined
Let’s see the result by printing the first 30 rows.
String manipulation in Pandas¶
In [20]: data.head(30)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-87f58bd614f4> in <module>()
----> 1 data.head(30)
NameError: name 'data' is not defined
When looking the data more carefully, we can see something interesting:
GUST
seems to be measured only once an hour, whereas SPD
(wind speed), and our temperatures seem to be measured approximately every 20 minutes (at minutes XX:00, XX:20 and XX:50).
That might be a problem as we might not be able to compare e.g. the average wind speeds and the speeds during the gust together as they are measured with different intervals. This kind of mismatch between sampling rates of measurements is actually quite typical when working with real data.
How we can solve this kind of problem is to aggregate the wind speeds into hourly level data as well so the attributes become comparable.
First we need to be able to group the values by hour. This can be done e.g. by slicing the date+hour time from the TIME
column (i.e. removing the minutes from the end of the value)
- Doing this requires two steps:
- Convert the
TIME
column fromint
intostr
datatype. - Include only numbers up to hourly accuracy (exclude minutes) by slicing texts
- Convert the
Note
There are also more advanced functions in Pandas to do time series manipulations by utilizing datetime
datatype and resample()
-function, but we won’t cover those here. Read more information about creating datetime index and aggregating data by time with resampling from here if you are interested.
Let’s convert the time into string. And check that the data type changes.
In [21]: data['TIME_str'] = data['TIME'].astype(str) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-21-87915f869435> in <module>() ----> 1 data['TIME_str'] = data['TIME'].astype(str) NameError: name 'data' is not defined In [22]: data.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-22-304fa4ce4ebd> in <module>() ----> 1 data.head() NameError: name 'data' is not defined In [23]: data['TIME_str'].dtypes --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-23-847e17b5b89c> in <module>() ----> 1 data['TIME_str'].dtypes NameError: name 'data' is not defined In [24]: type(data.loc[0, 'TIME_str']) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-24-7ff5b0a26668> in <module>() ----> 1 type(data.loc[0, 'TIME_str']) NameError: name 'data' is not defined
Okey it seems that now we indeed have the TIME
as str
datatype as well.
Now we can slice them into hourly level by including only 10 first characters from the text (i.e. excluding the minute-level information).
In [25]: data['TIME_dh'] = data['TIME_str'].str.slice(start=0, stop=10) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-25-0cd841112142> in <module>() ----> 1 data['TIME_dh'] = data['TIME_str'].str.slice(start=0, stop=10) NameError: name 'data' is not defined In [26]: data.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-26-304fa4ce4ebd> in <module>() ----> 1 data.head() NameError: name 'data' is not defined
Nice! Now we have information about time on an hourly basis including the date as well.
Note
Notice that all the typical str
functionalities can be applied to Series of text data with syntax data['mySeries'].str.<functionToUse>()
.
Let’s also slice only the hour of the day (excluding information about the date) and convert it back to integer (we will be using this information later)
In [27]: data['TIME_h'] = data['TIME_str'].str.slice(start=8, stop=10) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-27-7da4908562a0> in <module>() ----> 1 data['TIME_h'] = data['TIME_str'].str.slice(start=8, stop=10) NameError: name 'data' is not defined In [28]: data['TIME_h'] = data['TIME_h'].astype(int) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-28-7eabaa9a4448> in <module>() ----> 1 data['TIME_h'] = data['TIME_h'].astype(int) NameError: name 'data' is not defined In [29]: data.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-29-304fa4ce4ebd> in <module>() ----> 1 data.head() NameError: name 'data' is not defined
Wunderbar, now we have also a separate column for only the hour of the day.
Aggregating data in Pandas by grouping¶
Next we want to calculate the average temperatures, wind speeds, etc. on an hourly basis to enable us to compare all of them to each other.
This can be done by aggregating the data, i.e.:
- grouping the data based on hourly values
- Iterating over those groups and calculating the average values of our attributes
- Inserting those values into a new DataFrame where we store the aggregated data
Let’s first create a new empty DataFrame where we will store our aggregated data
In [30]: aggr_data = pd.DataFrame()
Let’s then group our data based on TIME_h
attribute that contains the information about the date + hour.
In [31]: grouped = data.groupby('TIME_dh')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-31-ba38166e7393> in <module>()
----> 1 grouped = data.groupby('TIME_dh')
NameError: name 'data' is not defined
Let’s see what we have now.
In [32]: type(grouped) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-32-994b097c9b35> in <module>() ----> 1 type(grouped) NameError: name 'grouped' is not defined In [33]: len(grouped) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-33-608daaa3b355> in <module>() ----> 1 len(grouped) NameError: name 'grouped' is not defined
Okey, interesting. Now we have a new object with type DataFrameGroupBy
. And it seems that we have 24 individual groups in our data, i.e. one group for each hour of the day.
Let’s see what we can do with this grouped
-variable.
As you might have noticed earlier, the first hour in hour data is 2017080400
(midnight at 4th of August in 2017).
Let’s now see what we have on hour grouped
variable e.g. on the first hour 2017080400
.
We can get the values of that hour from DataFrameGroupBy
-object with get_group()
-function.
In [34]: time1 = '2017080400' In [35]: group1 = grouped.get_group(time1) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-35-cc7df68a5c89> in <module>() ----> 1 group1 = grouped.get_group(time1) NameError: name 'grouped' is not defined In [36]: group1 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-36-21de662443e2> in <module>() ----> 1 group1 NameError: name 'group1' is not defined
Ahaa! As we can see, a single group contains a DataFrame with values only for that specific hour. This is really useful, because now we can calculate e.g. the average values for all weather measurements (+ hour) that we have (you can use any of the statistical functions that we have seen already, e.g. mean, std, min, max, median, etc.).
We can do that by using the mean()
-function that we already used during the Lesson 5.
Let’s calculate the mean for following attributes: DIR
, SPEED
, GUST
, TEMP
, and Celsius
.
In [37]: mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP', 'Celsius', 'TIME_h'] In [38]: mean_values = group1[mean_cols].mean() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-38-dadda1ae5813> in <module>() ----> 1 mean_values = group1[mean_cols].mean() NameError: name 'group1' is not defined In [39]: mean_values --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-39-a955082639c8> in <module>() ----> 1 mean_values NameError: name 'mean_values' is not defined
Nice, now we have averaged our data and e.g. the mean Celsius temperature seems to be about right when comparing to the original values above. Notice that we still have information about the hour but not about the date which is at the moment stored in time1
variable.
We can insert that datetime-information into our mean_values
Series so that we have the date information also associated with our data.
In [40]: mean_values['TIME_dh'] = time1 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-40-4251b8d4ad9a> in <module>() ----> 1 mean_values['TIME_dh'] = time1 NameError: name 'mean_values' is not defined In [41]: mean_values --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-41-a955082639c8> in <module>() ----> 1 mean_values NameError: name 'mean_values' is not defined
Perfect! Now we have also time information there. The last thing to do is to add these mean values into our DataFrame that we created.
That can be done with append()
-function in a quite similar manner as with Python lists.
In Pandas the data insertion is not done inplace (as when appending to Python lists) so we need to specify that we are updating the aggr_data (using the =
sign)
We also need to specify that we ignore the index values of our original DataFrame (i.e. the indices of mean_values
).
In [42]: aggr_data = aggr_data.append(mean_values, ignore_index=True) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-42-9a6bd50cf05b> in <module>() ----> 1 aggr_data = aggr_data.append(mean_values, ignore_index=True) NameError: name 'mean_values' is not defined In [43]: aggr_data Out[43]: Empty DataFrame Columns: [] Index: []
Now we have a single row in our new DataFrame where we have aggregated the data based on hourly mean values.
Next we could continue doing and insert the average values from other hours in a similar manner but, of course, that is not
something that we want to do manually (would require repeating these same steps too many times).
Luckily, we can actually iterate over all the groups that we have in our data and do these steps using a for -loop
.
When iterating over the groups in our DataFrameGroupBy
object
it is important to understand that a single group in our DataFrameGroupBy
actually contains not only the actual values, but also information about the key
that was used to do the grouping.
Hence, when iterating over the data we need to assign the key
and the values into separate variables.
Let’s see how we can iterate over the groups and print the key and the data from a single group (again using break
to only see what is happening).
In [44]: for key, group in grouped:
....: print(key)
....: print(group)
....: break
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-44-8005ac9a13c3> in <module>()
----> 1 for key, group in grouped:
2 print(key)
3 print(group)
4 break
NameError: name 'grouped' is not defined
Okey so from here we can see that the key
contains the value 2017080400
that is the same
as the values in TIME_dh
column. Meaning that we, indeed, grouped the values based on that column.
Let’s see how we can create a DataFrame where we calculate the mean values for all those weather attributes that we were interested in. I will repeate slightly the earlier steps so that you can see and better understand what is happening.
# Create an empty DataFrame for the aggregated values
aggr_data = pd.DataFrame()
# The columns that we want to aggregate
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP', 'Celsius', 'TIME_h']
# Iterate over the groups
for key, group in grouped:
# Aggregate the data
mean_values = group[mean_cols].mean()
# Add the ´key´ (i.e. the date+time information) into the aggregated values
mean_values['TIME_dh'] = key
# Append the aggregated values into the DataFrame
aggr_data = aggr_data.append(mean_values, ignore_index=True)
Let’s see what we have now.
In [45]: aggr_data
Out[45]:
Empty DataFrame
Columns: []
Index: []
Great! Now we have aggregated our data based on daily averages and we have a new DataFrame called aggr_data
where all those aggregated values are stored.
Finding outliers from the data¶
Finally, we are ready to see and find out if there are any outliers in our data suggesting to have a storm (meaning strong winds in this case).
We define an outlier if the wind speed is 2 times the standard deviation higher than the average wind speed (column SPEED
).
Let’s first find out what is the standard deviation and the mean of the Wind speed.
In [46]: std_wind = aggr_data['SPEED'].std() --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2888 try: -> 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'SPEED' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-46-0e133e816ffd> in <module>() ----> 1 std_wind = aggr_data['SPEED'].std() ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2900 if self.columns.nlevels > 1: 2901 return self._getitem_multilevel(key) -> 2902 indexer = self.columns.get_loc(key) 2903 if is_integer(indexer): 2904 indexer = [indexer] ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: -> 2891 raise KeyError(key) from err 2892 2893 if tolerance is not None: KeyError: 'SPEED' In [47]: avg_wind = aggr_data['SPEED'].mean() --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2888 try: -> 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'SPEED' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-47-1ff1df82ae4d> in <module>() ----> 1 avg_wind = aggr_data['SPEED'].mean() ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2900 if self.columns.nlevels > 1: 2901 return self._getitem_multilevel(key) -> 2902 indexer = self.columns.get_loc(key) 2903 if is_integer(indexer): 2904 indexer = [indexer] ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: -> 2891 raise KeyError(key) from err 2892 2893 if tolerance is not None: KeyError: 'SPEED' In [48]: print('Std:', std_wind) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-48-1e00832665f0> in <module>() ----> 1 print('Std:', std_wind) NameError: name 'std_wind' is not defined In [49]: print('Mean:', avg_wind) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-49-40e827dd07b9> in <module>() ----> 1 print('Mean:', avg_wind) NameError: name 'avg_wind' is not defined
Okey, so the variance in the windspeed tend to be approximately 1.6 meters per second, and the wind speed is approximately 5.2 m/s. Hence, the threshold for a wind speed to be an outlier with our criteria is:
In [50]: upper_threshold = avg_wind + (std_wind*2) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-50-94e9b9316b9c> in <module>() ----> 1 upper_threshold = avg_wind + (std_wind*2) NameError: name 'avg_wind' is not defined In [51]: print('Upper threshold for outlier:', upper_threshold) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-51-fbb0de72dd4d> in <module>() ----> 1 print('Upper threshold for outlier:', upper_threshold) NameError: name 'upper_threshold' is not defined
Let’s finally create a column called Outlier
which we update with True
value if the windspeed is an outlier and False
if it is not.
We do this again by iterating over the rows.
# Create an empty column for outlier info
aggr_data['Outlier'] = None
# Iterate over rows
for idx, row in aggr_data.iterrows():
# Update the 'Outlier' column with True if the wind speed is higher than our threshold value
if row['SPEED'] > upper_threshold :
aggr_data.loc[idx, 'Outlier'] = True
else:
aggr_data.loc[idx, 'Outlier'] = False
print(aggr_data)
Let’s see what we have now.
In [52]: print(aggr_data)
Empty DataFrame
Columns: [Outlier]
Index: []
Okey now we have at least many False values in our Outlier
-column.
Let’s select the rows with potential storm and see if we have any potential storms in our data.
In [53]: storm = aggr_data.ix[aggr_data['Outlier'] == True] --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-53-901c55890cb5> in <module>() ----> 1 storm = aggr_data.ix[aggr_data['Outlier'] == True] ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name) 5128 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5129 return self[name] -> 5130 return object.__getattribute__(self, name) 5131 5132 def __setattr__(self, name: str, value) -> None: AttributeError: 'DataFrame' object has no attribute 'ix' In [54]: print(storm) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-54-d9f3d8e2365a> in <module>() ----> 1 print(storm) NameError: name 'storm' is not defined
Okey, so it seems that there was one outlier in our data but the wind during that time wasn’t that strong as the average speed was only 9 m/s. This is not too strange as we were only looking at data from a single day.
Repeating the data analysis with larger dataset¶
Let’s continue by executing the script that we have written this far and use it to explore outlier winds based on whole month of August 2017.
For this purpose you should change the input file to be 6591337447542dat_August.txt
that you can download from here.
Note
Notice that if you haven’t written your codes into a script, you can take advantage of the History
-tab in Spyder where the history of all your codes should be written from this session (you can copy / paste from there).
Change the input data for your script to be the whole month of August 2017 and run the same codes again.
After running the code again with more data, let’s see what were the mean and std wind speeds of our data.
In [55]: std_wind = aggr_data['SPEED'].std() --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2888 try: -> 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'SPEED' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-55-0e133e816ffd> in <module>() ----> 1 std_wind = aggr_data['SPEED'].std() ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2900 if self.columns.nlevels > 1: 2901 return self._getitem_multilevel(key) -> 2902 indexer = self.columns.get_loc(key) 2903 if is_integer(indexer): 2904 indexer = [indexer] ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: -> 2891 raise KeyError(key) from err 2892 2893 if tolerance is not None: KeyError: 'SPEED' In [56]: avg_wind = aggr_data['SPEED'].mean() --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2888 try: -> 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'SPEED' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-56-1ff1df82ae4d> in <module>() ----> 1 avg_wind = aggr_data['SPEED'].mean() ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2900 if self.columns.nlevels > 1: 2901 return self._getitem_multilevel(key) -> 2902 indexer = self.columns.get_loc(key) 2903 if is_integer(indexer): 2904 indexer = [indexer] ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: -> 2891 raise KeyError(key) from err 2892 2893 if tolerance is not None: KeyError: 'SPEED' In [57]: print('Std:', std_wind) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-57-1e00832665f0> in <module>() ----> 1 print('Std:', std_wind) NameError: name 'std_wind' is not defined In [58]: print('Mean:', avg_wind) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-58-40e827dd07b9> in <module>() ----> 1 print('Mean:', avg_wind) NameError: name 'avg_wind' is not defined
Okey so they are indeed different now as we have more data: e.g. the average wind speed was 5.2 m/s, whereas it is now only 4.2. Let’s see what we have now in our storm
variable.
In [59]: storm
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-59-0a43648a34b9> in <module>()
----> 1 storm
NameError: name 'storm' is not defined
Okey, interesting! Now we can see the the days and hours when it has been stormy in August 2017.
It seems that the storms have usually been during the day time. Let’s check if this is the case.
We can easily count how many stormy observations for different hour of the day there has been by
using a value_counts()
-function that calculates how many observations per certain value there are
in a certain column (works best for categorigal data).
Let’s see the counts for different hours of the day
In [60]: print(storm['TIME_h'].value_counts())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-60-1face81c7aee> in <module>()
----> 1 print(storm['TIME_h'].value_counts())
NameError: name 'storm' is not defined
Okey, this is interesting. It seems that most often it has been stormy at 14:00 GMT (i.e. 16:00 at Finnish time). Notice, that there haven’t been any strong winds during the night, which is also interesting. However, as the The weather guys explains us, it is not that surprising actually =).
The average wind speed may not be the perfect measure to find extreme weather conditions. Gust might usually be a better measure for that purpose. Let’s see what were the strongest gust winds in our dataset by sorting the values.
In [61]: gust_sort = storm.sort_values(by='GUST', ascending=False) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-61-c1c81225c3b7> in <module>() ----> 1 gust_sort = storm.sort_values(by='GUST', ascending=False) NameError: name 'storm' is not defined In [62]: gust_sort --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-62-30ef7f94ef9a> in <module>() ----> 1 gust_sort NameError: name 'gust_sort' is not defined
Interesting! There was one hour with quite extraordinary gust wind in our data happening at 12th of August in 2017. Indeed, that was a big storm in Helsinki called Kiira that caused major damage in different parts of the city.
[1] | Below you can find information how to use functions in Pandas with an alternative way. |
Hint
Hint: Using iterrows()
-function is not the most efficient way of using your self-made functions. In Pandas, there is a function called apply()
that takes advantage of the power of numpy when looping, and is hence much faster which can give a lot of speed benefit when you have millions of rows to iterate over.
Below I show how to do the similar thing by using our own function with apply()
.
I will make a copy of our original DataFrame so this does not affect our original data.
Before using this approach, we need to modify our function a bit to get things working.
First, we need to have a parameter called row
that is used to pass the data from row into our function
(this is something specific to apply()
-function in Pandas) and then add paramaters for passing the information about the column name that contains the temperatures in Fahrenheit,
and the column name where the coverted temperatures will be updated (i.e. the Celsius temperatures).
Hence, in the end, you can see that this is a bit more generic function to use (i.e. the columns to use in the calculation are not “hard-coded”).
def fahrToCelsius(row, src_col, target_col):
"""
A generic function to convert Fahrenheit temperature into Celsius.
Parameters
----------
row: pd.Series
Input row containing the data for specific index in the DataFrame
src_col : str
Name of the source column for the calculation. I.e. the name of the column where Fahrenheits are stored.
target_col : str
Name of the target column where Celsius will be stored.
"""
# Convert the Fahrenheit into Celsius and update the target column value
row[target_col] = (row[src_col]- 32) / 1.8
return row
Take a copy of the data.
In [63]: data2 = data.copy()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-63-c5881cec65d0> in <module>()
----> 1 data2 = data.copy()
NameError: name 'data' is not defined
Apply our new function and update the values into a new column called Celsius2
In [64]: data2 = data2.apply(fahrToCelsius, src_col='TEMP', target_col='Celsius2', axis=1)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2888 try:
-> 2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'TEMP'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-64-468c0343a19e> in <module>()
----> 1 data2 = data2.apply(fahrToCelsius, src_col='TEMP', target_col='Celsius2', axis=1)
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7543 kwds=kwds,
7544 )
-> 7545 return op.get_result()
7546
7547 def applymap(self, func) -> "DataFrame":
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
253
254 def apply_standard(self):
--> 255 results, res_index = self.apply_series_generator()
256
257 # wrap results
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
282 for i, v in enumerate(series_gen):
283 # ignore SettingWithCopy here in case the user mutates
--> 284 results[i] = self.f(v)
285 if isinstance(results[i], ABCSeries):
286 # If we have a view on v, we need to make a copy because
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in f(x)
107
108 def f(x):
--> 109 return func(x, *args, **kwds)
110
111 else:
<ipython-input-63-0199b9821261> in fahrToCelsius(row, src_col, target_col)
1 def fahrToCelsius(row, src_col, target_col):
----> 2 row[target_col] = (row[src_col]- 32) / 1.8
3 return row
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993
~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
-> 2891 raise KeyError(key) from err
2892
2893 if tolerance is not None:
KeyError: 'TEMP'
As you can see here, we use the apply()
function and as the first parameter
we pass the name of the function that we want to use with the apply()
, and then we pass the names of the source column and the target column.
Lastly, it is important to add as a last parameter axis=1
that tells for the function to apply the calculations vertically (row by row) instead of horizontally (would move from column to another).
See the results.
In [65]: data2.head()
Out[65]:
Time Value
0 20170101 1
1 20170101 2
2 20170102 3
3 20170104 3
4 20170104 8
Indeed it seems that our function worked because the values in Celsius
and Celsius2
columns are the same.
With this approach it is extremely easy to reuse our function and pass the results into another new colum e.g.
In [66]: data2 = data2.apply(fahrToCelsius, src_col='TEMP', target_col='Celsius3', axis=1) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2888 try: -> 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'TEMP' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-66-b70a0a501690> in <module>() ----> 1 data2 = data2.apply(fahrToCelsius, src_col='TEMP', target_col='Celsius3', axis=1) ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds) 7543 kwds=kwds, 7544 ) -> 7545 return op.get_result() 7546 7547 def applymap(self, func) -> "DataFrame": ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self) 178 return self.apply_raw() 179 --> 180 return self.apply_standard() 181 182 def apply_empty_result(self): ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self) 253 254 def apply_standard(self): --> 255 results, res_index = self.apply_series_generator() 256 257 # wrap results ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self) 282 for i, v in enumerate(series_gen): 283 # ignore SettingWithCopy here in case the user mutates --> 284 results[i] = self.f(v) 285 if isinstance(results[i], ABCSeries): 286 # If we have a view on v, we need to make a copy because ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/apply.py in f(x) 107 108 def f(x): --> 109 return func(x, *args, **kwds) 110 111 else: <ipython-input-63-0199b9821261> in fahrToCelsius(row, src_col, target_col) 1 def fahrToCelsius(row, src_col, target_col): ----> 2 row[target_col] = (row[src_col]- 32) / 1.8 3 return row ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key) 880 881 elif key_is_scalar: --> 882 return self._get_value(key) 883 884 if ( ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/series.py in _get_value(self, label, takeable) 989 990 # Similar to Index.get_value, but we do not fall back to positional --> 991 loc = self.index.get_loc(label) 992 return self.index._get_values_for_loc(self, loc, label) 993 ~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2889 return self._engine.get_loc(casted_key) 2890 except KeyError as err: -> 2891 raise KeyError(key) from err 2892 2893 if tolerance is not None: KeyError: 'TEMP' In [67]: data2.head() Out[67]: Time Value 0 20170101 1 1 20170101 2 2 20170102 3 3 20170104 3 4 20170104 8
Now we just added another column called Celsius3
just by changing the value of the target_col
-parameter.
This is a good and efficient approach to use in many cases, and hence highly recommended (although it is a bit harder to understand).