Advanced data processing with Pandas¶
In this week, we will continue developing our skills using Pandas to analyze climate data. The aim of this lesson is to learn different functions to manipulate with the data and do simple analyses. In the end, our goal is to detect weather anomalies (stormy winds) in Helsinki, during August 2017.
Downloading and reading the data¶
Notice that this time, we will read the actual data obtained from NOAA without any modifications to the actual data by us.
Start by downloading the data file 6591337447542dat_sample.txt
from this link.
The first rows of the data looks like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029740 99999 201708040000 114 6 *** *** BKN * * * 25.0 03 ** ** ** ** ** ** ** 2 58 56 1005.6 ***** 999.2 *** *** ***** ***** ***** ***** 0
029740 99999 201708040020 100 6 *** 75 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.68 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040050 100 5 *** 60 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040100 123 8 *** 63 OVC * * * 10.0 ** ** ** ** 23 ** ** ** * 59 58 1004.7 ***** 998.4 *** *** ***** ***** ***** ***** 0
029740 99999 201708040120 110 7 *** 70 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 59 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
Notice from above that our data is separated with varying amount of spaces (fixed width).
Note
Write the codes of this lesson into a separate script called weather_analysis.py
because we will re-use the codes we write here again later.
Let’s start by importing pandas and specifying the filepath to the file that we want to read.
As the data was separated with varying amount of spaces, we need to tell Pandas to read it like that
with sep
parameter that says following things about it:
Hence, we can separate the columns by varying number spaces of spaces with sep='\s+'
-parameter.
Our data also included No Data values with varying number of *
-characters. Hence, we need to take also those
into account when reading the data. We can tell Pandas to consider those characters as NaNs by specifying na_values=['*', '**', '***', '****', '*****', '******']
.
In [1]: data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
Exploring data and renaming columns¶
Let’s see how the data looks by printing the first five rows with head()
function
In [2]: data.head()
Out[2]:
index USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L ... \
0 216 29740 99999 201708040000 114.0 6.0 NaN NaN BKN NaN ...
1 217 29740 99999 201708040020 100.0 6.0 NaN 75.0 NaN NaN ...
2 218 29740 99999 201708040050 100.0 5.0 NaN 60.0 NaN NaN ...
3 219 29740 99999 201708040100 123.0 8.0 NaN 63.0 OVC NaN ...
4 220 29740 99999 201708040120 110.0 7.0 NaN 70.0 NaN NaN ...
SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
0 1005.6 NaN 999.2 NaN NaN NaN NaN NaN NaN 0.0
1 NaN 29.68 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 29.65 NaN NaN NaN NaN NaN NaN NaN NaN
3 1004.7 NaN 998.4 NaN NaN NaN NaN NaN NaN 0.0
4 NaN 29.65 NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 34 columns]
Let’s continue and check what columns do we have.
In [3]: data.columns
Out[3]:
Index(['index', 'USAF', 'WBAN', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'CLG',
'SKC', 'L', 'M', 'H', 'VSB', 'MW', 'MW.1', 'MW.2', 'MW.3', 'AW', 'AW.1',
'AW.2', 'AW.3', 'W', 'TEMP', 'DEWP', 'SLP', 'ALT', 'STP', 'MAX', 'MIN',
'PCP01', 'PCP06', 'PCP24', 'PCPXX', 'SD'],
dtype='object')
Okey there are quite many columns and we are not interested to use all of them. Let’s select only columns that might be used to detect unexceptional weather conditions, i.e. YR–MODAHRMN, DIR, SPD, GUS, TEMP, MAX, and MIN.
In [4]: select_cols = ['YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN']
In [5]: data = data[select_cols]
Let’s see what our data looks like now by printing last 5 rows and the datatypes.
In [6]: data.tail()
Out[6]:
YR--MODAHRMN DIR SPD GUS TEMP MAX MIN
67 201708042220 180.0 11.0 NaN 61.0 NaN NaN
68 201708042250 190.0 8.0 NaN 59.0 NaN NaN
69 201708042300 200.0 9.0 11.0 60.0 NaN NaN
70 201708042320 190.0 8.0 NaN 59.0 NaN NaN
71 201708042350 190.0 8.0 NaN 59.0 NaN NaN
In [7]: data.dtypes