Advanced data processing with NumPy¶
This week we will continue developing our skills using NumPy to process real data. In the lesson this week we are using daily weather observations from the Detroit Willow Run Airport station, near Ypsilanti, Michigan in the U.S. This data file is in the same format as the one you’ll work with in this week’s exercise.
Getting started¶
Previewing the data file.¶
To start, let’s have a look at the top few lines of our data file.
STATION ELEVATION LATITUDE LONGITUDE DATE PRCP TAVG TMAX TMIN
----------------- ---------- ---------- ---------- -------- -------- -------- -------- --------
GHCND:USW00014853 236.8 42.23333 -83.53333 20070101 0.03 -9999 50 32
GHCND:USW00014853 236.8 42.23333 -83.53333 20070102 0.00 -9999 43 26
GHCND:USW00014853 236.8 42.23333 -83.53333 20070103 0.00 -9999 46 30
So, we have 9 columns in the file, with values separated by variable numbers of spaces. Another thing to notice are the -9999
values, which indicate missing data. Otherwise, some columns contain numerical values, and those will be our focus for today’s lesson. If you prefer, you’re welcome to open the data file in the JupyterLab editor to explore more.
Importing NumPy and loading the data¶
Now let’s import NumPy and read in our data to a variable called data
.
[1]:
import numpy as np
Now we can read the data file. Note that we only want the DATE
, TAVG
, TMAX
, and TMIN
values, which we can select using the usecols()
parameter. Don’t forget about skipping the header!
[2]:
fp = '1495566.txt'
data = np.genfromtxt(fp, skip_header=2, usecols=(4, 6, 7, 8))
Now let’s see what we’re dealing with…
[3]:
print(data)
[[ 2.00701010e+07 -9.99900000e+03 5.00000000e+01 3.20000000e+01]
[ 2.00701020e+07 -9.99900000e+03 4.30000000e+01 2.60000000e+01]
[ 2.00701030e+07 -9.99900000e+03 4.60000000e+01 3.00000000e+01]
...,
[ 2.01712290e+07 -9.99900000e+03 2.00000000e+01 1.10000000e+01]
[ 2.01712300e+07 -9.99900000e+03 1.70000000e+01 4.00000000e+00]
[ 2.01712310e+07 -9.99900000e+03 1.70000000e+01 -3.00000000e+00]]
Converting the missing data to nan values¶
Our first processing task will be to convert the missing data values to np.nan
values, which will make it easier to use masks later in our data processing. We can go ahead and do this to all of the data to start, following a similar approach to how we used masks in the last lesson.
First, we need to identify a useful test for identifying missing data. We could could convert the data to integer values everywhere and test for values being equal to -9999
, but there’s an even easier option. Since we know all of our data are dates or temperatures, we can simply look for numbers below -9998
to identify missing data.
[4]:
data_mask = (data < -9998)
data[data_mask] = np.nan
[5]:
print(data)
[[ 2.00701010e+07 nan 5.00000000e+01 3.20000000e+01]
[ 2.00701020e+07 nan 4.30000000e+01 2.60000000e+01]
[ 2.00701030e+07 nan 4.60000000e+01 3.00000000e+01]
...,
[ 2.01712290e+07 nan 2.00000000e+01 1.10000000e+01]
[ 2.01712300e+07 nan 1.70000000e+01 4.00000000e+00]
[ 2.01712310e+07 nan 1.70000000e+01 -3.00000000e+00]]
Now we see the expected nan
values in place of the -9999
values. Note that a nan
value in NumPy is refered to as np.nan
.
Splitting the data into column arrays¶
Like the last lesson, we can now separate our data into different arrays with meaningful names. For example, we can create a date
array using the first column of data
. Remember that we’ll do this using array slicing.
[6]:
date = data[:,0]
[7]:
print(date)
[ 20070101. 20070102. 20070103. ..., 20171229. 20171230. 20171231.]
Looks good, let’s do the others.
[8]:
tavg = data[:,1]
tmax = data[:,2]
tmin = data[:,3]
Examining our data¶
Now that we have loaded the data, let’s have a look at what we’re dealing with.
Basic observations¶
We can start by looking at some basic information, such as the starting and ending dates for the observations.
[9]:
print(date.min())
20070101.0
[10]:
print(date.max())
20171231.0
OK, so we can see our observations start on January 1, 2007 and continue until December 31, 2017.
Checking for missing/bad data¶
Let’s make sure that we don’t have any missing dates by looking for any nan
values in the date
array. To do this, we’ll need a new NumPy function we’ve not yet used called np.count_nonzero()
. np.count_nonzero()
does exactly what is says, counts all values that are not equal to zero in an array and returns the sum. In this case, False
values in mask arrays, for example, are considered to be equal to zero in the count. Thus, if we have a mask array with all of the finite
values for an array like date
, that should give np.count_nonzero()
will give us the number of dates with useful data. Let’s see how it works.
[11]:
date_mask = np.isfinite(date)
[12]:
print("Number of days:", np.count_nonzero(date_mask))
Number of days: 4016
OK, so we have 4016 dates in the data we’re using, but how do we know whether we’re missing any dates? One way to check is to take the opposite of the mask array values by using the ~
character, the logical not operator. How does it work? Let’s see.
[13]:
missing_date_mask = ~np.isfinite(date)
[14]:
print("Number of missing days:", np.count_nonzero(missing_date_mask))
Number of missing days: 0
OK, so no missing days, which is not really a big surprise. Good to know at least. How about for the other arrays like tavg
?
[15]:
tavg_mask = np.isfinite(tavg)
print("Number of average temps:", np.count_nonzero(tavg_mask))
Number of average temps: 0
Oh no! So we don’t have any average temperature observations. Looks like we’ll have to use another array for exploring the temperature data.
Calculations within a date range¶
One common thing we’d like to do is calculate something about our data over a fixed time range, for example, one year. We’ve seen this idea previously, but just to remind you we can do this by masking our data.
Removing missing data¶
First, let’s check for any missing data in the tmax
array.
[16]:
missing_tmax_mask = ~np.isfinite(tmax)
print("Number of missing tmax values:", np.count_nonzero(missing_tmax_mask))
Number of missing tmax values: 7
OK, so we need to remove the missing values first, which we can do as we’ve done before: Make a mask array, then remove bad values using the mask.
[17]:
tmax_mask = np.isfinite(tmax)
[18]:
tmax_clean = tmax[tmax_mask]
We can also remove the corresponding dates by using the same mask (tmax_mask
) on our date
array.
[19]:
date_clean = date[tmax_mask]
Average annual temperature, method 1¶
OK, now let’s use a range of dates to find the average maximum temperature for one year. In this case, let’s go for 2010.
[20]:
tmax_2010 = tmax[(date >= 20100101) & (date <= 20101231)]
[21]:
tmax_2010.mean()
[21]:
60.802739726027397
OK, so this works, but we’ll see another more flexible approach in just a moment.
Average monthly temperatures¶
So one possibility is for us to use numerical ranges to calculate things like average annual temperatures. An alternative is to split date into different arrays, with one for each year, month, and day in the data. To do this, we first need to convert the date_clean
array to be character string values rather than float64
values. Let’s do that.
[22]:
date_clean.astype(str)
[22]:
array(['20070101.0', '20070102.0', '20070103.0', ..., '20171229.0',
'20171230.0', '20171231.0'],
dtype='<U32')
Well, that’s not quite what we want. Ideally, we want a date in the form YYYYMMDD
where YYYY
is the year, MM
is the month, and DD
is the day. Looks like we need to convert the date to an integer first.
[23]:
date_clean_int = date_clean.astype(int)
[24]:
date_clean_str = date_clean_int.astype(str)
[25]:
print(date_clean_str)
['20070101' '20070102' '20070103' ..., '20171229' '20171230' '20171231']
That’s better.
Splitting dates into separate year, month, and day¶
Now we can go about splitting our dates into separate arrays for the year, month, and day.
Splitting a single date string¶
As a reminder, it is first useful to recall that character strings can be sliced using index values just like arrays.
[26]:
date_fmt = "YYYYMMDD"
[27]:
date_fmt[0:4]
[27]:
'YYYY'
So by taking the first 4 values from the string YYYYMMDD
we get YYYY
. We can do the same for the month, for example.
[28]:
date_fmt[4:6]
[28]:
'MM'
Splitting dates in a NumPy array¶
With this in mind, we can now attempt the same in our NumPy array date_clean_str
.
[29]:
year = [datenow[0:4] for datenow in date_clean_str]
[30]:
year = np.array(year)
[31]:
print(year)
['2007' '2007' '2007' ..., '2017' '2017' '2017']
OK, so we have the year now, but what happened with the [datenow[0:4] for datenow in date_clean_str]
stuff. We’ve never seen that before. You’re right, this is new, and is an example of list comprehension in Python. Basically, it is a shorthand approach for looping over all of the values and doing something to them. In our case, we find the first 4 characters for each value in the date
array.
To be more clear,
[datenow[0:4] for datenow in date_clean_str]
is equivalent to doing
year = []
for datenow in date_clean_str:
year.append(datenow[0:4]
since this would produce a new list called year
, we then need to use the np.array()
function to convert that list to a NumPy array. Let’s do the same for the months and days.
[32]:
month = [datenow[4:6] for datenow in date_clean_str]
month = np.array(month)
[33]:
day = [datenow[6:8] for datenow in date_clean_str]
day = np.array(day)
Finding averages in a date range¶
Now that we have separate arrays for the year, month, and day we can do some fun stuff like calculate the monthly average temperature for a given year.
Finding the average monthly max temperature¶
Let’s take 2010 again as our example and find the average temperatures for each month in 2010. For this, we can use a for
loop.
[34]:
means_2010 = np.zeros(12)
index = 0
[35]:
for month_now in np.unique(month):
means_2010[index] = tmax_clean[(month == month_now) & (year == '2010')].mean()
index = index + 1
[36]:
print(means_2010)
[ 29.77419355 32.46428571 52.51612903 65.76666667 71.96774194
80.23333333 87.5483871 85. 74.36666667 65.5483871
51.83333333 30.80645161]
OK, so let’s break this down as there were a few things that happened here that might be unfamiliar.
- We created an array of zeros with 12 locations to store the monthly average temperatures.
- We used an index variable called
index
to define where to store those monthly average temperatures. We could have done this differently, but there is a good reason to useindex
here because it is more flexible if we wanted to consider more than one year of data as you’ll see in this week’s exercise. - We then used a
for
loop to go through each unique month (found usingnp.unique()
). - For each month we found the max temperature using the condition
month == month_now
, and taking thetmax_clean.mean()
for that month. We also limit the months to those in 2010 by using theyear == '2010'
condition. - We added
1
to theindex
value to ensure the next mean value goes into the correct location in themeans_2010
array.
This might seem a bit complicated, but it is rather powerful as well. We’ve efficiently calcuated mean monthly max temperatures in only a few lines of code!
Using functions on NumPy array data¶
As you might have noticed above, the means_2010
array has temperatures in Fahrenheit, which might be easier for us to work with in Celsius. Fortunately, we can easily convert NumPy array data from Fahrenheit to Celsius using a familiar function.
[37]:
def fahrToCelsius(temp):
return (temp - 32.0) / 1.8
Now, let’s go about that conversion.
[38]:
means_2010_celsius = fahrToCelsius(means_2010)
[39]:
print(means_2010_celsius)
[ -1.23655914 0.25793651 11.39784946 18.75925926 22.20430108
26.7962963 30.86021505 29.44444444 23.53703704 18.63799283
11.01851852 -0.66308244]
Finally, something easy to do with NumPy! All we have to do hear is plug in a NumPy array with our function and we get back the converted temperature values.
Computing monthly average temperatures for a range of years¶
One thing we’ll encounter in this week’s exercise is a need to calculate monthly average temperatures for a number of years. For example, the average temperature in February from 2010-2015. We can do that with the tools we have, but in a slightly different way than calculating the monthly average temperature for each month in those years one by one. First, let’s make a few masks, starting with the month.
[40]:
month_mask = month == '02'
Nothing crazy here, just ensuring the month is February, or '02'
. Now for the years.
[41]:
year_mask = (year.astype(int) >= 2010) & (year.astype(int) <= 2015)
Again, nothing too scary here. We do need to use the .astype(int)
here to be able to check the range of years, but that is not so bad. Now let’s find that monthly average temperature for February between 2010 and 2015.
[42]:
mean_2010_2015 = tmax_clean[month_mask & year_mask].mean()
[43]:
print(mean_2010_2015)
32.3869047619
Looks about right, but perhaps we can check the value in Celsius to see whether that makes sense for winter in the northern hemisphere.
[44]:
print(fahrToCelsius(mean_2010_2015))
0.214947089947
Right around 0°C, looks good!
Calculating temperature anomalies¶
One thing we can do now that we have a mean temperature for February in 2010-2015 is to compare one of those years to the average max temperature in the time period. This would give us a temperature anomaly, the difference between the mean value and a given year. Let’s give that a shot.
[45]:
anomaly = means_2010[1] - mean_2010_2015
[46]:
print(anomaly)
0.077380952381
As we can see, February of 2010 was only 0.08° Fahrenheit warmer than the average February between 2010 and 2015. Pretty typical February, it would seem. We’ll explore exactly this kind of thing in more detail in this week’s exercise.