Advanced data processing with NumPy¶
This week we will continue developing our skills using NumPy to process real data. In the lesson this week we are using daily weather observations from the Detroit Willow Run Airport station, near Ypsilanti, Michigan in the U.S. This data file is in the same format as the one you’ll work with in this week’s exercise.
Previewing the data file.¶
To start, let’s have a look at the top few lines of our data file.
STATION ELEVATION LATITUDE LONGITUDE DATE PRCP TAVG TMAX TMIN ----------------- ---------- ---------- ---------- -------- -------- -------- -------- -------- GHCND:USW00014853 236.8 42.23333 -83.53333 20070101 0.03 -9999 50 32 GHCND:USW00014853 236.8 42.23333 -83.53333 20070102 0.00 -9999 43 26 GHCND:USW00014853 236.8 42.23333 -83.53333 20070103 0.00 -9999 46 30
So, we have 9 columns in the file, with values separated by variable numbers of spaces. Another thing to notice are the
-9999 values, which indicate missing data. Otherwise, some columns contain numerical values, and those will be our focus for today’s lesson. If you prefer, you’re welcome to open the data file in the JupyterLab editor to explore more.
Importing NumPy and loading the data¶
Now let’s import NumPy and read in our data to a variable called
import numpy as np
Now we can read the data file. Note that we only want the
TMIN values, which we can select using the
usecols() parameter. Don’t forget about skipping the header!
fp = '1495566.txt' data = np.genfromtxt(fp, skip_header=2, usecols=(4, 6, 7, 8))
Now let’s see what we’re dealing with…
[[ 2.00701010e+07 -9.99900000e+03 5.00000000e+01 3.20000000e+01] [ 2.00701020e+07 -9.99900000e+03 4.30000000e+01 2.60000000e+01] [ 2.00701030e+07 -9.99900000e+03 4.60000000e+01 3.00000000e+01] ..., [ 2.01712290e+07 -9.99900000e+03 2.00000000e+01 1.10000000e+01] [ 2.01712300e+07 -9.99900000e+03 1.70000000e+01 4.00000000e+00] [ 2.01712310e+07 -9.99900000e+03 1.70000000e+01 -3.00000000e+00]]
Converting the missing data to nan values¶
Our first processing task will be to convert the missing data values to
np.nan values, which will make it easier to use masks later in our data processing. We can go ahead and do this to all of the data to start, following a similar approach to how we used masks in the last lesson.
First, we need to identify a useful test for identifying missing data. We could could convert the data to integer values everywhere and test for values being equal to
-9999, but there’s an even easier option. Since we know all of our data are dates or temperatures, we can simply look for numbers below
-9998 to identify missing data.
data_mask = (data < -9998) data[data_mask] = np.nan
[[ 2.00701010e+07 nan 5.00000000e+01 3.20000000e+01] [ 2.00701020e+07 nan 4.30000000e+01 2.60000000e+01] [ 2.00701030e+07 nan 4.60000000e+01 3.00000000e+01] ..., [ 2.01712290e+07 nan 2.00000000e+01 1.10000000e+01] [ 2.01712300e+07 nan 1.70000000e+01 4.00000000e+00] [ 2.01712310e+07 nan 1.70000000e+01 -3.00000000e+00]]
Now we see the expected
nan values in place of the
-9999 values. Note that a
nan value in NumPy is refered to as
Splitting the data into column arrays¶
Like the last lesson, we can now separate our data into different arrays with meaningful names. For example, we can create a
date array using the first column of
data. Remember that we’ll do this using array slicing.
date = data[:,0]
[ 20070101. 20070102. 20070103. ..., 20171229. 20171230. 20171231.]
Looks good, let’s do the others.
tavg = data[:,1] tmax = data[:,2] tmin = data[:,3]
Examining our data¶
Now that we have loaded the data, let’s have a look at what we’re dealing with.
We can start by looking at some basic information, such as the starting and ending dates for the observations.
OK, so we can see our observations start on January 1, 2007 and continue until December 31, 2017.
Checking for missing/bad data¶
Let’s make sure that we don’t have any missing dates by looking for any
nan values in the
date array. To do this, we’ll need a new NumPy function we’ve not yet used called
np.count_nonzero() does exactly what is says, counts all values that are not equal to zero in an array and returns the sum. In this case,
False values in mask arrays, for example, are considered to be equal to zero in the count. Thus, if we have a mask array with all of the finite
values for an array like
date, that should give
np.count_nonzero() will give us the number of dates with useful data. Let’s see how it works.
date_mask = np.isfinite(date)
print("Number of days:", np.count_nonzero(date_mask))
Number of days: 4016
OK, so we have 4016 dates in the data we’re using, but how do we know whether we’re missing any dates? One way to check is to take the opposite of the mask array values by using the
~ character, the logical not operator. How does it work? Let’s see.
missing_date_mask = ~np.isfinite(date)
print("Number of missing days:", np.count_nonzero(missing_date_mask))
Number of missing days: 0
OK, so no missing days, which is not really a big surprise. Good to know at least. How about for the other arrays like
tavg_mask = np.isfinite(tavg) print("Number of average temps:", np.count_nonzero(tavg_mask))
Number of average temps: 0
Oh no! So we don’t have any average temperature observations. Looks like we’ll have to use another array for exploring the temperature data.
Calculations within a date range¶
One common thing we’d like to do is calculate something about our data over a fixed time range, for example, one year. We’ve seen this idea previously, but just to remind you we can do this by masking our data.
Removing missing data¶
First, let’s check for any missing data in the
missing_tmax_mask = ~np.isfinite(tmax) print("Number of missing tmax values:", np.count_nonzero(missing_tmax_mask))
Number of missing tmax values: 7
OK, so we need to remove the missing values first, which we can do as we’ve done before: Make a mask array, then remove bad values using the mask.
tmax_mask = np.isfinite(tmax)
tmax_clean = tmax[tmax_mask]
We can also remove the corresponding dates by using the same mask (
tmax_mask) on our
date_clean = date[tmax_mask]
Average annual temperature, method 1¶
OK, now let’s use a range of dates to find the average maximum temperature for one year. In this case, let’s go for 2010.
tmax_2010 = tmax[(date >= 20100101) & (date <= 20101231)]
OK, so this works, but we’ll see another more flexible approach in just a moment.
Average monthly temperatures¶
So one possibility is for us to use numerical ranges to calculate things like average annual temperatures. An alternative is to split date into different arrays, with one for each year, month, and day in the data. To do this, we first need to convert the
date_clean array to be character string values rather than
float64 values. Let’s do that.
array(['20070101.0', '20070102.0', '20070103.0', ..., '20171229.0', '20171230.0', '20171231.0'], dtype='<U32')
Well, that’s not quite what we want. Ideally, we want a date in the form
YYYY is the year,
MM is the month, and
DD is the day. Looks like we need to convert the date to an integer first.
date_clean_int = date_clean.astype(int)
date_clean_str = date_clean_int.astype(str)
['20070101' '20070102' '20070103' ..., '20171229' '20171230' '20171231']
Splitting dates into separate year, month, and day¶
Now we can go about splitting our dates into separate arrays for the year, month, and day.
Splitting a single date string¶
As a reminder, it is first useful to recall that character strings can be sliced using index values just like arrays.
date_fmt = "YYYYMMDD"
So by taking the first 4 values from the string
YYYYMMDD we get
YYYY. We can do the same for the month, for example.
Splitting dates in a NumPy array¶
With this in mind, we can now attempt the same in our NumPy array
year = [datenow[0:4] for datenow in date_clean_str]
year = np.array(year)
['2007' '2007' '2007' ..., '2017' '2017' '2017']
OK, so we have the year now, but what happened with the
[datenow[0:4] for datenow in date_clean_str] stuff. We’ve never seen that before. You’re right, this is new, and is an example of list comprehension in Python. Basically, it is a shorthand approach for looping over all of the values and doing something to them. In our case, we find the first 4 characters for each value in the
To be more clear,
[datenow[0:4] for datenow in date_clean_str]
is equivalent to doing
year =  for datenow in date_clean_str: year.append(datenow[0:4]
since this would produce a new list called
year, we then need to use the
np.array() function to convert that list to a NumPy array. Let’s do the same for the months and days.
month = [datenow[4:6] for datenow in date_clean_str] month = np.array(month)
day = [datenow[6:8] for datenow in date_clean_str] day = np.array(day)
Finding averages in a date range¶
Now that we have separate arrays for the year, month, and day we can do some fun stuff like calculate the monthly average temperature for a given year.
Finding the average monthly max temperature¶
Let’s take 2010 again as our example and find the average temperatures for each month in 2010. For this, we can use a
means_2010 = np.zeros(12) index = 0
for month_now in np.unique(month): means_2010[index] = tmax_clean[(month == month_now) & (year == '2010')].mean() index = index + 1
[ 29.77419355 32.46428571 52.51612903 65.76666667 71.96774194 80.23333333 87.5483871 85. 74.36666667 65.5483871 51.83333333 30.80645161]
OK, so let’s break this down as there were a few things that happened here that might be unfamiliar.
- We created an array of zeros with 12 locations to store the monthly average temperatures.
- We used an index variable called
indexto define where to store those monthly average temperatures. We could have done this differently, but there is a good reason to use
indexhere because it is more flexible if we wanted to consider more than one year of data as you’ll see in this week’s exercise.
- We then used a
forloop to go through each unique month (found using
- For each month we found the max temperature using the condition
month == month_now, and taking the
tmax_clean.mean()for that month. We also limit the months to those in 2010 by using the
year == '2010'condition.
- We added
indexvalue to ensure the next mean value goes into the correct location in the
This might seem a bit complicated, but it is rather powerful as well. We’ve efficiently calcuated mean monthly max temperatures in only a few lines of code!
Using functions on NumPy array data¶
As you might have noticed above, the
means_2010 array has temperatures in Fahrenheit, which might be easier for us to work with in Celsius. Fortunately, we can easily convert NumPy array data from Fahrenheit to Celsius using a familiar function.
def fahrToCelsius(temp): return (temp - 32.0) / 1.8
Now, let’s go about that conversion.
means_2010_celsius = fahrToCelsius(means_2010)
[ -1.23655914 0.25793651 11.39784946 18.75925926 22.20430108 26.7962963 30.86021505 29.44444444 23.53703704 18.63799283 11.01851852 -0.66308244]
Finally, something easy to do with NumPy! All we have to do hear is plug in a NumPy array with our function and we get back the converted temperature values.
Computing monthly average temperatures for a range of years¶
One thing we’ll encounter in this week’s exercise is a need to calculate monthly average temperatures for a number of years. For example, the average temperature in February from 2010-2015. We can do that with the tools we have, but in a slightly different way than calculating the monthly average temperature for each month in those years one by one. First, let’s make a few masks, starting with the month.
month_mask = month == '02'
Nothing crazy here, just ensuring the month is February, or
'02'. Now for the years.
year_mask = (year.astype(int) >= 2010) & (year.astype(int) <= 2015)
Again, nothing too scary here. We do need to use the
.astype(int) here to be able to check the range of years, but that is not so bad. Now let’s find that monthly average temperature for February between 2010 and 2015.
mean_2010_2015 = tmax_clean[month_mask & year_mask].mean()
Looks about right, but perhaps we can check the value in Celsius to see whether that makes sense for winter in the northern hemisphere.
Right around 0°C, looks good!
Calculating temperature anomalies¶
One thing we can do now that we have a mean temperature for February in 2010-2015 is to compare one of those years to the average max temperature in the time period. This would give us a temperature anomaly, the difference between the mean value and a given year. Let’s give that a shot.
anomaly = means_2010 - mean_2010_2015
As we can see, February of 2010 was only 0.08° Fahrenheit warmer than the average February between 2010 and 2015. Pretty typical February, it would seem. We’ll explore exactly this kind of thing in more detail in this week’s exercise.