Processing data with NumPy

Now you should know the basics of the data structures in NumPy and how to explore your data using some tools that are provided by NumPy. Next, we continue to explore some of the basic data operations that are regularly needed when doing data analysis.

Let’s first import NumPy, read the same data as before, and sort it into column arrays to have a clean start.

[1]:
import numpy as np
[2]:
fp = '../Kumpula-June-2016-w-metadata.txt'
data = np.genfromtxt(fp, skip_header=9, delimiter=',')
[3]:
date = data[:, 0]
temp = data[:, 1]
temp_max = data[:, 2]
temp_min = data[:, 3]

Calculating with NumPy arrays

One of the most common things to do in NumPy is to create new arrays based on calculations between different other arrays (columns).

Creating arrays

Arrays can be created in several ways. A common approach is to create an array of zeros with the same length as other existing arrays. This can be thought of as a blank space for calculations.

[4]:
diff = np.zeros(len(date))
[5]:
print(diff)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]

So, what just happened? We created a new array of zeros using the NumPy zeros() function, which takes the size of the array as a parameter. In our case, we’ve given the size to be the length of the date array. In other words, len(date).

Calcuating values using other arrays

We can now use our new diff array to calculate something useful, such as the difference between the temp_max and temp_min values for each row in our data. How do we do that? It’s easy.

[6]:
diff = temp_max - temp_min
[7]:
print(diff)
[18.9 25.8 22.3 23.6 15.1 16.9 19.2 12.9  8.4 12.9 20.4 18.2 20.9 20.
 21.  11.9 14.8  8.8  5.1 16.9 21.  14.8 12.2 12.2 17.5 17.4 12.4 17.2
 13.5 13.5]

We simply subtract temp_min from temp_max.

In fact, we don’t even need to create the array first. Let’s consider another example of calculating the difference between the daily mean temperature and the minimum temperature. We can calculate that simply as follows.

[8]:
diff_min = temp - temp_min
[9]:
print(diff_min)
[10.8 10.8 12.8 10.2  8.2  9.4 11.   6.7  3.7  6.5 12.3  9.4 11.  11.9
 14.1  2.2  4.5  3.3  2.2  7.1 12.2  6.3  6.   4.4  7.8  9.3  3.1  9.6
  6.1  6.5]

When we subtract one NumPy array from another, NumPy is smart enough to automatically create a new array to store the output. We can confirm this by checking the type of the diff_min array.

[10]:
type(diff_min)
[10]:
numpy.ndarray

As one final example, let’s consider converting temperatures in Fahrenheit to Celsius. We can store the results as temp_celsius.

[11]:
temp_celsius = (temp - 32) / (9/5)
[12]:
print(temp_celsius)
[18.61111111 18.77777778 20.22222222 14.16666667 10.77777778 11.22222222
 13.83333333 12.33333333  9.66666667  9.72222222 12.22222222 13.
 14.61111111 15.38888889 17.44444444 14.33333333 15.77777778 14.05555556
 13.5        15.16666667 17.         16.5        16.05555556 16.16666667
 18.72222222 20.88888889 15.94444444 18.55555556 18.77777778 18.72222222]

Again, since we use a NumPy ndarray in the calculation, a ndarray is output.

Filtering data

Another common thing to do with data is to look for a subset of the data that match some criterion. For example, we might want to create an array called w_temps that contains “warm” temperatures, those above 15°C. We can do that as follows.

[13]:
w_temps = temp_celsius[temp_celsius > 15.0]
[14]:
print(w_temps)
[18.61111111 18.77777778 20.22222222 15.38888889 17.44444444 15.77777778
 15.16666667 17.         16.5        16.05555556 16.16666667 18.72222222
 20.88888889 15.94444444 18.55555556 18.77777778 18.72222222]

Here, we see only the temperatures above 15°C, as expected.

It is also possible to combine multiple criteria at the same time. Here, we select temperatures above 15 degrees that were recorded in the second half of June in 2016 (i.e. date >= 20160615). Combining multiple criteria can be done with the & operator (logical AND) or the | operator (logical OR). Notice, that it is useful to separate the different clauses inside parentheses ().

[15]:
w_temps2 = temp_celsius[(temp_celsius > 15.0) & (date >= 20160615)]
[16]:
print(w_temps2)
[17.44444444 15.77777778 15.16666667 17.         16.5        16.05555556
 16.16666667 18.72222222 20.88888889 15.94444444 18.55555556 18.77777778
 18.72222222]

With two constraints on the data, that temp_celsius must be greater that 15°C and date must be on or after June 15, 2016 (i.e., 20160615), we get a smaller subset of the original data.

Using data masks

The filtering examples above are nice, but what if we would like to identify the dates with temperatures above 15°C and keep only those dates in our other data columns, such as date, temp, etc. How can we do that?

In order to do that, we will need to use a mask array. A mask array is basically a boolean (True/False) array that can be used to take a subset of data from other arrays. Let’s consider our example of warm temperatures once again. Rather than extracting w_temps directly, we can start by identifying the values in temp_celsius where the value is above 15°C (True) or less than or equal to 15°C (False). The logic is quite similar to before.

[17]:
w_temps_mask = temp_celsius > 15.0
[18]:
print(w_temps_mask)
[ True  True  True False False False False False False False False False
 False  True  True False  True False False  True  True  True  True  True
  True  True  True  True  True  True]

Now we see a list of True and False values in an array of the same size as temp_celsius. This array shows us whether the condition we stated is True or False for each index.

Now, if we wanted to know the dates when the temperature was above 15°C, we can simply take the values from the date array using the mask we just created.

[19]:
w_temp_dates = date[w_temps_mask]
[20]:
print(w_temp_dates)
[20160601. 20160602. 20160603. 20160614. 20160615. 20160617. 20160620.
 20160621. 20160622. 20160623. 20160624. 20160625. 20160626. 20160627.
 20160628. 20160629. 20160630.]

Cool, right? Now we see only the subset of dates that match the condition of having a temperature above 15°C, and the lengths of w_temps and w_temp_dates are the same, meaning we know both the date that the temperature exceeded 15°C and the temperature itself.

[21]:
len(w_temps)
[21]:
17
[22]:
len(w_temp_dates)
[22]:
17

Removing missing/bad data

In some cases, a data file might contain missing values or values that cannot be read. These may be replaced by nan values when you look at your data. nan stands for “not a number”, and often we want to get rid of these things.

Let’s consider a case where we have an array bad_data that is full of zeros, has the same size as date and the other arrays from our data file, and the first 5 rows have nan values.

[23]:
bad_data = np.zeros(len(date))
[24]:
bad_data[:5] = np.nan
[25]:
print(bad_data)
[nan nan nan nan nan  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

You can see the problem clearly.

If we wanted to include values for the date column that only correspond to locations in bad_data where we do not have a nan value, we can use the isfinite() function in NumPy. isfinite() checks to see if a value is defined (i.e., is not nan or infinite (inf). Let’s make a mask with bad_data.

[26]:
bad_data_mask = np.isfinite(bad_data)
[27]:
print(bad_data_mask)
[False False False False False  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]

We see the expected results. If we want now to include only the dates with good data, we can use the mask as we did before.

[28]:
good_dates = date[bad_data_mask]
[29]:
print(good_dates)
[20160606. 20160607. 20160608. 20160609. 20160610. 20160611. 20160612.
 20160613. 20160614. 20160615. 20160616. 20160617. 20160618. 20160619.
 20160620. 20160621. 20160622. 20160623. 20160624. 20160625. 20160626.
 20160627. 20160628. 20160629. 20160630.]

Rounding and finding unique values

It is possible to round values easily using the round() method for NumPy arrays. Let’s round our temperatures in Celsius to zero decimal places.

[30]:
temp_celsius_rounded = temp_celsius.round(0)
[31]:
print(temp_celsius_rounded)
[19. 19. 20. 14. 11. 11. 14. 12. 10. 10. 12. 13. 15. 15. 17. 14. 16. 14.
 13. 15. 17. 16. 16. 16. 19. 21. 16. 19. 19. 19.]

Finding unique values

We can find unique values in an array using the unique() function.

[32]:
unique = np.unique(temp_celsius_rounded)
[33]:
print(unique)
[10. 11. 12. 13. 14. 15. 16. 17. 19. 20. 21.]

Now we do not see any repeated values in our rounded temperatures.

Saving our data to a file

Finally we can save our modified data to a file for future use. We’ll need to do a few steps to get there, however.

Re-creating our 2D data array

As you have seen, we have mostly worked with single columns after reading in our data. We can recreate a 2D data structure by stacking these columns back together.

For example, let’s put together our date, temp, and temp_celsius columns in a new data array called new_data. We can start by stacking the data together using the vstack() function.

[34]:
new_data = np.vstack((date, temp, temp_celsius))
[35]:
print(new_data)
[[2.01606010e+07 2.01606020e+07 2.01606030e+07 2.01606040e+07
  2.01606050e+07 2.01606060e+07 2.01606070e+07 2.01606080e+07
  2.01606090e+07 2.01606100e+07 2.01606110e+07 2.01606120e+07
  2.01606130e+07 2.01606140e+07 2.01606150e+07 2.01606160e+07
  2.01606170e+07 2.01606180e+07 2.01606190e+07 2.01606200e+07
  2.01606210e+07 2.01606220e+07 2.01606230e+07 2.01606240e+07
  2.01606250e+07 2.01606260e+07 2.01606270e+07 2.01606280e+07
  2.01606290e+07 2.01606300e+07]
 [6.55000000e+01 6.58000000e+01 6.84000000e+01 5.75000000e+01
  5.14000000e+01 5.22000000e+01 5.69000000e+01 5.42000000e+01
  4.94000000e+01 4.95000000e+01 5.40000000e+01 5.54000000e+01
  5.83000000e+01 5.97000000e+01 6.34000000e+01 5.78000000e+01
  6.04000000e+01 5.73000000e+01 5.63000000e+01 5.93000000e+01
  6.26000000e+01 6.17000000e+01 6.09000000e+01 6.11000000e+01
  6.57000000e+01 6.96000000e+01 6.07000000e+01 6.54000000e+01
  6.58000000e+01 6.57000000e+01]
 [1.86111111e+01 1.87777778e+01 2.02222222e+01 1.41666667e+01
  1.07777778e+01 1.12222222e+01 1.38333333e+01 1.23333333e+01
  9.66666667e+00 9.72222222e+00 1.22222222e+01 1.30000000e+01
  1.46111111e+01 1.53888889e+01 1.74444444e+01 1.43333333e+01
  1.57777778e+01 1.40555556e+01 1.35000000e+01 1.51666667e+01
  1.70000000e+01 1.65000000e+01 1.60555556e+01 1.61666667e+01
  1.87222222e+01 2.08888889e+01 1.59444444e+01 1.85555556e+01
  1.87777778e+01 1.87222222e+01]]

Now we have our data back in a single array, but something isn’t quite right. The columns and rows need to be flipped. We can do this using the transpose() function.

[36]:
new_data = np.transpose(new_data)
[37]:
print(new_data)
[[2.01606010e+07 6.55000000e+01 1.86111111e+01]
 [2.01606020e+07 6.58000000e+01 1.87777778e+01]
 [2.01606030e+07 6.84000000e+01 2.02222222e+01]
 [2.01606040e+07 5.75000000e+01 1.41666667e+01]
 [2.01606050e+07 5.14000000e+01 1.07777778e+01]
 [2.01606060e+07 5.22000000e+01 1.12222222e+01]
 [2.01606070e+07 5.69000000e+01 1.38333333e+01]
 [2.01606080e+07 5.42000000e+01 1.23333333e+01]
 [2.01606090e+07 4.94000000e+01 9.66666667e+00]
 [2.01606100e+07 4.95000000e+01 9.72222222e+00]
 [2.01606110e+07 5.40000000e+01 1.22222222e+01]
 [2.01606120e+07 5.54000000e+01 1.30000000e+01]
 [2.01606130e+07 5.83000000e+01 1.46111111e+01]
 [2.01606140e+07 5.97000000e+01 1.53888889e+01]
 [2.01606150e+07 6.34000000e+01 1.74444444e+01]
 [2.01606160e+07 5.78000000e+01 1.43333333e+01]
 [2.01606170e+07 6.04000000e+01 1.57777778e+01]
 [2.01606180e+07 5.73000000e+01 1.40555556e+01]
 [2.01606190e+07 5.63000000e+01 1.35000000e+01]
 [2.01606200e+07 5.93000000e+01 1.51666667e+01]
 [2.01606210e+07 6.26000000e+01 1.70000000e+01]
 [2.01606220e+07 6.17000000e+01 1.65000000e+01]
 [2.01606230e+07 6.09000000e+01 1.60555556e+01]
 [2.01606240e+07 6.11000000e+01 1.61666667e+01]
 [2.01606250e+07 6.57000000e+01 1.87222222e+01]
 [2.01606260e+07 6.96000000e+01 2.08888889e+01]
 [2.01606270e+07 6.07000000e+01 1.59444444e+01]
 [2.01606280e+07 6.54000000e+01 1.85555556e+01]
 [2.01606290e+07 6.58000000e+01 1.87777778e+01]
 [2.01606300e+07 6.57000000e+01 1.87222222e+01]]

That’s better!

Saving our data

With the data in the correct format, we can now save it to a file using the savetxt() function. Let’s save our data to a file called converted_temps.csv, where the .csv indicates the data values are separated by commas (comma-separated values).

[38]:
np.savetxt('converted_temps.csv', new_data, delimiter=',')

Cool. We have now saved the array new_data to the file converted_temps.csv with commas between the values (using the delimiter=',' parameter).