Processing data with Pandas¶

Now you should know the basics of the data structures in Pandas and how to explore your data using some tools that is provided by Pandas. Next we continue to explore some of the basic data operations that are regularly needed when doing data analysis.

Let’s first read the same data as before into Pandas to have a clean start.

In [1] dataFrame = pd.read_csv('Kumpula-June-2016-w-metadata.txt', sep=',', skiprows=8)

Calculating with DataFrames¶

One of the most common things to do in Pandas is to create new columns based on calculations between different variables (columns).

Creating a new column into our DataFrame is easy by specifying the name of the column and giving it some default value (in this case decimal number 0.0).

In [1]: dataFrame['DIFF'] = 0.0

In [2]: print(dataFrame)
    YEARMODA  TEMP   MAX   MIN  DIFF
 20160601  65.5  73.6  54.7   0.0
 20160602  65.8  80.8  55.0   0.0
 20160603  68.4  77.9  55.6   0.0
 20160604  57.5  70.9  47.3   0.0
 20160605  51.4  58.3  43.2   0.0
 20160606  52.2  59.7  42.8   0.0
 20160607  56.9  65.1  45.9   0.0
 20160608  54.2  60.4  47.5   0.0
 20160609  49.4  54.1  45.7   0.0
 20160610  49.5  55.9  43.0   0.0
20160611  54.0  62.1  41.7   0.0
20160612  55.4  64.2  46.0   0.0
20160613  58.3  68.2  47.3   0.0
20160614  59.7  67.8  47.8   0.0
20160615  63.4  70.3  49.3   0.0
20160616  57.8  67.5  55.6   0.0
20160617  60.4  70.7  55.9   0.0
20160618  57.3  62.8  54.0   0.0
20160619  56.3  59.2  54.1   0.0
20160620  59.3  69.1  52.2   0.0
20160621  62.6  71.4  50.4   0.0
20160622  61.7  70.2  55.4   0.0
20160623  60.9  67.1  54.9   0.0
20160624  61.1  68.9  56.7   0.0
20160625  65.7  75.4  57.9   0.0
20160626  69.6  77.7  60.3   0.0
20160627  60.7  70.0  57.6   0.0
20160628  65.4  73.0  55.8   0.0
20160629  65.8  73.2  59.7   0.0
20160630  65.7  72.7  59.2   0.0

Let’s check the datatype of our new column

In [3]: dataFrame['DIFF'].dtypes
Out[3]: dtype('float64')

Okey, so we see that Pandas created a new column and recognized automatically that the data type is float as we passed a 0.0 value to it.

We can also easily do calculations inside our DataFrame. Let’s update the column DIFF by calculating the difference between MAX and MIN columns to get an idea how much the temperatures have been varying during different days. Calculations can be done with following syntax where we first specify the column that we want to update (i.e. DIFF) and then do the actual calculation using the columns that we have in our dataFrameFrame.

In [4]: dataFrame['DIFF'] = dataFrame['MAX'] - dataFrame['MIN']

In [5]: print(dataFrame)
    YEARMODA  TEMP   MAX   MIN  DIFF
 20160601  65.5  73.6  54.7  18.9
 20160602  65.8  80.8  55.0  25.8
 20160603  68.4  77.9  55.6  22.3
 20160604  57.5  70.9  47.3  23.6
 20160605  51.4  58.3  43.2  15.1
 20160606  52.2  59.7  42.8  16.9
 20160607  56.9  65.1  45.9  19.2
 20160608  54.2  60.4  47.5  12.9
 20160609  49.4  54.1  45.7   8.4
 20160610  49.5  55.9  43.0  12.9
20160611  54.0  62.1  41.7  20.4
20160612  55.4  64.2  46.0  18.2
20160613  58.3  68.2  47.3  20.9
20160614  59.7  67.8  47.8  20.0
20160615  63.4  70.3  49.3  21.0
20160616  57.8  67.5  55.6  11.9
20160617  60.4  70.7  55.9  14.8
20160618  57.3  62.8  54.0   8.8
20160619  56.3  59.2  54.1   5.1
20160620  59.3  69.1  52.2  16.9
20160621  62.6  71.4  50.4  21.0
20160622  61.7  70.2  55.4  14.8
20160623  60.9  67.1  54.9  12.2
20160624  61.1  68.9  56.7  12.2
20160625  65.7  75.4  57.9  17.5
20160626  69.6  77.7  60.3  17.4
20160627  60.7  70.0  57.6  12.4
20160628  65.4  73.0  55.8  17.2
20160629  65.8  73.2  59.7  13.5
20160630  65.7  72.7  59.2  13.5

Okey so now we can see that the calculations were inserted into the DIFF column as planned. Notice that you can do calculations directly without first creating the column. Let’s test this by calculating the difference between minimum temperature (MIN) and the mean temperature of the day (TEMP).

In [6]: dataFrame['DIFF_Min'] = dataFrame['TEMP'] - dataFrame['MIN']

In [7]: print(dataFrame)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min
 20160601  65.5  73.6  54.7  18.9      10.8
 20160602  65.8  80.8  55.0  25.8      10.8
 20160603  68.4  77.9  55.6  22.3      12.8
 20160604  57.5  70.9  47.3  23.6      10.2
 20160605  51.4  58.3  43.2  15.1       8.2
 20160606  52.2  59.7  42.8  16.9       9.4
 20160607  56.9  65.1  45.9  19.2      11.0
 20160608  54.2  60.4  47.5  12.9       6.7
 20160609  49.4  54.1  45.7   8.4       3.7
 20160610  49.5  55.9  43.0  12.9       6.5
20160611  54.0  62.1  41.7  20.4      12.3
20160612  55.4  64.2  46.0  18.2       9.4
20160613  58.3  68.2  47.3  20.9      11.0
20160614  59.7  67.8  47.8  20.0      11.9
20160615  63.4  70.3  49.3  21.0      14.1
20160616  57.8  67.5  55.6  11.9       2.2
20160617  60.4  70.7  55.9  14.8       4.5
20160618  57.3  62.8  54.0   8.8       3.3
20160619  56.3  59.2  54.1   5.1       2.2
20160620  59.3  69.1  52.2  16.9       7.1
20160621  62.6  71.4  50.4  21.0      12.2
20160622  61.7  70.2  55.4  14.8       6.3
20160623  60.9  67.1  54.9  12.2       6.0
20160624  61.1  68.9  56.7  12.2       4.4
20160625  65.7  75.4  57.9  17.5       7.8
20160626  69.6  77.7  60.3  17.4       9.3
20160627  60.7  70.0  57.6  12.4       3.1
20160628  65.4  73.0  55.8  17.2       9.6
20160629  65.8  73.2  59.7  13.5       6.1
20160630  65.7  72.7  59.2  13.5       6.5

As you can see, now we created directly a new column with the calculation. In a similar manner, you can do calculations using as many columns as you need and using any kind of math algebra (e.g. subtracttion, addition, multiplication, division, exponentiation, etc.).

We can for example convert the Fahrenheit temperatures in TEMP column into Celsius using the formula that we have seen already many times:

In [8]: dataFrame['TEMP_Celsius'] = (dataFrame['TEMP'] - 32) / (9/5)

In [9]: print(dataFrame)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160601  65.5  73.6  54.7  18.9      10.8     18.611111
 20160602  65.8  80.8  55.0  25.8      10.8     18.777778
 20160603  68.4  77.9  55.6  22.3      12.8     20.222222
 20160604  57.5  70.9  47.3  23.6      10.2     14.166667
 20160605  51.4  58.3  43.2  15.1       8.2     10.777778
 20160606  52.2  59.7  42.8  16.9       9.4     11.222222
 20160607  56.9  65.1  45.9  19.2      11.0     13.833333
 20160608  54.2  60.4  47.5  12.9       6.7     12.333333
 20160609  49.4  54.1  45.7   8.4       3.7      9.666667
 20160610  49.5  55.9  43.0  12.9       6.5      9.722222
20160611  54.0  62.1  41.7  20.4      12.3     12.222222
20160612  55.4  64.2  46.0  18.2       9.4     13.000000
20160613  58.3  68.2  47.3  20.9      11.0     14.611111
20160614  59.7  67.8  47.8  20.0      11.9     15.388889
20160615  63.4  70.3  49.3  21.0      14.1     17.444444
20160616  57.8  67.5  55.6  11.9       2.2     14.333333
20160617  60.4  70.7  55.9  14.8       4.5     15.777778
20160618  57.3  62.8  54.0   8.8       3.3     14.055556
20160619  56.3  59.2  54.1   5.1       2.2     13.500000
20160620  59.3  69.1  52.2  16.9       7.1     15.166667
20160621  62.6  71.4  50.4  21.0      12.2     17.000000
20160622  61.7  70.2  55.4  14.8       6.3     16.500000
20160623  60.9  67.1  54.9  12.2       6.0     16.055556
20160624  61.1  68.9  56.7  12.2       4.4     16.166667
20160625  65.7  75.4  57.9  17.5       7.8     18.722222
20160626  69.6  77.7  60.3  17.4       9.3     20.888889
20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

Selecting data using indices¶

One quite common procedure in programming that you want to select only specific rows from your data and possibly apply some operations into those rows only. In Pandas there are different ways of doing this.

One common way of selecting only specific rows from your DataFrame is done via index slicing to extract part of the DataFrame.

Let’s select the first five rows and assign them to a variable called rows5.

In [10]: rows5 = dataFrame[0:5]

In [11]: print(rows5)
   YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
0  20160601  65.5  73.6  54.7  18.9      10.8     18.611111
1  20160602  65.8  80.8  55.0  25.8      10.8     18.777778
2  20160603  68.4  77.9  55.6  22.3      12.8     20.222222
3  20160604  57.5  70.9  47.3  23.6      10.2     14.166667
4  20160605  51.4  58.3  43.2  15.1       8.2     10.777778

As you can see, slicing is done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets selection = dataFrame[start_index:stop_index].

You can also select an individual row from specific position using .loc[] indexing. Here we select all the data values from row 8.

In [12]: row8 = dataFrame.loc[8]

In [13]: print(row8)
YEARMODA        2.016061e+07
TEMP            4.940000e+01
MAX             5.410000e+01
MIN             4.570000e+01
DIFF            8.400000e+00
DIFF_Min        3.700000e+00
TEMP_Celsius    9.666667e+00
Name: 8, dtype: float64

Attention

Based on what we know about Python lists, what would you expect to see output from the following code?

In [13]: print(dataFrame.loc[-1])

Select your answer from the poll options at https://geo-python.github.io/poll/.

.loc[] indexing returns the values from that position as a pd.Series where the indices are actually the column names of those variables. Hence, you can access the value of an individual column by referring to its index using following format (both should work):

In [14]: print(row8['TEMP'])
49.4

In [15]: print(row8.YEARMODA)
20160609.0

It is also possible to select multiple rows simultaniously. Here, we select only temperature values (TEMP) between indices of 5-10:

In [16]: temps_5to10 = dataFrame.loc[5:10, 'TEMP']

In [17]: print(temps_5to10)
5     52.2
6     56.9
7     54.2
8     49.4
9     49.5
10    54.0
Name: TEMP, dtype: float64

It is also possible to select multiple columns using those same indices. Here, we select TEMP and the TEMP_Celsius columns by passing them inside a list (.loc[start_index:stop_index, list_of_columns]):

In [18]: temps_5to10 = dataFrame.loc[5:10, ['TEMP', 'TEMP_Celsius']]

In [19]: print(temps_5to10)
    TEMP  TEMP_Celsius
5   52.2     11.222222
6   56.9     13.833333
7   54.2     12.333333
8   49.4      9.666667
9   49.5      9.722222
10  54.0     12.222222

Of course, you can also get all values from those columns. This, can be done by simply referring to the dataFrame and inserting a list of columns inside the square brackets that you want to include.

In [20]: temps_only = dataFrame[['TEMP', 'TEMP_Celsius']]

In [21]: print(temps_only)
    TEMP  TEMP_Celsius
 65.5     18.611111
 65.8     18.777778
 68.4     20.222222
 57.5     14.166667
 51.4     10.777778
 52.2     11.222222
 56.9     13.833333
 54.2     12.333333
 49.4      9.666667
 49.5      9.722222
54.0     12.222222
55.4     13.000000
58.3     14.611111
59.7     15.388889
63.4     17.444444
57.8     14.333333
60.4     15.777778
57.3     14.055556
56.3     13.500000
59.3     15.166667
62.6     17.000000
61.7     16.500000
60.9     16.055556
61.1     16.166667
65.7     18.722222
69.6     20.888889
60.7     15.944444
65.4     18.555556
65.8     18.777778
65.7     18.722222

Filtering and updating data¶

One really useful feature in Pandas is the ability to easily filter and select rows based on certain criteria using .ix[] indexing. The following example shows how to select rows when the Celsius temperature has been higher than 15 degrees into variable w_temps (warm temperatures).

In [22]: w_temps = dataFrame.ix[dataFrame['TEMP_Celsius'] > 15]

In [23]: print(w_temps)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160601  65.5  73.6  54.7  18.9      10.8     18.611111
 20160602  65.8  80.8  55.0  25.8      10.8     18.777778
 20160603  68.4  77.9  55.6  22.3      12.8     20.222222
20160614  59.7  67.8  47.8  20.0      11.9     15.388889
20160615  63.4  70.3  49.3  21.0      14.1     17.444444
20160617  60.4  70.7  55.9  14.8       4.5     15.777778
20160620  59.3  69.1  52.2  16.9       7.1     15.166667
20160621  62.6  71.4  50.4  21.0      12.2     17.000000
20160622  61.7  70.2  55.4  14.8       6.3     16.500000
20160623  60.9  67.1  54.9  12.2       6.0     16.055556
20160624  61.1  68.9  56.7  12.2       4.4     16.166667
20160625  65.7  75.4  57.9  17.5       7.8     18.722222
20160626  69.6  77.7  60.3  17.4       9.3     20.888889
20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

It is also possible to combine multiple criteria at the same time. Here, we select temperatures above 15 degrees that were recorded on the second half of June in 2016 (i.e. YEARMODA >= 20160615). Combining multiple criteria can be done with & operator (AND) or | operator (OR). Notice, that it is often useful to separate the different clauses inside the parentheses ().

In [24]: w_temps2 = dataFrame.ix[(dataFrame['TEMP_Celsius'] > 15) & (dataFrame['YEARMODA'] >= 20160615)]

In [25]: print(w_temps2)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
20160615  63.4  70.3  49.3  21.0      14.1     17.444444
20160617  60.4  70.7  55.9  14.8       4.5     15.777778
20160620  59.3  69.1  52.2  16.9       7.1     15.166667
20160621  62.6  71.4  50.4  21.0      12.2     17.000000
20160622  61.7  70.2  55.4  14.8       6.3     16.500000
20160623  60.9  67.1  54.9  12.2       6.0     16.055556
20160624  61.1  68.9  56.7  12.2       4.4     16.166667
20160625  65.7  75.4  57.9  17.5       7.8     18.722222
20160626  69.6  77.7  60.3  17.4       9.3     20.888889
20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

Now we have a subset of our DataFrame with only rows where the TEMP_Celsius is above 15 and the dates in YEARMODA column start from 15th of June.

Notice, that the index values (numbers on the left) are still showing the positions from the original DataFrame. It is possible to reset the index using reset_index() function that might be useful in some cases to be able to slice the data in a similar manner as above. By default the reset_index() would make a new column called index to keep track on the previous index which might be useful in some cases but here not, so we can omit that by passing parameter drop=True.

In [26]: w_temps2 = w_temps2.reset_index(drop=True)

In [27]: print(w_temps2)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160615  63.4  70.3  49.3  21.0      14.1     17.444444
 20160617  60.4  70.7  55.9  14.8       4.5     15.777778
 20160620  59.3  69.1  52.2  16.9       7.1     15.166667
 20160621  62.6  71.4  50.4  21.0      12.2     17.000000
 20160622  61.7  70.2  55.4  14.8       6.3     16.500000
 20160623  60.9  67.1  54.9  12.2       6.0     16.055556
 20160624  61.1  68.9  56.7  12.2       4.4     16.166667
 20160625  65.7  75.4  57.9  17.5       7.8     18.722222
 20160626  69.6  77.7  60.3  17.4       9.3     20.888889
 20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

As can be seen, now the index values goes from 0 to 12.

Dealing with missing data¶

Next we update the first five values of TEMP_Celsius in our w_temps2 DataFrame to be NaN (not-a-number). This can be done by utilizing the loc[] indexing.

In [28]: w_temps2.loc[:4, 'TEMP_Celsius'] = None

In [29]: print(w_temps2)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160615  63.4  70.3  49.3  21.0      14.1           NaN
 20160617  60.4  70.7  55.9  14.8       4.5           NaN
 20160620  59.3  69.1  52.2  16.9       7.1           NaN
 20160621  62.6  71.4  50.4  21.0      12.2           NaN
 20160622  61.7  70.2  55.4  14.8       6.3           NaN
 20160623  60.9  67.1  54.9  12.2       6.0     16.055556
 20160624  61.1  68.9  56.7  12.2       4.4     16.166667
 20160625  65.7  75.4  57.9  17.5       7.8     18.722222
 20160626  69.6  77.7  60.3  17.4       9.3     20.888889
 20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

Now we can see that we have some missing data in our DataFrame.

Note

Notice here that you don’t necessary need to specify the starting index if you select data starting from the beginning of the file (at index 0), hence you can leave it empty as in the example above.

Having missing data in your datafile is really common situation and typically you want to deal with it somehow. Common procedures to deal with NaN is to either remove them from the DataFrame or fill them with some value. In Pandas both of these options are really easy to do.

Let’s first see how we can remove the NoData values (i.e. clean the data) using dropna() function. Inside the function you can pass with subset parameter a list of column(s) from which the NaN values should be searched from.

In [30]: w_temps_clean = w_temps2.dropna(subset=['TEMP_Celsius'])

In [31]: print(w_temps_clean)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160623  60.9  67.1  54.9  12.2       6.0     16.055556
 20160624  61.1  68.9  56.7  12.2       4.4     16.166667
 20160625  65.7  75.4  57.9  17.5       7.8     18.722222
 20160626  69.6  77.7  60.3  17.4       9.3     20.888889
 20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

As you can see, as a result we now have a DataFrame without the NoData values.

Other option is to fill the NoData with some value using fillna() -function. Here we fill it with value 0.

In [32]: w_temps_na_filled = w_temps2.fillna(0)

In [33]: print(w_temps_na_filled)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160615  63.4  70.3  49.3  21.0      14.1      0.000000
 20160617  60.4  70.7  55.9  14.8       4.5      0.000000
 20160620  59.3  69.1  52.2  16.9       7.1      0.000000
 20160621  62.6  71.4  50.4  21.0      12.2      0.000000
 20160622  61.7  70.2  55.4  14.8       6.3      0.000000
 20160623  60.9  67.1  54.9  12.2       6.0     16.055556
 20160624  61.1  68.9  56.7  12.2       4.4     16.166667
 20160625  65.7  75.4  57.9  17.5       7.8     18.722222
 20160626  69.6  77.7  60.3  17.4       9.3     20.888889
 20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222

As a result we now have a DataFrame where NoData values are filled with value 0.00000.

Warning

In many cases filling the data with a specific value might be dangerous because you end up modifying the actual data that might affect the results of your analysis. For example in the example above we would have dramatically changed the mean Celsius temperature because the 0 values are dramatically decreasing the average temperature of the month. Hence, use filling carefully.

Sorting data¶

Quite often it is useful to be able to sort your data (descending/ascending) based on values in some column This can be easily done with Pandas using sort_values(by='YourColumnName') -function.

Let’s first sort the values on ascending order based on the TEMP column:

In [34]: sorted_temp_a = dataFrame.sort_values(by='TEMP')

In [35]: print(sorted_temp_a)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
 20160609  49.4  54.1  45.7   8.4       3.7      9.666667
 20160610  49.5  55.9  43.0  12.9       6.5      9.722222
 20160605  51.4  58.3  43.2  15.1       8.2     10.777778
 20160606  52.2  59.7  42.8  16.9       9.4     11.222222
20160611  54.0  62.1  41.7  20.4      12.3     12.222222
 20160608  54.2  60.4  47.5  12.9       6.7     12.333333
20160612  55.4  64.2  46.0  18.2       9.4     13.000000
20160619  56.3  59.2  54.1   5.1       2.2     13.500000
 20160607  56.9  65.1  45.9  19.2      11.0     13.833333
20160618  57.3  62.8  54.0   8.8       3.3     14.055556
 20160604  57.5  70.9  47.3  23.6      10.2     14.166667
20160616  57.8  67.5  55.6  11.9       2.2     14.333333
20160613  58.3  68.2  47.3  20.9      11.0     14.611111
20160620  59.3  69.1  52.2  16.9       7.1     15.166667
20160614  59.7  67.8  47.8  20.0      11.9     15.388889
20160617  60.4  70.7  55.9  14.8       4.5     15.777778
20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160623  60.9  67.1  54.9  12.2       6.0     16.055556
20160624  61.1  68.9  56.7  12.2       4.4     16.166667
20160622  61.7  70.2  55.4  14.8       6.3     16.500000
20160621  62.6  71.4  50.4  21.0      12.2     17.000000
20160615  63.4  70.3  49.3  21.0      14.1     17.444444
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
 20160601  65.5  73.6  54.7  18.9      10.8     18.611111
20160625  65.7  75.4  57.9  17.5       7.8     18.722222
20160630  65.7  72.7  59.2  13.5       6.5     18.722222
 20160602  65.8  80.8  55.0  25.8      10.8     18.777778
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
 20160603  68.4  77.9  55.6  22.3      12.8     20.222222
20160626  69.6  77.7  60.3  17.4       9.3     20.888889

Of course, it is also possible to sort them in descending order with ascending=False parameter:

In [36]: sorted_temp_d = dataFrame.sort_values(by='TEMP', ascending=False)

In [37]: print(sorted_temp_d)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius
20160626  69.6  77.7  60.3  17.4       9.3     20.888889
 20160603  68.4  77.9  55.6  22.3      12.8     20.222222
 20160602  65.8  80.8  55.0  25.8      10.8     18.777778
20160629  65.8  73.2  59.7  13.5       6.1     18.777778
20160630  65.7  72.7  59.2  13.5       6.5     18.722222
20160625  65.7  75.4  57.9  17.5       7.8     18.722222
 20160601  65.5  73.6  54.7  18.9      10.8     18.611111
20160628  65.4  73.0  55.8  17.2       9.6     18.555556
20160615  63.4  70.3  49.3  21.0      14.1     17.444444
20160621  62.6  71.4  50.4  21.0      12.2     17.000000
20160622  61.7  70.2  55.4  14.8       6.3     16.500000
20160624  61.1  68.9  56.7  12.2       4.4     16.166667
20160623  60.9  67.1  54.9  12.2       6.0     16.055556
20160627  60.7  70.0  57.6  12.4       3.1     15.944444
20160617  60.4  70.7  55.9  14.8       4.5     15.777778
20160614  59.7  67.8  47.8  20.0      11.9     15.388889
20160620  59.3  69.1  52.2  16.9       7.1     15.166667
20160613  58.3  68.2  47.3  20.9      11.0     14.611111
20160616  57.8  67.5  55.6  11.9       2.2     14.333333
 20160604  57.5  70.9  47.3  23.6      10.2     14.166667
20160618  57.3  62.8  54.0   8.8       3.3     14.055556
 20160607  56.9  65.1  45.9  19.2      11.0     13.833333
20160619  56.3  59.2  54.1   5.1       2.2     13.500000
20160612  55.4  64.2  46.0  18.2       9.4     13.000000
 20160608  54.2  60.4  47.5  12.9       6.7     12.333333
20160611  54.0  62.1  41.7  20.4      12.3     12.222222
 20160606  52.2  59.7  42.8  16.9       9.4     11.222222
 20160605  51.4  58.3  43.2  15.1       8.2     10.777778
 20160610  49.5  55.9  43.0  12.9       6.5      9.722222
 20160609  49.4  54.1  45.7   8.4       3.7      9.666667

Rounding and finding unique values¶

It is possible to round values easily by using round() function. Here we round the Celsius temperatures with 0-decimals

In [38]: dataFrame['Celsius_rounded'] = dataFrame['TEMP_Celsius'].round(0)

In [39]: print(dataFrame)
    YEARMODA  TEMP   MAX   MIN  DIFF  DIFF_Min  TEMP_Celsius  Celsius_rounded
 20160601  65.5  73.6  54.7  18.9      10.8     18.611111             19.0
 20160602  65.8  80.8  55.0  25.8      10.8     18.777778             19.0
 20160603  68.4  77.9  55.6  22.3      12.8     20.222222             20.0
 20160604  57.5  70.9  47.3  23.6      10.2     14.166667             14.0
 20160605  51.4  58.3  43.2  15.1       8.2     10.777778             11.0
 20160606  52.2  59.7  42.8  16.9       9.4     11.222222             11.0
 20160607  56.9  65.1  45.9  19.2      11.0     13.833333             14.0
 20160608  54.2  60.4  47.5  12.9       6.7     12.333333             12.0
 20160609  49.4  54.1  45.7   8.4       3.7      9.666667             10.0
 20160610  49.5  55.9  43.0  12.9       6.5      9.722222             10.0
20160611  54.0  62.1  41.7  20.4      12.3     12.222222             12.0
20160612  55.4  64.2  46.0  18.2       9.4     13.000000             13.0
20160613  58.3  68.2  47.3  20.9      11.0     14.611111             15.0
20160614  59.7  67.8  47.8  20.0      11.9     15.388889             15.0
20160615  63.4  70.3  49.3  21.0      14.1     17.444444             17.0
20160616  57.8  67.5  55.6  11.9       2.2     14.333333             14.0
20160617  60.4  70.7  55.9  14.8       4.5     15.777778             16.0
20160618  57.3  62.8  54.0   8.8       3.3     14.055556             14.0
20160619  56.3  59.2  54.1   5.1       2.2     13.500000             13.0
20160620  59.3  69.1  52.2  16.9       7.1     15.166667             15.0
20160621  62.6  71.4  50.4  21.0      12.2     17.000000             17.0
20160622  61.7  70.2  55.4  14.8       6.3     16.500000             16.0
20160623  60.9  67.1  54.9  12.2       6.0     16.055556             16.0
20160624  61.1  68.9  56.7  12.2       4.4     16.166667             16.0
20160625  65.7  75.4  57.9  17.5       7.8     18.722222             19.0
20160626  69.6  77.7  60.3  17.4       9.3     20.888889             21.0
20160627  60.7  70.0  57.6  12.4       3.1     15.944444             16.0
20160628  65.4  73.0  55.8  17.2       9.6     18.555556             19.0
20160629  65.8  73.2  59.7  13.5       6.1     18.777778             19.0
20160630  65.7  72.7  59.2  13.5       6.5     18.722222             19.0

Now we have rounded our Celsius temperatures. Sometimes it is useful to extract the unique values that you have in your column. We can do that by using unique_values() -function:

In [40]: unique = dataFrame['Celsius_rounded'].unique()

In [41]: unique
Out[41]: array([ 19.,  20.,  14.,  11.,  12.,  10.,  13.,  15.,  17.,  16.,  21.])

As a result we get an array of unique values in that column.

Note

Sometimes if you have a long list of unique values, you don’t necessary see all the unique values directly as IPython hides them. It is, however, possible to see all those values by printing them as a list:

In [42]: print(list(unique))
[19.0, 20.0, 14.0, 11.0, 12.0, 10.0, 13.0, 15.0, 17.0, 16.0, 21.0]

How many days with unique mean temperature did we have in June 2016? We can check that!

In [43]: uniq_temp_days = len(unique)

In [44]: print("There were", uniq_temp_days, "days with unique mean temperatures in June 2016.")
There were 11 days with unique mean temperatures in June 2016.

Writing data¶

Lastly, it is of course important to be able to write the data that you have analyzed into your computer. This is really handy in Pandas as it supports many different data formats by default (see more info here). The most typical output format by far is CSV file. Function to_csv() can be used to easily save your data in CSV format.

Let’s first save the data from our data DataFrame into a file called Kumpula_temp_results_June_2016.csv .

In [45]: output_fp = "Kumpula_temps_June_2016.csv"

In [46]: dataFrame.to_csv(output_fp, sep=',')

Nice, now we have the data from our DataFrame saved to a file:

../../_images/pandas_save_file_result_1.PNG

As you can see, the first value in the datafile contains now the index value of the rows. There are also quite many decimals present in the new columns that we created. Let’s deal with these and save the temperature values from w_temps DataFrame without the index and with only 1 decimal in the floating point numbers.

In [47]: output_fp2 = "Kumpula_temps_above15_June_2016.csv"

In [48]: dataFrame.to_csv(output_fp2, sep=',', index=False, float_format="%.1f")

Omitting the index can be with index=False parameter. Specifying how many decimals should be written can be done with float_fomat -parameter where text %.1f defines Pandas to use 1 decimals in all columns when writing the data to a file (changing the value 1 to 2 would write 2 decimals etc.)

../../_images/pandas_save_file_result_2.PNG

As a results you have a “cleaner” output file without the index column, and with only 1 decimal for floating point numbers.

Hint

It is quite common that people (especially non-programmers) want you to deliver data in MS Excel format. Saving DataFrame into Excel is also straightforward in Pandas. First, you need to initialize a specific ExcelWriter object, and then you specify the filename and the spreadsheet name where you want to save the DataFrame. Optionally, you can also omit the index and specify the float formatting as in our earlier examples:

In [49]: excel_output_fp = "Kumpula_temps_June_2016.xlsx"

In [50]: writer = pd.ExcelWriter(excel_output_fp)

In [51]: dataFrame.to_excel(writer, sheet_name="Kumpula_temperatures", index=False, float_format="%.1f")

As a result you have the DataFrame in Excel format:

../../_images/pandas_save_excel_result_1.PNG