Processing data with Pandas¶
Now you should know the basics of the data structures in Pandas and how to explore your data using some tools that is provided by Pandas. Next we continue to explore some of the basic data operations that are regularly needed when doing data analysis.
Let’s first read the same data as before into Pandas to have a clean start.
In [1] dataFrame = pd.read_csv('Kumpula-June-2016-w-metadata.txt', sep=',', skiprows=8)
Calculating with DataFrames¶
One of the most common things to do in Pandas is to create new columns based on calculations between different variables (columns).
Creating a new column into our DataFrame is easy by specifying the name of the column and giving it some default value (in this case decimal number 0.0).
In [1]: dataFrame['DIFF'] = 0.0
In [2]: print(dataFrame)
YEARMODA TEMP MAX MIN DIFF
0 20160601 65.5 73.6 54.7 0.0
1 20160602 65.8 80.8 55.0 0.0
2 20160603 68.4 77.9 55.6 0.0
3 20160604 57.5 70.9 47.3 0.0
4 20160605 51.4 58.3 43.2 0.0
5 20160606 52.2 59.7 42.8 0.0
6 20160607 56.9 65.1 45.9 0.0
7 20160608 54.2 60.4 47.5 0.0
8 20160609 49.4 54.1 45.7 0.0
9 20160610 49.5 55.9 43.0 0.0
10 20160611 54.0 62.1 41.7 0.0
11 20160612 55.4 64.2 46.0 0.0
12 20160613 58.3 68.2 47.3 0.0
13 20160614 59.7 67.8 47.8 0.0
14 20160615 63.4 70.3 49.3 0.0
15 20160616 57.8 67.5 55.6 0.0
16 20160617 60.4 70.7 55.9 0.0
17 20160618 57.3 62.8 54.0 0.0
18 20160619 56.3 59.2 54.1 0.0
19 20160620 59.3 69.1 52.2 0.0
20 20160621 62.6 71.4 50.4 0.0
21 20160622 61.7 70.2 55.4 0.0
22 20160623 60.9 67.1 54.9 0.0
23 20160624 61.1 68.9 56.7 0.0
24 20160625 65.7 75.4 57.9 0.0
25 20160626 69.6 77.7 60.3 0.0
26 20160627 60.7 70.0 57.6 0.0
27 20160628 65.4 73.0 55.8 0.0
28 20160629 65.8 73.2 59.7 0.0
29 20160630 65.7 72.7 59.2 0.0
Let’s check the datatype of our new column
In [3]: dataFrame['DIFF'].dtypes
Out[3]: dtype('float64')
Okey, so we see that Pandas created a new column and recognized automatically that the data type is float as we passed a 0.0 value to it.
We can also easily do calculations inside our DataFrame. Let’s update the column DIFF
by calculating the difference between MAX
and MIN
columns to get an idea how much the temperatures have
been varying during different days. Calculations can be done with following syntax where we first specify the column that we want to update (i.e. DIFF
) and then do the actual calculation
using the columns that we have in our dataFrameFrame.
In [4]: dataFrame['DIFF'] = dataFrame['MAX'] - dataFrame['MIN']
In [5]: print(dataFrame)
YEARMODA TEMP MAX MIN DIFF
0 20160601 65.5 73.6 54.7 18.9
1 20160602 65.8 80.8 55.0 25.8
2 20160603 68.4 77.9 55.6 22.3
3 20160604 57.5 70.9 47.3 23.6
4 20160605 51.4 58.3 43.2 15.1
5 20160606 52.2 59.7 42.8 16.9
6 20160607 56.9 65.1 45.9 19.2
7 20160608 54.2 60.4 47.5 12.9
8 20160609 49.4 54.1 45.7 8.4
9 20160610 49.5 55.9 43.0 12.9
10 20160611 54.0 62.1 41.7 20.4
11 20160612 55.4 64.2 46.0 18.2
12 20160613 58.3 68.2 47.3 20.9
13 20160614 59.7 67.8 47.8 20.0
14 20160615 63.4 70.3 49.3 21.0
15 20160616 57.8 67.5 55.6 11.9
16 20160617 60.4 70.7 55.9 14.8
17 20160618 57.3 62.8 54.0 8.8
18 20160619 56.3 59.2 54.1 5.1
19 20160620 59.3 69.1 52.2 16.9
20 20160621 62.6 71.4 50.4 21.0
21 20160622 61.7 70.2 55.4 14.8
22 20160623 60.9 67.1 54.9 12.2
23 20160624 61.1 68.9 56.7 12.2
24 20160625 65.7 75.4 57.9 17.5
25 20160626 69.6 77.7 60.3 17.4
26 20160627 60.7 70.0 57.6 12.4
27 20160628 65.4 73.0 55.8 17.2
28 20160629 65.8 73.2 59.7 13.5
29 20160630 65.7 72.7 59.2 13.5
Okey so now we can see that the calculations were inserted into the DIFF
column as planned. Notice that you can do calculations directly without first creating the column. Let’s test
this by calculating the difference between minimum temperature (MIN
) and the mean temperature of the day (TEMP
).
In [6]: dataFrame['DIFF_Min'] = dataFrame['TEMP'] - dataFrame['MIN']
In [7]: print(dataFrame)
YEARMODA TEMP MAX MIN DIFF DIFF_Min
0 20160601 65.5 73.6 54.7 18.9 10.8
1 20160602 65.8 80.8 55.0 25.8 10.8
2 20160603 68.4 77.9 55.6 22.3 12.8
3 20160604 57.5 70.9 47.3 23.6 10.2
4 20160605 51.4 58.3 43.2 15.1 8.2
5 20160606 52.2 59.7 42.8 16.9 9.4
6 20160607 56.9 65.1 45.9 19.2 11.0
7 20160608 54.2 60.4 47.5 12.9 6.7
8 20160609 49.4 54.1 45.7 8.4 3.7
9 20160610 49.5 55.9 43.0 12.9 6.5
10 20160611 54.0 62.1 41.7 20.4 12.3
11 20160612 55.4 64.2 46.0 18.2 9.4
12 20160613 58.3 68.2 47.3 20.9 11.0
13 20160614 59.7 67.8 47.8 20.0 11.9
14 20160615 63.4 70.3 49.3 21.0 14.1
15 20160616 57.8 67.5 55.6 11.9 2.2
16 20160617 60.4 70.7 55.9 14.8 4.5
17 20160618 57.3 62.8 54.0 8.8 3.3
18 20160619 56.3 59.2 54.1 5.1 2.2
19 20160620 59.3 69.1 52.2 16.9 7.1
20 20160621 62.6 71.4 50.4 21.0 12.2
21 20160622 61.7 70.2 55.4 14.8 6.3
22 20160623 60.9 67.1 54.9 12.2 6.0
23 20160624 61.1 68.9 56.7 12.2 4.4
24 20160625 65.7 75.4 57.9 17.5 7.8
25 20160626 69.6 77.7 60.3 17.4 9.3
26 20160627 60.7 70.0 57.6 12.4 3.1
27 20160628 65.4 73.0 55.8 17.2 9.6
28 20160629 65.8 73.2 59.7 13.5 6.1
29 20160630 65.7 72.7 59.2 13.5 6.5
As you can see, now we created directly a new column with the calculation. In a similar manner, you can do calculations using as many columns as you need and using any kind of math algebra (e.g. subtracttion, addition, multiplication, division, exponentiation, etc.).
We can for example convert the Fahrenheit temperatures in TEMP
column into Celsius using the formula that we have seen already many times:
In [8]: dataFrame['TEMP_Celsius'] = (dataFrame['TEMP'] - 32) / (9/5)
In [9]: print(dataFrame)
YEARMODA TEMP MAX MIN DIFF DIFF_Min TEMP_Celsius
0 20160601 65.5 73.6 54.7 18.9 10.8 18.611111
1 20160602 65.8 80.8 55.0 25.8 10.8 18.777778
2 20160603 68.4 77.9 55.6 22.3 12.8 20.222222
3 20160604 57.5 70.9 47.3 23.6 10.2 14.166667
4 20160605 51.4 58.3 43.2 15.1 8.2 10.777778
5 20160606 52.2 59.7 42.8 16.9 9.4 11.222222
6 20160607 56.9 65.1 45.9 19.2 11.0 13.833333
7 20160608 54.2 60.4 47.5 12.9 6.7 12.333333
8 20160609 49.4 54.1 45.7 8.4 3.7 9.666667
9 20160610 49.5 55.9 43.0 12.9 6.5 9.722222
10 20160611 54.0 62.1 41.7 20.4 12.3 12.222222
11 20160612 55.4 64.2 46.0 18.2 9.4 13.000000
12 20160613 58.3 68.2 47.3 20.9 11.0 14.611111
13 20160614 59.7 67.8 47.8 20.0 11.9 15.388889
14 20160615 63.4 70.3 49.3 21.0 14.1 17.444444
15 20160616 57.8 67.5 55.6 11.9 2.2 14.333333
16 20160617 60.4 70.7 55.9 14.8 4.5 15.777778
17 20160618 57.3 62.8 54.0 8.8 3.3 14.055556
18 20160619 56.3 59.2 54.1 5.1 2.2 13.500000
19 20160620 59.3 69.1 52.2 16.9 7.1 15.166667
20 20160621 62.6 71.4 50.4 21.0 12.2 17.000000
21 20160622 61.7 70.2 55.4 14.8 6.3 16.500000
22 20160623 60.9 67.1 54.9 12.2 6.0 16.055556
23 20160624 61.1 68.9 56.7 12.2 4.4 16.166667
24 20160625 65.7 75.4 57.9 17.5 7.8 18.722222
25 20160626 69.6 77.7 60.3 17.4 9.3 20.888889
26 20160627 60.7 70.0 57.6 12.4 3.1 15.944444
27 20160628 65.4 73.0 55.8 17.2 9.6 18.555556
28 20160629 65.8 73.2 59.7 13.5 6.1 18.777778
29 20160630 65.7 72.7 59.2 13.5 6.5 18.722222
Selecting data using indices¶
One quite common procedure in programming that you want to select only specific rows from your data and possibly apply some operations into those rows only. In Pandas there are different ways of doing this.
One common way of selecting only specific rows from your DataFrame is done via index slicing to extract part of the DataFrame.
Let’s select the first five rows and assign them to a variable called rows5
.
In [10]: rows5 = dataFrame[0:5]
In [11]: print(rows5)
YEARMODA TEMP MAX MIN DIFF DIFF_Min TEMP_Celsius
0 20160601 65.5 73.6 54.7 18.9 10.8 18.611111
1 20160602 65.8 80.8 55.0 25.8 10.8 18.777778
2 20160603 68.4 77.9 55.6 22.3 12.8 20.222222
3 20160604 57.5 70.9 47.3 23.6 10.2 14.166667
4 20160605 51.4 58.3 43.2 15.1 8.2 10.777778
As you can see, slicing is done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets
selection = dataFrame[start_index:stop_index]
.
You can also select an individual row from specific position using .loc[]
indexing. Here we select all the data values from row 8.
In [12]: row8 = dataFrame.loc[8]
In [13]: print(row8)
YEARMODA 2.016061e+07
TEMP 4.940000e+01
MAX 5.410000e+01
MIN 4.570000e+01
DIFF 8.400000e+00
DIFF_Min 3.700000e+00
TEMP_Celsius 9.666667e+00
Name: 8, dtype: float64
Attention
Based on what we know about Python lists, what would you expect to see output from the following code?
In [13]: print(dataFrame.loc[-1])
Select your answer from the poll options at https://geo-python.github.io/poll/.
.loc[]
indexing returns the values from that position as a pd.Series
where the indices are actually the column names of those variables. Hence, you can access the value of an individual column by referring to its index using following format (both should work):
In [14]: print(row8['TEMP'])
49.4
In [15]: print(row8.YEARMODA)