Data processing with Pandas, part 2¶
This week we will continue developing our skills using Pandas to process real data.
Motivation¶
Source: https://weather.com/news/climate/news/2019-05-20-april-2019-global-temperatures-nasa-noaa
April 2019 was the second warmest April on record globally, and the warmest on record at 13 weather stations in Finland. In this lesson, we will use our data manipulation and analysis skills to analyze weather data from Finland, and investigate the claim that April 2019 was the warmest on record across Finland.
Along the way we will cover a number of useful techniques in pandas including:
- renaming columns
- iterating data frame rows and applying functions
- data aggregation
- repeating the analysis task for several input files
Input data¶
In the lesson this week we are using weather observation data from Finland downloaded from NOAA. You will be working with data from either 15 or 4 different weather obsercation stations from Finland, depending on your environment.
Downloading the data¶
The first step for today’s lesson is to get the data. Which data files you download will depend on the platform you’re using for working through the lesson today.
CSC Notebooks users¶
If you’re working on the CSC Notebooks platform, you can download the data by opening a new terminal window in Jupyter Lab by going to File -> New -> Terminal in the Jupyter Lab menu bar. Once the terminal is open, you will need to navigate to the directory for Lesson 6 by typing
cd notebooks/notebooks/L6/
Once in the correct directory, you can confirm this by typing
ls
You should see something like the following output:
advanced-data-processing-with-pandas.ipynb errors.ipynb img
debugging.ipynb gcp-assertions.ipynb
If so, you’re in the correct directory and you can download the data files by typing
wget https://davewhipp.github.io/data/Finland-weather-data-CSC.tar.gz
After the download completes, you can extract the data files by typing
tar zxvf Finland-weather-data-CSC.tar.gz
At this stage you should have a new directory called data
that contains the data for this week’s lesson. You can confirm this by typing
ls data
You should see something like the following:
029440.txt 029720.txt 3505doc.txt 6367598020644stn.txt
029700.txt 029740.txt 6367598020644inv.txt
Now you should be all set to proceed with the lesson!
Users with Jupyter on their personal computers¶
If you’re working on your own computer, you can download the data by opening a new terminal window in Jupyter Lab by going to File -> New -> Terminal in the Jupyter Lab menu bar. Once the terminal is open, you will need to navigate to the directory for Lesson 6 by typing
cd path/to/L6/
where path/to/
should be replaced with the directories needed to locate the Lesson 6 materials in on your computer. Once in the correct directory, you can confirm this by typing
ls
You should see something like the following output:
advanced-data-processing-with-pandas.ipynb errors.ipynb img
debugging.ipynb gcp-assertions.ipynb
If so, you’re in the correct directory and you can download the data files by typing
wget https://davewhipp.github.io/data/Finland-weather-data-full.tar.gz
After the download completes, you can extract the data files by typing
tar zxvf Finland-weather-data-full.tar.gz
At this stage you should have a new directory called data
that contains the data for this week’s lesson. You can confirm this by typing
ls data
You should see something like the following:
028360.txt 029070.txt 029440.txt 029740.txt 6367598020644inv.txt
028690.txt 029110.txt 029500.txt 029810.txt 6367598020644stn.txt
028750.txt 029170.txt 029700.txt 029820.txt
028970.txt 029350.txt 029720.txt 3505doc.txt
Now you should be all set to proceed with the lesson!
Binder users¶
It is not recommended to complete this lesson using Binder.
About the data¶
As part of the download there are a number of files that describe the weather data. These metadata files include:
- A list of stations*: data/6367598020644stn.txt
- Details about weather observations at each station: data/6367598020644inv.txt
- A data description (i.e., column names): data/3505doc.txt
*Note that the list of stations is for all 15 stations, even if you’re working with only the 4 stations on the CSC Notebooks platform.
The input data for this week are separated with varying number of spaces (i.e., fixed width). The first lines and columns of the data look like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029440 99999 190601010600 090 7 *** *** OVC * * * 0.0 ** ** ** ** ** ** ** ** * 27 **** 1011.0 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601011300 *** 0 *** *** OVC * * * 0.0 ** ** ** ** ** ** ** ** * 27 **** 1015.5 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601012000 *** 0 *** *** OVC * * * 0.0 ** ** ** ** ** ** ** ** * 25 **** 1016.2 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601020600 *** 0 *** *** CLR * * * 0.0 ** ** ** ** ** ** ** ** * 26 **** 1016.2 ***** ****** *** *** ***** ***** ***** ***** **
We will develop our analysis workflow using data for a single station. Then, we will repeat the same process for all the stations.
Reading the data¶
In order to get started, let’s import pandas:
[1]:
import pandas as pd
At this point, you can already have a quick look at the input file 029440.txt
for Tampere Pirkkala and how it is structured. We can notice at least two things we need to consider when reading in the data:
NOTE: Input data structure
- Delimiter: The data are separated with varying amount of spaces. If you check out the documentation for the read_csv() method, you can see that there are two different ways of doing this. We can either use the
sep
ordelim_whitespace
parameter;sep='\s+'
ordelim_whitespace=True
but not both. In this case, we prefer to usedelim_whitespace
. - No Data values: No data values in the NOAA data are coded with varyingg number of
*
. We can tell pandas to consider those characters as NaNs by specifyingna_values=['*', '**', '***', '****', '*****', '******']
.
[2]:
fp = r"data/029440.txt"
# Read data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, delim_whitespace=True, na_values=['*', '**', '***', '****', '*****', '******'])
C:\HYapp\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3058: DtypeWarning: Columns (29,30,31) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
- Let’s see how the data looks by printing the first five rows with the
head()
function
[3]:
print(data.head())
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M ... SLP \
0 29440 99999 190601010600 90.0 7.0 NaN NaN OVC NaN NaN ... 1011.0
1 29440 99999 190601011300 NaN 0.0 NaN NaN OVC NaN NaN ... 1015.5
2 29440 99999 190601012000 NaN 0.0 NaN NaN OVC NaN NaN ... 1016.2
3 29440 99999 190601020600 NaN 0.0 NaN NaN CLR NaN NaN ... 1016.2
4 29440 99999 190601021300 270.0 7.0 NaN NaN OVC NaN NaN ... 1015.6
ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 33 columns]
All seems ok. However, we won’t be needing all of the 33 columns for detecting warm temperatures in April. We can check all column names by running data.columns
:
[4]:
data.columns
[4]:
Index(['USAF', 'WBAN', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'CLG', 'SKC', 'L',
'M', 'H', 'VSB', 'MW', 'MW.1', 'MW.2', 'MW.3', 'AW', 'AW.1', 'AW.2',
'AW.3', 'W', 'TEMP', 'DEWP', 'SLP', 'ALT', 'STP', 'MAX', 'MIN', 'PCP01',
'PCP06', 'PCP24', 'PCPXX', 'SD'],
dtype='object')
A description for all these columns is available in the metadata file data/3505doc.txt.
Let’s read in the data one more time. This time, we will read in only some of the columns using the usecols
parameter. Let’s read in columns that might be somehow useful to our analysis, or at least that contain some values that are meaningful to us, including the station name, timestamp, and data about wind and temperature: 'USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'
[5]:
data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])
data.head()
[5]:
USAF | YR--MODAHRMN | DIR | SPD | GUS | TEMP | MAX | MIN | |
---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN |
Okay so we can see that the data was successfully read to the DataFrame and we also seemed to be able to convert the asterisk (*) characters into NaN
values.
Renaming columns¶
Check again the column names in our DataFrame:
[6]:
data.columns
[6]:
Index(['USAF', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'TEMP', 'MAX', 'MIN'], dtype='object')
The column names that we have are somewhat ackward. Let’s change them into more intuitive ones. This can be done easily using the rename()
method and a dictionary that lists old and new column names.
Dictionaries
Dictionary is a spesific data structure in Python for storing key-value pairs. During this course, we will use dictionaries mainly when renaming columns in a pandas series, but dictionaries are useful for many different purposes! For more information about Python dictionaries, check out this tutorial.
We can define the new column names using a dictionary where we determine “key: value
” -pairs, in which the original column name (the one which will be replaced) is the key, and the new column name is the value.
- Let’s change:
YR--MODAHRMN
column intoTIME
,SPD
intoSPEED
, andGUS
intoGUST
[7]:
# Create the dictionary with old and new names
new_names = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'}
# Let's see what they look like and what is the type
print(new_names)
print(type(new_names))
{'SPD': 'SPEED', 'YR--MODAHRMN': 'TIME', 'GUS': 'GUST'}
<class 'dict'>
From above we can see that we have successfully created a dictionary that is of type dict
.
- Now we can change the column names by passing that dictionary into parameter
columns
inrename()
-function.
[8]:
# Rename the columns
data = data.rename(columns=new_names)
# Print the new columns
print(data.columns)
Index(['USAF', 'TIME', 'DIR', 'SPEED', 'GUST', 'TEMP', 'MAX', 'MIN'], dtype='object')
Perfect, now our column names are more easy to understand and use.
TASK: Renaming columns
The temperature values are again in Fahrenheit. As you might guess, we will soon convert these temperatures in to Celsius. In order to avoid confusion with the columns, rename column TEMP
into TEMP_F
. Also, we could rename USAF
asSTATION_NUMBER
.
[9]:
# Create the dictionary with old and new names
new_names = {'USAF':'STATION_NUMBER', 'TEMP': 'TEMP_F'}
# Rename the columns
data = data.rename(columns=new_names)
# Check the output
data.head()
[9]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | |
---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN |
Check dataframe properties¶
As we learned last week, it’s always a good idea to check basic properties of the input data before proceeding with data analysis. Let’s check:
- How many rows and columns we have:
[10]:
data.shape
[10]:
(757983, 8)
- Top and bottom rows:
[11]:
data.head()
[11]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | |
---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN |
[12]:
data.tail()
[12]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | |
---|---|---|---|---|---|---|---|---|
757978 | 29440 | 201910012220 | 130.0 | 3.0 | NaN | 39.0 | NaN | NaN |
757979 | 29440 | 201910012250 | 110.0 | 3.0 | NaN | 37.0 | NaN | NaN |
757980 | 29440 | 201910012300 | 100.0 | 2.0 | NaN | 38.0 | NaN | NaN |
757981 | 29440 | 201910012320 | 100.0 | 3.0 | NaN | 37.0 | NaN | NaN |
757982 | 29440 | 201910012350 | 110.0 | 3.0 | NaN | 37.0 | NaN | NaN |
- Data types of the columns:
[13]:
data.dtypes
[13]:
STATION_NUMBER int64
TIME int64
DIR float64
SPEED float64
GUST float64
TEMP_F float64
MAX float64
MIN float64
dtype: object
- Descriptive statistics:
[14]:
print(data.describe())
STATION_NUMBER TIME DIR SPEED \
count 757983.0 7.579830e+05 699256.000000 750143.000000
mean 29440.0 1.999974e+11 233.499846 6.742641
std 0.0 1.629544e+09 209.707258 4.296191
min 29440.0 1.906010e+11 10.000000 0.000000
25% 29440.0 1.989083e+11 130.000000 3.000000
50% 29440.0 2.004042e+11 200.000000 7.000000
75% 29440.0 2.012050e+11 270.000000 9.000000
max 29440.0 2.019100e+11 990.000000 61.000000
GUST TEMP_F MAX MIN
count 19906.000000 754862.000000 23869.000000 23268.000000
mean 20.147996 40.409778 45.373539 35.783737
std 7.415138 17.898715 18.242679 17.195427
min 11.000000 -33.000000 -26.000000 -32.000000
25% 14.000000 29.000000 32.000000 26.000000
50% 18.000000 39.000000 44.000000 36.000000
75% 26.000000 54.000000 60.000000 49.000000
max 108.000000 91.000000 91.000000 81.000000
Here we can see that there are varying number of observations per column (see the count
information), because some of the columns have missing values.
Using your own functions in pandas¶
Now it’s again time to convert temperatures from Fahrenheit to Celsius! Yes, we have already done this many times before, but this time we will learn how to apply self-made functions to data in a pandas DataFrame. In short, our task is to define a function for the temperature conversion, and to apply this function for each Celsius value on each row of the DataFrame. Output celsius values should be stored in a new column called TEMP_C
.
Knowing how to use your own function in pandas can be really useful when doing your own analyses. Here, we will introduce two different approaches for using function in pandas. First, we will see how we can apply the function row-by-row using a for
-loop and the DataFrame.iterrows()
-method, and then we will learn how to apply the method to all rows at once using DataFrame.apply.
For both of these approaches, we first need to define our temperature conversion function from Fahrenheit to Celsius:
[15]:
def fahr_to_celsius(temp_fahrenheit):
"""
Function to convert Fahrenheit temperature into Celsius.
Parameters
----------
temp_fahrenheit: int | float
Input temperature in Fahrenheit (should be a number)
Returns
-------
Temperature in Celsius (float)
"""
# Convert the Fahrenheit into Celsius and return it
converted_temp = (temp_fahrenheit - 32) / 1.8
return converted_temp
Note: with such a simple example, we could use the function direcly on a column in the DataFrame in order to conver the values:
[16]:
data["TEMP_C"] = fahr_to_celsius(data["TEMP_F"])
data.head()
[16]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | |
---|---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN | -3.888889 |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN | -3.333333 |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
If we want to do something more complicated, we need to know how to apply the function row-by-row.
Iterating over rows¶
We can iterate over the rows of Pandas DataFrame by using the iterrows()
-method and use the function one row at a time.
When iterating over the rows in our DataFrame
, it is noteworthy to understand that pandas actually keeps track on the index
value as well. Hence, the contents of a single row actually contains not only the values, but also the index
of that row (each row is a pandas Series!).
- Let’s see how
iterrows()
works by printing out theTEMP
value on each row using afor
-loop:
[17]:
# Iterate over the rows
for idx, row in data.iterrows():
# Print the index value
print('Index:', idx)
# Print the row
print('Temp F:', row["TEMP_F"], "\n")
break
Index: 0
Temp F: 27.0
break
When developing a for-loop, you don’t always need to go trough the whole loop if you just want to test things out. break statement in Python terminates the current loop after the first iteration and we used it here just to test check out the values on the first row. With a large data, you might not want to print out thousands of values to the screen!
We can see that the idx
variable indeed contains the index value at position 0 (the first row) and the row
variable contains all the data from that given row stored as a pandas Series
.
- Let’s now create an empty column
TEMP_C
for the Celsius temperatures and update the values into that column using thefahr_to_celsius
function we defined earlier:
[18]:
# Create an empty column for the DataFrame where the values will be stored
new_column = "TEMP_C"
data[new_column] = None
# Iterate over the rows
for idx, row in data.iterrows():
# Convert the Fahrenheit to Celsius
celsius = fahr_to_celsius(row['TEMP_F'])
# Update the value of 'Celsius' column with the converted value
data.at[idx, new_column] = celsius
Reminder: .at or .loc?
Here, you could also use data.loc[idx, new_column] = celsius
to achieve the same result.
If you only need to access a single value in a DataFrame, DataFrame.at is faster compared to DataFrame.loc which is designed for accessing groups of rows and columns.
Check out more examples for using .at
and .loc
from lesson 5 materials.
- Let’s see what we have now.
[19]:
data.head(10)
[19]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | |
---|---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.77778 |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.77778 |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN | -3.88889 |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN | -3.33333 |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.77778 |
5 | 29440 | 190601022000 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.77778 |
6 | 29440 | 190601030600 | 270.0 | 7.0 | NaN | 26.0 | NaN | NaN | -3.33333 |
7 | 29440 | 190601031300 | 270.0 | 7.0 | NaN | 25.0 | NaN | NaN | -3.88889 |
8 | 29440 | 190601032000 | 270.0 | 7.0 | NaN | 24.0 | NaN | NaN | -4.44444 |
9 | 29440 | 190601040600 | NaN | 0.0 | NaN | 18.0 | NaN | NaN | -7.77778 |
Great! Now we have converted our temperatures into Celsius by using our self-made function.
Applying a function¶
Pandas DataFrames and Series also have a dedicated method .apply()
for applying functions on columns (or rows!). When using .apply()
, we pass the function name (without parenthesis!) as an argument:
[20]:
data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)
data.head()
[20]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | |
---|---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN | -3.888889 |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN | -3.333333 |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 |
Note: pay attention which column you are applying the function on! Running this code: data.apply(fahr_to_celsius)
would not give an error, but the results also don’t make much sense.
Should I use .iterrows() or .apply()?
We are teaching the .iterrows()
method because it helps to understand the structure of a DataFrame and the process of looping trough DataFrame rows. However, using .apply()
is often more efficient in terms of execution time.
At this point, the most important thing is that you understand what happens when you are modifying the values in a pandas DataFrame. When doing the course exercises, either of these approaches is ok!
Parsing dates¶
We will eventually want to group our data based on month in order to see if April temperatures in 2019 were higher than average. Currently, the date and time information is stored in the column TIME
:
YR--MODAHRMN = YEAR-MONTH-DAY-HOUR-MINUTE IN GREENWICH MEAN TIME (GMT)
Let’s have a closer look at the date and time information we have by checking the values in that column, and their data type:
[21]:
data["TIME"].head(10)
[21]:
0 190601010600
1 190601011300
2 190601012000
3 190601020600
4 190601021300
5 190601022000
6 190601030600
7 190601031300
8 190601032000
9 190601040600
Name: TIME, dtype: int64
[22]:
data["TIME"].tail(10)
[22]:
757973 201910012050
757974 201910012100
757975 201910012120
757976 201910012150
757977 201910012200
757978 201910012220
757979 201910012250
757980 201910012300
757981 201910012320
757982 201910012350
Name: TIME, dtype: int64
The TIME
column contains several observations per day (and even several observations per hour). The timestamp for the first observation is 190601010600
, i.e. from 1st of January 1906 (way back!), and the timestamp for the latest observation is 201910012350
(from last week, by the time of writing this!)
[23]:
data["TIME"].dtypes
[23]:
dtype('int64')
And the information is stored as integer values.
There are several different options for proceeding from here. The bottom line is, that we would want to aggregate the data on a monthly level, and in order to do so we need to “label” each row of data based on the month when the record was observed. In practice, we could create a new column (or an index), which contains information about the month (including the year, but excluding days, hours and minutes).
String slicing¶
One approach would be to convert the date and time information into character strings and “cut” the needed information from the string objects. If we look at the latest time stamp in the data (201910012350
), you can see that there is a systematic pattern YEAR-MONTH-DAY-HOUR-MINUTE
. Four first characters represent the year, and six first characters are year + month!
[24]:
date = "201910012350"
date[0:6]
[24]:
'201910'
Doing this in pandas requires two steps: 1. Convert the TIME
column from int
into str
datatype. 2. Slice the correct range of characters from the character string using pandas.Series.str.slice()
- Let’s convert the time into string. And check that the data type changes:
[25]:
# Convert to string
data['TIME_STR'] = data['TIME'].astype(str)
[26]:
# SLice the string
data['YEAR_MONTH'] = data['TIME_STR'].str.slice(start=0, stop=6)
# Let's see what we have
print(data.head())
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN \
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN
TEMP_C TIME_STR YEAR_MONTH
0 -2.777778 190601010600 190601
1 -2.777778 190601011300 190601
2 -3.888889 190601012000 190601
3 -3.333333 190601020600 190601
4 -2.777778 190601021300 190601
Nice! Now we have “labeled” the rows based on information about day of the year and hour of the day. However, let’s have a look at a more clever way of dealing with dates and times..
datetime¶
Pandas datetime
In pandas, we can convert dates and times into a new data type datetime using pandas.to_datetime function. First, it is important to understand the structure of the input data in order to avoid erroneous conversions, and that’s why we first learned string slicing before introducing the datetime functionalities.
Here is one example of how to convert the TIME_STR
-column in our data set to datetime:
# Convert to datetime
data["DATE"] = pd.to_datetime(data["TIME_STR"])
If needed, you can use the format
parameter to define the output datetime format according to strftime(format) method. together with exact=False
, for example like this:
# Convert to datetime
data["YEAR_MONTH"] = pd.to_datetime(data["TIME_STR"], format='%Y%m', exact=False)
In this example, exact=False
drops out days, hours and minutes, because they are not included in the specified formatting.
[27]:
# Convert to datetime
data["DATE"] = pd.to_datetime(data["TIME_STR"])
data["DATE"].head()
[27]:
0 1906-01-01 06:00:00
1 1906-01-01 13:00:00
2 1906-01-01 20:00:00
3 1906-01-02 06:00:00
4 1906-01-02 13:00:00
Name: DATE, dtype: datetime64[ns]
Note: in this case, the data type of the values is datetime
.
Pandas Series datetime properties
There are several methods available for accessing information about the properties of datetime values. Read more from the pandas documentation about datetime properties.
Now, we can extract different time units based on the datetime-column using the pandas.Series.dt accessor:
[28]:
data['DATE'].dt.year
[28]:
0 1906
1 1906
2 1906
3 1906
4 1906
5 1906
6 1906
7 1906
8 1906
9 1906
10 1906
11 1906
12 1906
13 1906
14 1906
15 1906
16 1906
17 1906
18 1906
19 1906
20 1906
21 1906
22 1906
23 1906
24 1906
25 1906
26 1906
27 1906
28 1906
29 1906
...
757953 2019
757954 2019
757955 2019
757956 2019
757957 2019
757958 2019
757959 2019
757960 2019
757961 2019
757962 2019
757963 2019
757964 2019
757965 2019
757966 2019
757967 2019
757968 2019
757969 2019
757970 2019
757971 2019
757972 2019
757973 2019
757974 2019
757975 2019
757976 2019
757977 2019
757978 2019
757979 2019
757980 2019
757981 2019
757982 2019
Name: DATE, Length: 757983, dtype: int64
We can also combine the datetime functionalities with other methods from pandas. For example, we can check the number of unique years in our input data:
[29]:
data['DATE'].dt.year.nunique()
[29]:
51
TASK:
- Create two new columns:
YEAR
andMONTH
based on the date column
[30]:
data['YEAR'] = data['DATE'].dt.year
data['MONTH'] = data['DATE'].dt.month
data.head()
[30]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | TIME_STR | YEAR_MONTH | DATE | YEAR | MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601010600 | 190601 | 1906-01-01 06:00:00 | 1906 | 1 |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601011300 | 190601 | 1906-01-01 13:00:00 | 1906 | 1 |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN | -3.888889 | 190601012000 | 190601 | 1906-01-01 20:00:00 | 1906 | 1 |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN | -3.333333 | 190601020600 | 190601 | 1906-01-02 06:00:00 | 1906 | 1 |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601021300 | 190601 | 1906-01-02 13:00:00 | 1906 | 1 |
Aggregating data in Pandas by grouping¶
Here, we will learn how to use pandas.DataFrame.groupby which is a handy method for compressing large amounts of data and computing statistics for subgroups.
Our practical task is to calculate the average temperatures for each month
This can be done by aggregating the data, i.e.:
- grouping the data based on year and month
- Calculating the average for each month (each group) either by using a for-loop or directly from the grouped object
- Storing those values into a new DataFrame
monthly_data
Before we start grouping the data, let’s once more check how our input data looks like:
[31]:
print("number of rows:", len(data))
data.head()
number of rows: 757983
[31]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | TIME_STR | YEAR_MONTH | DATE | YEAR | MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29440 | 190601010600 | 90.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601010600 | 190601 | 1906-01-01 06:00:00 | 1906 | 1 |
1 | 29440 | 190601011300 | NaN | 0.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601011300 | 190601 | 1906-01-01 13:00:00 | 1906 | 1 |
2 | 29440 | 190601012000 | NaN | 0.0 | NaN | 25.0 | NaN | NaN | -3.888889 | 190601012000 | 190601 | 1906-01-01 20:00:00 | 1906 | 1 |
3 | 29440 | 190601020600 | NaN | 0.0 | NaN | 26.0 | NaN | NaN | -3.333333 | 190601020600 | 190601 | 1906-01-02 06:00:00 | 1906 | 1 |
4 | 29440 | 190601021300 | 270.0 | 7.0 | NaN | 27.0 | NaN | NaN | -2.777778 | 190601021300 | 190601 | 1906-01-02 13:00:00 | 1906 | 1 |
We have quite a few rows of weather data, and several observations per day. Our goal is to create an aggreated data frame that would have only one row per month!
Let’s group our data based on unique year and month combinations
[32]:
grouped = data.groupby(["YEAR", "MONTH"])
NOTE:
Here you could also group the data based on the YEAR_MONTH
column to achieve the same result:
# Group the data
grouped = data.groupby('YEAR_MONTH')
- Let’s see what we have now.
[33]:
# What is the type?
print("Type:\n", type(grouped))
# How many?
print("Length:\n", len(grouped))
Type:
<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
Length:
601
Okey, interesting. Now we have a new object with type DataFrameGroupBy
.
TASK:
Think: what does the number of groups (length of the grouped object) tell us?
[34]:
# Answer: the length of the grouped object should be the same as
data["YEAR_MONTH"].nunique()
# in other words, the number of groups is the number of unique year and month combinations in our data
[34]:
601
There are several methods we can use for extracting information from the grouped data. See documentation for Pandas GroupBy objects for a comprehensive overview.
Checking group names:
[35]:
# Check the "names" of each group (uncomment the next row if you want to print out all the keys)
#grouped.groups.keys()
Accessing data for one group:
- Let’s check the contents for a group representing August 2019 (name of that group is
(2019, 4)
if you grouped the data based on datetime columnsYEAR
andMONTH
). We can get the values of that hour from the grouped object using theget_group()
-method:
[36]:
# Specify the time of the first hour (as text)
month = (2019, 4)
# Select the group
group1 = grouped.get_group(month)
# Let's see what we have
print(group1)
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN \
745098 29440 201904010000 280.0 6.0 NaN 30.0 NaN NaN
745099 29440 201904010020 280.0 7.0 NaN 32.0 NaN NaN
745100 29440 201904010050 280.0 6.0 NaN 30.0 NaN NaN
745101 29440 201904010100 280.0 7.0 NaN 30.0 NaN NaN
745102 29440 201904010120 280.0 6.0 NaN 30.0 NaN NaN
745103 29440 201904010150 300.0 5.0 NaN 30.0 NaN NaN
745104 29440 201904010200 290.0 6.0 NaN 30.0 NaN NaN
745105 29440 201904010220 290.0 6.0 NaN 30.0 NaN NaN
745106 29440 201904010250 290.0 6.0 NaN 28.0 NaN NaN
745107 29440 201904010300 290.0 5.0 NaN 29.0 NaN NaN
745108 29440 201904010320 280.0 6.0 NaN 30.0 NaN NaN
745109 29440 201904010350 280.0 6.0 NaN 28.0 NaN NaN
745110 29440 201904010400 280.0 5.0 NaN 29.0 NaN NaN
745111 29440 201904010420 290.0 6.0 NaN 28.0 NaN NaN
745112 29440 201904010450 270.0 7.0 NaN 30.0 NaN NaN
745113 29440 201904010500 270.0 6.0 NaN 30.0 NaN NaN
745114 29440 201904010520 280.0 8.0 NaN 30.0 NaN NaN
745115 29440 201904010550 280.0 7.0 NaN 30.0 NaN NaN
745116 29440 201904010600 290.0 6.0 NaN 31.0 35.0 29.0
745117 29440 201904010620 280.0 8.0 NaN 32.0 NaN NaN
745118 29440 201904010650 290.0 10.0 NaN 32.0 NaN NaN
745119 29440 201904010700 300.0 7.0 11.0 33.0 NaN NaN
745120 29440 201904010720 990.0 10.0 NaN 34.0 NaN NaN
745121 29440 201904010750 290.0 9.0 NaN 34.0 NaN NaN
745122 29440 201904010800 290.0 7.0 11.0 35.0 NaN NaN
745123 29440 201904010820 290.0 9.0 NaN 36.0 NaN NaN
745124 29440 201904010850 250.0 13.0 NaN 37.0 NaN NaN
745125 29440 201904010900 260.0 9.0 15.0 37.0 NaN NaN
745126 29440 201904010920 250.0 14.0 NaN 37.0 NaN NaN
745127 29440 201904010950 240.0 15.0 NaN 39.0 NaN NaN
... ... ... ... ... ... ... ... ...
747231 29440 201904301400 260.0 6.0 NaN 57.0 NaN NaN
747232 29440 201904301420 990.0 6.0 NaN 57.0 NaN NaN
747233 29440 201904301450 990.0 7.0 NaN 59.0 NaN NaN
747234 29440 201904301500 250.0 8.0 11.0 58.0 NaN NaN
747235 29440 201904301520 240.0 9.0 NaN 59.0 NaN NaN
747236 29440 201904301550 990.0 5.0 NaN 57.0 NaN NaN
747237 29440 201904301600 360.0 0.0 NaN 58.0 NaN NaN
747238 29440 201904301620 990.0 3.0 NaN 57.0 NaN NaN
747239 29440 201904301650 190.0 3.0 NaN 57.0 NaN NaN
747240 29440 201904301700 180.0 2.0 NaN 57.0 NaN NaN
747241 29440 201904301720 190.0 2.0 NaN 55.0 NaN NaN
747242 29440 201904301750 190.0 2.0 NaN 54.0 NaN NaN
747243 29440 201904301800 360.0 0.0 NaN 51.0 58.0 42.0
747244 29440 201904301820 300.0 6.0 NaN 52.0 NaN NaN
747245 29440 201904301850 280.0 3.0 NaN 50.0 NaN NaN
747246 29440 201904301900 270.0 5.0 NaN 47.0 NaN NaN
747247 29440 201904301920 230.0 2.0 NaN 46.0 NaN NaN
747248 29440 201904301950 990.0 1.0 NaN 41.0 NaN NaN
747249 29440 201904302000 360.0 0.0 NaN 42.0 NaN NaN
747250 29440 201904302020 990.0 1.0 NaN 41.0 NaN NaN
747251 29440 201904302050 NaN 0.0 NaN 39.0 NaN NaN
747252 29440 201904302100 360.0 0.0 NaN 39.0 NaN NaN
747253 29440 201904302120 990.0 1.0 NaN 37.0 NaN NaN
747254 29440 201904302150 NaN 0.0 NaN 37.0 NaN NaN
747255 29440 201904302200 360.0 0.0 NaN 36.0 NaN NaN
747256 29440 201904302220 990.0 1.0 NaN 36.0 NaN NaN
747257 29440 201904302250 990.0 1.0 NaN 36.0 NaN NaN
747258 29440 201904302300 360.0 0.0 NaN 36.0 NaN NaN
747259 29440 201904302320 990.0 1.0 NaN 34.0 NaN NaN
747260 29440 201904302350 190.0 3.0 NaN 36.0 NaN NaN
TEMP_C TIME_STR YEAR_MONTH DATE YEAR MONTH
745098 -1.111111 201904010000 201904 2019-04-01 00:00:00 2019 4
745099 0.000000 201904010020 201904 2019-04-01 00:20:00 2019 4
745100 -1.111111 201904010050 201904 2019-04-01 00:50:00 2019 4
745101 -1.111111 201904010100 201904 2019-04-01 01:00:00 2019 4
745102 -1.111111 201904010120 201904 2019-04-01 01:20:00 2019 4
745103 -1.111111 201904010150 201904 2019-04-01 01:50:00 2019 4
745104 -1.111111 201904010200 201904 2019-04-01 02:00:00 2019 4
745105 -1.111111 201904010220 201904 2019-04-01 02:20:00 2019 4
745106 -2.222222 201904010250 201904 2019-04-01 02:50:00 2019 4
745107 -1.666667 201904010300 201904 2019-04-01 03:00:00 2019 4
745108 -1.111111 201904010320 201904 2019-04-01 03:20:00 2019 4
745109 -2.222222 201904010350 201904 2019-04-01 03:50:00 2019 4
745110 -1.666667 201904010400 201904 2019-04-01 04:00:00 2019 4
745111 -2.222222 201904010420 201904 2019-04-01 04:20:00 2019 4
745112 -1.111111 201904010450 201904 2019-04-01 04:50:00 2019 4
745113 -1.111111 201904010500 201904 2019-04-01 05:00:00 2019 4
745114 -1.111111 201904010520 201904 2019-04-01 05:20:00 2019 4
745115 -1.111111 201904010550 201904 2019-04-01 05:50:00 2019 4
745116 -0.555556 201904010600 201904 2019-04-01 06:00:00 2019 4
745117 0.000000 201904010620 201904 2019-04-01 06:20:00 2019 4
745118 0.000000 201904010650 201904 2019-04-01 06:50:00 2019 4
745119 0.555556 201904010700 201904 2019-04-01 07:00:00 2019 4
745120 1.111111 201904010720 201904 2019-04-01 07:20:00 2019 4
745121 1.111111 201904010750 201904 2019-04-01 07:50:00 2019 4
745122 1.666667 201904010800 201904 2019-04-01 08:00:00 2019 4
745123 2.222222 201904010820 201904 2019-04-01 08:20:00 2019 4
745124 2.777778 201904010850 201904 2019-04-01 08:50:00 2019 4
745125 2.777778 201904010900 201904 2019-04-01 09:00:00 2019 4
745126 2.777778 201904010920 201904 2019-04-01 09:20:00 2019 4
745127 3.888889 201904010950 201904 2019-04-01 09:50:00 2019 4
... ... ... ... ... ... ...
747231 13.888889 201904301400 201904 2019-04-30 14:00:00 2019 4
747232 13.888889 201904301420 201904 2019-04-30 14:20:00 2019 4
747233 15.000000 201904301450 201904 2019-04-30 14:50:00 2019 4
747234 14.444444 201904301500 201904 2019-04-30 15:00:00 2019 4
747235 15.000000 201904301520 201904 2019-04-30 15:20:00 2019 4
747236 13.888889 201904301550 201904 2019-04-30 15:50:00 2019 4
747237 14.444444 201904301600 201904 2019-04-30 16:00:00 2019 4
747238 13.888889 201904301620 201904 2019-04-30 16:20:00 2019 4
747239 13.888889 201904301650 201904 2019-04-30 16:50:00 2019 4
747240 13.888889 201904301700 201904 2019-04-30 17:00:00 2019 4
747241 12.777778 201904301720 201904 2019-04-30 17:20:00 2019 4
747242 12.222222 201904301750 201904 2019-04-30 17:50:00 2019 4
747243 10.555556 201904301800 201904 2019-04-30 18:00:00 2019 4
747244 11.111111 201904301820 201904 2019-04-30 18:20:00 2019 4
747245 10.000000 201904301850 201904 2019-04-30 18:50:00 2019 4
747246 8.333333 201904301900 201904 2019-04-30 19:00:00 2019 4
747247 7.777778 201904301920 201904 2019-04-30 19:20:00 2019 4
747248 5.000000 201904301950 201904 2019-04-30 19:50:00 2019 4
747249 5.555556 201904302000 201904 2019-04-30 20:00:00 2019 4
747250 5.000000 201904302020 201904 2019-04-30 20:20:00 2019 4
747251 3.888889 201904302050 201904 2019-04-30 20:50:00 2019 4
747252 3.888889 201904302100 201904 2019-04-30 21:00:00 2019 4
747253 2.777778 201904302120 201904 2019-04-30 21:20:00 2019 4
747254 2.777778 201904302150 201904 2019-04-30 21:50:00 2019 4
747255 2.222222 201904302200 201904 2019-04-30 22:00:00 2019 4
747256 2.222222 201904302220 201904 2019-04-30 22:20:00 2019 4
747257 2.222222 201904302250 201904 2019-04-30 22:50:00 2019 4
747258 2.222222 201904302300 201904 2019-04-30 23:00:00 2019 4
747259 1.111111 201904302320 201904 2019-04-30 23:20:00 2019 4
747260 2.222222 201904302350 201904 2019-04-30 23:50:00 2019 4
[2163 rows x 14 columns]
Ahaa! As we can see, a single group contains a DataFrame with values only for that specific month. Let’s check the DataType of this group:
[37]:
type(group1)
[37]:
pandas.core.frame.DataFrame
So, one group is a pandas DataFrame! This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics etc for this spesific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).
We can do that by using the mean()
-function that we already used during the Lesson 5.
- Let’s calculate the mean for following attributes all at once:
DIR
,SPEED
,GUST
,TEMP
,TEMP_C
MONTH
[38]:
# Specify the columns that will be part of the calculation
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP_F', 'TEMP_C', 'MONTH']
# Calculate the mean values all at one go
mean_values = group1[mean_cols].mean()
# Let's see what we have
print(mean_values)
DIR 309.035306
SPEED 5.932188
GUST 15.868217
TEMP_F 42.472030
TEMP_C 5.817794
MONTH 4.000000
dtype: float64
Here we saw how you can access data from a single group. For getting information about all groups (all months) we can a for
-loop or methods available in the grouped object.
For-loops and grouped objects:
When iterating over the groups in our DataFrameGroupBy
-object it is important to understand that a single group in our DataFrameGroupBy
actually contains not only the actual values, but also information about the key
that was used to do the grouping. Hence, when iterating over the data we need to assign the key
and the values into separate variables.
- Let’s see how we can iterate over the groups and print the key and the data from a single group (again using
break
to only see what is happening).
[39]:
# Iterate over groups
for key, group in grouped:
# Print key and group
print("Key:\n", key)
print("\nFirst rows of data in this group:\n", group.head())
# Stop iteration with break command
break
Key:
(1906, 1)
First rows of data in this group:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN \
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN
TEMP_C TIME_STR YEAR_MONTH DATE YEAR MONTH
0 -2.777778 190601010600 190601 1906-01-01 06:00:00 1906 1
1 -2.777778 190601011300 190601 1906-01-01 13:00:00 1906 1
2 -3.888889 190601012000 190601 1906-01-01 20:00:00 1906 1
3 -3.333333 190601020600 190601 1906-01-02 06:00:00 1906 1
4 -2.777778 190601021300 190601 1906-01-02 13:00:00 1906 1
Okey so from here we can see that the key
contains the name of the group (year, month).
- Let’s see how we can create a DataFrame where we calculate the mean values for all those weather attributes that we were interested in. I will repeat slightly the earlier steps so that you can see and better understand what is happening.
[40]:
# Create an empty DataFrame for the aggregated values
monthly_data = pd.DataFrame()
# The columns that we want to aggregate
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP_F', 'TEMP_C', "MONTH"]
# Iterate over the groups
for key, group in grouped:
# Calculate mean
mean_values = group[mean_cols].mean()
# Add the ´key´ (i.e. the date+time information) into the aggregated values
mean_values['YEAR_MONTH'] = key
# Append the aggregated values into the DataFrame
monthly_data = monthly_data.append(mean_values, ignore_index=True)
- Let’s see what we have now:
[41]:
print(monthly_data)
DIR GUST MONTH SPEED TEMP_C TEMP_F YEAR_MONTH
0 218.181818 NaN 1.0 13.204301 -3.596177 25.526882 (1906, 1)
1 178.095238 NaN 2.0 13.142857 -3.445767 25.797619 (1906, 2)
2 232.043011 NaN 3.0 15.021505 -5.107527 22.806452 (1906, 3)
3 232.045455 NaN 4.0 13.811111 3.790123 38.822222 (1906, 4)
4 192.820513 NaN 5.0 10.333333 13.070490 55.526882 (1906, 5)
5 234.222222 NaN 6.0 12.922222 16.246914 61.244444 (1906, 6)
6 226.923077 NaN 7.0 10.827957 18.805257 65.849462 (1906, 7)
7 251.627907 NaN 8.0 11.623656 13.590203 56.462366 (1906, 8)
8 236.986301 NaN 9.0 9.988889 8.611111 47.500000 (1906, 9)
9 199.397590 NaN 10.0 12.365591 4.360812 39.849462 (1906, 10)
10 199.166667 NaN 11.0 14.211111 0.296296 32.533333 (1906, 11)
11 198.636364 NaN 12.0 15.516129 -4.086022 24.645161 (1906, 12)
12 190.930233 NaN 1.0 16.344086 -9.904421 14.172043 (1907, 1)
13 223.466667 NaN 2.0 15.108434 -5.535475 22.036145 (1907, 2)
14 238.051948 NaN 3.0 10.763441 -1.344086 29.580645 (1907, 3)
15 201.038961 NaN 4.0 10.433333 2.283951 36.111111 (1907, 4)
16 244.823529 NaN 5.0 12.913978 7.048984 44.688172 (1907, 5)
17 202.674419 NaN 6.0 11.633333 14.308642 57.755556 (1907, 6)
18 215.238095 NaN 7.0 10.473118 16.923536 62.462366 (1907, 7)
19 221.139241 NaN 8.0 9.645161 12.640382 54.752688 (1907, 8)
20 258.953488 NaN 9.0 16.466667 9.185185 48.533333 (1907, 9)
21 192.682927 NaN 10.0 10.408602 8.482676 47.268817 (1907, 10)
22 204.222222 NaN 11.0 15.744444 0.641975 33.155556 (1907, 11)
23 166.052632 NaN 12.0 11.000000 -11.992754 10.413043 (1907, 12)
24 221.547619 NaN 1.0 11.376344 -6.284349 20.688172 (1908, 1)
25 190.864198 NaN 2.0 10.546512 -5.245478 22.558140 (1908, 2)
26 169.315068 NaN 3.0 6.537634 -4.342891 24.182796 (1908, 3)
27 195.333333 NaN 4.0 6.222222 2.672840 36.811111 (1908, 4)
28 259.718310 NaN 5.0 6.709677 8.052569 46.494624 (1908, 5)
29 249.651163 NaN 6.0 20.011111 14.283951 57.711111 (1908, 6)
.. ... ... ... ... ... ... ...
571 340.875980 15.628492 5.0 6.872531 8.409816 47.137668 (2017, 5)
572 379.561190 17.014451 6.0 6.904539 12.733344 54.920019 (2017, 6)
573 359.269933 15.245136 7.0 6.318477 14.878319 58.780974 (2017, 7)
574 331.270588 15.627178 8.0 6.489526 14.526758 58.148165 (2017, 8)
575 304.676641 14.788660 9.0 5.399907 10.070834 50.127501 (2017, 9)
576 277.448502 16.598854 10.0 6.535569 3.963063 39.133514 (2017, 10)
577 215.835272 16.026846 11.0 7.165504 1.957475 35.523456 (2017, 11)
578 237.338195 15.927536 12.0 7.908435 -0.497221 31.105002 (2017, 12)
579 269.553613 17.351792 1.0 6.898274 -3.848810 25.072142 (2018, 1)
580 224.710831 14.937500 2.0 5.870871 -9.752473 14.445549 (2018, 2)
581 292.265435 15.392857 3.0 6.066968 -6.356461 20.558371 (2018, 3)
582 350.141509 15.216981 4.0 5.704087 3.862159 38.951887 (2018, 4)
583 422.810552 14.819095 5.0 5.387764 14.625079 58.325143 (2018, 5)
584 445.827233 18.480337 6.0 7.056325 14.357283 57.843109 (2018, 6)
585 408.614487 15.902778 7.0 5.607442 20.695197 69.251354 (2018, 7)
586 396.655422 16.764179 8.0 6.515248 16.888990 62.400182 (2018, 8)
587 290.531856 20.581481 9.0 7.879109 12.253005 54.055409 (2018, 9)
588 256.174656 17.902685 10.0 7.452370 5.609417 42.096950 (2018, 10)
589 303.893491 16.097473 11.0 6.950848 2.547624 36.585723 (2018, 11)
590 281.960321 14.872807 12.0 6.531746 -2.959591 26.672736 (2018, 12)
591 279.460600 17.300813 1.0 5.835289 -7.841929 17.884529 (2019, 1)
592 252.858576 17.908511 2.0 9.232639 -1.682650 28.971230 (2019, 2)
593 281.606481 17.462687 3.0 7.907805 -1.397007 29.485388 (2019, 3)
594 309.035306 15.868217 4.0 5.932188 5.817794 42.472030 (2019, 4)
595 367.198506 17.553719 5.0 7.738381 8.869752 47.965553 (2019, 5)
596 370.992008 17.251852 6.0 8.138490 16.524111 61.743400 (2019, 6)
597 294.433641 15.034722 7.0 5.785714 16.427753 61.569955 (2019, 7)
598 320.335766 15.751678 8.0 6.769447 15.888138 60.598649 (2019, 8)
599 306.491058 15.173285 9.0 6.363594 9.976743 49.958137 (2019, 9)
600 239.577465 17.470588 10.0 10.169014 5.985915 42.774648 (2019, 10)
[601 rows x 7 columns]
Awesome! Now we have aggregated our data and we have a new DataFrame called monthly_data
where we have mean values for each month in the data set.
Mean for all groups at once
We can also achieve the same result by computing the mean of all columns for all groups in the grouped object:
[42]:
grouped.mean()
[42]:
STATION_NUMBER | TIME | DIR | SPEED | GUST | TEMP_F | MAX | MIN | TEMP_C | ||
---|---|---|---|---|---|---|---|---|---|---|
YEAR | MONTH | |||||||||
1906 | 1 | 29440.0 | 1.906012e+11 | 218.181818 | 13.204301 | NaN | 25.526882 | NaN | NaN | -3.596177 |
2 | 29440.0 | 1.906021e+11 | 178.095238 | 13.142857 | NaN | 25.797619 | NaN | NaN | -3.445767 | |
3 | 29440.0 | 1.906032e+11 | 232.043011 | 15.021505 | NaN | 22.806452 | NaN | NaN | -5.107527 | |
4 | 29440.0 | 1.906042e+11 | 232.045455 | 13.811111 | NaN | 38.822222 | NaN | NaN | 3.790123 | |
5 | 29440.0 | 1.906052e+11 | 192.820513 | 10.333333 | NaN | 55.526882 | NaN | NaN | 13.070490 | |
6 | 29440.0 | 1.906062e+11 | 234.222222 | 12.922222 | NaN | 61.244444 | NaN | NaN | 16.246914 | |
7 | 29440.0 | 1.906072e+11 | 226.923077 | 10.827957 | NaN | 65.849462 | NaN | NaN | 18.805257 | |
8 | 29440.0 | 1.906082e+11 | 251.627907 | 11.623656 | NaN | 56.462366 | NaN | NaN | 13.590203 | |
9 | 29440.0 | 1.906092e+11 | 236.986301 | 9.988889 | NaN | 47.500000 | NaN | NaN | 8.611111 | |
10 | 29440.0 | 1.906102e+11 | 199.397590 | 12.365591 | NaN | 39.849462 | NaN | NaN | 4.360812 | |
11 | 29440.0 | 1.906112e+11 | 199.166667 | 14.211111 | NaN | 32.533333 | NaN | NaN | 0.296296 | |
12 | 29440.0 | 1.906122e+11 | 198.636364 | 15.516129 | NaN | 24.645161 | NaN | NaN | -4.086022 | |
1907 | 1 | 29440.0 | 1.907012e+11 | 190.930233 | 16.344086 | NaN | 14.172043 | NaN | NaN | -9.904421 |
2 | 29440.0 | 1.907021e+11 | 223.466667 | 15.108434 | NaN | 22.036145 | NaN | NaN | -5.535475 | |
3 | 29440.0 | 1.907032e+11 | 238.051948 | 10.763441 | NaN | 29.580645 | NaN | NaN | -1.344086 | |
4 | 29440.0 | 1.907042e+11 | 201.038961 | 10.433333 | NaN | 36.111111 | NaN | NaN | 2.283951 | |
5 | 29440.0 | 1.907052e+11 | 244.823529 | 12.913978 | NaN | 44.688172 | NaN | NaN | 7.048984 | |
6 | 29440.0 | 1.907062e+11 | 202.674419 | 11.633333 | NaN | 57.755556 | NaN | NaN | 14.308642 | |
7 | 29440.0 | 1.907072e+11 | 215.238095 | 10.473118 | NaN | 62.462366 | NaN | NaN | 16.923536 | |
8 | 29440.0 | 1.907082e+11 | 221.139241 | 9.645161 | NaN | 54.752688 | NaN | NaN | 12.640382 | |
9 | 29440.0 | 1.907092e+11 | 258.953488 | 16.466667 | NaN | 48.533333 | NaN | NaN | 9.185185 | |
10 | 29440.0 | 1.907102e+11 | 192.682927 | 10.408602 | NaN | 47.268817 | NaN | NaN | 8.482676 | |
11 | 29440.0 | 1.907112e+11 | 204.222222 | 15.744444 | NaN | 33.155556 | NaN | NaN | 0.641975 | |
12 | 29440.0 | 1.907122e+11 | 166.052632 | 11.000000 | NaN | 10.413043 | NaN | NaN | -11.992754 | |
1908 | 1 | 29440.0 | 1.908012e+11 | 221.547619 | 11.376344 | NaN | 20.688172 | NaN | NaN | -6.284349 |
2 | 29440.0 | 1.908022e+11 | 190.864198 | 10.546512 | NaN | 22.558140 | NaN | NaN | -5.245478 | |
3 | 29440.0 | 1.908032e+11 | 169.315068 | 6.537634 | NaN | 24.182796 | NaN | NaN | -4.342891 | |
4 | 29440.0 | 1.908042e+11 | 195.333333 | 6.222222 | NaN | 36.811111 | NaN | NaN | 2.672840 | |
5 | 29440.0 | 1.908052e+11 | 259.718310 | 6.709677 | NaN | 46.494624 | NaN | NaN | 8.052569 | |
6 | 29440.0 | 1.908062e+11 | 249.651163 | 20.011111 | NaN | 57.711111 | NaN | NaN | 14.283951 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2017 | 5 | 29440.0 | 2.017052e+11 | 340.875980 | 6.872531 | 15.628492 | 47.137668 | 53.580645 | 40.387097 | 8.409816 |
6 | 29440.0 | 2.017062e+11 | 379.561190 | 6.904539 | 17.014451 | 54.920019 | 60.566667 | 49.516667 | 12.733344 | |
7 | 29440.0 | 2.017072e+11 | 359.269933 | 6.318477 | 15.245136 | 58.780974 | 64.345455 | 53.581818 | 14.878319 | |
8 | 29440.0 | 2.017082e+11 | 331.270588 | 6.489526 | 15.627178 | 58.148165 | 62.229508 | 53.590164 | 14.526758 | |
9 | 29440.0 | 2.017092e+11 | 304.676641 | 5.399907 | 14.788660 | 50.127501 | 53.300000 | 46.600000 | 10.070834 | |
10 | 29440.0 | 2.017102e+11 | 277.448502 | 6.535569 | 16.598854 | 39.133514 | 41.918033 | 36.098361 | 3.963063 | |
11 | 29440.0 | 2.017112e+11 | 215.835272 | 7.165504 | 16.026846 | 35.523456 | 37.200000 | 33.368421 | 1.957475 | |
12 | 29440.0 | 2.017122e+11 | 237.338195 | 7.908435 | 15.927536 | 31.105002 | 33.016393 | 28.758065 | -0.497221 | |
2018 | 1 | 29440.0 | 2.018012e+11 | 269.553613 | 6.898274 | 17.351792 | 25.072142 | 26.816667 | 21.637931 | -3.848810 |
2 | 29440.0 | 2.018021e+11 | 224.710831 | 5.870871 | 14.937500 | 14.445549 | 18.196429 | 10.196429 | -9.752473 | |
3 | 29440.0 | 2.018032e+11 | 292.265435 | 6.066968 | 15.392857 | 20.558371 | 25.491525 | 12.423729 | -6.356461 | |
4 | 29440.0 | 2.018042e+11 | 350.141509 | 5.704087 | 15.216981 | 38.951887 | 44.175439 | 32.844828 | 3.862159 | |
5 | 29440.0 | 2.018052e+11 | 422.810552 | 5.387764 | 14.819095 | 58.325143 | 63.512195 | 48.658537 | 14.625079 | |
6 | 29440.0 | 2.018062e+11 | 445.827233 | 7.056325 | 18.480337 | 57.843109 | 63.754717 | 51.000000 | 14.357283 | |
7 | 29440.0 | 2.018072e+11 | 408.614487 | 5.607442 | 15.902778 | 69.251354 | 75.262295 | 63.688525 | 20.695197 | |
8 | 29440.0 | 2.018082e+11 | 396.655422 | 6.515248 | 16.764179 | 62.400182 | 67.833333 | 57.350000 | 16.888990 | |
9 | 29440.0 | 2.018092e+11 | 290.531856 | 7.879109 | 20.581481 | 54.055409 | 57.833333 | 49.648148 | 12.253005 | |
10 | 29440.0 | 2.018102e+11 | 256.174656 | 7.452370 | 17.902685 | 42.096950 | 45.229508 | 37.672131 | 5.609417 | |
11 | 29440.0 | 2.018111e+11 | 303.893491 | 6.950848 | 16.097473 | 36.585723 | 38.510638 | 34.085106 | 2.547624 | |
12 | 29440.0 | 2.018122e+11 | 281.960321 | 6.531746 | 14.872807 | 26.672736 | 27.901961 | 23.173077 | -2.959591 | |
2019 | 1 | 29440.0 | 2.019012e+11 | 279.460600 | 5.835289 | 17.300813 | 17.884529 | 21.474576 | 12.700000 | -7.841929 |
2 | 29440.0 | 2.019021e+11 | 252.858576 | 9.232639 | 17.908511 | 28.971230 | 32.672727 | 24.545455 | -1.682650 | |
3 | 29440.0 | 2.019032e+11 | 281.606481 | 7.907805 | 17.462687 | 29.485388 | 32.870968 | 24.870968 | -1.397007 | |
4 | 29440.0 | 2.019042e+11 | 309.035306 | 5.932188 | 15.868217 | 42.472030 | 48.810345 | 35.847458 | 5.817794 | |
5 | 29440.0 | 2.019052e+11 | 367.198506 | 7.738381 | 17.553719 | 47.965553 | 52.907407 | 43.462963 | 8.869752 | |
6 | 29440.0 | 2.019062e+11 | 370.992008 | 8.138490 | 17.251852 | 61.743400 | 67.316667 | 55.600000 | 16.524111 | |
7 | 29440.0 | 2.019072e+11 | 294.433641 | 5.785714 | 15.034722 | 61.569955 | 67.774194 | 55.903226 | 16.427753 | |
8 | 29440.0 | 2.019082e+11 | 320.335766 | 6.769447 | 15.751678 | 60.598649 | 65.935484 | 55.016129 | 15.888138 | |
9 | 29440.0 | 2.019092e+11 | 306.491058 | 6.363594 | 15.173285 | 49.958137 | 53.766667 | 45.350000 | 9.976743 | |
10 | 29440.0 | 2.019100e+11 | 239.577465 | 10.169014 | 17.470588 | 42.774648 | 48.500000 | 41.000000 | 5.985915 |
601 rows × 9 columns
Detecting warm months¶
Now, we have aggregated our data on monthly level and all we need to do is to check which years had the warmest April temperatures. A simple approach is to select all aprils from the data, group the data and check which group(s) have the highest mean value:
- select all records that are from April (regardless of the year):
[43]:
aprils = data[data["MONTH"]==4]
- take a subset of columns that might contain interesting information:
[44]:
aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C','YEAR', 'MONTH']]
- group by year and month:
[45]:
grouped = aprils.groupby(by=["YEAR", "MONTH"])
- calculate mean for each group:
[46]:
monthly_mean = grouped.mean()
monthly_mean.head()
[46]:
STATION_NUMBER | TEMP_F | TEMP_C | ||
---|---|---|---|---|
YEAR | MONTH | |||
1906 | 4 | 29440 | 38.822222 | 3.790123 |
1907 | 4 | 29440 | 36.111111 | 2.283951 |
1908 | 4 | 29440 | 36.811111 | 2.672840 |
1909 | 4 | 29440 | 31.977778 | -0.012346 |
1910 | 4 | 29440 | 39.833333 | 4.351852 |
- check the highest temperature values (sort the data frame in a descending order):
[47]:
monthly_mean.sort_values(by="TEMP_C", ascending=False).head(10)
[47]:
STATION_NUMBER | TEMP_F | TEMP_C | ||
---|---|---|---|---|
YEAR | MONTH | |||
2019 | 4 | 29440 | 42.472030 | 5.817794 |
1990 | 4 | 29440 | 41.918084 | 5.510047 |
1989 | 4 | 29440 | 41.369647 | 5.205360 |
2011 | 4 | 29440 | 41.290730 | 5.161517 |
2004 | 4 | 29440 | 41.249676 | 5.138709 |
2002 | 4 | 29440 | 41.132353 | 5.073529 |
1983 | 4 | 29440 | 41.016183 | 5.008991 |
2008 | 4 | 29440 | 40.962343 | 4.979079 |
2000 | 4 | 29440 | 40.777778 | 4.876543 |
1999 | 4 | 29440 | 40.695291 | 4.830717 |
How did April 2019 rank at the Tampere Pirkkala observation station 🌡️?
Repeating the data analysis with larger dataset¶
Finally, let’s repeat the data analysis steps above for all the available data we have (!!). First, confirm the path to the folder where all the input data are located. The idea is, that we will repeat the analysis process for each input file using a (rather long) for loop! Here we have all the main analysis steps with some additional output info - all in one long code cell:
[48]:
# Read selected columns of data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])
# Rename the columns
new_names = {'USAF':'STATION_NUMBER','YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST', 'TEMP':'TEMP_F'}
data = data.rename(columns=new_names)
#Print info about the current input file:
print("STATION NUMBER:", data.at[0,"STATION_NUMBER"])
print("NUMBER OF OBSERVATIONS:", len(data))
# Create column
col_name = 'TEMP_C'
data[col_name] = None
# Convert tempetarues from Fahrenheits to Celsius
data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)
# Convert TIME to string
data['TIME_STR'] = data['TIME'].astype(str)
# Parse year and month
data['MONTH'] = data['TIME_STR'].str.slice(start=5, stop=6).astype(int)
data['YEAR'] = data['TIME_STR'].str.slice(start=0, stop=4).astype(int)
# Extract observations for the months of April
aprils = data[data['MONTH']==4]
# Take a subset of columns
aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C', 'YEAR', 'MONTH']]
# Group by year and month
grouped = aprils.groupby(by=["YEAR", "MONTH"])
# Get mean values for each group
monthly_mean = grouped.mean()
# Print info
print(monthly_mean.sort_values(by="TEMP_C", ascending=False).head(5))
print("\n")
STATION NUMBER: 29440
NUMBER OF OBSERVATIONS: 757983
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29440 42.472030 5.817794
1990 4 29440 41.918084 5.510047
1989 4 29440 41.369647 5.205360
2011 4 29440 41.290730 5.161517
2004 4 29440 41.249676 5.138709
We will use the glob()
function from the module glob
to list our input files.
[49]:
import glob
[50]:
file_list = glob.glob(r'data/0*txt')
Note
Note that we’re using the * character as a wildcard, so any file that starts with data/0
and ends with txt
will be added to the list of files we will iterate over. We specifically use data/0
as the starting part of the file names to avoid having our metadata files included in the list!
[51]:
print("Number of files in the list", len(file_list))
print(file_list)
Number of files in the list 15
['data\\028360.txt', 'data\\028690.txt', 'data\\028750.txt', 'data\\028970.txt', 'data\\029070.txt', 'data\\029110.txt', 'data\\029170.txt', 'data\\029350.txt', 'data\\029440.txt', 'data\\029500.txt', 'data\\029700.txt', 'data\\029720.txt', 'data\\029740.txt', 'data\\029810.txt', 'data\\029820.txt']
Now, you should have all the relevant file names in a list, and we can loop over the list using a for-loop:
[52]:
for fp in file_list:
print(fp)
data\028360.txt
data\028690.txt
data\028750.txt
data\028970.txt
data\029070.txt
data\029110.txt
data\029170.txt
data\029350.txt
data\029440.txt
data\029500.txt
data\029700.txt
data\029720.txt
data\029740.txt
data\029810.txt
data\029820.txt
[53]:
# Repeat the analysis steps for each input file:
for fp in file_list:
# Read selected columns of data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])
# Rename the columns
new_names = {'USAF':'STATION_NUMBER','YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST', 'TEMP':'TEMP_F'}
data = data.rename(columns=new_names)
#Print info about the current input file:
print("STATION NUMBER:", data.at[0,"STATION_NUMBER"])
print("NUMBER OF OBSERVATIONS:", len(data))
# Create column
col_name = 'TEMP_C'
data[col_name] = None
# Convert tempetarues from Fahrenheits to Celsius
data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)
# Convert TIME to string
data['TIME_STR'] = data['TIME'].astype(str)
# Parse year and month
data['MONTH'] = data['TIME_STR'].str.slice(start=5, stop=6).astype(int)
data['YEAR'] = data['TIME_STR'].str.slice(start=0, stop=4).astype(int)
# Extract observations for the months of April
aprils = data[data['MONTH']==4]
# Take a subset of columns
aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C', 'YEAR', 'MONTH']]
# Group by year and month
grouped = aprils.groupby(by=["YEAR", "MONTH"])
# Get mean values for each group
monthly_mean = grouped.mean()
# Print info
print(monthly_mean.sort_values(by="TEMP_C", ascending=False).head(5))
print("\n")
STATION NUMBER: 28360
NUMBER OF OBSERVATIONS: 193825
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1937 4 28360 38.738095 3.743386
2011 4 28360 36.699571 2.610873
1921 4 28360 36.622222 2.567901
2002 4 28360 36.500000 2.500000
2019 4 28360 34.979138 1.655076
STATION NUMBER: 28690
NUMBER OF OBSERVATIONS: 542788
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2011 4 28690 35.430640 1.905911
2019 4 28690 35.215114 1.786174
2016 4 28690 35.031103 1.683946
1989 4 28690 34.612766 1.451537
2002 4 28690 34.279855 1.266586
STATION NUMBER: 28750
NUMBER OF OBSERVATIONS: 474562
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1989 4 28750 39.008403 3.893557
1983 4 28750 38.758475 3.754708
2019 4 28750 38.651599 3.695333
2002 4 28750 38.270419 3.483566
1994 4 28750 38.145833 3.414352
STATION NUMBER: 28970
NUMBER OF OBSERVATIONS: 555740
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1921 4 28970 41.688889 5.382716
1999 4 28970 39.073600 3.929778
2019 4 28970 38.706456 3.725809
1989 4 28970 38.362869 3.534927
2011 4 28970 38.094172 3.385651
STATION NUMBER: 29070
NUMBER OF OBSERVATIONS: 83567
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2014 4 29070 35.437326 1.909626
2015 4 29070 34.437209 1.354005
2004 4 29070 34.347032 1.303907
2016 4 29070 34.303199 1.279555
2008 4 29070 34.241667 1.245370
STATION NUMBER: 29110
NUMBER OF OBSERVATIONS: 483784
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1921 4 29110 42.166667 5.648148
2004 4 29110 41.682699 5.379277
1989 4 29110 41.420168 5.233427
1937 4 29110 40.671429 4.817460
2019 4 29110 40.636300 4.797945
STATION NUMBER: 29170
NUMBER OF OBSERVATIONS: 561097
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1937 4 29170 43.289157 6.271754
2019 4 29170 40.666820 4.814900
2011 4 29170 40.015962 4.453312
2001 4 29170 39.713228 4.285126
1906 4 29170 39.688889 4.271605
STATION NUMBER: 29350
NUMBER OF OBSERVATIONS: 559667
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1921 4 29350 45.144444 7.302469
1925 4 29350 40.777778 4.876543
2011 4 29350 40.670108 4.816727
2019 4 29350 40.585002 4.769446
2001 4 29350 39.662827 4.257126
STATION NUMBER: 29440
NUMBER OF OBSERVATIONS: 757983
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29440 42.472030 5.817794
1990 4 29440 41.918084 5.510047
1989 4 29440 41.369647 5.205360
2011 4 29440 41.290730 5.161517
2004 4 29440 41.249676 5.138709
STATION NUMBER: 29500
NUMBER OF OBSERVATIONS: 103105
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29500 41.639777 5.355432
2008 4 29500 40.838936 4.910520
2014 4 29500 40.226415 4.570231
2016 4 29500 39.176634 3.987019
2011 4 29500 38.647826 3.693237
STATION NUMBER: 29700
NUMBER OF OBSERVATIONS: 473881
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1921 4 29700 42.811111 6.006173
2000 4 29700 42.375587 5.764215
1990 4 29700 42.054167 5.585648
2019 4 29700 41.548747 5.304859
2004 4 29700 41.493392 5.274107
STATION NUMBER: 29720
NUMBER OF OBSERVATIONS: 843688
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29720 43.558414 6.421341
1990 4 29720 43.313576 6.285320
2000 4 29720 42.663169 5.923983
2008 4 29720 42.349642 5.749801
2004 4 29720 41.903492 5.501940
STATION NUMBER: 29740
NUMBER OF OBSERVATIONS: 931767
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2000 4 29740 43.479793 6.377663
2019 4 29740 43.464070 6.368928
1990 4 29740 43.375078 6.319488
2008 4 29740 43.341429 6.300794
2011 4 29740 42.750702 5.972612
STATION NUMBER: 29810
NUMBER OF OBSERVATIONS: 199330
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1990 4 29810 41.157895 5.087719
2019 4 29810 40.783032 4.879462
2014 4 29810 40.058036 4.476687
2008 4 29810 40.044881 4.469378
2016 4 29810 39.270308 4.039060
STATION NUMBER: 29820
NUMBER OF OBSERVATIONS: 198334
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29820 41.182197 5.101221
1990 4 29820 41.144681 5.080378
2014 4 29820 40.497908 4.721060
2008 4 29820 39.941423 4.411901
1913 4 29820 39.622222 4.234568
How about now, how did April 2019 rank across different stations?