Data processing with Pandas, part 2

This week we will continue developing our skills using Pandas to process real data.

Motivation

Finland April 2019 Source: https://weather.com/news/climate/news/2019-05-20-april-2019-global-temperatures-nasa-noaa

April 2019 was the second warmest April on record globally, and the warmest on record at 13 weather stations in Finland. In this lesson, we will use our data manipulation and analysis skills to analyze weather data from Finland, and investigate the claim that April 2019 was the warmest on record across Finland.

Along the way we will cover a number of useful techniques in pandas including:

  • renaming columns
  • iterating data frame rows and applying functions
  • data aggregation
  • repeating the analysis task for several input files

Input data

In the lesson this week we are using weather observation data from Finland downloaded from NOAA. You will be working with data from either 15 or 4 different weather obsercation stations from Finland, depending on your environment.

Downloading the data

The first step for today’s lesson is to get the data. Which data files you download will depend on the platform you’re using for working through the lesson today.

CSC Notebooks users

If you’re working on the CSC Notebooks platform, you can download the data by opening a new terminal window in Jupyter Lab by going to File -> New -> Terminal in the Jupyter Lab menu bar. Once the terminal is open, you will need to navigate to the directory for Lesson 6 by typing

cd notebooks/notebooks/L6/

Once in the correct directory, you can confirm this by typing

ls

You should see something like the following output:

advanced-data-processing-with-pandas.ipynb errors.ipynb                               img
debugging.ipynb                            gcp-assertions.ipynb

If so, you’re in the correct directory and you can download the data files by typing

wget https://davewhipp.github.io/data/Finland-weather-data-CSC.tar.gz

After the download completes, you can extract the data files by typing

tar zxvf Finland-weather-data-CSC.tar.gz

At this stage you should have a new directory called data that contains the data for this week’s lesson. You can confirm this by typing

ls data

You should see something like the following:

029440.txt           029720.txt           3505doc.txt          6367598020644stn.txt
029700.txt           029740.txt           6367598020644inv.txt

Now you should be all set to proceed with the lesson!

Users with Jupyter on their personal computers

If you’re working on your own computer, you can download the data by opening a new terminal window in Jupyter Lab by going to File -> New -> Terminal in the Jupyter Lab menu bar. Once the terminal is open, you will need to navigate to the directory for Lesson 6 by typing

cd path/to/L6/

where path/to/ should be replaced with the directories needed to locate the Lesson 6 materials in on your computer. Once in the correct directory, you can confirm this by typing

ls

You should see something like the following output:

advanced-data-processing-with-pandas.ipynb errors.ipynb                               img
debugging.ipynb                            gcp-assertions.ipynb

If so, you’re in the correct directory and you can download the data files by typing

wget https://davewhipp.github.io/data/Finland-weather-data-full.tar.gz

After the download completes, you can extract the data files by typing

tar zxvf Finland-weather-data-full.tar.gz

At this stage you should have a new directory called data that contains the data for this week’s lesson. You can confirm this by typing

ls data

You should see something like the following:

028360.txt           029070.txt           029440.txt           029740.txt  6367598020644inv.txt
028690.txt           029110.txt           029500.txt           029810.txt  6367598020644stn.txt
028750.txt           029170.txt           029700.txt           029820.txt
028970.txt           029350.txt           029720.txt           3505doc.txt

Now you should be all set to proceed with the lesson!

Binder users

It is not recommended to complete this lesson using Binder.

About the data

As part of the download there are a number of files that describe the weather data. These metadata files include:

*Note that the list of stations is for all 15 stations, even if you’re working with only the 4 stations on the CSC Notebooks platform.

The input data for this week are separated with varying number of spaces (i.e., fixed width). The first lines and columns of the data look like following:

  USAF  WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H  VSB MW MW MW MW AW AW AW AW W TEMP DEWP    SLP   ALT    STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029440 99999 190601010600 090   7 *** *** OVC * * *  0.0 ** ** ** ** ** ** ** ** *   27 **** 1011.0 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601011300 ***   0 *** *** OVC * * *  0.0 ** ** ** ** ** ** ** ** *   27 **** 1015.5 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601012000 ***   0 *** *** OVC * * *  0.0 ** ** ** ** ** ** ** ** *   25 **** 1016.2 ***** ****** *** *** ***** ***** ***** ***** **
029440 99999 190601020600 ***   0 *** *** CLR * * *  0.0 ** ** ** ** ** ** ** ** *   26 **** 1016.2 ***** ****** *** *** ***** ***** ***** ***** **

We will develop our analysis workflow using data for a single station. Then, we will repeat the same process for all the stations.

Reading the data

In order to get started, let’s import pandas:

[1]:
import pandas as pd

At this point, you can already have a quick look at the input file 029440.txt for Tampere Pirkkala and how it is structured. We can notice at least two things we need to consider when reading in the data:

NOTE: Input data structure

  • Delimiter: The data are separated with varying amount of spaces. If you check out the documentation for the read_csv() method, you can see that there are two different ways of doing this. We can either use the sep or delim_whitespace parameter; sep='\s+' or delim_whitespace=True but not both. In this case, we prefer to use delim_whitespace.
  • No Data values: No data values in the NOAA data are coded with varyingg number of *. We can tell pandas to consider those characters as NaNs by specifying na_values=['*', '**', '***', '****', '*****', '******'].
[2]:
fp = r"data/029440.txt"

# Read data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, delim_whitespace=True, na_values=['*', '**', '***', '****', '*****', '******'])
C:\HYapp\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3058: DtypeWarning: Columns (29,30,31) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
  • Let’s see how the data looks by printing the first five rows with the head() function
[3]:
print(data.head())
    USAF   WBAN  YR--MODAHRMN    DIR  SPD  GUS  CLG  SKC   L   M ...     SLP  \
0  29440  99999  190601010600   90.0  7.0  NaN  NaN  OVC NaN NaN ...  1011.0
1  29440  99999  190601011300    NaN  0.0  NaN  NaN  OVC NaN NaN ...  1015.5
2  29440  99999  190601012000    NaN  0.0  NaN  NaN  OVC NaN NaN ...  1016.2
3  29440  99999  190601020600    NaN  0.0  NaN  NaN  CLR NaN NaN ...  1016.2
4  29440  99999  190601021300  270.0  7.0  NaN  NaN  OVC NaN NaN ...  1015.6

   ALT  STP  MAX  MIN  PCP01  PCP06  PCP24  PCPXX  SD
0  NaN  NaN  NaN  NaN    NaN    NaN    NaN    NaN NaN
1  NaN  NaN  NaN  NaN    NaN    NaN    NaN    NaN NaN
2  NaN  NaN  NaN  NaN    NaN    NaN    NaN    NaN NaN
3  NaN  NaN  NaN  NaN    NaN    NaN    NaN    NaN NaN
4  NaN  NaN  NaN  NaN    NaN    NaN    NaN    NaN NaN

[5 rows x 33 columns]

All seems ok. However, we won’t be needing all of the 33 columns for detecting warm temperatures in April. We can check all column names by running data.columns:

[4]:
data.columns
[4]:
Index(['USAF', 'WBAN', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'CLG', 'SKC', 'L',
       'M', 'H', 'VSB', 'MW', 'MW.1', 'MW.2', 'MW.3', 'AW', 'AW.1', 'AW.2',
       'AW.3', 'W', 'TEMP', 'DEWP', 'SLP', 'ALT', 'STP', 'MAX', 'MIN', 'PCP01',
       'PCP06', 'PCP24', 'PCPXX', 'SD'],
      dtype='object')

A description for all these columns is available in the metadata file data/3505doc.txt.

Let’s read in the data one more time. This time, we will read in only some of the columns using the usecols parameter. Let’s read in columns that might be somehow useful to our analysis, or at least that contain some values that are meaningful to us, including the station name, timestamp, and data about wind and temperature: 'USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'

[5]:
data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])
data.head()
[5]:
USAF YR--MODAHRMN DIR SPD GUS TEMP MAX MIN
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN

Okay so we can see that the data was successfully read to the DataFrame and we also seemed to be able to convert the asterisk (*) characters into NaN values.

Renaming columns

Check again the column names in our DataFrame:

[6]:
data.columns
[6]:
Index(['USAF', 'YR--MODAHRMN', 'DIR', 'SPD', 'GUS', 'TEMP', 'MAX', 'MIN'], dtype='object')

The column names that we have are somewhat ackward. Let’s change them into more intuitive ones. This can be done easily using the rename() method and a dictionary that lists old and new column names.

Dictionaries

Dictionary is a spesific data structure in Python for storing key-value pairs. During this course, we will use dictionaries mainly when renaming columns in a pandas series, but dictionaries are useful for many different purposes! For more information about Python dictionaries, check out this tutorial.

We can define the new column names using a dictionary where we determine “key: value” -pairs, in which the original column name (the one which will be replaced) is the key, and the new column name is the value.

  • Let’s change:
  • YR--MODAHRMN column into TIME,
  • SPD into SPEED, and
  • GUS into GUST
[7]:
# Create the dictionary with old and new names
new_names = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST'}

# Let's see what they look like and what is the type
print(new_names)
print(type(new_names))
{'SPD': 'SPEED', 'YR--MODAHRMN': 'TIME', 'GUS': 'GUST'}
<class 'dict'>

From above we can see that we have successfully created a dictionary that is of type dict.

  • Now we can change the column names by passing that dictionary into parameter columns in rename() -function.
[8]:
# Rename the columns
data = data.rename(columns=new_names)

# Print the new columns
print(data.columns)
Index(['USAF', 'TIME', 'DIR', 'SPEED', 'GUST', 'TEMP', 'MAX', 'MIN'], dtype='object')

Perfect, now our column names are more easy to understand and use.

TASK: Renaming columns

The temperature values are again in Fahrenheit. As you might guess, we will soon convert these temperatures in to Celsius. In order to avoid confusion with the columns, rename column TEMP into TEMP_F. Also, we could rename USAF asSTATION_NUMBER.

[9]:
# Create the dictionary with old and new names
new_names = {'USAF':'STATION_NUMBER', 'TEMP': 'TEMP_F'}

# Rename the columns
data = data.rename(columns=new_names)

# Check the output
data.head()
[9]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN

Check dataframe properties

As we learned last week, it’s always a good idea to check basic properties of the input data before proceeding with data analysis. Let’s check:

  • How many rows and columns we have:
[10]:
data.shape
[10]:
(757983, 8)
  • Top and bottom rows:
[11]:
data.head()
[11]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN
[12]:
data.tail()
[12]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN
757978 29440 201910012220 130.0 3.0 NaN 39.0 NaN NaN
757979 29440 201910012250 110.0 3.0 NaN 37.0 NaN NaN
757980 29440 201910012300 100.0 2.0 NaN 38.0 NaN NaN
757981 29440 201910012320 100.0 3.0 NaN 37.0 NaN NaN
757982 29440 201910012350 110.0 3.0 NaN 37.0 NaN NaN
  • Data types of the columns:
[13]:
data.dtypes
[13]:
STATION_NUMBER      int64
TIME                int64
DIR               float64
SPEED             float64
GUST              float64
TEMP_F            float64
MAX               float64
MIN               float64
dtype: object
  • Descriptive statistics:
[14]:
print(data.describe())
       STATION_NUMBER          TIME            DIR          SPEED  \
count        757983.0  7.579830e+05  699256.000000  750143.000000
mean          29440.0  1.999974e+11     233.499846       6.742641
std               0.0  1.629544e+09     209.707258       4.296191
min           29440.0  1.906010e+11      10.000000       0.000000
25%           29440.0  1.989083e+11     130.000000       3.000000
50%           29440.0  2.004042e+11     200.000000       7.000000
75%           29440.0  2.012050e+11     270.000000       9.000000
max           29440.0  2.019100e+11     990.000000      61.000000

               GUST         TEMP_F           MAX           MIN
count  19906.000000  754862.000000  23869.000000  23268.000000
mean      20.147996      40.409778     45.373539     35.783737
std        7.415138      17.898715     18.242679     17.195427
min       11.000000     -33.000000    -26.000000    -32.000000
25%       14.000000      29.000000     32.000000     26.000000
50%       18.000000      39.000000     44.000000     36.000000
75%       26.000000      54.000000     60.000000     49.000000
max      108.000000      91.000000     91.000000     81.000000

Here we can see that there are varying number of observations per column (see the count information), because some of the columns have missing values.

Using your own functions in pandas

Now it’s again time to convert temperatures from Fahrenheit to Celsius! Yes, we have already done this many times before, but this time we will learn how to apply self-made functions to data in a pandas DataFrame. In short, our task is to define a function for the temperature conversion, and to apply this function for each Celsius value on each row of the DataFrame. Output celsius values should be stored in a new column called TEMP_C.

Knowing how to use your own function in pandas can be really useful when doing your own analyses. Here, we will introduce two different approaches for using function in pandas. First, we will see how we can apply the function row-by-row using a for-loop and the DataFrame.iterrows()-method, and then we will learn how to apply the method to all rows at once using DataFrame.apply.

For both of these approaches, we first need to define our temperature conversion function from Fahrenheit to Celsius:

[15]:
def fahr_to_celsius(temp_fahrenheit):
    """
    Function to convert Fahrenheit temperature into Celsius.

    Parameters
    ----------

    temp_fahrenheit: int | float
        Input temperature in Fahrenheit (should be a number)

    Returns
    -------

    Temperature in Celsius (float)
    """

    # Convert the Fahrenheit into Celsius and return it
    converted_temp = (temp_fahrenheit - 32) / 1.8
    return converted_temp

Note: with such a simple example, we could use the function direcly on a column in the DataFrame in order to conver the values:

[16]:
data["TEMP_C"] = fahr_to_celsius(data["TEMP_F"])
data.head()
[16]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN -2.777778
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN -2.777778
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN -3.888889
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN -3.333333
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN -2.777778

If we want to do something more complicated, we need to know how to apply the function row-by-row.

Iterating over rows

We can iterate over the rows of Pandas DataFrame by using the iterrows() -method and use the function one row at a time.

When iterating over the rows in our DataFrame, it is noteworthy to understand that pandas actually keeps track on the index value as well. Hence, the contents of a single row actually contains not only the values, but also the index of that row (each row is a pandas Series!).

  • Let’s see how iterrows() works by printing out the TEMP value on each row using a for-loop:
[17]:
# Iterate over the rows
for idx, row in data.iterrows():
    # Print the index value
    print('Index:', idx)

    # Print the row
    print('Temp F:', row["TEMP_F"], "\n")

    break
Index: 0
Temp F: 27.0

break

When developing a for-loop, you don’t always need to go trough the whole loop if you just want to test things out. break statement in Python terminates the current loop after the first iteration and we used it here just to test check out the values on the first row. With a large data, you might not want to print out thousands of values to the screen!

We can see that the idx variable indeed contains the index value at position 0 (the first row) and the row variable contains all the data from that given row stored as a pandas Series.

  • Let’s now create an empty column TEMP_C for the Celsius temperatures and update the values into that column using the fahr_to_celsius function we defined earlier:
[18]:
# Create an empty column for the DataFrame where the values will be stored
new_column = "TEMP_C"
data[new_column] = None

# Iterate over the rows
for idx, row in data.iterrows():
    # Convert the Fahrenheit to Celsius
    celsius = fahr_to_celsius(row['TEMP_F'])

    # Update the value of 'Celsius' column with the converted value
    data.at[idx, new_column] = celsius

Reminder: .at or .loc?

Here, you could also use data.loc[idx, new_column] = celsius to achieve the same result.

If you only need to access a single value in a DataFrame, DataFrame.at is faster compared to DataFrame.loc which is designed for accessing groups of rows and columns.

Check out more examples for using .at and .loc from lesson 5 materials.

  • Let’s see what we have now.
[19]:
data.head(10)
[19]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN -2.77778
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN -2.77778
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN -3.88889
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN -3.33333
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN -2.77778
5 29440 190601022000 NaN 0.0 NaN 27.0 NaN NaN -2.77778
6 29440 190601030600 270.0 7.0 NaN 26.0 NaN NaN -3.33333
7 29440 190601031300 270.0 7.0 NaN 25.0 NaN NaN -3.88889
8 29440 190601032000 270.0 7.0 NaN 24.0 NaN NaN -4.44444
9 29440 190601040600 NaN 0.0 NaN 18.0 NaN NaN -7.77778

Great! Now we have converted our temperatures into Celsius by using our self-made function.

Applying a function

Pandas DataFrames and Series also have a dedicated method .apply() for applying functions on columns (or rows!). When using .apply(), we pass the function name (without parenthesis!) as an argument:

[20]:
data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)
data.head()
[20]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN -2.777778
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN -2.777778
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN -3.888889
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN -3.333333
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN -2.777778

Note: pay attention which column you are applying the function on! Running this code: data.apply(fahr_to_celsius) would not give an error, but the results also don’t make much sense.

Should I use .iterrows() or .apply()?

We are teaching the .iterrows() method because it helps to understand the structure of a DataFrame and the process of looping trough DataFrame rows. However, using .apply() is often more efficient in terms of execution time.

At this point, the most important thing is that you understand what happens when you are modifying the values in a pandas DataFrame. When doing the course exercises, either of these approaches is ok!

Parsing dates

We will eventually want to group our data based on month in order to see if April temperatures in 2019 were higher than average. Currently, the date and time information is stored in the column TIME:

YR--MODAHRMN = YEAR-MONTH-DAY-HOUR-MINUTE IN GREENWICH MEAN TIME (GMT)

Let’s have a closer look at the date and time information we have by checking the values in that column, and their data type:

[21]:
data["TIME"].head(10)
[21]:
0    190601010600
1    190601011300
2    190601012000
3    190601020600
4    190601021300
5    190601022000
6    190601030600
7    190601031300
8    190601032000
9    190601040600
Name: TIME, dtype: int64
[22]:
data["TIME"].tail(10)
[22]:
757973    201910012050
757974    201910012100
757975    201910012120
757976    201910012150
757977    201910012200
757978    201910012220
757979    201910012250
757980    201910012300
757981    201910012320
757982    201910012350
Name: TIME, dtype: int64

The TIME column contains several observations per day (and even several observations per hour). The timestamp for the first observation is 190601010600, i.e. from 1st of January 1906 (way back!), and the timestamp for the latest observation is 201910012350 (from last week, by the time of writing this!)

[23]:
data["TIME"].dtypes
[23]:
dtype('int64')

And the information is stored as integer values.

There are several different options for proceeding from here. The bottom line is, that we would want to aggregate the data on a monthly level, and in order to do so we need to “label” each row of data based on the month when the record was observed. In practice, we could create a new column (or an index), which contains information about the month (including the year, but excluding days, hours and minutes).

String slicing

One approach would be to convert the date and time information into character strings and “cut” the needed information from the string objects. If we look at the latest time stamp in the data (201910012350), you can see that there is a systematic pattern YEAR-MONTH-DAY-HOUR-MINUTE. Four first characters represent the year, and six first characters are year + month!

[24]:
date = "201910012350"
date[0:6]
[24]:
'201910'

Doing this in pandas requires two steps: 1. Convert the TIME column from int into str datatype. 2. Slice the correct range of characters from the character string using pandas.Series.str.slice()

  • Let’s convert the time into string. And check that the data type changes:
[25]:
# Convert to string
data['TIME_STR'] = data['TIME'].astype(str)
[26]:
# SLice the string
data['YEAR_MONTH'] = data['TIME_STR'].str.slice(start=0, stop=6)

# Let's see what we have
print(data.head())
   STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F  MAX  MIN  \
0           29440  190601010600   90.0    7.0   NaN    27.0  NaN  NaN
1           29440  190601011300    NaN    0.0   NaN    27.0  NaN  NaN
2           29440  190601012000    NaN    0.0   NaN    25.0  NaN  NaN
3           29440  190601020600    NaN    0.0   NaN    26.0  NaN  NaN
4           29440  190601021300  270.0    7.0   NaN    27.0  NaN  NaN

     TEMP_C      TIME_STR YEAR_MONTH
0 -2.777778  190601010600     190601
1 -2.777778  190601011300     190601
2 -3.888889  190601012000     190601
3 -3.333333  190601020600     190601
4 -2.777778  190601021300     190601

Nice! Now we have “labeled” the rows based on information about day of the year and hour of the day. However, let’s have a look at a more clever way of dealing with dates and times..

datetime

Pandas datetime

In pandas, we can convert dates and times into a new data type datetime using pandas.to_datetime function. First, it is important to understand the structure of the input data in order to avoid erroneous conversions, and that’s why we first learned string slicing before introducing the datetime functionalities.

Here is one example of how to convert the TIME_STR-column in our data set to datetime:

# Convert to datetime
data["DATE"] = pd.to_datetime(data["TIME_STR"])

If needed, you can use the format parameter to define the output datetime format according to strftime(format) method. together with exact=False, for example like this:

# Convert to datetime
data["YEAR_MONTH"] = pd.to_datetime(data["TIME_STR"], format='%Y%m', exact=False)

In this example, exact=False drops out days, hours and minutes, because they are not included in the specified formatting.

[27]:
# Convert to datetime
data["DATE"] = pd.to_datetime(data["TIME_STR"])
data["DATE"].head()
[27]:
0   1906-01-01 06:00:00
1   1906-01-01 13:00:00
2   1906-01-01 20:00:00
3   1906-01-02 06:00:00
4   1906-01-02 13:00:00
Name: DATE, dtype: datetime64[ns]

Note: in this case, the data type of the values is datetime.

Pandas Series datetime properties

There are several methods available for accessing information about the properties of datetime values. Read more from the pandas documentation about datetime properties.

Now, we can extract different time units based on the datetime-column using the pandas.Series.dt accessor:

[28]:
data['DATE'].dt.year
[28]:
0         1906
1         1906
2         1906
3         1906
4         1906
5         1906
6         1906
7         1906
8         1906
9         1906
10        1906
11        1906
12        1906
13        1906
14        1906
15        1906
16        1906
17        1906
18        1906
19        1906
20        1906
21        1906
22        1906
23        1906
24        1906
25        1906
26        1906
27        1906
28        1906
29        1906
          ...
757953    2019
757954    2019
757955    2019
757956    2019
757957    2019
757958    2019
757959    2019
757960    2019
757961    2019
757962    2019
757963    2019
757964    2019
757965    2019
757966    2019
757967    2019
757968    2019
757969    2019
757970    2019
757971    2019
757972    2019
757973    2019
757974    2019
757975    2019
757976    2019
757977    2019
757978    2019
757979    2019
757980    2019
757981    2019
757982    2019
Name: DATE, Length: 757983, dtype: int64

We can also combine the datetime functionalities with other methods from pandas. For example, we can check the number of unique years in our input data:

[29]:
data['DATE'].dt.year.nunique()
[29]:
51

TASK:

  • Create two new columns: YEAR and MONTH based on the date column
[30]:
data['YEAR'] = data['DATE'].dt.year
data['MONTH'] = data['DATE'].dt.month

data.head()
[30]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C TIME_STR YEAR_MONTH DATE YEAR MONTH
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN -2.777778 190601010600 190601 1906-01-01 06:00:00 1906 1
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN -2.777778 190601011300 190601 1906-01-01 13:00:00 1906 1
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN -3.888889 190601012000 190601 1906-01-01 20:00:00 1906 1
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN -3.333333 190601020600 190601 1906-01-02 06:00:00 1906 1
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN -2.777778 190601021300 190601 1906-01-02 13:00:00 1906 1

Aggregating data in Pandas by grouping

Here, we will learn how to use pandas.DataFrame.groupby which is a handy method for compressing large amounts of data and computing statistics for subgroups.

Our practical task is to calculate the average temperatures for each month

This can be done by aggregating the data, i.e.:

  1. grouping the data based on year and month
  2. Calculating the average for each month (each group) either by using a for-loop or directly from the grouped object
  3. Storing those values into a new DataFrame monthly_data

Before we start grouping the data, let’s once more check how our input data looks like:

[31]:
print("number of rows:", len(data))
data.head()
number of rows: 757983
[31]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C TIME_STR YEAR_MONTH DATE YEAR MONTH
0 29440 190601010600 90.0 7.0 NaN 27.0 NaN NaN -2.777778 190601010600 190601 1906-01-01 06:00:00 1906 1
1 29440 190601011300 NaN 0.0 NaN 27.0 NaN NaN -2.777778 190601011300 190601 1906-01-01 13:00:00 1906 1
2 29440 190601012000 NaN 0.0 NaN 25.0 NaN NaN -3.888889 190601012000 190601 1906-01-01 20:00:00 1906 1
3 29440 190601020600 NaN 0.0 NaN 26.0 NaN NaN -3.333333 190601020600 190601 1906-01-02 06:00:00 1906 1
4 29440 190601021300 270.0 7.0 NaN 27.0 NaN NaN -2.777778 190601021300 190601 1906-01-02 13:00:00 1906 1

We have quite a few rows of weather data, and several observations per day. Our goal is to create an aggreated data frame that would have only one row per month!

Let’s group our data based on unique year and month combinations

[32]:
grouped = data.groupby(["YEAR", "MONTH"])

NOTE:

Here you could also group the data based on the YEAR_MONTH column to achieve the same result:

# Group the data
grouped = data.groupby('YEAR_MONTH')
  • Let’s see what we have now.
[33]:
# What is the type?
print("Type:\n", type(grouped))

# How many?
print("Length:\n", len(grouped))
Type:
 <class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
Length:
 601

Okey, interesting. Now we have a new object with type DataFrameGroupBy.

TASK:

Think: what does the number of groups (length of the grouped object) tell us?

[34]:
# Answer: the length of the grouped object should be the same as
data["YEAR_MONTH"].nunique()

# in other words, the number of groups is the number of unique year and month combinations in our data
[34]:
601

There are several methods we can use for extracting information from the grouped data. See documentation for Pandas GroupBy objects for a comprehensive overview.

Checking group names:

[35]:
# Check the "names" of each group (uncomment the next row if you want to print out all the keys)
#grouped.groups.keys()

Accessing data for one group:

  • Let’s check the contents for a group representing August 2019 (name of that group is (2019, 4) if you grouped the data based on datetime columns YEAR and MONTH). We can get the values of that hour from the grouped object using the get_group() -method:
[36]:
# Specify the time of the first hour (as text)
month = (2019, 4)

# Select the group
group1 = grouped.get_group(month)

# Let's see what we have
print(group1)
        STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F   MAX   MIN  \
745098           29440  201904010000  280.0    6.0   NaN    30.0   NaN   NaN
745099           29440  201904010020  280.0    7.0   NaN    32.0   NaN   NaN
745100           29440  201904010050  280.0    6.0   NaN    30.0   NaN   NaN
745101           29440  201904010100  280.0    7.0   NaN    30.0   NaN   NaN
745102           29440  201904010120  280.0    6.0   NaN    30.0   NaN   NaN
745103           29440  201904010150  300.0    5.0   NaN    30.0   NaN   NaN
745104           29440  201904010200  290.0    6.0   NaN    30.0   NaN   NaN
745105           29440  201904010220  290.0    6.0   NaN    30.0   NaN   NaN
745106           29440  201904010250  290.0    6.0   NaN    28.0   NaN   NaN
745107           29440  201904010300  290.0    5.0   NaN    29.0   NaN   NaN
745108           29440  201904010320  280.0    6.0   NaN    30.0   NaN   NaN
745109           29440  201904010350  280.0    6.0   NaN    28.0   NaN   NaN
745110           29440  201904010400  280.0    5.0   NaN    29.0   NaN   NaN
745111           29440  201904010420  290.0    6.0   NaN    28.0   NaN   NaN
745112           29440  201904010450  270.0    7.0   NaN    30.0   NaN   NaN
745113           29440  201904010500  270.0    6.0   NaN    30.0   NaN   NaN
745114           29440  201904010520  280.0    8.0   NaN    30.0   NaN   NaN
745115           29440  201904010550  280.0    7.0   NaN    30.0   NaN   NaN
745116           29440  201904010600  290.0    6.0   NaN    31.0  35.0  29.0
745117           29440  201904010620  280.0    8.0   NaN    32.0   NaN   NaN
745118           29440  201904010650  290.0   10.0   NaN    32.0   NaN   NaN
745119           29440  201904010700  300.0    7.0  11.0    33.0   NaN   NaN
745120           29440  201904010720  990.0   10.0   NaN    34.0   NaN   NaN
745121           29440  201904010750  290.0    9.0   NaN    34.0   NaN   NaN
745122           29440  201904010800  290.0    7.0  11.0    35.0   NaN   NaN
745123           29440  201904010820  290.0    9.0   NaN    36.0   NaN   NaN
745124           29440  201904010850  250.0   13.0   NaN    37.0   NaN   NaN
745125           29440  201904010900  260.0    9.0  15.0    37.0   NaN   NaN
745126           29440  201904010920  250.0   14.0   NaN    37.0   NaN   NaN
745127           29440  201904010950  240.0   15.0   NaN    39.0   NaN   NaN
...                ...           ...    ...    ...   ...     ...   ...   ...
747231           29440  201904301400  260.0    6.0   NaN    57.0   NaN   NaN
747232           29440  201904301420  990.0    6.0   NaN    57.0   NaN   NaN
747233           29440  201904301450  990.0    7.0   NaN    59.0   NaN   NaN
747234           29440  201904301500  250.0    8.0  11.0    58.0   NaN   NaN
747235           29440  201904301520  240.0    9.0   NaN    59.0   NaN   NaN
747236           29440  201904301550  990.0    5.0   NaN    57.0   NaN   NaN
747237           29440  201904301600  360.0    0.0   NaN    58.0   NaN   NaN
747238           29440  201904301620  990.0    3.0   NaN    57.0   NaN   NaN
747239           29440  201904301650  190.0    3.0   NaN    57.0   NaN   NaN
747240           29440  201904301700  180.0    2.0   NaN    57.0   NaN   NaN
747241           29440  201904301720  190.0    2.0   NaN    55.0   NaN   NaN
747242           29440  201904301750  190.0    2.0   NaN    54.0   NaN   NaN
747243           29440  201904301800  360.0    0.0   NaN    51.0  58.0  42.0
747244           29440  201904301820  300.0    6.0   NaN    52.0   NaN   NaN
747245           29440  201904301850  280.0    3.0   NaN    50.0   NaN   NaN
747246           29440  201904301900  270.0    5.0   NaN    47.0   NaN   NaN
747247           29440  201904301920  230.0    2.0   NaN    46.0   NaN   NaN
747248           29440  201904301950  990.0    1.0   NaN    41.0   NaN   NaN
747249           29440  201904302000  360.0    0.0   NaN    42.0   NaN   NaN
747250           29440  201904302020  990.0    1.0   NaN    41.0   NaN   NaN
747251           29440  201904302050    NaN    0.0   NaN    39.0   NaN   NaN
747252           29440  201904302100  360.0    0.0   NaN    39.0   NaN   NaN
747253           29440  201904302120  990.0    1.0   NaN    37.0   NaN   NaN
747254           29440  201904302150    NaN    0.0   NaN    37.0   NaN   NaN
747255           29440  201904302200  360.0    0.0   NaN    36.0   NaN   NaN
747256           29440  201904302220  990.0    1.0   NaN    36.0   NaN   NaN
747257           29440  201904302250  990.0    1.0   NaN    36.0   NaN   NaN
747258           29440  201904302300  360.0    0.0   NaN    36.0   NaN   NaN
747259           29440  201904302320  990.0    1.0   NaN    34.0   NaN   NaN
747260           29440  201904302350  190.0    3.0   NaN    36.0   NaN   NaN

           TEMP_C      TIME_STR YEAR_MONTH                DATE  YEAR  MONTH
745098  -1.111111  201904010000     201904 2019-04-01 00:00:00  2019      4
745099   0.000000  201904010020     201904 2019-04-01 00:20:00  2019      4
745100  -1.111111  201904010050     201904 2019-04-01 00:50:00  2019      4
745101  -1.111111  201904010100     201904 2019-04-01 01:00:00  2019      4
745102  -1.111111  201904010120     201904 2019-04-01 01:20:00  2019      4
745103  -1.111111  201904010150     201904 2019-04-01 01:50:00  2019      4
745104  -1.111111  201904010200     201904 2019-04-01 02:00:00  2019      4
745105  -1.111111  201904010220     201904 2019-04-01 02:20:00  2019      4
745106  -2.222222  201904010250     201904 2019-04-01 02:50:00  2019      4
745107  -1.666667  201904010300     201904 2019-04-01 03:00:00  2019      4
745108  -1.111111  201904010320     201904 2019-04-01 03:20:00  2019      4
745109  -2.222222  201904010350     201904 2019-04-01 03:50:00  2019      4
745110  -1.666667  201904010400     201904 2019-04-01 04:00:00  2019      4
745111  -2.222222  201904010420     201904 2019-04-01 04:20:00  2019      4
745112  -1.111111  201904010450     201904 2019-04-01 04:50:00  2019      4
745113  -1.111111  201904010500     201904 2019-04-01 05:00:00  2019      4
745114  -1.111111  201904010520     201904 2019-04-01 05:20:00  2019      4
745115  -1.111111  201904010550     201904 2019-04-01 05:50:00  2019      4
745116  -0.555556  201904010600     201904 2019-04-01 06:00:00  2019      4
745117   0.000000  201904010620     201904 2019-04-01 06:20:00  2019      4
745118   0.000000  201904010650     201904 2019-04-01 06:50:00  2019      4
745119   0.555556  201904010700     201904 2019-04-01 07:00:00  2019      4
745120   1.111111  201904010720     201904 2019-04-01 07:20:00  2019      4
745121   1.111111  201904010750     201904 2019-04-01 07:50:00  2019      4
745122   1.666667  201904010800     201904 2019-04-01 08:00:00  2019      4
745123   2.222222  201904010820     201904 2019-04-01 08:20:00  2019      4
745124   2.777778  201904010850     201904 2019-04-01 08:50:00  2019      4
745125   2.777778  201904010900     201904 2019-04-01 09:00:00  2019      4
745126   2.777778  201904010920     201904 2019-04-01 09:20:00  2019      4
745127   3.888889  201904010950     201904 2019-04-01 09:50:00  2019      4
...           ...           ...        ...                 ...   ...    ...
747231  13.888889  201904301400     201904 2019-04-30 14:00:00  2019      4
747232  13.888889  201904301420     201904 2019-04-30 14:20:00  2019      4
747233  15.000000  201904301450     201904 2019-04-30 14:50:00  2019      4
747234  14.444444  201904301500     201904 2019-04-30 15:00:00  2019      4
747235  15.000000  201904301520     201904 2019-04-30 15:20:00  2019      4
747236  13.888889  201904301550     201904 2019-04-30 15:50:00  2019      4
747237  14.444444  201904301600     201904 2019-04-30 16:00:00  2019      4
747238  13.888889  201904301620     201904 2019-04-30 16:20:00  2019      4
747239  13.888889  201904301650     201904 2019-04-30 16:50:00  2019      4
747240  13.888889  201904301700     201904 2019-04-30 17:00:00  2019      4
747241  12.777778  201904301720     201904 2019-04-30 17:20:00  2019      4
747242  12.222222  201904301750     201904 2019-04-30 17:50:00  2019      4
747243  10.555556  201904301800     201904 2019-04-30 18:00:00  2019      4
747244  11.111111  201904301820     201904 2019-04-30 18:20:00  2019      4
747245  10.000000  201904301850     201904 2019-04-30 18:50:00  2019      4
747246   8.333333  201904301900     201904 2019-04-30 19:00:00  2019      4
747247   7.777778  201904301920     201904 2019-04-30 19:20:00  2019      4
747248   5.000000  201904301950     201904 2019-04-30 19:50:00  2019      4
747249   5.555556  201904302000     201904 2019-04-30 20:00:00  2019      4
747250   5.000000  201904302020     201904 2019-04-30 20:20:00  2019      4
747251   3.888889  201904302050     201904 2019-04-30 20:50:00  2019      4
747252   3.888889  201904302100     201904 2019-04-30 21:00:00  2019      4
747253   2.777778  201904302120     201904 2019-04-30 21:20:00  2019      4
747254   2.777778  201904302150     201904 2019-04-30 21:50:00  2019      4
747255   2.222222  201904302200     201904 2019-04-30 22:00:00  2019      4
747256   2.222222  201904302220     201904 2019-04-30 22:20:00  2019      4
747257   2.222222  201904302250     201904 2019-04-30 22:50:00  2019      4
747258   2.222222  201904302300     201904 2019-04-30 23:00:00  2019      4
747259   1.111111  201904302320     201904 2019-04-30 23:20:00  2019      4
747260   2.222222  201904302350     201904 2019-04-30 23:50:00  2019      4

[2163 rows x 14 columns]

Ahaa! As we can see, a single group contains a DataFrame with values only for that specific month. Let’s check the DataType of this group:

[37]:
type(group1)
[37]:
pandas.core.frame.DataFrame

So, one group is a pandas DataFrame! This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics etc for this spesific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).

We can do that by using the mean() -function that we already used during the Lesson 5.

  • Let’s calculate the mean for following attributes all at once:
  • DIR,
  • SPEED,
  • GUST,
  • TEMP,
  • TEMP_C
  • MONTH
[38]:
# Specify the columns that will be part of the calculation
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP_F', 'TEMP_C', 'MONTH']

# Calculate the mean values all at one go
mean_values = group1[mean_cols].mean()

# Let's see what we have
print(mean_values)
DIR       309.035306
SPEED       5.932188
GUST       15.868217
TEMP_F     42.472030
TEMP_C      5.817794
MONTH       4.000000
dtype: float64

Here we saw how you can access data from a single group. For getting information about all groups (all months) we can a for -loop or methods available in the grouped object.

For-loops and grouped objects:

When iterating over the groups in our DataFrameGroupBy -object it is important to understand that a single group in our DataFrameGroupBy actually contains not only the actual values, but also information about the key that was used to do the grouping. Hence, when iterating over the data we need to assign the key and the values into separate variables.

  • Let’s see how we can iterate over the groups and print the key and the data from a single group (again using break to only see what is happening).
[39]:
# Iterate over groups
for key, group in grouped:
    # Print key and group
    print("Key:\n", key)
    print("\nFirst rows of data in this group:\n", group.head())

    # Stop iteration with break command
    break
Key:
 (1906, 1)

First rows of data in this group:
    STATION_NUMBER          TIME    DIR  SPEED  GUST  TEMP_F  MAX  MIN  \
0           29440  190601010600   90.0    7.0   NaN    27.0  NaN  NaN
1           29440  190601011300    NaN    0.0   NaN    27.0  NaN  NaN
2           29440  190601012000    NaN    0.0   NaN    25.0  NaN  NaN
3           29440  190601020600    NaN    0.0   NaN    26.0  NaN  NaN
4           29440  190601021300  270.0    7.0   NaN    27.0  NaN  NaN

     TEMP_C      TIME_STR YEAR_MONTH                DATE  YEAR  MONTH
0 -2.777778  190601010600     190601 1906-01-01 06:00:00  1906      1
1 -2.777778  190601011300     190601 1906-01-01 13:00:00  1906      1
2 -3.888889  190601012000     190601 1906-01-01 20:00:00  1906      1
3 -3.333333  190601020600     190601 1906-01-02 06:00:00  1906      1
4 -2.777778  190601021300     190601 1906-01-02 13:00:00  1906      1

Okey so from here we can see that the key contains the name of the group (year, month).

  • Let’s see how we can create a DataFrame where we calculate the mean values for all those weather attributes that we were interested in. I will repeat slightly the earlier steps so that you can see and better understand what is happening.
[40]:
# Create an empty DataFrame for the aggregated values
monthly_data = pd.DataFrame()

# The columns that we want to aggregate
mean_cols = ['DIR', 'SPEED', 'GUST', 'TEMP_F', 'TEMP_C', "MONTH"]

# Iterate over the groups
for key, group in grouped:

   # Calculate mean
   mean_values = group[mean_cols].mean()

   # Add the ´key´ (i.e. the date+time information) into the aggregated values
   mean_values['YEAR_MONTH'] = key

   # Append the aggregated values into the DataFrame
   monthly_data = monthly_data.append(mean_values, ignore_index=True)
  • Let’s see what we have now:
[41]:
print(monthly_data)
            DIR       GUST  MONTH      SPEED     TEMP_C     TEMP_F  YEAR_MONTH
0    218.181818        NaN    1.0  13.204301  -3.596177  25.526882   (1906, 1)
1    178.095238        NaN    2.0  13.142857  -3.445767  25.797619   (1906, 2)
2    232.043011        NaN    3.0  15.021505  -5.107527  22.806452   (1906, 3)
3    232.045455        NaN    4.0  13.811111   3.790123  38.822222   (1906, 4)
4    192.820513        NaN    5.0  10.333333  13.070490  55.526882   (1906, 5)
5    234.222222        NaN    6.0  12.922222  16.246914  61.244444   (1906, 6)
6    226.923077        NaN    7.0  10.827957  18.805257  65.849462   (1906, 7)
7    251.627907        NaN    8.0  11.623656  13.590203  56.462366   (1906, 8)
8    236.986301        NaN    9.0   9.988889   8.611111  47.500000   (1906, 9)
9    199.397590        NaN   10.0  12.365591   4.360812  39.849462  (1906, 10)
10   199.166667        NaN   11.0  14.211111   0.296296  32.533333  (1906, 11)
11   198.636364        NaN   12.0  15.516129  -4.086022  24.645161  (1906, 12)
12   190.930233        NaN    1.0  16.344086  -9.904421  14.172043   (1907, 1)
13   223.466667        NaN    2.0  15.108434  -5.535475  22.036145   (1907, 2)
14   238.051948        NaN    3.0  10.763441  -1.344086  29.580645   (1907, 3)
15   201.038961        NaN    4.0  10.433333   2.283951  36.111111   (1907, 4)
16   244.823529        NaN    5.0  12.913978   7.048984  44.688172   (1907, 5)
17   202.674419        NaN    6.0  11.633333  14.308642  57.755556   (1907, 6)
18   215.238095        NaN    7.0  10.473118  16.923536  62.462366   (1907, 7)
19   221.139241        NaN    8.0   9.645161  12.640382  54.752688   (1907, 8)
20   258.953488        NaN    9.0  16.466667   9.185185  48.533333   (1907, 9)
21   192.682927        NaN   10.0  10.408602   8.482676  47.268817  (1907, 10)
22   204.222222        NaN   11.0  15.744444   0.641975  33.155556  (1907, 11)
23   166.052632        NaN   12.0  11.000000 -11.992754  10.413043  (1907, 12)
24   221.547619        NaN    1.0  11.376344  -6.284349  20.688172   (1908, 1)
25   190.864198        NaN    2.0  10.546512  -5.245478  22.558140   (1908, 2)
26   169.315068        NaN    3.0   6.537634  -4.342891  24.182796   (1908, 3)
27   195.333333        NaN    4.0   6.222222   2.672840  36.811111   (1908, 4)
28   259.718310        NaN    5.0   6.709677   8.052569  46.494624   (1908, 5)
29   249.651163        NaN    6.0  20.011111  14.283951  57.711111   (1908, 6)
..          ...        ...    ...        ...        ...        ...         ...
571  340.875980  15.628492    5.0   6.872531   8.409816  47.137668   (2017, 5)
572  379.561190  17.014451    6.0   6.904539  12.733344  54.920019   (2017, 6)
573  359.269933  15.245136    7.0   6.318477  14.878319  58.780974   (2017, 7)
574  331.270588  15.627178    8.0   6.489526  14.526758  58.148165   (2017, 8)
575  304.676641  14.788660    9.0   5.399907  10.070834  50.127501   (2017, 9)
576  277.448502  16.598854   10.0   6.535569   3.963063  39.133514  (2017, 10)
577  215.835272  16.026846   11.0   7.165504   1.957475  35.523456  (2017, 11)
578  237.338195  15.927536   12.0   7.908435  -0.497221  31.105002  (2017, 12)
579  269.553613  17.351792    1.0   6.898274  -3.848810  25.072142   (2018, 1)
580  224.710831  14.937500    2.0   5.870871  -9.752473  14.445549   (2018, 2)
581  292.265435  15.392857    3.0   6.066968  -6.356461  20.558371   (2018, 3)
582  350.141509  15.216981    4.0   5.704087   3.862159  38.951887   (2018, 4)
583  422.810552  14.819095    5.0   5.387764  14.625079  58.325143   (2018, 5)
584  445.827233  18.480337    6.0   7.056325  14.357283  57.843109   (2018, 6)
585  408.614487  15.902778    7.0   5.607442  20.695197  69.251354   (2018, 7)
586  396.655422  16.764179    8.0   6.515248  16.888990  62.400182   (2018, 8)
587  290.531856  20.581481    9.0   7.879109  12.253005  54.055409   (2018, 9)
588  256.174656  17.902685   10.0   7.452370   5.609417  42.096950  (2018, 10)
589  303.893491  16.097473   11.0   6.950848   2.547624  36.585723  (2018, 11)
590  281.960321  14.872807   12.0   6.531746  -2.959591  26.672736  (2018, 12)
591  279.460600  17.300813    1.0   5.835289  -7.841929  17.884529   (2019, 1)
592  252.858576  17.908511    2.0   9.232639  -1.682650  28.971230   (2019, 2)
593  281.606481  17.462687    3.0   7.907805  -1.397007  29.485388   (2019, 3)
594  309.035306  15.868217    4.0   5.932188   5.817794  42.472030   (2019, 4)
595  367.198506  17.553719    5.0   7.738381   8.869752  47.965553   (2019, 5)
596  370.992008  17.251852    6.0   8.138490  16.524111  61.743400   (2019, 6)
597  294.433641  15.034722    7.0   5.785714  16.427753  61.569955   (2019, 7)
598  320.335766  15.751678    8.0   6.769447  15.888138  60.598649   (2019, 8)
599  306.491058  15.173285    9.0   6.363594   9.976743  49.958137   (2019, 9)
600  239.577465  17.470588   10.0  10.169014   5.985915  42.774648  (2019, 10)

[601 rows x 7 columns]

Awesome! Now we have aggregated our data and we have a new DataFrame called monthly_data where we have mean values for each month in the data set.

Mean for all groups at once

We can also achieve the same result by computing the mean of all columns for all groups in the grouped object:

[42]:
grouped.mean()
[42]:
STATION_NUMBER TIME DIR SPEED GUST TEMP_F MAX MIN TEMP_C
YEAR MONTH
1906 1 29440.0 1.906012e+11 218.181818 13.204301 NaN 25.526882 NaN NaN -3.596177
2 29440.0 1.906021e+11 178.095238 13.142857 NaN 25.797619 NaN NaN -3.445767
3 29440.0 1.906032e+11 232.043011 15.021505 NaN 22.806452 NaN NaN -5.107527
4 29440.0 1.906042e+11 232.045455 13.811111 NaN 38.822222 NaN NaN 3.790123
5 29440.0 1.906052e+11 192.820513 10.333333 NaN 55.526882 NaN NaN 13.070490
6 29440.0 1.906062e+11 234.222222 12.922222 NaN 61.244444 NaN NaN 16.246914
7 29440.0 1.906072e+11 226.923077 10.827957 NaN 65.849462 NaN NaN 18.805257
8 29440.0 1.906082e+11 251.627907 11.623656 NaN 56.462366 NaN NaN 13.590203
9 29440.0 1.906092e+11 236.986301 9.988889 NaN 47.500000 NaN NaN 8.611111
10 29440.0 1.906102e+11 199.397590 12.365591 NaN 39.849462 NaN NaN 4.360812
11 29440.0 1.906112e+11 199.166667 14.211111 NaN 32.533333 NaN NaN 0.296296
12 29440.0 1.906122e+11 198.636364 15.516129 NaN 24.645161 NaN NaN -4.086022
1907 1 29440.0 1.907012e+11 190.930233 16.344086 NaN 14.172043 NaN NaN -9.904421
2 29440.0 1.907021e+11 223.466667 15.108434 NaN 22.036145 NaN NaN -5.535475
3 29440.0 1.907032e+11 238.051948 10.763441 NaN 29.580645 NaN NaN -1.344086
4 29440.0 1.907042e+11 201.038961 10.433333 NaN 36.111111 NaN NaN 2.283951
5 29440.0 1.907052e+11 244.823529 12.913978 NaN 44.688172 NaN NaN 7.048984
6 29440.0 1.907062e+11 202.674419 11.633333 NaN 57.755556 NaN NaN 14.308642
7 29440.0 1.907072e+11 215.238095 10.473118 NaN 62.462366 NaN NaN 16.923536
8 29440.0 1.907082e+11 221.139241 9.645161 NaN 54.752688 NaN NaN 12.640382
9 29440.0 1.907092e+11 258.953488 16.466667 NaN 48.533333 NaN NaN 9.185185
10 29440.0 1.907102e+11 192.682927 10.408602 NaN 47.268817 NaN NaN 8.482676
11 29440.0 1.907112e+11 204.222222 15.744444 NaN 33.155556 NaN NaN 0.641975
12 29440.0 1.907122e+11 166.052632 11.000000 NaN 10.413043 NaN NaN -11.992754
1908 1 29440.0 1.908012e+11 221.547619 11.376344 NaN 20.688172 NaN NaN -6.284349
2 29440.0 1.908022e+11 190.864198 10.546512 NaN 22.558140 NaN NaN -5.245478
3 29440.0 1.908032e+11 169.315068 6.537634 NaN 24.182796 NaN NaN -4.342891
4 29440.0 1.908042e+11 195.333333 6.222222 NaN 36.811111 NaN NaN 2.672840
5 29440.0 1.908052e+11 259.718310 6.709677 NaN 46.494624 NaN NaN 8.052569
6 29440.0 1.908062e+11 249.651163 20.011111 NaN 57.711111 NaN NaN 14.283951
... ... ... ... ... ... ... ... ... ... ...
2017 5 29440.0 2.017052e+11 340.875980 6.872531 15.628492 47.137668 53.580645 40.387097 8.409816
6 29440.0 2.017062e+11 379.561190 6.904539 17.014451 54.920019 60.566667 49.516667 12.733344
7 29440.0 2.017072e+11 359.269933 6.318477 15.245136 58.780974 64.345455 53.581818 14.878319
8 29440.0 2.017082e+11 331.270588 6.489526 15.627178 58.148165 62.229508 53.590164 14.526758
9 29440.0 2.017092e+11 304.676641 5.399907 14.788660 50.127501 53.300000 46.600000 10.070834
10 29440.0 2.017102e+11 277.448502 6.535569 16.598854 39.133514 41.918033 36.098361 3.963063
11 29440.0 2.017112e+11 215.835272 7.165504 16.026846 35.523456 37.200000 33.368421 1.957475
12 29440.0 2.017122e+11 237.338195 7.908435 15.927536 31.105002 33.016393 28.758065 -0.497221
2018 1 29440.0 2.018012e+11 269.553613 6.898274 17.351792 25.072142 26.816667 21.637931 -3.848810
2 29440.0 2.018021e+11 224.710831 5.870871 14.937500 14.445549 18.196429 10.196429 -9.752473
3 29440.0 2.018032e+11 292.265435 6.066968 15.392857 20.558371 25.491525 12.423729 -6.356461
4 29440.0 2.018042e+11 350.141509 5.704087 15.216981 38.951887 44.175439 32.844828 3.862159
5 29440.0 2.018052e+11 422.810552 5.387764 14.819095 58.325143 63.512195 48.658537 14.625079
6 29440.0 2.018062e+11 445.827233 7.056325 18.480337 57.843109 63.754717 51.000000 14.357283
7 29440.0 2.018072e+11 408.614487 5.607442 15.902778 69.251354 75.262295 63.688525 20.695197
8 29440.0 2.018082e+11 396.655422 6.515248 16.764179 62.400182 67.833333 57.350000 16.888990
9 29440.0 2.018092e+11 290.531856 7.879109 20.581481 54.055409 57.833333 49.648148 12.253005
10 29440.0 2.018102e+11 256.174656 7.452370 17.902685 42.096950 45.229508 37.672131 5.609417
11 29440.0 2.018111e+11 303.893491 6.950848 16.097473 36.585723 38.510638 34.085106 2.547624
12 29440.0 2.018122e+11 281.960321 6.531746 14.872807 26.672736 27.901961 23.173077 -2.959591
2019 1 29440.0 2.019012e+11 279.460600 5.835289 17.300813 17.884529 21.474576 12.700000 -7.841929
2 29440.0 2.019021e+11 252.858576 9.232639 17.908511 28.971230 32.672727 24.545455 -1.682650
3 29440.0 2.019032e+11 281.606481 7.907805 17.462687 29.485388 32.870968 24.870968 -1.397007
4 29440.0 2.019042e+11 309.035306 5.932188 15.868217 42.472030 48.810345 35.847458 5.817794
5 29440.0 2.019052e+11 367.198506 7.738381 17.553719 47.965553 52.907407 43.462963 8.869752
6 29440.0 2.019062e+11 370.992008 8.138490 17.251852 61.743400 67.316667 55.600000 16.524111
7 29440.0 2.019072e+11 294.433641 5.785714 15.034722 61.569955 67.774194 55.903226 16.427753
8 29440.0 2.019082e+11 320.335766 6.769447 15.751678 60.598649 65.935484 55.016129 15.888138
9 29440.0 2.019092e+11 306.491058 6.363594 15.173285 49.958137 53.766667 45.350000 9.976743
10 29440.0 2.019100e+11 239.577465 10.169014 17.470588 42.774648 48.500000 41.000000 5.985915

601 rows × 9 columns

Detecting warm months

Now, we have aggregated our data on monthly level and all we need to do is to check which years had the warmest April temperatures. A simple approach is to select all aprils from the data, group the data and check which group(s) have the highest mean value:

  • select all records that are from April (regardless of the year):
[43]:
aprils = data[data["MONTH"]==4]
  • take a subset of columns that might contain interesting information:
[44]:
aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C','YEAR', 'MONTH']]
  • group by year and month:
[45]:
grouped = aprils.groupby(by=["YEAR", "MONTH"])
  • calculate mean for each group:
[46]:
monthly_mean = grouped.mean()
monthly_mean.head()
[46]:
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
1906 4 29440 38.822222 3.790123
1907 4 29440 36.111111 2.283951
1908 4 29440 36.811111 2.672840
1909 4 29440 31.977778 -0.012346
1910 4 29440 39.833333 4.351852
  • check the highest temperature values (sort the data frame in a descending order):
[47]:
monthly_mean.sort_values(by="TEMP_C", ascending=False).head(10)
[47]:
STATION_NUMBER TEMP_F TEMP_C
YEAR MONTH
2019 4 29440 42.472030 5.817794
1990 4 29440 41.918084 5.510047
1989 4 29440 41.369647 5.205360
2011 4 29440 41.290730 5.161517
2004 4 29440 41.249676 5.138709
2002 4 29440 41.132353 5.073529
1983 4 29440 41.016183 5.008991
2008 4 29440 40.962343 4.979079
2000 4 29440 40.777778 4.876543
1999 4 29440 40.695291 4.830717

How did April 2019 rank at the Tampere Pirkkala observation station 🌡️?

Repeating the data analysis with larger dataset

Finally, let’s repeat the data analysis steps above for all the available data we have (!!). First, confirm the path to the folder where all the input data are located. The idea is, that we will repeat the analysis process for each input file using a (rather long) for loop! Here we have all the main analysis steps with some additional output info - all in one long code cell:

[48]:
# Read selected columns of  data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])

# Rename the columns
new_names = {'USAF':'STATION_NUMBER','YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST', 'TEMP':'TEMP_F'}
data = data.rename(columns=new_names)

#Print info about the current input file:
print("STATION NUMBER:", data.at[0,"STATION_NUMBER"])
print("NUMBER OF OBSERVATIONS:", len(data))

# Create column
col_name = 'TEMP_C'
data[col_name] = None

# Convert tempetarues from Fahrenheits to Celsius
data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)

# Convert TIME to string
data['TIME_STR'] = data['TIME'].astype(str)

# Parse year and month
data['MONTH'] = data['TIME_STR'].str.slice(start=5, stop=6).astype(int)
data['YEAR'] = data['TIME_STR'].str.slice(start=0, stop=4).astype(int)

# Extract observations for the months of April
aprils = data[data['MONTH']==4]

# Take a subset of columns
aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C', 'YEAR', 'MONTH']]

# Group by year and month
grouped = aprils.groupby(by=["YEAR", "MONTH"])

# Get mean values for each group
monthly_mean = grouped.mean()

# Print info
print(monthly_mean.sort_values(by="TEMP_C", ascending=False).head(5))
print("\n")
STATION NUMBER: 29440
NUMBER OF OBSERVATIONS: 757983
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2019 4               29440  42.472030  5.817794
1990 4               29440  41.918084  5.510047
1989 4               29440  41.369647  5.205360
2011 4               29440  41.290730  5.161517
2004 4               29440  41.249676  5.138709


We will use the glob() function from the module glob to list our input files.

[49]:
import glob
[50]:
file_list = glob.glob(r'data/0*txt')

Note

Note that we’re using the * character as a wildcard, so any file that starts with data/0 and ends with txt will be added to the list of files we will iterate over. We specifically use data/0 as the starting part of the file names to avoid having our metadata files included in the list!

[51]:
print("Number of files in the list", len(file_list))
print(file_list)
Number of files in the list 15
['data\\028360.txt', 'data\\028690.txt', 'data\\028750.txt', 'data\\028970.txt', 'data\\029070.txt', 'data\\029110.txt', 'data\\029170.txt', 'data\\029350.txt', 'data\\029440.txt', 'data\\029500.txt', 'data\\029700.txt', 'data\\029720.txt', 'data\\029740.txt', 'data\\029810.txt', 'data\\029820.txt']

Now, you should have all the relevant file names in a list, and we can loop over the list using a for-loop:

[52]:
for fp in file_list:
    print(fp)
data\028360.txt
data\028690.txt
data\028750.txt
data\028970.txt
data\029070.txt
data\029110.txt
data\029170.txt
data\029350.txt
data\029440.txt
data\029500.txt
data\029700.txt
data\029720.txt
data\029740.txt
data\029810.txt
data\029820.txt
[53]:
# Repeat the analysis steps for each input file:
for fp in file_list:

    # Read selected columns of  data using varying amount of spaces as separator and specifying * characters as NoData values
    data = pd.read_csv(fp, delim_whitespace=True, usecols=['USAF','YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN'], na_values=['*', '**', '***', '****', '*****', '******'])

    # Rename the columns
    new_names = {'USAF':'STATION_NUMBER','YR--MODAHRMN': 'TIME', 'SPD': 'SPEED', 'GUS': 'GUST', 'TEMP':'TEMP_F'}
    data = data.rename(columns=new_names)

    #Print info about the current input file:
    print("STATION NUMBER:", data.at[0,"STATION_NUMBER"])
    print("NUMBER OF OBSERVATIONS:", len(data))

    # Create column
    col_name = 'TEMP_C'
    data[col_name] = None

    # Convert tempetarues from Fahrenheits to Celsius
    data["TEMP_C"] = data["TEMP_F"].apply(fahr_to_celsius)

    # Convert TIME to string
    data['TIME_STR'] = data['TIME'].astype(str)

    # Parse year and month
    data['MONTH'] = data['TIME_STR'].str.slice(start=5, stop=6).astype(int)
    data['YEAR'] = data['TIME_STR'].str.slice(start=0, stop=4).astype(int)

    # Extract observations for the months of April
    aprils = data[data['MONTH']==4]

    # Take a subset of columns
    aprils = aprils[['STATION_NUMBER','TEMP_F', 'TEMP_C', 'YEAR', 'MONTH']]

    # Group by year and month
    grouped = aprils.groupby(by=["YEAR", "MONTH"])

    # Get mean values for each group
    monthly_mean = grouped.mean()

    # Print info
    print(monthly_mean.sort_values(by="TEMP_C", ascending=False).head(5))
    print("\n")
STATION NUMBER: 28360
NUMBER OF OBSERVATIONS: 193825
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1937 4               28360  38.738095  3.743386
2011 4               28360  36.699571  2.610873
1921 4               28360  36.622222  2.567901
2002 4               28360  36.500000  2.500000
2019 4               28360  34.979138  1.655076


STATION NUMBER: 28690
NUMBER OF OBSERVATIONS: 542788
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2011 4               28690  35.430640  1.905911
2019 4               28690  35.215114  1.786174
2016 4               28690  35.031103  1.683946
1989 4               28690  34.612766  1.451537
2002 4               28690  34.279855  1.266586


STATION NUMBER: 28750
NUMBER OF OBSERVATIONS: 474562
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1989 4               28750  39.008403  3.893557
1983 4               28750  38.758475  3.754708
2019 4               28750  38.651599  3.695333
2002 4               28750  38.270419  3.483566
1994 4               28750  38.145833  3.414352


STATION NUMBER: 28970
NUMBER OF OBSERVATIONS: 555740
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1921 4               28970  41.688889  5.382716
1999 4               28970  39.073600  3.929778
2019 4               28970  38.706456  3.725809
1989 4               28970  38.362869  3.534927
2011 4               28970  38.094172  3.385651


STATION NUMBER: 29070
NUMBER OF OBSERVATIONS: 83567
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2014 4               29070  35.437326  1.909626
2015 4               29070  34.437209  1.354005
2004 4               29070  34.347032  1.303907
2016 4               29070  34.303199  1.279555
2008 4               29070  34.241667  1.245370


STATION NUMBER: 29110
NUMBER OF OBSERVATIONS: 483784
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1921 4               29110  42.166667  5.648148
2004 4               29110  41.682699  5.379277
1989 4               29110  41.420168  5.233427
1937 4               29110  40.671429  4.817460
2019 4               29110  40.636300  4.797945


STATION NUMBER: 29170
NUMBER OF OBSERVATIONS: 561097
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1937 4               29170  43.289157  6.271754
2019 4               29170  40.666820  4.814900
2011 4               29170  40.015962  4.453312
2001 4               29170  39.713228  4.285126
1906 4               29170  39.688889  4.271605


STATION NUMBER: 29350
NUMBER OF OBSERVATIONS: 559667
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1921 4               29350  45.144444  7.302469
1925 4               29350  40.777778  4.876543
2011 4               29350  40.670108  4.816727
2019 4               29350  40.585002  4.769446
2001 4               29350  39.662827  4.257126


STATION NUMBER: 29440
NUMBER OF OBSERVATIONS: 757983
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2019 4               29440  42.472030  5.817794
1990 4               29440  41.918084  5.510047
1989 4               29440  41.369647  5.205360
2011 4               29440  41.290730  5.161517
2004 4               29440  41.249676  5.138709


STATION NUMBER: 29500
NUMBER OF OBSERVATIONS: 103105
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2019 4               29500  41.639777  5.355432
2008 4               29500  40.838936  4.910520
2014 4               29500  40.226415  4.570231
2016 4               29500  39.176634  3.987019
2011 4               29500  38.647826  3.693237


STATION NUMBER: 29700
NUMBER OF OBSERVATIONS: 473881
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1921 4               29700  42.811111  6.006173
2000 4               29700  42.375587  5.764215
1990 4               29700  42.054167  5.585648
2019 4               29700  41.548747  5.304859
2004 4               29700  41.493392  5.274107


STATION NUMBER: 29720
NUMBER OF OBSERVATIONS: 843688
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2019 4               29720  43.558414  6.421341
1990 4               29720  43.313576  6.285320
2000 4               29720  42.663169  5.923983
2008 4               29720  42.349642  5.749801
2004 4               29720  41.903492  5.501940


STATION NUMBER: 29740
NUMBER OF OBSERVATIONS: 931767
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2000 4               29740  43.479793  6.377663
2019 4               29740  43.464070  6.368928
1990 4               29740  43.375078  6.319488
2008 4               29740  43.341429  6.300794
2011 4               29740  42.750702  5.972612


STATION NUMBER: 29810
NUMBER OF OBSERVATIONS: 199330
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
1990 4               29810  41.157895  5.087719
2019 4               29810  40.783032  4.879462
2014 4               29810  40.058036  4.476687
2008 4               29810  40.044881  4.469378
2016 4               29810  39.270308  4.039060


STATION NUMBER: 29820
NUMBER OF OBSERVATIONS: 198334
            STATION_NUMBER     TEMP_F    TEMP_C
YEAR MONTH
2019 4               29820  41.182197  5.101221
1990 4               29820  41.144681  5.080378
2014 4               29820  40.497908  4.721060
2008 4               29820  39.941423  4.411901
1913 4               29820  39.622222  4.234568


How about now, how did April 2019 rank across different stations?