Hints for Exercise 7

Converting a column of date-strings into datetime format

In some cases Pandas cannot understand and parse automatically the date information from a column when you read the data with read_csv() function. In these cases, you need to parse the date information afterwards.

Let’s see an example with following data

DATE, Value
"201401", 1
"201402", 2
"201403", 3
"201404", 5

Let’s try to read the data and parse the date.

In [1]: data = pd.read_csv(fp, sep=',', parse_dates=['DATE'])

In [2]: data.head()
Out[2]: 
     DATE   Value
0  201401       1
1  201402       2
2  201403       3
3  201404       5

In [3]: data.dtypes
Out[3]: 
DATE      object
 Value     int64
dtype: object

As we can see from above, Pandas was not able to convert the datatype of DATE into datetime format. This is because the DATE values does not include information about the day of the year but only the year and month.

There is a way to tell to Pandas to read the DATE information with a custom format such as the one we have here with pd.to_timestamp() function where we can specify with format parameter the custom format how the dates are represented in the data. Consider following example:

In [4]: data['datetime'] =  pd.to_datetime(data['DATE'], format='%Y%m')

In [5]: data.head()
Out[5]: 
     DATE   Value   datetime
0  201401       1 2014-01-01
1  201402       2 2014-02-01
2  201403       3 2014-03-01
3  201404       5 2014-04-01

In [6]: data.dtypes
Out[6]: 
DATE                object
 Value               int64
datetime    datetime64[ns]
dtype: object

Great, now we have the data in datetime format!

Creating an empty DataFrame with a datetime index

For Problem 2 in this exercise you are asked to calculate average seasonal ages for each year in our data file. The easiest way to do this is to create an empty DataFrame to store the seaonal temperatures, with one temperature for each year and season. Thus, the DataFrame should have columns for each season and the date as an index. In order to do this, we’ll need first to create a variable to store the dates for the index, then create the DataFrame using that index. Let’s consider an example for my world, where there are two seasons: coldSeason and warmSeason. For each season, I want list the number of times I wore a jacket, with data from the past 4 years. I can start by making a variable with 1 date for each of the past 4 years using the Pandas pd.date_range() function.

In [7]: timeIndex = pd.date_range('2014', '2017', freq='AS')

In [8]: print(timeIndex)
DatetimeIndex(['2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01'], dtype='datetime64[ns]', freq='AS-JAN')

As you can see, we now have a variable timeIndex in the Pandas datetime format with dates for January 1 of the past 4 years. The starting and ending years are clear, and the freq='AS' indicates the frequecy of dates between the listed starting and ending times. In this case, AS refers to annual values (1 time per year) at the start of the year.

With the timeIndex variable, we can now create our empty DataFrame to store the seasonal jacket numbers using the Pandas pd.DataFrame() function.

In [9]: seasonData = pd.DataFrame(index=timeIndex, columns=['coldSeason', 'warmSeason'])

In [10]: print(seasonData)
           coldSeason warmSeason
2014-01-01        NaN        NaN
2015-01-01        NaN        NaN
2016-01-01        NaN        NaN
2017-01-01        NaN        NaN

Now we have our empty DataFrame where I can fill in the number of times I needed a jacket in each season using the date index!

Slicing up the seasons

The other main task in Problem 2 is to sort values from the different months into seasonal average values. There are several ways in which this can be done, but one nice way to do it is using a for loop to loop over each year of data you consider and then fill in the seasonal values for that year. For each year, you want to identify the slice of dates that correspond to that season, calculate their mean, then store that result in the corresponding location in the new DataFrame created in the previous hint. For the for loop itself, it may be easiest to start with the second full year of data (1953), since we do not have temperatures for December of 1951. If you loop over the years from 1953-2016, you can then easily calculate the seasonal average temperatures for each season. For the winter, you can use year - 1 to find the temperature for December, assuming year is your variable for the current year in your for loop. This approach can be used also in relation to Problem 3 and 4.

In this week’s lesson we saw how to select a range of dates, but we did not cover how to take the mean value of the slice and store it. Because a slice of a DataFrame is still a DataFrame object, we can simply use the .mean() method to calculate the mean of that slice.

meanValue = dataFrame['2016-12':'2017-02']['TEMP'].mean()

This would assign the mean value for the TEMP field between December 2016 and February 2017 to the variable meanValue. In terms of storing the output value, we can use the DataFrame.loc() function. For example:

dataFrame.loc[year, 'coldSeason'] = 5

This would store the value 5 in the column coldSeason at index year of dataFrame. That’s a tricky sentence, but hopefully the idea is clear :).

Labels and legends

In the plot for Problem 2 you’re asked to include a line legend for each subplot. To do this, you need to do two things:

  1. You need to add a label value when you create the plot using the plt.plot() function. This is as easy as adding a parameter that say label='some text' when you call plt.plot().
  2. You’ll need to display the line legend, which can be done by calling plt.legend() for each subplot.

Saving multiple plots into a directory

In Problems 3 and 4 the aim is to create 65 individual plots, and save those into your computer. In these kind of situations, the smartest thing to do is to use a for loop and at the end of each loop, save the image into a folder that you have specified. There are some useful tricks related to saving files and generating good file names automatically.

A good approach when saving multiple files into a folder, is to define a separate variable where you store only the directory path. Then during every loop you combine this directory path, and the file name together. This can be done by using a function os.path.join() which is part of os built-in Python module.

Consider following example:

In [11]: import os

In [12]: myfolder = r"C:\MyUserName\Temp_visualizations"

In [13]: for i in range(5):
   ....:     filename = "My_File_" + str(i) + ".png"
   ....:     filepath = os.path.join(myfolder, filename)
   ....:     print(filepath)
   ....: 
C:\MyUserName\Temp_visualizations/My_File_0.png
C:\MyUserName\Temp_visualizations/My_File_1.png
C:\MyUserName\Temp_visualizations/My_File_2.png
C:\MyUserName\Temp_visualizations/My_File_3.png
C:\MyUserName\Temp_visualizations/My_File_4.png

Here, we created a folder path and a unique filename, and in the end parsed a full filepath that could be used to save a plot into that location on your computer.

Creating an animation from multiple images

In Problems 3 and 4 the aim was to plot multiple images on a predefined folder. An optional task was to create an animation out of those figures. Animating the figures in Problems 3 and 4 is fairly straightforward task to do in Python. All you need to do is to install a module called imageio and run couple lines of code that I show below.

But, first you need to install imageio module.

Installing the module can be done by running following command from the command prompt / terminal with admin rights:

$ conda install -c conda-forge imageio

Note

If everything works fine you should not see any errors coming into the screen. If you receive an error, the most typical one is that you did not have admin rights when trying to install the module. In such case, you should open command prompt with admin rights (Command prompt –> right click –> Run as administrator..)

When you have imageio installed you should be able to import it, in Spyder:

In [14]: import imageio
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-14-351220486d6a> in <module>()
----> 1 import imageio

ModuleNotFoundError: No module named 'imageio'

Creating the animation

Following commands should produce a nice gif-animation out of your plots. The idea is that you list all the files from the folder where you saved the plots using glob function, and then pass that file list into imageio function called imageio.mimsave(). A following example shows how to do that.

First we list all the files from folder that has .png file format using glob. The * wildcard character tells to computer that the name of the file can be anything (the purpose of the star). .png after the star tells that the filename should end with .png characters. If there are some other files with other file format than .png, they will be excluded. Finally, we create the animation into the computer.

import glob
import imageio

# Find all files from given folder that has .png file-format
search_criteria = r"C:\MyUserName\Temp_visualizations\*.png"

# Execute the glob function that returns a list of filepaths
figure_paths = glob.glob(search_criteria)

# Save the animation to disk with 48 ms durations
output_gif_path = r"C:\MyUserName\Temp_animation.gif"
imageio.mimsave(output_gif_path, [imageio.imread(fp) for fp in figure_paths], duration=0.48, subrectangles=True)

With these lines of code you should be able to create a nice animation out of your plots!