Hints for Exercise 7

Converting a column of date-strings into datetime format

In some cases Pandas cannot understand and parse automatically the date information from a column when you read the data with read_csv() function. In these cases, you need to parse the date information afterwards.

Let’s see an example with following data

DATE, Value
"201401", 1
"201402", 2
"201403", 3
"201404", 5

Let’s try to read the data and parse the date.

In [1]: data = pd.read_csv(fp, sep=',', parse_dates=['DATE'])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-f20dd3876fdc> in <module>()
----> 1 data = pd.read_csv(fp, sep=',', parse_dates=['DATE'])

~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    684     )
    685 
--> 686     return _read(filepath_or_buffer, kwds)
    687 
    688 

~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    450 
    451     # Create the parser.
--> 452     parser = TextFileReader(fp_or_buf, **kwds)
    453 
    454     if chunksize or iterator:

~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    934             self.options["has_index_names"] = kwds["has_index_names"]
    935 
--> 936         self._make_engine(self.engine)
    937 
    938     def close(self):

~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1166     def _make_engine(self, engine="c"):
   1167         if engine == "c":
-> 1168             self._engine = CParserWrapper(self.f, **self.options)
   1169         else:
   1170             if engine == "python":

~/virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1996         kwds["usecols"] = self.usecols
   1997 
-> 1998         self._reader = parsers.TextReader(src, **kwds)
   1999         self.unnamed_cols = self._reader.unnamed_cols
   2000 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/build/geo-python/site/data/L7/hintData.txt'

In [2]: data.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-304fa4ce4ebd> in <module>()
----> 1 data.head()

NameError: name 'data' is not defined

In [3]: data.dtypes
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-6226a73926db> in <module>()
----> 1 data.dtypes

NameError: name 'data' is not defined

As we can see from above, Pandas was not able to convert the datatype of DATE into datetime format. This is because the DATE values does not include information about the day of the year but only the year and month.

There is a way to tell to Pandas to read the DATE information with a custom format such as the one we have here with pd.to_timestamp() function where we can specify with format parameter the custom format how the dates are represented in the data. Consider following example:

In [4]: data['datetime'] =  pd.to_datetime(data['DATE'], format='%Y%m')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-1c7d066de4f0> in <module>()
----> 1 data['datetime'] =  pd.to_datetime(data['DATE'], format='%Y%m')

NameError: name 'data' is not defined

In [5]: data.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-304fa4ce4ebd> in <module>()
----> 1 data.head()

NameError: name 'data' is not defined

In [6]: data.dtypes
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-6226a73926db> in <module>()
----> 1 data.dtypes

NameError: name 'data' is not defined

Great, now we have the data in datetime format!

Creating an empty DataFrame with a datetime index

For Problem 2 in this exercise you are asked to calculate average seasonal ages for each year in our data file. The easiest way to do this is to create an empty DataFrame to store the seaonal temperatures, with one temperature for each year and season. Thus, the DataFrame should have columns for each season and the date as an index. In order to do this, we’ll need first to create a variable to store the dates for the index, then create the DataFrame using that index. Let’s consider an example for my world, where there are two seasons: coldSeason and warmSeason. For each season, I want list the number of times I wore a jacket, with data from the past 4 years. I can start by making a variable with 1 date for each of the past 4 years using the Pandas pd.date_range() function.

In [7]: timeIndex = pd.date_range('2014', '2017', freq='AS')

In [8]: print(timeIndex)
DatetimeIndex(['2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01'], dtype='datetime64[ns]', freq='AS-JAN')

As you can see, we now have a variable timeIndex in the Pandas datetime format with dates for January 1 of the past 4 years. The starting and ending years are clear, and the freq='AS' indicates the frequecy of dates between the listed starting and ending times. In this case, AS refers to annual values (1 time per year) at the start of the year.

With the timeIndex variable, we can now create our empty DataFrame to store the seasonal jacket numbers using the Pandas pd.DataFrame() function.

In [9]: seasonData = pd.DataFrame(index=timeIndex, columns=['coldSeason', 'warmSeason'])

In [10]: print(seasonData)
           coldSeason warmSeason
2014-01-01        NaN        NaN
2015-01-01        NaN        NaN
2016-01-01        NaN        NaN
2017-01-01        NaN        NaN

Now we have our empty DataFrame where I can fill in the number of times I needed a jacket in each season using the date index!

Slicing up the seasons

The other main task in Problem 2 is to sort values from the different months into seasonal average values. There are several ways in which this can be done, but one nice way to do it is using a for loop to loop over each year of data you consider and then fill in the seasonal values for that year. For each year, you want to identify the slice of dates that correspond to that season, calculate their mean, then store that result in the corresponding location in the new DataFrame created in the previous hint. For the for loop itself, it may be easiest to start with the second full year of data (1953), since we do not have temperatures for December of 1951. If you loop over the years from 1953-2016, you can then easily calculate the seasonal average temperatures for each season. For the winter, you can use year - 1 to find the temperature for December, assuming year is your variable for the current year in your for loop. This approach can be used also in relation to Problem 3 and 4.

In this week’s lesson we saw how to select a range of dates, but we did not cover how to take the mean value of the slice and store it. Because a slice of a DataFrame is still a DataFrame object, we can simply use the .mean() method to calculate the mean of that slice.

meanValue = dataFrame['2016-12':'2017-02']['TEMP'].mean()

This would assign the mean value for the TEMP field between December 2016 and February 2017 to the variable meanValue. In terms of storing the output value, we can use the DataFrame.loc() function. For example:

dataFrame.loc[year, 'coldSeason'] = 5

This would store the value 5 in the column coldSeason at index year of dataFrame. That’s a tricky sentence, but hopefully the idea is clear :).

Labels and legends

In the plot for Problem 2 you’re asked to include a line legend for each subplot. To do this, you need to do two things:

  1. You need to add a label value when you create the plot using the plt.plot() function. This is as easy as adding a parameter that say label='some text' when you call plt.plot().
  2. You’ll need to display the line legend, which can be done by calling plt.legend() for each subplot.

Saving multiple plots into a directory

In Problems 3 and 4 the aim is to create 65 individual plots, and save those into your computer. In these kind of situations, the smartest thing to do is to use a for loop and at the end of each loop, save the image into a folder that you have specified. There are some useful tricks related to saving files and generating good file names automatically.

A good approach when saving multiple files into a folder, is to define a separate variable where you store only the directory path. Then during every loop you combine this directory path, and the file name together. This can be done by using a function os.path.join() which is part of os built-in Python module.

Consider following example:

In [11]: import os

In [12]: myfolder = r"C:\MyUserName\Temp_visualizations"

In [13]: for i in range(5):
   ....:     filename = "My_File_" + str(i) + ".png"
   ....:     filepath = os.path.join(myfolder, filename)
   ....:     print(filepath)
   ....: 
C:\MyUserName\Temp_visualizations/My_File_0.png
C:\MyUserName\Temp_visualizations/My_File_1.png
C:\MyUserName\Temp_visualizations/My_File_2.png
C:\MyUserName\Temp_visualizations/My_File_3.png
C:\MyUserName\Temp_visualizations/My_File_4.png

Here, we created a folder path and a unique filename, and in the end parsed a full filepath that could be used to save a plot into that location on your computer.

Creating an animation from multiple images

In Problems 3 and 4 the aim was to plot multiple images on a predefined folder. An optional task was to create an animation out of those figures. Animating the figures in Problems 3 and 4 is fairly straightforward task to do in Python. All you need to do is to install a module called imageio and run couple lines of code that I show below.

But, first you need to install imageio module.

Installing the module can be done by running following command from the command prompt / terminal with admin rights:

$ conda install -c conda-forge imageio

Note

If everything works fine you should not see any errors coming into the screen. If you receive an error, the most typical one is that you did not have admin rights when trying to install the module. In such case, you should open command prompt with admin rights (Command prompt –> right click –> Run as administrator..)

When you have imageio installed you should be able to import it, in Spyder:

In [14]: import imageio

Creating the animation

Following commands should produce a nice gif-animation out of your plots. The idea is that you list all the files from the folder where you saved the plots using glob function, and then pass that file list into imageio function called imageio.mimsave(). A following example shows how to do that.

First we list all the files from folder that has .png file format using glob. The * wildcard character tells to computer that the name of the file can be anything (the purpose of the star). .png after the star tells that the filename should end with .png characters. If there are some other files with other file format than .png, they will be excluded. Finally, we create the animation into the computer.

import glob
import imageio

# Find all files from given folder that has .png file-format
search_criteria = r"C:\MyUserName\Temp_visualizations\*.png"

# Execute the glob function that returns a list of filepaths
figure_paths = glob.glob(search_criteria)

# Save the animation to disk with 48 ms durations
output_gif_path = r"C:\MyUserName\Temp_animation.gif"
imageio.mimsave(output_gif_path, [imageio.imread(fp) for fp in figure_paths], duration=0.48, subrectangles=True)

With these lines of code you should be able to create a nice animation out of your plots!