Reading data from a file¶
The instructions here are for reading files in the traditional Python way. We do not recommend doing this, as it is more challenging and often more difficult to understand. That said, the info is here if you’re curious. We’re assuming below that you have downloaded a copy of the Kumpula weather data file from the Exploring data using Pandas lesson. If not, do that before proceeding.
Reading an entire file at once¶
There are several ways in which data can be read from a file in Python, but some are more common (or better in some cases) than others. Here we will focus on a common way to read files that can be easily used for many types of data files.
We will begin by reading an entire file into a Python list. To start we need to open the file for reading by typing the following into the Spyder editor:
with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: <commands to read file...>
Here we are using the
open()
function in combination with thewith
statement in Python. I suppose an explanation is required.- The general format used for opening files in Python is
open(<filename>, <mode>)
, wherefilename
is the name of the file andmode
is either"r"
for reading a file or"w"
for writing to a file. In our case, we open our file ('Kumpula-June-2016-w-metadata.txt'
) to be read ('r'
). - In addition, we are using the
with
statement. What this does is open our file and assign access to the file to a variable (infile
). Thus, using the variableinfile
we can access the file contents anywhere within the indented block of code beneath thewith
statement. For instance, we will see how to read the file in the next point. - The main advantage of using the
with
statement is that normally you need to manually close file access in Python (using thefile.close()
method), but when using thewith
statement the file is automatically closed at the end of the indented block. This ensures you don’t forget to close it yourself. Closing files is important because sometimes the final changes made to a file will not be written until the file is closed, for example.
- The general format used for opening files in Python is
With our file open, we can now proceed to read the file.
#!/usr/bin/env python3 '''Reads the contents of a file at once. Usage: ./readall.py Author: David Whipp - 2.10.2017 ''' # Read entire data file with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: data = infile.read()
So what we have done here is to use the
file.read()
method to read in the entire file as one long character string. What does that mean? Well, this means that we now have a variabledata
that contains the entire contents of our data file (Kumpula-June-2016-w-metadata.txt
). Thus, if we save the script above asreadall.py
and run it in Spyder, we should see the following output to the IPython console when we print out the contents ofdata
:In [1]: print(data) # Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki # for June 1-30, 2016 # Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND # Data processing: Extracted temperatures from raw data file, converted to # comma-separated format # # David Whipp - 02.10.2017 YEARMODA,TEMP,MAX,MIN 20160601,65.5,73.6,54.7 20160602,65.8,80.8,55.0 20160603,68.4,77.9,55.6 20160604,57.5,70.9,47.3 20160605,51.4,58.3,43.2 20160606,52.2,59.7,42.8 20160607,56.9,65.1,45.9 20160608,54.2,60.4,47.5 20160609,49.4,54.1,45.7 20160610,49.5,55.9,43.0 20160611,54.0,62.1,41.7 20160612,55.4,64.2,46.0 20160613,58.3,68.2,47.3 20160614,59.7,67.8,47.8 20160615,63.4,70.3,49.3 20160616,57.8,67.5,55.6 20160617,60.4,70.7,55.9 20160618,57.3,62.8,54.0 20160619,56.3,59.2,54.1 20160620,59.3,69.1,52.2 20160621,62.6,71.4,50.4 20160622,61.7,70.2,55.4 20160623,60.9,67.1,54.9 20160624,61.1,68.9,56.7 20160625,65.7,75.4,57.9 20160626,69.6,77.7,60.3 20160627,60.7,70.0,57.6 20160628,65.4,73.0,55.8 20160629,65.8,73.2,59.7 20160630,65.7,72.7,59.2
No surprises here, this looks like the contents of the
Kumpula-June-2016-w-metadata.txt
data file. If you want to confirm, you’re welcome to open that file in the Spyder editor. Note that you may have to set Files of type to be “All files (*)” in the Open file window to see the data files.As mentioned,
file.read()
is a method for file objects that reads all data in as a single (potentially very long) character string. You can confirm this using thetype()
function.In [2]: type(data) Out[2]: str
Obviously, it is nice to read the entire file at once, but this may be a problem for very large data files that may not fit in memory on the computer.
To convert our character string
data
into a more usable format in which each line is a separate value in a Python list, we can use thestr.splitlines()
method. Thus, we can create a listdatalist
that contains each line of the file as follows:#!/usr/bin/env python3 '''Reads the contents of a file at once. Usage: ./readall.py Author: David Whipp - 2.10.2017 ''' # Read entire data file with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: data = infile.read() dataList = data.splitlines()
Now each line of the data file will be a character string in the list
dataList
. We can confirm this by running the example above and printing out the contents of dataLits, which should output the following to the IPython console:In [3]: print(dataList) ['# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki', '# for June 1-30, 2016', '# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND', '# Data processing: Extracted temperatures from raw data file, converted to', '# comma-separated format', '#', '# David Whipp - 02.10.2017', '', 'YEARMODA,TEMP,MAX,MIN', '20160601,65.5,73.6,54.7', '20160602,65.8,80.8,55.0', '20160603,68.4,77.9,55.6', '20160604,57.5,70.9,47.3', '20160605,51.4,58.3,43.2', '20160606,52.2,59.7,42.8', '20160607,56.9,65.1,45.9', '20160608,54.2,60.4,47.5', '20160609,49.4,54.1,45.7', '20160610,49.5,55.9,43.0', '20160611,54.0,62.1,41.7', '20160612,55.4,64.2,46.0', '20160613,58.3,68.2,47.3', '20160614,59.7,67.8,47.8', '20160615,63.4,70.3,49.3', '20160616,57.8,67.5,55.6', '20160617,60.4,70.7,55.9', '20160618,57.3,62.8,54.0', '20160619,56.3,59.2,54.1', '20160620,59.3,69.1,52.2', '20160621,62.6,71.4,50.4', '20160622,61.7,70.2,55.4', '20160623,60.9,67.1,54.9', '20160624,61.1,68.9,56.7', '20160625,65.7,75.4,57.9', '20160626,69.6,77.7,60.3', '20160627,60.7,70.0,57.6', '20160628,65.4,73.0,55.8', '20160629,65.8,73.2,59.7', '20160630,65.7,72.7,59.2']
We are now ready to start interacting with our file data.
Dealing with headers of known length¶
In many cases, the header in a data file will occupy the top few lines the file and we can simply skip over the header by not storing header data.
We currently have a Python list dataList
that contains our data file contents.
A common task in Python is to separate the values on each line into separate Python lists that can be manipuated independently.
Below, we will create a set of 4 Python lists, one for each column in our data file, and fill them with the values from the lines of our file.
We will first need to create our empty lists for storing the data file values. We can do this by creating empty lists beneath the indented block for reading the file.
#!/usr/bin/env python3 '''Reads the contents of a file at once. Usage: ./readall.py Author: David Whipp - 2.10.2017 ''' # Read entire data file with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: data = infile.read() dataList = data.splitlines() # Create empty lists to store file data date = [] meanTemp = [] maxTemp = [] minTemp = []
Note: These empty lists are not indented as part of the file reading block.
With the empty lists created, we now need to go through each line of the file, separate the values on each line, and add them to the lists we’ve created. We can do this using the
str.split()
method and afor
loop. Don’t forget, we want to skip over the header.#!/usr/bin/env python3 '''Reads the contents of a file at once. Usage: ./readall.py Author: David Whipp - 2.10.2017 ''' # Read entire data file with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: data = infile.read() dataList = data.splitlines() # Create empty lists to store file data date = [] meanTemp = [] maxTemp = [] minTemp = [] # Loop over lines in file, append to lists headerLines = 8 for line in range(len(dataList)): if line > headerLines: splitLine = dataList[line].split(',') date.append(splitLine[0]) meanTemp.append(splitLine[1]) maxTemp.append(splitLine[2]) minTemp.append(splitLine[3])
So, what happened?
- First, we have used a
for
loop to go over each value in the listdataList
, assigning each line to the variableline
in the loop. - Second, we have used an
if
statement to only deal with lines below the headers (index 9 and up). - Third, we have created a new variable
splitline
that is itself a Python list. In this case,line.split(',')
separates all of the values in the line at each comma (,
) and stores the split values in a list (splitline
). You can see this list for the final line in the data file by typingprint(splitline)
in the IPython console. - Lastly, since each of the four values in each line of the data file have been separated, we can add the values to the lists we’ve created earlier using the
list.append()
method. In this case, we append the corresponding values in the listsplitline
by using their index values. This may seem complicated, but if you look at the code line by line, we’re not really doing too many new things here.
- First, we have used a
Headers of a known number of lines - Alternative approach¶
Let’s start by editing the
readall.py
script we created above to read the other data file (Kumpula-June-2016-w-metadata.txt
) and saving the modified file asheadread.py
.#!/usr/bin/env python3 '''Reads the contents of a file at once. Usage: ./headread.py Author: David Whipp - 2.10.2017 ''' # Read entire data file with open('Kumpula-June-2016-w-metadata.txt', 'r') as infile: data = infile.read() dataList = data.splitlines() # Create empty lists to store file data date = [] meanTemp = [] maxTemp = [] minTemp = [] # Loop over lines in file, append to lists for line in range(9,len(dataList)): splitLine = dataList[line].split(',') date.append(splitLine[0]) meanTemp.append(splitLine[1]) maxTemp.append(splitLine[2]) minTemp.append(splitLine[3])
- So this looks almost exactly the same as before, but we’re starting the range at
9
, rather and0
. This means we don’t need to have theif
statement to only append below the header.
- So this looks almost exactly the same as before, but we’re starting the range at
More options¶
If you’d like to see a few other options for reading files the Pythonic way, you can also check out the materials from the 2016 version of this course.