This file covers the process of importing excel and csv files into a pandas dataframe. Note: the methods for importing excel and csv files is almost identical. The major difference is in the method used. This notebook serves as a tutorial for both.
Importing Excel (xlsx):
The function used is read_excel.
Importing comma separated values (csv):
The function used is read_csv.
Lets start by importing pandas and os. We will be using pandas to create a dataframe from our data, and os to get file paths.
In [6]:
import pandas as pd
import os
Now lets create a variable, filePath
, that is a string containing the full path to the file we want to import. The code below looks in the current working directory for the file given a file name input by the user. This isn't necessary, and is just included for convienence. Alternatively, user can input a full path into the filePath
variable.
In [7]:
dirPath = os.path.realpath('.')
fileName = 'assets/coolingExample.xlsx'
filePath = os.path.join(dirPath, fileName)
In [8]:
df = pd.read_excel(filePath,header=0)
df.head()
Out[8]:
Our data is now accessible by a key value. The keys are the column headers in the dataframe. In this example case, those are 'Time (s) - Dev1/ai0' and 'Temperature - Dev1/ai0'. For example, lets access the data in the first column.
In [9]:
df[df.columns[0]]
Out[9]:
What would happen if we tried to access the data with an invalid key, say 1
for example? Lets try it to find out.
Note: I enclose this code in a try: except:
statement in order to prevent a huge error from being generated.
In [10]:
try:
df[1]
except KeyError:
print("KeyError: 1 - not a valid key")
So lets say you have a large dataframe with unknown columns. There is a simple way to index them without having prior knowledge of what the dataframe columns are. Namely, the columns
method in pandas.
In [11]:
cols = df.columns
for col in cols:
print(df[col])
Now that we have the data easily accessible in python, lets look at how to plot it. Pandas
allows you to use matplotlib to plot, however it is done using methods built into pandas.
Although the methods to create an manipulate plots are built into Pandas
, we will still have to import matplotlib to save and show the plots.
In [12]:
import matplotlib.pyplot as plt
In order to demonstrate the plotting capabilities of pandas arrays, lets use the example data that we imported earlier. The data frame contains only the two columns that were in the file; temperature and time. Because of this simplicity, we can trust pandas to properly interpret the first column as time and the second column as th measurement (temperature). Thus we can plot with the simple command.
df.plot()
In [13]:
plt.figure(1)
ax = df.plot()
plt.show()
While this simplification is nice, it is generally better to specify what data you want to plot. Particularly if you are automating the plotting of a large set of dataframes. To do this, specify the x
and y
arrays in your dataframe as you would in a standard matplotlib
plot call, however since this plotting function is a method of the dataframe, you need only specify the column.
I.e.
In [14]:
plt.figure(2)
ax = df.plot(cols[0],cols[1])
plt.show()
Now that we have the basics down, lets spice up the plot a little bit.
In [15]:
plt.figure(3)
ax = df.plot(cols[0],cols[1])
ax.set_title('This is a Title')
ax.set_ylabel('Temperature (deg F)')
ax.grid()
plt.show()
One thing you probably noticed in these plots is that the time axis isn't all that useful. It would be better to change the timestamps to a more useful form like seconds since start. Lets go through the process of making that conversion.
First, lets see what the timestamp currently looks like.
In [16]:
df[cols[0]][0]
Out[16]:
Good news! Since python interpreted the date as a datetime object, we can use datetime object methods to determine the time in seconds. The one caveat is that we can only determine a time difference, not an absolute time. For more on this, read this stackoverflow question.
The first thing we have to do is convert these datetime.time
objects into datetime.datetime
objects using datetime.combine
Note: importing datetime is a little weird.. datetime
is both a module and a class.
In [17]:
from datetime import datetime, date
In [18]:
startTime = df[cols[0]][0]
timeArray = []
for i in range(0,len(df[cols[0]])):
timeArray.append((datetime.combine(date.today(), df[cols[0]][i]) - datetime.combine(date.today(), startTime)).total_seconds())
Note: There is probably a better way of doing this (i.e. without a loop, but I'm tired and can't think of anything right now)
In [19]:
plt.figure(4)
plt.plot(timeArray, df[cols[1]], 'b')
plt.title('This is a graph with a better time axis')
plt.ylabel('Temperature (deg F)')
plt.xlabel('Time (s)')
plt.grid()
plt.show()