Import Excel or CSV To Pandas

This file covers the process of importing excel and csv files into a pandas dataframe. Note: the methods for importing excel and csv files is almost identical. The major difference is in the method used. This notebook serves as a tutorial for both.

Importing Excel (xlsx):
The function used is read_excel.
Importing comma separated values (csv):
The function used is read_csv.

Step 1

Lets start by importing pandas and os. We will be using pandas to create a dataframe from our data, and os to get file paths.



In [6]:

    
import pandas as pd
import os

Step 2

Now lets create a variable, filePath, that is a string containing the full path to the file we want to import. The code below looks in the current working directory for the file given a file name input by the user. This isn't necessary, and is just included for convienence. Alternatively, user can input a full path into the filePath variable.



In [7]:

    
dirPath = os.path.realpath('.')
fileName = 'assets/coolingExample.xlsx'
filePath = os.path.join(dirPath, fileName)

Step 3

Great! Now lets read the data into a dataframe called df. This will allow our data to be accessible by the string in the header.



In [8]:

    
df = pd.read_excel(filePath,header=0)
df.head()









    Out[8]:






  
    
      
      Time - Dev2/ai0
      Temperature - Dev2/ai0
    
  
  
    
      0
      11:17:30
      85.4
    
    
      1
      11:17:30
      85.6
    
    
      2
      11:17:30
      84.9
    
    
      3
      11:17:30
      85.8
    
    
      4
      11:17:30
      85.2

Our data is now accessible by a key value. The keys are the column headers in the dataframe. In this example case, those are 'Time (s) - Dev1/ai0' and 'Temperature - Dev1/ai0'. For example, lets access the data in the first column.



In [9]:

    
df[df.columns[0]]









    Out[9]:





0       11:17:30
1       11:17:30
2       11:17:30
3       11:17:30
4       11:17:30
5       11:17:30
6       11:17:30
7       11:17:30
8       11:17:30
9       11:17:30
10      11:17:31
11      11:17:31
12      11:17:31
13      11:17:31
14      11:17:31
15      11:17:31
16      11:17:31
17      11:17:31
18      11:17:31
19      11:17:31
20      11:17:32
21      11:17:32
22      11:17:32
23      11:17:32
24      11:17:32
25      11:17:32
26      11:17:32
27      11:17:32
28      11:17:32
29      11:17:32
          ...   
2439    11:21:33
2440    11:21:34
2441    11:21:34
2442    11:21:34
2443    11:21:34
2444    11:21:34
2445    11:21:34
2446    11:21:34
2447    11:21:34
2448    11:21:34
2449    11:21:34
2450    11:21:35
2451    11:21:35
2452    11:21:35
2453    11:21:35
2454    11:21:35
2455    11:21:35
2456    11:21:35
2457    11:21:35
2458    11:21:35
2459    11:21:35
2460    11:21:36
2461    11:21:36
2462    11:21:36
2463    11:21:36
2464    11:21:36
2465    11:21:36
2466    11:21:36
2467    11:21:36
2468    11:21:36
Name: Time - Dev2/ai0, dtype: object

What would happen if we tried to access the data with an invalid key, say 1 for example? Lets try it to find out.

Note: I enclose this code in a try: except: statement in order to prevent a huge error from being generated.



In [10]:

    
try:
    df[1]
except KeyError:
    print("KeyError: 1 - not a valid key")









    



KeyError: 1 - not a valid key

So lets say you have a large dataframe with unknown columns. There is a simple way to index them without having prior knowledge of what the dataframe columns are. Namely, the columns method in pandas.



In [11]:

    
cols = df.columns
for col in cols:
    print(df[col])









    



0       11:17:30
1       11:17:30
2       11:17:30
3       11:17:30
4       11:17:30
5       11:17:30
6       11:17:30
7       11:17:30
8       11:17:30
9       11:17:30
10      11:17:31
11      11:17:31
12      11:17:31
13      11:17:31
14      11:17:31
15      11:17:31
16      11:17:31
17      11:17:31
18      11:17:31
19      11:17:31
20      11:17:32
21      11:17:32
22      11:17:32
23      11:17:32
24      11:17:32
25      11:17:32
26      11:17:32
27      11:17:32
28      11:17:32
29      11:17:32
          ...   
2439    11:21:33
2440    11:21:34
2441    11:21:34
2442    11:21:34
2443    11:21:34
2444    11:21:34
2445    11:21:34
2446    11:21:34
2447    11:21:34
2448    11:21:34
2449    11:21:34
2450    11:21:35
2451    11:21:35
2452    11:21:35
2453    11:21:35
2454    11:21:35
2455    11:21:35
2456    11:21:35
2457    11:21:35
2458    11:21:35
2459    11:21:35
2460    11:21:36
2461    11:21:36
2462    11:21:36
2463    11:21:36
2464    11:21:36
2465    11:21:36
2466    11:21:36
2467    11:21:36
2468    11:21:36
Name: Time - Dev2/ai0, dtype: object
0       85.4
1       85.6
2       84.9
3       85.8
4       85.2
5       85.1
6       86.1
7       85.1
8       85.0
9       85.8
10      85.0
11      85.6
12      85.1
13      85.2
14      85.1
15      85.1
16      85.8
17      85.1
18      85.6
19      85.1
20      86.1
21      86.4
22      85.8
23      86.6
24      86.1
25      85.8
26      85.9
27      86.1
28      85.5
29      85.8
        ... 
2439     4.2
2440     3.1
2441     3.8
2442     5.1
2443     4.4
2444     4.3
2445     4.7
2446     4.3
2447     4.4
2448     4.4
2449     4.4
2450     4.0
2451     2.7
2452     4.6
2453     4.8
2454     3.5
2455     4.2
2456     3.2
2457     3.7
2458     3.8
2459     3.5
2460     3.4
2461     3.9
2462     3.4
2463     4.0
2464     4.1
2465     3.5
2466     3.5
2467     3.1
2468     3.9
Name: Temperature - Dev2/ai0, dtype: float64

Data Manipulation (Plots)

Now that we have the data easily accessible in python, lets look at how to plot it. Pandas allows you to use matplotlib to plot, however it is done using methods built into pandas.

Although the methods to create an manipulate plots are built into Pandas, we will still have to import matplotlib to save and show the plots.



In [12]:

    
import matplotlib.pyplot as plt

In order to demonstrate the plotting capabilities of pandas arrays, lets use the example data that we imported earlier. The data frame contains only the two columns that were in the file; temperature and time. Because of this simplicity, we can trust pandas to properly interpret the first column as time and the second column as th measurement (temperature). Thus we can plot with the simple command.

df.plot()



In [13]:

    
plt.figure(1)
ax = df.plot()
plt.show()









    





<matplotlib.figure.Figure at 0x9ef1eb8>

While this simplification is nice, it is generally better to specify what data you want to plot. Particularly if you are automating the plotting of a large set of dataframes. To do this, specify the x and y arrays in your dataframe as you would in a standard matplotlib plot call, however since this plotting function is a method of the dataframe, you need only specify the column.

I.e.



In [14]:

    
plt.figure(2)
ax = df.plot(cols[0],cols[1])
plt.show()









    





<matplotlib.figure.Figure at 0xa53d128>

Now that we have the basics down, lets spice up the plot a little bit.



In [15]:

    
plt.figure(3)
ax = df.plot(cols[0],cols[1])
ax.set_title('This is a Title')
ax.set_ylabel('Temperature (deg F)')
ax.grid()
plt.show()









    





<matplotlib.figure.Figure at 0x9778be0>

Data Manipulation (Timestamps)

One thing you probably noticed in these plots is that the time axis isn't all that useful. It would be better to change the timestamps to a more useful form like seconds since start. Lets go through the process of making that conversion.

First, lets see what the timestamp currently looks like.



In [16]:

    
df[cols[0]][0]









    Out[16]:





datetime.time(11, 17, 30)

Good news! Since python interpreted the date as a datetime object, we can use datetime object methods to determine the time in seconds. The one caveat is that we can only determine a time difference, not an absolute time. For more on this, read this stackoverflow question.

The first thing we have to do is convert these datetime.time objects into datetime.datetime objects using datetime.combine

Note: importing datetime is a little weird.. datetime is both a module and a class.



In [17]:

    
from datetime import datetime, date



In [18]:

    
startTime = df[cols[0]][0]
timeArray = []
for i in range(0,len(df[cols[0]])):
    timeArray.append((datetime.combine(date.today(), df[cols[0]][i]) - datetime.combine(date.today(), startTime)).total_seconds())

Note: There is probably a better way of doing this (i.e. without a loop, but I'm tired and can't think of anything right now)



In [19]:

    
plt.figure(4)
plt.plot(timeArray, df[cols[1]], 'b')
plt.title('This is a graph with a better time axis')
plt.ylabel('Temperature (deg F)')
plt.xlabel('Time (s)')
plt.grid()
plt.show()

	Time - Dev2/ai0	Temperature - Dev2/ai0
0	11:17:30	85.4
1	11:17:30	85.6
2	11:17:30	84.9
3	11:17:30	85.8
4	11:17:30	85.2