This week we will:
Before we start, let's first import the pandas package. The syntax is typically:
import nameOfPackage
However, whenever you import a package and call a function from that package, you will have to write explicitly the name of the package. So I'm gonna rename the pandas package to something shorter, like pd.
import pandas as pd
The "as" syntax will allow me to refer to the pandas package as pd from now on.
In [44]:
import pandas as pd
Let's read in some data. Our file is an .xls file and it has 2 sheets. I don't know how to do that, so I will google "pandas read in excel file multiple sheets" and see what happens.
First useful result came up was: http://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook so I'm gonna do what the upvoted did.
In [45]:
xls = pd.ExcelFile(r'C:\Users\jenng\Documents\texaspse-blog\media\f16-scientific-python\week2\myExcelData.xls')
temp_table = xls.parse('Temperature')
liquid_flow_table = xls.parse('Liquid Flow')
Let's take a look at the "tables" we read in. By the way, in Python pandas terms, the data structure of the data we read in is called a DataFrame. This is the reason why when you search for answers online, a lot of people will refer to their variable names as df
.
In [46]:
temp_table
Out[46]:
That's beautiful. Now pandas has a nice function called head() that summarizes our data. Right now our data only has 2 columns and maybe 10+ rows. But head() is going to be super useful to have a giant table of 1000 rows x 1000 columns. This is how you use it.
In [47]:
temp_table.head()
Out[47]:
If you search up the documentation, you will see we can pass in an argument of how many rows we want to show. Let's see if we can show the first 7 rows.
In [48]:
temp_table.head(7)
Out[48]:
That works! Let's now doing some fun stuff with pandas. I want to rename the two columns in liquid_table_temp to "Primary" and "Ultrasonic". Let's take a look at the current column names first. As always, I googled "list of column names" and found this Stackoverflow answer.
In [49]:
list(liquid_flow_table.columns.values)
Out[49]:
Cool. Let's rename it now. Again, I googled "pandas rename column" and found http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
In [50]:
liquid_flow_table = liquid_flow_table.rename(columns={'Primary Std, Avg mL/s': 'Primary',
'Ultrasonic, mL/S': 'Ultrasonic',})
Let's check it now:
In [51]:
liquid_flow_table.head() # Great! It worked like we wanted.
Out[51]:
Let's say I, for some reasons, want to add the primary + ultrasonic together and create a new column called mySum. Again, googled for "pandas add column together" and found http://stackoverflow.com/questions/22342285/summing-two-columns-in-a-pandas-dataframe
In [52]:
liquid_flow_table['mySum'] = liquid_flow_table.Primary + liquid_flow_table.Ultrasonic
liquid_flow_table.head() #check my table to see if it did add
Out[52]:
In [53]:
liquid_flow_table['myMultiplication'] = liquid_flow_table['Primary'] * liquid_flow_table['Ultrasonic']
liquid_flow_table.head() #check my table to see if it did multiply
Out[53]:
In [54]:
liquid_flow_table['myPower'] = liquid_flow_table.Primary ** liquid_flow_table.Ultrasonic
liquid_flow_table.head() #check my table to see if it did multiply
Out[54]:
Let's say now I want to create a new column that is the cumulative sum of Ultrasonic
.
In [55]:
liquid_flow_table['CumulativeSum'] = liquid_flow_table.Ultrasonic.cumsum()
liquid_flow_table
Out[55]:
Now let's say I want to divide this CumulativeSum by the maximum of that Cumulative Sum column. I'm going to call the new column something arbitrary, "myDivision".
Again, I don't know how to find the max of the CumulativeSum column so I googled "pandas max value in column". Found this http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html
In [56]:
liquid_flow_table['CumulativeSum'].max() #looks right
Out[56]:
In [57]:
liquid_flow_table['myDivision'] = liquid_flow_table['CumulativeSum']/liquid_flow_table['CumulativeSum'].max()
liquid_flow_table.head()
Out[57]:
Now let's say I want to sort the table based on the Primary
column from smallest to largest.
In [58]:
liquid_flow_table.sort('Primary', ascending=True, inplace=True)
Apparently the sort() we are trying to use is deprecreted, I don't know what that quite mean but I'm gonna do what it says and use sort_values() instead
In [59]:
liquid_flow_table.sort_values(by='Primary', ascending=True, inplace=True)
liquid_flow_table.head()
Out[59]:
Now let's do some filtering. I want my table to only shows Primary values that are greater than 40. Let's first take a look at the original table before we move on.
In [60]:
liquid_flow_table
Out[60]:
In [61]:
liquid_flow_table.loc[liquid_flow_table['Ultrasonic'] > 40].head()
Out[61]:
That stackoverflow answer brought up a really nice option. That is, how can I select data where only certain values is what I want. For example, I want the table to only show rows where my Primary is either 39.61 or 60.64.
In [66]:
liquid_flow_table.loc[liquid_flow_table['Primary'].isin(39.62,60.64)].head()
Not working so I searched "pandas isin()" and found http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html
Seems like they put their values in [] so I'm gonna do that and see if it fixes the problem.
In [65]:
liquid_flow_table.loc[liquid_flow_table['Primary'].isin([39.62,60.64])].head()
Out[65]:
Great! That sounds like most of the things I want to do. Now I want to try somethinga litle bit different and actually plot them out. I'm going back to my temp_table and plot the K-type and T-Type. Let's take a look at the temp_table again before we start.
In [69]:
temp_table
Out[69]:
Now I frankly don't know where to start so I googled "how to plot with python". First result came up was http://matplotlib.org/users/pyplot_tutorial.html. I'm just gonna plug in what the first box says.
In [70]:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
Now what if I don't want the figure to come up every time? Is there a way for Jupyter notebook to display the images within the notebook I'm working on?
Yes there is.
In [71]:
%matplotlib inline
Now I'm going to run that again and see what happens.
In [72]:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
Perfect! Let's try now to plot our data.
In [74]:
# no need for the import statement again because it's already imported
plt.plot(temp_table['K-type TC, °C'])
plt.ylabel('some numbers')
plt.show()
But now I also want the other column in the same plot! No problem. Just add another line.
In [75]:
plt.plot(temp_table['K-type TC, °C'])
plt.plot(temp_table['T-type TC, °C'])
plt.ylabel('some numbers')
plt.show()
Awesome! Now I want to add in the title for the x-axis, y-axis, figure, label, and legend.
In [81]:
plt.plot(temp_table['K-type TC, °C'])
plt.plot(temp_table['T-type TC, °C'])
plt.ylabel('Temperature, °C')
plt.xlabel('Index')
plt.title('Temperature vs. Index')
plt.legend()
plt.show()
Sounds great! Now let's go a little bit further and decorate our plots. With some googling, "matplotlib change line styles" gave me http://matplotlib.org/examples/lines_bars_and_markers/line_styles_reference.html
In [93]:
plt.plot(temp_table['K-type TC, °C'],'-.',color='r',linewidth=3)
plt.plot(temp_table['T-type TC, °C'],'o-',color='c')
plt.ylabel('Temperature, °C')
plt.xlabel('Index')
plt.title('Temperature vs. Index')
plt.legend()
plt.show()
Awesome. Let's say we have to plot the same thing over and over again for 10 different columns. That's going to be very annoying to do in Excel. So let's see if we can write up some function and "automate" this for us. I'm going to copy&paste all the code above and put it in a function.
In [94]:
def plotMyData(x,y):
plt.plot(x, y,'-.',color='r',linewidth=3)
plt.xlabel(x.name)
plt.ylabel(y.name)
plt.title(y.name + " vs. " + x.name)
plt.legend()
plt.show()
Now I'm going plot several columns in the liquid_flow_table (because there are many columns). Again, let's take a look at the table again.
In [96]:
liquid_flow_table.head()
Out[96]:
In [99]:
plotMyData(liquid_flow_table.Primary,liquid_flow_table.mySum)
In [100]:
plotMyData(liquid_flow_table.Primary,liquid_flow_table.myMultiplication)
In [101]:
plotMyData(liquid_flow_table.Primary,liquid_flow_table.myPower)
In [102]:
plotMyData(liquid_flow_table.Primary,liquid_flow_table.CumulativeSum)b
In [103]:
plotMyData(liquid_flow_table.Primary,liquid_flow_table.myDivision)
So 5 different plots, all in 5 lines! Imagine doing this in Excel! And in fact, programming is so awesome you can in fact not even have to copy&paste and do all that work.
In [104]:
liquid_table_columns_list = list(liquid_flow_table.columns.values)
liquid_table_columns_list
Out[104]:
In [106]:
for column in liquid_table_columns_list:
plotMyData(liquid_flow_table.Primary, liquid_flow_table[column])b
So if you're not still sold over this, I don't know what will :)
If we have time, let's open our GASISData. If you open it in excel, you can see that it's a huge table. About 19220 rows and maybe 60 columns. Let's read the file in and see how pandas handle it.
I'm going to use something that's specific to Jupyter Notebook to time how long it reads in the data.
In [110]:
%%timeit
xls = pd.read_csv(r'C:\Users\jenng\Documents\texaspse-blog\media\f16-scientific-python\week2\GASISData.csv')
So took about 664 ms in total to read that in. That's pretty cool!