This workshop is licensed under a Creative Commons Attribution 4.0 International License.
pressure_001, pressure_002, etc.
In [1]:
for number in [2, 3, 5]:
print(number)
for loop is equivalent to:
In [2]:
print(2)
print(3)
print(5)
for loop must end with a colon, and the body must be indented.{} or begin/end to show nesting.
In [3]:
for number in [2, 3, 5]:
print(number)
In [ ]:
firstName="Jon"
lastName="Smith"
In [ ]:
for number in [2, 3, 5]:
print(number)
[2, 3, 5], is what the loop is being run on.print(number), specifies what to do for each value in the collection.number, is what changes for each iteration of the loop.
In [ ]:
for kitten in [2, 3, 5]:
print(kitten)
In [4]:
# Total length of the strings in the list: ["red", "green", "blue"] => 12
# Desired output: 12
total = 0
for word in ["red", "green", "blue"]:
total = total + len(word)
print(total)
In [5]:
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
# Desired output: [3, 5, 4]
lengths = []
for word in ["red", "green", "blue"]:
lengths.append(len(word))
print(lengths)
In [6]:
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
# Desired output: "redgreenblue"
words = ["red", "green", "blue"]
result = ''
for word in words:
result = result + word
print(result)
In [7]:
# Create acronym: ["red", "green", "blue"] => "RGB"
# write the whole thing
words = ["red", "green", "blue"]
acronym = ''
for word in words:
acronym = acronym + word[0].capitalize()
print(acronym)
if statements to control whether or not a block of code is executed.if statement (more properly called a conditional statement)
controls whether some block of code is executed or not.for statement:if and ends with a colon
In [8]:
mass = 3.54
if mass > 3.0:
print(mass, 'is large')
mass = 2.07
if mass > 3.0:
print (mass, 'is large')
In [9]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
In [10]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
In [11]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 9.0:
print(m, 'is HUGE')
elif m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
In [12]:
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
In [13]:
velocity = 10.0
if velocity > 20.0:
print('moving too fast')
else:
print('adjusting velocity')
velocity = 50.0
In [14]:
import csv
primes = [2,3,5]
with open('output.csv','w', newline='') as outFile:
for prime in primes:
squared = prime ** 2
cubed = prime ** 3
row = [prime,squared,cubed]
csv.writer(outFile).writerow(row)
In [15]:
with open('output.csv','r') as dataFile:
data = csv.reader(dataFile)
for row in data:
print(row)
import pandas.pandas.read_csv.
In [16]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)
\ to show wrapped lines when output is too wide to fit the screen.index_col to specify that a column's values should be used as row headings.read_csv as its index_col parameter to do this.
In [17]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)
In [18]:
data.info()
DataFrame'Australia' and 'New Zealand'DataFrame.columns variable stores information about the dataframe's columns.math.pi.() to try to call it.
In [19]:
print(data.columns)
In [20]:
print(data.T)
DataFrame.describe to get summary statistics about data.DataFrame.describe() gets the summary statistics of only the columns that have numerical data.
All other columns are ignored, unless you use the argument include='all'.
In [21]:
print(data.describe())
read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files.help to get information on how to use to_csv.In order to write the DataFrame americas to a file called processed.csv, execute the following command:
americas.to_csv('processed.csv')
In [22]:
data.T.to_csv('oceania_transposed.csv')
A [DataFrame][pandas-dataframe] is a collection of [Series][pandas-series]; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the [Numpy][numpy] library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
To access a value at the position [i,j] of a DataFrame, we have two options, depending on
what is the meaning of i in use.
Remember that a DataFrame provides a index as a way to identify the rows of the table;
a row, then, has a position inside the table as well as a label, which
uniquely identifies its entry in the DataFrame.
DataFrame.iloc[..., ...] to select values by their (entry) position
In [23]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])
In [24]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])
In [25]:
albania = data.loc["Albania"]
print(albania)
print(albania.describe())
In [26]:
gdp1952 = data["gdpPercap_1952"]
print(gdp1952)
print(gdp1952.max())
print(gdp1952.idxmax())
In [27]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])
In the above code, we discover that slicing using loc is inclusive at both
ends, which differs from slicing using iloc, where slicing indicates
everything up to but not including the final index.
import pandas as pd
df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Write an expression to find the:
In [ ]:
# Exercise 3
print(df.loc['Serbia', 'gdpPercap_2007'])
print(df['gdpPercap_1982'])
print(df.loc['Denmark',:])
print(df.loc[:,'gdpPercap_1985':])
In [32]:
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
data = pd.read_csv(filename, index_col='country')
print(filename,'\n', data.min())
glob.glob to find sets of files whose names match a pattern.* meaning "match zero or more characters"? meaning "match exactly one character"glob library to provide pattern matching functionalityglob library contains a function also called glob to match file patternsglob.glob('*.txt') matches all files in the current directory
whose names end with .txt.
In [33]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
In [34]:
print('all PDB files:', glob.glob('*.pdb'))
In [35]:
csvs = []
for filename in glob.glob('data/gapminder_*.csv'):
data = pd.read_csv(filename, index_col='country')
csvs.append(data)
#print(csvs)
dataAll = pd.concat(csvs)
print(dataAll)
matplotlib is the most widely used scientific plotting library in Python.matplotlib.pyplot.
In [36]:
%matplotlib inline
import matplotlib.pyplot as plt
In [37]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]
plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
Out[37]:
Pandas dataframe.matplotlib.pyplot.string to integer data type, since they represent numerical values
In [38]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
#print(data)
# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)
#print(data)
#print(data.loc['Australia'])
plt.plot(data.loc['Australia'])
Out[38]:
DataFrame.plot plots with the rows as the X axis.
In [39]:
plt.plot(data.T)
plt.ylabel('GDP per capita')
plt.xlabel('Year')
plt.title('GPD Per Capita for Oceania (1950-2007)')
Out[39]:
In [40]:
plt.style.use('ggplot')
plt.bar(list(data.columns), data.loc['Australia'])
plt.ylabel('GDP per capita')
Out[40]:
In [41]:
years = data.columns
gdp_australia = data.loc['Australia']
plt.plot(years, gdp_australia, 'g--')
Out[41]:
Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data.
This can be done in matplotlib in two stages:
plt.plot(years, gdp_australia, label='Australia')
plt.plot(years, gdp_nz, label='New Zealand')
matplotlib to create the legend.plt.legend()
loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'
In [42]:
# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']
# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')
# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')
Out[42]:
In [43]:
plt.scatter(gdp_australia, gdp_nz)
Out[43]:
If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with
plt.savefig('my_figure.png')will save the current figure to the file
my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).Note that functions in
pltrefer to a global figure variable and after a figure has been displayed to the screen (e.g. withplt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you callplt.savefigbefore the plot is displayed to the screen, otherwise you may find a file with an empty plot.When using dataframes, data is often generated and plotted to screen in one line, and
plt.savefigseems not to be a possible approach. One possibility to save the figure to file is then to
- save a reference to the current figure in a local variable (with
plt.gcf)- call the
savefigclass method from that varible.
In [44]:
plt.scatter(gdp_australia, gdp_nz)
plt.savefig('my_fig.png')
In [29]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
In [30]:
mask = subset > 10000
print(subset[mask])
In [31]:
print(subset[subset > 10000].describe())
Workshop materials are drevied from work that is Copyright ©Software Carpentry.