Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
More information about pandas: http://pandas.pydata.org/pandas-docs/stable/
https://pandas.pydata.org/pandas-docs/stable/10min.html
https://pandas.pydata.org/pandas-docs/stable/cookbook.html
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.=B6Dr74
http://xarray.pydata.org/en/stable/
https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
- 1-D: Series; e.g.
- Solar planets: [Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune]
- Set of astronomical objects and when they were observed:
[[NGC1952, 2012-05-01],
[NGC224, 2013-01-23],
[NGC5194, 2014-02-13]]
- 2-D: DataFrame; e.g (more business oriented):
- 3 months of sales information for 3 fictitious companies:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]
In [ ]:
# Import libraries
import pandas as pd
import numpy as np
Series is a one-dimensional labeled array capable of holding any data type
The axis labels are collectively referred to as the index
This is the basic idea of how to create a Series dataframe:
s = pd.Series(data, index=index)
where data can be:
In [ ]:
solar_planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune']
In [ ]:
splanets = pd.Series(solar_planets)
In [ ]:
# Tips and tricks
# To access the Docstring for quick reference on syntax use ? before:
#?pd.Series()
In [ ]:
splanets
In [ ]:
splanets.index
In [ ]:
s1 = pd.Series(np.random.randn(5))
In [ ]:
s1
In [ ]:
s1.index
In [ ]:
s2 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [ ]:
s2
In [ ]:
s2.index
If data is a scalar value, an index must be provided
The value will be repeated to match the length of index
In [ ]:
s3 = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
In [ ]:
s3
In [ ]:
s3.index
In [ ]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
In [ ]:
sd = pd.Series(d)
In [ ]:
sd
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types (see also Panel - 3-dimensional array).
You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
It is generally the most commonly used pandas object.
Like Series, DataFrame accepts many different kinds of input:
In [ ]:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]
In [ ]:
df = pd.DataFrame(sales)
In [ ]:
df
In [ ]:
df.info()
In [ ]:
df.index
In [ ]:
df = df.set_index('account')
In [ ]:
df
In [ ]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
In [ ]:
df = pd.DataFrame(d)
In [ ]:
df
In [ ]:
df.info()
In [ ]:
pd.DataFrame(d, index=['d', 'b', 'a'])
In [ ]:
df.index
In [ ]:
df.columns
In [ ]:
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
In [ ]:
pd.DataFrame(d)
In [ ]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
In [ ]:
data = np.random.random_sample((5, 5))
In [ ]:
data
In [ ]:
df = pd.DataFrame(data)
In [ ]:
df
In [ ]:
# Add index
df = pd.DataFrame(data,index = ['a','b','c','d','e'])
In [ ]:
df
In [ ]:
# Add column names
df = pd.DataFrame(data, index = ['a','b','c','d','e'], columns = ['ra', 'dec','z_phot','z_true','imag'])
In [ ]:
df
In [ ]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
In [ ]:
pd.DataFrame(data2)
In [ ]:
pd.DataFrame(data2, index=['first', 'second'])
In [ ]:
pd.DataFrame(data2, columns=['a', 'b'])
In [ ]:
#Few galaxies with some properties: id, ra, dec, magi
galaxies = [
{'id' : 1, 'ra' : 4.5, 'dec' : -55.6, 'magi' : 21.3},
{'id' : 3, 'ra' : 23.5, 'dec' : 23.6, 'magi' : 23.3},
{'id' : 25, 'ra' : 22.5, 'dec' : -0.3, 'magi' : 20.8},
{'id' : 17, 'ra' : 33.5, 'dec' : 15.6, 'magi' : 24.3}
]
In [ ]:
# %load -r 1-19 solutions/06_01_pandas.py
Most of the times any study starts with an input file containing some data rather than having a python list or dictionary.
Here we present three different data sources and how to read them: two file formats (CSV and FITS) and a database connection.
Advanced: More and more frequently the amount of data to handle is larger and larger (Big Data era) and therefore files are huge. This is why we strongly recommend to always program by chunks (sometimes it is mandatory and also it is not straight forward to implement).
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
SELECT unique_gal_id, ra_gal, dec_gal, z_cgal, z_cgal_v, lmhalo, (mr_gal - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS abs_mag, gr_gal AS color, (des_asahi_full_i_true - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS app_mag FROM micecatv2_0_view TABLESAMPLE (BUCKET 1 OUT OF 512)
In [ ]:
filename = '../resources/galaxy_sample.csv'
In [ ]:
!head -30 ../resources/galaxy_sample.csv
In [ ]:
filename_bz2 = '../resources/galaxy_sample.csv.bz2'
In [ ]:
!head ../resources/galaxy_sample.csv.bz2
In [ ]:
# Field index name (known a priori from the header or the file description)
unique_gal_id_field = 'unique_gal_id'
In [ ]:
galaxy_sample = pd.read_csv(filename, sep=',', index_col = unique_gal_id_field, comment='#', na_values = '\\N')
In [ ]:
galaxy_sample.head()
In [ ]:
galaxy_sample.tail()
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
In [ ]:
galaxy_sample.describe()
In [ ]:
galaxy_sample.info()
In [ ]:
galaxy_sample_bz2 = pd.read_csv(filename_bz2, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')
In [ ]:
galaxy_sample_bz2.head()
Pandas does not read directly FITS files so it is necessary to make some "convertion"
We have found 2 different approaches:
Not easy to read it by chunks (see also "Caveats and technicalities" section below)
Note: we strongly recommend to use CSV.BZ2!
In [ ]:
from astropy.table import Table
FITS file created using the same query as the CSV file:
In [ ]:
filename = '../resources/galaxy_sample.fits'
In [ ]:
#?Table.read()
In [ ]:
data = Table.read(filename)
In [ ]:
type(data)
In [ ]:
df = data.to_pandas()
In [ ]:
df.head()
In [ ]:
df = df.set_index('unique_gal_id')
In [ ]:
df.head()
In [ ]:
df.shape
In [ ]:
df.values.dtype
In [ ]:
df.info()
In [ ]:
# For PostgreSQL access
from sqlalchemy.engine import create_engine
# Text wrapping
import textwrap
In [ ]:
# Database configuration parameters
#db_url = '{scheme}://{user}:{password}@{host}/{database}'
db_url = 'sqlite:///../resources/pandas.sqlite'
sql_sample = textwrap.dedent("""\
SELECT *
FROM micecatv1
WHERE ABS(ra_mag-ra) > 0.05
""")
index_col = 'id'
In [ ]:
# Create database connection
engine = create_engine(db_url)
df = pd.read_sql(sql_sample, engine,index_col = 'id')
In [ ]:
df.head()
In [ ]:
outfile = '../resources/micecatv1_sample1.csv'
In [ ]:
with open(outfile, 'w') as f_out:
df.to_csv(f_out,
columns = ['ra', 'dec','ra_mag','dec_mag'],
index=True,
header=True
)
In [ ]:
filename = '../resources/galaxy_sample.csv'
outfile = '../resources/galaxy_sample_some_columns.csv'
In [ ]:
# chunk size
gal_chunk = 100000
In [ ]:
# Field index name (known a priori from the header or the file description)
unique_gal_id_field = 'unique_gal_id'
Opening file with the with method
Creating a file object using read_csv method
Looping by chunks using enumerate in order to also have the chunk number
In [ ]:
with open(filename, 'r') as galaxy_fd, open (outfile, 'w') as f_out:
galaxy_sample_reader = pd.read_csv(
galaxy_fd,
sep=',',
index_col = unique_gal_id_field,
comment='#',
na_values = '\\N',
chunksize=gal_chunk
)
for chunk, block in enumerate(galaxy_sample_reader):
print(chunk)
# In order not to write n chunk times the header (HELP PAU!)
block.to_csv(f_out,
columns = ['ra_gal','dec_gal','z_cgal_v'],
index=True,
header= chunk==0,
mode='a'
)
In [ ]:
block.head()
In [ ]:
block.tail(3)
In [ ]:
# DataFrame plot method
%matplotlib inline
import matplotlib.pyplot as plt
block['lmhalo'].plot.hist(bins=100, logy = True)
plt.show()
Operation | Syntax | Result |
---|---|---|
Select column | df[column label] | Series |
Select row by index | df.loc[index] | Series |
Select row by integer location | df.iloc[pos] | Series |
Slice rows | df[5:10] | DataFrame |
Select rows by boolean vector | df[bool_vec] | DataFrame |
In [ ]:
# Same dataframe as before
filename='../resources/galaxy_sample.csv.bz2'
galaxy_sample = pd.read_csv(filename, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')
In [ ]:
galaxy_sample.head()
In [ ]:
galaxy_sample['ra_gal'].head()
In [ ]:
type(galaxy_sample['dec_gal'])
In [ ]:
galaxy_sample[['ra_gal','dec_gal','lmhalo']].head()
In [ ]:
galaxy_sample.loc[28581888]
In [ ]:
type(galaxy_sample.loc[28581888])
In [ ]:
galaxy_sample.iloc[0]
In [ ]:
type(galaxy_sample.iloc[0])
In [ ]:
galaxy_sample.iloc[3:7]
In [ ]:
galaxy_sample[3:7]
In [ ]:
type(galaxy_sample.iloc[3:7])
The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.
In [ ]:
# Boolean vector
(galaxy_sample['ra_gal'] < 45).tail()
In [ ]:
type(galaxy_sample['ra_gal'] < 45)
In [ ]:
galaxy_sample[galaxy_sample['ra_gal'] < 45].head()
In [ ]:
# redshift shell
galaxy_sample[(galaxy_sample.z_cgal <= 0.2) | (galaxy_sample.z_cgal >= 1.0)].head()
In [ ]:
galaxy_sample[(galaxy_sample.z_cgal <= 1.0) & (galaxy_sample.index.isin([5670656,13615360,3231232]))]
In [ ]:
galaxy_sample[(galaxy_sample['ra_gal'] < 1.) & (galaxy_sample['dec_gal'] < 1.)][['ra_gal','dec_gal']].head()
Also we are going to introduce how to add a column and some other concepts
We split the galaxies into two different populations, Ellipticals and Spirals, depending on the their color and absolute magnitude:
if color - 0.29 + 0.03 * abs_mag < 0 then Spiral
else then Elliptical
How many galaxies are elliptical or spirals?
Elliptical galaxies do not have any disk (and therefore disk_length = 0).
The disk_length for spiral galaxies follows a normal distribution with mean = 0 and sigma = 0.15 (in arcsec). In addition, the minimum disk_length for a spiral galaxy is 1.e-3.
In [ ]:
galaxy_sample.tail(10)
In [ ]:
# Splitting the galaxies
# Boolean mask
has_disk_mask = (galaxy_sample['color']-0.29+0.03*galaxy_sample['abs_mag'] < 0)
In [ ]:
has_disk_mask.tail(10)
In [ ]:
print (len(has_disk_mask))
print (type(has_disk_mask))
In [ ]:
# Counting how many spirals
n_spiral = has_disk_mask.sum()
In [ ]:
# Counting how many ellipticals
n_elliptical = ~has_disk_mask.sum()
In [ ]:
galaxy_sample[has_disk_mask].count()
In [ ]:
galaxy_sample[has_disk_mask]['hubble_type'] = 'Spiral'
In [ ]:
# It did not add any column! It was working in a view!
galaxy_sample.tail(10)
In [ ]:
# This is the proper way of doing it if one wants to add another column
galaxy_sample.loc[has_disk_mask, 'hubble_type'] = 'Spiral'
In [ ]:
galaxy_sample.loc[~has_disk_mask, 'hubble_type'] = 'Elliptical'
In [ ]:
galaxy_sample.tail(10)
In [ ]:
# We can use the numpy where method to do the same:
galaxy_sample['color_type'] = np.where(has_disk_mask, 'Blue', 'Red')
In [ ]:
galaxy_sample.tail(10)
In [ ]:
# The proper way would be to use a boolean field
galaxy_sample['has_disk'] = has_disk_mask
In [ ]:
galaxy_sample.tail(10)
In [ ]:
galaxy_sample.loc[~has_disk_mask, 'disk_length'] = 0.
In [ ]:
galaxy_sample.loc[has_disk_mask, 'disk_length'] = np.fabs(
np.random.normal(
0., scale=0.15, size=n_spiral
)
)
In [ ]:
galaxy_sample.tail(10)
In [ ]:
# Minimum value for disk_length for spirals
dl_min = 1.e-4;
In [ ]:
disk_too_small_mask = has_disk_mask & (galaxy_sample['disk_length'] < dl_min)
In [ ]:
disk_too_small_mask.sum()
In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'].head()
In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'] = dl_min
In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'].head()
In [ ]:
galaxy_sample.tail(10)
What is the mean value and the standard deviation of the disk_length for spiral galaxies (Tip: use the .mean() and .std() methods)
Estimate the bulge_length for elliptical galaxies. The bulge_length depends on the absolute magnitude in the following way:
bulge_length = exp(-1.145 - 0.269 * (abs_mag - 23.))
How many galaxies have bulge_lenth > 1.0?
In our model the maximum bulge_length for an elliptical galaxy is 0.5 arcsec.
What is the mean value and the standard deviation of the bulge_length for elliptical galaxies. And for ellipticals with absolute magnitude brighter than -20?
In [ ]:
# %load -r 20-102 solutions/06_01_pandas.py
https://pandas.pydata.org/pandas-docs/stable/merging.html
pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
concat method:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
In [ ]:
df1 = pd.DataFrame(
{'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3]
)
In [ ]:
df2 = pd.DataFrame(
{'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7]
)
In [ ]:
df3 = pd.DataFrame(
{'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11]
)
In [ ]:
frames = [df1, df2, df3]
In [ ]:
result = pd.concat(frames)
result
In [ ]:
# Multiindex
result = pd.concat(frames, keys=['x', 'y','z'])
In [ ]:
result
In [ ]:
result.index
In [ ]:
result.loc['y']
In [ ]:
df4 = pd.DataFrame(
{'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7]
)
In [ ]:
df4
In [ ]:
df1
In [ ]:
result = pd.concat([df1, df4])
result
In [ ]:
result = pd.concat([df1, df4], axis=1)
result
In [ ]:
result = pd.concat([df1, df4], axis=1, join='inner')
result
In [ ]:
df1
In [ ]:
df2
In [ ]:
result = df1.append(df2)
result
In [ ]:
df1
In [ ]:
df4
In [ ]:
result = df1.append(df4)
result
In [ ]:
result = pd.concat([df1,df4])
result
In [ ]:
result = pd.concat([df1,df4], ignore_index=True)
result
In [ ]:
result = df1.append(df4, ignore_index = True)
result
In [ ]:
df1
In [ ]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
s1
In [ ]:
result = pd.concat([s1,df1])
result
In [ ]:
result = pd.concat([df1,s1], axis = 1)
result
In [ ]:
s2 = pd.Series(['_0', '_1', '_2', '_3'])
In [ ]:
result = pd.concat([df1,s2,s2,s2], axis = 1)
result
In this exercise we will use the concat method and show a basic example of multiIndex.
Given a subset of a few galaxies with the following properties ['halo_id', 'gal_id' ,'ra', 'dec', 'z', 'abs_mag'], create a random catalog with 50 times more galaxies than the subset keeping the properties of the galaxies but placing them randomly in the first octant of the sky.
The index of each galaxy is given by the tuple [halo_id, gal_id]
In [ ]:
data = [
# halo_id, gal_id, ra, dec, z, abs_mag'
[1, 1, 21.5, 30.1, 0.21, -21.2],
[1, 2, 21.6, 29.0, 0.21, -18.3],
[1, 3, 21.4, 30.0, 0.21, -18.5],
[2, 1, 45.0, 45.0, 0.42, -20.4],
[3, 1, 25.0, 33.1, 0.61, -21.2],
[3, 2, 25.1, 33.2, 0.61, -20.3]
]
In [ ]:
# %load -r 103-145 solutions/06_01_pandas.py
pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and internal layout of the data in DataFrame
See the cookbook for some advanced strategies
Users who are familiar with SQL but new to pandas might be interested in a comparison with SQL
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
Goal: build a dataframe merging 2 different dataframes with complementary information, through the relation given by a third dataframe.
In [ ]:
star_filename = '../resources/df_star.ssv'
spectra_filename = '../resources/df_spectra.ssv'
starid_specid_filename = '../resources/df_starid_specid.ssv'
In [ ]:
df_spectra = pd.read_csv(spectra_filename, index_col=['spec_id', 'band'], sep = ' ')
df_spectra.head(41)
In [ ]:
df_starid_specid = pd.read_csv(starid_specid_filename, sep=' ')
df_starid_specid.head(5)
In [ ]:
# Given that the file is somehow corrupted we open it without defining any index
df_star = pd.read_csv(star_filename, sep=' ')
df_star.head(10)
In [ ]:
df_star[(df_star['sdss_star_id'] == 1237653665258930303) & (df_star['filter'] == 'NB455')]
In [ ]:
# Drop duplicates:
df_star.drop_duplicates(subset = ['sdss_star_id', 'filter'], inplace= True)
In [ ]:
df_star[(df_star['sdss_star_id'] == 1237653665258930303) & (df_star['filter'] == 'NB455')]
In [ ]:
df_starid_specid.head(5)
In [ ]:
df_spectra.reset_index(inplace = True)
df_spectra.head()
In [ ]:
df_spectra.rename(columns={'band': 'filter'}, inplace = True)
df_spectra.head()
In [ ]:
df_starid_specid.rename(columns={'objID':'sdss_star_id', 'specObjID':'spec_id'}, inplace = True)
df_starid_specid.head()
In [ ]:
df_star_merged = pd.merge(df_star, df_starid_specid, on='sdss_star_id')
In [ ]:
df_star_merged.head()
In [ ]:
df_star_merged = pd.merge(df_star_merged, df_spectra, on=['spec_id','filter'])
In [ ]:
df_star_merged.head(40)
In [ ]:
df_star_merged.set_index(['sdss_star_id', 'filter'], inplace = True)
In [ ]:
df_star_merged.head()
In [ ]:
# Each element has been observed in how many bands?
count_bands = df_star_merged.groupby(level=0)['flux'].count()
In [ ]:
count_bands.head(20)
In [ ]:
df_star_merged.groupby(level=1)['flux_err'].mean().head(10)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html
Data cleaning:
In [ ]:
# e.g.: the decimal value 0.1 cannot be represented exactly as a base 2 fraction
(0.1 + 0.2) == 0.3
In [ ]:
(0.1 + 0.2) - 0.3
In [ ]:
import fitsio
In [ ]:
filename = '../resources/galaxy_sample.fits'
In [ ]:
fits=fitsio.FITS(filename)
In [ ]:
data = fits[1]
In [ ]:
# Number of rows
data.get_nrows()
In [ ]:
# chunk size
gal_chunk = 300000
In [ ]:
# e.g.to create the ranges!
import math
niter = int(math.ceil(data.get_nrows() / float(gal_chunk)))
for i in range(niter):
s = i*gal_chunk
f = min((i+1)*gal_chunk, data.get_nrows())
chunk = data[s:f]
print (i)
print (type(chunk))
print (chunk.dtype)
df_chunk = pd.DataFrame(chunk)
print (type(df_chunk))
print (df_chunk.dtypes)
df_chunk = df_chunk.set_index('unique_gal_id')
print (df_chunk.head())
Some scipy functions do not allow to use pandas dataframe as arguments and therefore it is useful to use the values atribute, which is the numpy representation of NDFrame
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
In [ ]:
bad_filename = '../resources/steps.flagship.dat'
In [ ]:
df_bad = pd.read_csv(bad_filename)
df_bad.head()
In [ ]:
df_bad = pd.read_csv(bad_filename, sep = ' ')
cat steps.flagship.dat | tr -s " " | sed 's/^ *//g' > steps.flagship.ssv
In [ ]:
filename = '../resources/steps.flagship.ssv'
In [ ]:
columns = ['step_num', 'r_min', 'r_max', 'r_med', 'a_med', 'z_med']
In [ ]:
df = pd.read_csv(filename, sep = ' ', header = None, names = columns, index_col = 'step_num')
In [ ]:
df.head()