Reading Data

  • Python has a large number of different ways to read data from external files.
  • Python supports almost any type of file you can think of, from simple text files to complex binary formats.
  • In this class we are going to mainly use the pakages Astropy and Pandas to load extrnal files.
  • Both of these packages create a python object called a Table.
  • Tables are very useful, since there are lots of built-in methods that allow us to easily manipulate the data.

In [ ]:
import os

The AstroPy package - QTable


In [ ]:
import numpy as np

from astropy.table import QTable

In [ ]:
os.listdir()

In [ ]:
planet_table = QTable.read('Planets.csv', format='ascii.csv')

In [ ]:
planet_table

In [ ]:
print(planet_table)

Renaming columns


In [ ]:
planet_table.rename_column('col2', 'ecc')
print(planet_table)

In [ ]:
planet_table['Name']

In [ ]:
planet_table['Name'][0]

Sorting


In [ ]:
planet_table.sort(['ecc'])

In [ ]:
planet_table

Masking


In [ ]:
planet_table.sort(['a'])    # re-sort our table

In [ ]:
mask1 = np.where(planet_table['a'] > 5)

mask1

In [ ]:
planet_table[mask1]

In [ ]:
mask2 = ((planet_table['a'] > 5) &
         (planet_table['ecc'] < 0.05))

planet_table[mask2]

Adding a column to the Table


In [ ]:
perihelion = planet_table['a'] * (1.0 - planet_table['ecc'])

In [ ]:
perihelion

In [ ]:
planet_table['Peri'] = perihelion

In [ ]:
planet_table

Saving a table


In [ ]:
planet_table.write('NewPlanets.csv', format='ascii.csv')

In [ ]:
os.listdir()

The Pandas package - DataFrame


In [ ]:
import pandas as pd

In [ ]:
planet_table2 = pd.read_csv('Planets.csv')

In [ ]:
planet_table2

In [ ]:
print(planet_table2)

Renaming columns


In [ ]:
planet_table2.rename(columns={'Unnamed: 2': 'ecc'}, inplace=True)

planet_table2

In [ ]:
planet_table2['Name']

In [ ]:
planet_table['Name'][0]

Sorting


In [ ]:
planet_table2.sort_values(['ecc'])

In [ ]:
planet_table2

In [ ]:
planet_table2.sort_values(['ecc'], ascending=False)

Masking


In [ ]:
mask3 = planet_table['a'] > 5

mask3

In [ ]:
planet_table2[mask3]

In [ ]:
mask4 = ((planet_table2['a'] > 5) &
         (planet_table2['ecc'] < 0.05))

planet_table2[mask4]

Adding a column to the Table


In [ ]:
perihelion = planet_table2['a'] * (1.0 - planet_table2['ecc'])

In [ ]:
perihelion

In [ ]:
planet_table2['Peri'] = perihelion

In [ ]:
planet_table2

Saving a table


In [ ]:
planet_table2.to_csv('NewPlanets2.csv', index=False)

In [ ]:
os.listdir()

QTables vs. DataFrames

  • As you can see, the astropy QTable and the pandas DataFrame are very similar.
  • There are some important differences that we will discover this quarter.
  • Astronomers use both packages, depending on the situation.
  • Pandas is the dominate packages outside astronomy.

Part I - Advantage Pandas

Pandas is really good for working with dates!


In [ ]:
import datetime

In [ ]:
doctor_table = pd.read_csv('Doctor.csv')

In [ ]:
doctor_table

In [ ]:
doctor_table.sort_values(['BirthDate'])

In [ ]:
doctor_table['BirthDate'] = pd.to_datetime(doctor_table['BirthDate'])

In [ ]:
doctor_table.sort_values(['BirthDate'])

In [ ]:
today = datetime.date.today()

today

In [ ]:
age = today - doctor_table['BirthDate']

In [ ]:
age

In [ ]:
doctor_table['AgeToday'] = age / np.timedelta64(1, 'Y')

In [ ]:
doctor_table

In [ ]:
doctor_table.describe()

Messy Data

  • Pandas is a good choice when working with messy data files.
  • In the "real world" all data is messy.
  • For example, here is the contents of the file Mess.csv:
####################################################### # # Col 1 - Name # Col 2 - Size (km) # ####################################################### "Sample 1",10 "",23 , "Another Sample",

This is not going to end well ...


In [ ]:
messy_table = pd.read_csv('Mess.csv')

Skip the header


In [ ]:
messy_table = pd.read_csv('Mess.csv', skiprows = 6)

messy_table

NaN = Not_A_Number, python's null value

Column names are messed up

Option 1 - Turn off the header


In [ ]:
messy_table = pd.read_csv('Mess.csv', skiprows = 6, header= None)

messy_table

Option 2 - Add the column names


In [ ]:
cols = ["Name", "Size"]

messy_table = pd.read_csv('Mess.csv', skiprows = 6, names = cols)

messy_table

Deal with the missing data with fillna()


In [ ]:
messy_table['Name'].fillna("unknown", inplace=True)
messy_table['Size'].fillna(999.0, inplace=True)

messy_table