Reading Data

Python has a large number of different ways to read data from external files.
Python supports almost any type of file you can think of, from simple text files to complex binary formats.
In this class we are going to mainly use the pakages Astropy and Pandas to load extrnal files.
Both of these packages create a python object called a Table.
Tables are very useful, since there are lots of built-in methods that allow us to easily manipulate the data.



In [ ]:

    
import os

The `AstroPy` package - `QTable`



In [ ]:

    
import numpy as np

from astropy.table import QTable



In [ ]:

    
os.listdir()



In [ ]:

    
planet_table = QTable.read('Planets.csv', format='ascii.csv')



In [ ]:

    
planet_table



In [ ]:

    
print(planet_table)

Renaming columns



In [ ]:

    
planet_table.rename_column('col2', 'ecc')
print(planet_table)



In [ ]:

    
planet_table['Name']



In [ ]:

    
planet_table['Name'][0]

Sorting



In [ ]:

    
planet_table.sort(['ecc'])



In [ ]:

    
planet_table

Masking



In [ ]:

    
planet_table.sort(['a'])    # re-sort our table



In [ ]:

    
mask1 = np.where(planet_table['a'] > 5)

mask1



In [ ]:

    
planet_table[mask1]



In [ ]:

    
mask2 = ((planet_table['a'] > 5) &
         (planet_table['ecc'] < 0.05))

planet_table[mask2]

Adding a column to the Table



In [ ]:

    
perihelion = planet_table['a'] * (1.0 - planet_table['ecc'])



In [ ]:

    
perihelion



In [ ]:

    
planet_table['Peri'] = perihelion



In [ ]:

    
planet_table

Saving a table



In [ ]:

    
planet_table.write('NewPlanets.csv', format='ascii.csv')



In [ ]:

    
os.listdir()

The `Pandas` package - `DataFrame`



In [ ]:

    
import pandas as pd



In [ ]:

    
planet_table2 = pd.read_csv('Planets.csv')



In [ ]:

    
planet_table2



In [ ]:

    
print(planet_table2)

Renaming columns



In [ ]:

    
planet_table2.rename(columns={'Unnamed: 2': 'ecc'}, inplace=True)

planet_table2



In [ ]:

    
planet_table2['Name']



In [ ]:

    
planet_table['Name'][0]

Sorting



In [ ]:

    
planet_table2.sort_values(['ecc'])



In [ ]:

    
planet_table2



In [ ]:

    
planet_table2.sort_values(['ecc'], ascending=False)

Masking



In [ ]:

    
mask3 = planet_table['a'] > 5

mask3



In [ ]:

    
planet_table2[mask3]



In [ ]:

    
mask4 = ((planet_table2['a'] > 5) &
         (planet_table2['ecc'] < 0.05))

planet_table2[mask4]

Adding a column to the Table



In [ ]:

    
perihelion = planet_table2['a'] * (1.0 - planet_table2['ecc'])



In [ ]:

    
perihelion



In [ ]:

    
planet_table2['Peri'] = perihelion



In [ ]:

    
planet_table2

Saving a table



In [ ]:

    
planet_table2.to_csv('NewPlanets2.csv', index=False)



In [ ]:

    
os.listdir()

QTables vs. DataFrames

As you can see, the astropy QTable and the pandas DataFrame are very similar.
There are some important differences that we will discover this quarter.
Astronomers use both packages, depending on the situation.
Pandas is the dominate packages outside astronomy.

Part I - Advantage Pandas

`Pandas` is really good for working with dates!



In [ ]:

    
import datetime



In [ ]:

    
doctor_table = pd.read_csv('Doctor.csv')



In [ ]:

    
doctor_table



In [ ]:

    
doctor_table.sort_values(['BirthDate'])



In [ ]:

    
doctor_table['BirthDate'] = pd.to_datetime(doctor_table['BirthDate'])



In [ ]:

    
doctor_table.sort_values(['BirthDate'])



In [ ]:

    
today = datetime.date.today()

today



In [ ]:

    
age = today - doctor_table['BirthDate']



In [ ]:

    
age



In [ ]:

    
doctor_table['AgeToday'] = age / np.timedelta64(1, 'Y')



In [ ]:

    
doctor_table



In [ ]:

    
doctor_table.describe()

Messy Data

Pandas is a good choice when working with messy data files.
In the "real world" all data is messy.
For example, here is the contents of the file Mess.csv:

####################################################### # # Col 1 - Name # Col 2 - Size (km) # ####################################################### "Sample 1",10 "",23 , "Another Sample",

This is not going to end well ...



In [ ]:

    
messy_table = pd.read_csv('Mess.csv')

Skip the header



In [ ]:

    
messy_table = pd.read_csv('Mess.csv', skiprows = 6)

messy_table

NaN = Not_A_Number, python's null value

Column names are messed up

Option 1 - Turn off the header



In [ ]:

    
messy_table = pd.read_csv('Mess.csv', skiprows = 6, header= None)

messy_table

Option 2 - Add the column names



In [ ]:

    
cols = ["Name", "Size"]

messy_table = pd.read_csv('Mess.csv', skiprows = 6, names = cols)

messy_table

Deal with the missing data with `fillna()`



In [ ]:

    
messy_table['Name'].fillna("unknown", inplace=True)
messy_table['Size'].fillna(999.0, inplace=True)

messy_table

Reading Data

The AstroPy package - QTable

Renaming columns

Sorting

Masking

Adding a column to the Table

Saving a table

The Pandas package - DataFrame

Renaming columns

Sorting

Masking

Adding a column to the Table

Saving a table

QTables vs. DataFrames

Part I - Advantage Pandas

Pandas is really good for working with dates!

Messy Data

This is not going to end well ...

Skip the header

Column names are messed up

Option 1 - Turn off the header

Option 2 - Add the column names

Deal with the missing data with fillna()

The `AstroPy` package - `QTable`

The `Pandas` package - `DataFrame`

`Pandas` is really good for working with dates!

Deal with the missing data with `fillna()`