This is the notebook for the python pandas dataframe course

The idea of this notebook is to show the power of working with pandas dataframes

Motivation

We usually work with tabular data

We should not handle them with bash commands like: for, split, grep, awk, etc...

And pandas is a very nice tool to handle this kind of data.

Welcome to Pandas!

Definition of pandas:

Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

More information about pandas: http://pandas.pydata.org/pandas-docs/stable/

Contents of the course:

  • I/O:
    • Reading: CSV, FITS, SQL
    • Writing: CSV
    • Advanced example: Reading and writing CSV files by chunks
  • Selecting and slicing:
    • loc. & iloc.
    • Advanced example: Estimate a galaxy property for a subset of galaxies using boolean conditions
    • Exercise 2: Estimate another galaxy property
  • More functions:
    • Loop a dataframe (itertuples and iterows)
    • Sort
    • Sample
    • Reshape: pivot, stack, unstack, etc.

Know your data

Very important to (perfectly) know your data: structure, data type, index, relation, etc. (see Pau's talk for a much better explanation ;)

Dimensionality:

- 1-D: Series; e.g.
    - Solar planets: [Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune]
    - Set of astronomical objects and when they were observed:
        [[NGC1952, 2012-05-01],
         [NGC224, 2013-01-23],
         [NGC5194, 2014-02-13]]
- 2-D: DataFrame; e.g (more business oriented):
    - 3 months of sales information for 3 fictitious companies:
        sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
                 {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
                 {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]

Index

  • It is the value (~key) we use as a reference for each element. (Note: It does not have to be unique)

  • Most of the data contain at least one index


In [ ]:
# Import libraries
import pandas as pd
import numpy as np

Series definition

Series is a one-dimensional labeled array capable of holding any data type

The axis labels are collectively referred to as the index

This is the basic idea of how to create a Series dataframe:

s = pd.Series(data, index=index)

where data can be:

  • list
  • ndarray
  • python dictionary
  • scalar and index is a list of axis labels

Create a Series array from a list

If no index is passed, one will be created having values [0, ..., len(data) - 1]


In [ ]:
solar_planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune']

In [ ]:
splanets = pd.Series(solar_planets)

In [ ]:
# Tips and tricks
# To access the Docstring for quick reference on syntax use ? before:
#?pd.Series()

In [ ]:
splanets

In [ ]:
splanets.index

Create a Series array from a numpy array

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1]

  • Not including index:

In [ ]:
s1 = pd.Series(np.random.randn(5))

In [ ]:
s1

In [ ]:
s1.index
  • Including index

In [ ]:
s2 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [ ]:
s2

In [ ]:
s2.index
  • From scalar value

If data is a scalar value, an index must be provided

The value will be repeated to match the length of index


In [ ]:
s3 = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

In [ ]:
s3

In [ ]:
s3.index

Create a Series array from a python dictionary


In [ ]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [ ]:
sd = pd.Series(d)

In [ ]:
sd

DataFrame definition

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types (see also Panel - 3-dimensional array).

You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

It is generally the most commonly used pandas object.

Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

From a list of dictionaries


In [ ]:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
                 {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
                 {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]

In [ ]:
df = pd.DataFrame(sales)

In [ ]:
df

In [ ]:
df.info()

In [ ]:
df.index

In [ ]:
df = df.set_index('account')

In [ ]:
df

From dict of Series or dicts


In [ ]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [ ]:
df = pd.DataFrame(d)

In [ ]:
df

In [ ]:
df.info()

In [ ]:
pd.DataFrame(d, index=['d', 'b', 'a'])

In [ ]:
df.index

In [ ]:
df.columns

From dict of ndarrays / lists

The ndarrays must all be the same length.

If an index is passed, it must clearly also be the same length as the arrays.

If no index is passed, the result will be range(n), where n is the array length.


In [ ]:
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}

In [ ]:
pd.DataFrame(d)

In [ ]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

From structured or record array

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays.

If no index is passed, the result will be range(n), where n is the array length.


In [ ]:
data = np.random.random_sample((5, 5))

In [ ]:
data

In [ ]:
df = pd.DataFrame(data)

In [ ]:
df

In [ ]:
# Add index
df = pd.DataFrame(data,index = ['a','b','c','d','e'])

In [ ]:
df

In [ ]:
# Add column names
df = pd.DataFrame(data, index = ['a','b','c','d','e'], columns = ['ra', 'dec','z_phot','z_true','imag'])

In [ ]:
df

From a list of dicts


In [ ]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [ ]:
pd.DataFrame(data2)

In [ ]:
pd.DataFrame(data2, index=['first', 'second'])

In [ ]:
pd.DataFrame(data2, columns=['a', 'b'])

Exercise 1: Selecting pandas structure

Given a few galaxies with some properties ['id', 'ra', 'dec', 'magi'], choose which pandas structure to use and its index:


In [ ]:
#Few galaxies with some properties: id, ra, dec, magi
galaxies = [
    {'id' : 1, 'ra' : 4.5, 'dec' : -55.6, 'magi' : 21.3},
    {'id' : 3, 'ra' : 23.5, 'dec' : 23.6, 'magi' : 23.3},
    {'id' : 25, 'ra' : 22.5, 'dec' : -0.3, 'magi' : 20.8},
    {'id' : 17, 'ra' : 33.5, 'dec' : 15.6, 'magi' : 24.3}   
]

In [ ]:
# %load -r 1-19 solutions/06_01_pandas.py

I/O

Reading from different sources into a DataFrame

  • Most of the times any study starts with an input file containing some data rather than having a python list or dictionary.

  • Here we present three different data sources and how to read them: two file formats (CSV and FITS) and a database connection.

  • Advanced: More and more frequently the amount of data to handle is larger and larger (Big Data era) and therefore files are huge. This is why we strongly recommend to always program by chunks (sometimes it is mandatory and also it is not straight forward to implement).

Reading the full catalog at once (if the file is not very large)

  • CSV file created using the following query (1341.csv.bz2):
SELECT unique_gal_id, ra_gal, dec_gal, z_cgal, z_cgal_v, lmhalo, (mr_gal - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS abs_mag, gr_gal AS color, (des_asahi_full_i_true - 0.8 * (atan(1.5 * z_cgal)- 0.1489)) AS app_mag FROM micecatv2_0_view TABLESAMPLE (BUCKET 1 OUT OF 512)

In [ ]:
filename = '../resources/galaxy_sample.csv'

In [ ]:
!head -30 ../resources/galaxy_sample.csv
  • CSV.BZ2 (less storage, slower when reading because of decompression)

In [ ]:
filename_bz2 = '../resources/galaxy_sample.csv.bz2'

In [ ]:
!head ../resources/galaxy_sample.csv.bz2

Reading the full catalog at once (if the file is not very large)


In [ ]:
# Field index name (known a priori from the header or the file description)
unique_gal_id_field = 'unique_gal_id'

In [ ]:
galaxy_sample = pd.read_csv(filename, sep=',', index_col = unique_gal_id_field, comment='#', na_values = '\\N')

In [ ]:
galaxy_sample.head()

In [ ]:
galaxy_sample.tail()
  • DataFrame.describe:

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.


In [ ]:
galaxy_sample.describe()

In [ ]:
galaxy_sample.info()

In [ ]:
galaxy_sample_bz2 = pd.read_csv(filename_bz2, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')

In [ ]:
galaxy_sample_bz2.head()

FITS file:

  • Pandas does not read directly FITS files so it is necessary to make some "convertion"

  • We have found 2 different approaches:

    • Table method from astropy pyfits
    • fitsio (see "Caveats and technicalities" section below)
  • Not easy to read it by chunks (see also "Caveats and technicalities" section below)

  • Note: we strongly recommend to use CSV.BZ2!

Using astropy (or pyfits)

  • This method does not support "by chunks" and therefore you have to read it all at once

In [ ]:
from astropy.table import Table

FITS file created using the same query as the CSV file:


In [ ]:
filename = '../resources/galaxy_sample.fits'

In [ ]:
#?Table.read()

In [ ]:
data = Table.read(filename)

In [ ]:
type(data)

In [ ]:
df = data.to_pandas()

In [ ]:
df.head()

In [ ]:
df = df.set_index('unique_gal_id')

In [ ]:
df.head()

In [ ]:
df.shape

In [ ]:
df.values.dtype

In [ ]:
df.info()

In [ ]:
# For PostgreSQL access
from sqlalchemy.engine import create_engine
# Text wrapping
import textwrap

In [ ]:
# Database configuration parameters
#db_url = '{scheme}://{user}:{password}@{host}/{database}'
db_url = 'sqlite:///../resources/pandas.sqlite'

sql_sample = textwrap.dedent("""\
SELECT *
FROM micecatv1
WHERE ABS(ra_mag-ra) > 0.05
""")

index_col = 'id'

In [ ]:
# Create database connection
engine = create_engine(db_url)
df = pd.read_sql(sql_sample, engine,index_col = 'id')

In [ ]:
df.head()

Write to csv file:


In [ ]:
outfile = '../resources/micecatv1_sample1.csv'

In [ ]:
with open(outfile, 'w') as f_out:
    df.to_csv(f_out,
              columns = ['ra', 'dec','ra_mag','dec_mag'],
              index=True,
              header=True
             )

Advanced example: Reading and writing by chunks


In [ ]:
filename = '../resources/galaxy_sample.csv'
outfile = '../resources/galaxy_sample_some_columns.csv'

In [ ]:
# chunk size
gal_chunk = 100000

In [ ]:
# Field index name (known a priori from the header or the file description)
unique_gal_id_field = 'unique_gal_id'
  • Opening file with the with method

  • Creating a file object using read_csv method

  • Looping by chunks using enumerate in order to also have the chunk number


In [ ]:
with open(filename, 'r') as galaxy_fd, open (outfile, 'w') as f_out:
    galaxy_sample_reader = pd.read_csv(
        galaxy_fd, 
        sep=',', 
        index_col = unique_gal_id_field, 
        comment='#', 
        na_values = '\\N', 
        chunksize=gal_chunk
    )
    for chunk, block in enumerate(galaxy_sample_reader):
        print(chunk)
        # In order not to write n chunk times the header (HELP PAU!)
        block.to_csv(f_out, 
            columns = ['ra_gal','dec_gal','z_cgal_v'], 
            index=True, 
            header= chunk==0, 
            mode='a'
       )

In [ ]:
block.head()

In [ ]:
block.tail(3)
  • DataFrame plot method (just for curiosity!)

In [ ]:
# DataFrame plot method
%matplotlib inline
import matplotlib.pyplot as plt
block['lmhalo'].plot.hist(bins=100, logy = True)
plt.show()

SELECTING AND SLICING

  • The idea of this section is to show how to slice and get and set subsets of pandas objects

  • The basics of indexing are as follows:

Operation Syntax Result
Select column df[column label] Series
Select row by index df.loc[index] Series
Select row by integer location df.iloc[pos] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

In [ ]:
# Same dataframe as before
filename='../resources/galaxy_sample.csv.bz2'
galaxy_sample = pd.read_csv(filename, sep=',', index_col = unique_gal_id_field, comment='#', na_values = r'\N')

In [ ]:
galaxy_sample.head()
  • Select a column

In [ ]:
galaxy_sample['ra_gal'].head()

In [ ]:
type(galaxy_sample['dec_gal'])

In [ ]:
galaxy_sample[['ra_gal','dec_gal','lmhalo']].head()
  • Select a row by index

In [ ]:
galaxy_sample.loc[28581888]

In [ ]:
type(galaxy_sample.loc[28581888])
  • Select a row by integer location

In [ ]:
galaxy_sample.iloc[0]

In [ ]:
type(galaxy_sample.iloc[0])
  • Slice rows

In [ ]:
galaxy_sample.iloc[3:7]

In [ ]:
galaxy_sample[3:7]

In [ ]:
type(galaxy_sample.iloc[3:7])
  • Select rows by boolean vector:

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.


In [ ]:
# Boolean vector
(galaxy_sample['ra_gal'] < 45).tail()

In [ ]:
type(galaxy_sample['ra_gal'] < 45)

In [ ]:
galaxy_sample[galaxy_sample['ra_gal'] < 45].head()

In [ ]:
# redshift shell
galaxy_sample[(galaxy_sample.z_cgal <= 0.2) | (galaxy_sample.z_cgal >= 1.0)].head()

In [ ]:
galaxy_sample[(galaxy_sample.z_cgal <= 1.0) & (galaxy_sample.index.isin([5670656,13615360,3231232]))]

In [ ]:
galaxy_sample[(galaxy_sample['ra_gal'] < 1.) & (galaxy_sample['dec_gal'] < 1.)][['ra_gal','dec_gal']].head()

Recap:

  • loc works on labels in the index.
  • iloc works on the positions in the index (so it only takes integers).

Advanced example: estimate the size of the disk (disk_length) for a set of galaxies

  • In this exercise we are going to use some of the previous examples.
  • Also we are going to introduce how to add a column and some other concepts

    • We split the galaxies into two different populations, Ellipticals and Spirals, depending on the their color and absolute magnitude:

         if color - 0.29 + 0.03 * abs_mag < 0 then Spiral
         else then Elliptical
    • How many galaxies are elliptical or spirals?

    • Elliptical galaxies do not have any disk (and therefore disk_length = 0).

    • The disk_length for spiral galaxies follows a normal distribution with mean = 0 and sigma = 0.15 (in arcsec). In addition, the minimum disk_length for a spiral galaxy is 1.e-3.


In [ ]:
galaxy_sample.tail(10)

In [ ]:
# Splitting the galaxies
# Boolean mask
has_disk_mask = (galaxy_sample['color']-0.29+0.03*galaxy_sample['abs_mag'] < 0)

In [ ]:
has_disk_mask.tail(10)

In [ ]:
print (len(has_disk_mask))
print (type(has_disk_mask))

In [ ]:
# Counting how many spirals
n_spiral = has_disk_mask.sum()

In [ ]:
# Counting how many ellipticals
n_elliptical = ~has_disk_mask.sum()

In [ ]:
galaxy_sample[has_disk_mask].count()

In [ ]:
galaxy_sample[has_disk_mask]['hubble_type'] = 'Spiral'

In [ ]:
# It did not add any column! It was working in a view!
galaxy_sample.tail(10)

In [ ]:
# This is the proper way of doing it if one wants to add another column
galaxy_sample.loc[has_disk_mask, 'hubble_type'] = 'Spiral'

In [ ]:
galaxy_sample.loc[~has_disk_mask, 'hubble_type'] = 'Elliptical'

In [ ]:
galaxy_sample.tail(10)

In [ ]:
# We can use the numpy where method to do the same:
galaxy_sample['color_type'] = np.where(has_disk_mask, 'Blue', 'Red')

In [ ]:
galaxy_sample.tail(10)

In [ ]:
# The proper way would be to use a boolean field
galaxy_sample['has_disk'] = has_disk_mask

In [ ]:
galaxy_sample.tail(10)

In [ ]:
galaxy_sample.loc[~has_disk_mask, 'disk_length'] = 0.

In [ ]:
galaxy_sample.loc[has_disk_mask, 'disk_length'] = np.fabs(
                np.random.normal(
                    0., scale=0.15, size=n_spiral
                )
            )

DO NOT LOOP THE PANDAS DATAFRAME IN GENERAL!


In [ ]:
galaxy_sample.tail(10)

In [ ]:
# Minimum value for disk_length for spirals
dl_min = 1.e-4;

In [ ]:
disk_too_small_mask = has_disk_mask & (galaxy_sample['disk_length'] < dl_min)

In [ ]:
disk_too_small_mask.sum()

In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'].head()

In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'] = dl_min

In [ ]:
galaxy_sample.loc[disk_too_small_mask, 'disk_length'].head()

In [ ]:
galaxy_sample.tail(10)

Exercise 2: Estimate another galaxy property

  • What is the mean value and the standard deviation of the disk_length for spiral galaxies (Tip: use the .mean() and .std() methods)

  • Estimate the bulge_length for elliptical galaxies. The bulge_length depends on the absolute magnitude in the following way:

    bulge_length = exp(-1.145 - 0.269 * (abs_mag - 23.))

  • How many galaxies have bulge_lenth > 1.0?

  • In our model the maximum bulge_length for an elliptical galaxy is 0.5 arcsec.

  • What is the mean value and the standard deviation of the bulge_length for elliptical galaxies. And for ellipticals with absolute magnitude brighter than -20?


In [ ]:
# %load -r 20-102 solutions/06_01_pandas.py

Merge, join, and concatenate

https://pandas.pydata.org/pandas-docs/stable/merging.html

  • pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

  • concat method:

    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
            keys=None, levels=None, names=None, verify_integrity=False,
            copy=True)

In [ ]:
df1 = pd.DataFrame(
    {'A': ['A0', 'A1', 'A2', 'A3'],
     'B': ['B0', 'B1', 'B2', 'B3'],
     'C': ['C0', 'C1', 'C2', 'C3'],
     'D': ['D0', 'D1', 'D2', 'D3']},
    index=[0, 1, 2, 3]
)

In [ ]:
df2 = pd.DataFrame(
    {'A': ['A4', 'A5', 'A6', 'A7'],
     'B': ['B4', 'B5', 'B6', 'B7'],
     'C': ['C4', 'C5', 'C6', 'C7'],
     'D': ['D4', 'D5', 'D6', 'D7']},
    index=[4, 5, 6, 7]
)

In [ ]:
df3 = pd.DataFrame(
    {'A': ['A8', 'A9', 'A10', 'A11'],
     'B': ['B8', 'B9', 'B10', 'B11'],
     'C': ['C8', 'C9', 'C10', 'C11'],
     'D': ['D8', 'D9', 'D10', 'D11']},
    index=[8, 9, 10, 11]
)

In [ ]:
frames = [df1, df2, df3]

In [ ]:
result = pd.concat(frames)
result

In [ ]:
# Multiindex
result = pd.concat(frames, keys=['x', 'y','z'])

In [ ]:
result

In [ ]:
result.index

In [ ]:
result.loc['y']

In [ ]:
df4 = pd.DataFrame(
    {'B': ['B2', 'B3', 'B6', 'B7'],
     'D': ['D2', 'D3', 'D6', 'D7'],
     'F': ['F2', 'F3', 'F6', 'F7']},
    index=[2, 3, 6, 7]
)

In [ ]:
df4

In [ ]:
df1

In [ ]:
result = pd.concat([df1, df4])
result

In [ ]:
result = pd.concat([df1, df4], axis=1)
result

In [ ]:
result = pd.concat([df1, df4], axis=1, join='inner')
result
  • Using append method:

In [ ]:
df1

In [ ]:
df2

In [ ]:
result = df1.append(df2)
result

In [ ]:
df1

In [ ]:
df4

In [ ]:
result = df1.append(df4)
result

In [ ]:
result = pd.concat([df1,df4])
result
  • Note: Unlike list.append method, which appends to the original list and returns nothing, append here does not modify df1 and returns its copy with df2 appended.

In [ ]:
result = pd.concat([df1,df4], ignore_index=True)
result
  • This is also a valid argument to DataFrame.append:

In [ ]:
result = df1.append(df4, ignore_index = True)
result
  • Mixing dimensions

In [ ]:
df1

In [ ]:
s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
s1

In [ ]:
result = pd.concat([s1,df1])
result

In [ ]:
result = pd.concat([df1,s1], axis = 1)
result

In [ ]:
s2 = pd.Series(['_0', '_1', '_2', '_3'])

In [ ]:
result = pd.concat([df1,s2,s2,s2], axis = 1)
result

Exercise 3: Generate a random catalog using concat method

  • In this exercise we will use the concat method and show a basic example of multiIndex.

  • Given a subset of a few galaxies with the following properties ['halo_id', 'gal_id' ,'ra', 'dec', 'z', 'abs_mag'], create a random catalog with 50 times more galaxies than the subset keeping the properties of the galaxies but placing them randomly in the first octant of the sky.

  • The index of each galaxy is given by the tuple [halo_id, gal_id]


In [ ]:
data =  [
    # halo_id, gal_id, ra, dec, z, abs_mag'
    [1, 1, 21.5, 30.1, 0.21, -21.2],
    [1, 2, 21.6, 29.0, 0.21, -18.3],
    [1, 3, 21.4, 30.0, 0.21, -18.5],
    [2, 1, 45.0, 45.0, 0.42, -20.4],
    [3, 1, 25.0, 33.1, 0.61, -21.2],
    [3, 2, 25.1, 33.2, 0.61, -20.3]
]

In [ ]:
# %load -r 103-145 solutions/06_01_pandas.py

Merge method: Database-style DataFrame joining/merging:

  • pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and internal layout of the data in DataFrame

  • See the cookbook for some advanced strategies

  • Users who are familiar with SQL but new to pandas might be interested in a comparison with SQL

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)

Example: Merging dataframes using the merge method (thanks Nadia!)

  • Goal: build a dataframe merging 2 different dataframes with complementary information, through the relation given by a third dataframe.

    • df_stars contains information of stars magnitudes per sdss_star_id and per filter:
      • ['sdss_star_id', 'filter', 'expected_mag', 'expected_mag_err']
      • Note, the file is "somehow" corrupted and entries are duplicate several times
      • Unique entries are characterized by sdss_star_id and filter
    • df_spectra contains information of star flux per band (== filter) and per spec_id (!= sdss_star_id):
      • ['spec_id', 'band', 'flux', 'flux_err']
      • Unique entries are characterized by spec_id and band
    • df_spec_IDs allows to make the correspondence between sdss_star_id (== objID) and spec_id (== specObjID):
      • ['objID', 'specObjID']
      • Unique entries are characterized by objID

In [ ]:
star_filename = '../resources/df_star.ssv'
spectra_filename = '../resources/df_spectra.ssv'
starid_specid_filename = '../resources/df_starid_specid.ssv'

In [ ]:
df_spectra = pd.read_csv(spectra_filename, index_col=['spec_id', 'band'], sep = ' ')
df_spectra.head(41)

In [ ]:
df_starid_specid = pd.read_csv(starid_specid_filename, sep=' ')
df_starid_specid.head(5)

In [ ]:
# Given that the file is somehow corrupted we open it without defining any index
df_star = pd.read_csv(star_filename, sep=' ')
df_star.head(10)

In [ ]:
df_star[(df_star['sdss_star_id'] == 1237653665258930303) & (df_star['filter'] == 'NB455')]

In [ ]:
# Drop duplicates:
df_star.drop_duplicates(subset = ['sdss_star_id', 'filter'], inplace= True)

In [ ]:
df_star[(df_star['sdss_star_id'] == 1237653665258930303) & (df_star['filter'] == 'NB455')]

In [ ]:
df_starid_specid.head(5)
  • We are going to unset the index and rename the columns in order to use the "on" argument:

In [ ]:
df_spectra.reset_index(inplace = True)
df_spectra.head()

In [ ]:
df_spectra.rename(columns={'band': 'filter'}, inplace = True)
df_spectra.head()

In [ ]:
df_starid_specid.rename(columns={'objID':'sdss_star_id', 'specObjID':'spec_id'}, inplace = True)
df_starid_specid.head()
  • Now we have everything ready to make the JOINs

In [ ]:
df_star_merged = pd.merge(df_star, df_starid_specid, on='sdss_star_id')

In [ ]:
df_star_merged.head()

In [ ]:
df_star_merged = pd.merge(df_star_merged, df_spectra, on=['spec_id','filter'])

In [ ]:
df_star_merged.head(40)

In [ ]:
df_star_merged.set_index(['sdss_star_id', 'filter'], inplace = True)

In [ ]:
df_star_merged.head()

In [ ]:
# Each element has been observed in how many bands?
count_bands = df_star_merged.groupby(level=0)['flux'].count()

In [ ]:
count_bands.head(20)

In [ ]:
df_star_merged.groupby(level=1)['flux_err'].mean().head(10)

Some general ideas to get home:

  • Do not loop a dataframe!

  • Try to work by chunks; create functions that work with chunks

  • Work with standard formats and "already implemented" functions

Caveats and technicalities

Floating point limitations:

  • Be careful with exact comparisons!

In [ ]:
# e.g.: the decimal value 0.1 cannot be represented exactly as a base 2 fraction
(0.1 + 0.2) == 0.3

In [ ]:
(0.1 + 0.2) - 0.3

FITS files

  • fitsio

  • And working by chunks


In [ ]:
import fitsio

In [ ]:
filename = '../resources/galaxy_sample.fits'

In [ ]:
fits=fitsio.FITS(filename)

In [ ]:
data = fits[1]

In [ ]:
# Number of rows
data.get_nrows()

In [ ]:
# chunk size
gal_chunk = 300000

In [ ]:
# e.g.to create the ranges!
import math
niter = int(math.ceil(data.get_nrows() / float(gal_chunk)))

for i in range(niter):
    s = i*gal_chunk
    f = min((i+1)*gal_chunk, data.get_nrows())
    chunk = data[s:f]
    print (i)
    print (type(chunk))
    print (chunk.dtype)
    df_chunk = pd.DataFrame(chunk)
    print (type(df_chunk))
    print (df_chunk.dtypes)
    df_chunk = df_chunk.set_index('unique_gal_id')
    print (df_chunk.head())

.values DataFrame attribute

  • Some scipy functions do not allow to use pandas dataframe as arguments and therefore it is useful to use the values atribute, which is the numpy representation of NDFrame

  • The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

Wrong input example:

  • .dat

  • Look at the file using e.g. head bash command

  • Note that there are more than one space, and if you do tail filename, different number of "spaces"


In [ ]:
bad_filename = '../resources/steps.flagship.dat'

In [ ]:
df_bad = pd.read_csv(bad_filename)
df_bad.head()

In [ ]:
df_bad =  pd.read_csv(bad_filename, sep = ' ')
  • Necessary to "modify" the file in order to convert it into a standard csv file, e.g.:
cat steps.flagship.dat | tr -s " " | sed 's/^ *//g' > steps.flagship.ssv

In [ ]:
filename = '../resources/steps.flagship.ssv'

In [ ]:
columns = ['step_num', 'r_min', 'r_max', 'r_med', 'a_med', 'z_med']

In [ ]:
df = pd.read_csv(filename, sep = ' ', header = None, names = columns, index_col = 'step_num')

In [ ]:
df.head()