In [1]:
# Importing modules
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context("notebook")
import matplotlib
matplotlib.rc("text", usetex=False)
In [2]:
np.random.seed(1)
np.random.random(10)
Out[2]:
In [3]:
series_1 = pd.Series(np.random.random(10))
series_1
Out[3]:
One can get a NumPy array from the Series, by typing:
In [4]:
series_1.values
Out[4]:
One can also get the indices of each element, by typing:
In [5]:
series_1.index.values
Out[5]:
One can also have a custom set of indices:
In [6]:
# import string
# alphabet = string.lowercase
# alphabet = np.array([x for x in alphabet])[0:10]
# alphabet
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
alphabet
Out[6]:
In [7]:
series_2 = pd.Series(np.random.random(len(alphabet)), index=alphabet)
series_2
Out[7]:
One can select only a subsample of the Series
In [8]:
series_1[[0, 1, 2]]
Out[8]:
In [9]:
series_1[[1,3,4]]
Out[9]:
In [10]:
series_2[['a','d','j']]
Out[10]:
You can also perform numerical expressions
In [11]:
series_1**2
Out[11]:
In [12]:
series_1[1]**2
Out[12]:
Or find values greater than some value 'x'
In [13]:
x = 0.5
series_1[(series_1 >= x) & (series_1 < 0.8)]
Out[13]:
You can apply functions to a column, and save it as a new Series
In [14]:
import sys
def exponentials(arr, basis=10.):
"""
Uses the array `arr` as the exponents for `basis`
Parameters
----------
arr: numpy array, list, pandas Series; shape (N,)
array to be used as exponents of `basis`
power: int or float, optional (default = 10)
number used as the basis
Returns
-------
exp_arr: numpy array or list, shape (N,)
array of values for `basis`**`arr`
"""
if isinstance(arr, list):
exp_arr = [basis**x for x in arr]
return exp_arr
elif isinstance(arr, np.ndarray) or isinstance(arr, pd.core.series.Series):
exp_arr = basis**arr
return exp_arr
else:
cmd = ">>>> `arr` is not a list nor a numpy array"
cmd +="\n>>>> Please give the correct type of object"
print(cmd)
sys.exit(1)
In [15]:
exponentials(series_1[(series_1 >= x) & (series_1 > 0.6)]).values
Out[15]:
You can also create a Series using a dictionary (we talked about these on Week 4)
In [16]:
labels_arr = ['foo', 'bar', 'baz']
data_arr = [100, 200, 300]
dict_1 = dict(zip(labels_arr, data_arr))
dict_1
Out[16]:
In [17]:
series_3 = pd.Series(dict_1)
series_3
Out[17]:
One of the most useful features of pandas is that it can handle missing data quite easily:
In [18]:
index = ['foo', 'bar', 'baz', 'qux']
series_4 = pd.Series(dict_1, index=index)
series_4
Out[18]:
In [19]:
pd.isnull(series_4)
Out[19]:
In [20]:
series_3
Out[20]:
In [21]:
series_3 + series_4
Out[21]:
So using a Series is powerful, but DataFrames are probably what gets used the most since it represents a tabular data structure containing an ordered collection of columns and rows.
A DataFrame is a "tabular data structure" containing an ordered collection of columns. Each column can a have a different data type.
Row and column operations are treated roughly symmetrically. One can obtain a DataFrame from a normal dictionary, or by reading a file with columns and rows.
Creating a DataFrame
In [22]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'popu' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1
Out[22]:
This DataFrame has 4 rows and 3 columns by the name "pop", "state", and "year".
The way to access a DataFrame is quite similar to that of accessing a Series.
To access a column, one writes the name of the column
, as in the following example:
In [23]:
df_1['popu']
Out[23]:
In [24]:
df_1.popu
Out[24]:
One can also handle missing data with DataFrames. Like Series, columns that are not present in the data are NaNs:
In [25]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'popu', 'unempl'])
df_2
Out[25]:
In [26]:
df_2['state']
Out[26]:
One can retrieve a row by:
In [27]:
df_2.iloc[1:4]
Out[27]:
Editing a DataFrame is quite easy to do. One can assign a Series to a column of the DataFrame. If the Series is a list or an array, the length must match the DataFrame.
In [28]:
unempl = pd.Series([1.0, 2.0, 10.], index=[1,3,5])
unempl
Out[28]:
In [29]:
df_2['unempl'] = unempl
df_2
Out[29]:
In [30]:
df_2.unempl.isnull()
Out[30]:
You can also transpose a DataFrame, i.e. switch rows by columns, and columns by rows
In [31]:
df_2.T
Out[31]:
Now, let's say you want to show only the 'year' and 'popu' columns. You can do it by:
In [32]:
df_2
Out[32]:
In [33]:
df_2[['year', 'unempl']]
Out[33]:
Let's say you only need a subsample of the table that you have, and you need to drop a column from the DataFrame. You can do that by using the 'drop' option:
In [34]:
df_2
Out[34]:
In [35]:
df_3 = df_2.drop('unempl', axis=1)
df_3
df_2.drop('unempl', axis=1)
Out[35]:
In [36]:
df_2
Out[36]:
You can also drop certain rows:
In [37]:
df_2
Out[37]:
In [38]:
df_4 = df_2.drop([1,2])
df_4
Out[38]:
Look at this carefully! The DataFrame preserved the same indices as for df_2.
If you can to reset the indices, you can do that by:
In [39]:
df_4.reset_index(inplace=True)
df_4
Out[39]:
Pandas is great at reading Data tables and CSV files, and other kinds of documents. For the remainder of this notebook, we will be using the Gaia's DR1 catalogue.
In [40]:
# Path to online file
url_path = 'http://cdn.gea.esac.esa.int/Gaia/gdr2/gaia_source/csv/GaiaSource_1000172165251650944_1000424567594791808.csv.gz'
# Converting data to DataFrame
gaia_df = pd.read_csv(url_path, compression='gzip')
In [41]:
gaia_df.head()
Out[41]:
You can get the shape of the "gaia_df" DataFrame by typing:
In [42]:
gaia_df.shape
Out[42]:
That means there are 14209 rows and 94 columns.
To get an array of the columns available, one could write:
In [43]:
gaia_df.columns.values.sort()
gaia_df.columns.values
Out[43]:
Let's say you only want a DataFrame with the the colums:
You do this by using the loc option for the DataFrame:
In [44]:
gaia_df_2 = gaia_df.loc[:,['ra','dec','l','b']]
# Displaying the first 15 lines
gaia_df_2.head(15)
Out[44]:
This selects all of the rows, and only the selected columns in the list.
You can also select only a subsample of the rows as well, as in the following example. Let's say I just want a random subsample of 10% of the galaxies in the Gaia DR1 catalogue. I can do that by:
In [45]:
import random
random.sample
Out[45]:
In [46]:
# Decission indices to select from DataFrame
import random
# Number of rows
nrows = len(gaia_df_2)
# Randomly selecting `nrows` from `gaia_df_2`
gaia_df_3 = gaia_df_2.sample(nrows)
gaia_df_3.shape
Out[46]:
I'm re-normalizing the indices of this DataFrame
In [47]:
gaia_df_3.reset_index(inplace=True, drop=True)
gaia_df_3
Out[47]:
You can produce plots directly from the DataFrame
In [48]:
title_txt = 'Right Ascension and Declination for Gaia'
gaia_df_3.plot('ra','dec', # Columns to plot
kind='scatter', # Kind of plot. In this case, it's `scatter`
label='Gaia', # Label of the points
title=title_txt, # Title of the figure
color='#4c72b0', # Color of the points
figsize=(12,8)) # Size of the fiure
Out[48]:
Or even Scatterplot Matrices:
In [49]:
sns.pairplot(gaia_df_3, plot_kws={'color': '#4c72b0'}, diag_kws={'color': '#4c72b0'})
Out[49]:
In [50]:
sns.jointplot(gaia_df_3['l'], gaia_df_3['b'], color='#3c8f40')
Out[50]:
In [51]:
ra_arr = gaia_df.ra.values
dec_arr = gaia_df.dec.values
In [52]:
# Just showing the first 25 elements
np.column_stack((ra_arr, dec_arr))[0:25]
Out[52]:
In [53]:
## Numpy way of finding the stars that meet the criteria
ra_min, ra_max = (102, 104)
dec_min, dec_max = (56.4, 56.7)
# RA critera
ra_idx = np.where((ra_arr >= ra_min) & (ra_arr <= ra_max))[0]
# Dec criteria
dec_idx = np.where((dec_arr >= dec_min) & (dec_arr <= dec_max))[0]
# Finding `intersecting' indices that meet both criteria
radec_idx = np.intersect1d(ra_idx, dec_idx)
# Selecting the values from only those indices
ra_new = ra_arr[radec_idx]
dec_new = dec_arr[radec_idx]
# Printing out ra and dec for corresponding indices
print(np.column_stack((ra_new, dec_new)))
This is rather convoluted and long, and one can easily make a mistake if s/he doesn't keep track of which arrays s/he is using!
In Pandas, this is much easier!!
In [54]:
gaia_df_4 = gaia_df.loc[(
(gaia_df.ra >= ra_min) & (gaia_df.ra <= ra_max) &
(gaia_df.dec >= dec_min) & (gaia_df.dec <= dec_max))]
gaia_df_4[['ra','dec']]
Out[54]:
Pandas is a great for handling data, especially comma-delimited or space-separated data. Pandas is also compatible with many other packages, like seaborn, astropy, NumPy, etc.
We will have another lecture on Pandas that will cover much more advanced aspects of Pandas. Make sure you keep checking the schedule!