Week 08 - Introduction to Pandas

Today's Agenda

  • Pandas: Introduction
    • Series
    • DataFrames
    • Indexing, Selecting, Filtering
    • Drop columns
    • Handling missing Data

In [1]:
# Importing modules
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context("notebook")
import matplotlib
matplotlib.rc("text", usetex=False)

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. One can use any NumPy data type to assign to the Series

Creating a Series:


In [2]:
np.random.seed(1)

np.random.random(10)


Out[2]:
array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
       1.46755891e-01, 9.23385948e-02, 1.86260211e-01, 3.45560727e-01,
       3.96767474e-01, 5.38816734e-01])

In [3]:
series_1 = pd.Series(np.random.random(10))
series_1


Out[3]:
0    0.419195
1    0.685220
2    0.204452
3    0.878117
4    0.027388
5    0.670468
6    0.417305
7    0.558690
8    0.140387
9    0.198101
dtype: float64

One can get a NumPy array from the Series, by typing:


In [4]:
series_1.values


Out[4]:
array([0.41919451, 0.6852195 , 0.20445225, 0.87811744, 0.02738759,
       0.67046751, 0.4173048 , 0.55868983, 0.14038694, 0.19810149])

Reindexing

One can also get the indices of each element, by typing:


In [5]:
series_1.index.values


Out[5]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One can also have a custom set of indices:


In [6]:
# import string
# alphabet = string.lowercase
# alphabet = np.array([x for x in alphabet])[0:10]
# alphabet

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
alphabet


Out[6]:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [7]:
series_2 = pd.Series(np.random.random(len(alphabet)), index=alphabet)
series_2


Out[7]:
a    0.800745
b    0.968262
c    0.313424
d    0.692323
e    0.876389
f    0.894607
g    0.085044
h    0.039055
i    0.169830
j    0.878143
dtype: float64

One can select only a subsample of the Series


In [8]:
series_1[[0, 1, 2]]


Out[8]:
0    0.419195
1    0.685220
2    0.204452
dtype: float64

In [9]:
series_1[[1,3,4]]


Out[9]:
1    0.685220
3    0.878117
4    0.027388
dtype: float64

In [10]:
series_2[['a','d','j']]


Out[10]:
a    0.800745
d    0.692323
j    0.878143
dtype: float64

Arithmetic and function Mapping

You can also perform numerical expressions


In [11]:
series_1**2


Out[11]:
0    0.175724
1    0.469526
2    0.041801
3    0.771090
4    0.000750
5    0.449527
6    0.174143
7    0.312134
8    0.019708
9    0.039244
dtype: float64

In [12]:
series_1[1]**2


Out[12]:
0.4695257637239847

Or find values greater than some value 'x'


In [13]:
x = 0.5
series_1[(series_1 >= x) & (series_1 < 0.8)]


Out[13]:
1    0.685220
5    0.670468
7    0.558690
dtype: float64

You can apply functions to a column, and save it as a new Series


In [14]:
import sys
def exponentials(arr, basis=10.):
    """
    Uses the array `arr` as the exponents for `basis`
    
    Parameters
    ----------
    arr: numpy array, list, pandas Series; shape (N,)
        array to be used as exponents of `basis`
    
    power: int or float, optional (default = 10)
        number used as the basis
    
    Returns
    -------
    exp_arr: numpy array or list, shape (N,)
        array of values for `basis`**`arr`
    """
    if isinstance(arr, list):
        exp_arr = [basis**x for x in arr]
        return exp_arr        
    elif isinstance(arr, np.ndarray) or isinstance(arr, pd.core.series.Series):
        exp_arr = basis**arr
        return exp_arr
    else:
        cmd = ">>>> `arr` is not a list nor a numpy array"
        cmd +="\n>>>> Please give the correct type of object"
        print(cmd)
        sys.exit(1)

In [15]:
exponentials(series_1[(series_1 >= x) & (series_1 > 0.6)]).values


Out[15]:
array([4.84417139, 7.55296438, 4.68238921])

You can also create a Series using a dictionary (we talked about these on Week 4)


In [16]:
labels_arr = ['foo', 'bar', 'baz']
data_arr   = [100, 200, 300]
dict_1     = dict(zip(labels_arr, data_arr))
dict_1


Out[16]:
{'foo': 100, 'bar': 200, 'baz': 300}

In [17]:
series_3 = pd.Series(dict_1)
series_3


Out[17]:
foo    100
bar    200
baz    300
dtype: int64

Handling Missing Data

One of the most useful features of pandas is that it can handle missing data quite easily:


In [18]:
index = ['foo', 'bar', 'baz', 'qux']
series_4 = pd.Series(dict_1, index=index)
series_4


Out[18]:
foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64

In [19]:
pd.isnull(series_4)


Out[19]:
foo    False
bar    False
baz    False
qux     True
dtype: bool

In [20]:
series_3


Out[20]:
foo    100
bar    200
baz    300
dtype: int64

In [21]:
series_3 + series_4


Out[21]:
bar    400.0
baz    600.0
foo    200.0
qux      NaN
dtype: float64

So using a Series is powerful, but DataFrames are probably what gets used the most since it represents a tabular data structure containing an ordered collection of columns and rows.

DataFrames

A DataFrame is a "tabular data structure" containing an ordered collection of columns. Each column can a have a different data type.

Row and column operations are treated roughly symmetrically. One can obtain a DataFrame from a normal dictionary, or by reading a file with columns and rows.

Creating a DataFrame


In [22]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'popu' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1


Out[22]:
state year popu
0 VA 2012 5.0
1 VA 2013 5.1
2 VA 2014 5.2
3 MD 2014 4.0
4 MD 2015 4.1

This DataFrame has 4 rows and 3 columns by the name "pop", "state", and "year".

The way to access a DataFrame is quite similar to that of accessing a Series.
To access a column, one writes the name of the column, as in the following example:


In [23]:
df_1['popu']


Out[23]:
0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

In [24]:
df_1.popu


Out[24]:
0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

One can also handle missing data with DataFrames. Like Series, columns that are not present in the data are NaNs:


In [25]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'popu', 'unempl'])
df_2


Out[25]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 NaN
2 2014 VA 5.2 NaN
3 2014 MD 4.0 NaN
4 2015 MD 4.1 NaN

In [26]:
df_2['state']


Out[26]:
0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

One can retrieve a row by:


In [27]:
df_2.iloc[1:4]


Out[27]:
year state popu unempl
1 2013 VA 5.1 NaN
2 2014 VA 5.2 NaN
3 2014 MD 4.0 NaN

Editing a DataFrame is quite easy to do. One can assign a Series to a column of the DataFrame. If the Series is a list or an array, the length must match the DataFrame.


In [28]:
unempl = pd.Series([1.0, 2.0, 10.], index=[1,3,5])
unempl


Out[28]:
1     1.0
3     2.0
5    10.0
dtype: float64

In [29]:
df_2['unempl'] = unempl
df_2


Out[29]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

In [30]:
df_2.unempl.isnull()


Out[30]:
0     True
1    False
2     True
3    False
4     True
Name: unempl, dtype: bool

You can also transpose a DataFrame, i.e. switch rows by columns, and columns by rows


In [31]:
df_2.T


Out[31]:
0 1 2 3 4
year 2012 2013 2014 2014 2015
state VA VA VA MD MD
popu 5 5.1 5.2 4 4.1
unempl NaN 1 NaN 2 NaN

Now, let's say you want to show only the 'year' and 'popu' columns. You can do it by:


In [32]:
df_2


Out[32]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

In [33]:
df_2[['year', 'unempl']]


Out[33]:
year unempl
0 2012 NaN
1 2013 1.0
2 2014 NaN
3 2014 2.0
4 2015 NaN

Dropping Entries

Let's say you only need a subsample of the table that you have, and you need to drop a column from the DataFrame. You can do that by using the 'drop' option:


In [34]:
df_2


Out[34]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

In [35]:
df_3 = df_2.drop('unempl', axis=1)
df_3

df_2.drop('unempl', axis=1)


Out[35]:
year state popu
0 2012 VA 5.0
1 2013 VA 5.1
2 2014 VA 5.2
3 2014 MD 4.0
4 2015 MD 4.1

In [36]:
df_2


Out[36]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

You can also drop certain rows:


In [37]:
df_2


Out[37]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

In [38]:
df_4 = df_2.drop([1,2])
df_4


Out[38]:
year state popu unempl
0 2012 VA 5.0 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN

Look at this carefully! The DataFrame preserved the same indices as for df_2.

If you can to reset the indices, you can do that by:


In [39]:
df_4.reset_index(inplace=True)
df_4


Out[39]:
index year state popu unempl
0 0 2012 VA 5.0 NaN
1 3 2014 MD 4.0 2.0
2 4 2015 MD 4.1 NaN

Gaia Dataset

Pandas is great at reading Data tables and CSV files, and other kinds of documents. For the remainder of this notebook, we will be using the Gaia's DR1 catalogue.


In [40]:
# Path to online file
url_path = 'http://cdn.gea.esac.esa.int/Gaia/gdr2/gaia_source/csv/GaiaSource_1000172165251650944_1000424567594791808.csv.gz'

# Converting data to DataFrame
gaia_df = pd.read_csv(url_path, compression='gzip')

In [41]:
gaia_df.head()


Out[41]:
solution_id designation source_id random_index ref_epoch ra ra_error dec dec_error parallax ... e_bp_min_rp_val e_bp_min_rp_percentile_lower e_bp_min_rp_percentile_upper flame_flags radius_val radius_percentile_lower radius_percentile_upper lum_val lum_percentile_lower lum_percentile_upper
0 1635721458409799680 Gaia DR2 1000225938242805248 1000225938242805248 1197051105 2015.5 103.447529 0.041099 56.022025 0.045175 0.582790 ... 0.0595 0.0080 0.1351 200111.0 1.024730 1.017359 1.038814 1.075774 0.801798 1.349751
1 1635721458409799680 Gaia DR2 1000383512003001728 1000383512003001728 598525552 2015.5 105.187856 0.016978 56.267982 0.016904 1.385686 ... 0.2430 0.0830 0.4030 200111.0 1.388711 1.311143 1.453106 1.937890 1.852440 2.023341
2 1635721458409799680 Gaia DR2 1000274106300491264 1000274106300491264 299262776 2015.5 103.424758 0.464608 56.450903 0.582490 0.314035 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1635721458409799680 Gaia DR2 1000396156385741312 1000396156385741312 1148557518 2015.5 105.049751 0.838232 56.508777 0.744511 1.939951 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1635721458409799680 Gaia DR2 1000250024419296000 1000250024419296000 574278759 2015.5 103.352525 0.023159 56.395144 0.022836 0.747108 ... 0.2870 0.1196 0.4051 200111.0 1.507958 1.435618 1.540208 2.427377 2.152597 2.702158

5 rows × 94 columns

Shape, Columns and Rows

You can get the shape of the "gaia_df" DataFrame by typing:


In [42]:
gaia_df.shape


Out[42]:
(14209, 94)

That means there are 14209 rows and 94 columns.

To get an array of the columns available, one could write:


In [43]:
gaia_df.columns.values.sort()
gaia_df.columns.values


Out[43]:
array(['a_g_percentile_lower', 'a_g_percentile_upper', 'a_g_val',
       'astrometric_chi2_al', 'astrometric_excess_noise',
       'astrometric_excess_noise_sig', 'astrometric_gof_al',
       'astrometric_matched_observations', 'astrometric_n_bad_obs_al',
       'astrometric_n_good_obs_al', 'astrometric_n_obs_ac',
       'astrometric_n_obs_al', 'astrometric_params_solved',
       'astrometric_primary_flag', 'astrometric_pseudo_colour',
       'astrometric_pseudo_colour_error', 'astrometric_sigma5d_max',
       'astrometric_weight_al', 'b', 'bp_g', 'bp_rp', 'dec', 'dec_error',
       'dec_parallax_corr', 'dec_pmdec_corr', 'dec_pmra_corr',
       'designation', 'duplicated_source', 'e_bp_min_rp_percentile_lower',
       'e_bp_min_rp_percentile_upper', 'e_bp_min_rp_val', 'ecl_lat',
       'ecl_lon', 'flame_flags', 'frame_rotator_object_type', 'g_rp', 'l',
       'lum_percentile_lower', 'lum_percentile_upper', 'lum_val',
       'matched_observations', 'mean_varpi_factor_al', 'parallax',
       'parallax_error', 'parallax_over_error', 'parallax_pmdec_corr',
       'parallax_pmra_corr', 'phot_bp_mean_flux',
       'phot_bp_mean_flux_error', 'phot_bp_mean_flux_over_error',
       'phot_bp_mean_mag', 'phot_bp_n_obs', 'phot_bp_rp_excess_factor',
       'phot_g_mean_flux', 'phot_g_mean_flux_error',
       'phot_g_mean_flux_over_error', 'phot_g_mean_mag', 'phot_g_n_obs',
       'phot_proc_mode', 'phot_rp_mean_flux', 'phot_rp_mean_flux_error',
       'phot_rp_mean_flux_over_error', 'phot_rp_mean_mag',
       'phot_rp_n_obs', 'phot_variable_flag', 'pmdec', 'pmdec_error',
       'pmra', 'pmra_error', 'pmra_pmdec_corr', 'priam_flags', 'ra',
       'ra_dec_corr', 'ra_error', 'ra_parallax_corr', 'ra_pmdec_corr',
       'ra_pmra_corr', 'radial_velocity', 'radial_velocity_error',
       'radius_percentile_lower', 'radius_percentile_upper', 'radius_val',
       'random_index', 'ref_epoch', 'rv_nb_transits', 'rv_template_fe_h',
       'rv_template_logg', 'rv_template_teff', 'solution_id', 'source_id',
       'teff_percentile_lower', 'teff_percentile_upper', 'teff_val',
       'visibility_periods_used'], dtype=object)

Let's say you only want a DataFrame with the the colums:

  • ra (right ascension)
  • dec (declination)
  • l (galactic longitude)
  • b (galactic latitude)

You do this by using the loc option for the DataFrame:


In [44]:
gaia_df_2 = gaia_df.loc[:,['ra','dec','l','b']]

# Displaying the first 15 lines
gaia_df_2.head(15)


Out[44]:
ra dec l b
0 103.447529 56.022025 160.163475 22.533932
1 105.187856 56.267982 160.174346 23.534087
2 103.424758 56.450903 159.712110 22.635989
3 105.049751 56.508777 159.899324 23.518554
4 103.352525 56.395144 159.758838 22.582657
5 101.929791 55.973333 159.959619 21.705035
6 101.853926 56.129320 159.785705 21.709303
7 105.128850 56.285081 160.147553 23.506483
8 103.396330 56.714410 159.432230 22.690559
9 101.780437 55.945333 159.962507 21.616907
10 103.500366 56.844629 159.312220 22.779748
11 105.649481 56.632527 159.854367 23.868954
12 103.189617 56.815154 159.294467 22.607972
13 103.530144 55.988175 160.212164 22.569395
14 105.938862 56.588687 159.941603 24.013615

This selects all of the rows, and only the selected columns in the list.

You can also select only a subsample of the rows as well, as in the following example. Let's say I just want a random subsample of 10% of the galaxies in the Gaia DR1 catalogue. I can do that by:


In [45]:
import random
random.sample


Out[45]:
<bound method Random.sample of <random.Random object at 0x7fb45d83e618>>

In [46]:
# Decission indices to select from DataFrame
import random

# Number of rows
nrows = len(gaia_df_2)

# Randomly selecting `nrows` from `gaia_df_2`
gaia_df_3 = gaia_df_2.sample(nrows)

gaia_df_3.shape


Out[46]:
(14209, 4)

I'm re-normalizing the indices of this DataFrame


In [47]:
gaia_df_3.reset_index(inplace=True, drop=True)
gaia_df_3


Out[47]:
ra dec l b
0 102.023218 56.433992 159.500110 21.886331
1 105.272250 56.502381 159.938521 23.636142
2 105.087569 56.445058 159.972272 23.523299
3 103.359287 56.193791 159.970039 22.532502
4 103.240760 56.630222 159.495644 22.585832
5 103.222515 56.549474 159.577022 22.554616
6 103.361986 56.583872 159.563251 22.637842
7 103.775851 56.168097 160.063471 22.749245
8 101.836603 56.119531 159.792805 21.697240
9 102.678893 56.936834 159.086470 22.371832
10 105.667572 56.619706 159.870549 23.875620
11 102.236439 55.945081 160.041579 21.861470
12 102.433467 56.518693 159.480950 22.127565
13 105.105611 56.486566 159.931001 23.543060
14 105.362118 56.534916 159.917022 23.692076
15 102.368405 56.455920 159.535245 22.075601
16 103.676978 56.307551 159.901923 22.732851
17 102.504898 57.212852 158.771444 22.355833
18 101.804305 55.852613 160.062130 21.602850
19 102.119267 56.169582 159.789621 21.862448
20 102.071402 56.730196 159.201815 21.995346
21 102.774558 56.830031 159.212977 22.393163
22 102.137264 56.350726 159.605436 21.923322
23 103.921963 56.655100 159.575937 22.953848
24 102.932901 56.335646 159.753100 22.342795
25 105.385988 56.643913 159.804840 23.730851
26 105.542361 56.377615 160.109832 23.751272
27 102.647940 56.268813 159.775683 22.172426
28 103.711527 56.349395 159.863591 22.762315
29 103.597854 56.158071 160.045628 22.650994
... ... ... ... ...
14179 102.489214 56.081397 159.943479 22.035450
14180 102.147983 56.711074 159.234323 22.030321
14181 103.716786 56.159878 160.062696 22.715352
14182 105.252391 56.393553 160.050906 23.599250
14183 102.596930 56.310072 159.724415 22.156631
14184 105.268657 56.207321 160.250440 23.562900
14185 102.044837 56.476426 159.459891 21.909801
14186 102.284147 56.917347 159.042894 22.159330
14187 102.602635 57.186376 158.814544 22.399603
14188 102.825593 56.441695 159.625292 22.314552
14189 101.982163 56.314866 159.616222 21.830699
14190 102.244534 56.641037 159.322869 22.061669
14191 102.608435 57.217545 158.783009 22.411067
14192 102.743681 57.158952 158.865450 22.465819
14193 102.777462 56.444267 159.614761 22.289662
14194 105.891445 56.646945 159.872951 24.001706
14195 102.788402 56.867407 159.176279 22.410572
14196 102.019136 56.186270 159.755308 21.813762
14197 103.381823 56.792650 159.348148 22.703505
14198 102.970067 56.429717 159.661250 22.388156
14199 101.940300 55.874246 160.063572 21.682161
14200 102.951998 56.286436 159.807412 22.339588
14201 103.637537 56.235559 159.970962 22.692768
14202 102.089903 55.881491 160.082031 21.764630
14203 103.215704 56.428451 159.702186 22.518617
14204 102.642189 56.958484 159.058057 22.358449
14205 102.233121 57.066272 158.880057 22.173822
14206 104.919145 56.263134 160.139589 23.388225
14207 103.295677 56.738090 159.391665 22.643558
14208 101.923237 56.098917 159.828998 21.737590

14209 rows × 4 columns

You can produce plots directly from the DataFrame


In [48]:
title_txt = 'Right Ascension and Declination for Gaia'

gaia_df_3.plot('ra','dec',       # Columns to plot
               kind='scatter',   # Kind of plot. In this case, it's `scatter`
               label='Gaia',     # Label of the points
               title=title_txt,  # Title of the figure
               color='#4c72b0',  # Color of the points
               figsize=(12,8))  # Size of the fiure


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16cd80f0>

Or even Scatterplot Matrices:


In [49]:
sns.pairplot(gaia_df_3, plot_kws={'color': '#4c72b0'}, diag_kws={'color': '#4c72b0'})


Out[49]:
<seaborn.axisgrid.PairGrid at 0x1a1760b438>

In [50]:
sns.jointplot(gaia_df_3['l'], gaia_df_3['b'], color='#3c8f40')


Out[50]:
<seaborn.axisgrid.JointGrid at 0x1a193999e8>

Indexing, Selecting, Filtering Data

Now I want to filter the data based on ra and dec:

I want to select all the stars within:

  • 45 < RA < 50
  • 5 < Dec < 10

Normally, you would could do in numpy using the np.where function, like in the following example:


In [51]:
ra_arr = gaia_df.ra.values
dec_arr = gaia_df.dec.values

In [52]:
# Just showing the first 25 elements
np.column_stack((ra_arr, dec_arr))[0:25]


Out[52]:
array([[103.44752895,  56.02202543],
       [105.18785594,  56.2679821 ],
       [103.42475813,  56.45090293],
       [105.04975071,  56.50877738],
       [103.35252488,  56.39514381],
       [101.92979073,  55.97333308],
       [101.85392576,  56.12931976],
       [105.12884963,  56.28508092],
       [103.39632957,  56.7144103 ],
       [101.78043734,  55.94533326],
       [103.50036565,  56.84462941],
       [105.64948082,  56.63252739],
       [103.18961712,  56.81515376],
       [103.53014423,  55.98817459],
       [105.93886175,  56.58868695],
       [102.24453393,  56.64103702],
       [102.02432422,  56.0158414 ],
       [103.0848673 ,  56.25264172],
       [102.30172541,  56.6658301 ],
       [103.28806439,  56.2536194 ],
       [103.05853189,  56.72134178],
       [102.0578448 ,  56.39003547],
       [103.71357369,  56.62167297],
       [103.37050966,  56.11391562],
       [103.44554963,  56.29543043]])

In [53]:
## Numpy way of finding the stars that meet the criteria

ra_min, ra_max = (102, 104)
dec_min, dec_max = (56.4, 56.7)

# RA critera
ra_idx = np.where((ra_arr >= ra_min) & (ra_arr <= ra_max))[0]

# Dec criteria
dec_idx = np.where((dec_arr >= dec_min) & (dec_arr <= dec_max))[0]

# Finding `intersecting' indices that meet both criteria
radec_idx = np.intersect1d(ra_idx, dec_idx)

# Selecting the values from only those indices
ra_new = ra_arr[radec_idx]
dec_new = dec_arr[radec_idx]

# Printing out ra and dec for corresponding indices
print(np.column_stack((ra_new, dec_new)))


[[103.42475813  56.45090293]
 [102.24453393  56.64103702]
 [102.30172541  56.6658301 ]
 ...
 [103.81978156  56.62006303]
 [103.31712396  56.61945699]
 [103.57468884  56.4318757 ]]

This is rather convoluted and long, and one can easily make a mistake if s/he doesn't keep track of which arrays s/he is using!

In Pandas, this is much easier!!


In [54]:
gaia_df_4 = gaia_df.loc[(
                (gaia_df.ra >= ra_min) & (gaia_df.ra <= ra_max) &
                (gaia_df.dec >= dec_min) & (gaia_df.dec <= dec_max))]
gaia_df_4[['ra','dec']]


Out[54]:
astrometric_excess_noise_sig astrometric_matched_observations
2 103.424758 56.450903
15 102.244534 56.641037
18 102.301725 56.665830
22 103.713574 56.621673
32 103.462653 56.532017
35 103.360486 56.413851
36 103.676326 56.623116
39 103.085718 56.447884
40 103.783947 56.521326
46 102.206913 56.640966
54 103.726192 56.673523
57 102.069819 56.405446
63 102.920433 56.430698
64 102.975403 56.593006
69 103.870145 56.684652
73 103.201988 56.538079
74 102.256910 56.592725
78 103.819070 56.577742
79 103.797238 56.403185
81 102.342651 56.599794
84 103.661779 56.492773
89 103.637186 56.622849
92 102.241080 56.684968
98 102.843411 56.658409
103 102.898776 56.470195
105 103.480558 56.617093
110 103.236574 56.509793
112 103.633321 56.650588
113 103.776568 56.622350
114 102.260166 56.659839
... ... ...
14066 102.337580 56.565822
14068 103.495447 56.455223
14069 103.408320 56.628237
14071 103.436448 56.564892
14075 102.350965 56.622774
14078 103.267697 56.473831
14080 102.227288 56.665774
14081 102.962817 56.594834
14088 103.021430 56.620846
14090 103.863479 56.469286
14104 102.954900 56.594641
14114 103.442994 56.647008
14119 102.902805 56.627175
14130 102.974802 56.451924
14135 103.838059 56.442466
14138 102.619619 56.504853
14139 103.687508 56.489265
14141 103.495952 56.632908
14142 103.412026 56.620438
14143 102.714416 56.661466
14161 102.844313 56.595309
14167 103.724074 56.430243
14168 103.211444 56.645173
14173 102.584448 56.539904
14180 103.682558 56.585142
14195 102.307805 56.625449
14200 103.733646 56.568726
14201 103.819782 56.620063
14204 103.317124 56.619457
14207 103.574689 56.431876

3156 rows × 2 columns

Future of Pandas

Pandas is a great for handling data, especially comma-delimited or space-separated data. Pandas is also compatible with many other packages, like seaborn, astropy, NumPy, etc.

We will have another lecture on Pandas that will cover much more advanced aspects of Pandas. Make sure you keep checking the schedule!