Gaia

Real data!

gully
Sept 14, 2016

Outline:

Batch download the data
Compare TGAS and GaiaSource
Explore

Import these first-- I auto import them every time!:



In [4]:

    
#! cat /Users/gully/.ipython/profile_default/startup/start.ipy



In [5]:

    
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

1. Batch download the data



In [6]:

    
import os



In [ ]:

    
for i in range(16):
    fn = 'http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-{:03d}.csv.gz'.format(i)
    executable = 'wget '+fn
    print(executable)
    os.system(executable) ## Uncomment to actually download









    



wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-000.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-001.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-002.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-003.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-004.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-005.csv.gz
wget http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-006.csv.gz



In [5]:

    
#! mv Tgas* ../data



In [6]:

    
#! gzip -d ../data/Tgas*



In [7]:

    
! ls ../data/Tgas*









    



../data/TgasSource_000-000-000.csv ../data/TgasSource_000-000-008.csv
../data/TgasSource_000-000-001.csv ../data/TgasSource_000-000-009.csv
../data/TgasSource_000-000-002.csv ../data/TgasSource_000-000-010.csv
../data/TgasSource_000-000-003.csv ../data/TgasSource_000-000-011.csv
../data/TgasSource_000-000-004.csv ../data/TgasSource_000-000-012.csv
../data/TgasSource_000-000-005.csv ../data/TgasSource_000-000-013.csv
../data/TgasSource_000-000-006.csv ../data/TgasSource_000-000-014.csv
../data/TgasSource_000-000-007.csv ../data/TgasSource_000-000-015.csv

Compare to a Gaia full catalog source (download from previous notebook or manually):



In [8]:

    
#! wget http://cdn.gea.esac.esa.int/Gaia/gaia_source/csv/GaiaSource_000-000-000.csv.gz
#! mv GaiaSource_000-000-000.csv.gz ../data/



In [9]:

    
! ls ../data/GaiaSource*









    



../data/GaiaSource_000-000-000.csv

2. Compare `TGAS` and `GaiaSource` files



In [10]:

    
import pandas as pd



In [11]:

    
%time t000 = pd.read_csv('../data/TgasSource_000-000-000.csv')









    



CPU times: user 1.78 s, sys: 107 ms, total: 1.89 s
Wall time: 1.89 s



In [12]:

    
%time g000 = pd.read_csv('../data/GaiaSource_000-000-000.csv')









    



CPU times: user 1.94 s, sys: 149 ms, total: 2.09 s
Wall time: 2.1 s



In [13]:

    
set(t000.columns) - set(g000.columns)









    Out[13]:





{'hip', 'tycho2_id'}

TGAS is just the subset with parallaxes available, while Gaia source has only Positions and Magnitudes but for billions of sources:

#GaiaDR1 details: 1billion stars w/ position+magnitude; 2million stars w/ pos+mag+parallax+proper motion; 3194 variable stars; 2152 quasars
— ESA Science (@esascience) September 14, 2016



In [14]:

    
len(t000), len(g000)









    Out[14]:





(134865, 218453)



In [15]:

    
p_i = t000.parallax == t000.parallax
tp000 = t000[p_i]
p_i.sum()









    Out[15]:





134865



In [16]:

    
p_i = g000.parallax == g000.parallax
gp000 = g000[p_i]
p_i.sum()









    Out[16]:





1651



In [17]:

    
sns.set_color_codes()

For a single file, TGAS covers much more area. The file sizes are capped at 40 Mb.



In [18]:

    
plt.plot(tp000.ra[0:2], tp000.dec[0:2], 'b.', label='TGAS') # Hack to get bold labels
plt.plot(gp000.ra[0:2], gp000.dec[0:2], 'r.', label='Gaia Source')
plt.plot(tp000.ra.values, tp000.dec.values, 'b.', alpha=0.1)
plt.plot(gp000.ra.values, gp000.dec.values, 'r.', alpha=0.1)
plt.legend(loc='lower left')









    Out[18]:





<matplotlib.legend.Legend at 0x10ed1b7b8>

3. Explore!



In [19]:

    
df_list = []

This takes a finite amount of RAM, but should be fine for modern laptops.



In [20]:

    
for i in range(16):
    df_list.append(pd.read_csv('../data/TgasSource_000-000-{:03d}.csv'.format(i)))



In [21]:

    
tt = pd.concat(df_list, ignore_index=True)



In [22]:

    
t000.shape









    Out[22]:





(134865, 59)



In [24]:

    
tt.shape









    Out[24]:





(2057050, 59)



In [25]:

    
len(tt.source_id.unique())









    Out[25]:





2057050

So 2.05+ million sources with 59 "features" or columns of metadata.



In [26]:

    
plt.plot(tt.parallax, tt.parallax_error, '.', alpha=0.005)
plt.xscale('log')



In [27]:

    
bins = np.arange(-50, 200, 3)
sns.distplot(tt.parallax, bins=bins,kde=False)
plt.yscale('log')



In [28]:

    
sns.distplot(tt.parallax_error)









    



//anaconda/lib/python3.4/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x1267f5e80>



In [30]:

    
bins = np.arange(0, 160, 2)
sns.distplot(tt.astrometric_n_obs_ac, bins=bins, kde=False)
sns.distplot(tt.astrometric_n_bad_obs_ac, bins=bins, kde=False)
sns.distplot(tt.astrometric_n_good_obs_ac, bins=bins, kde=False)









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x1266d26a0>



In [31]:

    
sns.distplot(tt.phot_g_mean_mag)









    



//anaconda/lib/python3.4/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x143599588>



In [32]:

    
bins = np.arange(0,40,1)
sns.distplot(tt.matched_observations, bins=bins,kde=False)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x13b642c88>



In [33]:

    
tt.iloc[0]









    Out[33]:





hip                                            13989
tycho2_id                                        NaN
solution_id                      1635378410781933568
source_id                              7627862074752
random_index                                  243619
ref_epoch                                       2015
ra                                           45.0343
ra_error                                    0.305989
dec                                         0.235392
dec_error                                   0.218802
parallax                                     6.35295
parallax_error                               0.30791
pmra                                         43.7523
pmra_error                                 0.0705422
pmdec                                       -7.64199
pmdec_error                                0.0874018
ra_dec_corr                                -0.414972
ra_parallax_corr                            0.179966
ra_pmra_corr                                0.159207
ra_pmdec_corr                             -0.0857597
dec_parallax_corr                          -0.407338
dec_pmra_corr                             -0.0994513
dec_pmdec_corr                             0.0606588
parallax_pmra_corr                       -0.00157679
parallax_pmdec_corr                        -0.101957
pmra_pmdec_corr                             0.214677
astrometric_n_obs_al                              79
astrometric_n_obs_ac                              79
astrometric_n_good_obs_al                         79
astrometric_n_good_obs_ac                         78
astrometric_n_bad_obs_al                           0
astrometric_n_bad_obs_ac                           1
astrometric_delta_q                          1.91906
astrometric_excess_noise                    0.717101
astrometric_excess_noise_sig                 412.606
astrometric_primary_flag                        True
astrometric_relegation_factor                 2.9361
astrometric_weight_al                        1.81816
astrometric_weight_ac                    1.26696e-05
astrometric_priors_used                            3
matched_observations                               9
duplicated_source                              False
scan_direction_strength_k1                  0.382348
scan_direction_strength_k2                  0.538266
scan_direction_strength_k3                  0.392378
scan_direction_strength_k4                  0.916306
scan_direction_mean_k1                       -113.76
scan_direction_mean_k2                       21.3929
scan_direction_mean_k3                      -41.6784
scan_direction_mean_k4                       26.2018
phot_g_n_obs                                      77
phot_g_mean_flux                         1.03123e+07
phot_g_mean_flux_error                       10577.4
phot_g_mean_mag                              7.99138
phot_variable_flag                     NOT_AVAILABLE
l                                             176.74
b                                           -48.7144
ecl_lon                                      42.6418
ecl_lat                                     -16.1211
Name: 0, dtype: object



In [34]:

    
gi = tt.astrometric_delta_q == tt.astrometric_delta_q
bins= np.arange(0,500, 5)
sns.distplot(tt.astrometric_delta_q[gi], bins=bins, kde=False)
plt.yscale('log')



In [35]:

    
tt.phot_variable_flag.unique()









    Out[35]:





array(['NOT_AVAILABLE', 'VARIABLE'], dtype=object)



In [36]:

    
vi = tt.phot_variable_flag == 'VARIABLE'



In [37]:

    
vi.sum(), len(vi)









    Out[37]:





(1, 2057050)

Only one variable star of all the TGAS sample, about what you'd expect.



In [38]:

    
tt[vi]









    Out[38]:






  
    
      
      hip
      tycho2_id
      solution_id
      source_id
      random_index
      ref_epoch
      ra
      ra_error
      dec
      dec_error
      ...
      scan_direction_mean_k4
      phot_g_n_obs
      phot_g_mean_flux
      phot_g_mean_flux_error
      phot_g_mean_mag
      phot_variable_flag
      l
      b
      ecl_lon
      ecl_lat
    
  
  
    
      1398590
      29055.0
      NaN
      1635378410781933568
      5283957629860435072
      1484959
      2015.0
      91.94041
      0.129556
      -66.977428
      0.291615
      ...
      6.455348
      516
      313636.500838
      2817.124822
      11.783704
      VARIABLE
      276.873805
      -29.051596
      209.455272
      -89.128589
    
  

1 rows × 59 columns