This notebook demonstrates exploring the sample information for the 1000 Genomes dataset stored in BigQuery. Specifically you will:
%%sql
statement to write and execute SQL statements within the notebookRelated Links:
NOTE:
First let's take a look at the 1000 Genomes sample dataset table.
In [1]:
import gcp.bigquery as bq
samples_table = bq.Table('genomics-public-data:1000_genomes.sample_info')
samples_table.schema
Out[1]:
We can see in the table schema that a number of different annotations exist for each genomic sample.
To get a feel for the dataset, let's see how the samples are distributed across populations and super populations.
In [2]:
%%sql --module pops
SELECT
population,
population_description,
super_population,
super_population_description,
COUNT(population) AS population_count,
FROM
$samples_table
GROUP BY
population,
population_description,
super_population,
super_population_description
In [3]:
bq.Query(pops, samples_table=samples_table).to_dataframe()
Out[3]:
In order to further analyze our query results locally within the IPython notebook, let's convert the result set into a Pandas dataframe:
In [4]:
pops_df = bq.Query(pops, samples_table=samples_table).to_dataframe()
pops_df[:5]
Out[4]:
Pandas dataframes have a dataframe.plot()
method that allows us to quickly render a number of standard visualizations. Let's draw a bar chart of the per-population sample counts:
In [5]:
import seaborn as sns
pops_df.plot(kind='bar', y='population_count')
Out[5]:
Now that we have a dataframe object, we can compute arbitrary rollups and aggregations locally. For example, let's aggregate the count of samples from the population level to the super population level. We'll do this via the dataframe.groupby
operation, which is generally of the form:
dataframe.groupby("column name").aggregation_func()
In [6]:
superpop_df = pops_df.groupby('super_population').sum()
superpop_df.plot(kind='bar')
Out[6]:
We see that the distribution of genome samples across population and super population is relatively uniform, the exception being the AFR (African) super population.
Let's explore a few quantitative attributes of the genomic samples. We'll keep our super population sample classification, but also add some metrics around the extent to which given samples were sequenced, for exomic and low coverage regions. We'll further annotate our samples with a tag for the laboratory/center that produced performed the sequencing.
In [7]:
%%sql --module metrics
select
super_population
, total_lc_sequence -- lc=low coverage
, total_exome_sequence
, Main_Project_E_Centers
from $samples_table
where
total_exome_sequence is not null
and total_lc_sequence is not null
and Main_Project_E_Centers is not null
and Main_Project_E_Centers != 'BCM,BGI' -- remove this single outlier
Again, we can convert these results to a Pandas dataframe for local analysis and visualization
In [8]:
df = bq.Query(metrics, samples_table=samples_table).to_dataframe()
df[:10]
Out[8]:
To get a feel for the quantitative sample attributes, let's see how their values are distributed among the dataset samples by plotting histograms. Note that a histogram of any dataframe attribute can be easily rendered with the pattern dataframe.attribute_name.hist()
In [9]:
df.hist(alpha=0.5, bins=30)
Out[9]:
What does the joint distribution of these two traits look like?
In [10]:
g = sns.jointplot('total_exome_sequence', 'total_lc_sequence', df)
Only a very slight positive correlation. Let's further annotate our scatter chart by rendering each mark with a color according to its super population assignment.
In [11]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', data=df, fit_reg=False)
Now, let's take the same plot as above, by facet our results based upon the genomic sequencing center that produced it to look for inter-center variability in the dataset.
In [12]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', col='Main_Project_E_Centers', col_wrap=2, data=df, fit_reg=False)
The WUSGC (the Genome Institute at Washington University) shows a small outlier cluster that is distinct relative to the other centers. The BCM (Baylor College of Medicine) facet appears the least variable within the exome sequencing dimension.
Are there any super population trends here? We can facet our data a second time, this time by super population to dig deeper.
In [13]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', col='super_population', row='Main_Project_E_Centers', data=df, fit_reg=False)
With the data now faceted by super population as well as genomic center, the variability of the Broad institute data becomes more apparent. Furthermore, the outlier cluster noted earlier in the WUGSC facet appears to be primarily constituted by the EAS and AMR super populations, with no representation from the SAS super population.