In [ ]:
!pip install solvebio
!pip install plotly
In [3]:
import solvebio
import numpy as np
import plotly.plotly as py
import plotly.tools as tls
from plotly.graph_objs import Data, Layout, XAxis, YAxis, Figure, Box
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# Initialize Plot.ly offline mode
init_notebook_mode(connected=True)
You'll need your SolveBio API key to run this notebook. You can get your API key from your Security Settings.
In [4]:
solvebio.login(api_key="Your API key here")
In this demo, we will use SolveBio's Python package combined with Plot.ly and numpy to quickly analyze and visualize patients and their characteristics from the The Cancer Genome Atlas Project.
We will use the the TCGA Patient Information dataset on SolveBio.
Since we're conducting this analysis by cancer type, we first need to pull out all the possible values for cancer type (aka cancer_abbreviation
) in this dataset. Then, we want to retrieve the range of all ages at first diagnosis for our analysis. Below, we use "nested facets" to do this in a single SolveBio query:
In [6]:
# Retrieve the TCGA Patient Information dataset
tcga = solvebio.Dataset.get_by_full_path('solvebio:public:/TCGA/1.2.0-2015-02-11/PatientInformation')
# Filter out values where the age is not available
include_ages = ~ solvebio.Filter(age_at_initial_pathologic_diagnosis='[Not Available]')
# Retrieve each cancer type (via terms facets)
# and the list of ages for each type (through a nested terms facet).
facets = {
'cancer_abbreviation': {
'limit': 1000, # Use a large number to get all available cancer types
'facets': {
# Add a nested facet to get the ages for each cancer type
'age_at_initial_pathologic_diagnosis': {
'limit': 1000
}
}
}
}
results = tcga.query(filters=include_ages).facets(**facets)
# Convert the results into a format usable by Plot.ly
# (a list of ages for each cancer type).
cancer_and_age = []
for cancer_type, count, sub_facets in results['cancer_abbreviation']:
# The ages are represented by tuples (age, count). To get a nice
# box plot below, expand out the ages for each occurrence.
ages = []
for age, count in sub_facets['age_at_initial_pathologic_diagnosis']:
ages += [int(age)] * count
cancer_and_age.append({'cancer_type': cancer_type, 'ages': ages})
Now that we have the age of diagnosis for every patient in TCGA, by cancer type, let's sort the data by median age for each cancer with numpy and visualize the data with Plot.ly.
In [7]:
cancer_and_age = sorted(cancer_and_age, key = lambda x: np.median(x['ages']))
data = Data([
Box(y=cancer['ages'], name=cancer['cancer_type'])
for cancer in cancer_and_age
])
layout = Layout(
title='Age of Diagnosis for TCGA Patients by Cancer Type',
xaxis=XAxis(title='Cancer Type'),
yaxis=YAxis(title='Age of Diagnosis')
)
fig = Figure(data=data, layout=layout)
iplot(fig)
The results are as we expect, based on the unique epidemiology of each cancer. For example, we know that testicular germ cell tumors are most common between the ages of 15-35 in men. This is a pretty simple analysis, but there's a lot of data in SolveBio's TCGA datasets that are ripe for analysis.