BioVida: Domain Unification and Data Management


This tutorial will cover the facilities BioVida offers to:

  • integrate images data against other kinds of biomedical data

  • manage cached resources.


Domain Unification

While primarily focused on image data, BioVida also contains interfaces to allow you to easily gain access to other biomedical data types. Namely, medical diagnostics and genomics data. This section will show how one, or several, image interfaces can be unified into a single DataFrame, complete with data from these additional sources.

We can start by collecting some data.


In [1]:
from biovida.images import OpeniInterface
opi = OpeniInterface()
opi.search(query='lung cancer')
pull_df1 = opi.pull()


Using Theano backend.
Results Found: 33,925.

Number of Records to Download: 100 (chunk size: 30 records).



Let's also get some data from the Cancer Imaging Archive.


In [2]:
from biovida.images import CancerImageInterface
cii = CancerImageInterface(api_key=YOUR_API_KEY_HERE)
cii.search(cancer_type='lung')
pull_df2 = cii.pull(collections_limit=1)  # only download the first collection/study




Next, we can import the tool we will be using to unify the data


In [3]:
from biovida.unification import unify_against_images

In [4]:
unified_df = unify_against_images(instances=[opi, cii])







In [5]:
import numpy as np
def simplify_df(df):
    """This function simplifies dataframes
    for the purposes of this tutorial."""
    data_frame = df.copy()
    for c in ('source_images_path', 'cached_images_path'):
        data_frame[c] = 'path_to_image'
    return data_frame.replace({np.NaN: ''})

To close out this section, we can take a quick look at the resultant DataFrame.


In [6]:
simplify_df(unified_df)[85:90]


Out[6]:
age article_type cached_images_path disease image_caption image_id image_id_short modality_best_guess pull_time query sex source_api source_images_path disease_family disease_synonym disease_definition known_associated_symptoms mentioned_symptoms known_associated_genes
85 research_article path_to_image lung cancer (a) Heat Map representing the activated signal... 6 Computed Tomography (CT): chest 2017-04-10 06:50:26.175164 {'specialties': None, 'exclusions': ['graphics... male openi path_to_image (respiratory system cancer,) (lung neoplasm,) A respiratory system cancer that is located in... (abdominal obesity, abdominal pain, abnormal r...
86 research_article path_to_image lung cancer Efficacy of LQ on tumor volume, tumor weight (... 4 Computed Tomography (CT): chest 2017-04-10 06:50:26.175164 {'specialties': None, 'exclusions': ['graphics... male openi path_to_image (respiratory system cancer,) (lung neoplasm,) A respiratory system cancer that is located in... (abdominal obesity, abdominal pain, abnormal r... (weight loss,)
87 research_article path_to_image lung cancer Survivin-VISA vector selectively expressed luc... 2 Computed Tomography (CT): chest 2017-04-10 06:50:26.175164 {'specialties': None, 'exclusions': ['graphics... male openi path_to_image (respiratory system cancer,) (lung neoplasm,) A respiratory system cancer that is located in... (abdominal obesity, abdominal pain, abnormal r...
88 research_article path_to_image lung cancer; ptosis Methylation status of SOX30 in lung cancer cel... 1 Computed Tomography (CT): chest 2017-04-10 06:50:26.175164 {'specialties': None, 'exclusions': ['graphics... male openi path_to_image
89 research_article path_to_image large cell carcinoma; lung adenocarcinoma; lun... CD56+CD16+ NK cell infiltration extent in diff... 2 Computed Tomography (CT): chest 2017-04-10 06:50:26.175164 {'specialties': None, 'exclusions': ['graphics... male openi path_to_image

Note: the 'mentioned_symptoms' column provides a list of symptoms known to be associated with the disease which were mentioned in the article.


Data Management

This section is intended to provide a brief overview of the ways in which data downloaded with BioVida can be removed from your computer.

1. The simplest way to delete BioVida data is to manually delete the biovida_cache folder, or some portion of files (e.g., images) contained within in. Both OpeniInterfaces and CancerImageInterface check for deleted files each time they are instantiated.

2. While the first approach is straightforward, it is neither elegant nor precise. For situations that require more finesse, we can employ the image_delete tool.


In [7]:
from biovida.images import image_delete

Next, we simply define a which will inform image_delete of which rows to delete.


In [8]:
def my_delete_rule(row):
    if isinstance(row['abstract'], str) and 'proteins' in row['abstract'].lower():
        return True

In this example, we'll use the instance of OpeniInterface created above.


In [9]:
deleted_rows = image_delete(opi, delete_rule=my_delete_rule, only_recent=True)


This action cannot be undone.
Do you wish to continue (y/n)?y

Deleting...

Indices of Deleted Rows:

records_db cache_records_db
0 88 9047
1 89 9048
2 90 9049
3 91 9050
4 92 9051
5 93 9052
6 94 9053
7 95 9054
8 96 9055
9 97 9056
10 98 9057
11 99 9058

This will not only delete the row, but any images associated with it. Therefore, as a precaution, you will be asked to confirm this action before it is performed.

Warning:

The default behavior of image_delete is to delete any rows for which your 'delete_rule' returns True, including those in cache_records_db which were not downloaded in the most recent pull(). The only_recent parameter can be used to limit deletion to data obtained in the most recent pull, as shown above.


Conclusion

In this tutorial we have reviewed how to unify images obtained with BioVida both with eachother as well as against external biomedial databases. Additionally, we have explored methods for deleting downloaded data.