BioVida: Open-i

Open-i is an open access biomedical search engine provided by the US National Institutes of Health. The service grants programmatic access to its over 1.2 million images through a RESTful web API. BioVida provides an easy-to-use python interface for this web API, located in the images subpackage.

In [1]:
from biovida.images import OpeniInterface

Using Theano backend.

In [2]:
opi = OpeniInterface()

We start by creating an instance of the class. All BioVida interfaces accept at least two parameters: verbose and cache_path. The first simply determines whether or not the class provides you with additional updates as the class works. The second refers to the location where will be stored (or cached) on your computer. If left to its default, data will be cached in a directory entitled biovida_cache in your home directory. For most use cases, this should suffice.


To search the Open-i database, we can use the OpeniInterface's search method. To explore valid values that can be passed to search, we can use options().

In [3]:

  - 'article_type'
  - 'collection'
  - 'exclusions'
  - 'fields'
  - 'image_type'
  - 'rankby'
  - 'specialties'
  - 'subset'
  - 'video'

The code above enumerates all of the parameters, apart from a specific query string, that can be passed to search(). Additionally, options() can be used to investigate the valid values for any one of these parameters.

In [4]:

  - 'history_of_medicine'
  - 'indiana_u_xray'
  - 'medpix'
  - 'pubmed'
  - 'usc_anatomy'

In [5]:

  - 'ct'
  - 'graphic'
  - 'microscopy'
  - 'mri'
  - 'pet'
  - 'photograph'
  - 'ultrasound'
  - 'x_ray'

Let's go ahead and perform a search for X-ray and CT images of 'lung cancer' from the PubMed collection/database.

In [6]:'lung cancer', image_type=('x_ray', 'ct'), collection='pubmed')

Results Found: 8,531.

Downloading Data

Now that we've defined a search, we can easily download some, or all, of the results found. For the sake of expediency, let's limit the number of results we download to the first 1500.

In [7]:
pull_df = opi.pull(download_limit=1500)

Number of Records to Download: 1,500 (chunk size: 30 records).

The text information associated with images are referred to as 'records', which are downloaded in 'chunks' of no more than 30 at a time.
Images, unlike records, are downloaded 'one by one'. However, pull() will check the cache before downloading an image, in an effort to reduce redundant downloads.

The dataframe generated by pull() can be viewed using either opi.records_db, or the pull_df used above to capture the output of pull(). Both will be identical. We can also view an abbreviated dataframe, opi.records_db_short, which has several (typically unneeded) columns removed.

In [8]:
import numpy as np
def simplify_df(df):
    """This function simplifies dataframes
    for the purposes of this tutorial."""
    data_frame = df.copy()
    data_frame['cached_images_path'] = '/path/to/image'
    return data_frame[0:5].replace({np.NaN: ''})

In [9]:

mesh_major mesh_minor problems abstract affiliate article_type authors cc_license doc_source image_caption ... image_problems_from_text imaging_modality_from_text parsed_abstract sex modality_full image_id_short query pull_time cached_images_path download_success
0 lung cancer Lung cancer is one of the leading causes of ca... Department of Nuclear Medicine and Molecular I... other Purandare NC, Rangarajan V byncsa PMC (A-D) Nodal disease. Right upper paratracheal ... ... (arrows, grids) Computed Tomography (CT): chest male Computed Tomography (CT) 7 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:24:37.098065 /path/to/image True
1 nsclc Non-small cell lung cancer (NSCLC) accounts fo... Cardiopulmonary Department, Sant'Andrea Hospit... research_article Pezzuto A, Piraino A, Mariotta S by PMC Case 2: (A) CT scan showing the lung after sur... ... Computed Tomography (CT): chest female Computed Tomography (CT) 2 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:24:37.098065 /path/to/image True
2 (Adrenal Insufficiency/diagnosis*/drug therapy... (Acute Disease, Adrenal Cortex Hormones/therap... hypoxia; tachycardia Background: Adrenal crisis after surgical pro... Department of Orthopedic Surgery, Osaka Medica... research_article Naka N, Takenaka S, Nanno K, Moriguchi Y, Chun... by PMC CT scan showing bilateral adrenal enlargement ... ... (asterisks,) Computed Tomography (CT): chest {'background': 'Adrenal crisis after surgical ... male Computed Tomography (CT) 1 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:24:37.098065 /path/to/image True
3 small cell carcinoma of lung Subcutaneous swelling as first clinical presen... Department of Medicine. case_report Kumar S, Gupta A, Diwan SK, Bhake A PMC Computerized tomography of the chest showing p... ... Computed Tomography (CT): chest male Computed Tomography (CT) 3 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:24:37.098065 /path/to/image True
4 (Echocardiography/methods*, Lung/ultrasonograp... (Aged, Humans, Incidental Findings, Male) lung tumor We present images of a rare case where a prima... Department of Clinical Physiology and Nuclear ... research_article Dencker M, Cronberg C, Damm S, Valind S, Wadbo M by PMC Display of CT-image. The arrow indicates the n... ... (arrows,) Computed Tomography (CT): chest male Computed Tomography (CT) 2 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:24:37.098065 /path/to/image True

5 rows × 38 columns

This dataframe is provides a lot of rich data, which is valuable independent of the images which have also been downloaded.

For instance, it is possible to quickly generate some descriptive statistics about our newly created 'lung cancer' dataset.

In [10]:

count    966.000000
mean      53.337267
std       21.999990
min        1.000000
25%       47.000000
50%       59.000000
75%       68.000000
max       91.400000
Name: age, dtype: float64

In [11]:

male      0.722359
female    0.277641
Name: sex, dtype: float64

The age and sex columns are generated by analyzing the raw text provided by Open-i. It is reasonably accurate, but mistakes are certainly possible.

It should also be mentioned that opi.records_db only contains data for the most recent search() and pull(). Conversely, cache_records_db provides a more complete account of all images in the cache, e.g., those obtained several sessions ago. Additionally, unlike opi.records_db, cache_records_db can contain duplicate rows. However, this is only allowed to occur if the queries that generated the rows are different.


Now that we've explored obtaining and reviewing data, we can finally turn our attention to images themselves.

In [12]:
from utils import show_image
%matplotlib inline

Note: utils is a small script with some helpful functions located in the base of this directory.

Using the show_images imported above, we can now look at a random images we pulled in the step above.

In [13]:
# show_image(opi.records_db['cached_images_path'].iloc[156])

In [14]:


Let's also look at the age and sex of this subject.

In [15]:
age_sex = opi.records_db['age'].iloc[156], opi.records_db['sex'].iloc[156]
print("age: {0}, sex: {1}.".format(*age_sex))

age: 48.0, sex: male.

We can also easily check their diagnosis

In [16]:

'carcinoma; neurofibromatosis'

Please be advised that for collections other than 'MedPix'*, such as PubMed, diagnosis information is obtained by analyzing the text associated with the image. Errors are possible.

*MedPix explicitly provides diagnosis information, so it can be assumed to be accurate.

Automated Cleaning of Image Data (Experimental)

While the data may look OK so far, if we look more closely we will likely find several problems with the images we have downloaded.

In [17]:
# show_image(opi.records_db['cached_images_path'].iloc[100])

In [18]:
# show_image(opi.records_db['cached_images_path'].iloc[10])

The images above contain several clear problems. They both contain arrows and the latter is actually a 'grid' of images. These are liable to confuse any model we attempt to train detect disease. We could manually go through and remove these images or, alternatively, we can use the experimental OpeniImageProcessing class to try and eliminate these images from our dataset automatically.

In [19]:
from biovida.images import OpeniImageProcessing

We initialize this class using our OpeniInterface instance. By default, it will extract the records_db DataFrame. Do note, however, that we can force it to extract the cache_records_db DataFrame by setting the db_to_extract equal to 'cache_records_db'.

In [20]:
ip = OpeniImageProcessing(opi)

OpeniImageProcessing will automatically download a model for a Convolutional Neural Network (convnet) which has been trained to detect these kinds of problems. If you are unfamiliar with these kinds of models, you can read more about them here.

The OpeniImageProcessing class tries to detect problems in the images by analyzing both the text associated it is associated with as well as by feeding the image through the convnet mentioned above. However, by default the OpeniImageProcessing class will only use predictions gleaned from this model if it has been explicitly trained on images from that kind of imaging modality.

We can easily check the modalities for which the model has been trained:

In [21]:

['ct', 'mri', 'x_ray']

Luckily, we're working with X-rays and CTs.

Now we're ready to analyze our images.

In [22]:
analysis_df =

In [23]:

mesh_major mesh_minor problems abstract affiliate article_type authors cc_license detailed_query_url doc_source ... grayscale medpix_logo_bounding_box hbar hborder vborder upper_crop lower_crop visual_image_problems invalid_image invalid_image_reasons
0 lung cancer Lung cancer is one of the leading causes of ca... Department of Nuclear Medicine and Molecular I... other Purandare NC, Rangarajan V byncsa PMC ... False [(grids, 0.864566), (arrows, 0.00571598), (tex... True (grayscale, image_problems_from_text, visual_i...
1 nsclc Non-small cell lung cancer (NSCLC) accounts fo... Cardiopulmonary Department, Sant'Andrea Hospit... research_article Pezzuto A, Piraino A, Mariotta S by PMC ... True [(valid_image, 0.858156), (text, 0.100862), (a... False
2 (Adrenal Insufficiency/diagnosis*/drug therapy... (Acute Disease, Adrenal Cortex Hormones/therap... hypoxia; tachycardia Background: Adrenal crisis after surgical pro... Department of Orthopedic Surgery, Osaka Medica... research_article Naka N, Takenaka S, Nanno K, Moriguchi Y, Chun... by PMC ... True 327 327 [(valid_image, 0.914723), (arrows, 0.0369132),... True (image_problems_from_text,)
3 small cell carcinoma of lung Subcutaneous swelling as first clinical presen... Department of Medicine. case_report Kumar S, Gupta A, Diwan SK, Bhake A PMC ... False 348 (4, 351) (191, 491) 4 348 [(arrows, 0.998476), (grids, 0.000149499), (va... True (grayscale, visual_image_problems)
4 (Echocardiography/methods*, Lung/ultrasonograp... (Aged, Humans, Incidental Findings, Male) lung tumor We present images of a rare case where a prima... Department of Clinical Physiology and Nuclear ... research_article Dencker M, Cronberg C, Damm S, Valind S, Wadbo M by PMC ... False 294 (96, 487) 294 [(grids, 0.993127), (arrows, 8.44916e-06), (va... True (grayscale, image_problems_from_text, visual_i...

5 rows × 65 columns

This will generate several new columns:

  • 'grayscale': this is simply an account of whether or not the images is grayscale.
  • 'medpix_logo_bounding_box': images from the MedPix collection, typically contain the organization's logo in the top right corner. Had we passed the class images from MedPix, it would have tried to 'draw' a bounding box around its precise location (enabling it to be cropped out of the image).
  • 'hbar': this denotes a 'horizontal bar' that is sometimes found at the bottom of images. If present, this column reports its height in pixels.
  • 'hborder': this column provides an account of 'horizontal borders' on either side of the image.
  • 'vborder': this column provides an account of 'vertical borders' on the top and bottom of the image.
  • 'upper crop': this is the location that has been selected to crop the top of the image. This decision is made by considering the 'medpix_logo_bounding_box' and 'vborder' columns.
  • 'lower crop': this is the location that has been selected to crop the bottom of the image. This decision is made by considering the 'hbar' and 'hborder' columns.
  • 'visual_image_problems': this column contains the output of the convnet model, with the numbers following the words representing the probability that the image belongs to that category.
  • 'invalid_image': this is a decision as to whether or not the image is invalid, e.g., has an arrow. This decision is made using the 'grayscale' and 'visual_image_problems' columns as well as the text associated with the image ('image_problems_from_text')
  • 'invalid_image_reasons': in cases where the 'invalid_image' column is True, column provides an account as to why a decision was made.

We can use this analysis to construct a new dataframe, with 'invalid_images' removed and the remaining images cropped in such a way that problematic features are removed.

In [24]:

This 'cleaned' set, should have fewer instances of problematic images.

Here's a random image from this new set:

In [25]:
# show_image(ip.image_dataframe_cleaned['cleaned_image'].iloc[180])

With time, the machinery used to detect these kinds of problems, particularly the convolutional neural network, will be improved. However, at the current time, this class is still considered to be very experimental.

Train, Validation and Test

Now that we've explored data harvesting, we can turn our attention to the final step before modeling: dividing data into training, validation and/or tests sets.

Let's use images from the Indiana University Chest X-Ray collection* ('indiana_u_xray'). This set of images has been assembled 'by hand', and thus does not require complicated image cleaning procedures.
*License; images have not been modified.

In [26]:'indiana_u_xray')

Results Found: 7,470.

Let's go ahead and download this entire collection.
Please be advised that this will take some time, so feel free to adjust download_limit to suit your needs.

In [27]:
pull_df2 = opi.pull(download_limit=None)

Number of Records to Download: 7,470 (chunk size: 30 records).

Let's quickly inspect this newly downloaded data.

In [28]:

mesh_major mesh_minor problems abstract affiliate article_type authors cc_license doc_source image_caption ... image_problems_from_text imaging_modality_from_text parsed_abstract sex modality_full image_id_short query pull_time cached_images_path download_success
0 (Calcified Granuloma/lung/upper lobe/right,) calcified granuloma Comparison: Chest radiographs XXXX. Indicatio... Indiana University radiology_report Kohli MD, Rosenman M byncnd CXR PA and lateral chest x-XXXX XXXX. ... Computed Tomography (CT): chest {'impression': 'No acute cardiopulmonary proce... male X-Ray 1 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:30:34.796070 /path/to/image True
1 (Calcified Granuloma/lung/upper lobe/right,) calcified granuloma Comparison: Chest radiographs XXXX. Indicatio... Indiana University radiology_report Kohli MD, Rosenman M byncnd CXR PA and lateral chest x-XXXX XXXX. ... Computed Tomography (CT): chest {'impression': 'No acute cardiopulmonary proce... male X-Ray 2 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:30:34.796070 /path/to/image True
2 (normal,) normal Comparison: None. Indication: Positive TB tes... Indiana University radiology_report Kohli MD, Rosenman M byncnd CXR Xray Chest PA and Lateral ... Computed Tomography (CT): chest {'impression': 'Normal chest x-XXXX.', 'compar... X-Ray 1 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:30:34.796070 /path/to/image True
3 (normal,) normal Comparison: None. Indication: Positive TB tes... Indiana University radiology_report Kohli MD, Rosenman M byncnd CXR Xray Chest PA and Lateral ... Computed Tomography (CT): chest {'impression': 'Normal chest x-XXXX.', 'compar... X-Ray 2 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:30:34.796070 /path/to/image True
4 (Markings/lung/bilateral/interstitial/diffuse/... markings; fibrosis Comparison: None. Indication: dyspnea, subjec... Indiana University radiology_report Kohli MD, Rosenman M byncnd CXR CHEST 2V FRONTAL/LATERAL XXXX, XXXX XXXX PM ... Computed Tomography (CT): chest {'impression': 'Diffuse fibrosis. No visible f... X-Ray 1 {'subset': None, 'rankby': None, 'collection':... 2017-04-10 15:30:34.796070 /path/to/image True

5 rows × 37 columns

We can easily select a subset of these ~7000 images and divide them into training and test sets for some machine learning model using the image_divvy() tool.

In [29]:
from biovida.images import image_divvy

Let's imagine we're interested in building a model capable of distinguishing between 'normal' chest x-rays and those with signs of problematic caclium deposits, a disease formally known as 'calcinosis'.

We can define a rule to construct such a training and test set using a 'divvy_rule'. This rule will tell image_divvy() how to 'divvy up' the images in the cache. More specifically, our rule will tell this image_divvy() how to categorize images in the cache.

In [30]:
def my_divvy_rule(row):
    if isinstance(row['diagnosis'], str):
        if 'normal' in row['diagnosis']:
            return 'normal'  # though this could be anything, e.g., 'super cool normal images'.
        elif 'calcinosis' in row['diagnosis']:
            return 'calcinosis'

Now that image_divvy() knows how we would like it to categorize the data we've downloaded, we can also pass it a dictionary specifying how to 'split' the data into training and testing sets. In this example, we'll use a standard 80% train, 20% test split and ask the function returns numpy arrays (ndarrays) as output.

In [31]:
train_test = image_divvy(instance=opi,
                         train_val_test_dict={'train': 0.8, 'test': 0.2})


- 'train':
  - 'calcinosis'
  - 'normal'
- 'test':
  - 'calcinosis'
  - 'normal'

Before signing off, image_divvy() printed the structure of the nested dictionary it returned.
We can use this information to unpack the arrays nested within this data structure:

In [32]:
train_ca, test_ca = train_test['train']['calcinosis'], train_test['test']['calcinosis']
train_norm, test_norm = train_test['train']['normal'], train_test['test']['normal']

Now that our data has been neatly unpacked, we can look at the number of samples the procedure generated.

In [33]:
# Normal
print("Train:", len(train_norm), "|", "Test:", len(test_norm))

Train: 2156 | Test: 540

In [34]:
# Calcinosis
print("Train:", len(train_ca), "|", "Test:", len(test_ca))

Train: 446 | Test: 112

Using the show_image() tool we imported above, we can take a quick at an image from each category.

In [35]:
# Normal
# show_image(train_norm[99])

In [36]:
# Calcinosis
# show_image(train_ca[104])


Here we've explored how BioVida can be used to easily obtain and process data from the Open-i database.

In the next tutorial, we'll see how BioVida can be used to gain access to a database with orders of magnitude more images.