Combine the candidates.csv and annotations.csv files

For some reason, the candidates and annotations files were never merged. The additional information in the annotation file is the nodule size. That could be useful in our models. The sizes are only included for class 1 nodules and not all class 1 nodules have annotations. Also, the annotations are out of order and the candidate centers are slightly different between the annotations and candidates files. I've asked on the LUNA16 mailing list which of the coordinates is more accurate. Still waiting for a response.

This script goes through both files and tries to match up the annotation with the correct candidate. It then merges the information and outputs this to a new candidates_plus_annotations.csv file.


In [1]:
## Create new candidates file

In [3]:
import pandas as pd
import numpy as np

In [4]:
DATA_DIR = "/Volumes/data/tonyr/dicom/LUNA16/"
cand_path = 'CSVFILES/candidates_V2.csv'
annotations_path = 'CSVFILES/annotations.csv'

In [5]:
dfAnnotations = pd.read_csv(DATA_DIR+annotations_path).reset_index()
dfAnnotations = dfAnnotations.rename(columns={'index': 'candidate'})
dfCandidates = pd.read_csv(DATA_DIR+cand_path).reset_index()
dfCandidates = dfCandidates.rename(columns={'index': 'candidate'})

In [5]:
dfCandidates['diameter_mm'] = np.nan  # Set a new column and fill with NaN until we know the true diameter of the candidate

In [6]:
dfClass1 = dfCandidates[dfCandidates['class'] == 1].copy(deep=True)  # Get only the class 1 (they are the only ones that are labeled)

In [6]:
dfCandidates.shape


Out[6]:
(754975, 6)

Append nodule size to candidates

Loop through the annotations dataframe and look for the closest points to the ROI centers listed in the candidates file. Then update the candidates dataframe with the nodule size listed in the annotated file.


In [7]:
seriesuid = dfClass1['seriesuid'].unique()  # Get the unique series names (subjects)

for seriesNum in seriesuid:
    
    # Get the annotations for this candidate
    candAnnotations = dfAnnotations[dfAnnotations['seriesuid']==seriesNum]['candidate'].values
    candCandidates = dfClass1[dfClass1['seriesuid'] == seriesNum]['candidate'].values

    # Now loop through annotations to find closest candidate
    diameterArray = []

    for ia in candAnnotations: # Loop through the annotation indices for this seriesuid

        annotatePoint = dfAnnotations[dfAnnotations['candidate']==ia][['coordX', 'coordY', 'coordZ']].values

        closestDist = 10000

        for ic in candCandidates: # Loop through the candidate indices for this seriesuid

            candidatePoint = dfCandidates[dfCandidates['candidate']==ic][['coordX', 'coordY', 'coordZ']].values

            dist = np.linalg.norm(annotatePoint - candidatePoint)  # Find euclidean distance between points

            if dist < closestDist:  # If this distance is closer then update array
                closest = [ia, ic, 
                           dfAnnotations[dfAnnotations['candidate']==ia]['diameter_mm'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordX'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordY'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordZ'].values[0]]
                closestDist = dist  # Update with new closest distance      

        diameterArray.append(closest)  
       
    # Update dfClass1 to include the annotated size of the nodule (diameter_mm)
    for row in diameterArray:
        dfClass1.set_value(row[1], 'diameter_mm', row[2])  
        dfClass1.set_value(row[1], 'coordX_annotated', row[3])
        dfClass1.set_value(row[1], 'coordY_annotated', row[4])
        dfClass1.set_value(row[1], 'coordZ_annotated', row[5])

Not all candidates were annotated

It looks like none of the class 0 candidates were annotated. 389 of the 1,557 class 1 nodules are also missing annotations.


In [8]:
dfClass1.iloc[:10,:]


Out[8]:
candidate seriesuid coordX coordY coordZ class diameter_mm coordX_annotated coordY_annotated coordZ_annotated
436 436 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... 104.083933 -211.755826 -227.017987 1 4.224708 103.783651 -211.925149 -227.121250
1009 1009 1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222... -128.982091 -175.176790 -298.510192 1 5.651471 -128.699421 -175.319272 -298.387506
2053 2053 1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793... 69.974375 -141.066875 876.777280 1 5.786348 69.639017 -140.944586 876.374496
3633 3633 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 1.790000 166.340000 -408.880000 1 NaN NaN NaN NaN
3707 3707 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 1.859783 172.221534 -405.366447 1 18.545150 2.441547 172.464881 -405.493732
3748 3748 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 95.927241 143.074256 -425.000000 1 NaN NaN NaN NaN
3842 3842 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 89.320000 190.840000 -516.820000 1 NaN NaN NaN NaN
3866 3866 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 90.794891 148.860497 -426.786049 1 18.208570 90.931713 149.027266 -426.544715
3870 3870 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... 88.690908 150.310589 -434.000000 1 NaN NaN NaN NaN
3901 3901 1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016... -23.851577 192.982264 -391.433808 1 8.143262 -24.013824 192.102405 -391.081276

In [9]:
del dfCandidates['diameter_mm']

In [10]:
dfOut = dfCandidates.join(dfClass1[['candidate', 'diameter_mm', 'coordX_annotated', 'coordY_annotated', 'coordZ_annotated']], on='candidate', rsuffix='_r')
del dfOut['candidate_r']
del dfOut['candidate']

In [11]:
dfOut.to_csv('candidates_with_annotations.csv', index=False)

In [ ]: