Combine the candidates.csv and annotations.csv files

For some reason, the candidates and annotations files were never merged. The additional information in the annotation file is the nodule size. That could be useful in our models. The sizes are only included for class 1 nodules and not all class 1 nodules have annotations. Also, the annotations are out of order and the candidate centers are slightly different between the annotations and candidates files. I've asked on the LUNA16 mailing list which of the coordinates is more accurate. Still waiting for a response.

This script goes through both files and tries to match up the annotation with the correct candidate. It then merges the information and outputs this to a new candidates_plus_annotations.csv file.



In [1]:

    
## Create new candidates file



In [3]:

    
import pandas as pd
import numpy as np



In [4]:

    
DATA_DIR = "/Volumes/data/tonyr/dicom/LUNA16/"
cand_path = 'CSVFILES/candidates_V2.csv'
annotations_path = 'CSVFILES/annotations.csv'



In [5]:

    
dfAnnotations = pd.read_csv(DATA_DIR+annotations_path).reset_index()
dfAnnotations = dfAnnotations.rename(columns={'index': 'candidate'})
dfCandidates = pd.read_csv(DATA_DIR+cand_path).reset_index()
dfCandidates = dfCandidates.rename(columns={'index': 'candidate'})



In [5]:

    
dfCandidates['diameter_mm'] = np.nan  # Set a new column and fill with NaN until we know the true diameter of the candidate



In [6]:

    
dfClass1 = dfCandidates[dfCandidates['class'] == 1].copy(deep=True)  # Get only the class 1 (they are the only ones that are labeled)



In [6]:

    
dfCandidates.shape









    Out[6]:





(754975, 6)

Append nodule size to candidates

Loop through the annotations dataframe and look for the closest points to the ROI centers listed in the candidates file. Then update the candidates dataframe with the nodule size listed in the annotated file.



In [7]:

    
seriesuid = dfClass1['seriesuid'].unique()  # Get the unique series names (subjects)

for seriesNum in seriesuid:
    
    # Get the annotations for this candidate
    candAnnotations = dfAnnotations[dfAnnotations['seriesuid']==seriesNum]['candidate'].values
    candCandidates = dfClass1[dfClass1['seriesuid'] == seriesNum]['candidate'].values

    # Now loop through annotations to find closest candidate
    diameterArray = []

    for ia in candAnnotations: # Loop through the annotation indices for this seriesuid

        annotatePoint = dfAnnotations[dfAnnotations['candidate']==ia][['coordX', 'coordY', 'coordZ']].values

        closestDist = 10000

        for ic in candCandidates: # Loop through the candidate indices for this seriesuid

            candidatePoint = dfCandidates[dfCandidates['candidate']==ic][['coordX', 'coordY', 'coordZ']].values

            dist = np.linalg.norm(annotatePoint - candidatePoint)  # Find euclidean distance between points

            if dist < closestDist:  # If this distance is closer then update array
                closest = [ia, ic, 
                           dfAnnotations[dfAnnotations['candidate']==ia]['diameter_mm'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordX'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordY'].values[0],
                           dfAnnotations[dfAnnotations['candidate']==ia]['coordZ'].values[0]]
                closestDist = dist  # Update with new closest distance      

        diameterArray.append(closest)  
       
    # Update dfClass1 to include the annotated size of the nodule (diameter_mm)
    for row in diameterArray:
        dfClass1.set_value(row[1], 'diameter_mm', row[2])  
        dfClass1.set_value(row[1], 'coordX_annotated', row[3])
        dfClass1.set_value(row[1], 'coordY_annotated', row[4])
        dfClass1.set_value(row[1], 'coordZ_annotated', row[5])

Not all candidates were annotated

It looks like none of the class 0 candidates were annotated. 389 of the 1,557 class 1 nodules are also missing annotations.



In [8]:

    
dfClass1.iloc[:10,:]









    Out[8]:







  
    
      
      candidate
      seriesuid
      coordX
      coordY
      coordZ
      class
      diameter_mm
      coordX_annotated
      coordY_annotated
      coordZ_annotated
    
  
  
    
      436
      436
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
      104.083933
      -211.755826
      -227.017987
      1
      4.224708
      103.783651
      -211.925149
      -227.121250
    
    
      1009
      1009
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...
      -128.982091
      -175.176790
      -298.510192
      1
      5.651471
      -128.699421
      -175.319272
      -298.387506
    
    
      2053
      2053
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...
      69.974375
      -141.066875
      876.777280
      1
      5.786348
      69.639017
      -140.944586
      876.374496
    
    
      3633
      3633
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      1.790000
      166.340000
      -408.880000
      1
      NaN
      NaN
      NaN
      NaN
    
    
      3707
      3707
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      1.859783
      172.221534
      -405.366447
      1
      18.545150
      2.441547
      172.464881
      -405.493732
    
    
      3748
      3748
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      95.927241
      143.074256
      -425.000000
      1
      NaN
      NaN
      NaN
      NaN
    
    
      3842
      3842
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      89.320000
      190.840000
      -516.820000
      1
      NaN
      NaN
      NaN
      NaN
    
    
      3866
      3866
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      90.794891
      148.860497
      -426.786049
      1
      18.208570
      90.931713
      149.027266
      -426.544715
    
    
      3870
      3870
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      88.690908
      150.310589
      -434.000000
      1
      NaN
      NaN
      NaN
      NaN
    
    
      3901
      3901
      1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...
      -23.851577
      192.982264
      -391.433808
      1
      8.143262
      -24.013824
      192.102405
      -391.081276



In [9]:

    
del dfCandidates['diameter_mm']



In [10]:

    
dfOut = dfCandidates.join(dfClass1[['candidate', 'diameter_mm', 'coordX_annotated', 'coordY_annotated', 'coordZ_annotated']], on='candidate', rsuffix='_r')
del dfOut['candidate_r']
del dfOut['candidate']



In [11]:

    
dfOut.to_csv('candidates_with_annotations.csv', index=False)



In [ ]:

	candidate	seriesuid	coordX	coordY	coordZ	class	diameter_mm	coordX_annotated	coordY_annotated	coordZ_annotated
436	436	1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...	104.083933	-211.755826	-227.017987	1	4.224708	103.783651	-211.925149	-227.121250
1009	1009	1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...	-128.982091	-175.176790	-298.510192	1	5.651471	-128.699421	-175.319272	-298.387506
2053	2053	1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...	69.974375	-141.066875	876.777280	1	5.786348	69.639017	-140.944586	876.374496
3633	3633	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	1.790000	166.340000	-408.880000	1	NaN	NaN	NaN	NaN
3707	3707	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	1.859783	172.221534	-405.366447	1	18.545150	2.441547	172.464881	-405.493732
3748	3748	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	95.927241	143.074256	-425.000000	1	NaN	NaN	NaN	NaN
3842	3842	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	89.320000	190.840000	-516.820000	1	NaN	NaN	NaN	NaN
3866	3866	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	90.794891	148.860497	-426.786049	1	18.208570	90.931713	149.027266	-426.544715
3870	3870	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	88.690908	150.310589	-434.000000	1	NaN	NaN	NaN	NaN
3901	3901	1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...	-23.851577	192.982264	-391.433808	1	8.143262	-24.013824	192.102405	-391.081276