As a first step, I will try to find duplicates in the data that I am currently loading. The question is whether or not these files are real duplicates or something went wrong with the file naming.
In [49]:
# Imports
import re
import pandas as pd
import brainbox as bb
from operator import itemgetter as ig
from collections import Counter
In [5]:
# Grab image files
in_path = '/data1/abide/Out/Remote/some_failed/out'
metric = 'stability_maps'
file_dict = bb.fileOps.grab_files(in_path, '.nii.gz', sub=metric)
# Grab pheno data
pheno_path = '/home/surchs/Project/abide/pheno/pheno_full.csv'
pheno = pd.read_csv(pheno_path)
In [34]:
# Get the real sub ids from the file names
names = file_dict['sub_name']
sub_ids = np.array([int64(re.search(r'(?<=\d{2})\d{5}', sub_id).group()) for sub_id in names])
In [77]:
dupl = [item for item, count in Counter(sub_ids).iteritems() if count > 1]
out_list = []
for idx, val in enumerate(dupl):
c = [names[x] for x in np.where(sub_ids == val)[0]]
out_list.append(c)
out_list
Out[77]:
Ok, so the duplicates are all several sessions of the same subject. Makes a lot of sense. How do I deal with this?
Well, I will just pick the first scan in these cases.
In [77]:
In [77]:
In [77]:
In [ ]: