Clean up data

Sometimes, unwanted data needs to be deleted.

Each of the screenshot data was manually checked, and was moved to the wrong/ directory.

This notebook iterates through the wrong/ directory and removes the accompanying rows inside the csv file.



In [25]:

    
%ls -lh ../data/csv









    



합계 9.8M
-rw-rw-r-- 1 im9uri im9uri 5.9M  8월 21 15:53 733bbfef.csv
-rw-rw-r-- 1 im9uri im9uri 4.0M  8월  8 21:58 97802012.csv
drwxrwxr-x 2 im9uri im9uri 4.0K  8월  8 20:02 final/
drwxrwxr-x 2 im9uri im9uri 4.0K  8월  8 20:01 preprocess/



In [26]:

    
import pandas as pd
import os



In [27]:

    
parent_path = os.path.dirname(os.getcwd())

csv_file = '97802012'
csv_file_name = csv_file + '.csv'
csv_dir_path = os.path.join(parent_path, 'data', 'csv')
csv_file_path = os.path.join(csv_dir_path, csv_file_name)

img_dir_path =  os.path.join(parent_path, 'data', 'img', 'raw')
img_output_dir_path = os.path.join(img_dir_path, csv_file)
img_wrong_dir_path = os.path.join(parent_path, 'data', 'img', 'wrong')



In [28]:

    
df = pd.read_csv(csv_file_path, header=0)
old_rows_count = df.shape[0]
print("%d rows" % df.shape[0])
df.head(3)









    



39342 rows






    Out[28]:







  
    
      
      img
      wheel-axis
      clutch
      brake
      gas
      paddle-left
      paddle-right
      wheel-button-left-1
      wheel-button-left-2
      wheel-button-left-3
      ...
      shifter-button-2
      shifter-button-3
      shifter-button-4
      gear-1
      gear-2
      gear-3
      gear-4
      gear-5
      gear-6
      gear-R
    
  
  
    
      0
      97802012_2017_08_08_20_23_27_84.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      97802012_2017_08_08_20_23_27_94.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      97802012_2017_08_08_20_23_28_02.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

3 rows × 30 columns

Get wrong image list



In [29]:

    
wrong_list = os.listdir(img_wrong_dir_path)



In [30]:

    
wrong_list = [x for x in wrong_list if csv_file in x]



In [31]:

    
len(wrong_list)









    Out[31]:





132

Get index of each wrong image



In [32]:

    
def get_index(i):
    return df[df['img'] == i].index.tolist()[0]



In [33]:

    
wrong_list_index = [get_index(i) for i in wrong_list]

Remove the rows, and save the modified csv file



In [34]:

    
df = df.drop(df.index[wrong_list_index])



In [35]:

    
df.shape[0]









    Out[35]:





39210



In [36]:

    
assert(df.shape[0] + len(wrong_list) == old_rows_count)



In [37]:

    
df.to_csv(csv_file_path, index=False)

Check to see that it was saved well



In [38]:

    
df = pd.read_csv(csv_file_path, header=0)
print("%d rows" % df.shape[0])
df.head(3)









    



39210 rows






    Out[38]:







  
    
      
      img
      wheel-axis
      clutch
      brake
      gas
      paddle-left
      paddle-right
      wheel-button-left-1
      wheel-button-left-2
      wheel-button-left-3
      ...
      shifter-button-2
      shifter-button-3
      shifter-button-4
      gear-1
      gear-2
      gear-3
      gear-4
      gear-5
      gear-6
      gear-R
    
  
  
    
      0
      97802012_2017_08_08_20_23_27_84.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      97802012_2017_08_08_20_23_27_94.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      97802012_2017_08_08_20_23_28_02.jpg
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

3 rows × 30 columns



In [ ]:

Move img files to respective output directory. (To indicate that we have looked at the images, and removed the "wrong" images)



In [39]:

    
if not os.path.exists(img_output_dir_path):
    os.makedirs(img_output_dir_path)



In [40]:

    
for f in df['img']:
    old_path = os.path.join(img_dir_path, f)
    new_path = os.path.join(img_output_dir_path, f)
    os.rename(old_path, new_path)

Delete wrong images from wrong directory



In [41]:

    
for f in wrong_list:
    remove_file_path = os.path.join(img_wrong_dir_path, f)
    os.remove(remove_file_path)



In [ ]:



In [ ]:



In [ ]:

	img	...
0	97802012_2017_08_08_20_23_27_84.jpg	...
1	97802012_2017_08_08_20_23_27_94.jpg	...
2	97802012_2017_08_08_20_23_28_02.jpg	...