Clean up data

Sometimes, unwanted data needs to be deleted.

Each of the screenshot data was manually checked, and was moved to the wrong/ directory.

This notebook iterates through the wrong/ directory and removes the accompanying rows inside the csv file.


In [25]:
%ls -lh ../data/csv


합계 9.8M
-rw-rw-r-- 1 im9uri im9uri 5.9M  8월 21 15:53 733bbfef.csv
-rw-rw-r-- 1 im9uri im9uri 4.0M  8월  8 21:58 97802012.csv
drwxrwxr-x 2 im9uri im9uri 4.0K  8월  8 20:02 final/
drwxrwxr-x 2 im9uri im9uri 4.0K  8월  8 20:01 preprocess/

In [26]:
import pandas as pd
import os

In [27]:
parent_path = os.path.dirname(os.getcwd())

csv_file = '97802012'
csv_file_name = csv_file + '.csv'
csv_dir_path = os.path.join(parent_path, 'data', 'csv')
csv_file_path = os.path.join(csv_dir_path, csv_file_name)

img_dir_path =  os.path.join(parent_path, 'data', 'img', 'raw')
img_output_dir_path = os.path.join(img_dir_path, csv_file)
img_wrong_dir_path = os.path.join(parent_path, 'data', 'img', 'wrong')

In [28]:
df = pd.read_csv(csv_file_path, header=0)
old_rows_count = df.shape[0]
print("%d rows" % df.shape[0])
df.head(3)


39342 rows
Out[28]:
img wheel-axis clutch brake gas paddle-left paddle-right wheel-button-left-1 wheel-button-left-2 wheel-button-left-3 ... shifter-button-2 shifter-button-3 shifter-button-4 gear-1 gear-2 gear-3 gear-4 gear-5 gear-6 gear-R
0 97802012_2017_08_08_20_23_27_84.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 97802012_2017_08_08_20_23_27_94.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 97802012_2017_08_08_20_23_28_02.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 30 columns

Get wrong image list


In [29]:
wrong_list = os.listdir(img_wrong_dir_path)

In [30]:
wrong_list = [x for x in wrong_list if csv_file in x]

In [31]:
len(wrong_list)


Out[31]:
132

Get index of each wrong image


In [32]:
def get_index(i):
    return df[df['img'] == i].index.tolist()[0]

In [33]:
wrong_list_index = [get_index(i) for i in wrong_list]

Remove the rows, and save the modified csv file


In [34]:
df = df.drop(df.index[wrong_list_index])

In [35]:
df.shape[0]


Out[35]:
39210

In [36]:
assert(df.shape[0] + len(wrong_list) == old_rows_count)

In [37]:
df.to_csv(csv_file_path, index=False)

Check to see that it was saved well


In [38]:
df = pd.read_csv(csv_file_path, header=0)
print("%d rows" % df.shape[0])
df.head(3)


39210 rows
Out[38]:
img wheel-axis clutch brake gas paddle-left paddle-right wheel-button-left-1 wheel-button-left-2 wheel-button-left-3 ... shifter-button-2 shifter-button-3 shifter-button-4 gear-1 gear-2 gear-3 gear-4 gear-5 gear-6 gear-R
0 97802012_2017_08_08_20_23_27_84.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 97802012_2017_08_08_20_23_27_94.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 97802012_2017_08_08_20_23_28_02.jpg 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 30 columns


In [ ]:

Move img files to respective output directory. (To indicate that we have looked at the images, and removed the "wrong" images)


In [39]:
if not os.path.exists(img_output_dir_path):
    os.makedirs(img_output_dir_path)

In [40]:
for f in df['img']:
    old_path = os.path.join(img_dir_path, f)
    new_path = os.path.join(img_output_dir_path, f)
    os.rename(old_path, new_path)

Delete wrong images from wrong directory


In [41]:
for f in wrong_list:
    remove_file_path = os.path.join(img_wrong_dir_path, f)
    os.remove(remove_file_path)

In [ ]:


In [ ]:


In [ ]: