Author(s): kozyr@google.com, bfoo@google.com
In this notebook, we gather exploratory data from our training set to do feature engineering and model tuning. Before running this notebook, make sure that:
In the spirit of learning to walk before learning to run, we'll write this notebook in a more basic style than you'll see in a professional setting.
TODO for you: In Screen terminal 1 (to begin with Screen in the VM, first type
screen and Ctrl+a c), go to the VM shell and type Ctrl+a 1,
create a folder to store your training and debugging images, and then copy a small
sample of training images from Cloud Storage:
mkdir -p ~/data/training_small
gsutil -m cp gs://$BUCKET/catimages/training_images/000*.png ~/data/training_small/
gsutil -m cp gs://$BUCKET/catimages/training_images/001*.png ~/data/training_small/
mkdir -p ~/data/debugging_small
gsutil -m cp gs://$BUCKET/catimages/training_images/002*.png ~/data/debugging_small
echo "done!"
Note that we only take the images starting with those IDs to limit the total number we'll copy over to under 3 thousand images.
In [0]:
# Enter your username:
YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address
In [0]:
# Libraries for this section:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import pandas as pd
import cv2
import warnings
warnings.filterwarnings('ignore')
In [0]:
# Grab the filenames:
TRAINING_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/training_small/')
files = os.listdir(TRAINING_DIR) # Grab all the files in the VM images directory
print(files[0:5]) # Let's see some filenames
In [0]:
def show_pictures(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):
"""Display the first few images.
Args:
filelist: list of filenames to pull from
dir: directory where the files are stored
img_rows: number of rows of images to display
img_cols: number of columns of images to display
figsize: sizing for inline plots
Returns:
None
"""
plt.close('all')
fig = plt.figure(figsize=figsize)
for i in range(img_rows * img_cols):
a=fig.add_subplot(img_rows, img_cols,i+1)
img = mpimg.imread(os.path.join(dir, filelist[i]))
plt.imshow(img)
plt.show()
In [0]:
show_pictures(files, TRAINING_DIR)
Check out the colors at rapidtables.com/web/color/RGB_Color, but don't forget to flip order of the channels to BGR.
In [0]:
# What does the actual image matrix look like? There are three channels:
img = cv2.imread(os.path.join(TRAINING_DIR, files[0]))
print('\n***Colors in the middle of the first image***\n')
print('Blue channel:')
print(img[63:67,63:67,0])
print('Green channel:')
print(img[63:67,63:67,1])
print('Red channel:')
print(img[63:67,63:67,2])
In [0]:
def show_bgr(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):
"""Make histograms of the pixel color matrices of first few images.
Args:
filelist: list of filenames to pull from
dir: directory where the files are stored
img_rows: number of rows of images to display
img_cols: number of columns of images to display
figsize: sizing for inline plots
Returns:
None
"""
plt.close('all')
fig = plt.figure(figsize=figsize)
color = ('b','g','r')
for i in range(img_rows * img_cols):
a=fig.add_subplot(img_rows, img_cols, i + 1)
img = cv2.imread(os.path.join(TRAINING_DIR, files[i]))
for c,col in enumerate(color):
histr = cv2.calcHist([img],[c],None,[256],[0,256])
plt.plot(histr,color = col)
plt.xlim([0,256])
plt.ylim([0,500])
plt.show()
In [0]:
show_bgr(files, TRAINING_DIR)
In [0]:
# Pull in blue channel for each image, reshape to vector, count unique values:
unique_colors = []
landscape = []
for f in files:
img = np.array(cv2.imread(os.path.join(TRAINING_DIR, f)))[:,:,0]
# Determine if landscape is more likely than portrait by comparing
#amount of zero channel in 3rd row vs 3rd col:
landscape_likely = (np.count_nonzero(img[:,2]) > np.count_nonzero(img[2,:])) * 1
# Count number of unique blue values:
col_count = len(set(img.ravel()))
# Append to array:
unique_colors.append(col_count)
landscape.append(landscape_likely)
unique_colors = pd.DataFrame({'files': files, 'unique_colors': unique_colors,
'landscape': landscape})
unique_colors = unique_colors.sort_values(by=['unique_colors'])
print(unique_colors[0:10])
In [0]:
# Plot the pictures with the lowest diversity of unique color values:
suspicious = unique_colors['files'].tolist()
show_pictures(suspicious, TRAINING_DIR, 1)
In [0]:
def get_label(str):
"""
Split out the label from the filename of the image, where we stored it.
Args:
str: filename string.
Returns:
label: an integer 1 or 0
"""
split_filename = str.split('_')
label = int(split_filename[-1].split('.')[0])
return(label)
# Example:
get_label('12550_0.1574_1.png')
In [0]:
df = unique_colors[:]
df['label'] = df['files'].apply(lambda x: get_label(x))
df['landscape_likely'] = df['landscape']
df = df.drop(['landscape', 'unique_colors'], axis=1)
df[:10]
In [0]:
def general_img_features(band):
"""
Define a set of features that we can look at for each color band
Args:
band: array which is one of blue, green, or red
Returns:
features: unique colors, nonzero count, mean, standard deviation,
min, and max of the channel's pixel values
"""
return [len(set(band.ravel())), np.count_nonzero(band),
np.mean(band), np.std(band),
band.min(), band.max()]
def concat_all_band_features(file, dir):
"""
Extract features from a single image.
Args:
file - single image filename
dir - directory where the files are stored
Returns:
features - descriptive statistics for pixels
"""
img = cv2.imread(os.path.join(dir, file))
features = []
blue = np.float32(img[:,:,0])
green = np.float32(img[:,:,1])
red = np.float32(img[:,:,2])
features.extend(general_img_features(blue)) # indices 0-4
features.extend(general_img_features(green)) # indices 5-9
features.extend(general_img_features(red)) # indices 10-14
return features
In [0]:
# Let's see an example:
print(files[0] + '\n')
example = concat_all_band_features(files[0], TRAINING_DIR)
print(example)
In [0]:
# Apply it to our dataframe:
feature_names = ['blue_unique', 'blue_nonzero', 'blue_mean', 'blue_sd', 'blue_min', 'blue_max',
'green_unique', 'green_nonzero', 'green_mean', 'green_sd', 'green_min', 'green_max',
'red_unique', 'red_nonzero', 'red_mean', 'red_sd', 'red_min', 'red_max']
# Compute a series holding all band features as lists
band_features_series = df['files'].apply(lambda x: concat_all_band_features(x, TRAINING_DIR))
# Loop through lists and distribute them across new columns in the dataframe
for i in range(len(feature_names)):
df[feature_names[i]] = band_features_series.apply(lambda x: x[i])
df[:10]
In [0]:
# Are these features good for finding cats?
# Let's look at some basic correlations.
df.corr().round(2)
These coarse features look pretty bad individually. Most of this is due to features capturing absolute pixel values. But photo lighting could vary significantly between different image shots. What we end up with is a lot of noise.
Are there some better feature detectors we can consider? Why yes, there are! Several common features involve finding corners in pictures, and looking for pixel gradients (differences in pixel values between neighboring pixels in different directions).
The following snippet runs code to visualize harris corner detection for a few sample images. Configuring the threshold determines how strong of a signal we need to determine if a pixel corresponds to a corner (high pixel gradients in all directions).
Note that because a Harris corner detector returns another image map with values corresponding to the likelihood of a corner at that pixel, it can also be fed into general_img_features() to extract additional features. What do you notice about corners on cat images?
In [0]:
THRESHOLD = 0.05
def show_harris(filelist, dir, band=0, img_rows=4, img_cols=4, figsize=(20, 10)):
"""
Display Harris corner detection for the first few images.
Args:
filelist: list of filenames to pull from
dir: directory where the files are stored
band: 0 = 'blue', 1 = 'green', 2 = 'red'
img_rows: number of rows of images to display
img_cols: number of columns of images to display
figsize: sizing for inline plots
Returns:
None
"""
plt.close('all')
fig = plt.figure(figsize=figsize)
def plot_bands(src, band_img):
a=fig.add_subplot(img_rows, img_cols, i + 1)
dst = cv2.cornerHarris(band_img, 2, 3, 0.04)
dst = cv2.dilate(dst,None) # dilation makes the marks a little bigger
# Threshold for an optimal value, it may vary depending on the image.
new_img = src.copy()
new_img[dst > THRESHOLD * dst.max()]=[0, 0, 255]
# Note: openCV reverses the red-green-blue channels compared to matplotlib,
# so we have to flip the image before showing it
imgplot = plt.imshow(cv2.cvtColor(new_img, cv2.COLOR_BGR2RGB))
for i in range(img_rows * img_cols):
img = cv2.imread(os.path.join(dir, filelist[i]))
plot_bands(img, img[:,:,band])
plt.show()
In [0]:
show_harris(files, TRAINING_DIR)