Exploring the Training Set

Author(s): kozyr@google.com, bfoo@google.com

In this notebook, we gather exploratory data from our training set to do feature engineering and model tuning. Before running this notebook, make sure that:

  • You have already run steps 2 and 3 to collect and split your data into training, validation, and test.
  • Your training data is in a Google storage folder such as gs://[your-bucket]/[dataprep-dir]/training_images/

In the spirit of learning to walk before learning to run, we'll write this notebook in a more basic style than you'll see in a professional setting.

Setup

TODO for you: In Screen terminal 1 (to begin with Screen in the VM, first type screen and Ctrl+a c), go to the VM shell and type Ctrl+a 1, create a folder to store your training and debugging images, and then copy a small sample of training images from Cloud Storage:

mkdir -p ~/data/training_small
gsutil -m cp gs://$BUCKET/catimages/training_images/000*.png ~/data/training_small/
gsutil -m cp gs://$BUCKET/catimages/training_images/001*.png ~/data/training_small/
mkdir -p ~/data/debugging_small
gsutil -m cp gs://$BUCKET/catimages/training_images/002*.png ~/data/debugging_small
echo "done!"

Note that we only take the images starting with those IDs to limit the total number we'll copy over to under 3 thousand images.


In [0]:
# Enter your username:
YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address

In [0]:
# Libraries for this section:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import pandas as pd
import cv2
import warnings
warnings.filterwarnings('ignore')

In [0]:
# Grab the filenames:
TRAINING_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/training_small/')
files = os.listdir(TRAINING_DIR)  # Grab all the files in the VM images directory
print(files[0:5])  # Let's see some filenames

Eyes on the data!


In [0]:
def show_pictures(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):
  """Display the first few images.

  Args:
    filelist: list of filenames to pull from
    dir: directory where the files are stored
    img_rows: number of rows of images to display
    img_cols: number of columns of images to display
    figsize: sizing for inline plots

  Returns:
    None
  """
  plt.close('all')
  fig = plt.figure(figsize=figsize)

  for i in range(img_rows * img_cols):
    a=fig.add_subplot(img_rows, img_cols,i+1)
    img = mpimg.imread(os.path.join(dir, filelist[i]))
    plt.imshow(img)
  plt.show()

In [0]:
show_pictures(files, TRAINING_DIR)

Check out the colors at rapidtables.com/web/color/RGB_Color, but don't forget to flip order of the channels to BGR.


In [0]:
# What does the actual image matrix look like?  There are three channels:
img = cv2.imread(os.path.join(TRAINING_DIR, files[0]))
print('\n***Colors in the middle of the first image***\n')
print('Blue channel:')
print(img[63:67,63:67,0])
print('Green channel:')
print(img[63:67,63:67,1])
print('Red channel:')
print(img[63:67,63:67,2])

In [0]:
def show_bgr(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):
  """Make histograms of the pixel color matrices of first few images.

  Args:
    filelist: list of filenames to pull from
    dir: directory where the files are stored
    img_rows: number of rows of images to display
    img_cols: number of columns of images to display
    figsize: sizing for inline plots

  Returns:
    None
  """
  plt.close('all')
  fig = plt.figure(figsize=figsize)
  color = ('b','g','r')

  for i in range(img_rows * img_cols):
    a=fig.add_subplot(img_rows, img_cols, i + 1)
    img = cv2.imread(os.path.join(TRAINING_DIR, files[i]))
    for c,col in enumerate(color):
      histr = cv2.calcHist([img],[c],None,[256],[0,256])
      plt.plot(histr,color = col)
      plt.xlim([0,256])
      plt.ylim([0,500])
  plt.show()

In [0]:
show_bgr(files, TRAINING_DIR)

Do some sanity checks

For example:

  • Do we have blank images?
  • Do we have images with very few colors?

In [0]:
# Pull in blue channel for each image, reshape to vector, count unique values:
unique_colors = []
landscape = []
for f in files:
  img = np.array(cv2.imread(os.path.join(TRAINING_DIR, f)))[:,:,0]
  # Determine if landscape is more likely than portrait by comparing
    #amount of zero channel in 3rd row vs 3rd col:
  landscape_likely = (np.count_nonzero(img[:,2]) > np.count_nonzero(img[2,:])) * 1
  # Count number of unique blue values:
  col_count = len(set(img.ravel()))
  # Append to array:
  unique_colors.append(col_count)
  landscape.append(landscape_likely)
    
unique_colors = pd.DataFrame({'files': files, 'unique_colors': unique_colors,
                              'landscape': landscape})
unique_colors = unique_colors.sort_values(by=['unique_colors'])
print(unique_colors[0:10])

In [0]:
# Plot the pictures with the lowest diversity of unique color values:
suspicious = unique_colors['files'].tolist()
show_pictures(suspicious, TRAINING_DIR, 1)

Get labels

Extract labels from the filename and create a pretty dataframe for analysis.


In [0]:
def get_label(str):
  """
  Split out the label from the filename of the image, where we stored it.
  Args:
    str: filename string.
  Returns:
    label: an integer 1 or 0
  """
  split_filename = str.split('_')
  label = int(split_filename[-1].split('.')[0])
  return(label)

# Example:
get_label('12550_0.1574_1.png')

Create DataFrame


In [0]:
df = unique_colors[:]
df['label'] = df['files'].apply(lambda x: get_label(x))
df['landscape_likely'] = df['landscape']
df = df.drop(['landscape', 'unique_colors'], axis=1)
df[:10]

Basic Feature Engineering

Below, we show an example of a very simple set of features that can be derived from an image. This function simply pulls the mean, standard deviation, min, and max of pixel values in one image band (red, green, or blue)


In [0]:
def general_img_features(band):
  """
  Define a set of features that we can look at for each color band
  Args:
    band: array which is one of blue, green, or red
  Returns:
    features: unique colors, nonzero count, mean, standard deviation,
              min, and max of the channel's pixel values
  """
  return [len(set(band.ravel())), np.count_nonzero(band),
          np.mean(band), np.std(band),
          band.min(), band.max()]

def concat_all_band_features(file, dir):
  """
  Extract features from a single image.
   Args:
         file - single image filename
         dir - directory where the files are stored
  Returns:
         features - descriptive statistics for pixels
  """
  img = cv2.imread(os.path.join(dir, file))
  features = []
  blue = np.float32(img[:,:,0])
  green = np.float32(img[:,:,1])
  red = np.float32(img[:,:,2])
  features.extend(general_img_features(blue)) # indices 0-4
  features.extend(general_img_features(green)) # indices 5-9
  features.extend(general_img_features(red)) # indices 10-14
  return features

In [0]:
# Let's see an example:
print(files[0] + '\n')
example = concat_all_band_features(files[0], TRAINING_DIR)
print(example)

In [0]:
# Apply it to our dataframe:
feature_names = ['blue_unique', 'blue_nonzero', 'blue_mean', 'blue_sd', 'blue_min', 'blue_max',
                 'green_unique', 'green_nonzero', 'green_mean', 'green_sd', 'green_min', 'green_max',
                 'red_unique', 'red_nonzero', 'red_mean', 'red_sd', 'red_min', 'red_max']

# Compute a series holding all band features as lists
band_features_series = df['files'].apply(lambda x: concat_all_band_features(x, TRAINING_DIR))

# Loop through lists and distribute them across new columns in the dataframe
for i in range(len(feature_names)):
  df[feature_names[i]] = band_features_series.apply(lambda x: x[i])
df[:10]

In [0]:
# Are these features good for finding cats?
# Let's look at some basic correlations.
df.corr().round(2)

These coarse features look pretty bad individually. Most of this is due to features capturing absolute pixel values. But photo lighting could vary significantly between different image shots. What we end up with is a lot of noise.

Are there some better feature detectors we can consider? Why yes, there are! Several common features involve finding corners in pictures, and looking for pixel gradients (differences in pixel values between neighboring pixels in different directions).

Harris Corner Detector

The following snippet runs code to visualize harris corner detection for a few sample images. Configuring the threshold determines how strong of a signal we need to determine if a pixel corresponds to a corner (high pixel gradients in all directions).

Note that because a Harris corner detector returns another image map with values corresponding to the likelihood of a corner at that pixel, it can also be fed into general_img_features() to extract additional features. What do you notice about corners on cat images?


In [0]:
THRESHOLD = 0.05

def show_harris(filelist, dir, band=0, img_rows=4, img_cols=4, figsize=(20, 10)):
  """
  Display Harris corner detection for the first few images.
  Args:
    filelist: list of filenames to pull from
    dir: directory where the files are stored
    band: 0 = 'blue', 1 = 'green', 2 = 'red'
    img_rows: number of rows of images to display
    img_cols: number of columns of images to display
    figsize: sizing for inline plots
  Returns:
    None
  """
  plt.close('all')
  fig = plt.figure(figsize=figsize)

  def plot_bands(src, band_img):
    a=fig.add_subplot(img_rows, img_cols, i + 1)
    dst = cv2.cornerHarris(band_img, 2, 3, 0.04)
    dst = cv2.dilate(dst,None) # dilation makes the marks a little bigger

    # Threshold for an optimal value, it may vary depending on the image.
    new_img = src.copy()
    new_img[dst > THRESHOLD * dst.max()]=[0, 0, 255]
    # Note: openCV reverses the red-green-blue channels compared to matplotlib,
    # so we have to flip the image before showing it
    imgplot = plt.imshow(cv2.cvtColor(new_img, cv2.COLOR_BGR2RGB))

  for i in range(img_rows * img_cols):
    img = cv2.imread(os.path.join(dir, filelist[i]))
    plot_bands(img, img[:,:,band])

  plt.show()

In [0]:
show_harris(files, TRAINING_DIR)