In this notebook, we'll take the basic
data set, use ibmseti
Python package to convert each data file into a spectrogram, then save as .png
files.
Also, we'll split the data set into a training set and a test set and create a handful of zip files for each class. This will dovetail into the next tutorial where we will train a custom Watson Visual Recognition classifier (we will use the zip files of pngs) and measure it's performance with the test (cross-validation) set.
primary
You may want to adapt this script to use the primary
data set.
In [1]:
from __future__ import division
import cStringIO
import glob
import json
import requests
import ibmseti
import os
import zipfile
import numpy as np
import matplotlib.pyplot as plt
In [2]:
#Making a local folder to put my data.
mydatafolder = os.environ['PWD'] + '/' + 'my_team_name_data_folder'
if os.path.exists(mydatafolder) is False:
os.makedirs(mydatafolder)
In [5]:
!ls -alrht $mydatafolder
In [6]:
outputpng_folder = mydatafolder + '/png'
if os.path.exists(outputpng_folder) is False:
os.makedirs(outputpng_folder)
In [8]:
#Use `ibmseti`, or other methods, to draw the spectrograms
def draw_spectrogram(data):
aca = ibmseti.compamp.SimCompamp(data)
spec = aca.get_spectrogram()
# Instead of using SimCompAmp.get_spectrogram method
# perform your own signal processing here before you create the spectrogram
#
# SimCompAmp.get_spectrogram is relatively simple. Here's the code to reproduce it:
#
# header, raw_data = r.content.split('\n',1)
# complex_data = np.frombuffer(raw_data, dtype='i1').astype(np.float32).view(np.complex64)
# shape = (int(32*8), int(6144/8))
# spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data.reshape(*shape) ), 1) )**2
#
# But instead of the line above, can you maniputlate `complex_data` with signal processing
# techniques in the time-domain (windowing?, de-chirp?), or manipulate the output of the
# np.fft.fft process in a way to improve the signal to noise (Welch periodogram, subtract noise model)?
#
# example: Apply Hanning Window
# complex_data = complex_data.reshape(*shape)
# complex_data = complex_data * np.hanning(complex_data.shape[1])
# spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data ), 1) )**2
# Alternatively:
# If you're using ibmseti 1.0.5 or greater, you can define a signal processing function,
# which will be passed the 2D complex-value time-series numpy array. Your processing function should return a 2D
# numpy array -- though it doesn't need to be complex-valued or even the same size.
# The SimCompamp.get_spectrogram function will treat the output of your signals processing function
# in the same way it treats the raw 2d complex-valued time-series data.
# The fourier transform of each row in the 2D array will be calculated
# and then squared to produce the spectrogram.
#
# def mySignalProcessing(compData):
# return compData * np.hanning(compData.shape[1])
#
# aca.sigProc(mySignalProcessing)
# spc = aca.get_spectrogram()
#
# You can define more sophisticated signal processing inside your function.
#
fig, ax = plt.subplots(figsize=(10, 5))
# do different color mappings affect Watson's classification accuracy?
# ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='hot')
# ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
# ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='Greys')
# If you're going to plot the log, make sure there are no values less than or equal to zero
spec_pos_min = spec[spec > 0].min()
spec[spec <= 0] = spec_pos_min
ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
return fig, aca.header()
In [9]:
## We're going to use Spark to distribute the job of creating the PNGs on the executor nodes
myzipfilepath = os.path.join(mydatafolder,'basic4.zip')
zz = zipfile.ZipFile(myzipfilepath)
filenames = filter(lambda x: x.endswith('.dat'), zz.namelist()) #this filters out the top-level folder in the zip file, which is a separate entry in the namelist
rdd = sc.parallelize(filenames, 8) #2 executors are available on free-tier IBM Spark clusters. If you have access to an Enterprise cluster, which has 30 executors, you should parallize to 120 partitions
In [10]:
def extract_data(row):
zzz = zipfile.ZipFile(myzipfilepath)
return (row, zzz.open(row, 'rb').read())
rdd = rdd.map(extract_data)
In [11]:
def convert_to_spectrogram_and_save(row):
name = os.path.basename(row[0])
fig, header = draw_spectrogram(row[1])
png_file = name + '.png'
fig.savefig(outputpng_folder + '/' + png_file)
plt.close(fig)
return (name, header, png_file)
In [12]:
rdd = rdd.map(convert_to_spectrogram_and_save)
In [13]:
results = rdd.collect() #This took about 70s on an Enterprise cluster. It will take longer on your free-tier.
In [14]:
results[0]
Out[14]:
Using the basic
list, we'll create training and test sets for each signal class. Then we'll archive the .png
files into a handful of .zip
files (We need the .zip files to be smaller than 100 MB because there is a limitation with the size of batches of data that are uploaded to Watson Visual Recognition when training a classifier.)
In [2]:
# Grab the Basic file list in order to
# Organize the Data into classes
ff = open('public_list_basic_v2_26may_2017.csv')
uuids_classes_as_list = ff.read().split('\n')[1:-1] #slice off the first line (header) and last line (empty)
def row_to_json(row):
uuid,sigclass = row.split(',')
return {'uuid':uuid, 'signal_classification':sigclass}
uuids_classes_as_list = map(lambda row: row_to_json(row), uuids_classes_as_list)
print "found {} files".format(len(uuids_classes_as_list))
uuids_group_by_class = {}
for item in uuids_classes_as_list:
uuids_group_by_class.setdefault(item['signal_classification'], []).append(item)
In [16]:
#At first, use just 20 percent and 10 percent. This will be useful
#as you prototype. Then you can come back here and increase these
#percentages as needed.
training_percentage = 0.20
test_percentage = 0.10
assert training_percentage + test_percentage <= 1.0
training_set_group_by_class = {}
test_set_group_by_class = {}
for k, v in uuids_group_by_class.iteritems():
total = len(v)
training_size = int(total * training_percentage)
test_size = int(total * test_percentage)
training_set = v[:training_size]
test_set = v[-1*test_size:]
training_set_group_by_class[k] = training_set
test_set_group_by_class[k] = test_set
print '{}: training set size: {}'.format(k, len(training_set))
print '{}: test set size: {}'.format(k, len(test_set))
In [17]:
training_set_group_by_class['noise'][0]
Out[17]:
In [18]:
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]
In [19]:
zipfilefolder = mydatafolder + '/zipfiles'
if os.path.exists(zipfilefolder) is False:
os.makedirs(zipfilefolder)
In [20]:
max_zip_file_size_in_mb = 25
In [21]:
#Create the Zip files containing the training PNG files
#Note that this limits output files to be less than <max_zip_file_size_in_mb> MB because WatsonVR has a limit on the
#size of input files that can be sent in single HTTP calls to train a custom classifier
for k, v, in training_set_group_by_class.iteritems():
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v] #yes, files are <uuid>.dat.png :/
count = 1
for fn in fnames:
archive_name = '{}/classification_{}_{}.zip'.format(zipfilefolder, count, k)
if os.path.exists(archive_name):
zz = zipfile.ZipFile(archive_name, mode='a')
else:
print 'creating new archive', archive_name
zz = zipfile.ZipFile(archive_name, mode='w')
zz.write(fn)
zz.close()
#if archive_name folder exceeds <max_zip_file_size_in_mb> MB, increase count to create a new one
if os.path.getsize(archive_name) > max_zip_file_size_in_mb * 1024 ** 2:
count += 1
In [22]:
# Create the Zip files containing the test PNG files using the following naming convention:
# testset_<NUMBER>_<CLASS>.zip (The next notebook example using Watson will break if a
# different naming convention is used) Refer to
# https://www.ibm.com/watson/developercloud/visual-recognition/api/v3/#classify_an_image
# for ZIP size and content limitations:
# "The max number of images in a .zip file is limited to 20, and limited to 5 MB."
for k, v, in test_set_group_by_class.iteritems():
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v] #yes, files are <uuid>.dat.png :/
count = 1
img_count = 0
for fn in fnames:
archive_name = '{}/testset_{}_{}.zip'.format(zipfilefolder, count, k)
if os.path.exists(archive_name):
zz = zipfile.ZipFile(archive_name, mode='a')
else:
print 'creating new archive', archive_name
zz = zipfile.ZipFile(archive_name, mode='w')
zz.write(fn)
zz.close()
img_count += 1
#if archive_name folder exceeds 5 MB or there are more than 20 images,
# increase count to create a new one
if (os.path.getsize(archive_name) >= 4.7 * 1024 ** 2) or img_count == 20:
count += 1
img_count = 0
In [23]:
!ls -alrth $mydatafolder/zipfiles
If you've been running this on an IBM DSX Spark cluster and you wish to move your data from the local filespace, the easiest and fastest way is to push these PNG files to an IBM Object Storage account. An Object Storage instances was created for you when you signed up for DSX.
You do NOT need to do this if you're going on to the next notebook where you use Watson to classify your images from this Spark cluster. That notebook will read the data from the local file space.
Service Credentials
tab and View Credentials
In [ ]:
import swiftclient.client as swiftclient
credentials = {
'auth_uri':'',
'global_account_auth_uri':'',
'username':'xx',
'password':"xx",
'auth_url':'https://identity.open.softlayer.com',
'project':'xx',
'projectId':'xx',
'region':'dallas',
'userId':'xx',
'domain_id':'xx',
'domain_name':'xx',
'tenantId':'xx'
}
In [ ]:
conn_seti_data = swiftclient.Connection(
key=creds_seti_public['password'],
authurl=creds_seti_public['auth_url']+"/v3",
auth_version='3',
os_options={
"project_id": creds_seti_public['projectId'],
"user_id": creds_seti_public['userId'],
"region_name": creds_seti_public['region']})
In [ ]:
myObjectStorageContainer = 'seti_pngs'
someFile = os.path.join(zipfilefolder, 'classification_1_narrowband.zip')
etag = conn_seti_data.put_object(myObjectStorageContainer, someFile, open(someFile,'rb').read())