In this script we are going to sample a dataset of words, but we'll sample carefully to avoid removing subsets with low occurrence In other words, we'll perform a stratified sampling
In our case we'll use the word length as the feature whose statistics we want to preserve, so we will perform an 1% sampling for each word length. And we'll change the sampling ratio for the long words (which are typically rare), so as to preserve them.
In [1]:
# We start reading the file from HDFS into a RDD
words = sc.textFile( "hdfs:///user/{0}/data/quijote-words.txt".format(sc.sparkUser()) )
total_words = words.count()
In [2]:
# Now we compute the subclasses. This is done creating a (k,v) RDD, by mapping using word length as the key
wordsByLength = words.map( lambda x : (len(x),x) )
In [3]:
# Let's take a peek to see what we've got
wordsByLength.take( 10 )
Out[3]:
In [4]:
# Ok, check class cardinality
# Compute the fraction of words having each length
lengthStats = words.map( lambda x : (len(x),1) ).reduceByKey( add ).sortByKey( False )
In [5]:
# Get those figures back to the driver program, since we need to work with them
stats = lengthStats.collect()
stats
Out[5]:
As we suspected, long words are exceedingly rare, while short words are fairly common (there is a bogus class of length 0, probably containing a couple of empty strings that slipped our processing).
Out of curiosity, let's find the champions in word length
In [6]:
wordsByLength.filter( lambda x : x[0]>16 ).sortByKey(False).collect()
Out[6]:
In [7]:
import numpy as np
In [8]:
# Now we convert our collected stats into a NumPy array
statsM = np.array( stats )
In [9]:
# How many words in total?
total = sum(statsM[::,1])
total
Out[9]:
In [10]:
# And what are the probabilities?
from __future__ import division # ensure we're using floating point division
fraction = statsM[::,1]/total
In [11]:
# Stack the fractions into our array
statsM2 = np.c_[statsM,fraction]
In [12]:
# See what we've got. To avoid that ugly scientific notation, we change NumPy presentation options
np.set_printoptions(suppress=True)
statsM2
Out[12]:
In [13]:
# Ok, as a general rule we want 1% of data for each category
sample_fractions = np.ones( fraction.shape )*0.01
In [15]:
# but: underrepresented categories (less than 100 instances) we want at 100%
# and very rare categories (less than 20 instances) we want in full
sample_fractions[ statsM2[:,1] < 100 ] = 0.1
sample_fractions[ statsM2[:,1] < 20 ] = 1
In [16]:
# Check those fractions
sample_fractions
Out[16]:
In [17]:
# Construct the dict "key:fraction" for the stratified sampling
s = dict( zip(map(int,statsM[:,0]),sample_fractions) )
s
Out[17]:
In [18]:
# Sample!
wordsSampledbyLength = wordsByLength.sampleByKey(False,s)
In [19]:
# See how many we've got overall
sampled = wordsSampledbyLength.count()
fraction = sampled/total_words
fraction
Out[19]:
In [20]:
# Check amount per category
lengthStatsSampled = wordsSampledbyLength.mapValues( lambda x : 1 ).reduceByKey( add ).sortByKey( False )
lengthStatsSampled.collect()
Out[20]:
In [21]:
# An example of what we have
wordsSampledbyLength.take(10)
Out[21]:
In [ ]: