Note: This notebook assumes Python 3 as the running interpreter
Textual data stored in the database requires to be properly represented in the considered Feature Space, i.e. the Vector Space Model (VSM).
To this end, we will build a processing pipeline (sklearn.pipeline.Pipeline
) that consists of two processing step:
sklearn.feature_extraction.text.CountVectorizer
);tfidf
representation (sklearn.feature_extraction.text.TfidfTransformer
)All the required logic is embedded in the coherence.load_coherence_dataset
function.
The actual formula used for tf-idf
(under the hood) is
$$tf * (idf + 1) = tf + tf * idf$$
The effect is that terms with zero $idf$, i.e. those occurring in all documents of the collection,
will not be entirely ignored.
Moreover, the specific formulas used to compute $tf$ and $idf$
depend on parameter passed to the Pipeline
object.
The list of used parameters are:
sublinear_tf=True
==> $tf$ will be calculated as $tf = 1 + \ln tf$.
This formulation of the $tf$ should be less sensible to (numerical) outliers, which could possibly be the case in our dataset, where only few terms are involved in the calculation.
use_idf=True
==> actually computes tfidf for features (instead of just tf)norm='l2'
==> $L_2$ normalization is applied to data, i.e. resulting vectors are length-normalised.lowercase=False
==> Avoid changing data to lowercase (as they're still so, in the DB)stop_words=None
==> No stopword list is passed, as raw textual data in the DB has been already processed in
a "standard" IR-indexing pipeline (i.e. lowercase + stop_word filtering + stemming (Porter)).The cut-off parameter (i.e. min_df
) could be considered for inclusion, in case we would filter out terms that occur less than a minimum number of documents.
In [1]:
from coherence import load_coherence_dataset
In [2]:
coherence_ds = load_coherence_dataset()
In [3]:
print(coherence_ds.DESCR)
In [4]:
X, y = coherence_ds.data, coherence_ds.target
In [5]:
print(X.shape)
print(y.shape)
In [6]:
## Statistics on the dataset for each project
for project in coherence_ds.projects['names']:
info = coherence_ds.projects[project]
print('Project: {0} {1}'.format(info['name'], info['version']), end=' '*4)
print('Positive={0}; Negative={1}'.format(info['positive_examples'],
info['negative_examples']))
In [7]:
coherence_ds.target_counts
Out[7]:
Since we are in a Feature Space consisting of 5642
dimensions, we need some trick to
reduce the dimension in order to plot the data.
In the next sections, two dimensions for feature reduction will be considered:
To visualise the data, 2D and 3D Scatter plots will be created, leveraging on the matplotlib
library.
Moreover, an interactive (2D) scatter plot using Bokeh will be also included, to easely inspect
resulting data points.
In [8]:
import matplotlib.pyplot as plt
import matplotlib
In [9]:
%matplotlib inline
In [10]:
pos_offset, _ = coherence_ds.target_counts
class_labels = coherence_ds.target_names
In [11]:
## 2D Scatter plot with matplotlib
def scatter_2D(data, offset, labels):
"""Generate and show a 2D scatter plot of input data.
Input data are assumed to be separated in two classes.
Parameters
----------
data : list (or numpy array)
Array of data to plot
offset : integer
Index of separating classes
labels : tuple
List of the two considered class names
"""
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
fig = plt.figure()
cl1_label, cl2_label = labels
plt.scatter(data[0:offset,0], data[0:offset,1], c='green',
marker='o', label=cl1_label, s=60)
plt.scatter(data[offset:,0], data[offset:,1], c='red',
marker='*', label=cl2_label, s=60)
plt.legend()
plt.show()
In [12]:
## 3D Scatter plot with matplotlib
from mpl_toolkits.mplot3d import Axes3D
def scatter_3D(data, offset, labels):
"""Generate and show a 3D scatter plot of input data.
Input data are assumed to be separated in two classes.
Parameters
----------
data : list (or numpy array)
Array of data to plot
offset : integer
Index of separating classes
labels : tuple
List of the two considered class names
"""
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
cl1_label, cl2_label = labels
ax.scatter(data[0:offset,0], data[0:offset,1], data[0:offset,1],
c='green', marker='o', label=cl1_label, s=60)
ax.scatter(data[offset:,0], data[offset:,1], data[offset:,1],
c='red', marker='*', label=cl2_label, s=60)
plt.legend()
plt.show()
PCA is a linear dimensionality reduction technique based on the SVD (Singular Value Decomposition) of the data. The technique aims at keeping only the most significant singular vectors to project the data to a lower dimensional space.
The time complexity of the current (underlying) implementation is
$O(n ^ 3)$ assuming n
$\propto$ n_samples
$\propto$ n_features
.
In [13]:
from sklearn.decomposition import PCA
In [14]:
X_dense = X.todense()
pca = PCA(n_components=2).fit(X_dense)
X_PCA2 = pca.transform(X_dense)
In [15]:
scatter_2D(X_PCA2, pos_offset, class_labels)
In [16]:
pca = PCA(n_components=3).fit(X_dense)
X_PCA3 = pca.transform(X_dense)
In [17]:
scatter_3D(X_PCA3, pos_offset, class_labels)
One of the earliest approaches to manifold learning is the Isomap algorithm, (short for Isometric Mapping).
Isomap can be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA.
Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points.
Isomap can be performed with the object sklearn.manifold.Isomap
.
Please see the Isomap
documentation for further details.
In [18]:
from sklearn.manifold import Isomap
from scipy.stats.mstats import mquantiles
# Parameter settings
k = 10 # number of nearest neighbors to consider
d = 2 # dimensionality
X_Isomap_2d = Isomap(k, d, eigen_solver='auto').fit_transform(X.toarray())
In [19]:
scatter_2D(X_Isomap_2d, pos_offset, class_labels)
In [20]:
d = 3 # 3 dimensions
X_Isomap_3d = Isomap(k, d, eigen_solver='auto').fit_transform(X.toarray())
In [21]:
scatter_3D(X_Isomap_3d, pos_offset, class_labels)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
The t-SNE technique converts similarities between data points to joint probabilities, and tries
to minise the Kullback-Leibler
divergence between the joint probabilities of the low-dimensional embedding
and the high-dimensional data.
The (main) "drawback" of t-SNE is that it has a cost function that is not convex, thus different initialisations can lead to different results.
Please see the TSNE
documentation for further details.
In [22]:
from sklearn.manifold import TSNE
For high-dimensional sparse data it is helpful to first reduce the dimensions to 50 dimensions with TruncatedSVD and then perform t-SNE. This will usually improve the visualization.
In [23]:
from sklearn.decomposition import TruncatedSVD
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(X)
In [24]:
X_tsne_2d = TSNE(n_components=2, perplexity=40).fit_transform(X_reduced)
In [25]:
scatter_2D(X_tsne_2d, pos_offset, class_labels)
In [26]:
X_tsne_3d = TSNE(n_components=3, perplexity=40).fit_transform(X_reduced)
In [27]:
scatter_3D(X_tsne_3d, pos_offset, class_labels)
In [28]:
from bokeh.plotting import output_notebook
output_notebook()
In [29]:
from bokeh.plotting import figure, show
from bokeh.charts import Scatter
def create_interactive_scatter(data, offset, labels):
"""Create an interactive Scatter plot for
data separated in two classes.
Parameters
----------
data : list (or numpy array)
Array of data to plot
offset : integer
Index of separating classes
Returns
-------
p : `bokeh.plotting.figure`
The Bokeh plotting figure to show
"""
TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,tap,previewsave,box_select,poly_select,lasso_select"
p = figure(tools=TOOLS)
cl1_label, cl2_label = labels
p.square(data[0:offset,0], data[0:offset,1], color="green", size=4, legend=cl1_label)
p.circle(data[offset:,0], data[offset:,1], color='red', size=4, legend=cl2_label)
return p
In [30]:
p = create_interactive_scatter(X_PCA2, pos_offset, class_labels)
show(p)
In [31]:
p = create_interactive_scatter(X_Isomap_2d, pos_offset, class_labels)
show(p)
In [32]:
p = create_interactive_scatter(X_tsne_2d, pos_offset, class_labels)
show(p)