In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

# given some content...
content = ["How to format my hard disk", " Hard disk format problems "]

X = vectorizer.fit_transform(content)
feature_names = vectorizer.get_feature_names()

print("Feature names: {}".format(feature_names))
print(X.toarray().transpose())


Feature names: ['disk', 'format', 'hard', 'how', 'my', 'problems', 'to']
[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]

Description of above data: Each column gives us a boolean (1 or 0) value letting us know if each word appears in the sentence (from content). Sentence 1 (content[0]) contains all words but "problems".


In [17]:
posts = [
    "This is a toy post about machine learning. Actually, it contains not much interesting stuff.",
    "Imaging databases provide storage capabilities.",
    "Most imaging databases save images permanently.",
    "Imaging databases store data.",
    "Imaging databases store data. Imaging databases store data. Imaging databases store data.",
]

# Create a training set
vectorizer = CountVectorizer(min_df=1)
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print("#samples: {}, #features: {}".format(num_samples, num_features))
print(vectorizer.get_feature_names())

# create a new post
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

# a naive similarity measure (which uses the full ndarray of the new post)
import scipy as sp
def dist_raw(v1, v2):
    delta = v1 - v2
    # norm: Euclidean norm (shortest distance)
    return sp.linalg.norm(delta.toarray())

# Find distances among all posts
import sys

def find_distances(vectorizer, new_post, posts, dist_func=dist_raw):
    X_train = vectorizer.fit_transform(posts)
    new_post_vec = vectorizer.transform([new_post])
    num_samples, num_features = X_train.shape
    
    print("----------------------------------------")
    print("#samples: {}, #features: {}".format(num_samples, num_features))
    print(vectorizer.get_feature_names())
    print("----------------------------------------")
    
    best_dist = sys.maxsize
    best_i = None

    for i in range(0, num_samples):
        post = posts[i]
        if post == new_post:
            continue
        post_vec = X_train.getrow(i)
        
        d = dist_func(post_vec, new_post_vec)

        print("- Post %i with dist=%.2f: %s" % (i, d, post))

        if d < best_dist:
            best_dist = d
            best_i = i

    print("Best post is %i with dist=%.2f" % (best_i, best_dist))

find_distances(vectorizer, new_post, posts)

# explore the vectors for posts 3 & 4 since they all contain the same words
print("\nVectors for what should be similar sentences:")
print(X_train.getrow(3).toarray())
print(X_train.getrow(4).toarray())


#samples: 5, #features: 25
['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']
----------------------------------------
#samples: 5, #features: 25
['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']
----------------------------------------
- Post 0 with dist=4.00: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
- Post 1 with dist=1.73: Imaging databases provide storage capabilities.
- Post 2 with dist=2.00: Most imaging databases save images permanently.
- Post 3 with dist=1.41: Imaging databases store data.
- Post 4 with dist=5.10: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=1.41

Vectors for what should be similar sentences:
[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]
[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]

In [18]:
# Normalize the vectors, and try again
def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

find_distances(vectorizer, new_post, posts, dist_func=dist_norm)


----------------------------------------
#samples: 5, #features: 25
['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']
----------------------------------------
- Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
- Post 1 with dist=0.86: Imaging databases provide storage capabilities.
- Post 2 with dist=0.92: Most imaging databases save images permanently.
- Post 3 with dist=0.77: Imaging databases store data.
- Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77

In [19]:
# using stop words; i.e. removing "noise" / useless info
# use common english stop words (can also provide a list of specific words)
vectorizer = CountVectorizer(min_df=1, stop_words='english')
print("Some of our stop words: {}".format(", ".join(sorted(vectorizer.get_stop_words())[0:20])))

# construct a new training set
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print("#samples: {}, #features: {}".format(num_samples, num_features))
print(vectorizer.get_feature_names())


Some of our stop words: a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst
#samples: 5, #features: 18
['actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'learning', 'machine', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'toy']

In [20]:
# Using NLTK for stemming (reducing words to their specific word stem)
import nltk.stem as ns
s = ns.SnowballStemmer('english')
print(s.stem("graphics"))
print(s.stem("imaging"))
print(s.stem("image"))
print(s.stem("imagination"))
print(s.stem("imagine"))


graphic
imag
imag
imagin
imagin

In [21]:
# stem our posts before verctorizing
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
find_distances(vectorizer, new_post, posts, dist_func=dist_norm)


----------------------------------------
#samples: 5, #features: 17
['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy']
----------------------------------------
- Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
- Post 1 with dist=0.86: Imaging databases provide storage capabilities.
- Post 2 with dist=0.63: Most imaging databases save images permanently.
- Post 3 with dist=0.77: Imaging databases store data.
- Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist=0.63

In [36]:
# Finally cluster some stuff
# Using data sets from: http://mlcomp.org/datasets/379

import sklearn.datasets
all_data = sklearn.datasets.fetch_20newsgroups(subset='all')
#print("Files: {}".format(len(all_data.filenames)))
#print("Target Names: {}".format(", ".join(all_data.target_names)))

# set up some training data
groups = [
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'comp.sys.ibm.pc.hardware',
    'comp.sys.mac.hardware',
    'comp.windows.x',
    'sci.space',
]
train_data = sklearn.datasets.fetch_20newsgroups(subset='train', categories=groups)
test_data = sklearn.datasets.fetch_20newsgroups(subset='test', categories=groups)

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem

english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):

    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))    

vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5, stop_words='english', decode_error='ignore')
vectorized = vectorizer.fit_transform(train_data.data)

num_samples, num_features = vectorized.shape
print("#samples: {}\n#features: {}".format(num_samples, num_features))


# NOW, do the clustering
from sklearn.cluster import KMeans
km = KMeans(n_clusters=50, init='random', n_init=1, verbose=1, random_state=3)
km.fit(vectorized)

print(km.labels_)
print(km.labels_.shape)


#samples: 3529
#features: 4712
Initialization complete
Iteration  0, inertia 5899.560
Iteration  1, inertia 3218.298
Iteration  2, inertia 3184.333
Iteration  3, inertia 3164.867
Iteration  4, inertia 3152.004
Iteration  5, inertia 3143.111
Iteration  6, inertia 3136.256
Iteration  7, inertia 3129.325
Iteration  8, inertia 3124.567
Iteration  9, inertia 3121.900
Iteration 10, inertia 3120.210
Iteration 11, inertia 3118.627
Iteration 12, inertia 3117.363
Iteration 13, inertia 3116.811
Iteration 14, inertia 3116.588
Iteration 15, inertia 3116.417
Iteration 16, inertia 3115.760
Iteration 17, inertia 3115.374
Iteration 18, inertia 3115.155
Iteration 19, inertia 3114.949
Iteration 20, inertia 3114.515
Iteration 21, inertia 3113.937
Iteration 22, inertia 3113.720
Iteration 23, inertia 3113.548
Iteration 24, inertia 3113.475
Iteration 25, inertia 3113.447
Converged at iteration 25
[38 17 47 ..., 41 14 16]
(3529,)

In [50]:
# Test out on a new mailling list post.
new_post = """Disk drive problems. Hi, I have a problem with my hard disk.
After 1 year it is working only sporadically now.
I tried to format it, but now it doesn't boot any more.
Any ideas? Thanks.
"""

# vectorize the new post, and predict it's grouping
new_post_vec = vectorizer.transform([new_post])
new_post_label = km.predict(new_post_vec)[0]
print("New post label: {}".format(new_post_label))

# find similar...
similar_indices = (km.labels_ == new_post_label).nonzero()[0]

# build a list of similar posts
similar = []
for i in similar_indices:
    dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
    similar.append((dist, train_data.data[i]))
similar = sorted(similar)
print("Found {} similar posts.".format(len(similar)))

a = similar[0]
b = similar[int(len(similar) / 10)]
c = similar[int(len(similar) / 2)]

fmt = "\n-- similarity score: {} --\n{}\n--------------------------\n"
print(fmt.format(a[0], a[1]))
print(fmt.format(b[0], b[1]))
print(fmt.format(c[0], c[1]))


New post label: 7
Found 166 similar posts.

-- similarity score: 1.0378441731334074 --
From: Thomas Dachsel <GERTHD@mvs.sas.com>
Subject: BOOT PROBLEM with IDE controller
Nntp-Posting-Host: sdcmvs.mvs.sas.com
Organization: SAS Institute Inc.
Lines: 25

Hi,
I've got a Multi I/O card (IDE controller + serial/parallel
interface) and two floppy drives (5 1/4, 3 1/2) and a
Quantum ProDrive 80AT connected to it.
I was able to format the hard disk, but I could not boot from
it. I can boot from drive A: (which disk drive does not matter)
but if I remove the disk from drive A and press the reset switch,
the LED of drive A: continues to glow, and the hard disk is
not accessed at all.
I guess this must be a problem of either the Multi I/o card
or floppy disk drive settings (jumper configuration?)
Does someone have any hint what could be the reason for it.
Please reply by email to GERTHD@MVS.SAS.COM
Thanks,
Thomas
+-------------------------------------------------------------------+
| Thomas Dachsel                                                    |
| Internet: GERTHD@MVS.SAS.COM                                      |
| Fidonet:  Thomas_Dachsel@camel.fido.de (2:247/40)                 |
| Subnet:   dachsel@rnivh.rni.sub.org (UUCP in Germany, now active) |
| Phone:    +49 6221 4150 (work), +49 6203 12274 (home)             |
| Fax:      +49 6221 415101                                         |
| Snail:    SAS Institute GmbH, P.O.Box 105307, D-W-6900 Heidelberg |
| Tagline:  One bad sector can ruin a whole day...                  |
+-------------------------------------------------------------------+

--------------------------


-- similarity score: 1.1716497460538475 --
From: badry@cs.UAlberta.CA (Badry Jason Theodore)
Subject: Chaining IDE drives
Summary: Trouble with Master/Slave drives
Nntp-Posting-Host: cab009.cs.ualberta.ca
Organization: University Of Alberta, Edmonton Canada
Lines: 16

Hi.  I am trying to set up a Conner 3184 and a Quantum 80AT drive.  I have
the conner set to the master, and the quantum set to the slave (doesn't work
the other way around).  I am able to access both drives if I boot from a 
floppy, but the drives will not boot themselves.  I am running MSDOS 6, and
have the Conner partitioned as Primary Dos, and is formatted with system
files.  I have tried all different types of setups, and even changed IDE
controller cards.  If I boot from a floppy, everything works great (except
the booting part :)).  The system doesn't report an error message or anything,
just hangs there.  Does anyone have any suggestions, or has somebody else
run into a similar problem?  I was thinking that I might have to update the bios
on one of the drives (is this possible?).  Any suggestions/answers would be
greatly appreciated.  Please reply to:

	Jason Badry
	badry@cs.ualberta.ca


--------------------------


-- similarity score: 1.3118266609870635 --
From: delman@mipg.upenn.edu (Delman Lee)
Subject: Tandberg 3600 + Future Domain TMC-1660 + Seagate ST-21M problem??
Distribution: comp
Organization: University of Pennsylvania, USA.
Lines: 37
Nntp-Posting-Host: mipgsun.mipg.upenn.edu

I am trying to get my system to work with a Tandberg 3600 + Future
Domain TMC-1660 + Seagate ST-21M MFM controller. 

The system boots up if the Tandberg is disconnected from the system,
and of course no SCSI devices found (I have no other SCSI devices).

The system boots up if the Seagate MFM controller is removed from the
system. The Future Domain card reports finding the Tandberg 3660 on
the SCSI bus. The system then of course stops booting because my MFM
hard disks can't be found.

The system hangs if all three (Tandberg, Future Domain TMC-1660 &
Seagate MFM controller) are in the system. 

Looks like there is some conflict between the Seagate and Future
Domain card. But the funny thing is that it only hangs if the Tandberg
is connected.

I have checked that there are no conflict in BIOS addresses, IRQ & I/O
port. Have I missed anything?

I am lost here. Any suggestions are most welcomed. Thanks in advance.

Delman.



--
______________________________________________________________________

  Delman Lee                                 Tel.: +1-215-662-6780
  Medical Image Processing Group,            Fax.: +1-215-898-9145
  University of Pennsylvania,
  4/F Blockley Hall, 418 Service Drive,                         
  Philadelphia, PA 19104-6021,
  U.S.A..                            Internet: delman@mipg.upenn.edu
______________________________________________________________________

--------------------------


In [ ]: