In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first sentence',
'This is the second sentence',
'Something that has nothing to do with the others',
'This is the final sentence'
]
In [ ]:
vectorizer = TfidfVectorizer(use_idf=True)
vectors = vectorizer.fit_transform(corpus)
In [38]:
vectors.shape
Out[38]:
In [48]:
import pandas as pd
df = pd.DataFrame(index=vectorizer.get_feature_names())
for i,doc in enumerate(vectors):
df[f"sentence {i+1}"] = doc.T.todense()
df
Out[48]:
TF-IDF Limitations: two documents on the same subject, but with different word spellings, will end up not being related at all
Latent Semantic Analysis (LSA) is used to create a vector representation of tf-idf word scores which turn into "topic vectors". This enables semantic search which can query documents based on their meaning or find similar documents. Vectors can be added and subtracted to learn new things, and vectors of similar length and direction end up having similar meanings!
LSA is unsupervised
Linear Discriminant Analysis (LDA) creates a single topic vector per document. Just compute the centroid of all TF-IDF vectors for each class in a multiclass classification problem (spam vs. non-spam) and find the line between each centroid. Further a TF-IDF vector is on the line tells you which class you're in.
LDA is a supervised algorithm that needs labels for training.
Latent Dirichlet Allocation (LDiA) can create multiple topic vectors per document
In [82]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
X = vectors.toarray()
y = [10,10,0,10]
clf = LDA()
clf.fit(X, y)
Out[82]:
In [83]:
clf.predict([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]])
Out[83]:
In [85]:
clf.predict([df.iloc[:,2].values])
Out[85]:
In [86]:
df.iloc[:,2].values
Out[86]:
In [ ]: