http://v.youku.com/v_show/id_XMzA3OTA5MjUy.html
This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.
试一下谷歌图书的数据: https://books.google.com/ngrams/
Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse
Bag of words,也叫做“词袋”,在信息检索中,Bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。这种假设虽然对自然语言进行了简化,便于模型化。
假定在有些情况下是不合理的,例如在新闻个性化推荐中,采用Bag of words的模型就会出现问题。例如用户甲对“南京醉酒驾车事故”这个短语很感兴趣,采用bag of words忽略了顺序和句法,则认为用户甲对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”,“公交车”,“事故”相关的新闻,这显然是不合理的。
解决的方法可以采用SCPCD的方法抽取出整个短语,或者采用高阶(2阶以上)统计语言模型,例如bigram,trigram来将词序保留下来,相当于bag of bigram和bag of trigram,这样能在一定程度上解决这种问题。简言之,bag of words模型是否适用需要根据实际情况来确定。对于那些不可以忽视词序,语法和句法的场合均不能采用bag of words的方法。
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.
In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.
D1 = "I like databases"
D2 = "I hate databases"
I | like | hate | databases | |
---|---|---|---|---|
D1 | 1 | 1 | 0 | 1 |
D2 | 1 | 0 | 1 | 1 |
In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
In [6]:
' '.join(dir(count))
Out[6]:
In [8]:
count.get_feature_names()
Out[8]:
In [2]:
print(count.vocabulary_)
In [5]:
type(bag)
Out[5]:
In [3]:
print(bag.toarray())
In [12]:
import pandas as pd
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[12]:
The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model
The choice of the number n in the n-gram model depends on the particular application
The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter.
While a 1-gram representation is used by default
we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2).
Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term frequencies from CountVectorizer as input and transforms them into tf-idfs:
In [15]:
np.set_printoptions(precision=2)
In [16]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
In [17]:
bag = tfidf.fit_transform(count.fit_transform(docs))
pd.DataFrame(bag.toarray(), columns = count.get_feature_names())
Out[17]:
In [18]:
# 一个词的tfidf值
tf_is = 2
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)
In [19]:
# 最后一个文本里的词的tfidf原始数值(未标准化)
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf
Out[19]:
In [20]:
# l2标准化后的tfidf数值
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf
Out[20]:
In [1]:
with open('/Users/chengjun/github/cjc2016/data/gov_reports1954-2016.txt', 'r') as f:
reports = f.readlines()
In [2]:
len(reports)
Out[2]:
In [3]:
print reports[32][:1000]
In [4]:
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import sys
import numpy as np
from collections import defaultdict
import statsmodels.api as sm
from wordcloud import WordCloud
import jieba
import matplotlib
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
matplotlib.rc("savefig", dpi=400)
In [6]:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))
In [7]:
filename = '/Users/chengjun/github/cjc2016/data/stopwords.txt'
stopwords = {}
f = open(filename, 'r')
line = f.readline().rstrip()
while line:
stopwords.setdefault(line, 0)
stopwords[line.decode('utf-8')] = 1
line = f.readline().rstrip()
f.close()
In [8]:
adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',
u'发展',u'建设',u'加强',u'继续',u'对',u'等',u'推进',u'工作',u'增加']
for s in adding_stopwords: stopwords[s]=10
In [14]:
import jieba.analyse
txt = reports[-1]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
In [262]:
print u"、".join([i[0] for i in tf[:50]])
In [267]:
plt.hist([i[1] for i in tf])
plt.show()
In [264]:
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
print u"、".join([i[0] for i in tr[:50]])
In [268]:
plt.hist([i[1] for i in tr])
plt.show()
In [75]:
import pandas as pd
def keywords(index):
txt = reports[-index]
tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)
tr = jieba.analyse.textrank(txt,topK=200, withWeight=True)
tfdata = pd.DataFrame(tf, columns=['word', 'tfidf'])
trdata = pd.DataFrame(tr, columns=['word', 'textrank'])
worddata = pd.merge(tfdata, trdata, on='word')
plt.plot(worddata.tfidf, worddata.textrank, linestyle='',marker='.')
for i in range(len(worddata.word)):
plt.text(worddata.tfidf[i], worddata.textrank[i], worddata.word[i],
fontsize = worddata.textrank[i]*15, color = 'red', rotation = 0)
plt.title(txt[:4])
plt.xlabel('Tf-Idf')
plt.ylabel('TextRank')
plt.show()
In [80]:
keywords(1)
In [269]:
keywords(2)
In [270]:
keywords(3)
In [59]:
def wordcloudplot(txt, year):
wordcloud = WordCloud(font_path='/Users/chengjun/github/cjc2016/data/msyh.ttf').generate(txt)
# Open a plot of the generated image.
plt.imshow(wordcloud)
plt.title(year)
plt.axis("off")
#plt.show()
In [326]:
txt = reports[-1]
tfidf200= jieba.analyse.extract_tags(txt, topK=200, withWeight=False)
seg_list = jieba.cut(txt, cut_all=False)
seg_list = [i for i in seg_list if i in tfidf200]
txt200 = r' '.join(seg_list)
wordcloudplot(txt200, txt[:4])