Homework6


张艺馨

15210130100

使用graphlab进行主题模型分析


In [2]:
import graphlab

In [3]:
import graphlab as gl
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')

In [5]:
traindata_path = "/Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv"
testdata_path = "/Users/zhangyixin/Desktop/cjc2016-gh-pages/testData.tsv"

In [6]:
import graphlab as gl
graphlab.product_key.set_product_key('7D9A-5351-5A47-786A-941D-38C6-2885-46EA')
train_data = gl.SFrame.read_csv(traindata_path,header=True, 
                                delimiter='\t',quote_char='"', 
                                column_type_hints = {'id':str, 
                                                     'sentiment' : int, 
                                                     'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(
    train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(
    train_data['review'],2)
cls = gl.classifier.create(train_data, target='sentiment', 
                           features=['1grams features','2grams features'])


2016-05-20 19:36:42,323 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1463744197.log
This non-commercial license of GraphLab Create is assigned to 421901797@qq.com and will expire on May 20, 2017. For commercial licensing options, visit https://dato.com/buy/.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.671221 secs.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 0.998492 secs.
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 23731
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1454610
Number of coefficients    : 1454611
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 2.939134     | 0.999241          | 0.885737            |
| 2         | 5        | 1.000000  | 4.227698     | 0.999916          | 0.885737            |
| 3         | 6        | 1.000000  | 5.045617     | 0.999958          | 0.887313            |
| 4         | 7        | 1.000000  | 5.873717     | 0.999958          | 0.887313            |
| 5         | 8        | 1.000000  | 6.900162     | 1.000000          | 0.888101            |
| 6         | 9        | 1.000000  | 7.991087     | 1.000000          | 0.889677            |
| 7         | 10       | 1.000000  | 8.856936     | 1.000000          | 0.888889            |
| 8         | 11       | 1.000000  | 9.712011     | 1.000000          | 0.889677            |
| 9         | 12       | 1.000000  | 10.574125    | 1.000000          | 0.886525            |
| 10        | 13       | 1.000000  | 11.552290    | 1.000000          | 0.886525            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 23731
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1454610
Number of coefficients    : 1454611
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 2.083471     | 0.999241          | 0.885737            |
| 2         | 5        | 1.000000  | 3.627069     | 0.999958          | 0.884949            |
| 3         | 6        | 1.000000  | 4.512867     | 0.999958          | 0.884949            |
| 4         | 7        | 1.000000  | 5.422483     | 0.999958          | 0.884949            |
| 5         | 8        | 1.000000  | 6.627776     | 0.499136          | 0.484634            |
| 6         | 10       | 1.000000  | 8.427533     | 0.999916          | 0.884949            |
| 7         | 11       | 1.000000  | 9.446657     | 0.999958          | 0.884949            |
| 8         | 12       | 1.000000  | 10.390037    | 1.000000          | 0.884949            |
| 9         | 13       | 1.000000  | 11.312677    | 0.999874          | 0.885737            |
| 10        | 16       | 5.000000  | 13.328470    | 0.999874          | 0.884949            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.886525
PROGRESS: SVMClassifier                   : 0.884949
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.

In [7]:
movies_reviews_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', 
                                         column_type_hints = {'id':str, 'sentiment' : str, 'review':str } )


Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.896575 secs.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 1.35816 secs.

In [8]:
movies_reviews_data.show()



In [9]:
movies_reviews_data['1grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data ['review'],1)

In [10]:
movies_reviews_data.show(['review','1grams features'])


2016-05-20 19:39:29,626 [WARNING] graphlab.data_structures.sframe, 4920: Column selection for SFrame.show is deprecated. To show only certain columns, use the sf[['column1', 'column2']] syntax or construct a new SFrame with the desired columns.

In [11]:
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)

In [12]:
model_1 = gl.classifier.create(train_set, target='sentiment', features=['1grams features'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 19067
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 79455
Number of coefficients    : 79456
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 0.777821     | 0.950700          | 0.862069            |
| 2         | 5        | 1.000000  | 1.363362     | 0.976714          | 0.872906            |
| 3         | 6        | 1.000000  | 1.708860     | 0.992815          | 0.889655            |
| 4         | 7        | 1.000000  | 1.971431     | 0.994913          | 0.885714            |
| 5         | 8        | 1.000000  | 2.235037     | 0.989668          | 0.845320            |
| 6         | 9        | 1.000000  | 2.557363     | 0.998164          | 0.885714            |
| 10        | 13       | 1.000000  | 3.667386     | 0.999528          | 0.879803            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 19067
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 79455
Number of coefficients    : 79456
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 0.449710     | 0.950700          | 0.862069            |
| 2         | 5        | 1.000000  | 1.005930     | 0.979913          | 0.873892            |
| 3         | 6        | 1.000000  | 1.424681     | 0.991084          | 0.881773            |
| 4         | 7        | 1.000000  | 1.764174     | 0.994441          | 0.883744            |
| 5         | 8        | 1.000000  | 2.127350     | 0.997063          | 0.876847            |
| 6         | 9        | 1.000000  | 2.483057     | 0.998479          | 0.878818            |
| 10        | 13       | 1.000000  | 3.542768     | 0.999685          | 0.876847            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.879803
PROGRESS: SVMClassifier                   : 0.876847
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.

In [13]:
result1 = model_1.evaluate(test_set)

In [14]:
def print_statistics(result):
    print "*" * 30
    print "Accuracy        : ", result["accuracy"]
    print "Confusion Matrix: \n", result["confusion_matrix"]
print_statistics(result1)


******************************
Accuracy        :  0.865595770638
Confusion Matrix: 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |  462  |
|      1       |        0        |  199  |
|      1       |        1        |  2194 |
|      0       |        0        |  2063 |
+--------------+-----------------+-------+
[4 rows x 3 columns]


In [15]:
movies_reviews_data['2grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data['review'],2)

In [16]:
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)
model_2 = gl.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'])
result2 = model_2.evaluate(test_set)
print_statistics(result2)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 19051
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1247038
Number of coefficients    : 1247039
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 1.520618     | 0.999475          | 0.871969            |
| 2         | 5        | 1.000000  | 2.730721     | 0.999948          | 0.871969            |
| 3         | 6        | 1.000000  | 3.489339     | 1.000000          | 0.871969            |
| 4         | 7        | 1.000000  | 4.244271     | 1.000000          | 0.872939            |
| 5         | 8        | 1.000000  | 5.009327     | 1.000000          | 0.872939            |
| 6         | 9        | 1.000000  | 5.754803     | 1.000000          | 0.873909            |
| 10        | 13       | 1.000000  | 8.867764     | 1.000000          | 0.872939            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 19051
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1247038
Number of coefficients    : 1247039
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000052  | 1.181267     | 0.999475          | 0.871969            |
| 2         | 5        | 1.000000  | 2.339975     | 1.000000          | 0.871969            |
| 3         | 6        | 1.000000  | 2.978354     | 1.000000          | 0.871969            |
| 4         | 7        | 1.000000  | 3.580178     | 0.000105          | 0.129971            |
| 5         | 9        | 1.000000  | 4.544126     | 1.000000          | 0.871969            |
| 6         | 10       | 1.000000  | 5.172173     | 1.000000          | 0.871969            |
| 10        | 31       | 0.069632  | 13.825946    | 1.000000          | 0.871969            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.872939
PROGRESS: SVMClassifier                   : 0.871969
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
******************************
Accuracy        :  0.88003253355
Confusion Matrix: 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |  376  |
|      0       |        0        |  2149 |
|      1       |        1        |  2179 |
|      1       |        0        |  214  |
+--------------+-----------------+-------+
[4 rows x 3 columns]


In [17]:
traindata_path = "/Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv"
testdata_path = "/Users/zhangyixin/Desktop/cjc2016-gh-pages/testData.tsv"
#creating classifier using all 25,000 reviews
train_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', 
                                column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(train_data['review'],2)

cls = gl.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])
#creating the test dataset
test_data = gl.SFrame.read_csv(testdata_path,header=True, delimiter='\t',quote_char='"', 
                               column_type_hints = {'id':str, 'review':str } )
test_data['1grams features'] = gl.text_analytics.count_ngrams(test_data['review'],1)
test_data['2grams features'] = gl.text_analytics.count_ngrams(test_data['review'],2)

#predicting the sentiment of each review in the test dataset
test_data['sentiment'] = cls.classify(test_data)['class'].astype(int)

#saving the prediction to a CSV for submission
test_data[['id','sentiment']].save("/Users/zhangyixin/Desktop/cjc2016-gh-pages/predictions.csv", format="csv")


Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 100 lines in 0.63047 secs.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/labeledTrainData.tsv
Parsing completed. Parsed 25000 lines in 1.05699 secs.
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 23820
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1462856
Number of coefficients    : 1462857
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 1.540691     | 0.999244          | 0.882203            |
| 2         | 5        | 1.000000  | 3.079095     | 0.999916          | 0.881356            |
| 3         | 6        | 1.000000  | 3.891543     | 0.999958          | 0.881356            |
| 4         | 7        | 1.000000  | 4.669646     | 0.999958          | 0.881356            |
| 5         | 8        | 1.000000  | 5.452494     | 1.000000          | 0.882203            |
| 6         | 9        | 1.000000  | 6.543843     | 1.000000          | 0.881356            |
| 10        | 13       | 1.000000  | 10.251021    | 1.000000          | 0.881356            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
SVM:
--------------------------------------------------------
Number of examples          : 23820
Number of classes           : 2
Number of feature columns   : 2
Number of unpacked features : 1462856
Number of coefficients    : 1462857
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 3        | 0.000042  | 1.653151     | 0.999244          | 0.882203            |
| 2         | 5        | 1.000000  | 2.796425     | 0.999958          | 0.879661            |
| 3         | 6        | 1.000000  | 3.551182     | 0.999958          | 0.879661            |
| 4         | 7        | 1.000000  | 4.279049     | 0.999958          | 0.879661            |
| 5         | 8        | 1.000000  | 5.080197     | 0.174811          | 0.363559            |
| 6         | 10       | 1.000000  | 6.405918     | 1.000000          | 0.882203            |
| 10        | 26       | 0.005336  | 15.607498    | 1.000000          | 0.881356            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.881356
PROGRESS: SVMClassifier                   : 0.881356
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/testData.tsv
Parsing completed. Parsed 100 lines in 0.671379 secs.
Finished parsing file /Users/zhangyixin/Desktop/cjc2016-gh-pages/testData.tsv
Parsing completed. Parsed 25000 lines in 0.986929 secs.

In [18]:
%matplotlib inline
from __future__ import print_function
from wordcloud import WordCloud
from gensim import corpora, models, similarities,  matutils
import matplotlib.pyplot as plt
import numpy as np

In [19]:
corpus = corpora.BleiCorpus('/Users/zhangyixin/Desktop/cjc2016-gh-pages/ap/ap.dat', '/Users/zhangyixin/Desktop/cjc2016-gh-pages/ap/vocab.txt')

In [20]:
' '.join(dir(corpus))


Out[20]:
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __getitem__ __hash__ __init__ __iter__ __len__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _adapt_by_suffix _load_specials _save_specials _smart_save docbyoffset fname id2word index length line2doc load save save_corpus serialize'

In [21]:
corpus.id2word.items()[:3]


Out[21]:
[(0, u'i'), (1, u'new'), (2, u'percent')]

In [22]:
NUM_TOPICS = 100

In [25]:
model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)


2016-05-20 19:45:38,474 [WARNING] gensim.models.ldamodel, 617: too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

In [26]:
' '.join(dir(model))


Out[26]:
'__class__ __delattr__ __dict__ __doc__ __format__ __getattribute__ __getitem__ __hash__ __init__ __module__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _adapt_by_suffix _apply _load_specials _save_specials _smart_save alpha bound chunksize clear decay dispatcher distributed do_estep do_mstep eta eval_every expElogbeta gamma_threshold get_document_topics get_topic_terms id2word inference init_dir_prior iterations load log_perplexity minimum_probability num_terms num_topics num_updates numworkers offset optimize_alpha optimize_eta passes print_topic print_topics save show_topic show_topics state sync_state top_topics update update_alpha update_eta update_every'

In [27]:
document_topics = [model[c] for c in corpus]

In [28]:
document_topics[2]


Out[28]:
[(11, 0.015588013505054232),
 (12, 0.12001077540650702),
 (23, 0.13437639081004069),
 (32, 0.045015029291970717),
 (33, 0.026628675280810588),
 (36, 0.24629330936241586),
 (39, 0.15115425480361758),
 (44, 0.039951115879590528),
 (48, 0.022577554861062794),
 (53, 0.021563411020602535),
 (57, 0.081237470073098728),
 (69, 0.021359626249450606),
 (71, 0.044532690445549479)]

In [29]:
model.show_topic(0, 10)


Out[29]:
[(u'court', 0.014528102507076909),
 (u'm', 0.010065075988315506),
 (u'thompson', 0.0095841366179563788),
 (u'kennedy', 0.0089711026468543634),
 (u'graham', 0.0083330687620746333),
 (u'year', 0.0080327258412096324),
 (u'angeles', 0.0073669031082667534),
 (u'los', 0.0069006999284592814),
 (u'million', 0.0067608270353513248),
 (u'president', 0.0065583439011857227)]

In [30]:
model.show_topic(99, 10)


Out[30]:
[(u'symphony', 0.0093464777066745684),
 (u'talks', 0.0065817066230704751),
 (u'i', 0.0045228045851080255),
 (u'orchestras', 0.0044892947801232676),
 (u'orchestra', 0.0044810035462180351),
 (u'inquiry', 0.0042804555326826973),
 (u'bush', 0.0042315079280181717),
 (u'neighboring', 0.0041763896803668028),
 (u'two', 0.004095495419954783),
 (u'mediator', 0.00370846371533688)]

In [31]:
words = model.show_topic(0, 5)
words


Out[31]:
[(u'court', 0.014528102507076909),
 (u'm', 0.010065075988315506),
 (u'thompson', 0.0095841366179563788),
 (u'kennedy', 0.0089711026468543634),
 (u'graham', 0.0083330687620746333)]

In [32]:
model.show_topics(4)


Out[32]:
[(65,
  u'0.019*rings + 0.012*handling + 0.011*percent + 0.010*approval + 0.009*disapproved + 0.008*uss + 0.008*monsignor + 0.006*rating + 0.006*poll + 0.005*york'),
 (74,
  u'0.014*thursday + 0.010*seoul + 0.008*government + 0.008*engineering + 0.007*executives + 0.006*resigned + 0.006*group + 0.006*strikers + 0.006*company + 0.005*two'),
 (96,
  u'0.008*west + 0.005*government + 0.005*gull + 0.005*united + 0.004*threats + 0.004*east + 0.004*rights + 0.004*new + 0.004*german + 0.004*human'),
 (53,
  u'0.028*virus + 0.022*aids + 0.015*infected + 0.007*number + 0.006*princess + 0.006*government + 0.005*britain + 0.005*years + 0.005*computer + 0.004*parties')]

In [33]:
for f, w in words[:10]:
    print(f)


court
m
thompson
kennedy
graham

In [34]:
words = model.show_topic(0, 10)
for (f, w) in words:
    print(w)


0.0145281025071
0.0100650759883
0.00958413661796
0.00897110264685
0.00833306876207
0.00803272584121
0.00736690310827
0.00690069992846
0.00676082703535
0.00655834390119

In [35]:
for f, w in words:
    print(f + '\t' + str(f))


court	court
m	m
thompson	thompson
kennedy	kennedy
graham	graham
year	year
angeles	angeles
los	los
million	million
president	president

In [36]:
for ti in range(model.num_topics):
    words = model.show_topic(ti, 10)
    tf = sum(w for f, w in words)
    with open('/Users/zhangyixin/Desktop/cjc2016-gh-pages/data/topics_term_weight.txt', 'a') as output:
        for f, w in words:
            line = str(ti) + '\t' +  f+ '\t' + str(w/tf) 
            output.write(line + '\n')

In [37]:
topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)
weight = topics.sum(1)
max_topic = weight.argmax()

In [38]:
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words[1]


Out[38]:
array([u'0.0168854634895', u'0.0147924007086', u'0.0139823754491',
       u'0.0118704117955', u'0.00845953615682', u'0.00549778326825',
       u'0.00537064913284', u'0.00449777462432', u'0.00410563036729',
       u'0.00403523819245', u'0.00385354182863', u'0.00365833995995',
       u'0.00334539906402', u'0.00317553485567', u'0.00311505981057',
       u'0.00305190463274', u'0.00288884771858', u'0.00278664063064',
       u'0.00275176934014', u'0.00269486372796', u'0.00268299680329',
       u'0.00264395662103', u'0.00264373001381', u'0.00261603957545',
       u'0.00256441698076', u'0.00252773190374', u'0.00251078816282',
       u'0.00248833327269', u'0.00239719526349', u'0.00238095384806',
       u'0.00237420717464', u'0.00236892632469', u'0.00233266315108',
       u'0.00232531305468', u'0.0023125492369', u'0.00229472491592',
       u'0.00228880824425', u'0.00227361074819', u'0.0022592005262',
       u'0.00218685172225', u'0.00218336515051', u'0.00217851112227',
       u'0.00215560227839', u'0.00214947953054', u'0.00212669409476',
       u'0.00211786368672', u'0.00209626588227', u'0.00209016966966',
       u'0.00208843471184', u'0.00204923711202', u'0.0020333456854',
       u'0.0020225134884', u'0.00200659545705', u'0.00197680591201',
       u'0.00196993719919', u'0.00196526241573', u'0.00195588341191',
       u'0.00194557420471', u'0.0019223175105', u'0.00190613848864',
       u'0.00190386150585', u'0.00190231639668', u'0.00189074179146',
       u'0.00188545063498'], 
      dtype='<U16')

In [39]:
words = model.show_topic(max_topic, 64)
words = np.array(words).T
words_freq=[float(i)*10000000 for i in words[1]]
words = zip(words[0], words_freq)

In [40]:
wordcloud = WordCloud().generate_from_frequencies(words)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()



In [41]:
num_topics_used = [len(model[doc]) for doc in corpus]

fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
fig.tight_layout()
#fig.savefig('Figure_04_01.png')



In [42]:
ALPHA = 1.0
model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)

num_topics_used1 = [len(model1[doc]) for doc in corpus]


2016-05-20 19:47:46,856 [WARNING] gensim.models.ldamodel, 617: too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

In [43]:
fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
# The coordinates below were fit by trial and error to look good
plt.text(9, 223, r'default alpha')
plt.text(26, 156, 'alpha=1.0')
fig.tight_layout()



In [44]:
with open('/Users/zhangyixin/Desktop/cjc2016-gh-pages/ap/ap.txt', 'r') as f:
    dat = f.readlines()

In [45]:
dat[:6]


Out[45]:
['<DOC>\n',
 '<DOCNO> AP881218-0003 </DOCNO>\n',
 '<TEXT>\n',
 " A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.\n",
 ' </TEXT>\n',
 '</DOC>\n']

In [46]:
dat[4].strip()[0]


Out[46]:
'<'

In [47]:
docs = []
for i in dat[:100]:
    if i.strip()[0] != '<':
        docs.append(i)

In [48]:
def clean_doc(doc):
    doc = doc.replace('.', '').replace(',', '')
    doc = doc.replace('``', '').replace('"', '')
    doc = doc.replace('_', '').replace("'", '')
    doc = doc.replace('!', '')
    return doc
docs = [clean_doc(doc) for doc in docs]

In [49]:
texts = [[i for i in doc.lower().split()] for doc in docs]

In [50]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [51]:
' '.join(stop)


Out[51]:
u'i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now d ll m o re ve y ain aren couldn didn doesn hadn hasn haven isn ma mightn mustn needn shan shouldn wasn weren won wouldn'

In [52]:
stop.append('said')

In [53]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1 and token not in stop]
        for text in texts]

In [54]:
docs[8]


Out[54]:
' Here is a summary of developments in forest and brush fires in Western states:\n'

In [55]:
' '.join(texts[9])


Out[55]:
'stirbois 2 man extreme-right national front party le pen died saturday automobile police 43 stirbois political meeting friday city dreux miles west paris traveling toward capital car ran police stirbois national front member party since born paris law headed business stirbois several extreme-right political joining national front 1977 percent vote local elections west paris highest vote percentage candidate year half later deputy dreux stirbois deputy national 1986 lost elections last national front founded le pen frances government death priority first years presidential elections le pen percent vote national front could'

In [56]:
dictionary = corpora.Dictionary(texts)
lda_corpus = [dictionary.doc2bow(text) for text in texts]

In [57]:
lda_model = models.ldamodel.LdaModel(
    lda_corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha=None)


2016-05-20 19:49:59,599 [WARNING] gensim.models.ldamodel, 617: too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

In [58]:
import pyLDAvis.gensim

ap_data = pyLDAvis.gensim.prepare(lda_model, lda_corpus, dictionary)


//anaconda/lib/python2.7/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.202072339705 and the largest is 0.677878230145.
  RuntimeWarning

In [59]:
pyLDAvis.enable_notebook()
pyLDAvis.display(ap_data)


Out[59]:

In [60]:
pyLDAvis.save_html(ap_data, '/Users/zhangyixin/Desktop/cjc2016-gh-pages/vis/ap_ldavis.html')

In [ ]: