A potentially useful feature could be to compare the topic distribution of each sentence in an article with the topic distribution of the article itself. The question is: could sentences that are highlighted contain words that are more or less associated with the topic of the article than sentences that are not highlighted?

This type of analysis requires topic analysis, such as Latent Dirichlet Allocation (LDA). Here, use LDA (from the gensim library) to generate topics for the corpus of articles scraped from and calculate a topic vector for each article. Then, when calculating features for the sentences in the dataset, I will be able to apply the LDA model to generate a topic vector for each sentence and calculate a cosine similarity score between the topic vector of the sentence and the article it belongs to.

Initial text processing

In [102]:
all_texts_processed = []

n = 0
for text in set_tr['text']:
    # combine sentences
    txt = ' '.join(text)
    # remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    txt2 = re.sub(u'\u2014','',txt)
    txt3 = txt2.translate(translator)
    # split text into words
    tokens = word_tokenizer.tokenize(txt3.lower())
    # remove stop words
    nostop_tokens = [i for i in tokens if not i in all_stopw]
    # stem words
    stemmed = [p_stemmer.stem(i) for i in nostop_tokens]
    # append to processed texts
    all_texts_processed.append( stemmed )
    if n == 0:
#         print(txt)
#         print(tokens)
#         print(nostop_tokens)
    n += 1
#     if n == 5:
#         break

In [103]:
flda_processedtexts = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts','wb')
pickle.dump(all_texts_processed, flda_processedtexts)

In [104]:
# Make document-term matrix
dictionary = corpora.Dictionary(all_texts_processed)
# Convert to bag-of-words
corpus = [dictionary.doc2bow(text) for text in all_texts_processed]

[(0, 1), (1, 2), (2, 40), (3, 11), (4, 14), (5, 19), (6, 1), (7, 5), (8, 9), (9, 8), (10, 1), (11, 13), (12, 29), (13, 50), (14, 12), (15, 1), (16, 1), (17, 2), (18, 2), (19, 5), (20, 11), (21, 5), (22, 3), (23, 10), (24, 1), (25, 2), (26, 18), (27, 1), (28, 1), (29, 4), (30, 1), (31, 4), (32, 2), (33, 4), (34, 1), (35, 1), (36, 1), (37, 1), (38, 7), (39, 23), (40, 13), (41, 5), (42, 3), (43, 30), (44, 4), (45, 4), (46, 4), (47, 8), (48, 1), (49, 44), (50, 2), (51, 1), (52, 5), (53, 1), (54, 9), (55, 1), (56, 4), (57, 1), (58, 3), (59, 16), (60, 4), (61, 1), (62, 1), (63, 2), (64, 9), (65, 1), (66, 17), (67, 23), (68, 10), (69, 3), (70, 1), (71, 1), (72, 7), (73, 1), (74, 1), (75, 2), (76, 5), (77, 1), (78, 8), (79, 1), (80, 1), (81, 7), (82, 2), (83, 2), (84, 2), (85, 12), (86, 2), (87, 2), (88, 1), (89, 1), (90, 2), (91, 24), (92, 11), (93, 3), (94, 6), (95, 1), (96, 4), (97, 1), (98, 12), (99, 12), (100, 4), (101, 1), (102, 8), (103, 3), (104, 2), (105, 3), (106, 2), (107, 18), (108, 7), (109, 2), (110, 1), (111, 1), (112, 5), (113, 9), (114, 9), (115, 8), (116, 7), (117, 2), (118, 7), (119, 1), (120, 5), (121, 2), (122, 3), (123, 1), (124, 3), (125, 1), (126, 1), (127, 2), (128, 1), (129, 3), (130, 1), (131, 1), (132, 1), (133, 7), (134, 9), (135, 5), (136, 3), (137, 2), (138, 1), (139, 6), (140, 1), (141, 1), (142, 2), (143, 3), (144, 1), (145, 20), (146, 27), (147, 12), (148, 1), (149, 1), (150, 9), (151, 1), (152, 2), (153, 2), (154, 1), (155, 2), (156, 1), (157, 11), (158, 1), (159, 1), (160, 9), (161, 4), (162, 6), (163, 2), (164, 8), (165, 5), (166, 1), (167, 3), (168, 1), (169, 2), (170, 6), (171, 4), (172, 2), (173, 2), (174, 1), (175, 38), (176, 3), (177, 4), (178, 6), (179, 1), (180, 2), (181, 6), (182, 10), (183, 7), (184, 10), (185, 1), (186, 1), (187, 9), (188, 3), (189, 2), (190, 4), (191, 3), (192, 7), (193, 5), (194, 2), (195, 2), (196, 2), (197, 1), (198, 6), (199, 2), (200, 30), (201, 2), (202, 24), (203, 6), (204, 4), (205, 3), (206, 28), (207, 13), (208, 1), (209, 11), (210, 2), (211, 1), (212, 2), (213, 9), (214, 3), (215, 2), (216, 1), (217, 2), (218, 23), (219, 3), (220, 2), (221, 1), (222, 8), (223, 5), (224, 2), (225, 1), (226, 8), (227, 1), (228, 1), (229, 1), (230, 2), (231, 2), (232, 1), (233, 1), (234, 1), (235, 2), (236, 2), (237, 1), (238, 2), (239, 5), (240, 2), (241, 1), (242, 7), (243, 2), (244, 1), (245, 1), (246, 9), (247, 2), (248, 1), (249, 2), (250, 2), (251, 5), (252, 1), (253, 3), (254, 1), (255, 2), (256, 7), (257, 10), (258, 4), (259, 1), (260, 4), (261, 4), (262, 2), (263, 5), (264, 1), (265, 2), (266, 22), (267, 1), (268, 1), (269, 1), (270, 1), (271, 10), (272, 4), (273, 2), (274, 8), (275, 13), (276, 4), (277, 1), (278, 1), (279, 6), (280, 2), (281, 2), (282, 4), (283, 1), (284, 2), (285, 2), (286, 2), (287, 4), (288, 1), (289, 1), (290, 2), (291, 5), (292, 1), (293, 1), (294, 1), (295, 3), (296, 4), (297, 1), (298, 1), (299, 2), (300, 1), (301, 1), (302, 2), (303, 1), (304, 1), (305, 1), (306, 4), (307, 1), (308, 2), (309, 5), (310, 10), (311, 1), (312, 1), (313, 1), (314, 3), (315, 6), (316, 2), (317, 1), (318, 4), (319, 1), (320, 2), (321, 20), (322, 1), (323, 3), (324, 1), (325, 4), (326, 4), (327, 1), (328, 1), (329, 1), (330, 4), (331, 1), (332, 1), (333, 2), (334, 1), (335, 1), (336, 4), (337, 1), (338, 7), (339, 6), (340, 5), (341, 1), (342, 5), (343, 4), (344, 8), (345, 1), (346, 1), (347, 2), (348, 2), (349, 1), (350, 2), (351, 1), (352, 8), (353, 1), (354, 1), (355, 1), (356, 1), (357, 4), (358, 1), (359, 4), (360, 5), (361, 11), (362, 2), (363, 1), (364, 9), (365, 1), (366, 1), (367, 15), (368, 4), (369, 3), (370, 2), (371, 3), (372, 2), (373, 13), (374, 13), (375, 2), (376, 1), (377, 1), (378, 1), (379, 2), (380, 10), (381, 1), (382, 1), (383, 3), (384, 6), (385, 1), (386, 1), (387, 1), (388, 1), (389, 1), (390, 4), (391, 1), (392, 1), (393, 1), (394, 1), (395, 1), (396, 4), (397, 7), (398, 1), (399, 1), (400, 7), (401, 4), (402, 1), (403, 1), (404, 1), (405, 12), (406, 2), (407, 2), (408, 1), (409, 1), (410, 7), (411, 4), (412, 1), (413, 1), (414, 15), (415, 1), (416, 1), (417, 2), (418, 1), (419, 3), (420, 2), (421, 2), (422, 2), (423, 3), (424, 6), (425, 2), (426, 2), (427, 1), (428, 1), (429, 20), (430, 1), (431, 1), (432, 1), (433, 2), (434, 2), (435, 1), (436, 1), (437, 3), (438, 1), (439, 5), (440, 1), (441, 3), (442, 1), (443, 1), (444, 1), (445, 1), (446, 1), (447, 1), (448, 1), (449, 5), (450, 1), (451, 1), (452, 3), (453, 1), (454, 4), (455, 1), (456, 1), (457, 1), (458, 1), (459, 2), (460, 1), (461, 2), (462, 3), (463, 1), (464, 1), (465, 3), (466, 2), (467, 1), (468, 2), (469, 3), (470, 1), (471, 1), (472, 2), (473, 1), (474, 1), (475, 1), (476, 1), (477, 1), (478, 3), (479, 2), (480, 5), (481, 10), (482, 1), (483, 1), (484, 2), (485, 1), (486, 3), (487, 1), (488, 1), (489, 5), (490, 3), (491, 3), (492, 1), (493, 1), (494, 1), (495, 2), (496, 2), (497, 1), (498, 1), (499, 2), (500, 5), (501, 7), (502, 7), (503, 1), (504, 1), (505, 3), (506, 1), (507, 1), (508, 4), (509, 6), (510, 4), (511, 11), (512, 1), (513, 6), (514, 1), (515, 9), (516, 6), (517, 2), (518, 5), (519, 1), (520, 9), (521, 4), (522, 4), (523, 5), (524, 8), (525, 1), (526, 1), (527, 4), (528, 6), (529, 6), (530, 1), (531, 4), (532, 2), (533, 2), (534, 1), (535, 7), (536, 2), (537, 4), (538, 1), (539, 1), (540, 2), (541, 3), (542, 2), (543, 2), (544, 2), (545, 4), (546, 4), (547, 4), (548, 2), (549, 2), (550, 2), (551, 1), (552, 2), (553, 1), (554, 2), (555, 4), (556, 1), (557, 3), (558, 2), (559, 2), (560, 1), (561, 11), (562, 1), (563, 1), (564, 1), (565, 2), (566, 1), (567, 1), (568, 2), (569, 1), (570, 1), (571, 1), (572, 2), (573, 1), (574, 2), (575, 1), (576, 4), (577, 4), (578, 1), (579, 1), (580, 12), (581, 1), (582, 1), (583, 1), (584, 1), (585, 1), (586, 4), (587, 1), (588, 9), (589, 10), (590, 2), (591, 1), (592, 1), (593, 1), (594, 1), (595, 2), (596, 5), (597, 1), (598, 1), (599, 2), (600, 1), (601, 1), (602, 4), (603, 2), (604, 1), (605, 1), (606, 1), (607, 2), (608, 6), (609, 1), (610, 1), (611, 1), (612, 4), (613, 1), (614, 2), (615, 1), (616, 1), (617, 1), (618, 1), (619, 16), (620, 8), (621, 1), (622, 2), (623, 6), (624, 2), (625, 3), (626, 1), (627, 2), (628, 2), (629, 1), (630, 5), (631, 1), (632, 1), (633, 1), (634, 1), (635, 1), (636, 2), (637, 2), (638, 2), (639, 2), (640, 1), (641, 9), (642, 3), (643, 1), (644, 1), (645, 1), (646, 5), (647, 1), (648, 1), (649, 1), (650, 1), (651, 3), (652, 1), (653, 4), (654, 1), (655, 4), (656, 2), (657, 11), (658, 8), (659, 1), (660, 1), (661, 2), (662, 1), (663, 1), (664, 1), (665, 2), (666, 2), (667, 1), (668, 2), (669, 1), (670, 1), (671, 1), (672, 9), (673, 5), (674, 1), (675, 2), (676, 1), (677, 2), (678, 1), (679, 1), (680, 1), (681, 1), (682, 4), (683, 8), (684, 1), (685, 1), (686, 1), (687, 1), (688, 1), (689, 1), (690, 4), (691, 4), (692, 3), (693, 1), (694, 1), (695, 2), (696, 1), (697, 1), (698, 2), (699, 1), (700, 2), (701, 1), (702, 1), (703, 1), (704, 1), (705, 1), (706, 4), (707, 1), (708, 2), (709, 3), (710, 1), (711, 1), (712, 2), (713, 2), (714, 1), (715, 1), (716, 1), (717, 1), (718, 1), (719, 1), (720, 4), (721, 1), (722, 1), (723, 1), (724, 4), (725, 1), (726, 1), (727, 1), (728, 3), (729, 1), (730, 1), (731, 3), (732, 1), (733, 3), (734, 1), (735, 4), (736, 2), (737, 2), (738, 1), (739, 3), (740, 4), (741, 1), (742, 2), (743, 1), (744, 2), (745, 1), (746, 2), (747, 1), (748, 1), (749, 2), (750, 2), (751, 1), (752, 1), (753, 1), (754, 1), (755, 1), (756, 1), (757, 1), (758, 3), (759, 3), (760, 1), (761, 1), (762, 2), (763, 1), (764, 2), (765, 1), (766, 1), (767, 3), (768, 1), (769, 1), (770, 2), (771, 1), (772, 5), (773, 3), (774, 1), (775, 2), (776, 1), (777, 3), (778, 3), (779, 10), (780, 4), (781, 1), (782, 3), (783, 8), (784, 1), (785, 1), (786, 1), (787, 1), (788, 1), (789, 1), (790, 2), (791, 1), (792, 1), (793, 1), (794, 1), (795, 1), (796, 1), (797, 1), (798, 1), (799, 2), (800, 1), (801, 2), (802, 1), (803, 7), (804, 2), (805, 1), (806, 4), (807, 9), (808, 5), (809, 2), (810, 3), (811, 1), (812, 2), (813, 1), (814, 8), (815, 1), (816, 2), (817, 2), (818, 1), (819, 4), (820, 3), (821, 2), (822, 3), (823, 1), (824, 1), (825, 10), (826, 1), (827, 1), (828, 1), (829, 1), (830, 2), (831, 1), (832, 1), (833, 1), (834, 2), (835, 3), (836, 1), (837, 1), (838, 1), (839, 5), (840, 7), (841, 3), (842, 1), (843, 1), (844, 1), (845, 1), (846, 1), (847, 2), (848, 2), (849, 2), (850, 1), (851, 1), (852, 1), (853, 1), (854, 4), (855, 1), (856, 1), (857, 1), (858, 1), (859, 1), (860, 3), (861, 3), (862, 1), (863, 1), (864, 1), (865, 5), (866, 1), (867, 1), (868, 3), (869, 1), (870, 1), (871, 1), (872, 1), (873, 2), (874, 1), (875, 1), (876, 1), (877, 1), (878, 1), (879, 1), (880, 1), (881, 1), (882, 2), (883, 4), (884, 2), (885, 4), (886, 1), (887, 1), (888, 2), (889, 1), (890, 1), (891, 2), (892, 1), (893, 1), (894, 1), (895, 1), (896, 1), (897, 1), (898, 1), (899, 1), (900, 1), (901, 1), (902, 1), (903, 1), (904, 3), (905, 2), (906, 1), (907, 1), (908, 3), (909, 2), (910, 2), (911, 1), (912, 1), (913, 2), (914, 1), (915, 4), (916, 1), (917, 1), (918, 1), (919, 3), (920, 1), (921, 6), (922, 1), (923, 1), (924, 1), (925, 2), (926, 2), (927, 1), (928, 3), (929, 1), (930, 2), (931, 1), (932, 1), (933, 1), (934, 1), (935, 4), (936, 2), (937, 1), (938, 2), (939, 1), (940, 1), (941, 2), (942, 1), (943, 2), (944, 2), (945, 1), (946, 4), (947, 1), (948, 3), (949, 1), (950, 2), (951, 1), (952, 1), (953, 1), (954, 1), (955, 2), (956, 2), (957, 1), (958, 2), (959, 3), (960, 1), (961, 1), (962, 1), (963, 2), (964, 4), (965, 2), (966, 3), (967, 1), (968, 1), (969, 1), (970, 3), (971, 1), (972, 1), (973, 3), (974, 1), (975, 1), (976, 1), (977, 1), (978, 2), (979, 1), (980, 3), (981, 2), (982, 1), (983, 1), (984, 1), (985, 3), (986, 3), (987, 1), (988, 1), (989, 1), (990, 1), (991, 2), (992, 1), (993, 1), (994, 1), (995, 1), (996, 1), (997, 1), (998, 1), (999, 1), (1000, 1), (1001, 1), (1002, 1), (1003, 3), (1004, 1), (1005, 6), (1006, 1), (1007, 6), (1008, 2), (1009, 1), (1010, 1), (1011, 2), (1012, 1), (1013, 3), (1014, 2), (1015, 2), (1016, 1), (1017, 1), (1018, 5), (1019, 1), (1020, 1), (1021, 1), (1022, 1), (1023, 1), (1024, 1), (1025, 1), (1026, 1), (1027, 2), (1028, 1), (1029, 1), (1030, 1), (1031, 1), (1032, 1), (1033, 2), (1034, 1), (1035, 1), (1036, 1), (1037, 1), (1038, 1), (1039, 1), (1040, 2), (1041, 1), (1042, 1), (1043, 2), (1044, 11), (1045, 5), (1046, 1), (1047, 1), (1048, 2), (1049, 2), (1050, 1), (1051, 1), (1052, 4), (1053, 3), (1054, 1), (1055, 6), (1056, 1), (1057, 10), (1058, 1), (1059, 1), (1060, 1), (1061, 1), (1062, 1), (1063, 1), (1064, 1), (1065, 2), (1066, 1), (1067, 1), (1068, 1), (1069, 1), (1070, 1), (1071, 1), (1072, 1), (1073, 1), (1074, 1), (1075, 1), (1076, 6), (1077, 1), (1078, 1), (1079, 1), (1080, 1), (1081, 1), (1082, 1), (1083, 3), (1084, 1), (1085, 1), (1086, 3), (1087, 1), (1088, 1), (1089, 1), (1090, 1), (1091, 1), (1092, 3), (1093, 1), (1094, 3), (1095, 1), (1096, 1), (1097, 1), (1098, 2), (1099, 5), (1100, 5), (1101, 5), (1102, 1), (1103, 1), (1104, 1), (1105, 1), (1106, 1), (1107, 3), (1108, 7), (1109, 1), (1110, 1), (1111, 1), (1112, 1), (1113, 1), (1114, 1), (1115, 2), (1116, 2), (1117, 2), (1118, 1), (1119, 1), (1120, 1), (1121, 1), (1122, 1), (1123, 1), (1124, 1), (1125, 3), (1126, 1), (1127, 1), (1128, 2), (1129, 1), (1130, 1), (1131, 1), (1132, 2), (1133, 2), (1134, 1), (1135, 1), (1136, 2), (1137, 2), (1138, 2), (1139, 2), (1140, 1), (1141, 1), (1142, 1), (1143, 1), (1144, 2), (1145, 2), (1146, 1), (1147, 1), (1148, 1), (1149, 1), (1150, 1), (1151, 4), (1152, 1), (1153, 1), (1154, 1), (1155, 1), (1156, 1), (1157, 1), (1158, 1), (1159, 1), (1160, 2), (1161, 1), (1162, 2), (1163, 1), (1164, 1), (1165, 1), (1166, 1), (1167, 1), (1168, 1), (1169, 2), (1170, 2), (1171, 4), (1172, 3), (1173, 1), (1174, 1), (1175, 1), (1176, 2), (1177, 2), (1178, 1), (1179, 1), (1180, 1), (1181, 3), (1182, 1), (1183, 1), (1184, 1), (1185, 1), (1186, 1), (1187, 1), (1188, 1), (1189, 2), (1190, 1), (1191, 1), (1192, 4), (1193, 1), (1194, 1), (1195, 1), (1196, 2), (1197, 1), (1198, 1), (1199, 1), (1200, 1), (1201, 1), (1202, 1), (1203, 1), (1204, 1), (1205, 1), (1206, 2), (1207, 2), (1208, 1), (1209, 1), (1210, 1), (1211, 1), (1212, 1), (1213, 2), (1214, 1), (1215, 3), (1216, 1), (1217, 1), (1218, 9), (1219, 1), (1220, 1), (1221, 1), (1222, 1), (1223, 1), (1224, 3), (1225, 1), (1226, 1), (1227, 1), (1228, 1), (1229, 1), (1230, 1), (1231, 1), (1232, 1), (1233, 1), (1234, 1), (1235, 1), (1236, 1), (1237, 1), (1238, 3), (1239, 1), (1240, 1), (1241, 1), (1242, 1), (1243, 1), (1244, 1), (1245, 1), (1246, 1), (1247, 1), (1248, 1), (1249, 1), (1250, 1), (1251, 1), (1252, 1), (1253, 1), (1254, 1), (1255, 2), (1256, 1), (1257, 2), (1258, 2), (1259, 1), (1260, 1), (1261, 1), (1262, 1), (1263, 1), (1264, 1), (1265, 1), (1266, 2), (1267, 2), (1268, 1), (1269, 1), (1270, 1), (1271, 2), (1272, 1), (1273, 1), (1274, 1), (1275, 1), (1276, 1), (1277, 1), (1278, 1), (1279, 1), (1280, 1), (1281, 1), (1282, 1), (1283, 1), (1284, 1), (1285, 1), (1286, 1), (1287, 1), (1288, 1), (1289, 1), (1290, 1), (1291, 1), (1292, 1), (1293, 2), (1294, 1), (1295, 1), (1296, 1), (1297, 1), (1298, 1), (1299, 1), (1300, 1), (1301, 1), (1302, 1), (1303, 8), (1304, 1), (1305, 1), (1306, 1), (1307, 4), (1308, 2), (1309, 3), (1310, 1), (1311, 2), (1312, 1), (1313, 1), (1314, 1), (1315, 4), (1316, 1), (1317, 3), (1318, 1), (1319, 1), (1320, 1), (1321, 1), (1322, 1), (1323, 3), (1324, 3), (1325, 2), (1326, 1), (1327, 1), (1328, 1), (1329, 1), (1330, 1), (1331, 1), (1332, 1), (1333, 1), (1334, 1), (1335, 1), (1336, 1), (1337, 1), (1338, 1), (1339, 3), (1340, 1), (1341, 1), (1342, 1), (1343, 1), (1344, 1), (1345, 1), (1346, 1), (1347, 1), (1348, 3), (1349, 1), (1350, 1), (1351, 1), (1352, 1), (1353, 1), (1354, 1), (1355, 1), (1356, 1), (1357, 1), (1358, 1), (1359, 1), (1360, 1), (1361, 1), (1362, 1), (1363, 1), (1364, 1), (1365, 1), (1366, 1), (1367, 5), (1368, 1), (1369, 1), (1370, 1), (1371, 1), (1372, 1), (1373, 1), (1374, 1), (1375, 1), (1376, 1), (1377, 1), (1378, 1), (1379, 1), (1380, 1), (1381, 1), (1382, 1), (1383, 1), (1384, 1), (1385, 2), (1386, 1), (1387, 1), (1388, 2), (1389, 1), (1390, 1), (1391, 1), (1392, 1), (1393, 2), (1394, 1), (1395, 1), (1396, 1), (1397, 1), (1398, 1), (1399, 1), (1400, 1), (1401, 1), (1402, 1), (1403, 1), (1404, 1), (1405, 1), (1406, 1), (1407, 1), (1408, 1), (1409, 1), (1410, 1), (1411, 1), (1412, 1), (1413, 1), (1414, 1), (1415, 1), (1416, 1), (1417, 1), (1418, 1), (1419, 1), (1420, 1), (1421, 1), (1422, 1), (1423, 1), (1424, 1), (1425, 1), (1426, 1), (1427, 1), (1428, 2), (1429, 1), (1430, 1), (1431, 1), (1432, 1), (1433, 1), (1434, 1), (1435, 1), (1436, 1), (1437, 1), (1438, 1), (1439, 1), (1440, 1), (1441, 1), (1442, 1), (1443, 1), (1444, 1), (1445, 2), (1446, 1), (1447, 1), (1448, 3), (1449, 1), (1450, 1), (1451, 2), (1452, 1), (1453, 1), (1454, 1), (1455, 1), (1456, 1), (1457, 1), (1458, 1), (1459, 1), (1460, 1), (1461, 1), (1462, 3), (1463, 3), (1464, 1), (1465, 1), (1466, 1), (1467, 1), (1468, 1), (1469, 1), (1470, 1), (1471, 1), (1472, 1), (1473, 1), (1474, 1), (1475, 1), (1476, 1), (1477, 1), (1478, 1)]

# save all_texts_processed
flda_dictionary = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary','wb')
flda_corpus = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus','wb')
pickle.dump(dictionary, flda_dictionary)
pickle.dump(corpus, flda_corpus)

# Run LDA
# choose 10 topics, 1 pass for initial try and time it

tic = timeit.default_timer()
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=1)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')
61.63819993999641 seconds elapsed

print(ldamodel.print_topics(num_topics=3, num_words=3))

[(7, '0.006*"like" + 0.006*"peopl" + 0.006*"time"'), (3, '0.006*"one" + 0.006*"like" + 0.006*"work"'), (0, '0.007*"design" + 0.006*"use" + 0.006*"one"')]

In [85]:
# # Run LDA
# # choose 100 topics, 20 passes

# tic = timeit.default_timer()
# ldamodel4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=100, id2word = dictionary, passes=20)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with old dictionary/corpus without excluding nltk stopwords

In [86]:
flda_100topic20pass = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_100topic20pass','wb')
pickle.dump(ldamodel4, flda_100topic20pass)

In [90]:
# Compare topic outputs of LDA models 1-4
print(ldamodel.print_topics( num_topics=10, num_words=10))
# print(ldamodel2.print_topics(num_topics=10, num_words=10))
# print(ldamodel3.print_topics(num_topics=10, num_words=10))
# print(ldamodel4.print_topics(num_topics=10, num_words=10))

[(0, '0.009*"design" + 0.008*"it’" + 0.007*"use" + 0.007*"can" + 0.006*"like" + 0.006*"get" + 0.006*"will" + 0.005*"make" + 0.005*"work" + 0.005*"time"'), (1, '0.006*"one" + 0.006*"work" + 0.006*"can" + 0.005*"will" + 0.005*"peopl" + 0.004*"it’" + 0.004*"make" + 0.004*"use" + 0.004*"just" + 0.004*"de"'), (2, '0.016*"que" + 0.014*"de" + 0.011*"fuck" + 0.011*"o" + 0.011*"e" + 0.006*"não" + 0.006*"é" + 0.005*"um" + 0.004*"para" + 0.004*"can"'), (3, '0.007*"product" + 0.007*"peopl" + 0.007*"get" + 0.006*"design" + 0.006*"will" + 0.005*"time" + 0.005*"one" + 0.005*"just" + 0.005*"can" + 0.005*"like"'), (4, '0.007*"peopl" + 0.006*"one" + 0.006*"can" + 0.005*"will" + 0.005*"make" + 0.005*"work" + 0.005*"time" + 0.004*"new" + 0.004*"thing" + 0.004*"like"'), (5, '0.008*"will" + 0.007*"thing" + 0.007*"one" + 0.007*"can" + 0.006*"it’" + 0.006*"peopl" + 0.005*"get" + 0.005*"time" + 0.005*"don’t" + 0.005*"make"'), (6, '0.009*"make" + 0.009*"work" + 0.008*"can" + 0.007*"one" + 0.007*"time" + 0.007*"want" + 0.006*"peopl" + 0.006*"it’" + 0.006*"will" + 0.006*"get"'), (7, '0.010*"time" + 0.007*"can" + 0.006*"one" + 0.005*"get" + 0.005*"like" + 0.005*"it’" + 0.004*"peopl" + 0.004*"make" + 0.004*"thing" + 0.004*"day"'), (8, '0.010*"like" + 0.007*"it’" + 0.006*"just" + 0.006*"time" + 0.005*"can" + 0.005*"will" + 0.005*"peopl" + 0.005*"get" + 0.005*"thing" + 0.005*"one"'), (9, '0.008*"peopl" + 0.007*"can" + 0.006*"like" + 0.006*"one" + 0.005*"time" + 0.005*"work" + 0.005*"will" + 0.005*"just" + 0.004*"get" + 0.004*"don’t"')]

Combine nltk and stop_words lists of stopwords -- moved to top

In [100]:
# from nltk.corpus import stopwords
# stopw_en = stopwords.words('english')
# print(stopw_en)
# print(stop_en)
# print(len(stopw_en))
# print(len(stop_en))
# all_stopw = set(stopw_en) | set(stop_en)
# print(len(all_stopw))

In [110]:
# Run LDA
# choose 10 topics, 20 passes after removing more stopwords

tic = timeit.default_timer()
ldamodel5 = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')

765.3832683630026 seconds elapsed

In [114]:
print(ldamodel.print_topics( num_topics=10, num_words=5))

[(0, '0.007*"design" + 0.006*"use" + 0.006*"one" + 0.006*"like" + 0.005*"app"'), (1, '0.006*"it’" + 0.006*"work" + 0.006*"peopl" + 0.006*"time" + 0.005*"make"'), (2, '0.008*"like" + 0.007*"time" + 0.007*"it’" + 0.006*"use" + 0.006*"get"'), (3, '0.006*"one" + 0.006*"like" + 0.006*"work" + 0.005*"use" + 0.005*"make"'), (4, '0.010*"design" + 0.007*"like" + 0.005*"it’" + 0.005*"time" + 0.005*"peopl"'), (5, '0.009*"peopl" + 0.008*"time" + 0.007*"it’" + 0.007*"like" + 0.006*"get"'), (6, '0.007*"peopl" + 0.005*"like" + 0.005*"it’" + 0.005*"one" + 0.005*"make"'), (7, '0.006*"like" + 0.006*"peopl" + 0.006*"time" + 0.006*"one" + 0.005*"make"'), (8, '0.005*"one" + 0.004*"it’" + 0.004*"time" + 0.004*"que" + 0.003*"peopl"'), (9, '0.007*"one" + 0.007*"want" + 0.006*"work" + 0.006*"get" + 0.006*"make"')]

Generate a "common word" list to ignore in LDA

Some words appear in most topics above - these should be treated as stopwords and ignored. To do this, create a list of all words appearing in more than 60% of files (to ignore).

In [151]:
flatten = lambda all_texts_processed: [item for sublist in all_texts_processed for item in sublist]
# all_texts_combined = ' '.join(all_texts_processed)
all_texts_flattened = flatten(all_texts_processed)

flatten_uniq = set(all_texts_flattened)


commonwords = []
wordlist = []
i = 0
tic = timeit.default_timer()
for word in flatten_uniq:
    n = 0
    for text in all_texts_processed:
        if word in text:
#             print('yes!')
            n += 1
    frac = float(n / len(all_texts_processed))
    if frac >= 0.6:
    elif frac < 0.6:
    i += 1
#     print(word)
#     print(frac)
#     print(n)
#     if i == 20:
#         break
#     print(word)
#     print(frac)
#     if i >= 50:
#         break

toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')


In [227]:
commonwords_2 = [i.strip('”“’‘') for i in commonwords]

In [164]:
# all_stopw2 = set(all_stopw_stem) | set(commonwords)
# print(len(all_stopw2))

# all_stopw_stem = [p_stemmer.stem(i) for i in all_stopw]
# print(all_stopw)
# print(all_stopw_stem)

# all_stopw2 = set(all_stopw_stem) | set(commonwords)
# print(len(all_stopw2))

In [245]:
fwordlist = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/wordlist','wb')
fcommonwords = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/commonwords','wb')
fcommonwords2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/commonwords2','wb')
pickle.dump(wordlist, fwordlist)
pickle.dump(commonwords, fcommonwords)
pickle.dump(commonwords_2, fcommonwords2)

In [160]:
# print(tmp)

In [230]:
# REDO LDA with common words removed

all_texts_processed_new = []

tic = timeit.default_timer()

n = 0
for text in set_tr['text']:
    txt = ' '.join(text)
    # remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    txt2 = re.sub(u'\u2014','',txt) # remove em dashes
    txt3 = re.sub(r'\d+', '', txt2) # remove digits
    txt4 = txt3.translate(translator) # remove punctuation
    # split text into words
    tokens = word_tokenizer.tokenize(txt4.lower())
    # strip single and double quotes from ends of words
    tokens_strip = [i.strip('”“’‘') for i in tokens]
    # keep only english words
    tokens_en = [i for i in tokens_strip if i in en_words]
    # remove nltk/stop_word stop words
    nostop_tokens = [i for i in tokens_en if not i in all_stopw]
    # strip single and double quotes from ends of words
    nostop_strip = [i.strip('”“’‘') for i in nostop_tokens]
    # stem words
    stemmed = [p_stemmer.stem(i) for i in nostop_strip]
    # strip single and double quotes from ends of words
    stemmed_strip = [i.strip('”“’‘') for i in stemmed]
    # stem words
    stemmed2 = [p_stemmer.stem(i) for i in stemmed_strip]
    # strip single and double quotes from ends of words
    stemmed2_strip = [i.strip('”“’‘') for i in stemmed2]
    # remove common words post-stemming
    stemmed_nocommon = [i for i in stemmed2_strip if not i in commonwords_2]
    # append to processed texts
    all_texts_processed_new.append( stemmed_nocommon )
    if n == 0:
#         print(txt)
#         print(tokens)
#         print(nostop_tokens)
    n += 1
#     if n == 5:
#         break

toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')

126.21320975600975 seconds elapsed

Save all_texts_processed

In [231]:
# flda_processedtexts_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts_new','wb')
# pickle.dump(all_texts_processed_new, flda_processedtexts_new)
# # above: without filtering for english words

flda_processedtexts_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts_new2','wb')
pickle.dump(all_texts_processed_new, flda_processedtexts_new2)
# above: with filtering for english words

Make document-term matrix

In [232]:
# dictionary_new = corpora.Dictionary(all_texts_processed_new)
# # Convert to bag-of-words
# corpus_new = [dictionary_new.doc2bow(text) for text in all_texts_processed_new]
# print(corpus_new[0])
# # above: without filtering for english words

# Make document-term matrix
dictionary_new2 = corpora.Dictionary(all_texts_processed_new)
# Convert to bag-of-words
corpus_new2 = [dictionary_new2.doc2bow(text) for text in all_texts_processed_new]
# above: with filtering for english words

Save new dictionary and corpus

In [233]:
# flda_dictionary_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary_new','wb')
# flda_corpus_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus_new','wb')
# pickle.dump(dictionary_new, flda_dictionary_new)
# pickle.dump(corpus_new, flda_corpus_new)
# # above: without filtering for english words

flda_dictionary_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary_new2','wb')
flda_corpus_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus_new2','wb')
pickle.dump(dictionary_new2, flda_dictionary_new2)
pickle.dump(corpus_new2, flda_corpus_new2)
# above: with filtering for english words


In [234]:
# # choose 10 topics, 20 passes
# tic = timeit.default_timer()
# ldamodel_new = gensim.models.ldamodel.LdaModel(corpus_new, num_topics=10, id2word = dictionary_new, passes=20)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with new dictionary/corpus excluding more stopwords and common words
# # # above: without filtering for english words

# choose 10 topics, 20 passes
tic = timeit.default_timer()
ldamodel_new = gensim.models.ldamodel.LdaModel(corpus_new2, num_topics=10, id2word = dictionary_new2, passes=20)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')
# current: with new dictionary/corpus excluding more stopwords and common words
# above: with filtering for english words

535.3515410920081 seconds elapsed

In [235]:
# Save LDA model
# # flda_10topic20pass_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new','wb')
# # pickle.dump(ldamodel_new, flda_10topic20pass_new)

# flda_10topic20pass_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new2','wb')
# pickle.dump(ldamodel_new, flda_10topic20pass_new2)
# # # above: without filtering for english words

flda_10topic20pass_new2b = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new2b','wb')
pickle.dump(ldamodel_new, flda_10topic20pass_new2b)
# above: with filtering for english words

Inspect topic output of new LDA model

In [240]:
print(ldamodel_new.print_topics( num_topics=10, num_words=5))

[(0, '0.011*"compani" + 0.010*"product" + 0.008*"busi" + 0.007*"build" + 0.007*"team"'), (1, '0.030*"design" + 0.010*"user" + 0.007*"code" + 0.006*"web" + 0.006*"color"'), (2, '0.013*"write" + 0.013*"read" + 0.008*"learn" + 0.008*"love" + 0.007*"life"'), (3, '0.034*"via" + 0.028*"music" + 0.025*"game" + 0.020*"univ" + 0.019*"data"'), (4, '0.007*"us" + 0.006*"learn" + 0.006*"world" + 0.005*"system" + 0.005*"human"'), (5, '0.153*"de" + 0.103*"e" + 0.044*"um" + 0.040*"da" + 0.039*"para"'), (6, '0.055*"white" + 0.033*"black" + 0.016*"photo" + 0.016*"hou" + 0.016*"presid"'), (7, '0.016*"life" + 0.008*"success" + 0.007*"feel" + 0.007*"becom" + 0.006*"learn"'), (8, '0.009*"food" + 0.007*"home" + 0.006*"eat" + 0.006*"hou" + 0.006*"live"'), (9, '0.007*"trump" + 0.007*"us" + 0.007*"said" + 0.005*"men" + 0.005*"never"')]

Define a function to convert topic vector to numeric vector

In [238]:
def lda_to_vec(lda_input):
    num_topics = 10
    vec = [0]*num_topics
    for i in lda_input:
        col = i[0]
        val = i[1]
        vec[col] = val
    return vec

Calculate document vectors

In [239]:
all_lda_vecs = []

n = 0
for i in corpus_new2:
    doc_lda = ldamodel_new[i]
    vec_lda = lda_to_vec(doc_lda)
    n += 1
    if n <= 20:
#         print(doc_lda)

In [241]:


In [243]:
# Save all_lda_vecs
fall_lda_vecs = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/all_lda_vecs','wb')
pickle.dump(all_lda_vecs, fall_lda_vecs)