@ Digital Textualities of South Asia: A Research Symposium
Department of Asian Studies, University of British Columbia
4 March 2016
A. Sean Pue, Michigan State University
pue@msu.edu
@seanpue
Github: seanpue
Talk Repository: http://github.com/seanpue/dtsa2016
In [1]:
from IPython.display import IFrame
In [2]:
import sys
sys.path.append('./graphparser/')
import graphparser as gp
import pandas as pd
import networkx as nx
import logging,sys,codecs,re,csv
In [3]:
pd.set_option("display.max_rows",25)
In [4]:
pd.DataFrame.from_csv('data/miraji_nazmen.csv', encoding='utf-16')
Out[4]:
Metrical units are not necessarily syllables
Flexibilities
Descriptions in Urdu from Persian (Farsi) and earlier Arabic prosody, as following a particular pattern (dates back to al-Khalil of Basra 718 CE)
نقش فریادی ہے کس کی سوخی تحریر کا
کاغذی ہے پیرہن ہر پیکر تصویر کا
naqsh faryaadii hai kis kii sho;xii-e ta;hriir kaa
kaa;gazii hai pairahan har paikar-e ta.sviir kaa
नक़्श फ़रयादी है किस की शोख़ी-ए तहरीर का
काग़ज़ी है पैरहन हर पैकर-ए तस्वीर का
What is topic modeling?
In [5]:
import pydot
dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b',
labeljust='r', ranksep=1)
topic1 = pydot.Node(name='topic1', texlbl=r'topic1', label='Topic #1', shape='square')
dot_object.add_node(topic1)
topic2 = pydot.Node(name='topic2', texlbl=r'topic2', label='Topic #2', shape='square')
dot_object.add_node(topic2)
#topic3 = pydot.Node(name='topic3', texlbl=r'topic3', label='عاشق', shape='square', fontname="Jameel Noori Nastaleeq")
#dot_object.add_node(topic3)
plate_document = pydot.Cluster(graph_name='plate_document', label='Document', fontsize=24)
word1= pydot.Node(name='word', texlbl=r'\word', label='Word')
plate_document.add_node(word1)
word2= pydot.Node(name='word2', texlbl=r'\word', label='Word')
plate_document.add_node(word2)
word3= pydot.Node(name='word3', texlbl=r'\word', label='Word')
plate_document.add_node(word3)
# add plate k to graph
dot_object.add_subgraph(plate_document)
dot_object.add_edge(pydot.Edge(topic1, word1))
dot_object.add_edge(pydot.Edge(topic1, word2))
dot_object.add_edge(pydot.Edge(topic2, word3))
#dot_object.add_edge(pydot.Edge(node_theta, node_z))
#dot_object.add_edge(pydot.Edge(node_z, node_w))
#dot_object.add_edge(pydot.Edge(node_w, node_beta, dir='back'))
#dot_object.add_edge(pydot.Edge(node_beta, node_eta, dir='back'))
dot_object.write('graph.dotfile', format='raw', prog='dot')
Out[5]:
In [6]:
dot_object.write_png('topic_model.png', prog='dot')
from IPython.display import Image
#Image('topic_model.png')
In [7]:
from gensim import corpora, models, similarities
import collections,operator,sys,numpy,pandas
from jinja2 import Template
sys.path.append('graphparser/')
from graphparser import GraphParser
urdup = GraphParser('graphparser/settings/urdu.yaml')
with open('ghalib-concordance/output/lemma_documents.txt','r') as f:
text = f.read()
verses = text.split('\n')
verses_orig=[urdup.parse(v).output for v in verses]
assert(len(verses)==1461)
tokens=[]
for v in verses:
tokens+= v.split(' ')
stoplist=['honaa','','karnaa',
'kaa','se','me;n','nah','vuh','kih','ko','jaanaa','kii','nahii;n','mai;n','kyaa','meraa','jo','ham',
'bhii','to','kahnaa','yih','aanaa','ne','teraa','dekhnaa','aur','par','denaa',';gaalib','ko))ii','kyuu;n',
'hii','pah','bah','gar','rahnaa','tuu','phir','apnaa','har','ay','ik','kis','tum','kuchh',
'agar','ek','asad','ab','chaahiye','puuchhnaa','yuu;n','hamaaraa',
'mauj','yaa;n','nikalnaa','yaa','milnaa','liye','yak',"jaan'naa",'achchhaa','haa))e','vaa;n','tak','paanaa',
'magar','taa','pa;rnaa','khe;nchnaa','kabhii','lekin','u;thnaa','varnah','chalnaa',
'phir','lenaa','denaa','kahaa;n','sar','jab',"go","ban'naa","ya((nii","vuhii","aap","saknaa","kisii","yihii"
'jitnaa','saa','pahle','lagnaa','vale','mat','sahii','kam',
'bahut','aisaa','qadar','aage','abhii','az','ba;gair','kyuu;nkar','buraa',
'hanuuz','baar']
verbs=[w for w in set(tokens) if w.endswith('naa') and w!='tamanna']
stoplist+=verbs
In [8]:
texts = [[word for word in verse.lower().split() if word not in stoplist] for verse in verses]
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
texts = [[urdup.parse(word).output for word in text] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
In [9]:
def gen_model(num_topics=15, passes=10,iterations=250,chunksize=10,workers=5):
model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics, eval_every=10, passes=passes,iterations=iterations,workers=workers)
return model
model=gen_model()
What is a topic?
usually a probability distribution
Example: 15 topics from Ghalib's Divan
In [11]:
def get_verses():
global model
global corpus
text_topics = [ model [x] for x in corpus ]
da = numpy.zeros((len(text_topics),model.num_topics))
for i, v in enumerate(text_topics):
for topic, value in v:
da[i,topic] = value
df = pandas.DataFrame(da) # probably a way to compress the above
verses_out = {}
for i in range (model.num_topics):
verses = []
for x in df.sort(columns=[i],ascending=False)[i].index:
v = df[i][x]
if (v > 0):
verses.append(verses_orig[x])
verses_out['topic_'+str(i)]=verses
return verses_out
num_words = 20
data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
'topic_verses': get_verses()}
In [12]:
for x in range(model.num_topics):
print('Topic #',x+1)
for w in data['topic_words'][x]:print(w)
Alternative Visualization as Interactive Word Clouds using d3.js
In [13]:
clouds_template='''
<!DOCTYPE html>
<meta charset="utf-8">
<head>
<script type="text/javascript" src="d3/d3.js"></script>
<script type="text/javascript" src="d3-cloud/d3.layout.cloud.js"></script>
<script type="application/json" id="data">
{{topic_words_json}}
</script>
</head>
<body>
<div id="models" style="width:50%;float:left">
</div>
<div id="texts" style="width:50%;float:left">
</div>
<script>
var fill = d3.scale.category20();
var word_data;
function make_cloud(cloud,id){
words = cloud.map(function(d){
return {text:d[0],size:d[1]*2000}
}).sort(function(a,b){
return a.size < b.size;
});
word_data = words;
d3.layout.cloud().size([800, 800])
.words(words)
.padding(1)
.rotate(function() { return 0})//~~(Math.random() * 2) * 90; })
.font("Impact")
.fontSize(function(d) { return d.size; })
.on("end", draw)
.start();
function show_text(id){
d3.select("div#texts").selectAll('p').remove();
for (i=0; i<10;i++){//topic_verses[id].length; i++){
d3.select("div#texts").append("p").style("font-family", "Jameel Noori Nastaleeq").style("font-size","16").text(topic_verses[id][i]).append("br");
}
}
function draw(words) {
d3.select("div#models").append("svg")
.attr("width", 400)
.attr("height", 400)
.attr("id",id)
.on("click",function(d) {show_text(this.id) } )
.append("g")
.attr("transform", "translate(400,400)")
.selectAll("text")
.data(words)
.enter().append("text")
.style("font-size", function(d) { return d.size + "px"; })
.style("font-family", "Jameel Noori Nastaleeq")
.style("fill", function(d, i) { return 0;})//fill(i); })
.attr("text-anchor", "middle")
.attr("transform", function(d) {
return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
})
.text(function(d) { return d.text; });
}
}
var num_topics = {{num_topics}};
var json_data = JSON.parse(document.getElementById('data').innerHTML);
topic_words = json_data['topic_words'];
topic_verses = json_data['topic_verses'];
for (i=0;i<num_topics;i++) {
id = "topic_"+i;
make_cloud(topic_words[i], id);
}
</script>
</body>
</html>
'''
from IPython.display import IFrame
import os
import json
num_words = 100
count=0
last_fun = None
def serve_html(s,w,h):
import os
global count
count+=1
fn= '__tmp'+str(os.getpid())+'_'+str(count)+'.html'
global last_fn
last_fn = fn
with open(fn,'w') as f:
f.write(s)
return IFrame('files/'+fn,w,h)
def gen_clouds():
global model
num_words = 100
data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
'topic_verses': get_verses()}
topic_words_json = json.dumps(data)
s=Template(clouds_template).render(num_topics=model.num_topics,topic_words_json = topic_words_json)
with open('word-cloud.html',"w") as f:
f.write(s)
# IFrame('word-cloud.html',width=1200,height=800)
#return(serve_html(s,1200,800))
gen_clouds()
IFrame('word-cloud.html',width=1200,height=800)
#IFrame
Out[13]:
The road of fresh themes is not closed
The gate of poetry is open until Doomsday
-Valī Dakkanī (1667-1707)
maẓmūn āfrīnī Creation of themes
the beloved is a hunter
the beloved lies in wait for the the prey
the hunter slaughters the prey
the hunter makes into a kabob the prey
the beloved is the prey
Perhaps as an Resource Data Framework (RDF) triple?
subject -> predicate -> object
In [14]:
import pydot
dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b',
labeljust='r', ranksep=1)
node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Subject', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Object', shape='square')
dot_object.add_node(node2)
dot_object.add_edge(pydot.Edge(node1, node2,label="Predicate"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('basic_triple.png', prog='dot')
from IPython.display import Image
#Image('basic_triple.png')
In [15]:
#import pydot
dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b',
labeljust='r', ranksep=1)
node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Beloved', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Lover', shape='square')
dot_object.add_node(node2)
node3 = pydot.Node(name='node3', texlbl=r'topic3', label='Cruelty', shape='square')
dot_object.add_node(node3)
dot_object.add_edge(pydot.Edge(node1, node2,label="hunts"))
dot_object.add_edge(pydot.Edge(node1, node3,label="exhibits"))
dot_object.add_edge(pydot.Edge(node2, node1,label="loves"))
dot_object.add_edge(pydot.Edge(node2, node3,label="suffers"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('example_triple1.png', prog='dot')
from IPython.display import Image
#Image('example_triple1.png')
Thanks!
Sean Pue
pue@msu.edu
@seanpue
In [ ]: