Textual Encoding of Hindi-Urdu Poetry for Data-Rich Literary Analysis

@ Digital Textualities of South Asia: A Research Symposium

Department of Asian Studies, University of British Columbia

4 March 2016

A. Sean Pue, Michigan State University

pue@msu.edu

@seanpue

http://seanpue.com

Github: seanpue

Talk Repository: http://github.com/seanpue/dtsa2016



In [1]:

    
from IPython.display import IFrame

Hindi/Urdu

हिन्दी

left-to-right devanagari script preferred
more tatsam (from Sanskrit) words

اردو

right-to-left nastaliq script preferred
more Perso-Arabic words

"different literary styles based on the same linguistic subdialect" (Masica 1991)

Research Question #1

How to best analyze and encode texts in both scripts?

Challenges

Disambiguating between words

आम aam as عام or آم

کیا as किया kiyaa or क्या kyaa

Challenges

Certain types of analysis require additional information:

morphology
grammatical markers, such as the iẓāfat (kitāb-e dil)
compound-word boundaries

Background Project

Desertful of Roses by Frances W. Pritchett

http://www.columbia.edu/itc/mealac/pritchett/00ghalib/

Hindi/Urdu Text and IPA from Transliteration

roman tokens parsed into devanagari/nastaliq versions
requires looking before and after for particular combinations
involves both tokens and classes of tokens, eg. consonant, vowel, etc.
quite but not entirely accurate
now using a lexer/parser

Workflow

Have texts transcribed into Unicode
Convert those files into spreadsheet tables
- easy to manipulate by an editor or programmatically
- very clean
Attach transliteration, lemaa information to the words
Analyze as a DataFrame
Reconstitute as TEI if necessary



In [2]:

    
import sys
sys.path.append('./graphparser/')
import graphparser as gp
import pandas as pd
import networkx as nx
import logging,sys,codecs,re,csv

Data File Structure



In [3]:

    
pd.set_option("display.max_rows",25)



In [4]:

    
pd.DataFrame.from_csv('data/miraji_nazmen.csv', encoding='utf-16')









    Out[4]:






  
    
      
      type
      transliteration
      urdu
      notes
    
  
  
    
      0
      TITLE
      NaN
      چل چلاؤ
      NaN
    
    
      1
      LINE
      NaN
      بس دیکھا اور پھر بھول گئے،
      NaN
    
    
      2
      TOKEN
      bas
      بس
      NaN
    
    
      3
      TOKEN
      dekhaa
      دیکھا
      NaN
    
    
      4
      TOKEN
      aur
      اور
      NaN
    
    
      5
      TOKEN
      phir
      پھر
      NaN
    
    
      6
      TOKEN
      bhuul
      بھول
      NaN
    
    
      7
      TOKEN
      ga))e
      گئے
      NaN
    
    
      8
      TOKEN
      ,
      ،
      NaN
    
    
      9
      LINE
      NaN
      جب حُسن نگاہوں میں آیا
      NaN
    
    
      10
      TOKEN
      jab
      جب
      NaN
    
    
      11
      TOKEN
      ;husn
      حُسن
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      21963
      TOKEN
      ik
      اک
      NaN
    
    
      21964
      TOKEN
      ke
      کے
      NaN
    
    
      21965
      TOKEN
      pahluu
      پہلو
      NaN
    
    
      21966
      TOKEN
      me;n
      میں
      NaN
    
    
      21967
      TOKEN
      ;xaak
      خاک
      NaN
    
    
      21968
      TOKEN
      aaluudah
      آلودہ
      NaN
    
    
      21969
      TOKEN
      aagahii
      آگہی
      NaN
    
    
      21970
      TOKEN
      hai
      ہے
      NaN
    
    
      21971
      TOKEN
      --
      ۔
      NaN
    
    
      21972
      LINE
      NaN
      ۔۔۔۔۔۔۔۔
      NaN
    
    
      21973
      TOKEN
      ----------------
      ۔۔۔۔۔۔۔۔
      NaN
    
    
      21974
      LINE
      NaN
      NaN
      NaN
    
  

21975 rows × 4 columns

Why digital analysis?

Motivated by the strong and recurrent discourse about ‘sound’ in modern Hindi/Urdu poetry

Hindi/Urdu as a language involves:

Perso-Arabic vocabulary and forms (ghazal, masnavi, etc.)
Indic (“Hindi”) vocabulary and forms
Relation of meter and forms to literary community

Possibilities of providing experiential or graphical “proof” to prose assertions

Urdu Meters

The meters are quantitative (not qualitative), based on length rather than stress
Metrical units involve “short” and “long” vowels
Metrical units are not necessarily syllables
- E.g. Raaj  = - (raa j) [where = is long, - is short]
Flexibilities
- Long vowels can be shortened at the end of words
- Metrical units can span words
- There are particular word-based anomalies/flexibilities

Urdu Prosody

Descriptions in Urdu from Persian (Farsi) and earlier Arabic prosody, as following a particular pattern (dates back to al-Khalil of Basra 718 CE)

Describe metrical feet using text where certain vowels are “moving” or “silent,” e. g.
- fāʿilātun = - = = فاعلاتن
- fāʿilun = - = فاعلن
- faʿūlan - = = فعولن
Meters named using primary metrical “wheels” and different sorts of modifications to them
Meter is referred to as a baḥr (“ocean”)

Meter: = - = = / = - = = / = - = = / = - =

نقش فریادی ہے کس کی سوخی تحریر کا

کاغذی ہے پیرہن ہر پیکر تصویر کا

naqsh faryaadii hai kis kii sho;xii-e ta;hriir kaa

kaa;gazii hai pairahan har paikar-e ta.sviir kaa

नक़्श फ़रयादी है किस की शोख़ी-ए तहरीर का

काग़ज़ी है पैरहन हर पैकर-ए तस्वीर का

Computational Problem

How to computationally scan Hindi/Urdu poetry in a scalable and effective way?

What is topic modeling?



In [5]:

    
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

topic1 = pydot.Node(name='topic1', texlbl=r'topic1', label='Topic #1', shape='square')
dot_object.add_node(topic1)
topic2 = pydot.Node(name='topic2', texlbl=r'topic2', label='Topic #2', shape='square')
dot_object.add_node(topic2)
#topic3 = pydot.Node(name='topic3', texlbl=r'topic3', label='عاشق', shape='square', fontname="Jameel Noori Nastaleeq")
#dot_object.add_node(topic3)

plate_document = pydot.Cluster(graph_name='plate_document', label='Document', fontsize=24)

word1= pydot.Node(name='word', texlbl=r'\word', label='Word')
plate_document.add_node(word1)
word2= pydot.Node(name='word2', texlbl=r'\word', label='Word')
plate_document.add_node(word2)
word3= pydot.Node(name='word3', texlbl=r'\word', label='Word')
plate_document.add_node(word3)


# add plate k to graph
dot_object.add_subgraph(plate_document)


dot_object.add_edge(pydot.Edge(topic1, word1))
dot_object.add_edge(pydot.Edge(topic1, word2))
dot_object.add_edge(pydot.Edge(topic2, word3))
#dot_object.add_edge(pydot.Edge(node_theta, node_z))
#dot_object.add_edge(pydot.Edge(node_z, node_w))
#dot_object.add_edge(pydot.Edge(node_w, node_beta, dir='back'))
#dot_object.add_edge(pydot.Edge(node_beta, node_eta, dir='back'))
dot_object.write('graph.dotfile', format='raw', prog='dot')









    Out[5]:





True



In [6]:

    
dot_object.write_png('topic_model.png', prog='dot')
from IPython.display import Image
#Image('topic_model.png')



In [7]:

    
from gensim import corpora, models, similarities
import collections,operator,sys,numpy,pandas
from jinja2 import Template


sys.path.append('graphparser/')
from graphparser import GraphParser
urdup = GraphParser('graphparser/settings/urdu.yaml')

with open('ghalib-concordance/output/lemma_documents.txt','r') as f:
    text = f.read()

verses = text.split('\n')
verses_orig=[urdup.parse(v).output for v in verses]
assert(len(verses)==1461)
tokens=[]

for v in verses:
    tokens+= v.split(' ')

    stoplist=['honaa','','karnaa',
'kaa','se','me;n','nah','vuh','kih','ko','jaanaa','kii','nahii;n','mai;n','kyaa','meraa','jo','ham',
'bhii','to','kahnaa','yih','aanaa','ne','teraa','dekhnaa','aur','par','denaa',';gaalib','ko))ii','kyuu;n',
'hii','pah','bah','gar','rahnaa','tuu','phir','apnaa','har','ay','ik','kis','tum','kuchh',
'agar','ek','asad','ab','chaahiye','puuchhnaa','yuu;n','hamaaraa',
'mauj','yaa;n','nikalnaa','yaa','milnaa','liye','yak',"jaan'naa",'achchhaa','haa))e','vaa;n','tak','paanaa',
'magar','taa','pa;rnaa','khe;nchnaa','kabhii','lekin','u;thnaa','varnah','chalnaa',
'phir','lenaa','denaa','kahaa;n','sar','jab',"go","ban'naa","ya((nii","vuhii","aap","saknaa","kisii","yihii"
'jitnaa','saa','pahle','lagnaa','vale','mat','sahii','kam',
'bahut','aisaa','qadar','aage','abhii','az','ba;gair','kyuu;nkar','buraa',
'hanuuz','baar']

verbs=[w for w in set(tokens) if w.endswith('naa') and w!='tamanna']

stoplist+=verbs



In [8]:

    
texts = [[word for word in verse.lower().split() if word not in stoplist] for verse in verses]

all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]
texts = [[urdup.parse(word).output for word in text] for text in texts]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]



In [9]:

    
def gen_model(num_topics=15, passes=10,iterations=250,chunksize=10,workers=5):
    model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics, eval_every=10, passes=passes,iterations=iterations,workers=workers)
    return model
model=gen_model()

What is a topic?

usually a probability distribution

Example: 15 topics from Ghalib's Divan



In [11]:

    
def get_verses():
    global model
    global corpus
    text_topics = [ model [x] for x in corpus ]
    da = numpy.zeros((len(text_topics),model.num_topics))
    for i, v in enumerate(text_topics):
        for topic, value in v:
            da[i,topic] = value
    df = pandas.DataFrame(da) # probably a way to compress the above
    verses_out = {}

    for i in range (model.num_topics):
        verses = []
        for x in df.sort(columns=[i],ascending=False)[i].index:
            v = df[i][x]
            if (v > 0):
                verses.append(verses_orig[x])

        verses_out['topic_'+str(i)]=verses
    return verses_out


num_words = 20
data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
        'topic_verses': get_verses()}









    



/Users/seanpue/anaconda/envs/python3/lib/python3.5/site-packages/ipykernel/__main__.py:16: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)



In [12]:

    
for x in range(model.num_topics):
    print('Topic #',x+1)
    for w in data['topic_words'][x]:print(w)









    



Topic # 1
('دل', 0.022736092999611903)
('پانو', 0.021338669050970229)
('عشق', 0.012930060118660762)
('طرح', 0.010336583438885157)
('چشم', 0.0092438153010541979)
('سایہ', 0.0088866184981056893)
('حسرت', 0.008492627787419044)
('ناز', 0.0082843407414475139)
('جلوہ', 0.00798499554293787)
('لذّت', 0.0079605382647480027)
('برق', 0.0079605381751552632)
('وفا', 0.0073417809554842308)
('نالہ', 0.0066580176002507428)
('یار', 0.0065524500020972109)
('زنجیر', 0.0065242641033103135)
('ستم', 0.006475336314957207)
('ذوق', 0.0062236936802671054)
('یاد', 0.0061957259363934003)
('گریہ', 0.0057341318381815476)
('گرم', 0.0053361851137337249)
Topic # 2
('وفا', 0.027794057874284701)
('دل', 0.015887778292937818)
('گل', 0.015444968917307504)
('عشق', 0.014884167145019986)
('خیال', 0.012919727517173437)
('گویا', 0.010789902417500673)
('آنکھ', 0.010726919487730352)
('سلامت', 0.010460352269209879)
('نالہ', 0.0092749655295324904)
('عمر', 0.0087477229181823733)
('نفس', 0.0074443804013880899)
('دن', 0.0074165423140923042)
('کار', 0.0068450084240849075)
('بزم', 0.0060547208605579187)
('راہ', 0.0058154651453025796)
('کہیں', 0.0056137951544426015)
('چمن', 0.0056137947676755352)
('ہوا', 0.0056137942516706184)
('رقیب', 0.0056003226234377706)
('بہار', 0.0056003218672578958)
Topic # 3
('غم', 0.025622882861400017)
('جہان', 0.015718407092265411)
('گل', 0.013610407205971758)
('دل', 0.011745912613602949)
('گرمی', 0.011644661808423126)
('بیان', 0.0083596056625632985)
('راہ', 0.0082244838592983489)
('غافل', 0.0082095997911397073)
('رگ', 0.008146321476258914)
('سو', 0.0076659700242010581)
('خاک', 0.0074486982964239975)
('خون', 0.007186153543220968)
('خوش', 0.0062020711518205587)
('عمر', 0.0062020707824539632)
('نظر', 0.006202070553182986)
('دن', 0.0062020700699835021)
('ہمہ', 0.0062020689060788975)
('مزہ', 0.0061871853064659789)
('کلام', 0.0061871849896173673)
('موت', 0.0061871846532556632)
Topic # 4
('دل', 0.05041971355100127)
('قطرہ', 0.014105851692029613)
('خاک', 0.013860880822899111)
('گل', 0.012030816699981)
('آئینہ', 0.01193775150892171)
('نگاہ', 0.011180978747599182)
('کون', 0.010401118911069608)
('خون', 0.0099258752917956682)
('ہوا', 0.0098849487312760356)
('آہ', 0.0092792933423818565)
('شب', 0.0091894315079985509)
('زلف', 0.0090362074186908659)
('چشم', 0.0083219939757781074)
('آج', 0.0081444219428388163)
('ناز', 0.0078713409958350612)
('قدح', 0.0078599430320199538)
('خس', 0.0077641744945999127)
('کہیں', 0.0076395252552082628)
('نالہ', 0.0069599851734542693)
('ذوق', 0.0067968489654221306)
Topic # 5
('گھر', 0.023309608208798009)
('در', 0.015445184471060641)
('بعد', 0.014751576634164999)
('دل', 0.014176421357360006)
('جان', 0.01284000836617206)
('انداز', 0.011871010786165788)
('عشق', 0.01048696082487849)
('بلا', 0.0092560269404682299)
('ہوا', 0.0085203284156983063)
('راز', 0.0081341437008652755)
('کاش', 0.0079410219980854801)
('دربان', 0.0079110778830626515)
('ہاتھ', 0.0068108158217238994)
('داغ', 0.0067713445028595711)
('نگاہ', 0.0064308175260683609)
('بن', 0.0060758271711368978)
('تمام', 0.0060629877366885207)
('تماشا', 0.0060629876700849499)
('جام', 0.0060629876495635034)
('بے خودی', 0.0060629873705233201)
Topic # 6
('دم', 0.018511877775816418)
('دیوار', 0.018501584788383811)
('گل', 0.015694841611775052)
('در', 0.014758044623465407)
('دل', 0.012471541357193704)
('غم', 0.011528889320038578)
('دنیا', 0.011298046721284115)
('یاد', 0.011142780332343297)
('ہی', 0.010261956472086491)
('شمع', 0.0099197072025559458)
('شب', 0.0098031784506216087)
('خون', 0.0094034348301845812)
('نالہ', 0.0092192701467490069)
('مژگان', 0.0085910426682053748)
('کب', 0.0085910426341529594)
('گناہ', 0.0082778173040333278)
('دوست', 0.0080288084437923225)
('ظلم', 0.0075673074664049447)
('دن', 0.0075249425916284222)
('غیر', 0.0072300809676694401)
Topic # 7
('ناز', 0.013831197600701119)
('نام', 0.012358216119504313)
('حسن', 0.010437720206937826)
('جز', 0.010096258319940656)
('جہان', 0.0085917947684954823)
('اہل', 0.0083715117663382185)
('شکوہ', 0.0073336021527504293)
('دل', 0.006845037700095296)
('حسرت', 0.006399387544355508)
('ضعف', 0.0063840670549961251)
('خم', 0.0063840665364749338)
('ذرّہ', 0.006384066062915284)
('محبّت', 0.0063840660121438257)
('اشارہ', 0.0063840658224433283)
('وقت', 0.0063840657293934141)
('دریا', 0.0063840654967053994)
('دن', 0.0063840653597772275)
('ادھر', 0.0063840652642626833)
('کرم', 0.0063840650885111823)
('گل', 0.0063840637522560957)
Topic # 8
('ہاتھ', 0.017464573864138803)
('جنون', 0.016148039359724153)
('کام', 0.014785650519624828)
('گھر', 0.013058746083672504)
('دل', 0.012959888665430203)
('تکلف', 0.011189757063899946)
('در', 0.011142879759635646)
('شوق', 0.010714592097478683)
('قیامت', 0.0094918521266664068)
('ناز', 0.0094349574495925879)
('بھلا', 0.0094349563198053744)
('شب', 0.0089835260416242547)
('نگاہ', 0.0087468348897242563)
('پا', 0.0082325515007794121)
('نقاب', 0.0075838106498020133)
('فرصت', 0.0075700872307403249)
('ہر چند', 0.0075700862120902952)
('معلوم', 0.0075700861643793731)
('نظّارہ', 0.007554729498508328)
('مدّعی', 0.0075547286671288271)
Topic # 9
('دل', 0.04023760413812881)
('بات', 0.0290498680139974)
('زبان', 0.014175927378541491)
('رب', 0.012766356618331181)
('خطّ', 0.012653697642522785)
('پا', 0.012015991952424046)
('منہ', 0.010939552003186673)
('بستر', 0.010026296306401363)
('زخم', 0.01000880560962638)
('خیال', 0.009749446692715269)
('گل', 0.0080297570562640787)
('فروغ', 0.0071758487113930359)
('قیامت', 0.0071453348915123801)
('دامن', 0.0070236495853165659)
('عاشق', 0.0059848651815830791)
('دن', 0.0057863184084896817)
('شوق', 0.0057758655263455564)
('خواب', 0.0057758652886920683)
('نگاہ', 0.0057741077481102817)
('جنون', 0.0057728805526259764)
Topic # 10
('دل', 0.025234331607359804)
('آرزو', 0.012392034532753757)
('دشمن', 0.012039097870045435)
('غیر', 0.011345663241902455)
('بہار', 0.0095990972797924696)
('رنگ', 0.0095419402499592257)
('چراغ', 0.0093744371057173474)
('ساغر', 0.0093023425998495675)
('ناز', 0.0091395733149186136)
('نالہ', 0.0088223793998582958)
('روز', 0.0078408628480522453)
('گل', 0.0071706656334045256)
('قاتل', 0.0063346677960341808)
('گردن', 0.0063346670284926271)
('ماہ', 0.0063346670030499062)
('پا', 0.0063346670021226558)
('رحم', 0.0063346666707661346)
('خواب', 0.0062690639134423956)
('آدمی', 0.006143254446050757)
('عرض', 0.0061359584979101796)
Topic # 11
('خدا', 0.022604420886035096)
('تماشا', 0.02079219734180111)
('رشک', 0.017830815671307194)
('رنگ', 0.01652430901960247)
('مے', 0.01290609783534594)
('دن', 0.011586375330216288)
('دل', 0.010868587184112738)
('آئینہ', 0.0091665953095151163)
('گل', 0.0085652010976996745)
('جان', 0.0083015641262586624)
('چشم', 0.0081814234435345495)
('نام', 0.008164283941088344)
('جلوہ', 0.0076527299373483756)
('صورت', 0.0070092114011424175)
('تیغ', 0.0068934563189608075)
('دوست', 0.0067509651279544294)
('خون', 0.0064947639236183439)
('رات', 0.0064510235995774694)
('نالہ', 0.006280324677179986)
('در', 0.0062029696992730084)
Topic # 12
('دل', 0.031966974796712844)
('قسمت', 0.020302171053087327)
('نالہ', 0.012269387791534058)
('آج', 0.011268772021738009)
('لوگ', 0.010262042894324081)
('جگر', 0.0087730887332348947)
('تن', 0.0086397365602927916)
('عالم', 0.008606833470376973)
('مژگان', 0.0085520820147755392)
('مجنون', 0.008424301839214058)
('رب', 0.0082359153362187732)
('عاشق', 0.0078863409302109676)
('اہل', 0.0069400076498063361)
('منہ', 0.0069400073125718458)
('کافر', 0.006940006904005867)
('غم', 0.0069400048121882165)
('گریبان', 0.0069101864884623119)
('در', 0.0067716357986007426)
('فتنہ', 0.006682466605414673)
('یار', 0.0063330537463705903)
Topic # 13
('دل', 0.044682794220116506)
('یار', 0.017269596485425774)
('نگاہ', 0.016811598989820632)
('دریا', 0.01601464066523995)
('بزم', 0.014610066036932027)
('آج', 0.014342786034121551)
('غیر', 0.013025534455991895)
('شوق', 0.011756663925285285)
('زخم', 0.010594936002021497)
('دیدار', 0.010455565660356955)
('ساقی', 0.01020230476486233)
('ناز', 0.009451871587602103)
('گل', 0.0094181230095751675)
('یاد', 0.0093673534421823672)
('مے', 0.0090358113233373581)
('نمک', 0.008842119737162665)
('خون', 0.00871323289443021)
('حسن', 0.0086611199410975576)
('حسرت', 0.0082215502421713899)
('جلوہ', 0.0078400305210077608)
Topic # 14
('دل', 0.030190721345750366)
('راہ', 0.022890896044669626)
('غم', 0.01804021414847249)
('نظر', 0.015682256651815833)
('گل', 0.015401974283406351)
('خیال', 0.01494062261072036)
('جگر', 0.010398141979110556)
('طاقت', 0.0097110408298777904)
('حاصل', 0.0096219132456647066)
('دام', 0.0083070577842033998)
('دیدہ', 0.008300210856018661)
('جی', 0.0082991463717669717)
('ذوق', 0.0069503367797766535)
('بسکہ', 0.0069320282924249651)
('جنون', 0.0069258061110748616)
('جوش', 0.0069111012480901363)
('مدّعا', 0.0068222625217949509)
('عشق', 0.0068210194631880029)
('باغ', 0.0068114707484331341)
('عالم', 0.006755594461197543)
Topic # 15
('دل', 0.025652129437047443)
('نگاہ', 0.01558303533237777)
('آخر', 0.014847281479119991)
('جگر', 0.014460563980758932)
('مے', 0.011926641669493778)
('آب', 0.011336452217444748)
('جان', 0.010505940830025564)
('زخم', 0.010321444489810459)
('بات', 0.0097904790255635005)
('سینہ', 0.0093121895624051054)
('پردہ', 0.0093060425620432544)
('شراب', 0.0083128960434106156)
('دیوار', 0.0082811655093723326)
('ناز', 0.0081671035440689412)
('بزم', 0.0066836998151170576)
('دست', 0.0065697110312131655)
('روزن', 0.0065190581193779566)
('درد', 0.0064334954765363772)
('خو', 0.0063915047783639897)
('عشق', 0.0055962096973080876)

Alternative Visualization as Interactive Word Clouds using d3.js



In [13]:

    
clouds_template='''
<!DOCTYPE html>
<meta charset="utf-8">
<head>
<script type="text/javascript" src="d3/d3.js"></script>
<script type="text/javascript" src="d3-cloud/d3.layout.cloud.js"></script>
<script type="application/json" id="data">

{{topic_words_json}}

</script>


</head>

<body>
<div id="models" style="width:50%;float:left">
</div>
<div id="texts" style="width:50%;float:left">
</div>

<script>

var fill = d3.scale.category20();

var word_data;

function make_cloud(cloud,id){
    
    
    words = cloud.map(function(d){
        return {text:d[0],size:d[1]*2000}
      }).sort(function(a,b){
        return a.size < b.size;
      });
    
    word_data = words;
      
    d3.layout.cloud().size([800, 800])
      .words(words)
      .padding(1)
      .rotate(function() { return 0})//~~(Math.random() * 2) * 90; })
      .font("Impact")
      .fontSize(function(d) { return d.size; })
      .on("end", draw)
      .start();
    
    function show_text(id){
    
        d3.select("div#texts").selectAll('p').remove();
        for (i=0; i<10;i++){//topic_verses[id].length; i++){
            d3.select("div#texts").append("p").style("font-family", "Jameel Noori Nastaleeq").style("font-size","16").text(topic_verses[id][i]).append("br");
        }
        
 
    }
    
    
    function draw(words) {
      d3.select("div#models").append("svg")
          .attr("width", 400)
          .attr("height", 400)
        .attr("id",id)
        .on("click",function(d) {show_text(this.id) } )
        .append("g")
          .attr("transform", "translate(400,400)")
        .selectAll("text")
          .data(words)
        .enter().append("text")
          .style("font-size", function(d) { return d.size + "px"; })
          .style("font-family", "Jameel Noori Nastaleeq")
          .style("fill", function(d, i) { return 0;})//fill(i); })
          .attr("text-anchor", "middle")
          .attr("transform", function(d) {
            return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
          })
          .text(function(d) { return d.text; });
    }
    
    
}


var num_topics = {{num_topics}};

var json_data = JSON.parse(document.getElementById('data').innerHTML);
topic_words = json_data['topic_words'];
topic_verses = json_data['topic_verses'];
for (i=0;i<num_topics;i++) {
  id = "topic_"+i;
  make_cloud(topic_words[i], id);
  
}
</script>
</body>
</html>
'''

from IPython.display import IFrame
import os
import json
num_words = 100
    
count=0
last_fun = None
def serve_html(s,w,h):
    import os
    global count
    count+=1
    fn= '__tmp'+str(os.getpid())+'_'+str(count)+'.html'
    global last_fn
    last_fn = fn
    with open(fn,'w') as f:
        f.write(s)
    return IFrame('files/'+fn,w,h)

def gen_clouds():
    global model
    num_words = 100
    data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
            'topic_verses': get_verses()}
    topic_words_json = json.dumps(data)
    s=Template(clouds_template).render(num_topics=model.num_topics,topic_words_json = topic_words_json)

    with open('word-cloud.html',"w") as f:
        f.write(s)
#    IFrame('word-cloud.html',width=1200,height=800)
    #return(serve_html(s,1200,800))

gen_clouds()
IFrame('word-cloud.html',width=1200,height=800)
#IFrame









    



/Users/seanpue/anaconda/envs/python3/lib/python3.5/site-packages/ipykernel/__main__.py:16: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    Out[13]:

Why topic model?

information extraction

show changes over time or text

compare one text/corpus with another

get access to texts in different ways

How does this model of the topic or theme align with Urdu-based rhetorical understandings?

more specifically, how does it compare to the idea of the maẓmūn (theme)

راہ مضمون تازہ بند نہیں
تا قیامت کھلا ہے باب سخن

The road of fresh themes is not closed

The gate of poetry is open until Doomsday

-Valī Dakkanī (1667-1707)

maẓmūn āfrīnī Creation of themes

the beloved is a hunter

the beloved lies in wait for the the prey

the hunter slaughters the prey

the hunter makes into a kabob the prey

the beloved is the prey

Perhaps as an Resource Data Framework (RDF) triple?

subject -> predicate -> object



In [14]:

    
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Subject', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Object', shape='square')
dot_object.add_node(node2)
dot_object.add_edge(pydot.Edge(node1, node2,label="Predicate"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('basic_triple.png', prog='dot')
from IPython.display import Image
#Image('basic_triple.png')



In [15]:

    
#import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Beloved', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Lover', shape='square')
dot_object.add_node(node2)
node3 = pydot.Node(name='node3', texlbl=r'topic3', label='Cruelty', shape='square')
dot_object.add_node(node3)
dot_object.add_edge(pydot.Edge(node1, node2,label="hunts"))
dot_object.add_edge(pydot.Edge(node1, node3,label="exhibits"))
dot_object.add_edge(pydot.Edge(node2, node1,label="loves"))
dot_object.add_edge(pydot.Edge(node2, node3,label="suffers"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('example_triple1.png', prog='dot')
from IPython.display import Image
#Image('example_triple1.png')

Thanks!

Sean Pue

pue@msu.edu

@seanpue



In [ ]:

	type	transliteration	urdu	notes
0	TITLE	NaN	چل چلاؤ	NaN
1	LINE	NaN	بس دیکھا اور پھر بھول گئے،	NaN
2	TOKEN	bas	بس	NaN
3	TOKEN	dekhaa	دیکھا	NaN
4	TOKEN	aur	اور	NaN
5	TOKEN	phir	پھر	NaN
6	TOKEN	bhuul	بھول	NaN
7	TOKEN	ga))e	گئے	NaN
8	TOKEN	,	،	NaN
9	LINE	NaN	جب حُسن نگاہوں میں آیا	NaN
10	TOKEN	jab	جب	NaN
11	TOKEN	;husn	حُسن	NaN
...	...	...	...	...
21963	TOKEN	ik	اک	NaN
21964	TOKEN	ke	کے	NaN
21965	TOKEN	pahluu	پہلو	NaN
21966	TOKEN	me;n	میں	NaN
21967	TOKEN	;xaak	خاک	NaN
21968	TOKEN	aaluudah	آلودہ	NaN
21969	TOKEN	aagahii	آگہی	NaN
21970	TOKEN	hai	ہے	NaN
21971	TOKEN	--	۔	NaN
21972	LINE	NaN	۔۔۔۔۔۔۔۔	NaN
21973	TOKEN	----------------	۔۔۔۔۔۔۔۔	NaN
21974	LINE	NaN	NaN	NaN