Textual Encoding of Hindi-Urdu Poetry for Data-Rich Literary Analysis

@ Digital Textualities of South Asia: A Research Symposium

Department of Asian Studies, University of British Columbia

4 March 2016

A. Sean Pue, Michigan State University

pue@msu.edu

@seanpue

http://seanpue.com

Github: seanpue

Talk Repository: http://github.com/seanpue/dtsa2016


In [1]:
from IPython.display import IFrame

Hindi/Urdu

हिन्दी

  • left-to-right devanagari script preferred
  • more tatsam (from Sanskrit) words

اردو

  • right-to-left nastaliq script preferred
  • more Perso-Arabic words

"different literary styles based on the same linguistic subdialect" (Masica 1991)

Research Question #1

How to best analyze and encode texts in both scripts?

Challenges

Disambiguating between words

आम aam as عام or آم

کیا as किया kiyaa or क्या kyaa

Challenges

Certain types of analysis require additional information:

  • morphology
  • grammatical markers, such as the iẓāfat (kitāb-e dil)
  • compound-word boundaries

Background Project

Desertful of Roses by Frances W. Pritchett

http://www.columbia.edu/itc/mealac/pritchett/00ghalib/

Hindi/Urdu Text and IPA from Transliteration

  • roman tokens parsed into devanagari/nastaliq versions

  • requires looking before and after for particular combinations

  • involves both tokens and classes of tokens, eg. consonant, vowel, etc.

  • quite but not entirely accurate

  • now using a lexer/parser

Workflow

  • Have texts transcribed into Unicode
  • Convert those files into spreadsheet tables
    • easy to manipulate by an editor or programmatically
    • very clean
  • Attach transliteration, lemaa information to the words
  • Analyze as a DataFrame
  • Reconstitute as TEI if necessary

In [2]:
import sys
sys.path.append('./graphparser/')
import graphparser as gp
import pandas as pd
import networkx as nx
import logging,sys,codecs,re,csv

Data File Structure


In [3]:
pd.set_option("display.max_rows",25)

In [4]:
pd.DataFrame.from_csv('data/miraji_nazmen.csv', encoding='utf-16')


Out[4]:
type transliteration urdu notes
0 TITLE NaN چل چلاؤ NaN
1 LINE NaN بس دیکھا اور پھر بھول گئے، NaN
2 TOKEN bas بس NaN
3 TOKEN dekhaa دیکھا NaN
4 TOKEN aur اور NaN
5 TOKEN phir پھر NaN
6 TOKEN bhuul بھول NaN
7 TOKEN ga))e گئے NaN
8 TOKEN , ، NaN
9 LINE NaN جب حُسن نگاہوں میں آیا NaN
10 TOKEN jab جب NaN
11 TOKEN ;husn حُسن NaN
... ... ... ... ...
21963 TOKEN ik اک NaN
21964 TOKEN ke کے NaN
21965 TOKEN pahluu پہلو NaN
21966 TOKEN me;n میں NaN
21967 TOKEN ;xaak خاک NaN
21968 TOKEN aaluudah آلودہ NaN
21969 TOKEN aagahii آگہی NaN
21970 TOKEN hai ہے NaN
21971 TOKEN -- ۔ NaN
21972 LINE NaN ۔۔۔۔۔۔۔۔ NaN
21973 TOKEN ---------------- ۔۔۔۔۔۔۔۔ NaN
21974 LINE NaN NaN NaN

21975 rows × 4 columns

Why digital analysis?

Motivated by the strong and recurrent discourse about ‘sound’ in modern Hindi/Urdu poetry

Hindi/Urdu as a language involves:

  • Perso-Arabic vocabulary and forms (ghazal, masnavi, etc.)
  • Indic (“Hindi”) vocabulary and forms
  • Relation of meter and forms to literary community

Possibilities of providing experiential or graphical “proof” to prose assertions

Urdu Meters

  • The meters are quantitative (not qualitative), based on length rather than stress
  • Metrical units involve “short” and “long” vowels
  • Metrical units are not necessarily syllables

    • E.g. Raaj  = - (raa j) [where = is long, - is short]
  • Flexibilities

    • Long vowels can be shortened at the end of words
    • Metrical units can span words
    • There are particular word-based anomalies/flexibilities

Urdu Prosody

Descriptions in Urdu from Persian (Farsi) and earlier Arabic prosody, as following a particular pattern (dates back to al-Khalil of Basra 718 CE)

  • Describe metrical feet using text where certain vowels are “moving” or “silent,” e. g.
    • fāʿilātun = - = = فاعلاتن
    • fāʿilun = - = فاعلن
    • faʿūlan - = = فعولن
  • Meters named using primary metrical “wheels” and different sorts of modifications to them
  • Meter is referred to as a baḥr (“ocean”)

  • Meter: = - = = / = - = = / = - = = / = - =

نقش فریادی ہے کس کی سوخی تحریر کا

کاغذی ہے پیرہن ہر پیکر تصویر کا

naqsh faryaadii hai kis kii sho;xii-e ta;hriir kaa

kaa;gazii hai pairahan har paikar-e ta.sviir kaa

नक़्श फ़रयादी है किस की शोख़ी-ए तहरीर का

काग़ज़ी है पैरहन हर पैकर-ए तस्वीर का

Computational Problem

How to computationally scan Hindi/Urdu poetry in a scalable and effective way?

What is topic modeling?


In [5]:
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

topic1 = pydot.Node(name='topic1', texlbl=r'topic1', label='Topic #1', shape='square')
dot_object.add_node(topic1)
topic2 = pydot.Node(name='topic2', texlbl=r'topic2', label='Topic #2', shape='square')
dot_object.add_node(topic2)
#topic3 = pydot.Node(name='topic3', texlbl=r'topic3', label='عاشق', shape='square', fontname="Jameel Noori Nastaleeq")
#dot_object.add_node(topic3)

plate_document = pydot.Cluster(graph_name='plate_document', label='Document', fontsize=24)

word1= pydot.Node(name='word', texlbl=r'\word', label='Word')
plate_document.add_node(word1)
word2= pydot.Node(name='word2', texlbl=r'\word', label='Word')
plate_document.add_node(word2)
word3= pydot.Node(name='word3', texlbl=r'\word', label='Word')
plate_document.add_node(word3)


# add plate k to graph
dot_object.add_subgraph(plate_document)


dot_object.add_edge(pydot.Edge(topic1, word1))
dot_object.add_edge(pydot.Edge(topic1, word2))
dot_object.add_edge(pydot.Edge(topic2, word3))
#dot_object.add_edge(pydot.Edge(node_theta, node_z))
#dot_object.add_edge(pydot.Edge(node_z, node_w))
#dot_object.add_edge(pydot.Edge(node_w, node_beta, dir='back'))
#dot_object.add_edge(pydot.Edge(node_beta, node_eta, dir='back'))
dot_object.write('graph.dotfile', format='raw', prog='dot')


Out[5]:
True

In [6]:
dot_object.write_png('topic_model.png', prog='dot')
from IPython.display import Image
#Image('topic_model.png')


In [7]:
from gensim import corpora, models, similarities
import collections,operator,sys,numpy,pandas
from jinja2 import Template


sys.path.append('graphparser/')
from graphparser import GraphParser
urdup = GraphParser('graphparser/settings/urdu.yaml')

with open('ghalib-concordance/output/lemma_documents.txt','r') as f:
    text = f.read()

verses = text.split('\n')
verses_orig=[urdup.parse(v).output for v in verses]
assert(len(verses)==1461)
tokens=[]

for v in verses:
    tokens+= v.split(' ')

    stoplist=['honaa','','karnaa',
'kaa','se','me;n','nah','vuh','kih','ko','jaanaa','kii','nahii;n','mai;n','kyaa','meraa','jo','ham',
'bhii','to','kahnaa','yih','aanaa','ne','teraa','dekhnaa','aur','par','denaa',';gaalib','ko))ii','kyuu;n',
'hii','pah','bah','gar','rahnaa','tuu','phir','apnaa','har','ay','ik','kis','tum','kuchh',
'agar','ek','asad','ab','chaahiye','puuchhnaa','yuu;n','hamaaraa',
'mauj','yaa;n','nikalnaa','yaa','milnaa','liye','yak',"jaan'naa",'achchhaa','haa))e','vaa;n','tak','paanaa',
'magar','taa','pa;rnaa','khe;nchnaa','kabhii','lekin','u;thnaa','varnah','chalnaa',
'phir','lenaa','denaa','kahaa;n','sar','jab',"go","ban'naa","ya((nii","vuhii","aap","saknaa","kisii","yihii"
'jitnaa','saa','pahle','lagnaa','vale','mat','sahii','kam',
'bahut','aisaa','qadar','aage','abhii','az','ba;gair','kyuu;nkar','buraa',
'hanuuz','baar']

verbs=[w for w in set(tokens) if w.endswith('naa') and w!='tamanna']

stoplist+=verbs

In [8]:
texts = [[word for word in verse.lower().split() if word not in stoplist] for verse in verses]

all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]
texts = [[urdup.parse(word).output for word in text] for text in texts]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

In [9]:
def gen_model(num_topics=15, passes=10,iterations=250,chunksize=10,workers=5):
    model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics, eval_every=10, passes=passes,iterations=iterations,workers=workers)
    return model
model=gen_model()

What is a topic?

usually a probability distribution

Example: 15 topics from Ghalib's Divan


In [11]:
def get_verses():
    global model
    global corpus
    text_topics = [ model [x] for x in corpus ]
    da = numpy.zeros((len(text_topics),model.num_topics))
    for i, v in enumerate(text_topics):
        for topic, value in v:
            da[i,topic] = value
    df = pandas.DataFrame(da) # probably a way to compress the above
    verses_out = {}

    for i in range (model.num_topics):
        verses = []
        for x in df.sort(columns=[i],ascending=False)[i].index:
            v = df[i][x]
            if (v > 0):
                verses.append(verses_orig[x])

        verses_out['topic_'+str(i)]=verses
    return verses_out


num_words = 20
data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
        'topic_verses': get_verses()}


/Users/seanpue/anaconda/envs/python3/lib/python3.5/site-packages/ipykernel/__main__.py:16: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

In [12]:
for x in range(model.num_topics):
    print('Topic #',x+1)
    for w in data['topic_words'][x]:print(w)


Topic # 1
('دل', 0.022736092999611903)
('پانو', 0.021338669050970229)
('عشق', 0.012930060118660762)
('طرح', 0.010336583438885157)
('چشم', 0.0092438153010541979)
('سایہ', 0.0088866184981056893)
('حسرت', 0.008492627787419044)
('ناز', 0.0082843407414475139)
('جلوہ', 0.00798499554293787)
('لذّت', 0.0079605382647480027)
('برق', 0.0079605381751552632)
('وفا', 0.0073417809554842308)
('نالہ', 0.0066580176002507428)
('یار', 0.0065524500020972109)
('زنجیر', 0.0065242641033103135)
('ستم', 0.006475336314957207)
('ذوق', 0.0062236936802671054)
('یاد', 0.0061957259363934003)
('گریہ', 0.0057341318381815476)
('گرم', 0.0053361851137337249)
Topic # 2
('وفا', 0.027794057874284701)
('دل', 0.015887778292937818)
('گل', 0.015444968917307504)
('عشق', 0.014884167145019986)
('خیال', 0.012919727517173437)
('گویا', 0.010789902417500673)
('آنکھ', 0.010726919487730352)
('سلامت', 0.010460352269209879)
('نالہ', 0.0092749655295324904)
('عمر', 0.0087477229181823733)
('نفس', 0.0074443804013880899)
('دن', 0.0074165423140923042)
('کار', 0.0068450084240849075)
('بزم', 0.0060547208605579187)
('راہ', 0.0058154651453025796)
('کہیں', 0.0056137951544426015)
('چمن', 0.0056137947676755352)
('ہوا', 0.0056137942516706184)
('رقیب', 0.0056003226234377706)
('بہار', 0.0056003218672578958)
Topic # 3
('غم', 0.025622882861400017)
('جہان', 0.015718407092265411)
('گل', 0.013610407205971758)
('دل', 0.011745912613602949)
('گرمی', 0.011644661808423126)
('بیان', 0.0083596056625632985)
('راہ', 0.0082244838592983489)
('غافل', 0.0082095997911397073)
('رگ', 0.008146321476258914)
('سو', 0.0076659700242010581)
('خاک', 0.0074486982964239975)
('خون', 0.007186153543220968)
('خوش', 0.0062020711518205587)
('عمر', 0.0062020707824539632)
('نظر', 0.006202070553182986)
('دن', 0.0062020700699835021)
('ہمہ', 0.0062020689060788975)
('مزہ', 0.0061871853064659789)
('کلام', 0.0061871849896173673)
('موت', 0.0061871846532556632)
Topic # 4
('دل', 0.05041971355100127)
('قطرہ', 0.014105851692029613)
('خاک', 0.013860880822899111)
('گل', 0.012030816699981)
('آئینہ', 0.01193775150892171)
('نگاہ', 0.011180978747599182)
('کون', 0.010401118911069608)
('خون', 0.0099258752917956682)
('ہوا', 0.0098849487312760356)
('آہ', 0.0092792933423818565)
('شب', 0.0091894315079985509)
('زلف', 0.0090362074186908659)
('چشم', 0.0083219939757781074)
('آج', 0.0081444219428388163)
('ناز', 0.0078713409958350612)
('قدح', 0.0078599430320199538)
('خس', 0.0077641744945999127)
('کہیں', 0.0076395252552082628)
('نالہ', 0.0069599851734542693)
('ذوق', 0.0067968489654221306)
Topic # 5
('گھر', 0.023309608208798009)
('در', 0.015445184471060641)
('بعد', 0.014751576634164999)
('دل', 0.014176421357360006)
('جان', 0.01284000836617206)
('انداز', 0.011871010786165788)
('عشق', 0.01048696082487849)
('بلا', 0.0092560269404682299)
('ہوا', 0.0085203284156983063)
('راز', 0.0081341437008652755)
('کاش', 0.0079410219980854801)
('دربان', 0.0079110778830626515)
('ہاتھ', 0.0068108158217238994)
('داغ', 0.0067713445028595711)
('نگاہ', 0.0064308175260683609)
('بن', 0.0060758271711368978)
('تمام', 0.0060629877366885207)
('تماشا', 0.0060629876700849499)
('جام', 0.0060629876495635034)
('بے خودی', 0.0060629873705233201)
Topic # 6
('دم', 0.018511877775816418)
('دیوار', 0.018501584788383811)
('گل', 0.015694841611775052)
('در', 0.014758044623465407)
('دل', 0.012471541357193704)
('غم', 0.011528889320038578)
('دنیا', 0.011298046721284115)
('یاد', 0.011142780332343297)
('ہی', 0.010261956472086491)
('شمع', 0.0099197072025559458)
('شب', 0.0098031784506216087)
('خون', 0.0094034348301845812)
('نالہ', 0.0092192701467490069)
('مژگان', 0.0085910426682053748)
('کب', 0.0085910426341529594)
('گناہ', 0.0082778173040333278)
('دوست', 0.0080288084437923225)
('ظلم', 0.0075673074664049447)
('دن', 0.0075249425916284222)
('غیر', 0.0072300809676694401)
Topic # 7
('ناز', 0.013831197600701119)
('نام', 0.012358216119504313)
('حسن', 0.010437720206937826)
('جز', 0.010096258319940656)
('جہان', 0.0085917947684954823)
('اہل', 0.0083715117663382185)
('شکوہ', 0.0073336021527504293)
('دل', 0.006845037700095296)
('حسرت', 0.006399387544355508)
('ضعف', 0.0063840670549961251)
('خم', 0.0063840665364749338)
('ذرّہ', 0.006384066062915284)
('محبّت', 0.0063840660121438257)
('اشارہ', 0.0063840658224433283)
('وقت', 0.0063840657293934141)
('دریا', 0.0063840654967053994)
('دن', 0.0063840653597772275)
('ادھر', 0.0063840652642626833)
('کرم', 0.0063840650885111823)
('گل', 0.0063840637522560957)
Topic # 8
('ہاتھ', 0.017464573864138803)
('جنون', 0.016148039359724153)
('کام', 0.014785650519624828)
('گھر', 0.013058746083672504)
('دل', 0.012959888665430203)
('تکلف', 0.011189757063899946)
('در', 0.011142879759635646)
('شوق', 0.010714592097478683)
('قیامت', 0.0094918521266664068)
('ناز', 0.0094349574495925879)
('بھلا', 0.0094349563198053744)
('شب', 0.0089835260416242547)
('نگاہ', 0.0087468348897242563)
('پا', 0.0082325515007794121)
('نقاب', 0.0075838106498020133)
('فرصت', 0.0075700872307403249)
('ہر چند', 0.0075700862120902952)
('معلوم', 0.0075700861643793731)
('نظّارہ', 0.007554729498508328)
('مدّعی', 0.0075547286671288271)
Topic # 9
('دل', 0.04023760413812881)
('بات', 0.0290498680139974)
('زبان', 0.014175927378541491)
('رب', 0.012766356618331181)
('خطّ', 0.012653697642522785)
('پا', 0.012015991952424046)
('منہ', 0.010939552003186673)
('بستر', 0.010026296306401363)
('زخم', 0.01000880560962638)
('خیال', 0.009749446692715269)
('گل', 0.0080297570562640787)
('فروغ', 0.0071758487113930359)
('قیامت', 0.0071453348915123801)
('دامن', 0.0070236495853165659)
('عاشق', 0.0059848651815830791)
('دن', 0.0057863184084896817)
('شوق', 0.0057758655263455564)
('خواب', 0.0057758652886920683)
('نگاہ', 0.0057741077481102817)
('جنون', 0.0057728805526259764)
Topic # 10
('دل', 0.025234331607359804)
('آرزو', 0.012392034532753757)
('دشمن', 0.012039097870045435)
('غیر', 0.011345663241902455)
('بہار', 0.0095990972797924696)
('رنگ', 0.0095419402499592257)
('چراغ', 0.0093744371057173474)
('ساغر', 0.0093023425998495675)
('ناز', 0.0091395733149186136)
('نالہ', 0.0088223793998582958)
('روز', 0.0078408628480522453)
('گل', 0.0071706656334045256)
('قاتل', 0.0063346677960341808)
('گردن', 0.0063346670284926271)
('ماہ', 0.0063346670030499062)
('پا', 0.0063346670021226558)
('رحم', 0.0063346666707661346)
('خواب', 0.0062690639134423956)
('آدمی', 0.006143254446050757)
('عرض', 0.0061359584979101796)
Topic # 11
('خدا', 0.022604420886035096)
('تماشا', 0.02079219734180111)
('رشک', 0.017830815671307194)
('رنگ', 0.01652430901960247)
('مے', 0.01290609783534594)
('دن', 0.011586375330216288)
('دل', 0.010868587184112738)
('آئینہ', 0.0091665953095151163)
('گل', 0.0085652010976996745)
('جان', 0.0083015641262586624)
('چشم', 0.0081814234435345495)
('نام', 0.008164283941088344)
('جلوہ', 0.0076527299373483756)
('صورت', 0.0070092114011424175)
('تیغ', 0.0068934563189608075)
('دوست', 0.0067509651279544294)
('خون', 0.0064947639236183439)
('رات', 0.0064510235995774694)
('نالہ', 0.006280324677179986)
('در', 0.0062029696992730084)
Topic # 12
('دل', 0.031966974796712844)
('قسمت', 0.020302171053087327)
('نالہ', 0.012269387791534058)
('آج', 0.011268772021738009)
('لوگ', 0.010262042894324081)
('جگر', 0.0087730887332348947)
('تن', 0.0086397365602927916)
('عالم', 0.008606833470376973)
('مژگان', 0.0085520820147755392)
('مجنون', 0.008424301839214058)
('رب', 0.0082359153362187732)
('عاشق', 0.0078863409302109676)
('اہل', 0.0069400076498063361)
('منہ', 0.0069400073125718458)
('کافر', 0.006940006904005867)
('غم', 0.0069400048121882165)
('گریبان', 0.0069101864884623119)
('در', 0.0067716357986007426)
('فتنہ', 0.006682466605414673)
('یار', 0.0063330537463705903)
Topic # 13
('دل', 0.044682794220116506)
('یار', 0.017269596485425774)
('نگاہ', 0.016811598989820632)
('دریا', 0.01601464066523995)
('بزم', 0.014610066036932027)
('آج', 0.014342786034121551)
('غیر', 0.013025534455991895)
('شوق', 0.011756663925285285)
('زخم', 0.010594936002021497)
('دیدار', 0.010455565660356955)
('ساقی', 0.01020230476486233)
('ناز', 0.009451871587602103)
('گل', 0.0094181230095751675)
('یاد', 0.0093673534421823672)
('مے', 0.0090358113233373581)
('نمک', 0.008842119737162665)
('خون', 0.00871323289443021)
('حسن', 0.0086611199410975576)
('حسرت', 0.0082215502421713899)
('جلوہ', 0.0078400305210077608)
Topic # 14
('دل', 0.030190721345750366)
('راہ', 0.022890896044669626)
('غم', 0.01804021414847249)
('نظر', 0.015682256651815833)
('گل', 0.015401974283406351)
('خیال', 0.01494062261072036)
('جگر', 0.010398141979110556)
('طاقت', 0.0097110408298777904)
('حاصل', 0.0096219132456647066)
('دام', 0.0083070577842033998)
('دیدہ', 0.008300210856018661)
('جی', 0.0082991463717669717)
('ذوق', 0.0069503367797766535)
('بسکہ', 0.0069320282924249651)
('جنون', 0.0069258061110748616)
('جوش', 0.0069111012480901363)
('مدّعا', 0.0068222625217949509)
('عشق', 0.0068210194631880029)
('باغ', 0.0068114707484331341)
('عالم', 0.006755594461197543)
Topic # 15
('دل', 0.025652129437047443)
('نگاہ', 0.01558303533237777)
('آخر', 0.014847281479119991)
('جگر', 0.014460563980758932)
('مے', 0.011926641669493778)
('آب', 0.011336452217444748)
('جان', 0.010505940830025564)
('زخم', 0.010321444489810459)
('بات', 0.0097904790255635005)
('سینہ', 0.0093121895624051054)
('پردہ', 0.0093060425620432544)
('شراب', 0.0083128960434106156)
('دیوار', 0.0082811655093723326)
('ناز', 0.0081671035440689412)
('بزم', 0.0066836998151170576)
('دست', 0.0065697110312131655)
('روزن', 0.0065190581193779566)
('درد', 0.0064334954765363772)
('خو', 0.0063915047783639897)
('عشق', 0.0055962096973080876)

Alternative Visualization as Interactive Word Clouds using d3.js


In [13]:
clouds_template='''
<!DOCTYPE html>
<meta charset="utf-8">
<head>
<script type="text/javascript" src="d3/d3.js"></script>
<script type="text/javascript" src="d3-cloud/d3.layout.cloud.js"></script>
<script type="application/json" id="data">

{{topic_words_json}}

</script>


</head>

<body>
<div id="models" style="width:50%;float:left">
</div>
<div id="texts" style="width:50%;float:left">
</div>

<script>

var fill = d3.scale.category20();

var word_data;

function make_cloud(cloud,id){
    
    
    words = cloud.map(function(d){
        return {text:d[0],size:d[1]*2000}
      }).sort(function(a,b){
        return a.size < b.size;
      });
    
    word_data = words;
      
    d3.layout.cloud().size([800, 800])
      .words(words)
      .padding(1)
      .rotate(function() { return 0})//~~(Math.random() * 2) * 90; })
      .font("Impact")
      .fontSize(function(d) { return d.size; })
      .on("end", draw)
      .start();
    
    function show_text(id){
    
        d3.select("div#texts").selectAll('p').remove();
        for (i=0; i<10;i++){//topic_verses[id].length; i++){
            d3.select("div#texts").append("p").style("font-family", "Jameel Noori Nastaleeq").style("font-size","16").text(topic_verses[id][i]).append("br");
        }
        
 
    }
    
    
    function draw(words) {
      d3.select("div#models").append("svg")
          .attr("width", 400)
          .attr("height", 400)
        .attr("id",id)
        .on("click",function(d) {show_text(this.id) } )
        .append("g")
          .attr("transform", "translate(400,400)")
        .selectAll("text")
          .data(words)
        .enter().append("text")
          .style("font-size", function(d) { return d.size + "px"; })
          .style("font-family", "Jameel Noori Nastaleeq")
          .style("fill", function(d, i) { return 0;})//fill(i); })
          .attr("text-anchor", "middle")
          .attr("transform", function(d) {
            return "translate(" + [d.x, d.y] + ")rotate(" + d.rotate + ")";
          })
          .text(function(d) { return d.text; });
    }
    
    
}


var num_topics = {{num_topics}};

var json_data = JSON.parse(document.getElementById('data').innerHTML);
topic_words = json_data['topic_words'];
topic_verses = json_data['topic_verses'];
for (i=0;i<num_topics;i++) {
  id = "topic_"+i;
  make_cloud(topic_words[i], id);
  
}
</script>
</body>
</html>
'''

from IPython.display import IFrame
import os
import json
num_words = 100
    
count=0
last_fun = None
def serve_html(s,w,h):
    import os
    global count
    count+=1
    fn= '__tmp'+str(os.getpid())+'_'+str(count)+'.html'
    global last_fn
    last_fn = fn
    with open(fn,'w') as f:
        f.write(s)
    return IFrame('files/'+fn,w,h)

def gen_clouds():
    global model
    num_words = 100
    data = {'topic_words': [model.show_topic(i,topn=num_words) for i in range(model.num_topics)],
            'topic_verses': get_verses()}
    topic_words_json = json.dumps(data)
    s=Template(clouds_template).render(num_topics=model.num_topics,topic_words_json = topic_words_json)

    with open('word-cloud.html',"w") as f:
        f.write(s)
#    IFrame('word-cloud.html',width=1200,height=800)
    #return(serve_html(s,1200,800))

gen_clouds()
IFrame('word-cloud.html',width=1200,height=800)
#IFrame


/Users/seanpue/anaconda/envs/python3/lib/python3.5/site-packages/ipykernel/__main__.py:16: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
Out[13]:

Why topic model?

information extraction

show changes over time or text

compare one text/corpus with another

get access to texts in different ways

How does this model of the topic or theme align with Urdu-based rhetorical understandings?

more specifically, how does it compare to the idea of the maẓmūn (theme)

راہ مضمون تازہ بند نہیں
تا قیامت کھلا ہے باب سخن

The road of fresh themes is not closed

The gate of poetry is open until Doomsday

-Valī Dakkanī (1667-1707)

maẓmūn āfrīnī Creation of themes

the beloved is a hunter

the beloved lies in wait for the the prey

the hunter slaughters the prey

the hunter makes into a kabob the prey

the beloved is the prey

Perhaps as an Resource Data Framework (RDF) triple?

subject -> predicate -> object


In [14]:
import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Subject', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Object', shape='square')
dot_object.add_node(node2)
dot_object.add_edge(pydot.Edge(node1, node2,label="Predicate"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('basic_triple.png', prog='dot')
from IPython.display import Image
#Image('basic_triple.png')


In [15]:
#import pydot

dot_object = pydot.Dot(graph_name="main_graph",rankdir="LR", labelloc='b', 
                       labeljust='r', ranksep=1)

node1 = pydot.Node(name='node1', texlbl=r'topic1', label='Beloved', shape='square')
dot_object.add_node(node1)
node2 = pydot.Node(name='node2', texlbl=r'topic2', label='Lover', shape='square')
dot_object.add_node(node2)
node3 = pydot.Node(name='node3', texlbl=r'topic3', label='Cruelty', shape='square')
dot_object.add_node(node3)
dot_object.add_edge(pydot.Edge(node1, node2,label="hunts"))
dot_object.add_edge(pydot.Edge(node1, node3,label="exhibits"))
dot_object.add_edge(pydot.Edge(node2, node1,label="loves"))
dot_object.add_edge(pydot.Edge(node2, node3,label="suffers"))
#dot_object.write('graph.dotfile', format='raw', prog='dot')
dot_object.write_png('example_triple1.png', prog='dot')
from IPython.display import Image
#Image('example_triple1.png')

Thanks!

Sean Pue

pue@msu.edu

@seanpue


In [ ]: