LOCUS Presentation: Topic Modeling Urdu Poetry

A. Sean Pue, Michigan State University

<h3>@seanpue, pue@msu.edu, seanpue.com</h3>

Code and Presentation available on Github:

http://seanpue.github.com/LOCUS

Note: Several submodules

Focus today is on the genre of the ghazal.

  • consists of two-line couplets
  • very popular but also little text

Idea of the Mazmun

  • Indigenous idea of a topic/theme
  • May be representable as an RDF triple, e.g. "the beloved","is","a rose"
  • ghazals consists of complicated extensions of known themes

Current Data Set: Divan-e Ghalib

  • Very important poet in Urdu
  • I am currently engaged in a concordance project, so am creating lemmatized texts

Goal: Use Python not R and generate word clouds in Urdu

  • Python because I am lazy
  • Use IPython Notebook (reproducible web-based notebook interface)

  • Word Clouds: Problems in Urdu
  • Usual R "wordcloud" does not work
  • Interested in the possible of cloud-based textual exploration

Current Experiment: Using gensim (topic modeling module); adapted jasondavies/d3-cloud

  • Gensim is "topic modeling for humans"; easy to use
  • d3.js is a Javascript graphics library;
  • jasondavies/d3-cloud is brilliant piece of code to generate word clouds in a web browser

In [ ]: