[Data, the Humanist's New Best Friend](index.ipynb)
*Course Description*

Course Supervisor: Juan Luis Suárez
E-mail: jsuarez@uwo.ca
Instructor: Javier de la Rosa
E-mail: jdelaro@uwo.ca
Office: AHB 1R14
Office Hours: Tuesdays 2:30pm-4:30pm
Meets: Winter 2015, Tuesdays 4:30pm-6:00pm, Wednesdays 4:30pm-6:00pm
Room: UC 207

About

This course is a hands-on and pragmatic introduction to computer tools and theoretical aspects of the new use of data by humanists of different disciplines. Furthermore, it will serve as an introduction to the techniques and methods used today to make sense of data from a Humanities point of view.

In that sense, Data, the Humanist's New Best Friend is divided into three blocks (plus one extra block that covers a programming review):

  • Data Mining, explaining the past and predicting the future by means of data analysis.
  • Text Analysis, producing valuable information from text sources.
  • Networks Science, understanding complex structures by analyzing the relationships among their entities.

Justification

We find computers and software in almost any field of study, from STEM disciplines (Science, Technology, Engineering and Mathematics), to Education and Arts. In the Humanities, for example, there is even a whole sub-discipline (or a discipline by itself, depending on who you ask) that tries to find answers to new and old questions by digital means. This approach, referred to as Digital Humanities, usually borrows methods from the sciences to analyze their own data, and produce interpretations and conclusions.

Data is today the new currency: is easy to produce and easy to collect; although the analysis still remains as one of the most complex steps in the workflow of any analytical proccess. This tendency of focusing solely into the data as the Holy Grail that solves everything, yet arguable, is producing important advances in the theoretical framework of the Data Science. Which is later translated into methods and tools with the enough madurity to be extrapolated and used in other fields. One of those is the Digital Humanities.

Being able to manipulate, manage and get value out of data is not only an in demand skill for researchers, but also a future-proof recipe for digital humanists in a field that is getting familiar with fast-paced introduction of innovative products to manage data.

Unfortunately, in the Humanities, data may come in a variety of formats: from mere CSV files or tables, to books, blog posts, tweets, or even network data from social or literary networks. Obviously, critical thinking and content related courses are completely necessary, but if we do not teach courses on new tools and methods, our next generation of students will graduate with limited knowledge of what can be done and how. It is not only about tools, software or applications, but, rather, it is to give students the skills they need to adapt to the changing environment of research. We do not know if Python will be widely used in 5 or 10 years, but we have to prepare our students to go beyond the next trend and apply their knowledge to the newest and coolest tool.

This course is focused in three main areas that I have surmised as the most important: data mining, text analysis, and network science. The election is not casual, is based on two principles:

  • Most common topics in conferences such as Digital Humanities (the biggest world conference in the field), Canadian Society of Digital Humanities, or the Summer Institute Innitiative.
  • More demanded skills in Digital Humanities job postings.

Philosophy of Teaching

I have a strong background in Computer Sciences. When I first enrolled in Computer Engineering I was very disappointed because I did not see a computer until three months into my program. All my colleagues felt the same way. And when we finally were in front of a computer, we were limited to only what the teacher wanted us to do. My first years in college were so frustrating for a guy who wanted to be roboticist, with only studying mathematics, physics and more mathematics. After finishing my Bachelor’s and Master’s degrees, I decided that, if I ever had the chance to teach a computer science related course, that would be intended as a learning by doing course.

After several years, I understood the point of acquiring all the mathematical knowledge they had taught us, but even so, I found that it is not the best way to teach computer sciences. People need participate and experience through trial and error. For this reason I believe the teacher must play the role of a "facilitator" of learning, and not the role of the "expert" that only gives information in one direction. I am sure that teaching is an amazing and practical way of learning for teachers, and, therefore, there must be a proper environment in the class for students to freely share and express their ideas.

Personally, I believe that one successful way to reach students is by “speaking” their own language. Keeping myself up to date with all the trends and technologies they usually use, may mean students feel more comfortable and confident in class. On the other hand, making the content more appealing, as well as challenging is a plus, so as to catch the interest and attention of the students.

Positive attitude is a must have for a teacher, and passion for what she or he teaches is very important as well. The process of teaching, interestingly, is in itself an amazing method of learning for an educator. For that reason, there must be an appropriate educational environment for students to freely share and express their ideas. The teacher is there to encourage the procurement of knowledge for the student, but they are also responsible for the instruction that comes from these exchanges. Daily small activities are intended to achieve this, promoting pair or group work whenever possible.

Methodology

As stated previously, the content is divided into 1+3 blocks of content: a programming review to get the hands on Python and the IPython Notebook, including the setup and in-class activities; data mining and analysis, with activities and one assignment; text analysis and processing, with activities and one assignment; and graphs and networks analysis, also with activities and one assignment.

The course will employ the format of micro-lectures, as it has been demonstrated that students from classes using this format outperform those from traditional lecture classes. The students are also report being happier and more engaged. Classes will be split to two hours on Mondays, and two more on Wednesdays. Officially Wednesdays are one class hour plus one lab hour, but since class time is lab time too, there will be no change in room as we will occupy a computer lab.

All micro-lectures will be delivered in the form of interactive, downloadable, and executable IPython Notebooks, made available on the course website. Students must come to class prepared as they are expected to take an active part in the lecture and activities. After teacher explained concepts from the recommended readings, and in order to ensure active participation in class, students will be assigned activities during the class to be solved, either in small groups or in pairs, in a short period of time. While students are working in their activities, teacher will help with doubts, and finally will show the answer to the class.

Moreover, after each of the main blocks of contents, students will have to complete assignments to put into practice concepts learned in class. Overall, this course is a highly interactive course that will allow students to actively learn by doing.

The use of electronic devices is highly recommended, as long as they can be used to run and experiment with the examples and activities from the micro-lectures.

Classes and Participation (15%)

Notebooks with the content, related exercises and the readings, will be available for the students on the course website. The readings in the course calendar are suggested to aid the student, but they are not required to understand the lectures (although they are good research material). Every day, to promote active learning, small activities will be proposed for the students to solve, when students are supposed to work in pairs or small groups. For this reason the room should be a laboratory, as some students may not have a laptop.

After finalizing each block, an assignment that covers as many concepts of the lectures as possible is proposed. Then students have more than one week to complete it, individually or in pairs.

IPython Notebooks

Goal: Deliver the content in an interactive way, but according to the needs and interests of the students, while keeping students engaged.

In order to make the lectures more practical, the notebooks are pieces of code that run in the browser, so students can interact with them. Notebooks might be long. The idea behind having long Notebooks is to cover as much content as possible, so depending on the interest and needs of the students, lectures can go deep in one direction or another. In this way, we can have students more engaged with the lectures at the time that they actively participate in the class by exposing their interests.

Activities

Goal: Guarantee that students are understanding the lectures.

Activities will be proposed daily, usually several times a day. These activities are conceived to put into practice the concepts of the lecture at the time it is explained. These activities will be evaluated only as participation. All activities, after students are given time to solve them, will be explained to the class by the teacher or by volunteers. The last of the activities may be left as a home activity, and be solved by the next class day.

Assignments (45%)

There will be three different assignments based on the three main practical blocks of content (excluding review), each of which will cover a real world case using Humanities data. All assignments need to be written as IPython Notebooks. Each assignment must have 5 different parts, evaluated as follows: getting and cleaning the data (3%), summarizing the data and extracting relevant information (3%), visualizing the information (3%), and stating the conclussions (3%). The last part goes for presentation, correctness, functionality and adequacy of code (3%). All assignments will be marked from 0 (not sent) to 100.

  • Data Mining (15%). This assignment aims to validate the acquired knowledge of the student in terms of data wrangling, converting or mapping from one raw form into another format that allows for more convenient consumption of the data with the help of the libraries and methods explained during the class. This will include further data visualization, data aggregation, and a statistical measures. The assignment will include a Notebook template that students must fill in. In the template, there will be excerpts of code already done and others to be filled in. The dataset to be used will be also included, although students can use their own data sets if that is their aim. A proposal for this assignment will be accessible as the Notebook Assignment 1.
  • Text Analysis (15%). In this assignment the student is expected to extract information and run different analysis over the text Moby Dick; or The Whale, by Herman Melville, from Project Gutenberg, in order to come to conclusions without reading the book. Other text sources, web sources or corpora can be used by the student after teacher approval. The template for the assignment will include guidelines and functions prototypes for the student to run measures such as frequency distributions, word and letter counts, richness, average length per sentence and paragraph, entity extraction and grammatical structures. Visualizations are also required. Conclusions for this assignment needs to address the topic of the book and the writing style of the author for the book. The proposal will be aviable as the Notebook Assignment 2.
  • Network Science (15%). Unlike previous assignments, the dataset is not given. Instead, the student is expected to use a graph of her or him choice from Koblenz University repositories. The graph can also be created by the student, although a minimum of 100 nodes and 300 relationships are mandatory in both cases. Smaller but more dense graphs can used upon aprroval by the teacher. The template for this assignment will be made availabel as the Notebook Assignment 3. It will include some guides on the expected analysis to run (degree, density, diameter, betweenness, average shortest path, closeness, clustering, modularity, and page rank), but this time there will not be code, and almost everything will need to be written by the student.

Final Project (40%)

For the final project, the student must choose a problem, phenomenon, dataset or topic of interest for him or her, and use at least two of the three blocks to write a Notebook about it. Therefore, all final projects must include at least one of these tuples of contents: Data Mining and Text Analysis, Text Analysis and Network Science, or Data Mining and Network Science. Ideally, projects should include aspects from all the main blocks, since Data Mining introduces several transversal concepts.

  • Proposal (5%).

    • One page, single space, Times New Roman or Georgia, 12pt (1%).
    • Extensive description of the blocks of content chosen and reasons (1%).
    • Description of the topic and motivations (1%).
    • Description of the intended approach and methodology (1%).
    • Tentative bibliography (1%).
  • Notebook (25%)

    • Description of the problem to solve and motivations (2%).
    • Analysis of the problem and possible ways to approach it (3%).
    • Methodology and theoretical framework (5%).
    • Description of the approach or solution and tools used (10%).
    • Conclusions (5%).
  • Oral presentation (10%)

    • Clarity of the explanations and tools used (5%).
    • Support material (3%).
    • Conclusions and questions (2%).

There is no minimal extension for the Notebook, as long as the project covers all the aspects. A small bibliography is mandatory (APA, MLA, etc, but consistent). Deadline is April $8^{th}$. This is due after the oral presentations so the students may have the chance to improve their work from the comments and feedback received during the presentation. A template Notebook Project will be provided.

Final Project will be marked from 0 (not sent) to 100.

Evaluation

  • Participation: 15% (including exercises sent at the end of the lectures)
  • Assignments: 45% (3 assignments, 15% each)
  • Final Project: 40% (5% Proposal, 25% Notebook, 10% Oral Presentation)

Course Plan

Block 0: What is This Course About?

This class will present the class, methodology and the IPython Notebook environment.

Block 1: How to Think Like a Computer Scientist [Review]

This class will cover the basic syntax for building Python programs.

This class will introduce basic programming abstractions called functions.

This class completes the basic knowledge of Python.

Block 2: Data Mining

This class will introduce new data types and show how to read and write from and to a file.

  • Goals:
    • Handle basic file input and ouput operations.
    • Understand the need for more complex data types.
  • Topics:
    • Getting data into and out of Python
    • Objects
    • NumPy
    • Arrays and Matrices
    • SciPy
  • Recommended readings:
  • Activity example:

    • The data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Computes and print, based on the data in populations.txt:
      1. The mean and standard deviation of the populations of each species for the years in the period.
      2. Which year each species had the largest population?
      3. Which species has the largest population for each year. (Hint: argsort & fancy indexing of np.array(['H', 'L', 'C'])).
      4. Which years any of the populations is above 50000. (Hint: comparisons and np.any).
      5. The top 2 years for each species when they had the lowest populations. (Hint: argsort, fancy indexing).

    ... all without for-loops.

This class will cover common tasks for cleaning, filtering and grouping data.

  • Goal:
    • Be able to clean messy datasets and filter out what's important for the researcher.
  • Topics:
    • Pandas
    • Cleaning data
    • Summary statistics
    • Indexing
    • Merging, joining
    • Group by
  • Recommended readings:
  • Activty example:
    • Given the arts data frame, do the next:
      1. Clean the dates so you only see numbers.
      2. Get the average execution year per artist.
      3. Get the average execution year per category.
      4. Get the number of artworks per artist. Which artist is the most prolific?
      5. Get the number of artworks per category. Which category has the highest number?

This class will introduce data summarization and visualization methods.

Class 08

This class is about estimating the relationships among variables.

  • Goals:
    • Distinguish between correlation and causality.
    • Understand when there is correlation in the data.
  • Topics:
    • Statistical modeling
    • statmodels
    • Correlation
    • Regression
    • Distributions
  • Recommended readings:
  • Activity example:
    • Given the arts data frame, try to do the next:
      1. Is there any correlation between the periods of production and the number of artworks per artist? If so, what kind?
      2. The execution years, what kind of distribution they follow?

Class 09.

This class will introduce the idea of systems that can learn from data, and make predictions from them.

Class 10.

  • Work on Assignment 1.

Block 3: Text Analysis

This class will cover the basics principles of Natural Language Processing.

  • Goal:
    • Understand how natural language is processed by machines, and the parts involved.
  • Topics:
    • NLTK
    • Tokenization
    • Concordance
    • Co-Occurrence and similarity
    • Word and phrase frequencies
    • Dispersion plots
    • TextBlob
  • Deadlines:
    • Assignment 1.
  • Recommended readings:
  • Activity example:
    • Create a function, most_commont(text, n), that receives a list of words or a Text and a number and returns the most common words. For example, most_commont(moby_dick, 5) should return the 5 most common words: [',', 'the', '.', 'of', 'and'].

This class will cover different ways to access and create corpora data.

  • Goal:
    • Collect and transform data into text ready to be analyzed.
  • Topcics:
    • Corpora
    • Conditional Frequency Distributions
    • Sources of data
    • Language detection
    • Machine translation
  • Recommended readings:
  • Activity example:
    • Write a program that loads feeds from the Spanish Blog in Digital Humanities, get the first 10 entries using feedparser, and for each, returns the next in English and withouth stopwords (Hint: take a look to the stopwords in NLTK under nltk.corpus.stopwords.words('spanish')):
      • Title
      • Number of sentences
      • Number of words
      • Number of unique words (vocabulary)
      • Number of hapaxes
      • Top 10 most frequent words

Class 13

This class will present some of the concepts behind modern statistical language applications and grammar analysis.

  • Goal:
    • Understand grammars structures in order to carry out stylometric analysis.
  • Topics:
    • Regular expressions
    • Word inflection and lemmatization
    • Parsing
    • n-grams
    • Part-of-speech Tagging
  • Recommended readings:
  • Activity examples:
    • Write a program to classify contexts involving the word must according to the tag of the following word. Can this be used to discriminate between the epistemic and deontic uses of must?
    • Generate some statistics for tagged data in the Brown Corpus to answer the following questions:
    • What proportion of word types are always assigned the same part-of-speech tag?
    • How many words are ambiguous, in the sense that they appear with at least two tags?
    • What percentage of word tokens involve these ambiguous words?

Class 15

This class will focus on measuring and extracting relevant information from texts.

Class 16

This class will present ways of classifying text and documents.

  • Goal:
    • Classify text and documents.
  • Topics:
    • Sentiment Analysis
    • Classifiers
    • Generative Writing
  • Recommended readings:
  • Activity example:
    • Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication about what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to one another. Using WordNet lexicon, augment the movie review document classifier presented in this chapter to use features that generalize the words that appear in a document, making it more likely that they will match words found in the training data.

Class 17

  • Work on Assignment 2

Block 4: Network Science

This class will introduce the basic concepts of networks and network analysis.

This class will present the idea of relevance in a network.

  • Goal:
    • Identify key entities in a network based on different criteria.
  • Topics:
    • Centrality
    • Degree
    • Betweenness
    • Closeness
    • Eigenvector
    • Current flow betweenness
    • Ego networks
  • Recommended readings:
  • Activity example:
    • Write a function, centrality_scatter(cent1, cent1), that receives two centrality dictionaries and plot each node label as a point using each of dictionary as one of the axes. Add a lienar best-fit trend, axis and title labels.

Class 20

This class will focus on understanding networks structures.

  • Goal:
    • Understand how entities behave differently when together.
  • Topics:
    • Random vs Scale Free
    • Small Worlds
    • Network Dynamics
    • Social Network Analysis
    • Modularity and Community Structure
  • Recommended readings:
  • Activity example:
    • Using the twitter or Facebook examples seen in class, extract your own network and determine whether follows the small world principle. Calculate the number of communities.

Class 21

This class will present effective ways of modeling real data for persistence.

Class 22

This class will focus on building networks from plays and analyzing other forms of content in networks.

  • Goal:
    • Extract and analyze hidden networks from texts.
  • Topics:
    • Network Content Analysis
    • Plot Analysis
  • Deadlines:
    • Final Project proposal.
  • Recommended readings:
  • Activity example:
    • Pick a title from Project Gutenberg, one that is not a monologue, and extract the plot network. Then run network analysis measures on it.

Class 23

  • Work on Assignment 3.

Block 5: Final Projects

Class 24

  • Work on Final Projects.
  • Deadlines:
    • Assignment 3.

Class 25 and 26

  • Final Project presentations