[Data, the Humanist's New Best Friend](index.ipynb)
Course Description

Course Supervisor: Juan Luis Suárez
E-mail: jsuarez@uwo.ca
Instructor: Javier de la Rosa
E-mail: jdelaro@uwo.ca
Office: AHB 1R14
Office Hours: Tuesdays 2:30pm-4:30pm
Meets: Winter 2015, Tuesdays 4:30pm-6:00pm, Wednesdays 4:30pm-6:00pm
Room: UC 207

About

This course is a hands-on and pragmatic introduction to computer tools and theoretical aspects of the new use of data by humanists of different disciplines. Furthermore, it will serve as an introduction to the techniques and methods used today to make sense of data from a Humanities point of view.

In that sense, Data, the Humanist's New Best Friend is divided into three blocks (plus one extra block that covers a programming review):

Data Mining, explaining the past and predicting the future by means of data analysis.
Text Analysis, producing valuable information from text sources.
Networks Science, understanding complex structures by analyzing the relationships among their entities.

Justification

We find computers and software in almost any field of study, from STEM disciplines (Science, Technology, Engineering and Mathematics), to Education and Arts. In the Humanities, for example, there is even a whole sub-discipline (or a discipline by itself, depending on who you ask) that tries to find answers to new and old questions by digital means. This approach, referred to as Digital Humanities, usually borrows methods from the sciences to analyze their own data, and produce interpretations and conclusions.

Data is today the new currency: is easy to produce and easy to collect; although the analysis still remains as one of the most complex steps in the workflow of any analytical proccess. This tendency of focusing solely into the data as the Holy Grail that solves everything, yet arguable, is producing important advances in the theoretical framework of the Data Science. Which is later translated into methods and tools with the enough madurity to be extrapolated and used in other fields. One of those is the Digital Humanities.

Being able to manipulate, manage and get value out of data is not only an in demand skill for researchers, but also a future-proof recipe for digital humanists in a field that is getting familiar with fast-paced introduction of innovative products to manage data.

Unfortunately, in the Humanities, data may come in a variety of formats: from mere CSV files or tables, to books, blog posts, tweets, or even network data from social or literary networks. Obviously, critical thinking and content related courses are completely necessary, but if we do not teach courses on new tools and methods, our next generation of students will graduate with limited knowledge of what can be done and how. It is not only about tools, software or applications, but, rather, it is to give students the skills they need to adapt to the changing environment of research. We do not know if Python will be widely used in 5 or 10 years, but we have to prepare our students to go beyond the next trend and apply their knowledge to the newest and coolest tool.

This course is focused in three main areas that I have surmised as the most important: data mining, text analysis, and network science. The election is not casual, is based on two principles:

Most common topics in conferences such as Digital Humanities (the biggest world conference in the field), Canadian Society of Digital Humanities, or the Summer Institute Innitiative.
More demanded skills in Digital Humanities job postings.

Philosophy of Teaching

I have a strong background in Computer Sciences. When I first enrolled in Computer Engineering I was very disappointed because I did not see a computer until three months into my program. All my colleagues felt the same way. And when we finally were in front of a computer, we were limited to only what the teacher wanted us to do. My first years in college were so frustrating for a guy who wanted to be roboticist, with only studying mathematics, physics and more mathematics. After finishing my Bachelor’s and Master’s degrees, I decided that, if I ever had the chance to teach a computer science related course, that would be intended as a learning by doing course.

After several years, I understood the point of acquiring all the mathematical knowledge they had taught us, but even so, I found that it is not the best way to teach computer sciences. People need participate and experience through trial and error. For this reason I believe the teacher must play the role of a "facilitator" of learning, and not the role of the "expert" that only gives information in one direction. I am sure that teaching is an amazing and practical way of learning for teachers, and, therefore, there must be a proper environment in the class for students to freely share and express their ideas.

Personally, I believe that one successful way to reach students is by “speaking” their own language. Keeping myself up to date with all the trends and technologies they usually use, may mean students feel more comfortable and confident in class. On the other hand, making the content more appealing, as well as challenging is a plus, so as to catch the interest and attention of the students.

Positive attitude is a must have for a teacher, and passion for what she or he teaches is very important as well. The process of teaching, interestingly, is in itself an amazing method of learning for an educator. For that reason, there must be an appropriate educational environment for students to freely share and express their ideas. The teacher is there to encourage the procurement of knowledge for the student, but they are also responsible for the instruction that comes from these exchanges. Daily small activities are intended to achieve this, promoting pair or group work whenever possible.

Methodology

As stated previously, the content is divided into 1+3 blocks of content: a programming review to get the hands on Python and the IPython Notebook, including the setup and in-class activities; data mining and analysis, with activities and one assignment; text analysis and processing, with activities and one assignment; and graphs and networks analysis, also with activities and one assignment.

The course will employ the format of micro-lectures, as it has been demonstrated that students from classes using this format outperform those from traditional lecture classes. The students are also report being happier and more engaged. Classes will be split to two hours on Mondays, and two more on Wednesdays. Officially Wednesdays are one class hour plus one lab hour, but since class time is lab time too, there will be no change in room as we will occupy a computer lab.

All micro-lectures will be delivered in the form of interactive, downloadable, and executable IPython Notebooks, made available on the course website. Students must come to class prepared as they are expected to take an active part in the lecture and activities. After teacher explained concepts from the recommended readings, and in order to ensure active participation in class, students will be assigned activities during the class to be solved, either in small groups or in pairs, in a short period of time. While students are working in their activities, teacher will help with doubts, and finally will show the answer to the class.

Moreover, after each of the main blocks of contents, students will have to complete assignments to put into practice concepts learned in class. Overall, this course is a highly interactive course that will allow students to actively learn by doing.

The use of electronic devices is highly recommended, as long as they can be used to run and experiment with the examples and activities from the micro-lectures.

Classes and Participation (15%)

Notebooks with the content, related exercises and the readings, will be available for the students on the course website. The readings in the course calendar are suggested to aid the student, but they are not required to understand the lectures (although they are good research material). Every day, to promote active learning, small activities will be proposed for the students to solve, when students are supposed to work in pairs or small groups. For this reason the room should be a laboratory, as some students may not have a laptop.

After finalizing each block, an assignment that covers as many concepts of the lectures as possible is proposed. Then students have more than one week to complete it, individually or in pairs.

IPython Notebooks

Goal: Deliver the content in an interactive way, but according to the needs and interests of the students, while keeping students engaged.

In order to make the lectures more practical, the notebooks are pieces of code that run in the browser, so students can interact with them. Notebooks might be long. The idea behind having long Notebooks is to cover as much content as possible, so depending on the interest and needs of the students, lectures can go deep in one direction or another. In this way, we can have students more engaged with the lectures at the time that they actively participate in the class by exposing their interests.

Activities

Goal: Guarantee that students are understanding the lectures.

Activities will be proposed daily, usually several times a day. These activities are conceived to put into practice the concepts of the lecture at the time it is explained. These activities will be evaluated only as participation. All activities, after students are given time to solve them, will be explained to the class by the teacher or by volunteers. The last of the activities may be left as a home activity, and be solved by the next class day.

Assignments (45%)

There will be three different assignments based on the three main practical blocks of content (excluding review), each of which will cover a real world case using Humanities data. All assignments need to be written as IPython Notebooks. Each assignment must have 5 different parts, evaluated as follows: getting and cleaning the data (3%), summarizing the data and extracting relevant information (3%), visualizing the information (3%), and stating the conclussions (3%). The last part goes for presentation, correctness, functionality and adequacy of code (3%). All assignments will be marked from 0 (not sent) to 100.

Data Mining (15%). This assignment aims to validate the acquired knowledge of the student in terms of data wrangling, converting or mapping from one raw form into another format that allows for more convenient consumption of the data with the help of the libraries and methods explained during the class. This will include further data visualization, data aggregation, and a statistical measures. The assignment will include a Notebook template that students must fill in. In the template, there will be excerpts of code already done and others to be filled in. The dataset to be used will be also included, although students can use their own data sets if that is their aim. A proposal for this assignment will be accessible as the Notebook Assignment 1.

Text Analysis (15%). In this assignment the student is expected to extract information and run different analysis over the text Moby Dick; or The Whale, by Herman Melville, from Project Gutenberg, in order to come to conclusions without reading the book. Other text sources, web sources or corpora can be used by the student after teacher approval. The template for the assignment will include guidelines and functions prototypes for the student to run measures such as frequency distributions, word and letter counts, richness, average length per sentence and paragraph, entity extraction and grammatical structures. Visualizations are also required. Conclusions for this assignment needs to address the topic of the book and the writing style of the author for the book. The proposal will be aviable as the Notebook Assignment 2.

Network Science (15%). Unlike previous assignments, the dataset is not given. Instead, the student is expected to use a graph of her or him choice from Koblenz University repositories. The graph can also be created by the student, although a minimum of 100 nodes and 300 relationships are mandatory in both cases. Smaller but more dense graphs can used upon aprroval by the teacher. The template for this assignment will be made availabel as the Notebook Assignment 3. It will include some guides on the expected analysis to run (degree, density, diameter, betweenness, average shortest path, closeness, clustering, modularity, and page rank), but this time there will not be code, and almost everything will need to be written by the student.

Final Project (40%)

For the final project, the student must choose a problem, phenomenon, dataset or topic of interest for him or her, and use at least two of the three blocks to write a Notebook about it. Therefore, all final projects must include at least one of these tuples of contents: Data Mining and Text Analysis, Text Analysis and Network Science, or Data Mining and Network Science. Ideally, projects should include aspects from all the main blocks, since Data Mining introduces several transversal concepts.

Proposal (5%).
- One page, single space, Times New Roman or Georgia, 12pt (1%).
- Extensive description of the blocks of content chosen and reasons (1%).
- Description of the topic and motivations (1%).
- Description of the intended approach and methodology (1%).
- Tentative bibliography (1%).
Notebook (25%)
- Description of the problem to solve and motivations (2%).
- Analysis of the problem and possible ways to approach it (3%).
- Methodology and theoretical framework (5%).
- Description of the approach or solution and tools used (10%).
- Conclusions (5%).
Oral presentation (10%)
- Clarity of the explanations and tools used (5%).
- Support material (3%).
- Conclusions and questions (2%).

There is no minimal extension for the Notebook, as long as the project covers all the aspects. A small bibliography is mandatory (APA, MLA, etc, but consistent). Deadline is April $8^{th}$. This is due after the oral presentations so the students may have the chance to improve their work from the comments and feedback received during the presentation. A template Notebook Project will be provided.

Final Project will be marked from 0 (not sent) to 100.

Evaluation

Participation: 15% (including exercises sent at the end of the lectures)
Assignments: 45% (3 assignments, 15% each)
Final Project: 40% (5% Proposal, 25% Notebook, 10% Oral Presentation)

Course Plan

Block 0: What is This Course About?

Class 01

This class will present the class, methodology and the IPython Notebook environment.

Goal:
- Get the student familiar with the environment used to deliver the classes.
Topics:
- Initial setup.
- IPython.
Recommended readings:
- Learnable Programming. Designing a programming system for understanding programs.
Activity example:
- Write the Hello World! Python program.

Block 1: How to Think Like a Computer Scientist [Review]

Class 02

This class will cover the basic syntax for building Python programs.

Goal:
- Be able to write some code to solve small and not very ambitious problems
Topics:
- Python syntax
- Variables and values
- Statements and expressions
Recommended readings:
- Chapter 1 of How to Think Like a Computer Scientist
- Sections 2.1-2.9 of chapter 2 of How to Think Like a Computer Scientist
Activity example:
- Write a Python program that asks the user for a number and prints it back.

Class 03

This class will introduce basic programming abstractions called functions.

Goals:
- Write code that relies on conditional actions as well as getting data from user input.
- Encapsulate code.
Topics:
- Boolean logic
- Define functions
- Debug problems
- Get data from the keyboard
Recommended readings:
Activity example:
- Write a function, grade(number), that receives an number between 0.0 and 100.0 and returns the proper grading following the Smith College numerology. For values outside the range, just print "N/A". For example, grade(88.75) returns B+.

Class 04

This class completes the basic knowledge of Python.

Goal:
- Understand the importance of reutilization of code.
- Use the adequate data type for the problem.
- Be able to document code.
Topics:
- Import and reuse code
- Format, comment and document your code
- Control flow execution
- Complex data types like lists and dictionaries
Recommended readings:
Activity example:
- Write a function, freq_dict(string), that counts the number of aparitions of each letter in the string string, and returns a dictionary with letters as keys and their frequencies as values. For example, if we call freq_dict("Mississippi"), the result must be {'M': 1, 's': 4, 'p': 2, 'i': 4}.

Block 2: Data Mining

Class 05

This class will introduce new data types and show how to read and write from and to a file.

Goals:
- Handle basic file input and ouput operations.
- Understand the need for more complex data types.
Topics:
- Getting data into and out of Python
- Objects
- NumPy
- Arrays and Matrices
- SciPy
Recommended readings:
- Section 1.3. NumPy: creating and manipulating numerical data, from Scipy lecture notes.
- Section 1.5.6. Statistics and random numbers: scipy.stats, from Scipy lecture notes.
Activity example:
- The data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Computes and print, based on the data in populations.txt:
  1. The mean and standard deviation of the populations of each species for the years in the period.
  2. Which year each species had the largest population?
  3. Which species has the largest population for each year. (Hint: argsort & fancy indexing of np.array(['H', 'L', 'C'])).
  4. Which years any of the populations is above 50000. (Hint: comparisons and np.any).
  5. The top 2 years for each species when they had the lowest populations. (Hint: argsort, fancy indexing).
... all without for-loops.

Class 06

This class will cover common tasks for cleaning, filtering and grouping data.

Goal:
- Be able to clean messy datasets and filter out what's important for the researcher.
Topics:
- Pandas
- Cleaning data
- Summary statistics
- Indexing
- Merging, joining
- Group by
Recommended readings:
Activty example:
- Given the arts data frame, do the next:
  1. Clean the dates so you only see numbers.
  2. Get the average execution year per artist.
  3. Get the average execution year per category.
  4. Get the number of artworks per artist. Which artist is the most prolific?
  5. Get the number of artworks per category. Which category has the highest number?

Class 07

This class will introduce data summarization and visualization methods.

Goal:
- Summarize information effectively.
- Represent and visualize information in a meaning way.
Topics:
- Split-Apply-Combine
- Cross-tabulation and pivoting
- String manipulation
- matplotlib
- Visualizations
- Histograms
Recommended readings:
Activity example:
- Given the arts data frame, try to do the next:
  1. Clean the dates so you only see numbers by using string manipulations.
  2. Plot a histogram of number of artworks per year.
  3. Split the latest into categories.

Class 08

This class is about estimating the relationships among variables.

Goals:
- Distinguish between correlation and causality.
- Understand when there is correlation in the data.
Topics:
- Statistical modeling
- statmodels
- Correlation
- Regression
- Distributions
Recommended readings:
- Chapters 7-9 and 13-14, from Statistics for Humanities.
- Ordinary Least Squares in Python, from datarobot.
Activity example:
- Given the arts data frame, try to do the next:
  1. Is there any correlation between the periods of production and the number of artworks per artist? If so, what kind?
  2. The execution years, what kind of distribution they follow?

Class 09.

This class will introduce the idea of systems that can learn from data, and make predictions from them.

Goal:
- Build a program that is able to learn how to classify new entries in a dataset.
Topics:
- Machine Learning
- scikit-learn
- Supervised learning: kNN
- Unsupervised learning: K-means
- Cross-validation
Recommended readings:
Activity example:
- Given the arts data frame, try to do the next:
  1. Train a classifier so we can predict the artist name, given the title of the artworks.
  2. Train a classifier so we can predict the category, given the title of the artworks.

Class 10.

Work on Assignment 1.

Block 3: Text Analysis

Class 11

This class will cover the basics principles of Natural Language Processing.

Goal:
- Understand how natural language is processed by machines, and the parts involved.
Topics:
- NLTK
- Tokenization
- Concordance
- Co-Occurrence and similarity
- Word and phrase frequencies
- Dispersion plots
- TextBlob
Deadlines:
- Assignment 1.
Recommended readings:
- Chapter 1, from Natural Language Processing with Python.
Activity example:
- Create a function, most_commont(text, n), that receives a list of words or a Text and a number and returns the most common words. For example, most_commont(moby_dick, 5) should return the 5 most common words: [',', 'the', '.', 'of', 'and'].

Class 12

This class will cover different ways to access and create corpora data.

Goal:
- Collect and transform data into text ready to be analyzed.
Topcics:
- Corpora
- Conditional Frequency Distributions
- Sources of data
- Language detection
- Machine translation
Recommended readings:
- Chapters 2-3, from Natural Language Processing with Python.
Activity example:
- Write a program that loads feeds from the Spanish Blog in Digital Humanities, get the first 10 entries using feedparser, and for each, returns the next in English and withouth stopwords (Hint: take a look to the stopwords in NLTK under nltk.corpus.stopwords.words('spanish')):
  - Title
  - Number of sentences
  - Number of words
  - Number of unique words (vocabulary)
  - Number of hapaxes
  - Top 10 most frequent words

Class 13

This class will present some of the concepts behind modern statistical language applications and grammar analysis.

Goal:
- Understand grammars structures in order to carry out stylometric analysis.
Topics:
- Regular expressions
- Word inflection and lemmatization
- Parsing
- n-grams
- Part-of-speech Tagging
Recommended readings:
- Chapter 5, from Natural Language Processing with Python.
Activity examples:
- Write a program to classify contexts involving the word must according to the tag of the following word. Can this be used to discriminate between the epistemic and deontic uses of must?
- Generate some statistics for tagged data in the Brown Corpus to answer the following questions:
- What proportion of word types are always assigned the same part-of-speech tag?
- How many words are ambiguous, in the sense that they appear with at least two tags?
- What percentage of word tokens involve these ambiguous words?

Class 15

This class will focus on measuring and extracting relevant information from texts.

Goal:
- Extract the most important entities in a text without reading it.
Topics:
- Information extraction
- Trees
- Named Entity Recognition
- Mining HTML, Twitter and Facebook
Recommended readings:
- Chapter 7, from Natural Language Processing with Python.
- Chapters 1 and 5, from Mining the Social Web.
Activity example:
- Read the article The Automatic Creation of Literature Abstracts by H.P. Luhn, and develop an approach to rank sentences and generate summaries for home page news from The New York Times.

Class 16

This class will present ways of classifying text and documents.

Goal:
- Classify text and documents.
Topics:
- Sentiment Analysis
- Classifiers
- Generative Writing
Recommended readings:
- Chapters 6-7, from Natural Language Processing with Python.
Activity example:
- Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication about what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to one another. Using WordNet lexicon, augment the movie review document classifier presented in this chapter to use features that generalize the words that appear in a document, making it more likely that they will match words found in the training data.

Class 17

Work on Assignment 2

Block 4: Network Science

Class 18

This class will introduce the basic concepts of networks and network analysis.

Goal:
- Think about problems in terms of graphs.
Topics:
- Networks
- Vertices vs nodes
- Graph Theory
- NetworkX
- Network Analysis
Deadlines:
- Assignment 2.
Recommended readings:
- Sections 2.1-2.3, from Networks, Crowds, and Markets: Reasoning About a Highly Connected World.
- Chapter 3 and 4, from Algorithms.
Activity example:
- Write two functions, max_degree(graph) and min_degree(graph), that take a graph and return the maximum and minimum degree of the graph.

Class 19

This class will present the idea of relevance in a network.

Goal:
- Identify key entities in a network based on different criteria.
Topics:
- Centrality
- Degree
- Betweenness
- Closeness
- Eigenvector
- Current flow betweenness
- Ego networks
Recommended readings:
- Chapter 3, from Networks, Crowds, and Markets: Reasoning About a Highly Connected World.
Activity example:
- Write a function, centrality_scatter(cent1, cent1), that receives two centrality dictionaries and plot each node label as a point using each of dictionary as one of the axes. Add a lienar best-fit trend, axis and title labels.

Class 20

This class will focus on understanding networks structures.

Goal:
- Understand how entities behave differently when together.
Topics:
- Random vs Scale Free
- Small Worlds
- Network Dynamics
- Social Network Analysis
- Modularity and Community Structure
Recommended readings:
- Python for Social Science.
- Chapter 4, from Social Networks for Startups.
Activity example:
- Using the twitter or Facebook examples seen in class, extract your own network and determine whether follows the small world principle. Calculate the number of communities.

Class 21

This class will present effective ways of modeling real data for persistence.

Goal:
- Transform suitable research questions into networks.
Topics:
- Meta-networks
- Modeling
- The Property Graph
- Predictive Analysis
Recommended readings:
- Sections 3.1-3.2 and Chapter 7, from Graph Databases.
Activity example:
- Design a schema to model the interactions between places, authors and paintings from the BaroqueArt Database.

Class 22

This class will focus on building networks from plays and analyzing other forms of content in networks.

Goal:
- Extract and analyze hidden networks from texts.
Topics:
- Network Content Analysis
- Plot Analysis
Deadlines:
- Final Project proposal.
Recommended readings:
- On Social Networks in Plays and Novels, from The International Journal of Sciences.
- Network Theory, Plot Analysis.
Activity example:
- Pick a title from Project Gutenberg, one that is not a monologue, and extract the plot network. Then run network analysis measures on it.

Class 23

Work on Assignment 3.

Block 5: Final Projects

Class 24

Work on Final Projects.
Deadlines:
- Assignment 3.

Class 25 and 26

Final Project presentations

[Data, the Humanist's New Best Friend](index.ipynb)*Course Description*