Course Supervisor: Juan Luis Suárez
E-mail: jsuarez@uwo.ca
Instructor: Javier de la Rosa
E-mail: jdelaro@uwo.ca
Office: AHB 1R14
Office Hours: Tuesdays 2:30pm-4:30pm
Meets: Winter 2015, Tuesdays 4:30pm-6:00pm, Wednesdays 4:30pm-6:00pm
Room: UC 207
This course is a hands-on and pragmatic introduction to computer tools and theoretical aspects of the new use of data by humanists of different disciplines. Furthermore, it will serve as an introduction to the techniques and methods used today to make sense of data from a Humanities point of view.
In that sense, Data, the Humanist's New Best Friend is divided into three blocks (plus one extra block that covers a programming review):
We find computers and software in almost any field of study, from STEM disciplines (Science, Technology, Engineering and Mathematics), to Education and Arts. In the Humanities, for example, there is even a whole sub-discipline (or a discipline by itself, depending on who you ask) that tries to find answers to new and old questions by digital means. This approach, referred to as Digital Humanities, usually borrows methods from the sciences to analyze their own data, and produce interpretations and conclusions.
Data is today the new currency: is easy to produce and easy to collect; although the analysis still remains as one of the most complex steps in the workflow of any analytical proccess. This tendency of focusing solely into the data as the Holy Grail that solves everything, yet arguable, is producing important advances in the theoretical framework of the Data Science. Which is later translated into methods and tools with the enough madurity to be extrapolated and used in other fields. One of those is the Digital Humanities.
Being able to manipulate, manage and get value out of data is not only an in demand skill for researchers, but also a future-proof recipe for digital humanists in a field that is getting familiar with fast-paced introduction of innovative products to manage data.
Unfortunately, in the Humanities, data may come in a variety of formats: from mere CSV files or tables, to books, blog posts, tweets, or even network data from social or literary networks. Obviously, critical thinking and content related courses are completely necessary, but if we do not teach courses on new tools and methods, our next generation of students will graduate with limited knowledge of what can be done and how. It is not only about tools, software or applications, but, rather, it is to give students the skills they need to adapt to the changing environment of research. We do not know if Python will be widely used in 5 or 10 years, but we have to prepare our students to go beyond the next trend and apply their knowledge to the newest and coolest tool.
This course is focused in three main areas that I have surmised as the most important: data mining, text analysis, and network science. The election is not casual, is based on two principles:
I have a strong background in Computer Sciences. When I first enrolled in Computer Engineering I was very disappointed because I did not see a computer until three months into my program. All my colleagues felt the same way. And when we finally were in front of a computer, we were limited to only what the teacher wanted us to do. My first years in college were so frustrating for a guy who wanted to be roboticist, with only studying mathematics, physics and more mathematics. After finishing my Bachelor’s and Master’s degrees, I decided that, if I ever had the chance to teach a computer science related course, that would be intended as a learning by doing course.
After several years, I understood the point of acquiring all the mathematical knowledge they had taught us, but even so, I found that it is not the best way to teach computer sciences. People need participate and experience through trial and error. For this reason I believe the teacher must play the role of a "facilitator" of learning, and not the role of the "expert" that only gives information in one direction. I am sure that teaching is an amazing and practical way of learning for teachers, and, therefore, there must be a proper environment in the class for students to freely share and express their ideas.
Personally, I believe that one successful way to reach students is by “speaking” their own language. Keeping myself up to date with all the trends and technologies they usually use, may mean students feel more comfortable and confident in class. On the other hand, making the content more appealing, as well as challenging is a plus, so as to catch the interest and attention of the students.
Positive attitude is a must have for a teacher, and passion for what she or he teaches is very important as well. The process of teaching, interestingly, is in itself an amazing method of learning for an educator. For that reason, there must be an appropriate educational environment for students to freely share and express their ideas. The teacher is there to encourage the procurement of knowledge for the student, but they are also responsible for the instruction that comes from these exchanges. Daily small activities are intended to achieve this, promoting pair or group work whenever possible.
As stated previously, the content is divided into 1+3 blocks of content: a programming review to get the hands on Python and the IPython Notebook, including the setup and in-class activities; data mining and analysis, with activities and one assignment; text analysis and processing, with activities and one assignment; and graphs and networks analysis, also with activities and one assignment.
The course will employ the format of micro-lectures, as it has been demonstrated that students from classes using this format outperform those from traditional lecture classes. The students are also report being happier and more engaged. Classes will be split to two hours on Mondays, and two more on Wednesdays. Officially Wednesdays are one class hour plus one lab hour, but since class time is lab time too, there will be no change in room as we will occupy a computer lab.
All micro-lectures will be delivered in the form of interactive, downloadable, and executable IPython Notebooks, made available on the course website. Students must come to class prepared as they are expected to take an active part in the lecture and activities. After teacher explained concepts from the recommended readings, and in order to ensure active participation in class, students will be assigned activities during the class to be solved, either in small groups or in pairs, in a short period of time. While students are working in their activities, teacher will help with doubts, and finally will show the answer to the class.
Moreover, after each of the main blocks of contents, students will have to complete assignments to put into practice concepts learned in class. Overall, this course is a highly interactive course that will allow students to actively learn by doing.
The use of electronic devices is highly recommended, as long as they can be used to run and experiment with the examples and activities from the micro-lectures.
Notebooks with the content, related exercises and the readings, will be available for the students on the course website. The readings in the course calendar are suggested to aid the student, but they are not required to understand the lectures (although they are good research material). Every day, to promote active learning, small activities will be proposed for the students to solve, when students are supposed to work in pairs or small groups. For this reason the room should be a laboratory, as some students may not have a laptop.
After finalizing each block, an assignment that covers as many concepts of the lectures as possible is proposed. Then students have more than one week to complete it, individually or in pairs.
Goal: Deliver the content in an interactive way, but according to the needs and interests of the students, while keeping students engaged.
In order to make the lectures more practical, the notebooks are pieces of code that run in the browser, so students can interact with them. Notebooks might be long. The idea behind having long Notebooks is to cover as much content as possible, so depending on the interest and needs of the students, lectures can go deep in one direction or another. In this way, we can have students more engaged with the lectures at the time that they actively participate in the class by exposing their interests.
Goal: Guarantee that students are understanding the lectures.
Activities will be proposed daily, usually several times a day. These activities are conceived to put into practice the concepts of the lecture at the time it is explained. These activities will be evaluated only as participation. All activities, after students are given time to solve them, will be explained to the class by the teacher or by volunteers. The last of the activities may be left as a home activity, and be solved by the next class day.
There will be three different assignments based on the three main practical blocks of content (excluding review), each of which will cover a real world case using Humanities data. All assignments need to be written as IPython Notebooks. Each assignment must have 5 different parts, evaluated as follows: getting and cleaning the data (3%), summarizing the data and extracting relevant information (3%), visualizing the information (3%), and stating the conclussions (3%). The last part goes for presentation, correctness, functionality and adequacy of code (3%). All assignments will be marked from 0 (not sent) to 100.
For the final project, the student must choose a problem, phenomenon, dataset or topic of interest for him or her, and use at least two of the three blocks to write a Notebook about it. Therefore, all final projects must include at least one of these tuples of contents: Data Mining and Text Analysis, Text Analysis and Network Science, or Data Mining and Network Science. Ideally, projects should include aspects from all the main blocks, since Data Mining introduces several transversal concepts.
Proposal (5%).
Notebook (25%)
Oral presentation (10%)
There is no minimal extension for the Notebook, as long as the project covers all the aspects. A small bibliography is mandatory (APA, MLA, etc, but consistent). Deadline is April $8^{th}$. This is due after the oral presentations so the students may have the chance to improve their work from the comments and feedback received during the presentation. A template Notebook Project will be provided.
Final Project will be marked from 0 (not sent) to 100.
This class will present the class, methodology and the IPython Notebook environment.
Hello World!
Python program.This class will cover the basic syntax for building Python programs.
This class will introduce basic programming abstractions called functions.
grade(number)
, that receives an number between 0.0 and 100.0 and returns the proper grading following the Smith College numerology. For values outside the range, just print "N/A". For example, grade(88.75)
returns B+
.This class completes the basic knowledge of Python.
freq_dict(string)
, that counts the number of aparitions of each letter in the string string
, and returns a dictionary with letters as keys and their frequencies as values. For example, if we call freq_dict("Mississippi")
, the result must be {'M': 1, 's': 4, 'p': 2, 'i': 4}
.This class will introduce new data types and show how to read and write from and to a file.
Activity example:
populations.txt
describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Computes and print, based on the data in populations.txt
:np.array(['H', 'L', 'C'])
).np.any
).argsort
, fancy indexing).... all without for-loops.
This class will cover common tasks for cleaning, filtering and grouping data.
arts
data frame, do the next:This class will introduce data summarization and visualization methods.
arts
data frame, try to do the next:This class is about estimating the relationships among variables.
arts
data frame, try to do the next:This class will introduce the idea of systems that can learn from data, and make predictions from them.
arts
data frame, try to do the next:This class will cover the basics principles of Natural Language Processing.
most_commont(text, n)
, that receives a list of words or a Text
and a number and returns the most common words.
For example, most_commont(moby_dick, 5)
should return the 5 most common words: [',', 'the', '.', 'of', 'and']
.This class will cover different ways to access and create corpora data.
feedparser
, and for each, returns the next in English and withouth stopwords (Hint: take a look to the stopwords in NLTK under nltk.corpus.stopwords.words('spanish')
):This class will present some of the concepts behind modern statistical language applications and grammar analysis.
This class will focus on measuring and extracting relevant information from texts.
This class will present ways of classifying text and documents.
This class will introduce the basic concepts of networks and network analysis.
max_degree(graph)
and min_degree(graph)
, that take a graph and return the maximum and minimum degree of the graph.This class will present the idea of relevance in a network.
centrality_scatter(cent1, cent1)
, that receives two centrality dictionaries and plot each node label as a point using each of dictionary as one of the axes. Add a lienar best-fit trend, axis and title labels.This class will focus on understanding networks structures.
This class will present effective ways of modeling real data for persistence.
This class will focus on building networks from plays and analyzing other forms of content in networks.