Introduction to Computational Tools and Python


This notebook introduces students to popular computational tools used in the Digital Humanities and Social Sciences and the research possibilities they create. It then provides an abbreviated introduction to Python focussing on analysis and preparing for a Twitter data analysis on Day 2.

Estimated Time: 180 minutes


Topics Covered:

  • Tools available for researchers interested in computational analyses
  • Common applications for these tools in research
  • How these tools fit in to the research workflow
  • Python fundamentals

Parts:


Tools

Online Point-And-Click Tools

Common Programming Languages for Research and Visualization

Qualitative Data Analysis

Geospatial Analysis


Common Tasks

Data Management

  • Tabular Data
  • Numerical Data
  • Text Data
  • Storing Data
  • Archiving Data
  • Annotating Data

Qualitative Analysis

  • Annotating and aggregating non-numerical data
  • Visualizing non-numerical data

Quantitative Analysis

  • Counting (words, observations, events, locations, etc.)
  • Traditional statistical analyses

Model Building and Machine Learning

  • Classification
  • Regression
  • Clustering
  • Topic Modeling
  • Sentiment Analysis
  • Text Generation

Linguistic Analysis and Natural Language Processing (NLP)

  • Vocabulary
  • Chunking
  • Dependency Parsing
  • Named Entity Recognition (NER)
  • Semantic Distance

Visualization

  • Graphs and Charts
  • Network Analysis
  • Geospatial Analysis

Pedagogy

  • Digital Editions
  • Much of Visualization

Introduction to Python

Variables

  • Variables are names for values.
  • In Python the = symbol assigns the value on the right to the name on the left.
  • The variable is created when a value is assigned to it.
  • Here's Python code that assigns an age to a variable age and a name in quotation marks to a variable first_name.

In [ ]:
age = 42
first_name = 'Ahmed'
  • Variable names:
    • cannot start with a digit
    • cannot contain spaces, quotation marks, or other punctuation
    • may contain an underscore (typically used to separate words in long variable names)
  • Underscores at the start like __alistairs_real_age have a special meaning so we won't do that until we understand the convention.

Use print to display values.

  • Python has a built-in function called print that prints things as text.
  • Call the function (i.e., tell Python to run it) by using its name.
  • Provide values to the function (e.g., things to print) in parentheses.

In [ ]:
print(first_name, 'is', age, 'years old')
  • print automatically puts a single space between items to separate them.
  • And wraps around to a new line at the end.

Variables persist between cells.

  • Variables defined in one cell exist in all following cells.
  • Notebook cells are just a way to organize a program: as far as Python is concerned, all of the source code is one long set of instructions.

Variables must be created before they are used.

  • If a variable doesn't exist yet, or if the name has been mis-spelled, Python reports an error.

In [ ]:
print(last_name)
  • The last line of an error message is usually the most informative.
  • We will look at error messages in detail later.

Python is case-sensitive.

  • Python thinks that upper- and lower-case letters are different, so Name and name are different variables.
  • Again, there are conventions around using upper-case letters at the start of variable names so we will use lower-case letters for now.

Use meaningful variable names.

  • Python doesn't care what you call variables as long as they obey the rules (alphanumeric characters and the underscore).

In [ ]:
flabadab = 42
ewr_422_yY = 'Ahmed'
print(ewr_422_yY, 'is', flabadab, 'years old')
  • Use meaningful variable names to help other people understand what the program does.
  • The most important "other person" is your future self.

Variables can be used in calculations.

  • We can use variables in calculations just as if they were values.
    • Remember, we assigned 42 to age a few lines ago.

In [ ]:
age = age + 3
print('Age in three years:', age)

Challenge 1: Making and Printing Variables

Make 3 variables:

  • name (with your full name)
  • city (where you were born) and
  • year (when you were born.)

Print these three variables so that it prints [your name] was born in [city] in [year].


In [ ]:


Data Types

Every value has a type.

  • Every value in a program has a specific type.
  • Integer (int): counting numbers like 3 or -512.
  • Floating point number (float): fractional numbers like 3.14159 or -2.5.
    • Integers are used to count, floats are used to measure.
  • Character string (usually just called "string", str): text.
    • Written in either single quotes or double quotes (as long as they match).
    • The quotation marks aren't printed when the string is displayed.

Use the built-in function type to find the type of a value.

  • Use the built-in function type to find out what type a value has.
  • Works on variables as well.
    • But remember: the value has the type --- the variable is just a label.

In [ ]:
print(type(52))

In [ ]:
pi = 3.14159
print(type(pi))

In [ ]:
fitness = 'average'
print(type(fitness))

Types control what operations can be done on values.

  • A value's type determines what the program can do to it.

In [ ]:
print(5 - 3)

In [ ]:
print('hello' - 'h')

Strings can be added and multiplied.

  • "Adding" character strings concatenates them.

In [ ]:
full_name = 'Ahmed' + ' ' + 'Walsh'
print(full_name)
  • Multiplying a character string by an integer replicates it.
    • Since multiplication is just repeated addition.

In [ ]:
separator = '=' * 10
print(separator)

Strings have a length (but numbers don't).

  • The built-in function len counts the number of characters in a string.

In [ ]:
print(len(full_name))
  • But numbers don't have a length (not even zero).

In [ ]:
print(len(52))

Must convert numbers to strings or vice versa when operating on them.

  • Cannot add numbers and strings.

In [ ]:
print(1 + '2')
  • Not allowed because it's ambiguous: should 1 + '2' be 3 or '12'?
  • Use the name of a type as a function to convert a value to that type.

In [ ]:
print(1 + int('2'))
print(str(1) + '2')

Can mix integers and floats freely in operations.

  • Integers and floating-point numbers can be mixed in arithmetic.
    • Python automatically converts integers to floats as needed.

In [ ]:
print('half is', 1 / 2.0)
print('three squared is', 3.0 ** 2)

Variables only change value when something is assigned to them.

  • If we make one cell in a spreadsheet depend on another, and update the latter, the former updates automatically.
  • This does not happen in programming languages.

In [ ]:
first = 1
second = 5 * first
first = 2
print('first is', first, 'and second is', second)
  • The computer reads the value of first when doing the multiplication, creates a new value, and assigns it to second.
  • After that, second does not remember where it came from.
  • The computer reads the value of first when doing the multiplication, creates a new value, and assigns it to second.
  • After that, second does not remember where it came from.

Challenge 2: Making and Coercing Variables

  1. Make a variable year and assign it as the year you were born
  2. Coerce that variable to a float, and assign it to a new variable year_float
  3. Coerce year_float to a string, and assign it to a new variable year_string
  4. Someone in your class says they were born in 1997. Really? Really. Find out what your age difference is, using only year_string.

In [ ]:


Strings

We can do things with strings

  • We've already seen some operations that can be done with strings.

In [ ]:
first_name = "Johan"
last_name = "Gambolputty"
full_name = first_name + last_name
print(full_name)
  • Remember that computers don't understand context.

In [ ]:
full_name = first_name + " " + last_name
print(full_name)

Strings are made up of sub-strings

  • You can think of strings as a sequence of smaller strings or characters.
  • We can access a piece of that sequence using [].

In [ ]:
full_name[1]

Gotcha - Python (and many other langauges) start counting from 0.


In [ ]:
full_name[0]

In [ ]:
full_name[4]

You can splice strings using [ : ]

  • if you want a range (or "slice") of a sequence, you get everything before the second index:

In [ ]:
full_name[0:4]

In [ ]:
full_name[0:5]
  • You can see some of the logic for this when we consider implicit indices.

In [ ]:
full_name[:5]

In [ ]:
full_name[5:]

String Have Methods

  • There are other operations defined on string data. These are called string methods.
  • IPython lets you do tab-completion after a dot ('.') to see what methods an object (i.e., a defined variable) has to offer. Try it now!

In [ ]:
str.
  • Let's look at the upper method. What does it do? Lets take a look at the documentation. IPython lets us do this with a question mark ('?') before or after an object (again, a defined variable).

In [ ]:
str.upper?

So we can use it to upper-caseify a string.


In [ ]:
full_name.upper()

You have to use the parenthesis at the end because upper is a method of the string class.

Don't forget, simply calling the method does not change the original variable, you must reassign the variable:


In [ ]:
print(full_name)

In [ ]:
full_name = full_name.upper()
print(full_name)

For what its worth, you don't need to have a variable to use the upper() method, you could use it on the string itself.


In [ ]:
"Johann Gambolputty".upper()

What do you think should happen when you take upper of an int? What about a string representation of an int?

Challenge 3: Write your name

  1. Make two string variables, one with your first name and one with your last name.
  2. Concatenate both strings to form your full name and assign it to a variable.
  3. Assign a new variable that has your full name in all upper case.
  4. Slice that string to get your first name again.

In [ ]:

Challenge 4: Exploring string methods

In our next meeting we will be looking at Twitter activity from the March4Trump event in Berkeley on March 4, 2017. Below was the most retweeted tweet during the March4Trump event. Use this tweet to explore the methods above.


In [ ]:
tweet = 'RT @JasonBelich: #March4Trump #berkeley elderly man pepper sprayed by #antifa https://t.co/5z3O6UZuhL'

Using this tweet, try seeing what the following string methods do:

* `split`
* `join`
* `replace`
* `strip`
* `find`

In [ ]:


Lists

A list is an ordered, indexable collection of data. Lets say you're doing a study on the following countries:

country:

"Afghanistan"
"Canada"
"Sierra Leone"
"Denmark"
"Japan"

You could put that data into a list

  • contain data in square brackets [...],
  • each value is separated by a comma ,.

In [ ]:
country_list = ["Afghanistan", "Canada", "Sierra Leone", "Denmark", "Japan"]
type(country_list)
  • Use len to find out how many values are in a list.

In [ ]:
len(country_list)

Use an item’s index to fetch it from a list.

  • Each value in a list is stored in a particular location.
  • Locations are numbered from 0 rather than 1.
  • Use the location’s index in square brackets to access the value it contains.

In [ ]:
print('the first item is:', country_list[0])
print('the fourth item is:', country_list[3])
  • Lists can be indexed from the back using a negative index.

In [ ]:
print(country_list[-1])
print(country_list[-2])

"Slice" a list using [ : ]

  • Just as with strings, we can get multiple items from a list using splicing
  • Note that the first index is included, while the second is excluded

In [ ]:
print(country_list[1:4])
  • Leave an index blank to get everything from the beginning / end

In [ ]:
print(country_list[:4])

In [ ]:
print(country_list[2:])

Lists’ values can be replaced by assigning to specific indices.


In [ ]:
country_list[0] = "Iran"
print('Country List is now:', country_list)
  • This makes lists different from strings.
  • You cannot change the characters in a string after it has been created.
    • Immutable: cannot be changed after creation.
    • In contrast, lists are mutable: they can be modified in place.

In [ ]:
mystring = "Donut"
mystring[0] = 'C'

Lists have Methods

  • Just like strings have methods, lists do too.
    • Remember that a method is like a function, but tied to a particular object.
    • Use object_name.method_name to call methods.
    • IPython lets us do tab completion after a dot ('.') to see what an object has to offer.

In [ ]:
country_list.
  • If you want to append items to the end of a list, use the append method.

In [ ]:
country_list.append("United States")
print(country_list)

Use del to remove items from a list entirely.

  • del list_name[index] removes an item from a list and shortens the list.
  • Not a function or a method, but a statement in the language.

In [ ]:
print("original list was:", country_list)
del country_list[3]
print("the list is now:", country_list)

Lists may contain values of different types.

  • A single list may contain numbers, strings, and anything else.

In [ ]:
complex_list = ['life', 42, 'the universe', [1,2,3]]
print(complex_list)
  • Notice that we put a list inside of a list, which can itself be indexed. The same could be done for a string.

In [ ]:
print(complex_list[3])

print(complex_list[3][0])

The empty list contains no values.

  • Use [] on its own to represent a list that doesn't contain any values.
    • "The zero of lists."
  • Helpful as a starting point for collecting values (which we will see in the next episode.)

Indexing beyond the end of the collection is an error.

  • Python reports an IndexError if we attempt to access a value that doesn't exist.
    • This is a kind of runtime error.
    • Cannot be detected as the code is parsed because the index might be calculated based on data.

In [ ]:
print(country_list[99])

Challenge 5: Exploring Lists

What does the following program print?


In [ ]:
hashtags = ['#March4Trump',
 '#Fascism',
 '#TwitterIsFascist',
 '#majority',
 '#CouldntEvenStopDeVos',
 '#IsTrumpCompromised',
 '#Berkeley',
 '#NotMyPresident',
 '#mondaymotivation',
 '#BlueLivesMatter',
 '#Action4Trump',
 '#impeachtrump'
 '#Periscope',
 '#march',
 '#TrumpRussia',
 '#obamagate',
 '#Resist',
 '#sedition',
 '#NeverTrump',
 '#maga']

print(hashtags[::2])
print()
print(hashtags[::-1])

How long is the hashtags list?


In [ ]:

Use the .index() method to find out what the index number is for #Resist:


In [ ]:

Read the help file (or the Python documentation) for join(), a string method.


In [ ]:
str.join?

Using the join method, concatenate all the values in hashtags into one long string:


In [ ]:

Using the string replace method and the list index method, print 'Never Trump' without the '#'


In [ ]: