Homework #4

This notebook is due on Friday, November 18th, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.

FOR THIS HOMEWORK: In addition to the correctness of your answers, we will be grading you on:

  1. The quality of your code
  2. The correctness of your code
  3. Whether your code runs.

To that end:

  1. Code quality: make sure that you use functions whenever possible, use descriptive variable names, and use comments to explain what your code does as well as function properties (including what arguments they take, what they do, and what they return).
  2. Whether your code runs: prior to submitting your homework assignment, re-run the entire notebook and test it. Go to the "kernel" menu, select "Restart", and then click "clear all outputs and restart." Then, go to the "Cell" menu and choose "Run all" to ensure that your code produces the correct results. We will take off points for code that does not work correctly when we run it!.

Your name

Put your name here!


Section 1: The 1D Schelling model

Schelling's model for happiness: Recall that in a 1D line of stars and zeros (to use Schelling's terminology), an element is "happy" if at least half of its neighbors (defined as the four elements to the left and four elements to the right) are like it, and "unhappy" otherwise. For those near the end of the line the rule is that, of the four neighbors on the side toward the center plus the one, two or three outboard neighbors, at least half must be like oneself.

Your assignment is to implement the Schelling model exactly as described in the in-class project using zeros and ones to indicate the two types of elements. As with Schelling's original paper, you will play twice through the entire list, moving each element in a row (possibly moving a given element multiple times if it goes to the right). Print out the state of the list at each step so that you can see how the solution evolves.

Break up your code into functions whenever possible, and add comments to each function and to the rest of the code to make sure that we know what's going on!


In [ ]:
# put your code here

Section 2: Wrapping up our twitter analysis

In this part of the homework, we're going to extend our analysis of the tweets of the presidential candidates (well, by the time you're actually coding this up, the president-to-be and their defeated opponent). We're going to do a few things:

  1. Clean the data more comprehensively.
  2. Examine the candidates' tweeting styles in a more in-depth fashion.
  3. See how often the candidates refer to each other, and when they do refer to each other, is it positive, negative, or neutral?

The cells that are immediately below this download the various files needed for this project, and then load the two files of tweets into huge strings named clinton_tweets and trump_tweets.


In [ ]:
# download the files
import urllib.request

files=['negative.txt','positive.txt']
path='http://www.unc.edu/~ncaren/haphazard/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)
    
files=['HillaryClinton_tweets.txt','realDonaldTrump_tweets.txt']
path='https://raw.githubusercontent.com/bwoshea/CMSE201_datasets/master/pres_tweets/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)

In [ ]:
'''
Now we open up the files.  Note that the 'encoding="utf8"' portion
is to take care of the fact that some windows machines have a hard 
time reading the text files generated on Mac and Linux computers.
'''
clinton_tweets = open("HillaryClinton_tweets.txt",encoding="utf8").read()
trump_tweets = open("realDonaldTrump_tweets.txt",encoding="utf8").read()

Cleaning the data

We want to do a more comprehensive job cleaning the data than we did in class. We still want to do that too, though! In particular, in addition to making all of the words lower case and removing punctuation, we want to remove words corresponding to hash tags (words starting with a pound sign, #) or websites (words starting with "http"), and any empty strings (an empty string looks like this: '').

In the space below, you are given a test tweet that has capitalizations, punctuation, hash tags, and that will have empty strings when split into words. Write a function that takes a tweet as an argument, cleans it, and returns a list of words. Demonstrate that this function works by using it to clean the test tweet and print out the returned words.


In [ ]:
test_tweet = " Here is My test TWEET!!!?! #whoah #so_much_election #cmserocks   http://whoahdude.com "

words = test_tweet.split(' ')

print("the words in the uncleaned test tweet are:", words)

# put your code here!

A more comprehensive examination of the candidates' twitter styles

Now that we've figured out how to clean the tweets, we're going to do a more in-depth analysis of the candidates' writing styles (or, more accurately, the styles of the candidates and their campaign staff - not all of the tweets come from the candidates themselves). In particular, we want to determine:

  1. What is the distribution of word lengths that each candidate uses? The average length of words used is one way to estimate the sophistication of writing - longer words may suggest more complex thoughts.
  2. Which candidate has used a larger vocabulary in their tweets? In other words, which candidate uses more distinct or unique words, and thus uses less repetition of individual words? As with word length, a larger vocabulary may suggest more complex thoughts.

Hint: Consider using a dictionary to help address the second question! (And see the tutorial we have provided on dictionaries.)

In the cell below, summarize what you've learned about the candidates' tweeting styles. Use the cell below that (and any additional cells you need) to include the code, figures, etc. that you needed to determine your answer.

Put your summary here!


In [ ]:
# put your code and figures here.  Add additional cells if necessary!

How do the candidates talk about each other?

We're now going to examine how the candidates have talked about each other. Go through their list of tweets and find all of the incidences where they refer to the other candidate, and determine:

  1. If they typically refer to their opponent using positive words, negative words, both positive and negative words, or neither?
  2. Keep track of all of the positive and negative words that they use to refer to their opponent. What are the most common positive and negative words that they use about their opponent?

Hint: What words do the candidates use to refer to each other? By their first names, last names, Twitter handles, or something else?

In the cell below, summarize what you've learned about how the candidates talk about each other.

Put your summary here!


In [ ]:
# put your code and figures here.  Add additional cells if necessary!

A different way of visualizing the data

Finally, we're going to visualize the positive and negative words that the candidates use to describe each other using word clouds. In a word cloud, the more a specific word appears in some source of textual data, the larger and bolder the word appears in the cloud. While this is not a perfectly representative way of visualizing data, it gives a good sense of what words are commonly used. So, here's how we will proceed:

First, install some software that can be used to generate word clouds by typing:

!pip install wordcloud

in a code cell and wait for the code to install, which may take a minute or two. Make sure to include the exclamation point. You only have to do this once per computer!

Then, generate separate lists of the positive and negative words that each candidate uses to refer to their opponent (making sure to keep the duplicate words in the list - you need this for the word cloud!). You'll take each list, convert it into a string, clean up the string, and then make it into a word cloud. This is kind of a pain, so we're including example code below. Note that we are only showing you how to generate a simple word cloud - to make a more complicated one, look at the documentation.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

# imports the word cloud module
from wordcloud import WordCloud

# we assume you have a list called candidate_word_list_positive,
# which we then convert into a string
candidate_positive_words = str(candidate_word_list_positive)

# take that string and clean out all of the list-y things by replacing them with blank spaces.
candidate_positive_words = candidate_positive_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

# now we make the word cloud, using only the 60 most common words and keeping the
# font size relatively small.
wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(candidate_positive_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [ ]:
# Put your code here!

Section 3: Feedback (required!)


In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
    src="https://goo.gl/forms/8HNgy9DipNjx0Xe42?embedded=true" 
    width="80%" 
    height="1200px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
"""
)

Congratulations, you're done!

How to submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "Homework assignments" folder. There will be a dropbox labeled "Homework 4". Upload this notebook there.