This notebook is due on Friday, November 18th, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.
FOR THIS HOMEWORK: In addition to the correctness of your answers, we will be grading you on:
To that end:
Put your name here!
Schelling's model for happiness: Recall that in a 1D line of stars and zeros (to use Schelling's terminology), an element is "happy" if at least half of its neighbors (defined as the four elements to the left and four elements to the right) are like it, and "unhappy" otherwise. For those near the end of the line the rule is that, of the four neighbors on the side toward the center plus the one, two or three outboard neighbors, at least half must be like oneself.
Your assignment is to implement the Schelling model exactly as described in the in-class project using zeros and ones to indicate the two types of elements. As with Schelling's original paper, you will play twice through the entire list, moving each element in a row (possibly moving a given element multiple times if it goes to the right). Print out the state of the list at each step so that you can see how the solution evolves.
Break up your code into functions whenever possible, and add comments to each function and to the rest of the code to make sure that we know what's going on!
In [ ]:
# put your code here
In this part of the homework, we're going to extend our analysis of the tweets of the presidential candidates (well, by the time you're actually coding this up, the president-to-be and their defeated opponent). We're going to do a few things:
The cells that are immediately below this download the various files needed for this project, and then load the two files of tweets into huge strings named clinton_tweets
and trump_tweets
.
In [ ]:
# download the files
import urllib.request
files=['negative.txt','positive.txt']
path='http://www.unc.edu/~ncaren/haphazard/'
for file_name in files:
urllib.request.urlretrieve(path+file_name,file_name)
files=['HillaryClinton_tweets.txt','realDonaldTrump_tweets.txt']
path='https://raw.githubusercontent.com/bwoshea/CMSE201_datasets/master/pres_tweets/'
for file_name in files:
urllib.request.urlretrieve(path+file_name,file_name)
In [ ]:
'''
Now we open up the files. Note that the 'encoding="utf8"' portion
is to take care of the fact that some windows machines have a hard
time reading the text files generated on Mac and Linux computers.
'''
clinton_tweets = open("HillaryClinton_tweets.txt",encoding="utf8").read()
trump_tweets = open("realDonaldTrump_tweets.txt",encoding="utf8").read()
We want to do a more comprehensive job cleaning the data than we did in class. We still want to do that too, though! In particular, in addition to making all of the words lower case and removing punctuation, we want to remove words corresponding to hash tags (words starting with a pound sign, #) or websites (words starting with "http"), and any empty strings (an empty string looks like this: '').
In the space below, you are given a test tweet that has capitalizations, punctuation, hash tags, and that will have empty strings when split into words. Write a function that takes a tweet as an argument, cleans it, and returns a list of words. Demonstrate that this function works by using it to clean the test tweet and print out the returned words.
In [ ]:
test_tweet = " Here is My test TWEET!!!?! #whoah #so_much_election #cmserocks http://whoahdude.com "
words = test_tweet.split(' ')
print("the words in the uncleaned test tweet are:", words)
# put your code here!
Now that we've figured out how to clean the tweets, we're going to do a more in-depth analysis of the candidates' writing styles (or, more accurately, the styles of the candidates and their campaign staff - not all of the tweets come from the candidates themselves). In particular, we want to determine:
Hint: Consider using a dictionary to help address the second question! (And see the tutorial we have provided on dictionaries.)
In the cell below, summarize what you've learned about the candidates' tweeting styles. Use the cell below that (and any additional cells you need) to include the code, figures, etc. that you needed to determine your answer.
Put your summary here!
In [ ]:
# put your code and figures here. Add additional cells if necessary!
We're now going to examine how the candidates have talked about each other. Go through their list of tweets and find all of the incidences where they refer to the other candidate, and determine:
Hint: What words do the candidates use to refer to each other? By their first names, last names, Twitter handles, or something else?
In the cell below, summarize what you've learned about how the candidates talk about each other.
Put your summary here!
In [ ]:
# put your code and figures here. Add additional cells if necessary!
Finally, we're going to visualize the positive and negative words that the candidates use to describe each other using word clouds. In a word cloud, the more a specific word appears in some source of textual data, the larger and bolder the word appears in the cloud. While this is not a perfectly representative way of visualizing data, it gives a good sense of what words are commonly used. So, here's how we will proceed:
First, install some software that can be used to generate word clouds by typing:
!pip install wordcloud
in a code cell and wait for the code to install, which may take a minute or two. Make sure to include the exclamation point. You only have to do this once per computer!
Then, generate separate lists of the positive and negative words that each candidate uses to refer to their opponent (making sure to keep the duplicate words in the list - you need this for the word cloud!). You'll take each list, convert it into a string, clean up the string, and then make it into a word cloud. This is kind of a pain, so we're including example code below. Note that we are only showing you how to generate a simple word cloud - to make a more complicated one, look at the documentation.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
# imports the word cloud module
from wordcloud import WordCloud
# we assume you have a list called candidate_word_list_positive,
# which we then convert into a string
candidate_positive_words = str(candidate_word_list_positive)
# take that string and clean out all of the list-y things by replacing them with blank spaces.
candidate_positive_words = candidate_positive_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')
# now we make the word cloud, using only the 60 most common words and keeping the
# font size relatively small.
wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(candidate_positive_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [ ]:
# Put your code here!
In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
src="https://goo.gl/forms/8HNgy9DipNjx0Xe42?embedded=true"
width="80%"
height="1200px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)