Homework #4 - SOLUTIONS

This notebook is due on Friday, November 18th, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.

FOR THIS HOMEWORK: In addition to the correctness of your answers, we will be grading you on:

  1. The quality of your code
  2. The correctness of your code
  3. Whether your code runs.

To that end:

  1. Code quality: make sure that you use functions whenever possible, use descriptive variable names, and use comments to explain what your code does as well as function properties (including what arguments they take, what they do, and what they return).
  2. Whether your code runs: prior to submitting your homework assignment, re-run the entire notebook and test it. Go to the "kernel" menu, select "Restart", and then click "clear all outputs and restart." Then, go to the "Cell" menu and choose "Run all" to ensure that your code produces the correct results. We will take off points for code that does not work correctly when we run it!.

Your name

SOLUTIONS


Section 1: The 1D Schelling model

Schelling's model for happiness: Recall that in a 1D line of stars and zeros (to use Schelling's terminology), an element is "happy" if at least half of its neighbors (defined as the four elements to the left and four elements to the right) are like it, and "unhappy" otherwise. For those near the end of the line the rule is that, of the four neighbors on the side toward the center plus the one, two or three outboard neighbors, at least half must be like oneself.

Your assignment is to implement the Schelling model exactly as described in the in-class project using zeros and ones to indicate the two types of elements. As with Schelling's original paper, you will play twice through the entire list, moving each element in a row (possibly moving a given element multiple times if it goes to the right). Print out the state of the list at each step so that you can see how the solution evolves.

Break up your code into functions whenever possible, and add comments to each function and to the rest of the code to make sure that we know what's going on!


In [ ]:
# put your code here

import random
import math

def initialize_list(array_size=32, randseed=8675309):
    '''
    This function optionally takes in an array size and random seed
    and returns the initial neighborhood that we're going to start 
    from - a string of zeros and ones.  If no arguments are given, it
    defaults to the values specified.
    '''

    random.seed(randseed)
    
    initial_list = []

    for i in range(array_size):
        initial_list.append(random.randint(0,1))

    return initial_list

def is_happy(my_list, my_value, my_index):
    '''
    This function assumes that my_list has a value (my_value)
    popped out of it already, and checkes to see if my_value
    would be happy in my_list at index my_index.  It returns
    'True' if happy and 'False' if unhappy under those circumstances.
    '''

    # do some error-checking (is the index within the allowed range?)
    if my_index < 0 or my_index > len(my_list):
        print("you've made an indexing error!", my_index)
        
    start = my_index-4 # start 4 to the left
    end = my_index+4   # end 3 to the right b/c we count the value at my_index too
    
    # if the starting value is out of bounds, fix it
    if start < 0:
        start = 0
    
    # if the ending value is out of bounds, fix it.  note that we want to go to 
    # len(list), not len(list)-1, because range() goes to 1 before the end of 
    # the range!
    if end > len(my_list):
        end = len(my_list)

    # keep track of the neighbors that are like me
    neighbors_like_me = 0
    
    # keep track of total neighbors
    total_neighbors = 0
    
    # loop over the specified range
    for i in range(start,end):
        if my_list[i] == my_value:  # if this neighbor is like me, keep track of that
            neighbors_like_me += 1
        total_neighbors+=1  # also keep track of total neighbors
    
    # happy if at least half are like me, unhappy otherwise
    # note: it's *at least* half because we're not double-counting our
    # own value
    if neighbors_like_me/total_neighbors >= 0.5:
        return True
    else:
        return False


def where_to_move(my_list, my_value, my_index):
    '''
    Given a neighborhood (my_list), a value (my_value), and the index
    that it started at (my_index), figure out where to move my_value
    so that it's happy.  This assumes that my_value is unhappy where it 
    is, by the way!  This function then returns the index where my_value
    should move to in order to be happy.
    '''

    # this block of code steps to the left to see where (if anywhere) it's 
    # happy to the left.  If it continues to be unhappy, it'll stop when it
    # is about to step off the end of the list.
    left_index=my_index-1
    left_happy=False
    
    while left_happy==False and left_index >= 0:
        left_happy = is_happy(my_list,my_value,left_index)
        if left_happy==False:
            left_index -= 1
    
    # as above, but to the right.
    right_index=my_index+1
    right_happy=False

    while right_happy==False and right_index < len(my_list):
        right_happy = is_happy(my_list,my_value,right_index)
        if right_happy==False:
            right_index += 1

    # now we figure out where the new index should be!
    
    if left_index < 0 and right_index < len(my_list):
        # can't be happy to the left; only possible answer is right_index
        new_index = right_index
        
    elif left_index >= 0 and right_index > len(my_list):  
        # can't be happy to the right; only possible answer is left_index
        new_index = left_index
        
    elif left_index >= 0 and right_index <= len(my_list): 
        # we're within bounds, so now check to see which side is closer.
        # if they're the same we move it to the left.  (This was never specified
        # by Schelling, so we have to make a choice on that.)
        if math.fabs(left_index-my_index) > math.fabs(right_index-my_index):
            new_index = right_index
        else:
            new_index = left_index        
    else:
        # this should only ever be called if something goes horribly wrong.
        print("something has gone wrong in where_to_move!")
    
    return new_index;

def neighborhood_print(neighborhood, note=''):
    '''
    This is a convenience function to take our neighborhood list,
    make a string of stars and zeros out of it, and print the string
    plus optional text at the end.  It's not necessary but it looks pretty.  
    '''
    
    neighborstring=''

    for i in range(len(neighborhood)):
        if(neighborhood[i]) > 0:
            neighborstring += '*'
        else:
            neighborstring += '0'
    
    # make sure optional text is a string
    if type(note)!=str:
        note = str(note)
    
    # add an extra space to make it look nice!
    if note != '':
        note = ' ' + note
        
    neighborstring += note
    
    print(neighborstring)

In [ ]:
# initialize list to defaults
neighborhood = initialize_list(array_size=32)

neighborhood_print(neighborhood, 'initial state')

# do 2 loops over the list
for i in range(2):

    this_index = 0 

    # step through the neighborhood once
    while this_index < len(neighborhood):
        
        this_val = neighborhood.pop(this_index)

        if is_happy(neighborhood,this_val,this_index):
            # if we're happy where we are, don't change anything!
            neighborhood.insert(this_index,this_val)
            
        else:
            # we're unhappy; we need to figure out where to move and then move.
            new_index = where_to_move(neighborhood,this_val,this_index)
            neighborhood.insert(new_index,this_val)

        neighborhood_print(neighborhood, this_index)

        # increment this_index or we'll never stop looping
        this_index += 1

# print out the final state, just to see what it's like.
neighborhood_print(neighborhood, 'final state!')

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(neighborhood,'ro')
plt.xlim(-2,len(neighborhood)+1)
plt.ylim(-0.1,1.1)

Section 2: Wrapping up our twitter analysis

In this part of the homework, we're going to extend our analysis of the tweets of the presidential candidates (well, by the time you're actually coding this up, the president-to-be and their defeated opponent). We're going to do a few things:

  1. Clean the data more comprehensively.
  2. Examine the candidates' tweeting styles in a more in-depth fashion.
  3. See how often the candidates refer to each other, and when they do refer to each other, is it positive, negative, or neutral?

The cells that are immediately below this download the various files needed for this project, and then load the two files of tweets into huge strings named clinton_tweets and trump_tweets.


In [ ]:
# download the files
import urllib.request

files=['negative.txt','positive.txt']
path='http://www.unc.edu/~ncaren/haphazard/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)
    
files=['HillaryClinton_tweets.txt','realDonaldTrump_tweets.txt']
path='https://raw.githubusercontent.com/bwoshea/CMSE201_datasets/master/pres_tweets/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)

In [ ]:
'''
Now we open up the files.  Note that the 'encoding="utf8"' portion
is to take care of the fact that some windows machines have a hard 
time reading the text files generated on Mac and Linux computers.
'''
clinton_tweets = open("HillaryClinton_tweets.txt",encoding="utf8").read()
trump_tweets = open("realDonaldTrump_tweets.txt",encoding="utf8").read()

Cleaning the data

We want to do a more comprehensive job cleaning the data than we did in class. We still want to do that too, though! In particular, in addition to making all of the words lower case and removing punctuation, we want to remove words corresponding to hash tags (words starting with a pound sign, #) or websites (words starting with "http"), and any empty strings (an empty string looks like this: '').

In the space below, you are given a test tweet that has capitalizations, punctuation, hash tags, and that will have empty strings when split into words. Write a function that takes a tweet as an argument, cleans it, and returns a list of words. Demonstrate that this function works by using it to clean the test tweet and print out the returned words.


In [ ]:
test_tweet = " Here is My test TWEET!!!?! #whoah #so_much_election #cmserocks   http://whoahdude.com "

words = test_tweet.split(' ')

print("the words in the uncleaned test tweet are:", words)

# put your code here!
from string import punctuation

def tweet_cleaner(tweet_to_clean):
    '''
    tweet cleaner.  takes in a string that is a tweet full of messy words, 
    returns a list that has removed the hash tags, lowercased everything,
    and removed all punctuation.  The order that this happens is very important!
    '''

    # make everything lowercase
    lowercase = tweet_to_clean.lower()
    
    # split into a list (still has hashtags and punctuation, though)
    uncleaned_words = lowercase.split()

    # empty list - fill with words that aren't hash tags
    no_hashtags = []
    
    # loop through words.  If word starts with a #, it's a hash tag.  
    # if it does NOT start with a #, append it to no_hashtags
    for word in uncleaned_words:
        if word[0] != '#' and word[0:4] != 'http':
            no_hashtags.append(word)
         
    # empty list - fill with words that have been cleaned of punctuation
    no_punctuation = []

    # loop through words, clean each word of punctuation, and then append
    # it to the no_puncutation list.
    for word in no_hashtags:
        for p in punctuation:
            word=word.replace(p,'')
        no_punctuation.append(word)
        
    # at this point our dirty, dirty string has been turned into a 
    # clean, clean list!
    return no_punctuation
    
list_of_words = tweet_cleaner(test_tweet)

print("cleaned test tweet:", list_of_words)

A more comprehensive examination of the candidates' twitter styles

Now that we've figured out how to clean the tweets, we're going to do a more in-depth analysis of the candidates' writing styles (or, more accurately, the styles of the candidates and their campaign staff - not all of the tweets come from the candidates themselves). In particular, we want to determine:

  1. What is the distribution of word lengths that each candidate uses? The average length of words used is one way to estimate the sophistication of writing - longer words may suggest more complex thoughts.
  2. Which candidate has used a larger vocabulary in their tweets? In other words, which candidate uses more distinct or unique words, and thus uses less repetition of individual words? As with word length, a larger vocabulary may suggest more complex thoughts.

Hint: Consider using a dictionary to help address the second question! (And see the tutorial we have provided on dictionaries.)

In the cell below, summarize what you've learned about the candidates' tweeting styles. Use the cell below that (and any additional cells you need) to include the code, figures, etc. that you needed to determine your answer.

ANSWER: The distribution of word lengths is basically the same between the two candidates. When you examine the number of words, the Clinton campain uses about 10% more distinct words than the Trump campaign.


In [ ]:
# put your code and figures here.  Add additional cells if necessary!

def analyze_candidate_tweets(tweets):
    '''
    Analyzes candidate tweets to determine (1) the distribution of word lengths
    and (2) the number of separate words, and return both as dictionaries of the
    number of words of a given letter length, and then the number of words and 
    the times each is used.
    '''
    
    tweets_list = tweets.split('\n')

    # make dictionaries
    word_lengths = {}
    word_dictionary = {}
    
    # loop over tweets
    for tweet in tweets_list:
        
        # for each tweet, make a cleaned list of words
        word_list = tweet_cleaner(tweet)

        # loop over words in the tweet
        for word in word_list:
            # how long is the word?
            wordsize = len(word)
            
            # increment the word count if it's in the dictionary already;
            # add it if not.
            if word in word_dictionary:
                word_dictionary[word] += 1
            else:
                word_dictionary[word] = 1
            
            # increment the number of words at that size if it's already in the 
            # dictioanry; add it if not.
            if wordsize in word_lengths:
                word_lengths[wordsize] += 1
            else:
                word_lengths[wordsize] = 1
            
    # return our two dictionaries    
    return word_lengths, word_dictionary 

# get clinton info
clinton_word_lengths, clinton_word_dictionary = analyze_candidate_tweets(clinton_tweets)

print("Number of Clinton words: ", len(clinton_word_dictionary))

# get trump info
trump_word_lengths, trump_word_dictionary = analyze_candidate_tweets(trump_tweets)

print("Number of Trump words: ", len(trump_word_dictionary))

plt.plot(list(trump_word_lengths.keys()), list(trump_word_lengths.values()),'r-')
plt.plot(list(clinton_word_lengths.keys()), list(clinton_word_lengths.values()),'b-')

How do the candidates talk about each other?

We're now going to examine how the candidates have talked about each other. Go through their list of tweets and find all of the incidences where they refer to the other candidate, and determine:

  1. If they typically refer to their opponent using positive words, negative words, both positive and negative words, or neither?
  2. Keep track of all of the positive and negative words that they use to refer to their opponent. What are the most common positive and negative words that they use about their opponent?

Hint: What words do the candidates use to refer to each other? By their first names, last names, Twitter handles, or something else?

In the cell below, summarize what you've learned about how the candidates talk about each other.

ANSWER:

Clinton refers to Trump about 25% of the time; the largest fraction of mentions is "neither positive or negative" followed by "positive" and then "negative". The most common positive words that Clinton uses when referring to Trump are help, just, will and great. The most common negative words are racist, lie, against, unqalified, and lying.

Trump refers to Clinton about 15% of the time; the largest fraction of mentions is "both positive and negative" followed by "negative" and then "neither". The most common positive words that Trump uses when referring to Clinton are even, will, just, deal, wow. The most common negative words are crooked (almost 200 times!), lie, bad, corrupt, wrong.


In [ ]:
# put your code and figures here.  Add additional cells if necessary!

# get list of positive words (previously downloaded)
pos_sent = open("positive.txt").read()
positive_words=pos_sent.split('\n')

# get list of negative words (previously downloaded)
neg_sent = open("negative.txt").read()
negative_words=neg_sent.split('\n')

# guess at the names clinton and trump use to refer to each other
clinton_names = ['hillaryclinton', 'hillary', 'hilary', 'clinton']
trump_names = ['realdonaldtrump','donald','trump']

def candidate_referral(tweets, opponent_names, positive_words, negative_words):
    '''
    This function examines each candidate's tweets to see how often they refer to
    their opponent, and if so how positive/negative that is, and to figure out the words used.
    
    Function takes in a list of tweets from a candidate, the names that they use to refer to
    their opponent, and lists of positive and negative words.
    
    Outputs a dictionary of info about the tweets (total tweets, mentions, pos. mentions, neg. mentions, etc.),
    dictionaries of the positive and negative words used, and lists of EVERY positive and negative
    word used (even if they are repeated).  The lists are used for word clouds later.
    '''
    
    # split up our tweets
    tweets_list = tweets.split('\n')
    
    # dictionaries of positive and negative words we'll use
    pos_words_used = {}
    neg_words_used = {}
    
    # lists to keep track of positive and negative words used
    pos_words_list = []
    neg_words_list = []
    
    # dictionary to keep track of tweet statistics
    tweet_info_dict = {'positive':0, 'negative':0, 'both':0, 'neither':0, 
                       'total_tweets':0, 'opponent_mentions':0}
    
    # loop over tweets
    for tweet in tweets_list:

        # is the candidate mentioned in this tweek?
        is_opponent_mentioned = 0
        
        # clean the tweet and return list of words
        word_list = tweet_cleaner(tweet)

        # loop over words in word list
        for word in word_list:

            # check to see if the opponent is mentioned
            if word in opponent_names:
                is_opponent_mentioned += 1
        
        # if the opponent is mentioned, now let's figure some stuff out.
        if is_opponent_mentioned > 0:

            # keep track of mentions
            tweet_info_dict['opponent_mentions'] += 1

            negative = 0
            positive = 0
            
            # loop over words in tweet again
            for word in word_list:

                # if the word is on our list of positive words, add that up, 
                # and keep track of the word in both the dict and list
                if word in positive_words:
                    positive += 1
                    pos_words_list.append(word)
                    if word in pos_words_used:
                        pos_words_used[word] += 1
                    else:
                        pos_words_used[word] = 1
                        
                # do the same for negative words (as above)
                if word in negative_words:
                    negative += 1
                    neg_words_list.append(word)
                    if word in neg_words_used:
                        neg_words_used[word] += 1
                    else:
                        neg_words_used[word] = 1

            # logic to figure out if it's a positive, negative, both pos/neg, or neither mention
            if positive > 0 and negative == 0:
                tweet_info_dict['positive'] += 1

            if positive == 0 and negative > 0:
                tweet_info_dict['negative'] += 1

            if positive == 0 and negative == 0:
                tweet_info_dict['neither'] += 1
                
            if positive > 0 and negative > 0:
                tweet_info_dict['both'] += 1

        # no matter what, increment the total number of tweets
        tweet_info_dict['total_tweets'] += 1

    # return all the info.
    return tweet_info_dict, pos_words_used, neg_words_used, pos_words_list, neg_words_list
    
# collect info about clinton mentions
clinton_info_dict, clinton_poswords, clinton_negwords, clinton_poswords_list, clinton_negwords_list = candidate_referral(clinton_tweets, trump_names, 
                                                                           positive_words, negative_words)

# collect info about trump mentions
trump_info_dict, trump_poswords, trump_negwords, trump_poswords_list, trump_negwords_list = candidate_referral(trump_tweets, clinton_names, 
                                                                           positive_words, negative_words)

In [ ]:
# print out some info; I'm going to extract words by hand.
print("CLINTON POSITIVE WORDS:", clinton_poswords,"\n\n")
print("CLINTON NEGATIVE WORDS:", clinton_negwords)

In [ ]:
# print out some info; I'm going to extract words by hand.
print("TRUMP POSITIVE WORDS:", trump_poswords,"\n\n")
print("TRUMP NEGATIVE WORDS:", trump_negwords)

In [ ]:
# pie chart for Clinton referrals to Trump

print("\n\n******** CLINTON (referrals to Trump) ********\n\n")
print(clinton_info_dict)

# The slices will be ordered and plotted counter-clockwise.
labels = 'positive', 'both', 'negative', 'neither'
sizes = [clinton_info_dict['positive'], clinton_info_dict['both'],
         clinton_info_dict['negative'], clinton_info_dict['neither']]
colors = ['yellowgreen', 'yellow','red',  'lightcyan']
explode = (0.1, 0, 0.1, 0)  

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90);

In [ ]:
# pie chart for Trump referrals to Clinton

print("\n\n******** TRUMP (referrals to Clinton) ********\n\n")
print(trump_info_dict)

# The slices will be ordered and plotted counter-clockwise.
labels = 'positive', 'both', 'negative', 'neither'
sizes = [trump_info_dict['positive'], trump_info_dict['both'],
         trump_info_dict['negative'], trump_info_dict['neither']]
colors = ['yellowgreen', 'yellow','red',  'lightcyan']
explode = (0.1, 0, 0.1, 0)  

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90);

A different way of visualizing the data

Finally, we're going to visualize the positive and negative words that the candidates use to describe each other using word clouds. In a word cloud, the more a specific word appears in some source of textual data, the larger and bolder the word appears in the cloud. While this is not a perfectly representative way of visualizing data, it gives a good sense of what words are commonly used. So, here's how we will proceed:

First, install some software that can be used to generate word clouds by typing:

!pip install wordcloud

in a code cell and wait for the code to install, which may take a minute or two. Make sure to include the exclamation point. You only have to do this once per computer!

Then, generate separate lists of the positive and negative words that each candidate uses to refer to their opponent (making sure to keep the duplicate words in the list - you need this for the word cloud!). You'll take each list, convert it into a string, clean up the string, and then make it into a word cloud. This is kind of a pain, so we're including example code below. Note that we are only showing you how to generate a simple word cloud - to make a more complicated one, look at the documentation.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

# imports the word cloud module
from wordcloud import WordCloud

# we assume you have a list called candidate_word_list_positive,
# which we then convert into a string
candidate_positive_words = str(candidate_word_list_positive)

# take that string and clean out all of the list-y things by replacing them with blank spaces.
candidate_positive_words = candidate_positive_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

# now we make the word cloud, using only the 60 most common words and keeping the
# font size relatively small.
wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(candidate_positive_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [ ]:
# positive words clinton uses in mentions of trump
from wordcloud import WordCloud

clinton_positive_words = str(clinton_poswords_list)
clinton_positive_words = clinton_positive_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(clinton_positive_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [ ]:
# negative words clinton uses in mentions of trump
clinton_negative_words = str(clinton_negwords_list)
clinton_negative_words = clinton_negative_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(clinton_negative_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [ ]:
# positive words trump uses in mentions of clinton
trump_positive_words = str(trump_poswords_list)
trump_positive_words = trump_positive_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(trump_positive_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [ ]:
# negative words trump uses in mentions of clinton
trump_negative_words = str(trump_negwords_list)
trump_negative_words = trump_negative_words.replace('\'','').replace(',','').replace('\"','').replace('[','').replace(']','')

wordcloud = WordCloud(background_color="white",max_words=60, max_font_size=40).generate(trump_negative_words)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Section 3: Feedback (required!)


In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
    src="https://goo.gl/forms/8HNgy9DipNjx0Xe42?embedded=true" 
    width="80%" 
    height="1200px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
"""
)

Congratulations, you're done!

How to submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "Homework assignments" folder. There will be a dropbox labeled "Homework 4". Upload this notebook there.