Problem 1

Write a function that takes a list of 0s and 1s and produces the corresponding integer. The equation for converting a list $L = [l_1, l_2, ..., l_n]$ of 0's and 1's to binary is $\sum_i l_i*2^i$. What is the integer representation of [1, 0, 0, 0, 1, 1, 0, 1]?



In [1]:

    
def to_binary(x):
    the_sum = 0
    # enumerate returns pairs of values from `x`
    # as well as the index of each value
    for index, value in enumerate(x):
        the_sum += value * 2**index
    
    return the_sum



In [2]:

    
my_list = [1, 1]
to_binary(my_list)









    Out[2]:





3



In [3]:

    
my_list = [1, 0, 0, 0, 1, 1, 0, 1]
to_binary(my_list)









    Out[3]:





177

One note - there are actually 2 possible solutions to this problem, depending on which value of [1, 0, 0, 0, 1, 1, 0, 1] is treated as the least-significant bit (LSB). The solution above treats the left-most bit as the LSB (i.e. the bit that gets multiplied by $2^0=1$). How would you rewrite the function to treat the right-most bit as the LSB?

Problem 2

Read data/alice_in_wonderland.txt into memory. How many characters does it contain? How does this compare to its size on disk?
Print out the unique non-ASCII characters in Alice in Wonderland (hint: non-ASCII means that the number of bytes used is greater than 1).
Write the first 10,000 characters of Alice in Wonderland as text and as a pickle. What are the sizes of each file on disk?



In [3]:

    
import os

with open('data/alice_in_wonderland.txt', 'r') as file:
    alice = file.read()

# how many characters are in Alice?
print('number of characters is {}'.format(len(alice)))

# how large is the file on disk?
print('number of bytes on disk is {}'.format(os.path.getsize('data/alice_in_wonderland.txt')))









    



number of characters is 163817
number of bytes on disk is 173595

So this tells us that there are non-ASCII characters (characters that use more than 1 byte) in the file



In [5]:

    
# non-ASCI characters are characters that use more
# than 1 byte to represent the character
non_ascii = []
for character in alice:
    # convert character to Unicode bytes and check how many bytes there are
    if len(bytes(character, 'UTF-8')) > 1:
        non_ascii.append(character)

# convert list to set to get only the unique characters
print('unique non-ASCII characters:', set(non_ascii))









    



unique non-ASCII characters: {'‘', '’', '\ufeff', '“', '”'}



In [8]:

    
import pickle

# open a file in write mode ('w') to write plain text
with open('data/alice_partial.txt', 'w') as file:
    file.write(alice[:10000])

# open a file in write-binary ('wb') mode to write pickle protocol
with open('data/alice_partial.pickle', 'wb') as file:
    pickle.dump(alice[:10000], file)

print('size of plain text file: {}'.format(os.path.getsize('data/alice_partial.txt')))
print('size of pickled file: {}'.format(os.path.getsize('data/alice_partial.pickle')))









    



size of plain text file: 10182
size of pickled file: 10192

Problem 3

Iterating over good_movies, print the name of the movies that Ben Affleck stars in.
Find the total number of Oscar nominations for 2016 movies in the dataset.



In [12]:

    
import json

# use the `json` library to read json-structured plain text into Python objects
with open('data/good_movies.json', 'r') as file:
    good_movies = json.loads(file.read())



In [14]:

    
# iterate over the movies, checking the list of stars for each
for movie in good_movies:
    if 'Ben Affleck' in movie['stars']:
        print(movie['title'])









    



Argo
Gone Girl



In [16]:

    
# iterate over the movies, tallying the Oscars for movies in 2016
nominations_2016 = 0
for movie in good_movies:
    if movie['year'] == 2016:
        nominations_2016 += movie['oscar_nominations']

print(nominations_2016)

Problem 4

Create a NumPy array with 100,000 random integers between 1 and 100. Then, write two functions (in pure Python, not using built-in NumPy functions):

Compute the average
Compute the standard deviation
Create weight vector of 100,000 elements (the sum of the elements is 1). Compute the weighted average of your first vector with these weights.



In [18]:

    
import numpy as np

rand_array = np.random.randint(1, high=100, size=100000)



In [19]:

    
def my_average(x):
    the_sum = 0
    for el in x:
        the_sum += el
    
    return the_sum / len(x)

def my_stdev(x):
    the_sum = 0
    the_avg = my_average(x)
    for xi in x:
        the_sum += (xi - the_avg) ** 2
    return np.sqrt(the_sum / len(x))

def my_weighted_average(x, weights):
    the_sum = 0
    for el, weight in zip(x, weights):
        the_sum += el * weight
    
    return the_sum



In [20]:

    
print('average:', my_average(rand_array))
print('standard deviation:', my_stdev(rand_array))









    



average: 49.9322
standard deviation: 28.5287448578

A weight vector needs to sum to 1. So we'll create a vector of random numbers between 0 and 1 and normalize it (divide by its sum) so that it sums to 1.



In [23]:

    
rand_weights = np.random.random(size=100000)
rand_weights /= np.sum(rand_weights)



In [25]:

    
print('weighted average:',  my_weighted_average(rand_array, rand_weights))









    



weighted average: 49.9482673521



In [ ]: