Import libraries
In [1]:
import numpy as np
from io import BytesIO
import matplotlib
import matplotlib.pyplot as plt
import random
from mpl_toolkits.mplot3d import Axes3D
from bs4 import BeautifulSoup
import urllib.request
%matplotlib inline
Open the file $\mathtt{dataset}$_$\mathtt{HW0.txt}$, containing birth biometrics as well as maternal data for a number of U.S. births, and inspect the csv formatting of the data. Load the data, without the column headers, into an numpy array.
Do some preliminary explorations of the data by printing out the dimensions as well as the first three rows of the array. Finally, for each column, print out the range of the values.
Prettify your output, add in some text and formatting to make sure your outputs are readable (e.g. "36x4" is less readable than "array dimensions: 36x4").
In [2]:
#create a variable for the file dataset_HW0.txt
fname = 'dataset_HW0.txt'
In [3]:
#fname
In [4]:
# Option 1: Open the file and load the data into the numpy array; skip the headers
with open(fname) as f:
lines = (line for line in f if not line.startswith('#'))
data = np.loadtxt(lines, delimiter=',', skiprows=1)
In [5]:
# What is the shape of the data
data.shape
Out[5]:
In [6]:
#Option 2: Open the file and load the data into the numpy array; skip the headers
data = np.loadtxt('dataset_HW0.txt', delimiter=',', skiprows=1)
data.shape
Out[6]:
In [7]:
# print the first 3 rows of the data
data[0:3]
Out[7]:
In [8]:
#data[:,0]
In [9]:
# show the range of values for birth weight
fig = plt.figure()
axes = fig.add_subplot(111)
plt.xlabel("birth weight")
axes.hist(data[:,0])
Out[9]:
In [10]:
# show the range of values for the femur length
fig = plt.figure()
axes = fig.add_subplot(111)
plt.xlabel("femur length")
axes.hist(data[:,1])
Out[10]:
Compute the mean birth weight and mean femur length for the entire dataset. Now, we want to split the birth data into three groups based on the mother's age:
For each maternal age group, compute the mean birth weight and mean femure length.
Prettify your output.
Compare the group means with each other and with the overall mean, what can you conclude?
In [11]:
#calculate the overall means
birth_weight_mean = data[:,0].mean()
birth_weight_mean
Out[11]:
In [12]:
#calculagte the overall mean for Femur Length
femur_length_mean = data[:,1].mean()
femur_length_mean
Out[12]:
In [13]:
# Capture the birth weight
birth_weight = data[:,0]
#Capture the Femur length
femur_length = data[:,1]
# Capture the maternal age
maternal_age = data[:,2]
maternal_age.shape
# Create indexes for the different maternal age groups
#group_1
group_1 = maternal_age <= 17
#group_2
group_2 = [(maternal_age >= 18) & (maternal_age <= 34)]
#group_3
group_3 = [(maternal_age >= 35) & (maternal_age <= 50)]
In [14]:
bw_g1 = data[:, 0][group_1]
age0_17 = data[:, 2][group_1]
bw_g1.mean()
Out[14]:
In [15]:
fl_g1 = data[:, 1][group_1]
fl_g1.mean()
Out[15]:
In [16]:
bw_g2 = data[:, 0][group_2]
age18_34 = data[:, 2][group_2]
bw_g2.mean()
Out[16]:
In [17]:
fl_g2 = data[:, 1][group_2]
fl_g2.mean()
Out[17]:
In [18]:
bw_g3 = data[:, 0][group_3]
age35_50 = data[:, 2][group_3]
bw_g3.mean()
Out[18]:
In [19]:
fl_g3 = data[:, 1][group_3]
fl_g3.mean()
Out[19]:
In [20]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for c, m in [('r', 'o')]:
ax.scatter(bw_g1, fl_g1, age0_17, edgecolor=c,facecolors=(0,0,0,0), marker=m, s=40)
for c, m in [('b', 's')]:
ax.scatter(bw_g2, fl_g2, age18_34, edgecolor=c,facecolors=(0,0,0,0), marker=m, s=40)
for c, m in [('g', '^')]:
ax.scatter(bw_g3, fl_g3, age35_50, edgecolor=c,facecolors=(0,0,0,0), marker=m, s=40)
fig.suptitle('3D Data Visualization', fontsize=14, fontweight='bold')
ax.set_title('Birth Weigth vs Femur Length vs Weight Plot')
ax.set_xlabel('birth_weight')
ax.set_ylabel('femur_length')
ax.set_zlabel('maternal_age')
plt.show()
In [21]:
plt.scatter(maternal_age,birth_weight, color='r', marker='o')
plt.xlabel("maternal age")
plt.ylabel("birth weight")
plt.show()
In [22]:
plt.scatter(maternal_age,femur_length, color='b', marker='s')
plt.xlabel("maternal age")
plt.ylabel("femur length")
plt.show()
In [23]:
plt.scatter(birth_weight,femur_length, color='g', marker='^')
plt.xlabel("birth weight")
plt.ylabel("femur length")
plt.show()
Open and load the page (Kafka's The Metamorphosis) at
$\mathtt{http://www.gutenberg.org/files/5200/5200-h/5200-h.htm}$
into a BeautifulSoup object.
The object we obtain is a parse tree (a data structure representing all tags and relationship between tags) of the html file. To concretely visualize this object, print out the first 1000 characters of a representation of the parse tree using the $\mathtt{prettify()}$ function.
In [24]:
# load the file into a beautifulsoup object
page = urllib.request.urlopen("http://www.gutenberg.org/files/5200/5200-h/5200-h.htm").read()
In [25]:
# prettify the data read from the url and print the first 1000 characters
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify()[0:1000])
Explore the nested data structure you obtain in Part (a) by printing out the following:
Make your output readable.
In [26]:
# print the content of the head tag
soup.head
Out[26]:
In [27]:
# print the string inside the head tag
soup.head.title
Out[27]:
In [28]:
# print each child of the head tag
soup.head.meta
Out[28]:
In [29]:
# print the string inside the title tag
soup.head.title.string
Out[29]:
In [30]:
# print the string inside the pre-formatbted text (pre) tag
print(soup.body.pre.string)
In [31]:
# print the string inside first paragraph (p) tag
print(soup.body.p.string)
Now we want to extract the text of The Metamorphosis and do some simple analysis. Beautiful Soup provides a way to extract all text from a webpage via the $\mathtt{get}$_$\mathtt{text()}$ function.
Print the first and last 1000 characters of the text returned by $\mathtt{get}$_$\mathtt{text()}$. Is this the content of the novela? Where is the content of The Metamorphosis stored in the BeautifulSoup object?
In [32]:
print(soup.get_text()[1:1000])
In [33]:
p = soup.find_all('p')
combined_text = ''
for node in soup.findAll('p'):
combined_text += "".join(node.findAll(text=True))
print(combined_text[0:1000])
Count the number of words in The Metamorphosis. Compute the average word length and plot a histogram of word lengths.
You'll need to adjust the number of bins for each histogram.
Hint: You'll need to pre-process the text in order to obtain the correct word/sentence length and count.
In [35]:
word_list = combined_text.lower().replace(':','').replace('.','').replace(',', '').replace('"','').replace('!','').replace('?','').replace(';','').split()
#print(word_list[0:100])
word_length = [len(n) for n in word_list]
print(word_length[0:100])
total_word_length = sum(word_length)
print("The total word length: ", total_word_length)
wordcount = len(word_list)
print("The total number of words: ", wordcount)
avg_word_length = total_word_length / wordcount
print("The average word length is: ", avg_word_length)
# function to calculate the number of uniques words
# wordcount = {}
# for word in word_list:
# if word not in wordcount:
# wordcount[word] = 1
# else:
# wordcount[word] += 1
# for k,v in wordcount.items():
# print (len(k), v)
In [40]:
# Print the histogram for the word lengths
fig = plt.figure()
axes = fig.add_subplot(111)
plt.xlabel("Word Lengths")
plt.xlabel("Count")
#axes.hist(word_length)
plt.hist(word_length, bins=np.arange(min(word_length), max(word_length) + 1, 1))
Out[40]:
In this problem we practice generating data by setting up a simulation of a simple phenomenon, a queue.
Suppose we're interested in simulating a queue that forms in front of a small Bank of America branch with one teller, where the customers arrive one at a time.
We want to study the queue length and customer waiting time.
Assume that gaps between consecutive arrivals are uniformly distributed over the interval of 1 to 20 minutes (i.e. any two times between 1 minute and 6 minutes are equally likely).
Assume that the service times are uniform over the interval of 5 to 15 minutes.
Generate the arrival and service times for 100 customers, using the $\mathtt{uniform()}$ function from the $\mathtt{random}$ library.
In [ ]: