This notebook contains the application with real data exercise.
Looking at a recent Atlantic Magazine piece: My president was black
In [1]:
header = """
My President Was Black
A history of the first African American White House—and of what came next
By Ta-Nehisi Coates
Photograph by Ian Allen
"""
In [2]:
?repr
In [3]:
print(repr(header))
Now we can see all the elements in our header, we can notice that each part that we need to extract is in between two \n
, so let's try to split our header using as reference those \n
. You can check the split()
method documentation here.
In [4]:
header_list = header.split('\n')
In [5]:
print(header_list)
We use the method strip()
check the documentation here
In [6]:
#Removing extra white spaces in each element of our list
for i in range(len(header_list)):
header_list[i] = header_list[i].strip()
In [7]:
print(header_list)
You can avoid removing the empty spaces, but if you want to remove them then you can follow the next lines.
To remove all the elements that match a condition, we will use filter() and lambda functions.
I used this because I tried to use:
for element in header_list:
if element == '':
header_list.remove(element)
but I had troubles, it wasn't deleting all of the ''. Not sure exactly why it happens but you can try this simple example.
In [8]:
##EXAMPLE of remove not doing what I want.
c = ['a', 'a', 'a', 'a', 'a', '', '', '', '']
for element in c:
if element == '':
c.remove(element)
print(c)
In [9]:
header_list = list(filter(lambda item: item!='' , header_list))
In [10]:
print(header_list)
Now our data is clean and neat, let's get the things we were assigned to. Refreshing the exercise stament, we need:
In [11]:
title = header_list[0]
intro = header_list[1]
Let's print them!
In [12]:
print('The title is: ', title)
print('The intro is: ', intro)
Another way of formatting print statements is by doing the following. Remeber that \n means "go to the next line".
In [13]:
print('The title is: {}. \nThe introduction is: {}.'.format(title, intro))
In [14]:
author = header_list[2].strip('By ')
photographer = header_list[3].strip('Photograph by ')
In [15]:
print('The author is: {}. \nThe photographer is: {}.'.format(author, photographer))
In [16]:
print('Title : ', title)
print('Introduction: ', intro)
print('Author : ', author)
print('Photographer: ', photographer)
In [17]:
first_paragraph = """
In the waning days of President Barack Obama’s administration, he and his wife, Michelle, hosted a farewell party, the full import of which no one could then grasp. It was late October, Friday the 21st, and the president had spent many of the previous weeks, as he would spend the two subsequent weeks, campaigning for the Democratic presidential nominee, Hillary Clinton. Things were looking up. Polls in the crucial states of Virginia and Pennsylvania showed Clinton with solid advantages. The formidable GOP strongholds of Georgia and Texas were said to be under threat. The moment seemed to buoy Obama. He had been light on his feet in these last few weeks, cracking jokes at the expense of Republican opponents and laughing off hecklers. At a rally in Orlando on October 28, he greeted a student who would be introducing him by dancing toward her and then noting that the song playing over the loudspeakers—the Gap Band’s “Outstanding”—was older than she was. “This is classic!” he said. Then he flashed the smile that had launched America’s first black presidency, and started dancing again. Three months still remained before Inauguration Day, but staffers had already begun to count down the days. They did this with a mix of pride and longing—like college seniors in early May. They had no sense of the world they were graduating into. None of us did.
"""
In [18]:
repr(first_paragraph)
Out[18]:
In [19]:
paragraph_list = first_paragraph.split()
Let's take a look at the split paragraph.
In [20]:
print(paragraph_list)
As you can notice, the punctuation doesn't the word count because the symbols are attached to the previous word. However, the em dash '—' symbol links two words and that will affect the count, so let's replace them with a space so when we split by spaces it will do what we want.
If you don't have the em dash symbol in your keyword you can generate it by doing (Ctrl+Shift+u)+2014
.
In [21]:
#We want to keep first_paragraph without changes, so we create a copy and work with that.
revised_paragraph = first_paragraph
In [22]:
for element in revised_paragraph:
if element == '—':
revised_paragraph = revised_paragraph.replace(element, ' ')
In [23]:
print(revised_paragraph)
Now our revised_paragraph is the same than the first paragraph but without the "em dashes" . Let's split it and then count the words.
In [24]:
words_list = revised_paragraph.split()
To count the words we just need to know the length of our list. To do that we use len()
In [25]:
words = len(words_list)
In [26]:
print('The amount of words in first_paragraph is: ', words)
To find the number of sentences we just need to split the initial paragraph by using the delimiter '.', but what happens when we have a '?' ? If there is one we should take that into account. We can replace them by '.' just for the counting, and after that we are good to go.
In [27]:
#We want to keep first_paragraph without changes, so we create a copy, in this case replacing ? by .
#and work with that copy.
sentence_paragraph = first_paragraph.replace('?', '.')
sentence_paragraph = sentence_paragraph.replace('\n', '')
In [28]:
#Split the paragraph into sentences
sentence_list = sentence_paragraph.split('.')
In [29]:
print(sentence_list)
In [30]:
#Let's remove the '' elements
sentence_list = list(filter(lambda item: item!='' , sentence_list))
In [31]:
print(sentence_list)
Now our sentences_list
just contains the sentences, let's use len()
to count the amount of sentences.
In [32]:
sentences = len(sentence_list)
In [33]:
print('The amount of sentences in first_paragraph is: ', sentences)
If you read the paragraph you might have notice that in some parts the Word Obama appears as "Obama's". We want to count that case; therefore, we will look for the string 'Obama' in wach word of the word_list. If the string is in the word we will add 1 to the variable obama_count (initialized in 0).
In [34]:
obama_count = 0
for word in words_list:
if 'Obama' in word:
obama_count +=1
In [35]:
print(obama_count)
In [36]:
print('The word Obama in first_paragraph appears {} times.'.format(obama_count))
This sounds like a tricky question. Why lower case would affect word counting? In the answer of part one we were asked to count the words and we did it. Lower case the whole text and removing punctuation won't affect that count. However, removing all type of punctuation might affect. For example, think in the cases where we have words like "Obama's" in this case the initial word-count counts that case as 1, even though we know that "Obama's" = "Obama is" which are 2 words.
So, let's remove punctuation and count again.
We can do this in one line, because strings have a method called lower()
that will do that for us.
In [37]:
#Lower case the whole paragraph
lower_paragraph = first_paragraph.lower()
In [38]:
print(lower_paragraph)
To approach this exercise we will use some of the string constants available in python. Thanksfully, there is one that has all the punctuation (string.punctuation). Once we have this string, we will do a for loop on the elements of that constant and every time we find one of those elements in our paragraph, we will replace it for a space (' ').
Since
In [39]:
#First we import the string constants available in python.
import string
In [40]:
#Let's print the string punctuation.
print(string.punctuation)
In [41]:
#loop in the character of string.punctuation and we replace the characters that appear
#in our lower_paragraph for a space.
for character in string.punctuation:
lower_paragraph = lower_paragraph.replace(character, ' ')
In [42]:
print(lower_paragraph)
It seems that our paragraph is not using the curly quation marks, single and double. Those are not in our strin.punctuation constant and the "em dash" either. Let's create another constant string with all the ramining characters we want to remove and take them out. We might want to add also the '\n' symbol.
Probably you don't have non straight quotation marks, but you can creat them using unicode:
(Ctrl+Shift+u)+2018
(Ctrl+Shift+u)+2019
(Ctrl+Shift+u)+201c
(Ctrl+Shift+u)+201d
(Ctrl+Shift+u)+2014
In [43]:
more_punct = '’‘“”—\n'
In [44]:
for character in more_punct:
lower_paragraph = lower_paragraph.replace(character, ' ')
In [45]:
print(lower_paragraph)
Now we removed all kind of punctuation, let's create our list and count the elements to know the amount of words.
In [46]:
no_punctuation_list = lower_paragraph.split()
words_no_punctuation = len(no_punctuation_list)
In [47]:
print('The amount of words in the paragraph with no punctuation is: ', words_no_punctuation)
In [48]:
print('The amount of words in first_paragraph is: ', words)
print('The amount of sentences in first_paragraph is: ', sentences)
print('The word Obama in first_paragraph appears {} times.'.format(obama_count))
print('The amount of words in the paragraph with no punctuation is: ', words_no_punctuation)
Sicne this is the advance exercise we will try to use cool stuffs so we learn more. Because, you might have changed the name of the file when you copy it to your working folder. Or maybe the location where is it it's not the same than mine, and we want to have that freedom we will use the input()
function to especify the path (location) of the file we want to open
For example, in my case it will be data/article_part_one.txt
In [49]:
name = input('Enter file name with its path location: ')
In [50]:
with open(name, 'r') as file:
article = file.read()
In [51]:
print(article)
In [52]:
#Let's create a string with all the punctuation we want to remove to count words.
all_punct = string.punctuation + more_punct
In [53]:
all_punct
Out[53]:
Now we replace all the punctuation characters for a space. We create an article to modify that is a copy of the original article.
In [54]:
#We will modify article_no_punct but we want to keep intact article. So
article_no_punct = article
In [55]:
for char in all_punct:
if char in article_no_punct:
article_no_punct = article_no_punct.replace(char, ' ')
In [56]:
article_no_punct
Out[56]:
No we split and count by using len()
In [57]:
words_list = article_no_punct.split()
In [58]:
words_total = len(words_list)
In [59]:
print('The total amount of words is: {}'.format(words_total))
We can do that easily by using dictionaries so:
To do the counting we will use dictionaries, this will allow to count how many words of each kind are in an easy way. We will also use the get(), to do the counting.
In [60]:
count = {}
for word in words_list:
count[word] = count.get(word,0) + 1
If we print the dictionary we should have every word as a key and then next to it the value will indicate the amount of it that we have.
In [61]:
print(count)
In [62]:
#We will modify article_sentences but we want to keep intact article. So
article_sentences = article
In [63]:
article_sentences = article_sentences.replace('\n','.')
In [64]:
sentences_article_list = article_sentences.split('.')
In [65]:
print(sentences_article_list)
Let's ommit the '' and the elements that are "sentences" that are actually two letters because they were part of an abbreviation. For example '— F' or '"'. These elements have the particularity that their length is always smaller than 3, so let's filter that.
In [66]:
list_clean = list(filter(lambda item: len(item)>3 , sentences_article_list))
In [67]:
print(list_clean)
In [68]:
sentence_total = len(list_clean)
In [69]:
print('The total amount of sentences is: {}'.format(sentence_total))
Using our words_list, which is in order of appearance in the text, we will make two different list with the words that follow 'white' and the ones that follow 'black'. Once we have those lists we will make a dictionary counting the repetitions of each word and finally we will look for the maxium value and its corresponding key.
We will make all the words lower case so we can just look for the lower case words and not worry about syntax.
The we will use two new things to complete the task, enumerate() and list comprehensions, you can read about the last one here in section 5.1.4.
In [70]:
words_lower = []
for word in words_list:
words_lower.append(word.lower())
We want to know were in the list appears the word white and where the word black. We look for those indices doing the following.
In [71]:
indx_white = [i for i, x in enumerate(words_lower) if x == "white"]
indx_black = [i for i, x in enumerate(words_lower) if x == "black"]
In [72]:
print('idx white:\n',indx_white)
print('idx black:\n',indx_black)
Let's look for the word that follows each repetition of white/black and save them into lists.
In [73]:
#Looking for the words that follow white
lst_white =[]
for i in indx_white:
lst_white.append(words_lower[i+1])
In [74]:
#Looking for the words that follow white
lst_black =[]
for i in indx_black:
lst_black.append(words_lower[i+1])
In [75]:
print('Words that follows white:\n',lst_white)
print('Words that follows black:\n',lst_black)
Let's count for each list the repetitions of each word using dictionaries and the get method.
In [76]:
follows_white = {}
for word in lst_white:
follows_white[word] = follows_white.get(word,0) + 1
In [77]:
print(follows_white)
In [78]:
follows_black = {}
for word in lst_black:
follows_black[word] = follows_black.get(word,0) + 1
In [79]:
print(follows_black)
Let's get the word in each dictionary that has the biggest value.
In [80]:
most_follow_white = max(follows_white, key=follows_white.get)
most_follow_black = max(follows_black, key=follows_black.get)
In [81]:
print("The most used word that follows 'white' is: ",most_follow_white)
print("The most used word that follows 'black' is: ",most_follow_black)