Refine the Data



In [32]:

    
import pandas as pd



In [33]:

    
df = pd.read_csv('data_tau.csv')



In [34]:

    
df.head()









    Out[34]:






  
    
      
      title
      date
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss

To get the date of the title - we will need the following algorithm

If the string contains hours we can consider it 1 day
And if the string has day, we pick the number preceding the day

To apply this algorithm, we need to be able to pick these words and digits from a string. For that we will use Regular Expression.

Introduction to Regular Expression (Regex)

Regular expression is a way of selecting text using symbols in a string.

Refer to the following links for an interactive playground



In [35]:

    
import re



In [36]:

    
test_string = "Hello world, welcome to 2016."



In [37]:

    
# We can pass the whole string and re.search will give the first occurence of the value
# re.search - This function searches for first occurrence of RE pattern within string.
a = re.search('Hello world, welcome to 2016',test_string)



In [38]:

    
a









    Out[38]:





<_sre.SRE_Match object; span=(0, 28), match='Hello world, welcome to 2016'>



In [39]:

    
a.group()









    Out[39]:





'Hello world, welcome to 2016'



In [40]:

    
# Match the first letters in the string
a = re.search('.',test_string)
a.group()









    Out[40]:





'H'



In [41]:

    
# Match all the letters in the string
a = re.search('.*',test_string)
a.group()









    Out[41]:





'Hello world, welcome to 2016.'



In [42]:

    
a = re.search('Hello',test_string)
print(a)









    



<_sre.SRE_Match object; span=(0, 5), match='Hello'>

Some basic symbols

?

The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

\*

The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

\+
The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".



In [43]:

    
a = re.search('\w.',test_string)
print(a)









    



<_sre.SRE_Match object; span=(0, 2), match='He'>



In [44]:

    
a = re.search('\w*',test_string)
print(a)









    



<_sre.SRE_Match object; span=(0, 5), match='Hello'>

Exercises



In [45]:

    
string = '''In 2016, we are learning Text Analytics in Data Science 101
            by scraping http://datatau.com'''



In [46]:

    
string = "In 2016, we are learning Text Analytics in Data Science 101 by scraping http://datatau.com"

Write a regex to pick the numbers 2016 from string above.



In [ ]:

Write a regex to pick the url link (http://xyz.com) from the string above



In [ ]:

Lets get the date from our string



In [47]:

    
df.head()









    Out[47]:






  
    
      
      title
      date
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss



In [48]:

    
df.tail()









    Out[48]:






  
    
      
      title
      date
    
  
  
    
      175
      Getting Started with Statistics for Data Science
      3 points by nickhould 35 days ago  | discuss
    
    
      176
      Rodeo 1.3 - Tab-completion for docstrings
      3 points by glamp 35 days ago  | discuss
    
    
      177
      Teaching D3.js - links
      3 points by pmigdal 35 days ago  | discuss
    
    
      178
      Parallel scikit-learn on YARN
      5 points by stijntonk 39 days ago  | discuss
    
    
      179
      Meetup: Free Live Webinar on Prescriptive Anal...
      2 points by ann928 32 days ago  | discuss



In [49]:

    
date_string = df['date'][0]



In [50]:

    
print(date_string)









    



5 points by Rogerh91 6 hours ago  | discuss



In [51]:

    
re.search('hours',date_string)









    Out[51]:





<_sre.SRE_Match object; span=(23, 28), match='hours'>



In [52]:

    
date_string = df['date'][50]



In [53]:

    
print(date_string)









    



4 points by lefish 7 days ago  | discuss



In [54]:

    
# If hours is not there, we don't get any match
re.search('hours',date_string)



In [55]:

    
# Let us match the digit preceding the day text
day_search = re.search('\d+ day',date_string)
day_search









    Out[55]:





<_sre.SRE_Match object; span=(19, 24), match='7 day'>



In [56]:

    
days_string = day_search.group(0)
days_string









    Out[56]:





'7 day'



In [57]:

    
days = days_string.split(' ')[0] 
days









    Out[57]:





'7'



In [58]:

    
def return_reg_ex_days(row):
    days = ''
    if re.search('hours',row['date']) is not None:
        # print('hours',row['date'])
        days = 1
    else:
        day_search = re.search('\d+ day',row['date'])
        # print('day',day_search.group(0))
        days = day_search.group(0).split(' ')[0]    
    
    #print(row,days)
    return days



In [59]:

    
# Now we apply this function to each of the row in the dataframe
df['days'] = df.apply(return_reg_ex_days,axis=1)



In [60]:

    
df.head()









    Out[60]:






  
    
      
      title
      date
      days
    
  
  
    
      0
      An Exploration of R, Yelp, and the Search for ...
      5 points by Rogerh91 6 hours ago  | discuss
      1
    
    
      1
      Deep Advances in Generative Modeling
      7 points by gwulfs 15 hours ago  | 1 comment
      1
    
    
      2
      Spark Pipelines: Elegant Yet Powerful
      3 points by aouyang1 9 hours ago  | discuss
      1
    
    
      3
      Shit VCs Say
      3 points by Argentum01 10 hours ago  | discuss
      1
    
    
      4
      Python, Machine Learning, and Language Wars
      4 points by pmigdal 17 hours ago  | discuss
      1



In [61]:

    
df.tail()









    Out[61]:






  
    
      
      title
      date
      days
    
  
  
    
      175
      Getting Started with Statistics for Data Science
      3 points by nickhould 35 days ago  | discuss
      35
    
    
      176
      Rodeo 1.3 - Tab-completion for docstrings
      3 points by glamp 35 days ago  | discuss
      35
    
    
      177
      Teaching D3.js - links
      3 points by pmigdal 35 days ago  | discuss
      35
    
    
      178
      Parallel scikit-learn on YARN
      5 points by stijntonk 39 days ago  | discuss
      39
    
    
      179
      Meetup: Free Live Webinar on Prescriptive Anal...
      2 points by ann928 32 days ago  | discuss
      32



In [62]:

    
# Let us save to a dataframe
df.to_csv('data_tau_days.csv', index=False)

	title	date
0	An Exploration of R, Yelp, and the Search for ...	5 points by Rogerh91 6 hours ago \| discuss
1	Deep Advances in Generative Modeling	7 points by gwulfs 15 hours ago \| 1 comment
2	Spark Pipelines: Elegant Yet Powerful	3 points by aouyang1 9 hours ago \| discuss
3	Shit VCs Say	3 points by Argentum01 10 hours ago \| discuss
4	Python, Machine Learning, and Language Wars	4 points by pmigdal 17 hours ago \| discuss

	title	date
175	Getting Started with Statistics for Data Science	3 points by nickhould 35 days ago \| discuss
176	Rodeo 1.3 - Tab-completion for docstrings	3 points by glamp 35 days ago \| discuss
177	Teaching D3.js - links	3 points by pmigdal 35 days ago \| discuss
178	Parallel scikit-learn on YARN	5 points by stijntonk 39 days ago \| discuss
179	Meetup: Free Live Webinar on Prescriptive Anal...	2 points by ann928 32 days ago \| discuss