Refine the Data


In [32]:
import pandas as pd

In [33]:
df = pd.read_csv('data_tau.csv')

In [34]:
df.head()


Out[34]:
title date
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss

To get the date of the title - we will need the following algorithm

  • If the string contains hours we can consider it 1 day
  • And if the string has day, we pick the number preceding the day

To apply this algorithm, we need to be able to pick these words and digits from a string. For that we will use Regular Expression.

Introduction to Regular Expression (Regex)

Regular expression is a way of selecting text using symbols in a string.

Refer to the following links for an interactive playground


In [35]:
import re

In [36]:
test_string = "Hello world, welcome to 2016."

In [37]:
# We can pass the whole string and re.search will give the first occurence of the value
# re.search - This function searches for first occurrence of RE pattern within string.
a = re.search('Hello world, welcome to 2016',test_string)

In [38]:
a


Out[38]:
<_sre.SRE_Match object; span=(0, 28), match='Hello world, welcome to 2016'>

In [39]:
a.group()


Out[39]:
'Hello world, welcome to 2016'

In [40]:
# Match the first letters in the string
a = re.search('.',test_string)
a.group()


Out[40]:
'H'

In [41]:
# Match all the letters in the string
a = re.search('.*',test_string)
a.group()


Out[41]:
'Hello world, welcome to 2016.'

In [42]:
a = re.search('Hello',test_string)
print(a)


<_sre.SRE_Match object; span=(0, 5), match='Hello'>

Some basic symbols

?

The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".

\*

The asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

\+
The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".


In [43]:
a = re.search('\w.',test_string)
print(a)


<_sre.SRE_Match object; span=(0, 2), match='He'>

In [44]:
a = re.search('\w*',test_string)
print(a)


<_sre.SRE_Match object; span=(0, 5), match='Hello'>

Exercises


In [45]:
string = '''In 2016, we are learning Text Analytics in Data Science 101
            by scraping http://datatau.com'''

In [46]:
string = "In 2016, we are learning Text Analytics in Data Science 101 by scraping http://datatau.com"

Write a regex to pick the numbers 2016 from string above.


In [ ]:

Write a regex to pick the url link (http://xyz.com) from the string above


In [ ]:

Lets get the date from our string


In [47]:
df.head()


Out[47]:
title date
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss

In [48]:
df.tail()


Out[48]:
title date
175 Getting Started with Statistics for Data Science 3 points by nickhould 35 days ago | discuss
176 Rodeo 1.3 - Tab-completion for docstrings 3 points by glamp 35 days ago | discuss
177 Teaching D3.js - links 3 points by pmigdal 35 days ago | discuss
178 Parallel scikit-learn on YARN 5 points by stijntonk 39 days ago | discuss
179 Meetup: Free Live Webinar on Prescriptive Anal... 2 points by ann928 32 days ago | discuss

In [49]:
date_string = df['date'][0]

In [50]:
print(date_string)


5 points by Rogerh91 6 hours ago  | discuss

In [51]:
re.search('hours',date_string)


Out[51]:
<_sre.SRE_Match object; span=(23, 28), match='hours'>

In [52]:
date_string = df['date'][50]

In [53]:
print(date_string)


4 points by lefish 7 days ago  | discuss

In [54]:
# If hours is not there, we don't get any match
re.search('hours',date_string)

In [55]:
# Let us match the digit preceding the day text
day_search = re.search('\d+ day',date_string)
day_search


Out[55]:
<_sre.SRE_Match object; span=(19, 24), match='7 day'>

In [56]:
days_string = day_search.group(0)
days_string


Out[56]:
'7 day'

In [57]:
days = days_string.split(' ')[0] 
days


Out[57]:
'7'

In [58]:
def return_reg_ex_days(row):
    days = ''
    if re.search('hours',row['date']) is not None:
        # print('hours',row['date'])
        days = 1
    else:
        day_search = re.search('\d+ day',row['date'])
        # print('day',day_search.group(0))
        days = day_search.group(0).split(' ')[0]    
    
    #print(row,days)
    return days

In [59]:
# Now we apply this function to each of the row in the dataframe
df['days'] = df.apply(return_reg_ex_days,axis=1)

In [60]:
df.head()


Out[60]:
title date days
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss 1
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment 1
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss 1
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss 1
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss 1

In [61]:
df.tail()


Out[61]:
title date days
175 Getting Started with Statistics for Data Science 3 points by nickhould 35 days ago | discuss 35
176 Rodeo 1.3 - Tab-completion for docstrings 3 points by glamp 35 days ago | discuss 35
177 Teaching D3.js - links 3 points by pmigdal 35 days ago | discuss 35
178 Parallel scikit-learn on YARN 5 points by stijntonk 39 days ago | discuss 39
179 Meetup: Free Live Webinar on Prescriptive Anal... 2 points by ann928 32 days ago | discuss 32

In [62]:
# Let us save to a dataframe
df.to_csv('data_tau_days.csv', index=False)