It is observed that 70% of data available to any businesses is unstructured. The first step is collating unstructured data from different sources such as open-ended feedback, phone calls, email support, online chat and social media networks like Twitter, LinkedIn and Facebook. Assembling these data and applying mining/machine learning techniques to analyze them provides valuable opportunities for organizations to build more power into customer experience. There are several libraries available for extracting text content from different formats discussed above. By far the best library that provides simple and single interface for multiple formats is ‘textract’ (open source MIT license). Note that as of now this library/package is available for Linux, Mac OS and not Windows. Below is a list of supported formats.
In [59]:
import pandas as pd
import numpy as np
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
In [44]:
access_token = "8397390582---------------------------------"
access_token_secret = "dr5L3QHHkIls6Rbffz-------------------"
consumer_key = "U1eVHGzL-----------------"
consumer_secret = "qATe7kb41zRAz------------------------------------"
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
In [119]:
fetched_tweets = api.search(q=['Bitcoin','ethereum'], result_type='recent', lang='en', count=10)
print ("Number of tweets: ", len(fetched_tweets))
In [125]:
for tweet in fetched_tweets:
print ('Tweet AUTOR: ', tweet.author.name)
print ('Tweet ID: ', tweet.id)
print ('Tweet Text: ', tweet.text, '\n')
There are many other way to collect data from PDF, Voice and etc.
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
Bad news nltk not support persian
Good news hazm is a similar library for persian language processing
In [103]:
import nltk
nltk.download()
Out[103]:
stopword
In [60]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]
Out[60]:
In [61]:
len(stopwords.words())
Out[61]:
In [62]:
import hazm
hazm.stopwords_list()
Out[62]:
In [63]:
len(hazm.stopwords_list())
Out[63]:
In [104]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
In [105]:
text = pd.read_csv('./Datasets/text_mining.csv').drop(['document_id','direction'], axis =1)
text2 =text[(text['category_id'] == 1) | (text['category_id'] == 8) ]
In [106]:
text[text['category_id'] == 1].head()
Out[106]:
In [68]:
text[text['category_id'] == 8].head()
Out[68]:
In [69]:
climate_con = text[text['category_id'] == 1].iloc[0:5]
politics = text[text['category_id'] == 8].iloc[0:5]
In [70]:
countvectorizer = CountVectorizer()
In [71]:
cli = ''
for i in climate_con.text.as_matrix(): cli = cli + ' ' + i
pol = ''
for i in politics.text.as_matrix(): pol = pol + ' ' + i
In [72]:
content = [cli, pol]
In [73]:
countvectorizer.fit(content)
Out[73]:
In [74]:
doc_vec = countvectorizer.transform(content)
In [75]:
df = pd.DataFrame(doc_vec.toarray().transpose(), index = countvectorizer.get_feature_names())
In [76]:
df.sort_values(0, ascending=False)
Out[76]:
In [77]:
df.sort_values(1, ascending=False)
Out[77]:
In [78]:
tfidf = TfidfVectorizer()
In [79]:
tfidf_vec = tfidf.fit_transform(content)
In [80]:
df2 = pd.DataFrame(tfidf_vec.toarray().transpose(), index =tfidf.get_feature_names())
In [81]:
df2.sort_values(0, ascending=False)
Out[81]:
In [82]:
df2.sort_values(1, ascending=False)
Out[82]:
In [83]:
tfidf.vocabulary_
Out[83]:
In [84]:
from nltk.stem import SnowballStemmer
In [85]:
stemmer = SnowballStemmer('english')
In [86]:
stemmer.stem("impressive")
Out[86]:
In [87]:
stemmer.stem("impressness")
Out[87]:
In [88]:
from hazm import Stemmer
In [89]:
stem2 = Stemmer()
In [90]:
stem2.stem('کتاب ها')
Out[90]:
In [91]:
stem2.stem('کتابهایش')
Out[91]:
In [92]:
stem2.stem('کتاب هایم')
Out[92]:
and Naive assumption:
$$ P(x_i | y,x_1,x_2,...x_{i-1},x_{i+1},...,x_n) = P(x_i|y) $$leads to:
$$ P(y| x_1,x_2,...,x_n) = \frac{P(y) \prod_{i=1}^n P(x_i|y)}{P(x_1,x_2,...,x_n)} $$If the purpose is only to classify:
$$ \hat{y} = arg\max_y P(y)\prod_{i=1}^n P(x_i|y) $$
In [107]:
from sklearn.naive_bayes import GaussianNB
In [108]:
model = GaussianNB()
In [113]:
model.fit(tfidf.transform(text2.text).toarray(), text2.category_id)
Out[113]:
In [123]:
model.sigma_.shape
Out[123]:
In [101]:
from sklearn.metrics import classification_report
print(classification_report(text2.category_id, model.predict(tfidf.transform(text2.text).toarray())))