In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3,max_df=0.95,sublinear_tf=True)

In [2]:
experts_count=pd.read_pickle('./input/experts_count.pkl')
experts_count=experts_count.fillna('none')

In [3]:
experts_count.head()


Out[3]:
Id Title QuestionBody CodeBody Tag ExpertId Count Label
0 95007 Explain the quantile() function in R I've been mystified by the R quantile functio... none math statistics 79513.0 1 218
1 255697 Is there an R package for learning a Dirichlet... I'm looking for a an package which can be u... R R math statistics bayesian dirichlet 23263.0 1 91
2 359438 Optimization packages for R Does anyone know of any optimization packages... none mathematical-optimization 3201.0 1 15
3 439526 Thinking in Vectors with R I know that R works most efficiently with vec... st p1 p2 st<-NULL p1<-NULL p2<-NU... vector 37751.0 53 121
4 445059 Vectorize my thinking: Vector Operations in R So earlier I answered my own question on thin... for (j in my.data$item[my.data$fixed==0]) { #... vector 54904.0 1 163

In [4]:
experts_count.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 87404 entries, 0 to 87403
Data columns (total 8 columns):
Id              87404 non-null int64
Title           87404 non-null object
QuestionBody    87404 non-null object
CodeBody        87404 non-null object
Tag             87404 non-null object
ExpertId        87404 non-null float64
Count           87404 non-null int64
Label           87404 non-null int64
dtypes: float64(1), int64(3), object(4)
memory usage: 6.0+ MB

Title

start from here


In [7]:
Y = experts_count.Label
X_title = experts_count.Title
print (type(Y),type(X_title))


(<class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>)

In [8]:
X_title = tfv.fit_transform(list(X_title))

In [11]:
X_title


Out[11]:
<87404x7196 sparse matrix of type '<type 'numpy.float64'>'
	with 672997 stored elements in Compressed Sparse Row format>

In [12]:
# for fun
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_title, Y, test_size=.10)


C:\Users\Administrator\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [14]:
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)


Out[14]:
0.082256034778629444

Now, Begin!

when Count > 10


In [16]:
X1_train = experts_count[:80000][experts_count.Count>10]


C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':

In [18]:
X1_title=tfv.fit_transform(list(X1_train['Title']) \
                         +list(experts_count[80000:]['Title']))

In [19]:
print (type(X1_title))
print (X1_title.shape)


<class 'scipy.sparse.csr.csr_matrix'>
(73639, 6385)

In [21]:
X_train = X1_title[:X1_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>10]
X_test = X1_title[X1_train.shape[0]:]
Y_test = experts_count['Label'][80000:]

In [22]:
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)


Out[22]:
0.1053484602917342

Also, we can tune the parameters. Let's see what will happen!


In [23]:
# for C=3
lr = LogisticRegression(C=3)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)


Out[23]:
0.10075634792004322

Next, see what will happen when experts_count.Count > 20

Conclusion for title: From above, we can see, when Count > ?, C = ? will get the best result, which is 0.???

Tags

start from here


In [25]:
X1_train = experts_count[:80000][experts_count.Count>10]


C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':

In [26]:
X1_tag=tfv.fit_transform(list(X1_train['Tag']) \
                         +list(experts_count[80000:]['Tag']))

In [27]:
print (type(X1_tag))
print (X1_tag.shape)


<class 'scipy.sparse.csr.csr_matrix'>
(73639, 1983)

In [28]:
X_train = X1_tag[:X1_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>10]
X_test = X1_tag[X1_train.shape[0]:]
Y_test = experts_count['Label'][80000:]

In [29]:
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)


Out[29]:
0.11655861696380335

Next, see what will happen when experts_count.Count > 20


In [30]:
X2_train=experts_count[:80000][experts_count.Count>20]


C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':

In [32]:
X2_tag=tfv.fit_transform(list(X2_train['Tag']) \
                         +list(experts_count[80000:]['Tag']))

print (type(X2_tag))
print (X2_tag.shape)


<class 'scipy.sparse.csr.csr_matrix'>
(69302, 1918)

In [37]:
X_train = X2_tag[:X2_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>20]
X_test = X2_tag[X2_train.shape[0]:]
Y_test = experts_count['Label'][80000:]

In [38]:
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)


Out[38]:
0.1149378714208536

Also, we can tune the parameters. Let's see what will happen!


In [39]:
# for C=3
lr = LogisticRegression(C=3)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
print ("The test accuracy is : %r" % accuracy_score(Y_test, y))


The test accuracy is : 0.11358725013506213

And more...


In [ ]:
X3_train=experts_count[:80000][experts_count.Count>30]

In [ ]:
X4_train=experts_count[:80000][experts_count.Count>40]

Conclusion for tag: From above, we can see, when Count > ?, C = ? will get the best result, which is 0.???

Try more models. Like:


In [ ]: