notebook.community

Edit and run



In [1]:

    
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3,max_df=0.95,sublinear_tf=True)



In [2]:

    
experts_count=pd.read_pickle('./input/experts_count.pkl')
experts_count=experts_count.fillna('none')



In [3]:

    
experts_count.head()









    Out[3]:






  
    
      
      Id
      Title
      QuestionBody
      CodeBody
      Tag
      ExpertId
      Count
      Label
    
  
  
    
      0
      95007
      Explain the quantile() function in R
      I've been mystified by the R quantile functio...
      none
      math statistics
      79513.0
      1
      218
    
    
      1
      255697
      Is there an R package for learning a Dirichlet...
      I'm looking for a an   package which can be u...
      R  R
      math statistics bayesian dirichlet
      23263.0
      1
      91
    
    
      2
      359438
      Optimization packages for R
      Does anyone know of any optimization packages...
      none
      mathematical-optimization
      3201.0
      1
      15
    
    
      3
      439526
      Thinking in Vectors with R
      I know that R works most efficiently with vec...
      st  p1  p2  st&lt;-NULL p1&lt;-NULL p2&lt;-NU...
      vector
      37751.0
      53
      121
    
    
      4
      445059
      Vectorize my thinking: Vector Operations in R
      So earlier I answered my own question on thin...
      for (j in my.data$item[my.data$fixed==0]) { #...
      vector
      54904.0
      1
      163



In [4]:

    
experts_count.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 87404 entries, 0 to 87403
Data columns (total 8 columns):
Id              87404 non-null int64
Title           87404 non-null object
QuestionBody    87404 non-null object
CodeBody        87404 non-null object
Tag             87404 non-null object
ExpertId        87404 non-null float64
Count           87404 non-null int64
Label           87404 non-null int64
dtypes: float64(1), int64(3), object(4)
memory usage: 6.0+ MB

Title

start from here



In [7]:

    
Y = experts_count.Label
X_title = experts_count.Title
print (type(Y),type(X_title))









    



(<class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>)



In [8]:

    
X_title = tfv.fit_transform(list(X_title))



In [11]:

    
X_title









    Out[11]:





<87404x7196 sparse matrix of type '<type 'numpy.float64'>'
	with 672997 stored elements in Compressed Sparse Row format>



In [12]:

    
# for fun
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_title, Y, test_size=.10)









    



C:\Users\Administrator\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [14]:

    
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)









    Out[14]:





0.082256034778629444

Now, Begin!

when Count > 10



In [16]:

    
X1_train = experts_count[:80000][experts_count.Count>10]









    



C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':



In [18]:

    
X1_title=tfv.fit_transform(list(X1_train['Title']) \
                         +list(experts_count[80000:]['Title']))



In [19]:

    
print (type(X1_title))
print (X1_title.shape)









    



<class 'scipy.sparse.csr.csr_matrix'>
(73639, 6385)



In [21]:

    
X_train = X1_title[:X1_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>10]
X_test = X1_title[X1_train.shape[0]:]
Y_test = experts_count['Label'][80000:]



In [22]:

    
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)









    Out[22]:





0.1053484602917342

Also, we can tune the parameters. Let's see what will happen!



In [23]:

    
# for C=3
lr = LogisticRegression(C=3)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)









    Out[23]:





0.10075634792004322

Next, see what will happen when experts_count.Count > 20

Conclusion for title: From above, we can see, when Count > ?, C = ? will get the best result, which is 0.???

Tags

start from here



In [25]:

    
X1_train = experts_count[:80000][experts_count.Count>10]









    



C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':



In [26]:

    
X1_tag=tfv.fit_transform(list(X1_train['Tag']) \
                         +list(experts_count[80000:]['Tag']))



In [27]:

    
print (type(X1_tag))
print (X1_tag.shape)









    



<class 'scipy.sparse.csr.csr_matrix'>
(73639, 1983)



In [28]:

    
X_train = X1_tag[:X1_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>10]
X_test = X1_tag[X1_train.shape[0]:]
Y_test = experts_count['Label'][80000:]



In [29]:

    
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)









    Out[29]:





0.11655861696380335

Next, see what will happen when experts_count.Count > 20



In [30]:

    
X2_train=experts_count[:80000][experts_count.Count>20]









    



C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':



In [32]:

    
X2_tag=tfv.fit_transform(list(X2_train['Tag']) \
                         +list(experts_count[80000:]['Tag']))

print (type(X2_tag))
print (X2_tag.shape)









    



<class 'scipy.sparse.csr.csr_matrix'>
(69302, 1918)



In [37]:

    
X_train = X2_tag[:X2_train.shape[0]]
Y_train = experts_count['Label'][:80000][experts_count.Count>20]
X_test = X2_tag[X2_train.shape[0]:]
Y_test = experts_count['Label'][80000:]



In [38]:

    
# for C=2
lr = LogisticRegression(C=2)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
accuracy_score(Y_test, y)









    Out[38]:





0.1149378714208536

Also, we can tune the parameters. Let's see what will happen!



In [39]:

    
# for C=3
lr = LogisticRegression(C=3)
lr.fit(X_train, Y_train)
y = lr.predict(X_test)
print ("The test accuracy is : %r" % accuracy_score(Y_test, y))









    



The test accuracy is : 0.11358725013506213

And more...



In [ ]:

    
X3_train=experts_count[:80000][experts_count.Count>30]



In [ ]:

    
X4_train=experts_count[:80000][experts_count.Count>40]

Conclusion for tag: From above, we can see, when Count > ?, C = ? will get the best result, which is 0.???

Try more models. Like:



In [ ]:

	Id	Title	QuestionBody	CodeBody	Tag	ExpertId	Count	Label
0	95007	Explain the quantile() function in R	I've been mystified by the R quantile functio...	none	math statistics	79513.0	1	218
1	255697	Is there an R package for learning a Dirichlet...	I'm looking for a an package which can be u...	R R	math statistics bayesian dirichlet	23263.0	1	91
2	359438	Optimization packages for R	Does anyone know of any optimization packages...	none	mathematical-optimization	3201.0	1	15
3	439526	Thinking in Vectors with R	I know that R works most efficiently with vec...	st p1 p2 st<-NULL p1<-NULL p2<-NU...	vector	37751.0	53	121
4	445059	Vectorize my thinking: Vector Operations in R	So earlier I answered my own question on thin...	for (j in my.data$item[my.data$fixed==0]) { #...	vector	54904.0	1	163