Do some statistics on the tags

Let's take a quick look at the Stackoverflow R tags and see if we can get any insights into the R ecosystem.



In [54]:

    
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "./"]).decode("utf8"))

from scipy.sparse import csr_matrix

import seaborn as sns

%config InlineBackend.figure_format = 'retina'









    



Answers.csv
Data statistics.ipynb
Questions.csv
Tags.csv



In [55]:

    
tags = pd.read_csv("./Tags.csv")



In [56]:

    
tags.head()









    Out[56]:






  
    
      
      Id
      Tag
    
  
  
    
      0
      77434
      vector
    
    
      1
      79709
      memory
    
    
      2
      79709
      function
    
    
      3
      79709
      global-variables
    
    
      4
      79709
      side-effects



In [57]:

    
tags.shape









    Out[57]:





(241546, 2)



In [58]:

    
# Let's check out the most popular tags:
tags["Tag"].value_counts().head(10).plot(kind="barh")









    Out[58]:





<matplotlib.axes._subplots.AxesSubplot at 0x10f5e860>



In [59]:

    
# And the number of tags per question:
(tags.groupby("Id")["Tag"].count().value_counts().plot(kind = "bar"))









    Out[59]:





<matplotlib.axes._subplots.AxesSubplot at 0x1104cbe0>

The majority of tagged questions have one or two tags with four tags max.

I'm curious about the relationships and connections between different tags so for now we'll limit our scope to only looking at questions with 4 tags and only look at the 1000 most popular tags.



In [60]:

    
tag_counts = tags.groupby("Id")["Tag"].count()
many_tags = tag_counts[tag_counts > 3].index
popular_tags = tags.Tag.value_counts().iloc[:1000].index

tags = tags[tags["Id"].isin(many_tags)] #getting questions with 4 tags
tags = tags[tags["Tag"].isin(popular_tags)] #getting only top 1000 tags



In [61]:

    
tags.shape









    Out[61]:





(37903, 2)



In [62]:

    
tags.head(20)









    Out[62]:






  
    
      
      Id
      Tag
    
  
  
    
      1
      79709
      memory
    
    
      2
      79709
      function
    
    
      3
      79709
      global-variables
    
    
      10
      255697
      math
    
    
      11
      255697
      statistics
    
    
      12
      255697
      bayesian
    
    
      32
      713878
      matrix
    
    
      33
      713878
      linear-algebra
    
    
      34
      713878
      sparse-matrix
    
    
      63
      952275
      regex
    
    
      65
      952275
      stringr
    
    
      82
      1110363
      parsing
    
    
      83
      1110363
      command-line
    
    
      90
      1133172
      osx
    
    
      91
      1133172
      terminal
    
    
      92
      1133172
      plot
    
    
      93
      1133172
      command
    
    
      94
      1136709
      java
    
    
      143
      1195826
      dataframe
    
    
      144
      1195826
      r-factor

Creating a Bag of Tags:

Now I am going to create a bag of tags and do some kind of dimensionality reduction on it. To do this I'll basically have to spread the tags and create one column for each tag. Using pd.pivot works but it's very memory-intensive. Instead I'll take advantage of the sparsity and use scipy sparse matrices the create the bag of tags. This sparse bag idea was inspired by dune_dweller's script.



In [69]:

    
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.pipeline import make_pipeline



In [70]:

    
#let's integer encode the id's and tags:
tag_encoder = LabelEncoder()
question_encoder = LabelEncoder()

tags["Tag"] = tag_encoder.fit_transform(tags["Tag"])
tags["Id"] = question_encoder.fit_transform(tags["Id"])



In [71]:

    
tags.head()



In [76]:

    
tag_num = np.max(tags["Tag"]) + 1



In [80]:

    
print (tag_num)



In [81]:

    
id_num = np.max(tags["Id"]) + 1



In [82]:

    
print (id_num)

!@!@!@!|



In [83]:

    
X = csr_matrix((np.ones(tags.shape[0]), (tags.Id, tags.Tag)))
X.shape #one row for each question, one column for each tag









    Out[83]:





(10900, 998)



In [84]:

    
tags.shape









    Out[84]:





(37903, 2)

Dimensionality Reduction using SVD:

Now we will project our bags of words matrix into a 3 dimensional subspace that captures as much of the variance as possible. Hopefully this will help us better understand the connections between the tags.



In [86]:

    
model = TruncatedSVD(n_components=3)
model.fit(X)









    Out[86]:





TruncatedSVD(algorithm='randomized', n_components=3, n_iter=5,
       random_state=None, tol=0.0)



In [87]:

    
two_components = pd.DataFrame(model.transform(X),\
                              columns=["one", "two", "three"])
two_components.plot(x = "one", y = "two",kind = "scatter",\
                    title = "2D PCA projection components 1 and 2")









    Out[87]:





<matplotlib.axes._subplots.AxesSubplot at 0xcc234a8>



In [89]:

    
two_components.plot(x = "two", y = "three", kind = "scatter", \
                    title = "2D PCA projection - components 2 and 3")









    Out[89]:





<matplotlib.axes._subplots.AxesSubplot at 0xc7b4550>



In [90]:

    
tagz = popular_tags[:20]

tag_ids = tag_encoder.transform(tagz)
n = len(tag_ids)
print (n)



In [91]:

    
X_new = csr_matrix((np.ones(n), (pd.Series(range(n)), tag_ids)),\
                   shape = (n, 998))



In [92]:

    
proj = pd.DataFrame(model.transform(X_new)[:,:2], index=tagz, \
                    columns = ["one", "two"])
proj["tag"] = proj.index



In [94]:

    
from ggplot import * #ggplot!
plt = (ggplot(proj, aes(x = "one", y = "two", label = "tag")) +
            geom_point() +
            geom_text())
plt.show()



In [95]:

    
sm_proj = proj[proj["one"] < 0.2][proj["two"] < 0.2]
plt = (ggplot(sm_proj, aes(x = "one", y = "two", label = "tag")) +
            geom_point() +
            geom_text() +
            xlim(0, 0.1) +
            ylim(-0.1, 0.2))
plt.show()









    



C:\Users\Administrator\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':



In [ ]:

	Id	Tag
1	0	493
2	0	278
3	0	309
10	1	480
11	1	855

	Id	Tag
0	77434	vector
1	79709	memory
2	79709	function
3	79709	global-variables
4	79709	side-effects