In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [2]:
quality_prediction_and_page_views <- read.table("../results/sql_queries/entity_views_and_aggregated_revisions/entity_views_and_aggregated_revisions_and_quality_scoring_prediction_converted_20140701.tsv", header=FALSE, sep="\t")

In [3]:
colnames(quality_prediction_and_page_views) <- c('entity_id','number_of_revisions', 'page_views', 'prediction', 'ordinal_score')

In [4]:
head(quality_prediction_and_page_views)


entity_idnumber_of_revisionspage_viewspredictionordinal_score
Q1000999 33 736 E 1
Q1001536411 1 E 1
Q10018576 9 21 E 1
Q1002034824 11 E 1
Q1002083211 12 E 1
Q1002822013 6 E 1

In [5]:
cor(quality_prediction_and_page_views$page_views,quality_prediction_and_page_views$ordinal_score, method="spearman")


0.168980299425223

In [6]:
quality_prediction_and_page_views_model <- lm(quality_prediction_and_page_views$page_views ~ quality_prediction_and_page_views$ordinal_score)

In [7]:
summary(quality_prediction_and_page_views_model)


Call:
lm(formula = quality_prediction_and_page_views$page_views ~ quality_prediction_and_page_views$ordinal_score)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.936e+05 -5.303e+03 -5.201e+03 -4.388e+03  1.253e+10 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                       -92988       4024  -23.11
quality_prediction_and_page_views$ordinal_score    98295       2818   34.89
                                                Pr(>|t|)    
(Intercept)                                       <2e-16 ***
quality_prediction_and_page_views$ordinal_score   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6646000 on 14863161 degrees of freedom
Multiple R-squared:  8.188e-05,	Adjusted R-squared:  8.181e-05 
F-statistic:  1217 on 1 and 14863161 DF,  p-value: < 2.2e-16

Class A Items with Least Pages Views


In [8]:
class_a_quality_prediction_and_page_views <- filter(quality_prediction_and_page_views, prediction=="A")

In [9]:
sorted_ascend_class_a_quality_prediction_and_page_views <- dplyr::arrange(class_a_quality_prediction_and_page_views, page_views)

In [10]:
head(sorted_ascend_class_a_quality_prediction_and_page_views, n=10)


entity_idnumber_of_revisionspage_viewspredictionordinal_score
Q50039 324 4934 A 5
Q3577919 265 6444 A 5
Q3152321 687 77136 A 5
Q273461 1176 241736 A 5
Q42 1227 1526156 A 5
Q2513 375 2480599 A 5
Q7416 467 6327804 A 5
Q991 710 6707373 A 5
Q5592 661 7190367 A 5
Q153 604 7390506 A 5

Class E Items with Most Pages Views


In [ ]:


In [11]:
class_e_quality_prediction_and_page_views <- filter(quality_prediction_and_page_views, prediction=="E")

In [12]:
sorted_desc_class_e_quality_prediction_and_page_views <- dplyr::arrange(class_e_quality_prediction_and_page_views, desc(page_views))

In [13]:
head(sorted_desc_class_e_quality_prediction_and_page_views, n=10)


entity_idnumber_of_revisionspage_viewspredictionordinal_score
Q1868372 45 2056080224E 1
Q156376 83 2046132338E 1
Q183718 64 2045831558E 1
Q2638147 25 2045739408E 1
Q7315186 34 2045708263E 1
Q219523 73 2045690113E 1
Q1002972 30 2045659200E 1
Q372827 36 2045652543E 1
Q4299858 22 2045645431E 1
Q6883832 25 2045602031E 1

In [14]:
nrow(class_e_quality_prediction_and_page_views)


11775438

In [15]:
nrow(filter(quality_prediction_and_page_views, prediction=="D"))


1880999

In [16]:
nrow(filter(quality_prediction_and_page_views, prediction=="C"))


1184954

In [17]:
nrow(filter(quality_prediction_and_page_views, prediction=="B"))


21735

In [18]:
nrow(filter(quality_prediction_and_page_views, prediction=="A"))


37

In [ ]: