Statistics for Entity Page Views


In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [2]:
entity_views <- read.table("../results/sql_queries/entity_views.tsv", header=FALSE, sep="\t")

In [3]:
colnames(entity_views) <- c('entity_id','page_views')

All entities


In [4]:
summary(entity_views)


   entity_id          page_views       
 P1     :       1   Min.   :0.000e+00  
 P10    :       1   1st Qu.:1.300e+01  
 P100   :       1   Median :1.360e+02  
 P1000  :       1   Mean   :3.006e+04  
 P10000 :       1   3rd Qu.:9.970e+02  
 P1001  :       1   Max.   :1.253e+10  
 (Other):22250015                      

In [5]:
nrow(entity_views)


22250021

In [6]:
sd(entity_views$page_views)


6426968.81680796

In [7]:
hist(log2(entity_views$page_views),xlab="Log of Page Views", main="Distribution of Page Views")



In [8]:
sorted_descending_entity_values_by_page_views = dplyr::arrange(entity_views, desc(page_views))

In [23]:
head(sorted_descending_entity_values_by_page_views, 50)


entity_idpage_views
Q5296 12530369761
P373 6531371917
Q5 5668008721
P18 5304100266
P856 5143708396
Q6581097 3273952711
P570 3230549347
P31 3153325528
P345 3064724376
P19 2851053904
P1559 2571114971
P166 2545366245
P20 2497532842
P569 2412197025
P27 2342197161
Q30 2277746226
P106 2267267026
Q36578 2229315598
P136 2176414324
P1477 2155338581
Q54919 2148531382
Q37312 2142913121
Q423048 2136131564
Q193563 2130725560
Q2597810 2128920607
Q2494649 2114531894
Q33999 2108672678
Q17299517 2105487660
Q623578 2097991400
Q355 2093900731
Q750403 2084693498
Q1967876 2084215818
Q477675 2080785713
Q866 2079749157
Q150248 2068796814
Q918 2063217449
Q14005 2063120071
Q209330 2060928966
Q1868372 2056080224
Q565 2052996261
Q4584301 2052339927
Q105584 2049926923
Q40629 2049755644
Q31165 2048330818
Q1048694 2048095025
Q384060 2047954248
Q829984 2047695596
Q356 2047631596
Q171186 2047579662
Q10726338 2047546936

In [10]:
sorted_ascending_entity_values_by_page_views = dplyr::arrange(entity_views, page_views)

In [24]:
head(sorted_ascending_entity_values_by_page_views, 50)


entity_idpage_views
Q219254660
Q249159800
Q201245330
Q260379020
Q177068940
Q269331040
Q248505950
Q202864780
Q255624290
Q222184540
Q258429770
Q258390190
Q239382440
Q295565560
Q245750470
Q145617420
Q267541970
Q149528700
Q149982070
Q242326410
Q177907700
Q279000140
Q122675160
Q150181060
Q221832880
Q228745880
Q257647000
Q177891430
Q4366157 0
Q288469020
Q217481020
Q226463520
Q260775930
Q6920428 0
Q256729810
Q6918381 0
Q223866600
Q150036150
Q145123510
Q177144360
Q9862468 0
Q223859340
Q176933280
Q178015230
Q257760600
Q295603350
Q203017290
Q177585890
Q258982490
Q294006840

Sample of 100 entities with 0 views


In [26]:
zero_views <- filter(sorted_ascending_entity_values_by_page_views, page_views==0)

In [27]:
zero_views_sampled <- sample(zero_views, 50)


Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
Traceback:

1. sample(zero_views, 50)
2. x[sample.int(length(x), size, replace, prob)]
3. `[.data.frame`(x, sample.int(length(x), size, replace, prob))
4. sample.int(length(x), size, replace, prob)

Entities that do not have page views


In [12]:
entities_with_no_page_views <- subset(entity_views, page_views == 0)

In [13]:
nrow(entities_with_no_page_views)


1037758

Entities with no page views over all page views


In [14]:
nrow(entities_with_no_page_views)/nrow(entity_views)


0.0466407649682668

Entities that have less than 100 page views


In [15]:
entities_with_less_than_100_page_views <- subset(entity_views, page_views <= 100)

In [16]:
nrow(entities_with_less_than_100_page_views)


10385721

Entities with less than 100 page views over all page views


In [17]:
nrow(entities_with_less_than_100_page_views)/nrow(entity_views)


0.466773536977785

Male versus Female Bias

"Male" Item Usage


In [18]:
male_item_pages <- filter(sorted_descending_entity_values_by_page_views, entity_id=="Q6581097")

In [19]:
head(male_item_pages)


entity_idpage_views
Q6581097 3273952711

"Female" Item Usage


In [20]:
female_item_pages <-filter(sorted_descending_entity_values_by_page_views, entity_id=="Q6581072")

In [21]:
head(female_item_pages)


entity_idpage_views
Q6581072 1027466361

In [22]:
female_item_pages$n/male_item_pages$n



In [ ]: