Statistics for Entity Page Views


In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [ ]:
entity_views <- read.table("../results/sql_queries/entity_views.tsv", header=FALSE, sep="\t")

In [3]:
colnames(entity_views) <- c('entity_id','page_views')

All entities


In [4]:
summary(entity_views)


   entity_id          page_views       
 P1     :       1   Min.   :0.000e+00  
 P10    :       1   1st Qu.:1.300e+01  
 P100   :       1   Median :1.360e+02  
 P1000  :       1   Mean   :3.006e+04  
 P10000 :       1   3rd Qu.:9.970e+02  
 P1001  :       1   Max.   :1.253e+10  
 (Other):22250015                      

In [5]:
nrow(entity_views)


22250021

In [6]:
sd(entity_views$page_views)


6426968.81680796

In [7]:
hist(log2(entity_views$page_views),xlab="Log of Page Views", main="Distribution of Page Views")



In [8]:
sorted_descending_entity_values_by_page_views = dplyr::arrange(entity_views, desc(page_views))

In [9]:
head(sorted_descending_entity_values_by_page_views, 25)


entity_idpage_views
Q5296 12530369761
P373 6531371917
Q5 5668008721
P18 5304100266
P856 5143708396
Q6581097 3273952711
P570 3230549347
P31 3153325528
P345 3064724376
P19 2851053904
P1559 2571114971
P166 2545366245
P20 2497532842
P569 2412197025
P27 2342197161
Q30 2277746226
P106 2267267026
Q36578 2229315598
P136 2176414324
P1477 2155338581
Q54919 2148531382
Q37312 2142913121
Q423048 2136131564
Q193563 2130725560
Q2597810 2128920607

In [10]:
sorted_ascending_entity_values_by_page_views = dplyr::arrange(entity_views, page_views)

In [11]:
head(sorted_ascending_entity_values_by_page_views, 25)


entity_idpage_views
Q219254660
Q249159800
Q201245330
Q260379020
Q177068940
Q269331040
Q248505950
Q202864780
Q255624290
Q222184540
Q258429770
Q258390190
Q239382440
Q295565560
Q245750470
Q145617420
Q267541970
Q149528700
Q149982070
Q242326410
Q177907700
Q279000140
Q122675160
Q150181060
Q221832880

Entities that do not have page views


In [12]:
entities_with_no_page_views <- subset(entity_views, page_views == 0)

In [13]:
nrow(entities_with_no_page_views)


1037758

Entities with no page views over all page views


In [14]:
nrow(entities_with_no_page_views)/nrow(entity_views)


0.0466407649682668

Entities that have less than 100 page views


In [15]:
entities_with_less_than_100_page_views <- subset(entity_views, page_views <= 100)

In [16]:
nrow(entities_with_less_than_100_page_views)


10385721

Entities with less than 100 page views over all page views


In [17]:
nrow(entities_with_less_than_100_page_views)/nrow(entity_views)


0.466773536977785

Male versus Female Bias

"Male" Item Usage


In [18]:
male_item_pages <- filter(sorted_descending_entity_values_by_page_views, entity_id=="Q6581097")

In [19]:
head(male_item_pages)


entity_idpage_views
Q6581097 3273952711

"Female" Item Usage


In [20]:
female_item_pages <-filter(sorted_descending_entity_values_by_page_views, entity_id=="Q6581072")

In [21]:
head(female_item_pages)


entity_idpage_views
Q6581072 1027466361

In [22]:
female_item_pages$n/male_item_pages$n



In [ ]: