Use the eland
library to generate search metrics based on data in an Elasticsearch index. All indices and transforms should have already been created and run before this notebook can be used. We operate on the two main indices: events in ecs-search-metrics
and the post-transform query-level metrics in ecs-search-metrics_transform_queryid
.
In [1]:
%matplotlib inline
import eland as el
import numpy as np
In [2]:
ES_URL = 'http://localhost:9200/'
In [3]:
df = el.read_es(ES_URL, 'ecs-search-metrics')
In [4]:
df.dtypes
Out[4]:
In [5]:
print(df.info_es())
In [6]:
df.head()
Out[6]:
What is the distribution of ranks of results clicked on?
In [7]:
df['SearchMetrics.click.result.rank'].describe()
Out[7]:
In [8]:
df['SearchMetrics.click.result.rank'].hist()
Out[8]:
How many users are in the dataset?
In [9]:
df['source.user.id'].nunique()
Out[9]:
How many of each event type?
In [10]:
df['event.action'].value_counts()
Out[10]:
Split dataset into two dataframes based on action type.
In [11]:
df_queries = df[df['event.action'] == 'SearchMetrics.query']
df_pages = df[df['event.action'] == 'SearchMetrics.page']
df_clicks = df[df['event.action'] == 'SearchMetrics.click']
What is the distribution of search result sizes in query events?
In [12]:
df_queries[['SearchMetrics.results.size']].hist(figsize=[10,5], bins=10)
Out[12]:
In [13]:
df_tf_query = el.read_es(ES_URL, 'ecs-search-metrics_transform_queryid')
In [14]:
df_tf_query.head()
Out[14]:
What are the distributions of the numeric fields?
In [15]:
df_tf_query.select_dtypes(include=[np.number])\
.drop(['query_event.SearchMetrics.query.page'], axis=1)\
.hist(figsize=[17,11], bins=10)
Out[15]:
We're going to start with some basic query metrics [1] (no session metrics):
^ When computing the metrics marked with ^, we exclude queries with no results to avoid conflating these measures with zero result rate.
† When computing the metrics marked with †, we exclude queries with no clicks to avoid conflating these measures with abandonment rate.
[1] F. Radlinski, M. Kurup, T. Joachims. How Does Clickthrough Data Reflect Retrieval Quality?. CIKM '08, 2008.
Given the above definitions, build datasets to base metrics off of.
In [16]:
# queries that have no results
df_tf_query_without_results = df_tf_query[df_tf_query['query_event.SearchMetrics.results.size'] == 0]
# queries that have results
df_tf_query_with_results = df_tf_query[df_tf_query['query_event.SearchMetrics.results.size'] > 0]
# queries that have results but no clicks
df_tf_query_without_clicks = df_tf_query_with_results[df_tf_query_with_results['metrics.clicks.count'] == 0]
# queries that have results and clicks
df_tf_query_with_clicks = df_tf_query_with_results[df_tf_query_with_results['metrics.clicks.count'] > 0]
Provide basic counts for all datasets.
In [17]:
num_queries = df_tf_query.shape[0]
num_queries_without_results = df_tf_query_without_results.shape[0]
num_queries_with_results = df_tf_query_with_results.shape[0]
num_queries_without_clicks = df_tf_query_without_clicks.shape[0]
num_queries_with_clicks = df_tf_query_with_clicks.shape[0]
In [18]:
zero_result_rate = num_queries_without_results / num_queries * 100
print(f"Zero result rate: {round(zero_result_rate, 2)}%")
In [19]:
abandonment_rate = num_queries_without_clicks / num_queries * 100
print(f"Abandonment rate: {round(abandonment_rate, 2)}%")
In [20]:
mean_clicks_per_query = df_tf_query_with_results['metrics.clicks.count'].mean()
print(f"Clicks per Query: {round(mean_clicks_per_query, 2)}")
In [21]:
num_queries_with_clicks_at_3 = df_tf_query_with_clicks[df_tf_query_with_clicks['metrics.clicks.exist_at_3'] == True].shape[0]
ctr_at_3 = num_queries_with_clicks_at_3 / num_queries_with_clicks
print(f"CTR@3: {round(ctr_at_3, 2)}")
In [22]:
max_reciprocal_rank = df_tf_query_with_clicks['metrics.clicks.max_reciprocal_rank'].mean()
print(f"Max Reciprocal Rank: {round(max_reciprocal_rank, 2)}")
In [23]:
mean_mean_reciprocal_rank = df_tf_query_with_clicks['metrics.clicks.mean_reciprocal_rank'].mean()
print(f"Mean Per-Query Mean Reciprocal Rank: {round(mean_mean_reciprocal_rank, 2)}")
In [24]:
time_to_first_click = df_tf_query_with_clicks['metrics.clicks.time_to_first_click'].mean()
print(f"Time to First Click: {round(time_to_first_click / 1000, 2)} seconds")
In [25]:
time_to_last_click = df_tf_query_with_clicks['metrics.clicks.time_to_last_click'].mean()
print(f"Time to Last Click: {round(time_to_last_click / 1000, 2)} seconds")
In [ ]: