Explaining Bombora topic interest score datasets.
As a matter of clarification, topic surge as a product is generated from topic interest models. In technical discussions, we'll refer to both the product and the models by the latter, providing a conceptually more meaningful mapping. Bombora's user-facing topic interest scores (and the origin of the data you currently have) are currently generated from a 3rd party topic interest model and service. As mentioned previously, we are also are developing an internal topic interest model.
In general, down the topic interest line, there exists primarily three datasets that are consumed by end users (ordering from raw to aggregate):
Firehose (FH): the raw content consumption data, which contains event level resolution. Size is thousands of GBs/week, only a handful of clients actually consume this data. Bombora Data Science team refers to this as the raw event data.
All Domains All Topic (ADAT): an aggregate topic interest score on keys of interest in the Firehose data. Size is tens of GBs/week. Bombora Data Science team refers to this as the topic interest score data.
Master Surge (MS): A filtering and transformation of the ADAT dataset to consider only those topic keys whose scores meet some surge score criteria (explained below). Size is GBs/week. Bombora Data Science team refer to this as surging topic interest score data.
While dataset naming convention might be a little confusing, the simple explanation is that the topic interest model ingests Firehose data, and outputs both the ADAT and MasterSurge.
As you're interested in the aggregate topic interest score, we'll only consider ADAT and MasterSurge. While similar, each has their own schema. To understand better, we consider representative topic interest result files for both ADAT and MasterSurge that are output from the current topic surge batch jobs for the week starting 2016-07-19:
In [22]:
!ls -lh ../../data/topic-interest-score/
Note the file Input_AllDomainsAllTopics_20150719-reduced.csv
is tagged with reduce
to indicate it's not the complete record set. This is due to 2GB file limiations with git LFS. The orginal compressed Input_AllDomainsAllTopics_20150719.csv.gz
file weighed in at 2.62 GB.
This file was generated via:
head -n 166434659 Input_AllDomainsAllTopics_20150719.csv > Input_AllDomainsAllTopics_20150719-reduced.csv
To get an idea of record count, count the number of lines in both files:
In [2]:
!gzip -dc ../../data/topic-interest-score/Output_MasterSurgeFile_20150719.csv.gz | wc -l
In [19]:
!gzip -dc ../../data/topic-interest-score/Input_AllDomainsAllTopics_20150719-reduced.csv.gz | wc -l
As we're interested in understanding the data schema we'll consider a smaller (non-statistically significant) sample for both files.
In [23]:
path_to_data = '../../data/topic-interest-score/'
data_files = !ls {path_to_data}
data_files
Out[23]:
In [24]:
n = 10000
#cl_cmd_args = '{cmd} -n {n} ../sample_data/{data_file} >> {data_file_root}-sample.csv'
cl_cmd_args = 'gzip -cd {path_to_data}{data_file} | {cmd} -n {n} >> {data_file_out}'
for data_file in data_files:
data_file_out = data_file.strip('.csv.gz') + '-sample.csv'
print('rm -f {data_file_out}'.format(data_file_out=data_file_out))
!rm -f {data_file_out}
print('touch {data_file_out}'.format(data_file_out=data_file_out))
!touch {data_file_out}
final_cl_cmd = cl_cmd_args.format(cmd='head', n=n,
path_to_data=path_to_data,
data_file=data_file,
data_file_out=data_file_out)
print(final_cl_cmd)
!{final_cl_cmd}
The ADAT file contains topic interest scores across both global and metro resolutions, which are model aggregate values produced at both keys (domain, topic)
: and (domain, topic, metro)
keys. Note that
The schema of the data is:
Company Name, Domain, Size, Industry, Category, Topic, Composite Score, Bucket Code, Metro Area, Metro Composite Score, Metro Bucket Code, Domain Origin Country
Note that in the schema above, the:
Composite Score
is the topic interest score from the (domain, topic)
key.Metro Composite Score
is the topic interest score from the (domain, topic, metro)
key.Additionally, we note that the format of data in the ADAT file topic interest scores a denormalized / flattened schema, as show below
In [25]:
! head -n 15 Input_AllDomainsAllTopics_20150719-reduced-sample.csv
In [26]:
! tail -n 15 Input_AllDomainsAllTopics_20150719-reduced-sample.csv
For end users who only wish to consider the surging topics—(domain, topic)
and (domain, topic, metro)
keys whose topic interest score meet surge criteria (i.e., when score is > 50)—we filter the ADAT dataset to only consider scores greater than 50.
In producing this filtered result, instead of leaving the schema intact, the 3rd-party also performs a tranformation of the topic interest score(s) representation. The schema is the same intitally, like:
Company Name, Domain, Size, Industry, Category, Topic, Composite Score,
however, the metro resolution scores is now collapsed into an array (of sorts), unique to each (domain, topic)
key. The metro name and score is formatted as metro name[metro score]
, and each line can contain multiple results, formatted together like:
metro_1[metro_1 score]|metro_2[metro_2 score]|metro_3[metro_3 score],
and finally, again, ending with the domain origin country, which would collectively look like:
Company Name, Domain, Size, Industry, Category, Topic, Composite Score,vmetro_1[metro_1 score]|metro_2[metro_2 score]|metro_3[metro_3 score], Domain Country Origin
Example output, below:
In [17]:
! head -n 15 Output_MasterSurgeFile_20150719-sample.csv
In [18]:
! tail -n 15 Output_MasterSurgeFile_20150719-sample.csv
In [ ]: