Bombora Topic Interest Datasets

Explaining Bombora topic interest score datasets.

0. Surge vs Interest?

As a matter of clarification, topic surge as a product is generated from topic interest models. In technical discussions, we'll refer to both the product and the models by the latter, providing a conceptually more meaningful mapping. Bombora's user-facing topic interest scores (and the origin of the data you currently have) are currently generated from a 3rd party topic interest model and service. As mentioned previously, we are also are developing an internal topic interest model.

1. End User Datasets (Overview)

In general, down the topic interest line, there exists primarily three datasets that are consumed by end users (ordering from raw to aggregate):

  1. Firehose (FH): the raw content consumption data, which contains event level resolution. Size is thousands of GBs/week, only a handful of clients actually consume this data. Bombora Data Science team refers to this as the raw event data.

  2. All Domains All Topic (ADAT): an aggregate topic interest score on keys of interest in the Firehose data. Size is tens of GBs/week. Bombora Data Science team refers to this as the topic interest score data.

  3. Master Surge (MS): A filtering and transformation of the ADAT dataset to consider only those topic keys whose scores meet some surge score criteria (explained below). Size is GBs/week. Bombora Data Science team refer to this as surging topic interest score data.

While dataset naming convention might be a little confusing, the simple explanation is that the topic interest model ingests Firehose data, and outputs both the ADAT and MasterSurge.

2. End User Dataset (Details)

As you're interested in the aggregate topic interest score, we'll only consider ADAT and MasterSurge. While similar, each has their own schema. To understand better, we consider representative topic interest result files for both ADAT and MasterSurge that are output from the current topic surge batch jobs for the week starting 2016-07-19:


In [22]:
!ls -lh ../../data/topic-interest-score/


total 3930616
-rw-r--r--  1 nehalecky  staff   1.7G Jul 14 12:30 Input_AllDomainsAllTopics_20150719-reduced.csv.gz
-rw-------  1 nehalecky  staff   173M Aug 10  2015 Output_MasterSurgeFile_20150719.csv.gz

Note the file Input_AllDomainsAllTopics_20150719-reduced.csv is tagged with reduce to indicate it's not the complete record set. This is due to 2GB file limiations with git LFS. The orginal compressed Input_AllDomainsAllTopics_20150719.csv.gz file weighed in at 2.62 GB.

This file was generated via:

head -n 166434659 Input_AllDomainsAllTopics_20150719.csv > Input_AllDomainsAllTopics_20150719-reduced.csv

To get an idea of record count, count the number of lines in both files:


In [2]:
!gzip -dc ../../data/topic-interest-score/Output_MasterSurgeFile_20150719.csv.gz | wc -l


 14521301

In [19]:
!gzip -dc ../../data/topic-interest-score/Input_AllDomainsAllTopics_20150719-reduced.csv.gz | wc -l


 166434659

As we're interested in understanding the data schema we'll consider a smaller (non-statistically significant) sample for both files.


In [23]:
path_to_data = '../../data/topic-interest-score/'
data_files = !ls {path_to_data}
data_files


Out[23]:
['Input_AllDomainsAllTopics_20150719-reduced.csv.gz',
 'Output_MasterSurgeFile_20150719.csv.gz']

In [24]:
n = 10000
#cl_cmd_args = '{cmd} -n {n} ../sample_data/{data_file} >> {data_file_root}-sample.csv'
cl_cmd_args = 'gzip -cd {path_to_data}{data_file} | {cmd} -n {n} >> {data_file_out}'
for data_file in data_files:
    data_file_out = data_file.strip('.csv.gz') + '-sample.csv'
    print('rm -f {data_file_out}'.format(data_file_out=data_file_out))
    !rm -f {data_file_out}
    print('touch {data_file_out}'.format(data_file_out=data_file_out))
    !touch {data_file_out}
    final_cl_cmd = cl_cmd_args.format(cmd='head', n=n, 
                                      path_to_data=path_to_data,
                                      data_file=data_file, 
                                      data_file_out=data_file_out)
    print(final_cl_cmd)
    !{final_cl_cmd}


rm -f Input_AllDomainsAllTopics_20150719-reduced-sample.csv
touch Input_AllDomainsAllTopics_20150719-reduced-sample.csv
gzip -cd ../../data/topic-interest-score/Input_AllDomainsAllTopics_20150719-reduced.csv.gz | head -n 10000 >> Input_AllDomainsAllTopics_20150719-reduced-sample.csv
gzip: error writing to output: Broken pipe
gzip: ../../data/topic-interest-score/Input_AllDomainsAllTopics_20150719-reduced.csv.gz: uncompress failed
rm -f Output_MasterSurgeFile_20150719-sample.csv
touch Output_MasterSurgeFile_20150719-sample.csv
gzip -cd ../../data/topic-interest-score/Output_MasterSurgeFile_20150719.csv.gz | head -n 10000 >> Output_MasterSurgeFile_20150719-sample.csv
gzip: error writing to output: Broken pipe
gzip: ../../data/topic-interest-score/Output_MasterSurgeFile_20150719.csv.gz: uncompress failed

ADAT

The ADAT file contains topic interest scores across both global and metro resolutions, which are model aggregate values produced at both keys (domain, topic): and (domain, topic, metro) keys. Note that

The schema of the data is:

Company Name, Domain, Size, Industry, Category, Topic, Composite Score, Bucket Code, Metro Area, Metro Composite Score, Metro Bucket Code, Domain Origin Country

Note that in the schema above, the:

  • Composite Score is the topic interest score from the (domain, topic) key.
  • Metro Composite Score is the topic interest score from the (domain, topic, metro) key.

Additionally, we note that the format of data in the ADAT file topic interest scores a denormalized / flattened schema, as show below


In [25]:
! head -n 15 Input_AllDomainsAllTopics_20150719-reduced-sample.csv


,0-lightower.net,,,accounting,recurring journal entries,53,3131,greater boston area,72,2132,United States
,0-lightower.net,,,accounting,recurring journal entries,53,3131,,45,3421,United States
,0-lightower.net,,,accounting,accounting journals entries,49,3421,,52,3141,United States
,0-lightower.net,,,accounting,intercompany accounting,48,3232,,58,3121,United States
,0-lightower.net,,,accounting,intercompany accounting,48,3232,greater boston area,25,3232,United States
,0-lightower.net,,,accounting,accounting,46,3232,greater boston area,44,3232,United States
,0-lightower.net,,,accounting,accounting,46,3232,,16,8021,United States
,0-lightower.net,,,accounting,activity-based costing (abc),25,8042,,0,0,United States
,0-lightower.net,,,accounting,multi-currency accounting,25,8042,,0,0,United States
,0-lightower.net,,,accounting,credit derivatives,5,8032,,49,3421,United States
,0-lightower.net,,,ad tech,sell side platform (ssp),37,3422,greater boston area,46,3142,United States
,0-lightower.net,,,ad tech,online quoting software,15,8032,,0,0,United States
,0-lightower.net,,,ad tech,marketing email,10,8041,,0,0,United States
,0-lightower.net,,,ad tech,web beacons,10,8042,,0,0,United States
,0-lightower.net,,,ad tech,marketing tools,5,8032,greater boston area,25,3232,United States

In [26]:
! tail -n 15 Input_AllDomainsAllTopics_20150719-reduced-sample.csv


,1stopsale.com,,,it management,it industry,45,3411,greater new york city area,45,3411,United States
,1stopsale.com,,,it management,innovation,25,3411,,0,0,United States
,1stopsale.com,,,it management,it spending,25,3421,,0,0,United States
,1stopsale.com,,,it management,software compliance,25,8042,,0,0,United States
,1stopsale.com,,,it management,it portfolio management,14,8042,,0,0,United States
,1stopsale.com,,,it management,it careers,11,8031,,0,0,United States
,1stopsale.com,,,it management,sustainability,10,8041,,0,0,United States
,1stopsale.com,,,labor relations,collective bargaining,43,3421,"orange county, california area",58,3121,United States
,1stopsale.com,,,labor relations,collective bargaining,43,3421,greater new york city area,30,8041,United States
,1stopsale.com,,,labor relations,labor relations,11,8031,,0,0,United States
,1stopsale.com,,,labor relations,unions,11,8031,,0,0,United States
,1stopsale.com,,,leadership & strategy,management fundamentals,68,2421,greater new york city area,70,2411,United States
,1stopsale.com,,,leadership & strategy,human resource management,10,8041,,0,0,United States
,1stopsale.com,,,leadership & strategy,seniority,10,8041,,0,0,United States
,1stopsale.com,,,leadership & strategy,workforce management,10,8041,,0,0,United States

Master Surge

Filter

For end users who only wish to consider the surging topics—(domain, topic) and (domain, topic, metro) keys whose topic interest score meet surge criteria (i.e., when score is > 50)—we filter the ADAT dataset to only consider scores greater than 50.

Transform

In producing this filtered result, instead of leaving the schema intact, the 3rd-party also performs a tranformation of the topic interest score(s) representation. The schema is the same intitally, like:

Company Name, Domain, Size, Industry, Category, Topic, Composite Score,

however, the metro resolution scores is now collapsed into an array (of sorts), unique to each (domain, topic) key. The metro name and score is formatted as metro name[metro score], and each line can contain multiple results, formatted together like:

metro_1[metro_1 score]|metro_2[metro_2 score]|metro_3[metro_3 score],

and finally, again, ending with the domain origin country, which would collectively look like:

Company Name, Domain, Size, Industry, Category, Topic, Composite Score,vmetro_1[metro_1 score]|metro_2[metro_2 score]|metro_3[metro_3 score], Domain Country Origin

Example output, below:


In [17]:
! head -n 15 Output_MasterSurgeFile_20150719-sample.csv


,0-lightower.net,,,analytics & reporting,workforce analytics,83,greater boston area[77]|[77],United States
,0-lightower.net,,,benefits,workers' compensation,72,greater boston area[77],United States
,0-lightower.net,,,business finance,business loans,25,greater boston area[25],United States
,0-lightower.net,,,business finance,equipment and vehicle financing,25,greater boston area[25],United States
,0-lightower.net,,,crm,email management software,75,greater boston area[75]|[77],United States
,0-lightower.net,,,desktop,desktop environment,25,greater boston area[25],United States
,0-lightower.net,,,device connectivity,wifi,25,greater boston area[25],United States
,0-lightower.net,,,ecommerce,payment processing,25,greater boston area[25],United States
,0-lightower.net,,,email marketing,email list management,75,greater boston area[75]|[77],United States
,0-lightower.net,,,health tech,computerized physician order entry (cpoe),25,[74],United States
,0-lightower.net,,,hr tech,corporate portals,25,greater boston area[25],United States
,0-lightower.net,,,it management,email management,75,greater boston area[75]|[77],United States
,0-lightower.net,,,leadership & strategy,workforce management,80,greater boston area[77]|[43],United States
,0-lightower.net,,,operating system,operating system,76,greater boston area[82]|greater chicago area[52],United States
,0-lightower.net,,,operating system,linux,25,greater boston area[25],United States

In [18]:
! tail -n 15 Output_MasterSurgeFile_20150719-sample.csv


pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,search marketing,internet search,70,greater boston area[79],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,smartphone,windows phone,92,greater boston area[94],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,social,reddit,89,greater boston area[88],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,staff administration,staff management,75,greater boston area[83],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,staff administration,time management,75,greater boston area[75],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,staff departure,layoff,86,greater boston area[94],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,telecommunications,phone call processing,70,greater boston area[70],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,initial public offering (ipo),87,greater boston area[81],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,currency futures,83,greater boston area[83],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,market volatility index (vix),83,greater boston area[83],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,treasury notes,83,greater boston area[83],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,foreign-exchange options,80,greater boston area[77],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,options,80,greater boston area[83],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,trading & investing,bonds,70,greater boston area[77],United States
pja advertising + marketing,agencypja.com,51-200 employees,marketing and advertising,training & development,career development,84,greater boston area[79],United States

In [ ]: