NESTA - looking for areas of growth to invest in and support.
Research frely available online, to provoke discussion.
Research questions
What are "digital companies". (main focus of talk).
What do they look like.
What drives their innovation/growth.
Why?
Standard classifications of businesses don't work.
Used to measure economic output, doesn't work for digital companies.
SIC - Standard Industry Classification
731 SICs, self-classified.
(from a question) self-classification has no incentives for accuracy, in fact directly the opposite. Changing your classification over time to accurately reflect changing business strategy just adds paperwork.
e.g.
77220: renting of video tapes and disks
01440: raising of camels and cemlids
82990: other business support service activities (10%)
20% not classified
3 million companies in Companies House
Almost a million are unclassified or improperly classified.
This presentation / research did not attempt to classify these unclassified companies.
Challenge
Mapping is necessarily imprecise.
Data-driven methods can be richer, more informative, more up to date.
Linked datasets
Online activity
Trade activity
Trademarks / Patents
News/events
Financials
...
Approach
Classify by:
Sector (their vertical)
Product type
Client type (B2B, B2C, government)
Sales process (franchise, subscription)
e.g. you might be an Oil and Gas company that produces software
Tech stack
scrapy / pandas / scikit-learn
Getting training set
Some public companies are pre-classified.
Expert panels for authoritative labels.
Crowd sourcing
!!AI sounds like Amazon Mechanical Turk
use qualification tests, and send tasks to many humans and take majority vote.
Feature engineering
Multiple sources
Free text (news)
Structured (patent filings)
Cleaning
Malformed HTML
Stripping JavaScript
lxml, beautifulsoup (prefer lxml, more robust)
!!AI beautifulsoup4 defaults to using lxml if installed.
goose for article extraction
Tokenising and calculating TF-IDF weights
Modelling
pandas to build up feature sets
linear SVMs, linear models are fast for thousands of features
multi-class classifiers for sector, product, client, sales process
Example
Kelton
Official SIC classification: 82290 (other)
Their classification
Sector: Oil and Energy
Product: Software Company
Client - Businesses
Sale process: Projects
Based in: Aberdeen
Challenges
Company structure
Subsidiaries, trading partners, who is actually trading and where and what.