In [1]:
%autosave 10


Autosaving every 10 seconds

Sponsors

  • NESTA - looking for areas of growth to invest in and support.
  • Research frely available online, to provoke discussion.

Research questions

  • What are "digital companies". (main focus of talk).
  • What do they look like.
  • What drives their innovation/growth.

 Why?

  • Standard classifications of businesses don't work.
  • Used to measure economic output, doesn't work for digital companies.

SIC - Standard Industry Classification

  • 731 SICs, self-classified.
    • (from a question) self-classification has no incentives for accuracy, in fact directly the opposite. Changing your classification over time to accurately reflect changing business strategy just adds paperwork.
  • e.g.
    • 77220: renting of video tapes and disks
    • 01440: raising of camels and cemlids
  • 82990: other business support service activities (10%)
  • 20% not classified
  • 3 million companies in Companies House
  • Almost a million are unclassified or improperly classified.
  • This presentation / research did not attempt to classify these unclassified companies.

Challenge

  • Mapping is necessarily imprecise.
  • Data-driven methods can be richer, more informative, more up to date.

Linked datasets

  • Online activity
  • Trade activity
  • Trademarks / Patents
  • News/events
  • Financials
  • ...

Approach

  • Classify by:
    • Sector (their vertical)
    • Product type
    • Client type (B2B, B2C, government)
    • Sales process (franchise, subscription)
  • e.g. you might be an Oil and Gas company that produces software

Tech stack

  • scrapy / pandas / scikit-learn

Getting training set

  • Some public companies are pre-classified.
  • Expert panels for authoritative labels.
  • Crowd sourcing
    • !!AI sounds like Amazon Mechanical Turk
    • use qualification tests, and send tasks to many humans and take majority vote.

 Feature engineering

  • Multiple sources
    • Free text (news)
    • Structured (patent filings)
  • Cleaning
    • Malformed HTML
    • Stripping JavaScript
    • lxml, beautifulsoup (prefer lxml, more robust)
      • !!AI beautifulsoup4 defaults to using lxml if installed.
    • goose for article extraction
  • Tokenising and calculating TF-IDF weights

 Modelling

  • pandas to build up feature sets
  • linear SVMs, linear models are fast for thousands of features
  • multi-class classifiers for sector, product, client, sales process

Example

  • Kelton
    • Official SIC classification: 82290 (other)
    • Their classification
      • Sector: Oil and Energy
      • Product: Software Company
      • Client - Businesses
      • Sale process: Projects
      • Based in: Aberdeen

Challenges

  • Company structure
    • Subsidiaries, trading partners, who is actually trading and where and what.

In [ ]: