Hypotheses

  • Cleaner features will improve accuracy & robustness
  • Including the body of the email will improve accuracy
  • Extracting meaning from text will lead to higher quality features

In [27]:
# Load data
import pandas as pd
with open('./data_files/8lWZYw-u-yNbGBkC4B--ip77K1oVwwyZTHKLeD7rm7k.csv') as data_file:
    df = pd.read_csv(data_file)
df.head()


Out[27]:
Subject Id ConversationId Importance SentDateTime ProcessedBody RawBody RawContentType CcRecipients Sender ToRecipients FolderId
0 Notification of Approval Change to Application... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-03-03T00:47:23Z An approver has made changes to the state of y... <html>\r\n<head>\r\n<meta http-equiv="Content-... HTML aadonboardingapprove@microsoft.com aad@microsoft.com dkershaw@microsoft.com;dmitry.pugachev@microso... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
1 Application ownership request for ApplicationI... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-03-02T21:00:50Z An application ownership request has been crea... <html>\r\n<head>\r\n<meta http-equiv="Content-... HTML dkershaw@microsoft.com;dmitry.pugachev@microso... aad@microsoft.com aadonboardingapprove@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
2 Broken OAuth 2 experience for Android developers AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-03-02T19:24:12Z Hi Danny I’ve been trying to build thishtt... <html>\r\n<head>\r\n<meta http-equiv="Content-... HTML noreply@github.com;ddiaz@microsoft.com johnau@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
3 Establishing a process and SLA for answering d... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-03-02T14:07:33Z Meanwhile here is a small tool enabling us to ... <html>\r\n<head>\r\n<meta http-equiv="Content-... HTML NaN jean-marc.prieur@microsoft.com eduardk@microsoft.com;skwan@microsoft.com;vitt... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
4 Establishing a process and SLA for answering d... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-03-02T04:08:20Z You might have to be part of the FDR group to ... <html>\r\n<head>\r\n<meta http-equiv="Content-... HTML NaN skwan@microsoft.com jean-marc.prieur@microsoft.com;vittorib@micros... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...

Constructing intelligent features


  • Use Google Cloud Natural Language APIs to start
  • Entity Recognition might be powerful, especially with salience data
  • Syntax analysis to get nouns & perform lemmatization

In [28]:
# Remove messages without a Subject and a body
print df.shape
df = df.dropna(subset=['Subject'])
df = df.dropna(subset=['RawBody'])
print df.shape


(10397, 12)
(10385, 12)

In [29]:
# Sample the data set to decrease number of records
df = df.sample(frac=0.33, random_state=42)
print df.shape


(3427, 12)

In [30]:
print df['RawBody'][0]


<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="text/html; charset=us-ascii">
</head>
<body>
An approver has made changes to the state of your application ownership request. The following are the details
<br>
<br>
RequestId: a8d743a1-12c9-444b-9f76-e9c7fdbf8277 <br>
ApplicationId: 00000003-0000-0000-c000-000000000000 <br>
Environment : Blackforest <br>
State: Approved <br>
Last Updated By: nanedev@microsoft.com <br>
ApproverComments: Approved. 3/2/17 -nanedev <br>
<br>
Please view the application ownership request at the following url https://aadonboardingsite.cloudapp.net/RequestApplicationOwnership/Details?RequestId=a8d743a1-12c9-444b-9f76-e9c7fdbf8277
<br>
<br>
The application can be viewed at the following url https://aadonboardingsite.cloudapp.net/ViewApplications/Details?applicationId=00000003-0000-0000-c000-000000000000&amp;environment=Blackforest
<br>
<br>
Note that AAD onboarding site is monitored for access by SAW users. If you are a SAW user, please make sure to access this site from your SAW machine.
</body>
</html>


In [24]:
# Post a single body text to the Entity Recognition API
# I estimate running this on a corpus of 10K documents would cost about $50
import requests
import json
params = {'key': 'AIzaSyA_2WascO_oSrABHD4yMvkR4q5l9JeGO7Y'}
data = {
    'encodingType': 'UTF8',
    'document': {
        'type': df['RawContentType'][0],
        'content': df['RawBody'][0],
    }
}
r = requests.post('https://language.googleapis.com/v1/documents:analyzeEntities', params=params, json=data)
print json.dumps(r.json(), sort_keys=True, indent=2, separators=(',', ': '))


{
  "entities": [
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 169,
            "content": "changes"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "changes",
      "salience": 0.1815384,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 198,
            "content": "application ownership request"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "application ownership request",
      "salience": 0.15448295,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 151,
            "content": "approver"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "approver",
      "salience": 0.14197437,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 184,
            "content": "state"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "state",
      "salience": 0.14028053,
      "type": "LOCATION"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 1029,
            "content": "SAW user"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "SAW user",
      "salience": 0.04807233,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 251,
            "content": "details"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "details",
      "salience": 0.034719877,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 233,
            "content": "following"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "following",
      "salience": 0.034034744,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 483,
            "content": "ApproverComments"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "ApproverComments",
      "salience": 0.02164433,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 272,
            "content": "RequestId"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "RequestId",
      "salience": 0.02104913,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 416,
            "content": "State"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "State",
      "salience": 0.020842455,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 555,
            "content": "application ownership request"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "application ownership request",
      "salience": 0.019584557,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 398,
            "content": "Blackforest"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "Blackforest",
      "salience": 0.018065916,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 384,
            "content": "Environment"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "Environment",
      "salience": 0.018065916,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 792,
            "content": "https://aadonboardingsite.cloudapp.net/ViewApplications/Details?applicationId=00000003-0000-0000-c000-000000000000&amp;environment=Blackforest"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 788,
            "content": "url"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "https://aadonboardingsite.cloudapp.net/ViewApplications/Details?applicationId=00000003-0000-0000-c000-000000000000&environment=Blackforest",
      "salience": 0.017957626,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 606,
            "content": "https://aadonboardingsite.cloudapp.net/RequestApplicationOwnership/Details?RequestId=a8d743a1-12c9-444b-9f76-e9c7fdbf8277"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 602,
            "content": "url"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "https://aadonboardingsite.cloudapp.net/RequestApplicationOwnership/Details?RequestId=a8d743a1-12c9-444b-9f76-e9c7fdbf8277",
      "salience": 0.017957626,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 326,
            "content": "ApplicationId"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "ApplicationId",
      "salience": 0.015409823,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 745,
            "content": "application"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "application",
      "salience": 0.0148884235,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 962,
            "content": "onboarding site"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "onboarding site",
      "salience": 0.013239309,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 1009,
            "content": "users"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "users",
      "salience": 0.013239309,
      "type": "PERSON"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 462,
            "content": "@microsoft.com"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {},
      "name": "@microsoft.com",
      "salience": 0.011577254,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 995,
            "content": "access"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "access",
      "salience": 0.0114711495,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 1071,
            "content": "site"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "site",
      "salience": 0.009654219,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 1090,
            "content": "machine"
          },
          "type": "COMMON"
        }
      ],
      "metadata": {},
      "name": "machine",
      "salience": 0.008363323,
      "type": "OTHER"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 1005,
            "content": "SAW"
          },
          "type": "PROPER"
        },
        {
          "text": {
            "beginOffset": 1086,
            "content": "SAW"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/01b82r",
        "wikipedia_url": "http://en.wikipedia.org/wiki/Saw"
      },
      "name": "SAW",
      "salience": 0.008035766,
      "type": "WORK_OF_ART"
    },
    {
      "mentions": [
        {
          "text": {
            "beginOffset": 958,
            "content": "AAD"
          },
          "type": "PROPER"
        }
      ],
      "metadata": {
        "mid": "/m/02cxr0",
        "wikipedia_url": "http://en.wikipedia.org/wiki/Clostridium_difficile_colitis"
      },
      "name": "AAD",
      "salience": 0.0038506673,
      "type": "ORGANIZATION"
    }
  ],
  "language": "en"
}

In [39]:
import requests
import json

feature_matrix = pd.DataFrame()
for index, row in df.iterrows():
    
    # Perform entity recognition on document
    params = {'key': 'AIzaSyA_2WascO_oSrABHD4yMvkR4q5l9JeGO7Y'}
    data = {
        'encodingType': 'UTF8',
        'document': {
            'type': row['RawContentType'],
            'content': row['RawBody'],
        }
    }
    r = requests.post('https://language.googleapis.com/v1/documents:analyzeEntities', params=params, json=data)
    
    # Populate feature matrix with entities as columns
    try:
        for entity in r.json()['entities']:
            try:
                feature_matrix.at[index, entity['name'].lower()] = entity['salience']
            except KeyError as ex:
                continue
    except KeyError as ex:
        continue

feature_matrix.head()


Out[39]:
skype subject system msa danny strockis @saravana kumar dastrock@microsoft.com adfrei@microsoft.com shalages@microsoft.com cc ... msoffhelp@live.com blocklist https://excel.officeapps.live.com/x/_layouts/resources/oauthcallback.htm wac chris brown ardian daron skeptor otherdoamin/callback.html https://excel.officeapps.live.com/x/_layouts/resources/1033/wefgallery.htm samedomain/callback.htm
8964 0.6039 0.000166 0.000543 0.021836 0.019095 0.0178 0.016940 0.015969 0.001099 0.000374 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2926 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1806 NaN 0.008246 NaN NaN 0.005078 NaN 0.007919 NaN NaN 0.007938 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8114 NaN 0.000063 0.000085 0.006795 0.003791 NaN 0.012747 0.010935 NaN 0.000155 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3442 NaN 0.000901 NaN 0.055578 NaN NaN NaN NaN NaN 0.002320 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 42798 columns


In [43]:
# TODO: Need to train with fixed vocabulary, otherwise runtime feature construction won't work correctly
# TODO: Try to limit number of rows
print len(feature_matrix.columns.values)


42798

Strategies for reducing # of columns in feature matrix

  • Add more stop words
  • Remove email addresses
  • Remove URLs
  • Lemmatization
  • Remove number, special characters, sequences of characters like 'aaaaa'
  • Perform manual tokenization to get column names, and inspect types of cols created
  • ...

In [60]:
# TODO: Is there some form of TF/IDF to be done here?
# Drop rows not in feature matrix
df = df.ix[feature_matrix.index.values]
print df.shape

# Fill NaNs with zeros
feature_matrix = feature_matrix.fillna(value=0.0)

# Convert to sparse matrix
from scipy.sparse import csr_matrix
feature_matrix_numpy = csr_matrix(feature_matrix.values)


(3389, 12)

Train model & evaluate accuracies



In [61]:
# Split into test and training data sets
from sklearn.model_selection import train_test_split
labels_train, labels_test, features_train, features_test = train_test_split(df['FolderId'], feature_matrix_numpy, test_size=0.20, random_state=42)
print labels_train.shape
print labels_test.shape
print features_train.shape
print features_test.shape


(2711,)
(678,)
(2711, 42798)
(678, 42798)

In [62]:
# Train a default Logistic Regression model, with no tuning
from sklearn.linear_model import LogisticRegression
default_lgr_model = LogisticRegression().fit(features_train, labels_train)

In [64]:
# Evaluate default Logistic Regression model on test data
default_lgr_predictions = default_lgr_model.predict(features_test)
from sklearn import metrics
print metrics.accuracy_score(labels_test, default_lgr_predictions)


0.323008849558

Conclusions

  • Need to go back through and examine the types of features being used
  • Need to add people features

In [ ]: