The MIT License (MIT)
Copyright (c) 2016, Lumen Novus Incorporated d/b/a SharePoint Experience
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
In [1]:
%matplotlib inline
import requests
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import spacy
from gensim import corpora, models, similarities
from lxml import etree
from bs4 import BeautifulSoup
from IPython.core.display import display, HTML
import re
In [2]:
pd.set_option('display.max_colwidth', -1)
plt.rcdefaults()
nlp = spacy.English()
Time to define what SharePoint Online site will be used, and the credentials used to sign into it. Since SharePoint Online supports SAML for authentication, create a template to make the request
You'll need to update the four variables here to match your SharePoint Online environment. If you're not using SharePoint Online, you'll need to change the code from the following cell all the way down to where sp_request is defined.
In [3]:
# Your SharePoint Online domain root goes here, including the / at the end
endpoint = 'https://????.sharepoint.com/'
# Your SharePoint Online user goes here (the user you want to use for authenticating)
username = '????@????.onmicrosoft.com'
# And your password. Imagine that.
password = 'pass@word1'
# The URL to the root of the site collection you want to use goes here, including the / at the end
the_site = 'sites/contoso/Employee/ITWeb/Information%20Technology/'
api_base = '{}{}_api/web/'.format(endpoint, the_site)
request_headers = {
'Accept': 'application/json',
'odata': 'verbose',
}
saml_request_template = '<s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:a="http://www.w3.org/2005/08/addressing" xmlns:u="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><s:Header><a:Action s:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2005/02/trust/RST/Issue</a:Action><a:ReplyTo><a:Address>http://www.w3.org/2005/08/addressing/anonymous</a:Address></a:ReplyTo><a:To s:mustUnderstand="1">https://login.microsoftonline.com/extSTS.srf</a:To><o:Security s:mustUnderstand="1" xmlns:o="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"><o:UsernameToken><o:Username>{username}</o:Username><o:Password>{password}</o:Password></o:UsernameToken></o:Security></s:Header><s:Body><t:RequestSecurityToken xmlns:t="http://schemas.xmlsoap.org/ws/2005/02/trust"><wsp:AppliesTo xmlns:wsp="http://schemas.xmlsoap.org/ws/2004/09/policy"><a:EndpointReference><a:Address>{endpoint}</a:Address></a:EndpointReference></wsp:AppliesTo><t:KeyType>http://schemas.xmlsoap.org/ws/2005/05/identity/NoProofKey</t:KeyType><t:RequestType>http://schemas.xmlsoap.org/ws/2005/02/trust/Issue</t:RequestType><t:TokenType>urn:oasis:names:tc:SAML:1.0:assertion</t:TokenType></t:RequestSecurityToken></s:Body></s:Envelope>'
saml_request_body = saml_request_template.format(
endpoint=endpoint, username=username, password=password)
<s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:a="http://www.w3.org/2005/08/addressing" xmlns:u="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<s:Header>
<a:Action s:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2005/02/trust/RST/Issue</a:Action>
<a:ReplyTo>
<a:Address>http://www.w3.org/2005/08/addressing/anonymous</a:Address>
</a:ReplyTo>
<a:To s:mustUnderstand="1">https://login.microsoftonline.com/extSTS.srf</a:To>
<o:Security s:mustUnderstand="1" xmlns:o="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">
<o:UsernameToken>
<o:Username>????@????.onmicrosoft.com</o:Username>
<o:Password>pass@word1</o:Password>
</o:UsernameToken>
</o:Security>
</s:Header>
<s:Body>
<t:RequestSecurityToken xmlns:t="http://schemas.xmlsoap.org/ws/2005/02/trust">
<wsp:AppliesTo xmlns:wsp="http://schemas.xmlsoap.org/ws/2004/09/policy">
<a:EndpointReference>
<a:Address>https://????.sharepoint.com/</a:Address>
</a:EndpointReference>
</wsp:AppliesTo>
<t:KeyType>http://schemas.xmlsoap.org/ws/2005/05/identity/NoProofKey</t:KeyType>
<t:RequestType>http://schemas.xmlsoap.org/ws/2005/02/trust/Issue</t:RequestType>
<t:TokenType>urn:oasis:names:tc:SAML:1.0:assertion</t:TokenType>
</t:RequestSecurityToken>
</s:Body>
</s:Envelope>
An XML document can actually combine a bunch of different grammars, each one known by a namespace. A tag named with a prefix, like <SharePoint:AwesomeSauce>, means that the AwesomeSauce tag comes from the SharePoint namespace.
These namespaces may be useful, so they'll be stored in a Python dictionary for use later.
In [4]:
nsmap = {
'S': 'http://www.w3.org/2003/05/soap-envelope',
'wsse': 'http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd',
'wsu': 'http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd',
'wsa': 'http://www.w3.org/2005/08/addressing',
'wst': 'http://schemas.xmlsoap.org/ws/2005/02/trust'
}
In [5]:
def saml_request(body):
# SharePoint Online REST API - Get Request Security Token
# POST https://login.microsoftonline.com/extSTS.srf
try:
response = requests.post(
url="https://login.microsoftonline.com/extSTS.srf",
headers={
"Accept": "application/json",
},
data=body
)
return response
except requests.exceptions.RequestException:
print('HTTP Request failed')
In [6]:
s = saml_request(saml_request_body)
saml_response = etree.XML(s.content)
saml_token = saml_response.xpath(
'//wsse:BinarySecurityToken[@Id="Compact0"]', namespaces=nsmap)[0].text
saml_token
Out[6]:
Now that the token is returned, it's time to exchange it with the SharePoint Online site used in this example. It's kinda like logging in, but the SharePoint site isn't given the login and password, it's given the ticket that the earlier request returned to us. It will respond with two cookies (yum!) that will be used again (and again) for any future requests to the SharePoint Online site.
In [7]:
a = requests.post(
url="{}_forms/default.aspx?wa=wsignin1.0".format(endpoint),
data=saml_token
)
In [8]:
authcookies = {
'FedAuth': a.cookies['FedAuth'],
'rtFa': a.cookies['rtFa']
}
In [9]:
authcookies
Out[9]:
In [10]:
def sp_request(path_from_base, fields_to_keep, index_on_id=True):
r = requests.get(
url="{}{}".format(api_base, path_from_base),
cookies=authcookies,
headers=request_headers
)
j = r.json()
v = j['value']
df = pd.DataFrame(v)
df = df[fields_to_keep]
if index_on_id:
df.set_index('Id', inplace=True)
return df
In [11]:
df_users = sp_request('siteusers', ['Id', 'Email', 'Title'])
In [12]:
df_users
Out[12]:
In [13]:
df_lists_all = sp_request(
'lists', ['Title', 'Description', 'ItemCount', 'Id', 'Hidden'])
df_lists = df_lists_all[df_lists_all.Hidden == False]
In [14]:
# Top 5 lists
df_lists.sort(['ItemCount'], ascending=False).head()
Out[14]:
In [15]:
discussion_id = df_lists[df_lists.Title == 'Discussions List'].index.values[0]
tmp_discussions = sp_request(
path_from_base="lists(guid'{}')/items".format(discussion_id),
fields_to_keep=[
'Title', 'Body', 'AuthorId', 'Id', 'ParentItemID', 'BestAnswerId'],
index_on_id=False
)
In [16]:
tmp_discussions
Out[16]:
Natural Language Processing works better when you're processing a single language — in this case, English — so the makeplain function will use the Python module BeautifulSoup to remove all the HTML. A regular expression is also used to remove some other whitespace-related characters to help clean up the text.
In [17]:
def makeplain(html):
text = BeautifulSoup(html, 'html.parser').get_text()
text = text.replace(u'\u200b','')
text = text.replace(u'\xa0','')
return text
In [18]:
df_discussions = tmp_discussions.merge(
df_users, left_on=tmp_discussions.AuthorId, right_on=df_users.index.values, suffixes=['_disc', '_user'])
df_discussions.set_index('Id', inplace=True)
df_discussions['Body_Text'] = df_discussions.apply(
lambda row: makeplain(row['Body']), axis=1)
df_discussions = df_discussions[['Title_user', 'Title_disc', 'Body_Text', 'Body', 'AuthorId', 'ParentItemID', 'BestAnswerId',
'Email']]
Here, the discussions are grouped by author, and a graph is plotted to show how many posts each user has made, and how many posts they've marked as an answer, a metric that may show how much they're truly contributing to rather than "leeching" from the community.
In [19]:
%matplotlib inline
plt.rcParams.update({
'figure.facecolor': 'white',
'font.size': '24',
'axes.grid': 'true',
})
grouped = df_discussions.groupby(['Title_user']).count().sort(
['Title_disc'], ascending=False)
grouped[['Title_disc', 'BestAnswerId']].plot(
kind='barh', title='Frequent Posters', figsize=(18, 12), colormap='rainbow')
plt.legend(['Posts by User', 'Other Answers Marked by User'])
Out[19]:
In [20]:
best_answer_ids = df_discussions.BestAnswerId.dropna().values
best_answers = df_discussions[df_discussions.index.isin(best_answer_ids)].Body.values
for answer in best_answers:
display(HTML(answer))
In [21]:
df_discussions.sort_index(inplace=True)
discussion_bodies = df_discussions.Body_Text.values
discussion_bodies[:5]
Out[21]:
Some words don't really contribute to the ability to discern important words or topics. These words, like the, of, altogether, or whereby, are called stopwords. SpaCy has its own list of them, and a few bits of punctuation and partial contracts are added to that list. Finally, texts is set to a list of lists of words. One list for each of the original discussions.
In [22]:
my_stop_list = set('the of is - . , ? ! \'s ca n\'t'.split())
stoplist = my_stop_list.union(spacy.en.STOPWORDS)
texts = [
[word.lower_ for word in nlp(discussion)
if word.lower_ not in stoplist]
for discussion in discussion_bodies
]
frequency is set as a dictionary of words and their frequencies. Each word gets an entry in the dictionary, and each time the word appears, its count is increased. Words that appear only once are filtered out and set to the variable texts. Since this is still a list (one entry for each of the discussions themselves) of lists (of those "relevant" words, that appeared more than once in the entire corpus), it's easily assigned as a new column in the df_discussions DataFrame.
In [23]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
df_discussions['relevant_words'] = texts
In [24]:
i = 0
for text in texts:
i += 1
print(i,text)
In [25]:
dictionary = corpora.Dictionary(texts)
dictionary.save('contoso.dict')
In [26]:
print(dictionary.token2id)
Sounds like a new charity, but in this case, the gensim library is used to save a corpus as in a special format called bag of words (bow). There are a few file formats for those, and the Matrix Market format is one of the most common ones, which could be helpful if this corpus would be processed in its numeric form by another library.
In [27]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('contoso.mm', corpus)
Now, a model, or another form of numeric representation of the corpus, is trained based on a techique called Term Frequency / Inverse Document Frequency.
Just because a word appears often in the entire corpus, it doesn't mean that it can be used to classify individual discussions. The stopwords won't catch any industry-specific terms, for example. If a word or phrase is used a lot in a few posts, that word or phrase could be relevant when determining the topic of that discussion. TF/IDF is a great way to identity truly relevant words or phrases. In this example, only words are considered.
In [28]:
tfidf = models.TfidfModel(corpus, normalize=True)
The corpus is now transformed to match the trained TF/IDF model. Before, the numeric representation was based on the bag of words model (how much times each word appears), and after the transformation, the numeric representation will be based on the times each word appears in the whole corpus and scaled by how many individual discussions use that word. The actual math is considerably more complex than that:
$$idf(t,D) = log{{N}\over{|\{d \in D : t \in d\}|}}$$Where: idf(t, D) is the inverse document frequency of the term t in our set of discussions D, and N is the total number of discussions in the corpus.
In [29]:
corpus_tfidf = tfidf[corpus]
The numeric representation of the corpus is transformed again using another method called Latent Semantic Indexing, which aims to discover (using some clever maths) the words that seem to be associated together frequently, which may suggest they contribute to an overall topic. In this case, the number of topics is limited to 5.
In [30]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
corpus_lsi = lsi[corpus_tfidf]
topics = lsi.show_topics()
In [31]:
topic_words = []
for topic in topics:
topic = re.split('[^A-Za-z]',topic)
topic = [token for token in topic if token != '']
topic_words.append(topic)
print(topic)
topic_labels = ['Topic A','Topic B', 'Topic C', 'Topic D', 'Topic E']
Each discussion's "topic words" are stored with the discussion in the df_discussions dataframe, in case they're useful later. They're sorted by their relative contribution to the overall topic. The topic names and IDs are then calculated and stored in the df_discussions dataframe.
In [32]:
best_topics = []
best_topic_labels = []
lsi_topics = [lsi for lsi in corpus_lsi]
for lsi in lsi_topics:
if lsi:
lsi.sort(key=lambda tup: tup[1], reverse=True)
topic_id = lsi[0][0]
best_topics.append(topic_id)
best_topic_labels.append(topic_labels[topic_id])
else:
best_topics.append('unknown')
best_topic_labels.append('unknown')
In [33]:
df_discussions
Out[33]:
In [34]:
disc_np = []
disc_op = []
for entry in discussion_bodies:
doc = nlp(entry)
nounphrases_and_head = [[np.orth_, np.root.head.orth_] for np in doc.noun_chunks]
nounphrases = [np.orth_ for np in doc.noun_chunks]
entities = list(doc.ents)
orgs_and_people = [entity.orth_ for entity in entities if entity.label_ in ['ORG','PERSON']]
disc_np.append(', '.join(nounphrases))
disc_op.append(', '.join(orgs_and_people))
df_discussions['noun_phrases'] = disc_np
df_discussions['orgs_and_people'] = disc_op
In [35]:
df_discussions
Out[35]:
In [36]:
df_discussions.to_csv('contoso.csv')
Using the list of AuthorId values from the df_discussions DataFrame, the active_users variable is set to a filtered copy of the original df_users DataFrame. The inverse of that list of authors is used to populate inactive_users with another filtered copy of df_users.
In [37]:
active_user_ids = set(df_discussions.AuthorId.values)
active_users = df_users[df_users.index.isin(active_user_ids)]
inactive_users = df_users[~df_users.index.isin(active_user_ids)]
inactive_users = inactive_users[~(inactive_users.Email == "")]
display(HTML('<h2>Active Users</h2>'))
display(HTML(active_users.to_html()))
display(HTML('<h2>Inactive Users</h2>'))
display(HTML(inactive_users.to_html()))
In [38]:
%matplotlib inline
from wordcloud import WordCloud
wctext = ' '.join([word for sublist in texts for word in sublist])
# take relative word frequencies into account, lower max_font_size
wc = WordCloud(background_color="white", max_font_size=40)
wc.generate(wctext)
plt.figure(figsize=(12,12))
plt.imshow(wc)
plt.axis("off")
plt.show()