In [1]:
from pymongo import MongoClient
from tqdm import tqdm
In [2]:
client = MongoClient()
db = client['rf_test']
col_entries = db['entries']
col_inst = db['inst']
col_inds = db['inds']
col_nouns = db['nouns']
In [3]:
def list_entries():
d = col_entries.distinct("type")
return [x for x in d]
In [8]:
entries = Out[4]
In [12]:
def count_entry():
entry_count = []
for entry in entries:
c = col_entries.find({"type":entry}).count()
print("{}|{}".format(c,entry))
entry_count.append((c,entry))
return entry_count
In [14]:
entry_count = Out[13]
In [17]:
sorted(entry_count, key= lambda x: x[0])[::-1]
Out[17]:
In these sources of data, usernames seem interesting.
First, we want to look through what are the usernames we have in the database. To find all usernames, we can do:
In [3]:
def find_all_usernames():
return [x for x in col_entries.find({"type":"Username"})]
Since there are about 50k usernames, let's just look at a random one:
In [4]:
find_all_usernames()[30]
Out[4]:
Note the 'id' field. Since this correspond to the author field in the instance, we can use this to count the contributions of the username. Therefore, for each username, we can count or list all of the instance that they are associated with.
In [5]:
def find_inst_by_username(userid):
return [x for x in col_inst.find({"attributes.authors": userid})]
In [6]:
find_inst_by_username("LTRvjC")
Out[6]:
We could also go through the data and see which username "have the most to say".
In [7]:
def username_ranking():
usernames_ranks = []
usernames = find_all_usernames()[:100]
for u in tqdm(usernames):
uid = u['id']
name = u['name']
count = col_inst.find({"attributes.authors":uid}).count()
usernames_ranks.append((uid, name, count))
return sorted(usernames_ranks, key=lambda x: x[2])
On my computer, this process would take around 5 hours to go through the complete dataset and count up all the usernames, we got about 3 username per seconds. So I only did about the first 100 usernames
In [8]:
u_ranks = username_ranking()
For example, in this small subset, the user with the most instance connected to, is
In [11]:
u_ranks[-1]
Out[11]:
In [12]:
def get_inst_for_user(uid):
return [x for x in col_inst.find({"attributes.authors":uid})]
In [15]:
[x['attributes']['indicator'] for x in get_inst_for_user('KKYGPH')]
Out[15]:
In [16]:
from nltk.tag import pos_tag
In [17]:
def find_nouns(sentence):
words = sentence.split()
tagged = pos_tag(words)
return [w for w, t in tagged if t == 'NNP']
Here, we will look at everything the user has said, and count all the proper nouns.
In [22]:
def words_from_user(uid):
insts = get_inst_for_user('KKYGPH')
sentences = [x['fragment'] for x in insts]
words = {}
for s in sentences:
ns = find_nouns(s)
for n in ns:
try:
words[n] += 1
except KeyError:
words[n] = 1
return words
In [26]:
sorted(((a,b) for a,b in words_from_user('KKYGPH').items()), key=lambda x: x[1])[::-1]
Out[26]:
Uhmm, apparently, this user talks about Dridex more than other. This still needs a lot of improvement. One big improvement this approach can use is to categorize the noun. This will give us some context to what the user is talking about.
Furthermore, we can compare the user and group them by what they are talking about. This come back to our instance database. For example, let's look at an our Dridex malware
In [46]:
def people_and_instance(indicator):
c = col_inst.find({"attributes.indicator": indicator})
c = [x for x in c]
authors = []
authors_info = {}
for entry in c:
try:
author = entry['attributes']['authors']
except KeyError:
author = None
authors.append(author)
for a in tqdm(authors):
if a:
en = col_entries.find_one({"id": a[0]})
authors_info[en['id']] = {"name": en['name']}
return authors_info
In [48]:
talks_of_dridex = Out[29]
In [32]:
len(Out[29]) #There are 468 entries about Dridex, sweet !!
Out[32]:
In [47]:
people_and_instance('Dridex')
Out[47]:
When we have a list of people who talks about an topic, we can iterate through the list of ID, find their sentences and count the noun in those. This way, we could see who are "really interested" in the topic.
A different way of looking at Usernames, especially Twitter username is to look through the relationship aspect. This means that take a twitter handle. We will also look at their following. We suspect that the twitter user would follows subject of importance to them. In this way, we will be able to see other account to put in our crawl list. Furthermore, we can look at a group of people and see if they all talk about the same topic.
In [ ]: