Sekarang mau coba eksperimen dengan dataset real. Dataset diambil dari forum MacForums dan saya sendiri yang menilai mana yang off-topic.


In [1]:
from glob import glob
import os.path
import re

from otdet.detector import OOTDetector
from otdet.feature_extraction import CountVectorizerWrapper

In [2]:
files = glob('datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/*.txt')

In [3]:
def post_num(filename):
    m = re.search('-(\d+)\.', filename)
    return int(m.group(1))

In [4]:
files.sort(key=post_num)

In [5]:
files


Out[5]:
['datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-0.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-1.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-2.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-3.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-4.oot.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-5.oot.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-6.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-7.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-8.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-9.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-10.oot.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-11.oot.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-12.oot.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-13.txt',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-14.txt']

In [6]:
documents = []
for file_ in files:
    with open(file_) as f:
        documents.append(f.read())

In [7]:
documents


Out[7]:
['As from what I have been told, Flash has existed for around 20 years. It is theorized to soon be obsolete and will be replaced by HTML5 . Also, Apple adopted the best decision in not supporting Flash. \n\nNot supporting Flash- What does that really mean? Surely you can download Adobe Flash into your Mac OS. You need Flash to watch YouTube in addition to many other websites. I do not quite understand this. So I guess Macs can still play Flash but does not have it readily available in a newly purchased Mac. Which simply means that Apple does not fancy Flash, yet allow users the permission to install it. Is that right? \n\nAnother thing. Tablets and iPads from what I have heard do not support Flash and you CANNOT download Flash on it. Yet, how can an iPad view YouTube if it does not have Flash?\n',
 "Flash has never been included in any OS (Windows, Linux or OS X) by default. You've always had to download it separately. Chrome includes a version of Flash which means you don't have to download it..\n\nApple has been pushing HTML5 for a while due in part for the fact that it sin't as resource intensive as Flash is and is more flexible.\n\nEither way, its your machine, you want Flash to watch videos on sites that don't support HTML5 (YouTube DOES support it), then download Flash and enjoy.\n\niOS devices have never included Flash. Additionally, Adobe has stopped working on the mobile version of Flash as well. The YouTube client and many video sites on the web detect the fact that the viewer is an iOS device and actually use MP4 video files instead of Flash.\n",
 "Oh yeh, that's right. I always had to download and install Flash on my Windows computer. \n\nI misunderstood the concept before and thought that Apple does not allow installation of Flash period. I took the word 'support' in the sense of 'compatibility' as opposed to 'condone'\n\nOh I see. So if we watch YouTube with our phone, YouTube detects that we are using a phone so it streams MP4. In contrast, when we watch YouTube with our Mac, it streams via the Flash that we have manually installed. I wonder why YouTube did not make it universal and stream MP4 for when we are watching with our Mac as well....\n",
 "No OS vendor is going to explicitly deny installation of any piece of software. Take Java for example, OS X used to include it for a while and now they don't. Again, if you need it, there is an easy way of getting it..\n\nDevices like iPhone/iPod Touch/Android and so on support a set of video/audio formats..MP4, VP8, MP3, OGG, WAV and so on. So if you stream data in those formats the devices will handle it no problem.\n\nOn the other hand computers (based on the OS) can support some combination of those formats, and others have to be installed, thus the original use of Flash to play videos and now the use of HTML5 which doesn't require downloading anything more a recent browser..\n",
 'Out of the topic but I feel sorry for the IT students who have to learn Java. Appears it will become obsolete in the near future.\n',
 'Unlikely...\n\n\n\nAndroid uses Java as its language..\n',
 "You're right, quite unlikely. \n\nSo why doesn't YouTube, the most popular video streaming site, stream through HTML5 instead of Flash? After all, you said that Flash require less resources and is more flexible.\n",
 'https://www.youtube.com/html5\n',
 "Oh, it does support HTML5. The reason for my confusion is I had to download Flash in order to watch YouTube via Firefox.\n\nIf it's not too personal, what do you do for a living? Are you in the IT industry? Just out of curiosity\n",
 "HTML5 isn't enabled by default on YouTube, so if you visit with Firefox you'll be prompted to download Flash, but if you'd enabled HTML5 it would've been fine. Additionally, if you'd used Chrome you'd never have been prompted at all..\n\nHehe, why, does it show?  Yes, I'm a technical software engineering manager, which basically means that I'm a SW manager who codes.. Interestingly, currently working on porting to Android to Windows..but that's a very long off-topic discussion..\n",
 "I want to understand something. There are so many people on here, such as yourself, answering questions on an hourly basis. How does that work? I mean, do you randomly check this website everyday and look for questions to answer. I'm sure you're quite busy with your Android/Windows project\n",
 'One of the things of doing the kind of work that I do is the need for distractions. If I were to just sit and code for 8 hours straight, my brain would leak out of my ears..\n\nSo every 15-20 mins or so, apart from taking a break from the screen, walking around, checking on my team and so on I get on here to see if there are any posts I can reply to and then get back to work..\n',
 "Most people go on Facebook, play online games or walk to another cubicle and talk to a friend to take a break at work. \nYou go onto Mac-Forum [which is a good thing for us ]. But, that's like Mike Tyson hitting the punching bag after a title fight...to take a break. That's ferocious computer passion.\n",
 "Part of my job is creating videos for YouTube. When I encode a video I use an mp4 format which is a very universally compatible format for just about any device. Years back it was all about Flash and that is what I used. \n\nYouTube use to accept what ever you sent it and if your computer did not run flash you got an error message telling you to go download it. Now YouTube will upload your video and re-encode it into several formats for maximum compatibility across all platforms. \n\nFlash has proven over and over to have security issues which is why several years back they did not support it. If I recall correctly, Steve Jobs got into a tiff with Adobe and would not allow Flash on an Apple device legitimately. (Correct me if I am remembering this wrong.) He was also a big supporter of the HTML5 format.\n\nI don't add Flash to any of my machines anymore. I run both windows and mac platforms now and I haven't seen a reason to have it. \n\nAnyway, I too am another of the legions who sit in front of her computer all day and take breaks by visiting forums. This one has become my favorite because of the tons of information and all the nice people.\n\nLisa\n",
 "Lisa, you are remembering correctly about Steve Jobs not wanting Flash on the iPhone until Adobe addressed the security issues and more importantly the performance issues. Adobe didn't care very much for that and figured they had leverage, but had to close down their mobile Flash work, so who was right all along? \n\n@simonvee, I do the other things as well but sometimes I end up in a mode where I have to build code and that can take some time which gives me a breather to peruse the forums..\n"]

In [8]:
d = dict(zip(files, documents))

In [9]:
d


Out[9]:
{'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-2.txt': "Oh yeh, that's right. I always had to download and install Flash on my Windows computer. \n\nI misunderstood the concept before and thought that Apple does not allow installation of Flash period. I took the word 'support' in the sense of 'compatibility' as opposed to 'condone'\n\nOh I see. So if we watch YouTube with our phone, YouTube detects that we are using a phone so it streams MP4. In contrast, when we watch YouTube with our Mac, it streams via the Flash that we have manually installed. I wonder why YouTube did not make it universal and stream MP4 for when we are watching with our Mac as well....\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-10.oot.txt': "I want to understand something. There are so many people on here, such as yourself, answering questions on an hourly basis. How does that work? I mean, do you randomly check this website everyday and look for questions to answer. I'm sure you're quite busy with your Android/Windows project\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-1.txt': "Flash has never been included in any OS (Windows, Linux or OS X) by default. You've always had to download it separately. Chrome includes a version of Flash which means you don't have to download it..\n\nApple has been pushing HTML5 for a while due in part for the fact that it sin't as resource intensive as Flash is and is more flexible.\n\nEither way, its your machine, you want Flash to watch videos on sites that don't support HTML5 (YouTube DOES support it), then download Flash and enjoy.\n\niOS devices have never included Flash. Additionally, Adobe has stopped working on the mobile version of Flash as well. The YouTube client and many video sites on the web detect the fact that the viewer is an iOS device and actually use MP4 video files instead of Flash.\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-12.oot.txt': "Most people go on Facebook, play online games or walk to another cubicle and talk to a friend to take a break at work. \nYou go onto Mac-Forum [which is a good thing for us ]. But, that's like Mike Tyson hitting the punching bag after a title fight...to take a break. That's ferocious computer passion.\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-13.txt': "Part of my job is creating videos for YouTube. When I encode a video I use an mp4 format which is a very universally compatible format for just about any device. Years back it was all about Flash and that is what I used. \n\nYouTube use to accept what ever you sent it and if your computer did not run flash you got an error message telling you to go download it. Now YouTube will upload your video and re-encode it into several formats for maximum compatibility across all platforms. \n\nFlash has proven over and over to have security issues which is why several years back they did not support it. If I recall correctly, Steve Jobs got into a tiff with Adobe and would not allow Flash on an Apple device legitimately. (Correct me if I am remembering this wrong.) He was also a big supporter of the HTML5 format.\n\nI don't add Flash to any of my machines anymore. I run both windows and mac platforms now and I haven't seen a reason to have it. \n\nAnyway, I too am another of the legions who sit in front of her computer all day and take breaks by visiting forums. This one has become my favorite because of the tons of information and all the nice people.\n\nLisa\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-11.oot.txt': 'One of the things of doing the kind of work that I do is the need for distractions. If I were to just sit and code for 8 hours straight, my brain would leak out of my ears..\n\nSo every 15-20 mins or so, apart from taking a break from the screen, walking around, checking on my team and so on I get on here to see if there are any posts I can reply to and then get back to work..\n',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-5.oot.txt': 'Unlikely...\n\n\n\nAndroid uses Java as its language..\n',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-14.txt': "Lisa, you are remembering correctly about Steve Jobs not wanting Flash on the iPhone until Adobe addressed the security issues and more importantly the performance issues. Adobe didn't care very much for that and figured they had leverage, but had to close down their mobile Flash work, so who was right all along? \n\n@simonvee, I do the other things as well but sometimes I end up in a mode where I have to build code and that can take some time which gives me a breather to peruse the forums..\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-6.txt': "You're right, quite unlikely. \n\nSo why doesn't YouTube, the most popular video streaming site, stream through HTML5 instead of Flash? After all, you said that Flash require less resources and is more flexible.\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-0.txt': 'As from what I have been told, Flash has existed for around 20 years. It is theorized to soon be obsolete and will be replaced by HTML5 . Also, Apple adopted the best decision in not supporting Flash. \n\nNot supporting Flash- What does that really mean? Surely you can download Adobe Flash into your Mac OS. You need Flash to watch YouTube in addition to many other websites. I do not quite understand this. So I guess Macs can still play Flash but does not have it readily available in a newly purchased Mac. Which simply means that Apple does not fancy Flash, yet allow users the permission to install it. Is that right? \n\nAnother thing. Tablets and iPads from what I have heard do not support Flash and you CANNOT download Flash on it. Yet, how can an iPad view YouTube if it does not have Flash?\n',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-4.oot.txt': 'Out of the topic but I feel sorry for the IT students who have to learn Java. Appears it will become obsolete in the near future.\n',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-3.txt': "No OS vendor is going to explicitly deny installation of any piece of software. Take Java for example, OS X used to include it for a while and now they don't. Again, if you need it, there is an easy way of getting it..\n\nDevices like iPhone/iPod Touch/Android and so on support a set of video/audio formats..MP4, VP8, MP3, OGG, WAV and so on. So if you stream data in those formats the devices will handle it no problem.\n\nOn the other hand computers (based on the OS) can support some combination of those formats, and others have to be installed, thus the original use of Flash to play videos and now the use of HTML5 which doesn't require downloading anything more a recent browser..\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-7.txt': 'https://www.youtube.com/html5\n',
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-9.txt': "HTML5 isn't enabled by default on YouTube, so if you visit with Firefox you'll be prompted to download Flash, but if you'd enabled HTML5 it would've been fine. Additionally, if you'd used Chrome you'd never have been prompted at all..\n\nHehe, why, does it show?  Yes, I'm a technical software engineering manager, which basically means that I'm a SW manager who codes.. Interestingly, currently working on porting to Android to Windows..but that's a very long off-topic discussion..\n",
 'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-8.txt': "Oh, it does support HTML5. The reason for my confusion is I had to download Flash in order to watch YouTube via Firefox.\n\nIf it's not too personal, what do you do for a living? Are you in the IT industry? Just out of curiosity\n"}

In [10]:
extractor = CountVectorizerWrapper(input='content', stop_words='english')
detector = OOTDetector(extractor=extractor)

Jadi, di sini yang dipake itu metode yang paling bagus, yaitu txt_comp_dist, dan pengukuran jarak euclidean karena sesuai dengan hasil eksperimen sebelumnya kalau euclidean unggul saat ukuran thread kecil, sekitar 10. Ini kan lumayan kecil juga cuma 15 tulisan.


In [11]:
distances = detector.txt_comp_dist(documents, metric='euclidean')

In [12]:
distances


Out[12]:
array([ 34.62657939,  35.31288717,  42.46174749,  46.72258554,
        50.98038839,  50.90186637,  46.97871859,  50.22947342,
        47.57099957,  47.19110086,  50.10987927,  50.5074252 ,
        50.3487835 ,  39.0256326 ,  47.57099957])

In [13]:
ranked = sorted(zip(distances, files), reverse=True)

In [14]:
ranked


Out[14]:
[(50.980388386123543,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-4.oot.txt'),
 (50.90186637049765,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-5.oot.txt'),
 (50.507425196697561,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-11.oot.txt'),
 (50.34878350069642,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-12.oot.txt'),
 (50.229473419497438,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-7.txt'),
 (50.109879265470198,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-10.oot.txt'),
 (47.570999569065187,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-8.txt'),
 (47.570999569065187,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-14.txt'),
 (47.191100855987671,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-9.txt'),
 (46.97871858618538,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-6.txt'),
 (46.722585544894663,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-3.txt'),
 (42.461747491124292,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-2.txt'),
 (39.025632602175712,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-13.txt'),
 (35.312887166019152,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-1.txt'),
 (34.62657938636157,
  'datasets/macforums/311794__Whats-the-Deal-with-Apple-and-Flash/post-0.txt')]

Jadi, hasilnya bagus juga ya ternyata... Sekarang coba dilihat top 5 tuh yang gimana sih postingannya...


In [15]:
for i, (_, filename) in enumerate(ranked[:5]):
    print('#{}'.format(i+1))
    print(d[filename])


#1
Out of the topic but I feel sorry for the IT students who have to learn Java. Appears it will become obsolete in the near future.

#2
Unlikely...



Android uses Java as its language..

#3
One of the things of doing the kind of work that I do is the need for distractions. If I were to just sit and code for 8 hours straight, my brain would leak out of my ears..

So every 15-20 mins or so, apart from taking a break from the screen, walking around, checking on my team and so on I get on here to see if there are any posts I can reply to and then get back to work..

#4
Most people go on Facebook, play online games or walk to another cubicle and talk to a friend to take a break at work. 
You go onto Mac-Forum [which is a good thing for us ]. But, that's like Mike Tyson hitting the punching bag after a title fight...to take a break. That's ferocious computer passion.

#5
https://www.youtube.com/html5

Sekarang coba dilihat precision dan recallnya


In [16]:
import numpy as np
import pandas as pd

In [17]:
data = np.zeros((2, 3))
for i, t in enumerate([1, 3, 5]):
    top_oot = [filename for _, filename in ranked[:t] if filename.endswith('oot.txt')]
    all_oot = [filename for _, filename in ranked if filename.endswith('oot.txt')]
    precision = len(top_oot) / t
    recall = len(top_oot) / len(all_oot)
    data[:,i] = [precision, recall]

In [18]:
res = pd.DataFrame(data, index=['precision', 'recall'], columns=['top 1', 'top 3', 'top 5'])

In [19]:
res


Out[19]:
top 1 top 3 top 5
precision 1.0 1.0 0.8
recall 0.2 0.6 0.8

In [ ]: