Interview Paper v1.1

**Max Wu

*2017-02-24

  • [Max]: Just update bs4 call to remove a warning after environment migration. *2015-08-22
  • [Max]: Change TOP_NUM to 2000 to take a trial.

In [1]:
import os, json
from pandas import DataFrame, Series
import pandas as pd; import numpy as np
import nltk, re, string
from bs4 import BeautifulSoup
TOP_NUM = 2000

In [3]:
#setup train-set 
path_g = 'training-data/group-ppl-0822-orig.json'
rec_g = [json.loads(line) for line in open (path_g, 'r',1)]
'Total number of g_json:' + str(rec_g[0]['total'])

tks_g = [[issue['fields']['project']['key'], issue['fields']['summary'], issue['fields']['description']] for issue in rec_g[0]['issues']]

frame_g = DataFrame(tks_g, columns=['prj', 'sum', 'desc'])

print frame_g.index
frame_g.head()


RangeIndex(start=0, stop=220, step=1)
Out[3]:
prj sum desc
0 TONTC (CIG-401)mcast traffic can not pass ONT when c... Configure mcast service on ONT bridge port, VL...
1 TONTC (CIG-397)L3 mcast configuration is missed on C... configure mcast service on ONT L3 port.\r\nOMC...
2 TONTC (CIG-396)DS L2 unicast traffic can not pass ONT Configure 7 data service and 1 mcast service o...
3 TONTC (CIG-395) product code for R4.1.52.002 is wrong We received R4.1.52.002 on 14th August. \r\nAf...
4 TONTC (CIG-389) CIG does not report '1 BOOT' after r... Calling Reboot RPC via TR69 to reboot CIG.\r\n...

In [6]:
#Utilities
def normalized_words(article_text):
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    stop_words = nltk.corpus.stopwords.words('english')
    words   = []
    oneline = article_text.replace('\n', ' ')
    soup = BeautifulSoup(oneline.strip(), "lxml")
    cleaned = soup.get_text()
    toks1   = cleaned.split()
    for t1 in toks1:
        translated = regex.sub('', t1)
        toks2 = translated.split()
        for t2 in toks2:
            t2s = t2.strip().lower()
            #if  t2s in stop_words:
            if t2s in stop_words or t2s.isdigit():
                pass
            else:
                words.append(t2s)
    return words

tokens = normalized_words(frame_g['desc'][2])
#tokens

#Return a list with all words shown in all items
def collect_all_words(items):
    all_words = []
    for item in items:
        for w in item:
            all_words.append(w)
    return list(set (all_words))

def identify_top_words(all_words):
    freq_dist = nltk.FreqDist(w.lower() for w in all_words)
    return freq_dist.keys()[:TOP_NUM]

#Count the top  words
desc_list = [normalized_words(desc) for desc in frame_g['desc']]
top_words = identify_top_words (collect_all_words(desc_list))

#$TOP_NUM words
len(top_words)

#read desc and generate hash against top_words with true/false judgement
def features(desc, top_words):
    word_s = set(normalized_words(desc))
    features ={}
    for wd in top_words:
        features["w_%s" %wd] = (wd in word_s) 
    return features

len(features(frame_g['desc'][1], top_words))


Out[6]:
2000

In [7]:
#Prepare test-set
path_m = 'test-data/Binbin-0722-orig.json'
rec_m = [json.loads(line) for line in open (path_m, 'r',1)]
'Total number of g_json:' + str(rec_m[0]['total'])

tks_m = [[issue['fields']['project']['key'], issue['fields']['summary'], issue['fields']['description']] for issue in rec_m[0]['issues']]

frame_m = DataFrame(tks_m)

print frame_m.index
frame_m.head()


RangeIndex(start=0, stop=117, step=1)
Out[7]:
0 1 2
0 TONTT The traffic will be lost even the pbit match t... h5. Summary:\r\nThe traffic will be lost even ...
1 TONTT MVR: The ME information in GIA include 3 MVR v... h5. Summary:\r\nMVR: The ME information in GIA...
2 TONTT ONT will transfer all the IGMP report packet t... h5. Summary:\r\n ONT will transfer all th...
3 TONTC (CIG-370)IPv6 echo request attack protect didn... h5. Summary:\r\nONT version:R4.1.51.28\r\n\r\n...
4 TONTC (CIG-369)The internet LED is always off. h5. Summary:\r\nThe internet LED is always off...

In [8]:
#Train and test
def get_training_set(tickets):
    training_set = []
    for ticket in tickets:
        tup = (features(ticket[2], top_words), ticket[0])
        training_set.append(tup)
    return training_set

training_set = get_training_set(tks_g)
test_set = get_training_set(tks_m)

classifier = nltk.NaiveBayesClassifier.train(training_set)

print 'train on %d instances, test on %d instances' % (len(training_set), len(test_set)) 
classifier.show_most_informative_features(20)

#Test the classifier
print ("Accuracy = %f" %nltk.classify.accuracy(classifier, get_training_set(tks_m)))

catalogs = [classifier.classify(features(frame_m[2][idx], top_words))for idx in frame_m.index]
frame_m['guess'] = catalogs
frame_m['check'] = [frame_m[0][idx] == frame_m['guess'][idx] for idx in frame_m.index]
frame_m


train on 220 instances, test on 117 instances
Most Informative Features
                  w_summ = True              PXA : PREM   =     61.3 : 1.0
                w_ticket = True              PXA : PREM   =     61.3 : 1.0
                w_video1 = True              PXA : PREM   =     61.3 : 1.0
                 w_occur = True              PXA : PREM   =     61.3 : 1.0
                w_entity = True              PXA : PREM   =     61.3 : 1.0
                w_58to58 = True              PXA : PREM   =     61.3 : 1.0
               w_managed = True              PXA : PREM   =     61.3 : 1.0
         w_preconditions = False           TONTC : PREM   =     58.5 : 1.0
                 w_setup = False           TONTC : PREM   =     53.0 : 1.0
              w_returned = True            SXACC : PREM   =     46.0 : 1.0
         w_configuration = False           SXACC : PREM   =     46.0 : 1.0
            w_mvrprofile = True              PXA : PREM   =     36.8 : 1.0
               w_message = True              PXA : PREM   =     36.8 : 1.0
                   w_mvr = True              PXA : PREM   =     36.8 : 1.0
          w_ec4f82135af9 = True              PXA : PREM   =     36.8 : 1.0
                  w_mask = True              PXA : PREM   =     36.8 : 1.0
                    w_t0 = True               EX : PREM   =     35.8 : 1.0
                  w_cigg = True               EX : PREM   =     35.8 : 1.0
                  w_ersn = True               EX : PREM   =     35.8 : 1.0
               w_13206mb = True               EX : PREM   =     35.8 : 1.0
Accuracy = 0.760684
Out[8]:
0 1 2 guess check
0 TONTT The traffic will be lost even the pbit match t... h5. Summary:\r\nThe traffic will be lost even ... PREM False
1 TONTT MVR: The ME information in GIA include 3 MVR v... h5. Summary:\r\nMVR: The ME information in GIA... PREM False
2 TONTT ONT will transfer all the IGMP report packet t... h5. Summary:\r\n ONT will transfer all th... PREM False
3 TONTC (CIG-370)IPv6 echo request attack protect didn... h5. Summary:\r\nONT version:R4.1.51.28\r\n\r\n... TONTC True
4 TONTC (CIG-369)The internet LED is always off. h5. Summary:\r\nThe internet LED is always off... TONTC True
5 TONTC (CIG-363)CIG missed the requirement R7.070, so... CIG missed the requirement R7.070, some enhanc... PREM False
6 TONTC (CIG-357)IPv6: ONT don's support the IPv6 echo... h5. Summary:\r\nIPv6: ONT don's support the IP... PREM False
7 TONTC (CIG-356)The default WAN address mode is SLAAC... h5. Summary:\r\nThe default WAN address mode i... PREM False
8 TONTC (CIG-350)ONT has the default value for the LAN... h5. Summary:\r\nONT has the default value for ... PREM False
9 TONTC (CIG-349)Login GUI failed after disable the du... h5. Summary:\r\nLogin GUI failed after disable... TONTC True
10 TONTC (CIG-348)DHCPv6 solicit packet has two elapsed... h5. Summary:\r\nDHCPv6 solicit packet has two ... PREM False
11 TONTC (CIG-347)ONT stop to traceroute the unreachabl... h5. Summary:\r\nONT stop to traceroute the unr... TONTC True
12 TONTC (CIG-346)The first hop of ipv6 traceroute is n... h5. Summary:\r\nThe first hop of ipv6 tracerou... TONTC True
13 TONTC (CIG-344) IPv6 traffic lost in upstream h5. Summary:\r\nIPv6 traffic lost in upstream\... TONTC True
14 TONTC (CIG-343) Send the upstream ipv6 traffic will ... h5. Summary:\r\nSend the upstream ipv6 traffic... TONTC True
15 TONTC (CIG-340) ONT GUI can't show the link-local ad... h5. Summary:\r\nONT GUI can't show the link-lo... PREM False
16 TONTC (CIG-337)ONT will send the dhcpv6 advertise an... h5. Summary:\r\nONT will send the dhcpv6 adver... PREM False
17 TONTC (CIG-336) ONT crash due to application dead wh... h5. Summary:\r\nONT crash due to application d... TONTC True
18 TONTC (CIG-339)ONT WAN can generate two ipv6 global ... h5. Summary:\r\nONT WAN can generate two ipv6 ... PREM False
19 TONTC (CIG-338)ONT can't handle the dhcpv6 solict pa... h5. Summary:\r\nONT can't handle the dhcpv6 so... PREM False
20 TONTC (CIG-309)ONT UVCM state is not maintained afte... h5. Summary:\r\nONT reboot always reboot twice... PREM False
21 TONTC (CIG-272)ONT rolling reset after upgrade to R4... h5. Summary:\r\nONT rolling reset after upgrad... PREM False
22 TONTC The L3 DS traffic will stop for almost 2mins ... h5. Summary:\r\nThe DS traffic will stop for a... PREM False
23 TONTC DNS-SRV:ONT can't switchover after the invite ... h5. Summary:\r\nDNS-SRV:ONT can't switchover a... PREM False
24 TONTC Access the Web GUI through WAN successfully ev... h5. Summary:\r\nAccess the Web GUI through WAN... PREM False
25 TONTC ONT can't change the pbit of inner vlan h5. Summary:\r\nONT can't change the pbit of i... PREM False
26 TONTC VOIP:Just Provision sip service to one pots po... h5. Summary:\r\nVOIP:Just Provision sip servic... PREM False
27 TONTC VOIP:Provision sip service to two pots port, i... h5. Summary:\r\nVOIP:Provision sip service to ... PREM False
28 TONTC Voice:ONT reboot twice when load the Voice XML... h5. Summary:\r\nONT reboot twice when load the... PREM False
29 TONTC (CIG-180)IPTV: The L3 IPTV didn't work h5. Summary:\r\nIPTV: The L3 IPTV didn't work ... PREM False
... ... ... ... ... ...
87 PREM [844E-1] [RG-WAN] ONT doesn't flood the video ... h5. Summary:\r\nONT doesn't flood the video tr... PREM True
88 PREM [844E-1] [RG-WAN] There is no IGMP event log f... h5. Summary:\r\nThere is no IGMP event log fro... PREM True
89 PREM [844E-1] [RG-WAN] There is no inform when the ... h5. Summary:\r\nThere is no inform when the va... PREM True
90 PREM [844E-1] [RG-WAN] The timer of IGMP snooping e... h5. Summary:\r\nThe timer of IGMP snooping ent... PREM True
91 PREM [844E-1][RG-WAN] IGMPv1 didn't work h5. Summary:\r\nIGMPv1 didn't work\r\n\r\n h5.... PREM True
92 PREM [844E-1][RG-WAN] IGMPV1 mode still has the fas... h5. Summary:\r\n IGMPV1 mode still has the fas... PREM True
93 PREM [844E-1] [TR069] TR069 status can't refresh af... h5. Summary:\r\nTR069 status can't refresh aft... PREM True
94 PREM [844E-1] [GUI] The information of WAN-protocol... h5. Summary:\r\nThe information of WAN-protoco... PREM True
95 PREM [844E-1] [RG-WAN] ONT restore default , found ... h5. Summary:\r\nONT restore default , found th... PREM True
96 PREM [844E-1] [TR069] TR069 can't access the ACS se... h5. Summary:\r\nTR069 can't access the ACS ser... PREM True
97 PREM RG service can't get the ip address on 11.0.20... h5. Summary:\r\nRG service can't active on 11.... PREM True
98 PREM [844E-1] [Wireless] ONT crash due to Chttpd cr... h5. Summary:\r\n ONT crash due to wlanmgr miss... PREM True
99 PREM [844E-1] [GUI] ONT output some unnecessary inf... h5. Summary:\r\nONT output some unnecessary l... PREM True
100 PREM [844E-1] [GUI-Support] MDM is locked by chttpd... h5. Summary:\r\nSend the IGMP report packets t... PREM True
101 PREM [844E-1] [GUI-Status] The DNS information is n... h5. Summary:\r\nI just set the primary DNS ser... PREM True
102 PREM [844E-1] [RG-WAN] ONT crash due to chttpd process h5. Summary:\r\n ONT crash due to chttpd proc... PREM True
103 PREM [844E-1] [RG-WAN] The counters of downstream P... 1、Issue summary:\r\nThe counters of downstream... PREM True
104 PREM [844E-1] [GUI-Status] There is no information ... 1、Issue summary:\r\n There is no informati... PREM True
105 PREM [844E-1] [RG-WAN] Send the L3 data traffic wil... 1、Issue summary:\r\nSend the L3 data traffic w... PREM True
106 PREM ONT activate failed on 10.8.0.5 against the E7... h5. Summary:\r\nONT activate failed on 10.8.0.... PREM True
107 PREM 2.4G WPS failed h5. Summary:\r\n2.4G WPS failed\r\n h5. On sce... PREM True
108 PREM The unit of Wi-Fi speed are not correct in GU... h5. Summary:\r\nThe unit of Wi-Fi speed are n... PREM True
109 PREM The description of GUI tips is not correct h5. Summary:\r\nThe description of GUI tips is... PREM True
110 PREM ONT didn't forward the IGMP report packets to E7 h5. Summary:\r\nONT didn't forward the IGMP re... PREM True
111 PREM The WAN service can be activated only when you... h5. Summary:\r\nThe WAN service can be activat... PREM True
112 PREM Send the L3 Data traffic will make the CPU in ... h5. Summary:\r\nSend the L3 Data traffic will ... PREM True
113 PREM 【GUI】:can't login GUI after sending the L3 dat... h5. Summary:\r\ncan't login GUI after sending ... PREM True
114 PREM 720GX with build 10.4.0.52 connect to E7 can d... h5. Summary:\r\n\r\nh5. On Scene Investigator:... PREM True
115 PREM 744GE with manufacture build 15.2.4.12 fail to... h5. Summary:\r\n 744GE with manufacture bui... PREM True
116 CONN [CCFP][844E] CC didn't update the X_000631_IGM... CCFP didn't update the .X_000631_IGMP.CacheTab... PREM False

117 rows × 5 columns


In [ ]:


In [ ]: