First I need to start with an article dictionary


In [1]:
from testdataextractor.testdataextractor.extractor import Extractor
ext = Extractor('../test_data/1957284403.ofs.gold.xml')
article = ext.extract(verbose=True)


50  comments parsed.
190  sentences parsed.
140  links parsed.

I then need to put the data in a format that I can query

Maybe a way to do this is via pandas? I want it to look like so:

sentence comment links
sentence id comment id, if it is from a comment list of sentences its linked to

In [2]:
import pandas as pd

frame_art = pd.DataFrame.from_dict(article['sentences'], orient='index')

In [3]:
frame_art


Out[3]:
comment text links
s0 NaN BT and Vodafone among telecoms companies passi... [s66, s101, s82, s162, s118, s114, s113, s160,...
s1 NaN Some of the world 's leading telecoms firms , ... NaN
s10 NaN It gives top secret codenames for each firm , ... [s57]
s100 c17 Best way to keep your information secure ? [s28, s167, s49]
s101 c17 Use Vodafone . [s0, s47, s0]
s102 c17 You wo n't be in contact with anybody . [s47, s115]
s103 c18 [Snapshackle ] Agreed , 4G? ? [s99]
s104 c18 3G would be nice , or even a fucking signal . NaN
s105 c19 [ OurPlanet ] That really made me laugh in a s... NaN
s106 c19 Thanks VaughnParadisThe best antidote to this ... [s56]
s107 c19 Fear only feeds these imbeciles egos . [s99]
s108 c20 [ Malkatrinho] that 's a lovely shade of green . NaN
s109 c21 [jonbryce ] This is the former Cable & Wireles... [s99, s99]
s11 NaN The other firms include Global Crossing ( " Pi... NaN
s110 c21 Lots of ISPs use them for their transatlantic ... NaN
s111 c21 Those that do n't use one of the other ones li... [s99, s165, s140]
s112 c21 Level 3 is probably the biggest . NaN
s113 c22 [IronCurtain ] The Governments and the Corpora... [s0, s16]
s114 c22 cui bonno ? [s0]
s115 c22 you ? [s102]
s116 c22 me ? NaN
s117 c22 Freedom ? NaN
s118 c22 Liberty ? [s0]
s119 c22 not likely . [s131, s133, s140, s127, s126, s129]
s12 NaN The companies refused to comment on any specif... NaN
s120 c23 [enfrance ] If apathy is the order of the day ... [s52]
s121 c23 Nah , I despair . NaN
s122 c23 And the likelyhood of anyone boycotting either... [s52]
s123 c23 How about a petition of some sort ? NaN
s124 c23 Its late and I 'm too tired and hot to think o... [s189]
... ... ... ...
s72 c9 We need a revolution , and we need it now , to... [s52, s66]
s73 c10 [dennis79] All of this is certainly not done t... [s78, s37, s80]
s74 c10 Spying upon politics , economics , NGOs etc . [s37]
s75 c10 I 'm also pretty confident that a number of in... [s79]
s76 c10 States have always set up people , used people... [s0]
s77 c10 All this is far easier when in the possession ... [s0, s48]
s78 c11 [SpecialRX] I suspect you 're right . [s9, s73]
s79 c11 Have a recommend on me . [s75]
s8 NaN The paper said it had seen a copy of an intern... NaN
s80 c12 [ Malkatrinho] It has nothing to do with &quot... [s73]
s81 c12 It 's about keeping an eye on the general popu... NaN
s82 c13 [timetorememberagain ] A spokeswoman for Veriz... [s0, s93, s2, s96]
s83 c13 Verizon also complies with the law in every co... [s22, s2, s55, s3]
s84 c13 It sounds like they 're doing everything just ... [s96, s2, s15, s0, s2, s22]
s85 c13 So the government demands access to my ( and e... [s92, s47, s2, s15, s54]
s86 c13 Sheer doublespeak and deception of the highest... [s55, s47]
s87 c13 Recall Parliament now ! [s3]
s88 c13 Demonstrate now ! NaN
s89 c14 [AGrumpyGit ] Democracy does indeed work that ... NaN
s9 NaN The document identified for the first time whi... [s78, s176]
s90 c15 [timetorememberagain ] It should have been deb... [s98]
s91 c15 Quite so . [s27]
s92 c15 Those among us who might previously have argue... [s85]
s93 c15 With luck we might even stage mass demonstrati... [s82, s150]
s94 c15 Organise ! NaN
s95 c15 Resist ! NaN
s96 c16 [SamSSSS] I do n't see why the existence of wa... [s84, s82, s15]
s97 c16 It should have been debated in parliament . NaN
s98 c16 If the majority were in favour , the access wo... [s90]
s99 c17 [VaughnParadis ] I 'd be concerned about Vodaf... [s109, s107, s109, s103, s49, s111, s0, s52]

190 rows × 3 columns

Excellent! Now get sentences with most number of links


In [45]:
def calc_row_len(row):
    if 'list' in str(type(row['links'])):
        return len(row['links']) 
    else:
        return 0
frame_num_links = frame_art.apply(
    (lambda row: calc_row_len(row)), axis=1
)
frame_with_lengths = pd.concat([frame_art, frame_num_links], axis=1)

top_sentences = frame_with_lengths.sort_values(by=0, axis=0, ascending=False)[:11]
top_sentences.columns = ['text', 'comment', 'links', 'link length']
print top_sentences.ix[:, ['links', 'link length']]

print '\nCHUNKED SENTENCES'
for s in top_sentences['text']:
    print s[:100]

#print "These are the most linked sentences in the corpus."
#print "Sentences\n", top_sentences['text']
#print "Links they have\n", top_sentences['links']
#print "Number of links they have Links they have\n", top_sentences[0]


                                                  links  link length
s0    [s66, s101, s82, s162, s118, s114, s113, s160,...           17
s175  [s181, s178, s20, s176, s182, s180, s20, s177,...           10
s52   [s152, s122, s150, s155, s149, s72, s146, s99,...           10
s99        [s109, s107, s109, s103, s49, s111, s0, s52]            8
s57                  [s49, s0, s65, s62, s62, s64, s10]            7
s66                   [s67, s0, s69, s0, s71, s72, s71]            7
s2                  [s84, s85, s70, s69, s82, s83, s84]            7
s84                         [s96, s2, s15, s0, s2, s22]            6
s119               [s131, s133, s140, s127, s126, s129]            6
s85                            [s92, s47, s2, s15, s54]            5
s48                         [s62, s70, s178, s127, s77]            5

CHUNKED SENTENCES
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-f68cedd7224d> in <module>()
     15 print '\nCHUNKED SENTENCES'
     16 for s in top_sentences['text']:
---> 17     print s[:100]
     18 
     19 #print "These are the most linked sentences in the corpus."

TypeError: 'float' object has no attribute '__getitem__'

In [ ]: