First I need to start with an article dictionary



In [1]:

    
from testdataextractor.testdataextractor.extractor import Extractor
ext = Extractor('../test_data/1957284403.ofs.gold.xml')
article = ext.extract(verbose=True)









    



50  comments parsed.
190  sentences parsed.
140  links parsed.

I then need to put the data in a format that I can query

Maybe a way to do this is via pandas? I want it to look like so:

sentence	comment	links
sentence id	comment id, if it is from a comment	list of sentences its linked to



In [2]:

    
import pandas as pd

frame_art = pd.DataFrame.from_dict(article['sentences'], orient='index')



In [3]:

    
frame_art









    Out[3]:






  
    
      
      comment
      text
      links
    
  
  
    
      s0
      NaN
      BT and Vodafone among telecoms companies passi...
      [s66, s101, s82, s162, s118, s114, s113, s160,...
    
    
      s1
      NaN
      Some of the world 's leading telecoms firms , ...
      NaN
    
    
      s10
      NaN
      It gives top secret codenames for each firm , ...
      [s57]
    
    
      s100
      c17
      Best way to keep your information secure ?
      [s28, s167, s49]
    
    
      s101
      c17
      Use Vodafone .
      [s0, s47, s0]
    
    
      s102
      c17
      You wo n't be in contact with anybody .
      [s47, s115]
    
    
      s103
      c18
      [Snapshackle ] Agreed , 4G? ?
      [s99]
    
    
      s104
      c18
      3G would be nice , or even a fucking signal .
      NaN
    
    
      s105
      c19
      [ OurPlanet ] That really made me laugh in a s...
      NaN
    
    
      s106
      c19
      Thanks VaughnParadisThe best antidote to this ...
      [s56]
    
    
      s107
      c19
      Fear only feeds these imbeciles egos .
      [s99]
    
    
      s108
      c20
      [ Malkatrinho] that 's a lovely shade of green .
      NaN
    
    
      s109
      c21
      [jonbryce ] This is the former Cable & Wireles...
      [s99, s99]
    
    
      s11
      NaN
      The other firms include Global Crossing ( " Pi...
      NaN
    
    
      s110
      c21
      Lots of ISPs use them for their transatlantic ...
      NaN
    
    
      s111
      c21
      Those that do n't use one of the other ones li...
      [s99, s165, s140]
    
    
      s112
      c21
      Level 3 is probably the biggest .
      NaN
    
    
      s113
      c22
      [IronCurtain ] The Governments and the Corpora...
      [s0, s16]
    
    
      s114
      c22
      cui bonno ?
      [s0]
    
    
      s115
      c22
      you ?
      [s102]
    
    
      s116
      c22
      me ?
      NaN
    
    
      s117
      c22
      Freedom ?
      NaN
    
    
      s118
      c22
      Liberty ?
      [s0]
    
    
      s119
      c22
      not likely .
      [s131, s133, s140, s127, s126, s129]
    
    
      s12
      NaN
      The companies refused to comment on any specif...
      NaN
    
    
      s120
      c23
      [enfrance ] If apathy is the order of the day ...
      [s52]
    
    
      s121
      c23
      Nah , I despair .
      NaN
    
    
      s122
      c23
      And the likelyhood of anyone boycotting either...
      [s52]
    
    
      s123
      c23
      How about a petition of some sort ?
      NaN
    
    
      s124
      c23
      Its late and I 'm too tired and hot to think o...
      [s189]
    
    
      ...
      ...
      ...
      ...
    
    
      s72
      c9
      We need a revolution , and we need it now , to...
      [s52, s66]
    
    
      s73
      c10
      [dennis79] All of this is certainly not done t...
      [s78, s37, s80]
    
    
      s74
      c10
      Spying upon politics , economics , NGOs etc .
      [s37]
    
    
      s75
      c10
      I 'm also pretty confident that a number of in...
      [s79]
    
    
      s76
      c10
      States have always set up people , used people...
      [s0]
    
    
      s77
      c10
      All this is far easier when in the possession ...
      [s0, s48]
    
    
      s78
      c11
      [SpecialRX] I suspect you 're right .
      [s9, s73]
    
    
      s79
      c11
      Have a recommend on me .
      [s75]
    
    
      s8
      NaN
      The paper said it had seen a copy of an intern...
      NaN
    
    
      s80
      c12
      [ Malkatrinho] It has nothing to do with &quot...
      [s73]
    
    
      s81
      c12
      It 's about keeping an eye on the general popu...
      NaN
    
    
      s82
      c13
      [timetorememberagain ] A spokeswoman for Veriz...
      [s0, s93, s2, s96]
    
    
      s83
      c13
      Verizon also complies with the law in every co...
      [s22, s2, s55, s3]
    
    
      s84
      c13
      It sounds like they 're doing everything just ...
      [s96, s2, s15, s0, s2, s22]
    
    
      s85
      c13
      So the government demands access to my ( and e...
      [s92, s47, s2, s15, s54]
    
    
      s86
      c13
      Sheer doublespeak and deception of the highest...
      [s55, s47]
    
    
      s87
      c13
      Recall Parliament now !
      [s3]
    
    
      s88
      c13
      Demonstrate now !
      NaN
    
    
      s89
      c14
      [AGrumpyGit ] Democracy does indeed work that ...
      NaN
    
    
      s9
      NaN
      The document identified for the first time whi...
      [s78, s176]
    
    
      s90
      c15
      [timetorememberagain ] It should have been deb...
      [s98]
    
    
      s91
      c15
      Quite so .
      [s27]
    
    
      s92
      c15
      Those among us who might previously have argue...
      [s85]
    
    
      s93
      c15
      With luck we might even stage mass demonstrati...
      [s82, s150]
    
    
      s94
      c15
      Organise !
      NaN
    
    
      s95
      c15
      Resist !
      NaN
    
    
      s96
      c16
      [SamSSSS] I do n't see why the existence of wa...
      [s84, s82, s15]
    
    
      s97
      c16
      It should have been debated in parliament .
      NaN
    
    
      s98
      c16
      If the majority were in favour , the access wo...
      [s90]
    
    
      s99
      c17
      [VaughnParadis ] I 'd be concerned about Vodaf...
      [s109, s107, s109, s103, s49, s111, s0, s52]
    
  

190 rows × 3 columns

Excellent! Now get sentences with most number of links



In [45]:

    
def calc_row_len(row):
    if 'list' in str(type(row['links'])):
        return len(row['links']) 
    else:
        return 0
frame_num_links = frame_art.apply(
    (lambda row: calc_row_len(row)), axis=1
)
frame_with_lengths = pd.concat([frame_art, frame_num_links], axis=1)

top_sentences = frame_with_lengths.sort_values(by=0, axis=0, ascending=False)[:11]
top_sentences.columns = ['text', 'comment', 'links', 'link length']
print top_sentences.ix[:, ['links', 'link length']]

print '\nCHUNKED SENTENCES'
for s in top_sentences['text']:
    print s[:100]

#print "These are the most linked sentences in the corpus."
#print "Sentences\n", top_sentences['text']
#print "Links they have\n", top_sentences['links']
#print "Number of links they have Links they have\n", top_sentences[0]









    



                                                  links  link length
s0    [s66, s101, s82, s162, s118, s114, s113, s160,...           17
s175  [s181, s178, s20, s176, s182, s180, s20, s177,...           10
s52   [s152, s122, s150, s155, s149, s72, s146, s99,...           10
s99        [s109, s107, s109, s103, s49, s111, s0, s52]            8
s57                  [s49, s0, s65, s62, s62, s64, s10]            7
s66                   [s67, s0, s69, s0, s71, s72, s71]            7
s2                  [s84, s85, s70, s69, s82, s83, s84]            7
s84                         [s96, s2, s15, s0, s2, s22]            6
s119               [s131, s133, s140, s127, s126, s129]            6
s85                            [s92, s47, s2, s15, s54]            5
s48                         [s62, s70, s178, s127, s77]            5

CHUNKED SENTENCES






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-f68cedd7224d> in <module>()
     15 print '\nCHUNKED SENTENCES'
     16 for s in top_sentences['text']:
---> 17     print s[:100]
     18 
     19 #print "These are the most linked sentences in the corpus."

TypeError: 'float' object has no attribute '__getitem__'



In [ ]:

	comment	text	links
s0	NaN	BT and Vodafone among telecoms companies passi...	[s66, s101, s82, s162, s118, s114, s113, s160,...
s1	NaN	Some of the world 's leading telecoms firms , ...	NaN
s10	NaN	It gives top secret codenames for each firm , ...	[s57]
s100	c17	Best way to keep your information secure ?	[s28, s167, s49]
s101	c17	Use Vodafone .	[s0, s47, s0]
s102	c17	You wo n't be in contact with anybody .	[s47, s115]
s103	c18	[Snapshackle ] Agreed , 4G? ?	[s99]
s104	c18	3G would be nice , or even a fucking signal .	NaN
s105	c19	[ OurPlanet ] That really made me laugh in a s...	NaN
s106	c19	Thanks VaughnParadisThe best antidote to this ...	[s56]
s107	c19	Fear only feeds these imbeciles egos .	[s99]
s108	c20	[ Malkatrinho] that 's a lovely shade of green .	NaN
s109	c21	[jonbryce ] This is the former Cable & Wireles...	[s99, s99]
s11	NaN	The other firms include Global Crossing ( " Pi...	NaN
s110	c21	Lots of ISPs use them for their transatlantic ...	NaN
s111	c21	Those that do n't use one of the other ones li...	[s99, s165, s140]
s112	c21	Level 3 is probably the biggest .	NaN
s113	c22	[IronCurtain ] The Governments and the Corpora...	[s0, s16]
s114	c22	cui bonno ?	[s0]
s115	c22	you ?	[s102]
s116	c22	me ?	NaN
s117	c22	Freedom ?	NaN
s118	c22	Liberty ?	[s0]
s119	c22	not likely .	[s131, s133, s140, s127, s126, s129]
s12	NaN	The companies refused to comment on any specif...	NaN
s120	c23	[enfrance ] If apathy is the order of the day ...	[s52]
s121	c23	Nah , I despair .	NaN
s122	c23	And the likelyhood of anyone boycotting either...	[s52]
s123	c23	How about a petition of some sort ?	NaN
s124	c23	Its late and I 'm too tired and hot to think o...	[s189]
...	...	...	...
s72	c9	We need a revolution , and we need it now , to...	[s52, s66]
s73	c10	[dennis79] All of this is certainly not done t...	[s78, s37, s80]
s74	c10	Spying upon politics , economics , NGOs etc .	[s37]
s75	c10	I 'm also pretty confident that a number of in...	[s79]
s76	c10	States have always set up people , used people...	[s0]
s77	c10	All this is far easier when in the possession ...	[s0, s48]
s78	c11	[SpecialRX] I suspect you 're right .	[s9, s73]
s79	c11	Have a recommend on me .	[s75]
s8	NaN	The paper said it had seen a copy of an intern...	NaN
s80	c12	[ Malkatrinho] It has nothing to do with &quot...	[s73]
s81	c12	It 's about keeping an eye on the general popu...	NaN
s82	c13	[timetorememberagain ] A spokeswoman for Veriz...	[s0, s93, s2, s96]
s83	c13	Verizon also complies with the law in every co...	[s22, s2, s55, s3]
s84	c13	It sounds like they 're doing everything just ...	[s96, s2, s15, s0, s2, s22]
s85	c13	So the government demands access to my ( and e...	[s92, s47, s2, s15, s54]
s86	c13	Sheer doublespeak and deception of the highest...	[s55, s47]
s87	c13	Recall Parliament now !	[s3]
s88	c13	Demonstrate now !	NaN
s89	c14	[AGrumpyGit ] Democracy does indeed work that ...	NaN
s9	NaN	The document identified for the first time whi...	[s78, s176]
s90	c15	[timetorememberagain ] It should have been deb...	[s98]
s91	c15	Quite so .	[s27]
s92	c15	Those among us who might previously have argue...	[s85]
s93	c15	With luck we might even stage mass demonstrati...	[s82, s150]
s94	c15	Organise !	NaN
s95	c15	Resist !	NaN
s96	c16	[SamSSSS] I do n't see why the existence of wa...	[s84, s82, s15]
s97	c16	It should have been debated in parliament .	NaN
s98	c16	If the majority were in favour , the access wo...	[s90]
s99	c17	[VaughnParadis ] I 'd be concerned about Vodaf...	[s109, s107, s109, s103, s49, s111, s0, s52]