In [42]:
! gsutil ls gs://pyspark-workshop/so-posts
gs://pyspark-workshop/so-posts/Posts.xml-aa
gs://pyspark-workshop/so-posts/Posts.xml-ab
gs://pyspark-workshop/so-posts/Posts.xml-ac
gs://pyspark-workshop/so-posts/Posts.xml-ad
gs://pyspark-workshop/so-posts/Posts.xml-ae
gs://pyspark-workshop/so-posts/Posts.xml-af
gs://pyspark-workshop/so-posts/Posts.xml-ag
gs://pyspark-workshop/so-posts/Posts.xml-ah
gs://pyspark-workshop/so-posts/Posts.xml-ai
gs://pyspark-workshop/so-posts/Posts.xml-aj
gs://pyspark-workshop/so-posts/Posts.xml-ak
gs://pyspark-workshop/so-posts/Posts.xml-al
gs://pyspark-workshop/so-posts/Posts.xml-am
gs://pyspark-workshop/so-posts/Posts.xml-an
gs://pyspark-workshop/so-posts/Posts.xml-ao
gs://pyspark-workshop/so-posts/Posts.xml-ap
gs://pyspark-workshop/so-posts/Posts.xml-aq
gs://pyspark-workshop/so-posts/Posts.xml-ar
gs://pyspark-workshop/so-posts/Posts.xml-as
gs://pyspark-workshop/so-posts/Posts.xml-at
gs://pyspark-workshop/so-posts/Posts.xml-au
gs://pyspark-workshop/so-posts/Posts.xml-av
gs://pyspark-workshop/so-posts/Posts.xml-aw
gs://pyspark-workshop/so-posts/Posts.xml-ax
gs://pyspark-workshop/so-posts/Posts.xml-ay
gs://pyspark-workshop/so-posts/Posts.xml-az
gs://pyspark-workshop/so-posts/Posts.xml-ba
gs://pyspark-workshop/so-posts/Posts.xml-bb
gs://pyspark-workshop/so-posts/Posts.xml-bc
gs://pyspark-workshop/so-posts/Posts.xml-bd
gs://pyspark-workshop/so-posts/Posts.xml-be
gs://pyspark-workshop/so-posts/Posts.xml-bf
gs://pyspark-workshop/so-posts/Posts.xml-bg
gs://pyspark-workshop/so-posts/Posts.xml-bh
gs://pyspark-workshop/so-posts/Posts.xml-bi
gs://pyspark-workshop/so-posts/Posts.xml-bj
gs://pyspark-workshop/so-posts/Posts.xml-bk
gs://pyspark-workshop/so-posts/Posts.xml-bl
gs://pyspark-workshop/so-posts/Posts.xml-bm
gs://pyspark-workshop/so-posts/Posts.xml-bn
gs://pyspark-workshop/so-posts/Posts.xml-bo
gs://pyspark-workshop/so-posts/Posts.xml-bp
gs://pyspark-workshop/so-posts/Posts.xml-bq
gs://pyspark-workshop/so-posts/Posts.xml-br
gs://pyspark-workshop/so-posts/Posts.xml-bs
gs://pyspark-workshop/so-posts/Posts.xml-bt
gs://pyspark-workshop/so-posts/Posts.xml-bu
gs://pyspark-workshop/so-posts/Posts.xml-bv
gs://pyspark-workshop/so-posts/Posts.xml-bw
gs://pyspark-workshop/so-posts/Posts.xml-bx
gs://pyspark-workshop/so-posts/Posts.xml-by
gs://pyspark-workshop/so-posts/Posts.xml-bz
gs://pyspark-workshop/so-posts/Posts.xml-ca
gs://pyspark-workshop/so-posts/Posts.xml-cb
gs://pyspark-workshop/so-posts/Posts.xml-cc
gs://pyspark-workshop/so-posts/Posts.xml-cd
gs://pyspark-workshop/so-posts/Posts.xml-ce
gs://pyspark-workshop/so-posts/Posts.xml-cf
gs://pyspark-workshop/so-posts/Posts.xml-cg
gs://pyspark-workshop/so-posts/Posts.xml-ch
gs://pyspark-workshop/so-posts/Posts.xml-ci
gs://pyspark-workshop/so-posts/Posts.xml-cj
gs://pyspark-workshop/so-posts/Posts.xml-ck
gs://pyspark-workshop/so-posts/Posts.xml-cl
gs://pyspark-workshop/so-posts/Posts.xml-cm
gs://pyspark-workshop/so-posts/Posts.xml-cn
gs://pyspark-workshop/so-posts/Posts.xml-co
gs://pyspark-workshop/so-posts/Posts.xml-cp
gs://pyspark-workshop/so-posts/Posts.xml-cq
gs://pyspark-workshop/so-posts/Posts.xml-cr
gs://pyspark-workshop/so-posts/Posts.xml-cs
gs://pyspark-workshop/so-posts/Posts.xml-ct
gs://pyspark-workshop/so-posts/Posts.xml-cu
gs://pyspark-workshop/so-posts/Posts.xml-cv
gs://pyspark-workshop/so-posts/Posts.xml-cw
gs://pyspark-workshop/so-posts/Posts.xml-cx
gs://pyspark-workshop/so-posts/Posts.xml-cy
gs://pyspark-workshop/so-posts/Posts.xml-cz
gs://pyspark-workshop/so-posts/Posts.xml-da
gs://pyspark-workshop/so-posts/Posts.xml-db
gs://pyspark-workshop/so-posts/Posts.xml-dc
gs://pyspark-workshop/so-posts/Posts.xml-dd
gs://pyspark-workshop/so-posts/Posts.xml-de
gs://pyspark-workshop/so-posts/Posts.xml-df
gs://pyspark-workshop/so-posts/Posts.xml-dg
gs://pyspark-workshop/so-posts/Posts.xml-dh
gs://pyspark-workshop/so-posts/Posts.xml-di
gs://pyspark-workshop/so-posts/Posts.xml-dj
gs://pyspark-workshop/so-posts/Posts.xml-dk
gs://pyspark-workshop/so-posts/Posts.xml-dl
gs://pyspark-workshop/so-posts/Posts.xml-dm
gs://pyspark-workshop/so-posts/Posts.xml-dn
gs://pyspark-workshop/so-posts/Posts.xml-do
gs://pyspark-workshop/so-posts/Posts.xml-dp
gs://pyspark-workshop/so-posts/Posts.xml-dq
gs://pyspark-workshop/so-posts/Posts.xml-dr
gs://pyspark-workshop/so-posts/Posts.xml-ds
gs://pyspark-workshop/so-posts/Posts.xml-dt
gs://pyspark-workshop/so-posts/Posts.xml-du
gs://pyspark-workshop/so-posts/Posts.xml-dv
gs://pyspark-workshop/so-posts/Posts.xml-dw
gs://pyspark-workshop/so-posts/Posts.xml-dx
gs://pyspark-workshop/so-posts/Posts.xml-dy
gs://pyspark-workshop/so-posts/Posts.xml-dz
gs://pyspark-workshop/so-posts/Posts.xml-ea
gs://pyspark-workshop/so-posts/Posts.xml-eb
gs://pyspark-workshop/so-posts/Posts.xml-ec
gs://pyspark-workshop/so-posts/Posts.xml-ed
gs://pyspark-workshop/so-posts/Posts.xml-ee
gs://pyspark-workshop/so-posts/Posts.xml-ef
gs://pyspark-workshop/so-posts/Posts.xml-eg
gs://pyspark-workshop/so-posts/Posts.xml-eh
gs://pyspark-workshop/so-posts/Posts.xml-ei
gs://pyspark-workshop/so-posts/Posts.xml-ej
gs://pyspark-workshop/so-posts/Posts.xml-ek
gs://pyspark-workshop/so-posts/Posts.xml-el
gs://pyspark-workshop/so-posts/Posts.xml-em
gs://pyspark-workshop/so-posts/Posts.xml-en
gs://pyspark-workshop/so-posts/Posts.xml-eo
gs://pyspark-workshop/so-posts/Posts.xml-ep
gs://pyspark-workshop/so-posts/Posts.xml-eq
gs://pyspark-workshop/so-posts/Posts.xml-er
gs://pyspark-workshop/so-posts/Posts.xml-es
gs://pyspark-workshop/so-posts/Posts.xml-et
gs://pyspark-workshop/so-posts/Posts.xml-eu
gs://pyspark-workshop/so-posts/Posts.xml-ev
gs://pyspark-workshop/so-posts/Posts.xml-ew
gs://pyspark-workshop/so-posts/Posts.xml-ex
gs://pyspark-workshop/so-posts/Posts.xml-ey
gs://pyspark-workshop/so-posts/Posts.xml-ez
gs://pyspark-workshop/so-posts/Posts.xml-fa
gs://pyspark-workshop/so-posts/Posts.xml-fb
gs://pyspark-workshop/so-posts/Posts.xml-fc
gs://pyspark-workshop/so-posts/Posts.xml-fd
gs://pyspark-workshop/so-posts/Posts.xml-fe
gs://pyspark-workshop/so-posts/Posts.xml-ff
gs://pyspark-workshop/so-posts/Posts.xml-fg
gs://pyspark-workshop/so-posts/Posts.xml-fh
gs://pyspark-workshop/so-posts/Posts.xml-fi
gs://pyspark-workshop/so-posts/Posts.xml-fj
gs://pyspark-workshop/so-posts/Posts.xml-fk
gs://pyspark-workshop/so-posts/Posts.xml-fl
gs://pyspark-workshop/so-posts/Posts.xml-fm
gs://pyspark-workshop/so-posts/Posts.xml-fn
gs://pyspark-workshop/so-posts/Posts.xml-fo
gs://pyspark-workshop/so-posts/Posts.xml-fp
gs://pyspark-workshop/so-posts/Posts.xml-fq
gs://pyspark-workshop/so-posts/Posts.xml-fr
gs://pyspark-workshop/so-posts/Posts.xml-fs
gs://pyspark-workshop/so-posts/Posts.xml-ft
gs://pyspark-workshop/so-posts/Posts.xml-fu
gs://pyspark-workshop/so-posts/Posts.xml-fv
gs://pyspark-workshop/so-posts/Posts.xml-fw
gs://pyspark-workshop/so-posts/Posts.xml-fx
gs://pyspark-workshop/so-posts/Posts.xml-fy
gs://pyspark-workshop/so-posts/Posts.xml-fz
gs://pyspark-workshop/so-posts/Posts.xml-ga
gs://pyspark-workshop/so-posts/Posts.xml-gb
gs://pyspark-workshop/so-posts/Posts.xml-gc
gs://pyspark-workshop/so-posts/Posts.xml-gd
gs://pyspark-workshop/so-posts/Posts.xml-ge
gs://pyspark-workshop/so-posts/Posts.xml-gf
gs://pyspark-workshop/so-posts/Posts.xml-gg
gs://pyspark-workshop/so-posts/Posts.xml-gh
gs://pyspark-workshop/so-posts/Posts.xml-gi
gs://pyspark-workshop/so-posts/Posts.xml-gj
gs://pyspark-workshop/so-posts/Posts.xml-gk
gs://pyspark-workshop/so-posts/Posts.xml-gl
gs://pyspark-workshop/so-posts/Posts.xml-gm
gs://pyspark-workshop/so-posts/Posts.xml-gn
gs://pyspark-workshop/so-posts/Posts.xml-go
gs://pyspark-workshop/so-posts/Posts.xml-gp
gs://pyspark-workshop/so-posts/Posts.xml-gq
gs://pyspark-workshop/so-posts/Posts.xml-gr
gs://pyspark-workshop/so-posts/Posts.xml-gs
gs://pyspark-workshop/so-posts/Posts.xml-gt
gs://pyspark-workshop/so-posts/Posts.xml-gu
gs://pyspark-workshop/so-posts/Posts.xml-gv
gs://pyspark-workshop/so-posts/Posts.xml-gw
gs://pyspark-workshop/so-posts/Posts.xml-gx
gs://pyspark-workshop/so-posts/Posts.xml-gy
gs://pyspark-workshop/so-posts/Posts.xml-gz
gs://pyspark-workshop/so-posts/Posts.xml-ha
gs://pyspark-workshop/so-posts/Posts.xml-hb
gs://pyspark-workshop/so-posts/Posts.xml-hc
gs://pyspark-workshop/so-posts/Posts.xml-hd
gs://pyspark-workshop/so-posts/Posts.xml-he
gs://pyspark-workshop/so-posts/Posts.xml-hf
gs://pyspark-workshop/so-posts/Posts.xml-hg
gs://pyspark-workshop/so-posts/Posts.xml-hh
gs://pyspark-workshop/so-posts/Posts.xml-hi
gs://pyspark-workshop/so-posts/Posts.xml-hj
gs://pyspark-workshop/so-posts/Posts.xml-hk
gs://pyspark-workshop/so-posts/Posts.xml-hl
gs://pyspark-workshop/so-posts/Posts.xml-hm
gs://pyspark-workshop/so-posts/Posts.xml-hn
gs://pyspark-workshop/so-posts/Posts.xml-ho
gs://pyspark-workshop/so-posts/Posts.xml-hp
gs://pyspark-workshop/so-posts/Posts.xml-hq
gs://pyspark-workshop/so-posts/Posts.xml-hr
gs://pyspark-workshop/so-posts/Posts.xml-hs
gs://pyspark-workshop/so-posts/Posts.xml-ht
gs://pyspark-workshop/so-posts/Posts.xml-hu
gs://pyspark-workshop/so-posts/Posts.xml-hv
gs://pyspark-workshop/so-posts/Posts.xml-hw
gs://pyspark-workshop/so-posts/Posts.xml-hx
gs://pyspark-workshop/so-posts/Posts.xml-hy
gs://pyspark-workshop/so-posts/Posts.xml-hz
gs://pyspark-workshop/so-posts/Posts.xml-ia
gs://pyspark-workshop/so-posts/Posts.xml-ib
gs://pyspark-workshop/so-posts/Posts.xml-ic
gs://pyspark-workshop/so-posts/Posts.xml-id
gs://pyspark-workshop/so-posts/Posts.xml-ie
gs://pyspark-workshop/so-posts/Posts.xml-if
gs://pyspark-workshop/so-posts/Posts.xml-ig
gs://pyspark-workshop/so-posts/Posts.xml-ih
gs://pyspark-workshop/so-posts/Posts.xml-ii
gs://pyspark-workshop/so-posts/Posts.xml-ij
gs://pyspark-workshop/so-posts/Posts.xml-ik
gs://pyspark-workshop/so-posts/Posts.xml-il
gs://pyspark-workshop/so-posts/Posts.xml-im
gs://pyspark-workshop/so-posts/Posts.xml-in
gs://pyspark-workshop/so-posts/Posts.xml-io
gs://pyspark-workshop/so-posts/Posts.xml-ip
gs://pyspark-workshop/so-posts/Posts.xml-iq
gs://pyspark-workshop/so-posts/Posts.xml-ir
gs://pyspark-workshop/so-posts/Posts.xml-is
gs://pyspark-workshop/so-posts/Posts.xml-it
gs://pyspark-workshop/so-posts/Posts.xml-iu
gs://pyspark-workshop/so-posts/Posts.xml-iv
gs://pyspark-workshop/so-posts/Posts.xml-iw
gs://pyspark-workshop/so-posts/Posts.xml-ix
gs://pyspark-workshop/so-posts/Posts.xml-iy
gs://pyspark-workshop/so-posts/Posts.xml-iz
gs://pyspark-workshop/so-posts/Posts.xml-ja
gs://pyspark-workshop/so-posts/Posts.xml-jb
gs://pyspark-workshop/so-posts/Posts.xml-jc
gs://pyspark-workshop/so-posts/Posts.xml-jd
gs://pyspark-workshop/so-posts/Posts.xml-je
gs://pyspark-workshop/so-posts/Posts.xml-jf
gs://pyspark-workshop/so-posts/Posts.xml-jg
gs://pyspark-workshop/so-posts/Posts.xml-jh
gs://pyspark-workshop/so-posts/Posts.xml-ji
gs://pyspark-workshop/so-posts/Posts.xml-jj
gs://pyspark-workshop/so-posts/Posts.xml-jk
gs://pyspark-workshop/so-posts/Posts.xml-jl
gs://pyspark-workshop/so-posts/Posts.xml-jm
gs://pyspark-workshop/so-posts/Posts.xml-jn
gs://pyspark-workshop/so-posts/Posts.xml-jo
gs://pyspark-workshop/so-posts/Posts.xml-jp
gs://pyspark-workshop/so-posts/Posts.xml-jq
gs://pyspark-workshop/so-posts/Posts.xml-jr
gs://pyspark-workshop/so-posts/Posts.xml-js
gs://pyspark-workshop/so-posts/Posts.xml-jt
gs://pyspark-workshop/so-posts/Posts.xml-ju
gs://pyspark-workshop/so-posts/Posts.xml-jv
gs://pyspark-workshop/so-posts/Posts.xml-jw
gs://pyspark-workshop/so-posts/Posts.xml-jx
gs://pyspark-workshop/so-posts/Posts.xml-jy
gs://pyspark-workshop/so-posts/Posts.xml-jz
gs://pyspark-workshop/so-posts/Posts.xml-ka
gs://pyspark-workshop/so-posts/Posts.xml-kb
gs://pyspark-workshop/so-posts/Posts.xml-kc
gs://pyspark-workshop/so-posts/Posts.xml-kd
gs://pyspark-workshop/so-posts/Posts.xml-ke
gs://pyspark-workshop/so-posts/Posts.xml-kf
gs://pyspark-workshop/so-posts/Posts.xml-kg
gs://pyspark-workshop/so-posts/Posts.xml-kh
gs://pyspark-workshop/so-posts/Posts.xml-ki
gs://pyspark-workshop/so-posts/Posts.xml-kj
gs://pyspark-workshop/so-posts/Posts.xml-kk
gs://pyspark-workshop/so-posts/Posts.xml-kl
gs://pyspark-workshop/so-posts/Posts.xml-km
gs://pyspark-workshop/so-posts/Posts.xml-kn
gs://pyspark-workshop/so-posts/Posts.xml-ko
gs://pyspark-workshop/so-posts/Posts.xml-kp
gs://pyspark-workshop/so-posts/Posts.xml-kq
gs://pyspark-workshop/so-posts/Posts.xml-kr
gs://pyspark-workshop/so-posts/Posts.xml-ks
gs://pyspark-workshop/so-posts/Posts.xml-kt
gs://pyspark-workshop/so-posts/Posts.xml-ku
gs://pyspark-workshop/so-posts/Posts.xml-kv
gs://pyspark-workshop/so-posts/Posts.xml-kw
gs://pyspark-workshop/so-posts/Posts.xml-kx
gs://pyspark-workshop/so-posts/Posts.xml-ky
gs://pyspark-workshop/so-posts/Posts.xml-kz
gs://pyspark-workshop/so-posts/Posts.xml-la
gs://pyspark-workshop/so-posts/Posts.xml-lb
gs://pyspark-workshop/so-posts/Posts.xml-lc
gs://pyspark-workshop/so-posts/Posts.xml-ld
gs://pyspark-workshop/so-posts/Posts.xml-le
gs://pyspark-workshop/so-posts/Posts.xml-lf
gs://pyspark-workshop/so-posts/Posts.xml-lg
gs://pyspark-workshop/so-posts/Posts.xml-lh
gs://pyspark-workshop/so-posts/Posts.xml-li
gs://pyspark-workshop/so-posts/Posts.xml-lj
gs://pyspark-workshop/so-posts/Posts.xml-lk
gs://pyspark-workshop/so-posts/Posts.xml-ll
gs://pyspark-workshop/so-posts/Posts.xml-lm
gs://pyspark-workshop/so-posts/Posts.xml-ln
gs://pyspark-workshop/so-posts/Posts.xml-lo
gs://pyspark-workshop/so-posts/Posts.xml-lp
gs://pyspark-workshop/so-posts/Posts.xml-lq
gs://pyspark-workshop/so-posts/Posts.xml-lr
gs://pyspark-workshop/so-posts/Posts.xml-ls
gs://pyspark-workshop/so-posts/Posts.xml-lt
gs://pyspark-workshop/so-posts/Posts.xml-lu
gs://pyspark-workshop/so-posts/Posts.xml-lv
gs://pyspark-workshop/so-posts/Posts.xml-lw
gs://pyspark-workshop/so-posts/Posts.xml-lx
gs://pyspark-workshop/so-posts/Posts.xml-ly
gs://pyspark-workshop/so-posts/Posts.xml-lz
gs://pyspark-workshop/so-posts/Posts.xml-ma
gs://pyspark-workshop/so-posts/Posts.xml-mb
gs://pyspark-workshop/so-posts/Posts.xml-mc
gs://pyspark-workshop/so-posts/Posts.xml-md
gs://pyspark-workshop/so-posts/Posts.xml-me
gs://pyspark-workshop/so-posts/Posts.xml-mf
gs://pyspark-workshop/so-posts/Posts.xml-mg
gs://pyspark-workshop/so-posts/Posts.xml-mh
gs://pyspark-workshop/so-posts/Posts.xml-mi
gs://pyspark-workshop/so-posts/Posts.xml-mj
gs://pyspark-workshop/so-posts/Posts.xml-mk
In [2]:
lines = sc.textFile("gs://pyspark-workshop/so-posts/*")
In [33]:
# or a smaller piece of them
lines = sc.textFile("gs://pyspark-workshop/so-posts/Posts.xml-*a")
In [34]:
lines.take(5)
Out[34]:
['<?xml version="1.0" encoding="utf-8"?>',
'<posts>',
' <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="421" ViewCount="28370" Body="<p>I want to use a track-bar to change a form\'s opacity.</p>

<p>This is my code:</p>

<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>

<p>When I try to build it, I get this error:</p>

<blockquote>
 <p>Cannot implicitly convert type \'decimal\' to \'double\'.</p>
</blockquote>

<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work. This code has worked fine for me in VB.NET in the past. </p>
" OwnerUserId="8" LastEditorUserId="5455605" LastEditorDisplayName="Rich B" LastEditDate="2015-12-23T21:34:28.557" LastActivityDate="2016-07-17T20:33:18.217" Title="When setting a form\'s opacity should I use a decimal or double?" Tags="<c#><winforms><type-conversion><decimal><opacity>" AnswerCount="13" CommentCount="3" FavoriteCount="33" CommunityOwnedDate="2012-10-31T16:42:47.213" />',
' <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="189" ViewCount="13715" Body="<p>I have an absolutely positioned <code>div</code> containing several children, one of which is a relatively positioned <code>div</code>. When I use a <strong>percentage-based width</strong> on the child <code>div</code>, it collapses to \'0\' width on <a href="http://en.wikipedia.org/wiki/Internet_Explorer_7">Internet&nbsp;Explorer&nbsp;7</a>, but not on Firefox or Safari.</p>

<p>If I use <strong>pixel width</strong>, it works. If the parent is relatively positioned, the percentage width on the child works.</p>

<ol>
<li>Is there something I\'m missing here?</li>
<li>Is there an easy fix for this besides the <em>pixel-based width</em> on the
child?</li>
<li>Is there an area of the CSS specification that covers this?</li>
</ol>
" OwnerUserId="9" LastEditorUserId="63550" LastEditorDisplayName="Rich B" LastEditDate="2016-03-19T06:05:48.487" LastActivityDate="2016-03-19T06:10:52.170" Title="Percentage width child element in absolutely positioned parent on Internet Explorer 7" Tags="<html><css><css3><internet-explorer-7>" AnswerCount="5" CommentCount="0" FavoriteCount="7" />',
' <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="305" Body="<p>An explicit cast to double isn\'t necessary.</p>

<pre><code>double trans = (double)trackBar1.Value / 5000.0;
</code></pre>

<p>Identifying the constant as <code>5000.0</code> (or as <code>5000d</code>) is sufficient:</p>

<pre><code>double trans = trackBar1.Value / 5000.0;
double trans = trackBar1.Value / 5000d;
</code></pre>
" OwnerUserId="9" LastEditorUserId="967315" LastEditDate="2012-10-14T11:50:16.703" LastActivityDate="2012-10-14T11:50:16.703" CommentCount="1" />']
In [5]:
rows = lines.filter(lambda x: x.lstrip().startswith('<row'))
In [9]:
import xml.etree.ElementTree as ET
In [10]:
parsed = lines.map(lambda x: x.lstrip()).filter(lambda x: x.startswith('<row')).map(lambda x: ET.fromstring(x))
In [12]:
from pprint import pprint
pprint(parsed.take(2))
[<Element 'row' at 0x7f58e69ed868>, <Element 'row' at 0x7f58e69ed818>]
In [49]:
pprint(parsed.map(lambda x: x.attrib).take(3))
[{'AcceptedAnswerId': '7',
'AnswerCount': '13',
'Body': "<p>I want to use a track-bar to change a form's opacity.</p>\n"
'\n'
'<p>This is my code:</p>\n'
'\n'
'<pre><code>decimal trans = trackBar1.Value / 5000;\n'
'this.Opacity = trans;\n'
'</code></pre>\n'
'\n'
'<p>When I try to build it, I get this error:</p>\n'
'\n'
'<blockquote>\n'
" <p>Cannot implicitly convert type 'decimal' to 'double'.</p>\n"
'</blockquote>\n'
'\n'
'<p>I tried making <code>trans</code> a <code>double</code>, but '
"then the control doesn't work. This code has worked fine for me in "
'VB.NET in the past. </p>\n',
'CommentCount': '3',
'CommunityOwnedDate': '2012-10-31T16:42:47.213',
'CreationDate': '2008-07-31T21:42:52.667',
'FavoriteCount': '33',
'Id': '4',
'LastActivityDate': '2016-07-17T20:33:18.217',
'LastEditDate': '2015-12-23T21:34:28.557',
'LastEditorDisplayName': 'Rich B',
'LastEditorUserId': '5455605',
'OwnerUserId': '8',
'PostTypeId': '1',
'Score': '421',
'Tags': '<c#><winforms><type-conversion><decimal><opacity>',
'Title': "When setting a form's opacity should I use a decimal or double?",
'ViewCount': '28370'},
{'AcceptedAnswerId': '31',
'AnswerCount': '5',
'Body': '<p>I have an absolutely positioned <code>div</code> containing '
'several children, one of which is a relatively positioned '
'<code>div</code>. When I use a <strong>percentage-based '
"width</strong> on the child <code>div</code>, it collapses to '0' "
'width on <a '
'href="http://en.wikipedia.org/wiki/Internet_Explorer_7">Internet Explorer 7</a>, '
'but not on Firefox or Safari.</p>\n'
'\n'
'<p>If I use <strong>pixel width</strong>, it works. If the parent '
'is relatively positioned, the percentage width on the child '
'works.</p>\n'
'\n'
'<ol>\n'
"<li>Is there something I'm missing here?</li>\n"
'<li>Is there an easy fix for this besides the <em>pixel-based '
'width</em> on the\n'
'child?</li>\n'
'<li>Is there an area of the CSS specification that covers '
'this?</li>\n'
'</ol>\n',
'CommentCount': '0',
'CreationDate': '2008-07-31T22:08:08.620',
'FavoriteCount': '7',
'Id': '6',
'LastActivityDate': '2016-03-19T06:10:52.170',
'LastEditDate': '2016-03-19T06:05:48.487',
'LastEditorDisplayName': 'Rich B',
'LastEditorUserId': '63550',
'OwnerUserId': '9',
'PostTypeId': '1',
'Score': '189',
'Tags': '<html><css><css3><internet-explorer-7>',
'Title': 'Percentage width child element in absolutely positioned parent on '
'Internet Explorer 7',
'ViewCount': '13715'},
{'Body': "<p>An explicit cast to double isn't necessary.</p>\n"
'\n'
'<pre><code>double trans = (double)trackBar1.Value / 5000.0;\n'
'</code></pre>\n'
'\n'
'<p>Identifying the constant as <code>5000.0</code> (or as '
'<code>5000d</code>) is sufficient:</p>\n'
'\n'
'<pre><code>double trans = trackBar1.Value / 5000.0;\n'
'double trans = trackBar1.Value / 5000d;\n'
'</code></pre>\n',
'CommentCount': '1',
'CreationDate': '2008-07-31T22:17:57.883',
'Id': '7',
'LastActivityDate': '2012-10-14T11:50:16.703',
'LastEditDate': '2012-10-14T11:50:16.703',
'LastEditorUserId': '967315',
'OwnerUserId': '9',
'ParentId': '4',
'PostTypeId': '2',
'Score': '305'}]
In [36]:
def parse_tags(x):
return x[1:-1].split("><")
tags = parsed.map(lambda x: parse_tags(x.attrib['Tags']) if 'Tags' in x.attrib else [])
tags.take(5)
Out[36]:
[['c#', 'winforms', 'type-conversion', 'decimal', 'opacity'],
['html', 'css', 'css3', 'internet-explorer-7'],
[],
['c#', '.net', 'datetime'],
['c#', 'datetime', 'datediff', 'relative-time-span']]
In [40]:
counts = tags.flatMap(lambda x: x).groupBy(lambda x: x).map(lambda x: (x[0], len(x[1])))
In [41]:
counts.sortBy(lambda x: x[1], ascending=False).take(10)
Out[41]:
[('javascript', 1206322),
('java', 1128803),
('c#', 997465),
('php', 969707),
('android', 887010),
('jquery', 770009),
('python', 624016),
('html', 572556),
('c++', 467183),
('ios', 457172)]
Puzzles:
In [50]:
# if you hate xml (you do), then save it as json on hdfs!
import json
parsed.map(lambda x: json.dumps(x.attrib)).saveAsTextFile("posts.jsons")
Content source: szymonm/pyspark-dataproc-workshop
Similar notebooks: