Let's read the data


In [42]:
! gsutil ls gs://pyspark-workshop/so-posts


gs://pyspark-workshop/so-posts/Posts.xml-aa
gs://pyspark-workshop/so-posts/Posts.xml-ab
gs://pyspark-workshop/so-posts/Posts.xml-ac
gs://pyspark-workshop/so-posts/Posts.xml-ad
gs://pyspark-workshop/so-posts/Posts.xml-ae
gs://pyspark-workshop/so-posts/Posts.xml-af
gs://pyspark-workshop/so-posts/Posts.xml-ag
gs://pyspark-workshop/so-posts/Posts.xml-ah
gs://pyspark-workshop/so-posts/Posts.xml-ai
gs://pyspark-workshop/so-posts/Posts.xml-aj
gs://pyspark-workshop/so-posts/Posts.xml-ak
gs://pyspark-workshop/so-posts/Posts.xml-al
gs://pyspark-workshop/so-posts/Posts.xml-am
gs://pyspark-workshop/so-posts/Posts.xml-an
gs://pyspark-workshop/so-posts/Posts.xml-ao
gs://pyspark-workshop/so-posts/Posts.xml-ap
gs://pyspark-workshop/so-posts/Posts.xml-aq
gs://pyspark-workshop/so-posts/Posts.xml-ar
gs://pyspark-workshop/so-posts/Posts.xml-as
gs://pyspark-workshop/so-posts/Posts.xml-at
gs://pyspark-workshop/so-posts/Posts.xml-au
gs://pyspark-workshop/so-posts/Posts.xml-av
gs://pyspark-workshop/so-posts/Posts.xml-aw
gs://pyspark-workshop/so-posts/Posts.xml-ax
gs://pyspark-workshop/so-posts/Posts.xml-ay
gs://pyspark-workshop/so-posts/Posts.xml-az
gs://pyspark-workshop/so-posts/Posts.xml-ba
gs://pyspark-workshop/so-posts/Posts.xml-bb
gs://pyspark-workshop/so-posts/Posts.xml-bc
gs://pyspark-workshop/so-posts/Posts.xml-bd
gs://pyspark-workshop/so-posts/Posts.xml-be
gs://pyspark-workshop/so-posts/Posts.xml-bf
gs://pyspark-workshop/so-posts/Posts.xml-bg
gs://pyspark-workshop/so-posts/Posts.xml-bh
gs://pyspark-workshop/so-posts/Posts.xml-bi
gs://pyspark-workshop/so-posts/Posts.xml-bj
gs://pyspark-workshop/so-posts/Posts.xml-bk
gs://pyspark-workshop/so-posts/Posts.xml-bl
gs://pyspark-workshop/so-posts/Posts.xml-bm
gs://pyspark-workshop/so-posts/Posts.xml-bn
gs://pyspark-workshop/so-posts/Posts.xml-bo
gs://pyspark-workshop/so-posts/Posts.xml-bp
gs://pyspark-workshop/so-posts/Posts.xml-bq
gs://pyspark-workshop/so-posts/Posts.xml-br
gs://pyspark-workshop/so-posts/Posts.xml-bs
gs://pyspark-workshop/so-posts/Posts.xml-bt
gs://pyspark-workshop/so-posts/Posts.xml-bu
gs://pyspark-workshop/so-posts/Posts.xml-bv
gs://pyspark-workshop/so-posts/Posts.xml-bw
gs://pyspark-workshop/so-posts/Posts.xml-bx
gs://pyspark-workshop/so-posts/Posts.xml-by
gs://pyspark-workshop/so-posts/Posts.xml-bz
gs://pyspark-workshop/so-posts/Posts.xml-ca
gs://pyspark-workshop/so-posts/Posts.xml-cb
gs://pyspark-workshop/so-posts/Posts.xml-cc
gs://pyspark-workshop/so-posts/Posts.xml-cd
gs://pyspark-workshop/so-posts/Posts.xml-ce
gs://pyspark-workshop/so-posts/Posts.xml-cf
gs://pyspark-workshop/so-posts/Posts.xml-cg
gs://pyspark-workshop/so-posts/Posts.xml-ch
gs://pyspark-workshop/so-posts/Posts.xml-ci
gs://pyspark-workshop/so-posts/Posts.xml-cj
gs://pyspark-workshop/so-posts/Posts.xml-ck
gs://pyspark-workshop/so-posts/Posts.xml-cl
gs://pyspark-workshop/so-posts/Posts.xml-cm
gs://pyspark-workshop/so-posts/Posts.xml-cn
gs://pyspark-workshop/so-posts/Posts.xml-co
gs://pyspark-workshop/so-posts/Posts.xml-cp
gs://pyspark-workshop/so-posts/Posts.xml-cq
gs://pyspark-workshop/so-posts/Posts.xml-cr
gs://pyspark-workshop/so-posts/Posts.xml-cs
gs://pyspark-workshop/so-posts/Posts.xml-ct
gs://pyspark-workshop/so-posts/Posts.xml-cu
gs://pyspark-workshop/so-posts/Posts.xml-cv
gs://pyspark-workshop/so-posts/Posts.xml-cw
gs://pyspark-workshop/so-posts/Posts.xml-cx
gs://pyspark-workshop/so-posts/Posts.xml-cy
gs://pyspark-workshop/so-posts/Posts.xml-cz
gs://pyspark-workshop/so-posts/Posts.xml-da
gs://pyspark-workshop/so-posts/Posts.xml-db
gs://pyspark-workshop/so-posts/Posts.xml-dc
gs://pyspark-workshop/so-posts/Posts.xml-dd
gs://pyspark-workshop/so-posts/Posts.xml-de
gs://pyspark-workshop/so-posts/Posts.xml-df
gs://pyspark-workshop/so-posts/Posts.xml-dg
gs://pyspark-workshop/so-posts/Posts.xml-dh
gs://pyspark-workshop/so-posts/Posts.xml-di
gs://pyspark-workshop/so-posts/Posts.xml-dj
gs://pyspark-workshop/so-posts/Posts.xml-dk
gs://pyspark-workshop/so-posts/Posts.xml-dl
gs://pyspark-workshop/so-posts/Posts.xml-dm
gs://pyspark-workshop/so-posts/Posts.xml-dn
gs://pyspark-workshop/so-posts/Posts.xml-do
gs://pyspark-workshop/so-posts/Posts.xml-dp
gs://pyspark-workshop/so-posts/Posts.xml-dq
gs://pyspark-workshop/so-posts/Posts.xml-dr
gs://pyspark-workshop/so-posts/Posts.xml-ds
gs://pyspark-workshop/so-posts/Posts.xml-dt
gs://pyspark-workshop/so-posts/Posts.xml-du
gs://pyspark-workshop/so-posts/Posts.xml-dv
gs://pyspark-workshop/so-posts/Posts.xml-dw
gs://pyspark-workshop/so-posts/Posts.xml-dx
gs://pyspark-workshop/so-posts/Posts.xml-dy
gs://pyspark-workshop/so-posts/Posts.xml-dz
gs://pyspark-workshop/so-posts/Posts.xml-ea
gs://pyspark-workshop/so-posts/Posts.xml-eb
gs://pyspark-workshop/so-posts/Posts.xml-ec
gs://pyspark-workshop/so-posts/Posts.xml-ed
gs://pyspark-workshop/so-posts/Posts.xml-ee
gs://pyspark-workshop/so-posts/Posts.xml-ef
gs://pyspark-workshop/so-posts/Posts.xml-eg
gs://pyspark-workshop/so-posts/Posts.xml-eh
gs://pyspark-workshop/so-posts/Posts.xml-ei
gs://pyspark-workshop/so-posts/Posts.xml-ej
gs://pyspark-workshop/so-posts/Posts.xml-ek
gs://pyspark-workshop/so-posts/Posts.xml-el
gs://pyspark-workshop/so-posts/Posts.xml-em
gs://pyspark-workshop/so-posts/Posts.xml-en
gs://pyspark-workshop/so-posts/Posts.xml-eo
gs://pyspark-workshop/so-posts/Posts.xml-ep
gs://pyspark-workshop/so-posts/Posts.xml-eq
gs://pyspark-workshop/so-posts/Posts.xml-er
gs://pyspark-workshop/so-posts/Posts.xml-es
gs://pyspark-workshop/so-posts/Posts.xml-et
gs://pyspark-workshop/so-posts/Posts.xml-eu
gs://pyspark-workshop/so-posts/Posts.xml-ev
gs://pyspark-workshop/so-posts/Posts.xml-ew
gs://pyspark-workshop/so-posts/Posts.xml-ex
gs://pyspark-workshop/so-posts/Posts.xml-ey
gs://pyspark-workshop/so-posts/Posts.xml-ez
gs://pyspark-workshop/so-posts/Posts.xml-fa
gs://pyspark-workshop/so-posts/Posts.xml-fb
gs://pyspark-workshop/so-posts/Posts.xml-fc
gs://pyspark-workshop/so-posts/Posts.xml-fd
gs://pyspark-workshop/so-posts/Posts.xml-fe
gs://pyspark-workshop/so-posts/Posts.xml-ff
gs://pyspark-workshop/so-posts/Posts.xml-fg
gs://pyspark-workshop/so-posts/Posts.xml-fh
gs://pyspark-workshop/so-posts/Posts.xml-fi
gs://pyspark-workshop/so-posts/Posts.xml-fj
gs://pyspark-workshop/so-posts/Posts.xml-fk
gs://pyspark-workshop/so-posts/Posts.xml-fl
gs://pyspark-workshop/so-posts/Posts.xml-fm
gs://pyspark-workshop/so-posts/Posts.xml-fn
gs://pyspark-workshop/so-posts/Posts.xml-fo
gs://pyspark-workshop/so-posts/Posts.xml-fp
gs://pyspark-workshop/so-posts/Posts.xml-fq
gs://pyspark-workshop/so-posts/Posts.xml-fr
gs://pyspark-workshop/so-posts/Posts.xml-fs
gs://pyspark-workshop/so-posts/Posts.xml-ft
gs://pyspark-workshop/so-posts/Posts.xml-fu
gs://pyspark-workshop/so-posts/Posts.xml-fv
gs://pyspark-workshop/so-posts/Posts.xml-fw
gs://pyspark-workshop/so-posts/Posts.xml-fx
gs://pyspark-workshop/so-posts/Posts.xml-fy
gs://pyspark-workshop/so-posts/Posts.xml-fz
gs://pyspark-workshop/so-posts/Posts.xml-ga
gs://pyspark-workshop/so-posts/Posts.xml-gb
gs://pyspark-workshop/so-posts/Posts.xml-gc
gs://pyspark-workshop/so-posts/Posts.xml-gd
gs://pyspark-workshop/so-posts/Posts.xml-ge
gs://pyspark-workshop/so-posts/Posts.xml-gf
gs://pyspark-workshop/so-posts/Posts.xml-gg
gs://pyspark-workshop/so-posts/Posts.xml-gh
gs://pyspark-workshop/so-posts/Posts.xml-gi
gs://pyspark-workshop/so-posts/Posts.xml-gj
gs://pyspark-workshop/so-posts/Posts.xml-gk
gs://pyspark-workshop/so-posts/Posts.xml-gl
gs://pyspark-workshop/so-posts/Posts.xml-gm
gs://pyspark-workshop/so-posts/Posts.xml-gn
gs://pyspark-workshop/so-posts/Posts.xml-go
gs://pyspark-workshop/so-posts/Posts.xml-gp
gs://pyspark-workshop/so-posts/Posts.xml-gq
gs://pyspark-workshop/so-posts/Posts.xml-gr
gs://pyspark-workshop/so-posts/Posts.xml-gs
gs://pyspark-workshop/so-posts/Posts.xml-gt
gs://pyspark-workshop/so-posts/Posts.xml-gu
gs://pyspark-workshop/so-posts/Posts.xml-gv
gs://pyspark-workshop/so-posts/Posts.xml-gw
gs://pyspark-workshop/so-posts/Posts.xml-gx
gs://pyspark-workshop/so-posts/Posts.xml-gy
gs://pyspark-workshop/so-posts/Posts.xml-gz
gs://pyspark-workshop/so-posts/Posts.xml-ha
gs://pyspark-workshop/so-posts/Posts.xml-hb
gs://pyspark-workshop/so-posts/Posts.xml-hc
gs://pyspark-workshop/so-posts/Posts.xml-hd
gs://pyspark-workshop/so-posts/Posts.xml-he
gs://pyspark-workshop/so-posts/Posts.xml-hf
gs://pyspark-workshop/so-posts/Posts.xml-hg
gs://pyspark-workshop/so-posts/Posts.xml-hh
gs://pyspark-workshop/so-posts/Posts.xml-hi
gs://pyspark-workshop/so-posts/Posts.xml-hj
gs://pyspark-workshop/so-posts/Posts.xml-hk
gs://pyspark-workshop/so-posts/Posts.xml-hl
gs://pyspark-workshop/so-posts/Posts.xml-hm
gs://pyspark-workshop/so-posts/Posts.xml-hn
gs://pyspark-workshop/so-posts/Posts.xml-ho
gs://pyspark-workshop/so-posts/Posts.xml-hp
gs://pyspark-workshop/so-posts/Posts.xml-hq
gs://pyspark-workshop/so-posts/Posts.xml-hr
gs://pyspark-workshop/so-posts/Posts.xml-hs
gs://pyspark-workshop/so-posts/Posts.xml-ht
gs://pyspark-workshop/so-posts/Posts.xml-hu
gs://pyspark-workshop/so-posts/Posts.xml-hv
gs://pyspark-workshop/so-posts/Posts.xml-hw
gs://pyspark-workshop/so-posts/Posts.xml-hx
gs://pyspark-workshop/so-posts/Posts.xml-hy
gs://pyspark-workshop/so-posts/Posts.xml-hz
gs://pyspark-workshop/so-posts/Posts.xml-ia
gs://pyspark-workshop/so-posts/Posts.xml-ib
gs://pyspark-workshop/so-posts/Posts.xml-ic
gs://pyspark-workshop/so-posts/Posts.xml-id
gs://pyspark-workshop/so-posts/Posts.xml-ie
gs://pyspark-workshop/so-posts/Posts.xml-if
gs://pyspark-workshop/so-posts/Posts.xml-ig
gs://pyspark-workshop/so-posts/Posts.xml-ih
gs://pyspark-workshop/so-posts/Posts.xml-ii
gs://pyspark-workshop/so-posts/Posts.xml-ij
gs://pyspark-workshop/so-posts/Posts.xml-ik
gs://pyspark-workshop/so-posts/Posts.xml-il
gs://pyspark-workshop/so-posts/Posts.xml-im
gs://pyspark-workshop/so-posts/Posts.xml-in
gs://pyspark-workshop/so-posts/Posts.xml-io
gs://pyspark-workshop/so-posts/Posts.xml-ip
gs://pyspark-workshop/so-posts/Posts.xml-iq
gs://pyspark-workshop/so-posts/Posts.xml-ir
gs://pyspark-workshop/so-posts/Posts.xml-is
gs://pyspark-workshop/so-posts/Posts.xml-it
gs://pyspark-workshop/so-posts/Posts.xml-iu
gs://pyspark-workshop/so-posts/Posts.xml-iv
gs://pyspark-workshop/so-posts/Posts.xml-iw
gs://pyspark-workshop/so-posts/Posts.xml-ix
gs://pyspark-workshop/so-posts/Posts.xml-iy
gs://pyspark-workshop/so-posts/Posts.xml-iz
gs://pyspark-workshop/so-posts/Posts.xml-ja
gs://pyspark-workshop/so-posts/Posts.xml-jb
gs://pyspark-workshop/so-posts/Posts.xml-jc
gs://pyspark-workshop/so-posts/Posts.xml-jd
gs://pyspark-workshop/so-posts/Posts.xml-je
gs://pyspark-workshop/so-posts/Posts.xml-jf
gs://pyspark-workshop/so-posts/Posts.xml-jg
gs://pyspark-workshop/so-posts/Posts.xml-jh
gs://pyspark-workshop/so-posts/Posts.xml-ji
gs://pyspark-workshop/so-posts/Posts.xml-jj
gs://pyspark-workshop/so-posts/Posts.xml-jk
gs://pyspark-workshop/so-posts/Posts.xml-jl
gs://pyspark-workshop/so-posts/Posts.xml-jm
gs://pyspark-workshop/so-posts/Posts.xml-jn
gs://pyspark-workshop/so-posts/Posts.xml-jo
gs://pyspark-workshop/so-posts/Posts.xml-jp
gs://pyspark-workshop/so-posts/Posts.xml-jq
gs://pyspark-workshop/so-posts/Posts.xml-jr
gs://pyspark-workshop/so-posts/Posts.xml-js
gs://pyspark-workshop/so-posts/Posts.xml-jt
gs://pyspark-workshop/so-posts/Posts.xml-ju
gs://pyspark-workshop/so-posts/Posts.xml-jv
gs://pyspark-workshop/so-posts/Posts.xml-jw
gs://pyspark-workshop/so-posts/Posts.xml-jx
gs://pyspark-workshop/so-posts/Posts.xml-jy
gs://pyspark-workshop/so-posts/Posts.xml-jz
gs://pyspark-workshop/so-posts/Posts.xml-ka
gs://pyspark-workshop/so-posts/Posts.xml-kb
gs://pyspark-workshop/so-posts/Posts.xml-kc
gs://pyspark-workshop/so-posts/Posts.xml-kd
gs://pyspark-workshop/so-posts/Posts.xml-ke
gs://pyspark-workshop/so-posts/Posts.xml-kf
gs://pyspark-workshop/so-posts/Posts.xml-kg
gs://pyspark-workshop/so-posts/Posts.xml-kh
gs://pyspark-workshop/so-posts/Posts.xml-ki
gs://pyspark-workshop/so-posts/Posts.xml-kj
gs://pyspark-workshop/so-posts/Posts.xml-kk
gs://pyspark-workshop/so-posts/Posts.xml-kl
gs://pyspark-workshop/so-posts/Posts.xml-km
gs://pyspark-workshop/so-posts/Posts.xml-kn
gs://pyspark-workshop/so-posts/Posts.xml-ko
gs://pyspark-workshop/so-posts/Posts.xml-kp
gs://pyspark-workshop/so-posts/Posts.xml-kq
gs://pyspark-workshop/so-posts/Posts.xml-kr
gs://pyspark-workshop/so-posts/Posts.xml-ks
gs://pyspark-workshop/so-posts/Posts.xml-kt
gs://pyspark-workshop/so-posts/Posts.xml-ku
gs://pyspark-workshop/so-posts/Posts.xml-kv
gs://pyspark-workshop/so-posts/Posts.xml-kw
gs://pyspark-workshop/so-posts/Posts.xml-kx
gs://pyspark-workshop/so-posts/Posts.xml-ky
gs://pyspark-workshop/so-posts/Posts.xml-kz
gs://pyspark-workshop/so-posts/Posts.xml-la
gs://pyspark-workshop/so-posts/Posts.xml-lb
gs://pyspark-workshop/so-posts/Posts.xml-lc
gs://pyspark-workshop/so-posts/Posts.xml-ld
gs://pyspark-workshop/so-posts/Posts.xml-le
gs://pyspark-workshop/so-posts/Posts.xml-lf
gs://pyspark-workshop/so-posts/Posts.xml-lg
gs://pyspark-workshop/so-posts/Posts.xml-lh
gs://pyspark-workshop/so-posts/Posts.xml-li
gs://pyspark-workshop/so-posts/Posts.xml-lj
gs://pyspark-workshop/so-posts/Posts.xml-lk
gs://pyspark-workshop/so-posts/Posts.xml-ll
gs://pyspark-workshop/so-posts/Posts.xml-lm
gs://pyspark-workshop/so-posts/Posts.xml-ln
gs://pyspark-workshop/so-posts/Posts.xml-lo
gs://pyspark-workshop/so-posts/Posts.xml-lp
gs://pyspark-workshop/so-posts/Posts.xml-lq
gs://pyspark-workshop/so-posts/Posts.xml-lr
gs://pyspark-workshop/so-posts/Posts.xml-ls
gs://pyspark-workshop/so-posts/Posts.xml-lt
gs://pyspark-workshop/so-posts/Posts.xml-lu
gs://pyspark-workshop/so-posts/Posts.xml-lv
gs://pyspark-workshop/so-posts/Posts.xml-lw
gs://pyspark-workshop/so-posts/Posts.xml-lx
gs://pyspark-workshop/so-posts/Posts.xml-ly
gs://pyspark-workshop/so-posts/Posts.xml-lz
gs://pyspark-workshop/so-posts/Posts.xml-ma
gs://pyspark-workshop/so-posts/Posts.xml-mb
gs://pyspark-workshop/so-posts/Posts.xml-mc
gs://pyspark-workshop/so-posts/Posts.xml-md
gs://pyspark-workshop/so-posts/Posts.xml-me
gs://pyspark-workshop/so-posts/Posts.xml-mf
gs://pyspark-workshop/so-posts/Posts.xml-mg
gs://pyspark-workshop/so-posts/Posts.xml-mh
gs://pyspark-workshop/so-posts/Posts.xml-mi
gs://pyspark-workshop/so-posts/Posts.xml-mj
gs://pyspark-workshop/so-posts/Posts.xml-mk

In [2]:
lines = sc.textFile("gs://pyspark-workshop/so-posts/*")

In [33]:
# or a smaller piece of them
lines = sc.textFile("gs://pyspark-workshop/so-posts/Posts.xml-*a")

Let's check what's inside these files...


In [34]:
lines.take(5)


Out[34]:
['<?xml version="1.0" encoding="utf-8"?>',
 '<posts>',
 '  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="421" ViewCount="28370" Body="&lt;p&gt;I want to use a track-bar to change a form\'s opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;Cannot implicitly convert type \'decimal\' to \'double\'.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making &lt;code&gt;trans&lt;/code&gt; a &lt;code&gt;double&lt;/code&gt;, but then the control doesn\'t work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="5455605" LastEditorDisplayName="Rich B" LastEditDate="2015-12-23T21:34:28.557" LastActivityDate="2016-07-17T20:33:18.217" Title="When setting a form\'s opacity should I use a decimal or double?" Tags="&lt;c#&gt;&lt;winforms&gt;&lt;type-conversion&gt;&lt;decimal&gt;&lt;opacity&gt;" AnswerCount="13" CommentCount="3" FavoriteCount="33" CommunityOwnedDate="2012-10-31T16:42:47.213" />',
 '  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="189" ViewCount="13715" Body="&lt;p&gt;I have an absolutely positioned &lt;code&gt;div&lt;/code&gt; containing several children, one of which is a relatively positioned &lt;code&gt;div&lt;/code&gt;. When I use a &lt;strong&gt;percentage-based width&lt;/strong&gt; on the child &lt;code&gt;div&lt;/code&gt;, it collapses to \'0\' width on &lt;a href=&quot;http://en.wikipedia.org/wiki/Internet_Explorer_7&quot;&gt;Internet&amp;nbsp;Explorer&amp;nbsp;7&lt;/a&gt;, but not on Firefox or Safari.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;If I use &lt;strong&gt;pixel width&lt;/strong&gt;, it works. If the parent is relatively positioned, the percentage width on the child works.&lt;/p&gt;&#xA;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;Is there something I\'m missing here?&lt;/li&gt;&#xA;&lt;li&gt;Is there an easy fix for this besides the &lt;em&gt;pixel-based width&lt;/em&gt; on the&#xA;child?&lt;/li&gt;&#xA;&lt;li&gt;Is there an area of the CSS specification that covers this?&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;" OwnerUserId="9" LastEditorUserId="63550" LastEditorDisplayName="Rich B" LastEditDate="2016-03-19T06:05:48.487" LastActivityDate="2016-03-19T06:10:52.170" Title="Percentage width child element in absolutely positioned parent on Internet Explorer 7" Tags="&lt;html&gt;&lt;css&gt;&lt;css3&gt;&lt;internet-explorer-7&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="7" />',
 '  <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="305" Body="&lt;p&gt;An explicit cast to double isn\'t necessary.&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = (double)trackBar1.Value / 5000.0;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;Identifying the constant as &lt;code&gt;5000.0&lt;/code&gt; (or as &lt;code&gt;5000d&lt;/code&gt;) is sufficient:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = trackBar1.Value / 5000.0;&#xA;double trans = trackBar1.Value / 5000d;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;" OwnerUserId="9" LastEditorUserId="967315" LastEditDate="2012-10-14T11:50:16.703" LastActivityDate="2012-10-14T11:50:16.703" CommentCount="1" />']

Only proper rows with posts


In [5]:
rows = lines.filter(lambda x: x.lstrip().startswith('<row'))

Let's parse this mess...


In [9]:
import xml.etree.ElementTree as ET

In [10]:
parsed = lines.map(lambda x: x.lstrip()).filter(lambda x: x.startswith('<row')).map(lambda x: ET.fromstring(x))

In [12]:
from pprint import pprint
pprint(parsed.take(2))


[<Element 'row' at 0x7f58e69ed868>, <Element 'row' at 0x7f58e69ed818>]

Better:


In [49]:
pprint(parsed.map(lambda x: x.attrib).take(3))


[{'AcceptedAnswerId': '7',
  'AnswerCount': '13',
  'Body': "<p>I want to use a track-bar to change a form's opacity.</p>\n"
          '\n'
          '<p>This is my code:</p>\n'
          '\n'
          '<pre><code>decimal trans = trackBar1.Value / 5000;\n'
          'this.Opacity = trans;\n'
          '</code></pre>\n'
          '\n'
          '<p>When I try to build it, I get this error:</p>\n'
          '\n'
          '<blockquote>\n'
          "  <p>Cannot implicitly convert type 'decimal' to 'double'.</p>\n"
          '</blockquote>\n'
          '\n'
          '<p>I tried making <code>trans</code> a <code>double</code>, but '
          "then the control doesn't work. This code has worked fine for me in "
          'VB.NET in the past. </p>\n',
  'CommentCount': '3',
  'CommunityOwnedDate': '2012-10-31T16:42:47.213',
  'CreationDate': '2008-07-31T21:42:52.667',
  'FavoriteCount': '33',
  'Id': '4',
  'LastActivityDate': '2016-07-17T20:33:18.217',
  'LastEditDate': '2015-12-23T21:34:28.557',
  'LastEditorDisplayName': 'Rich B',
  'LastEditorUserId': '5455605',
  'OwnerUserId': '8',
  'PostTypeId': '1',
  'Score': '421',
  'Tags': '<c#><winforms><type-conversion><decimal><opacity>',
  'Title': "When setting a form's opacity should I use a decimal or double?",
  'ViewCount': '28370'},
 {'AcceptedAnswerId': '31',
  'AnswerCount': '5',
  'Body': '<p>I have an absolutely positioned <code>div</code> containing '
          'several children, one of which is a relatively positioned '
          '<code>div</code>. When I use a <strong>percentage-based '
          "width</strong> on the child <code>div</code>, it collapses to '0' "
          'width on <a '
          'href="http://en.wikipedia.org/wiki/Internet_Explorer_7">Internet&nbsp;Explorer&nbsp;7</a>, '
          'but not on Firefox or Safari.</p>\n'
          '\n'
          '<p>If I use <strong>pixel width</strong>, it works. If the parent '
          'is relatively positioned, the percentage width on the child '
          'works.</p>\n'
          '\n'
          '<ol>\n'
          "<li>Is there something I'm missing here?</li>\n"
          '<li>Is there an easy fix for this besides the <em>pixel-based '
          'width</em> on the\n'
          'child?</li>\n'
          '<li>Is there an area of the CSS specification that covers '
          'this?</li>\n'
          '</ol>\n',
  'CommentCount': '0',
  'CreationDate': '2008-07-31T22:08:08.620',
  'FavoriteCount': '7',
  'Id': '6',
  'LastActivityDate': '2016-03-19T06:10:52.170',
  'LastEditDate': '2016-03-19T06:05:48.487',
  'LastEditorDisplayName': 'Rich B',
  'LastEditorUserId': '63550',
  'OwnerUserId': '9',
  'PostTypeId': '1',
  'Score': '189',
  'Tags': '<html><css><css3><internet-explorer-7>',
  'Title': 'Percentage width child element in absolutely positioned parent on '
           'Internet Explorer 7',
  'ViewCount': '13715'},
 {'Body': "<p>An explicit cast to double isn't necessary.</p>\n"
          '\n'
          '<pre><code>double trans = (double)trackBar1.Value / 5000.0;\n'
          '</code></pre>\n'
          '\n'
          '<p>Identifying the constant as <code>5000.0</code> (or as '
          '<code>5000d</code>) is sufficient:</p>\n'
          '\n'
          '<pre><code>double trans = trackBar1.Value / 5000.0;\n'
          'double trans = trackBar1.Value / 5000d;\n'
          '</code></pre>\n',
  'CommentCount': '1',
  'CreationDate': '2008-07-31T22:17:57.883',
  'Id': '7',
  'LastActivityDate': '2012-10-14T11:50:16.703',
  'LastEditDate': '2012-10-14T11:50:16.703',
  'LastEditorUserId': '967315',
  'OwnerUserId': '9',
  'ParentId': '4',
  'PostTypeId': '2',
  'Score': '305'}]

Let's compute tag counts!


In [36]:
def parse_tags(x):
    return x[1:-1].split("><")

tags = parsed.map(lambda x: parse_tags(x.attrib['Tags']) if 'Tags' in x.attrib else [])
tags.take(5)


Out[36]:
[['c#', 'winforms', 'type-conversion', 'decimal', 'opacity'],
 ['html', 'css', 'css3', 'internet-explorer-7'],
 [],
 ['c#', '.net', 'datetime'],
 ['c#', 'datetime', 'datediff', 'relative-time-span']]

In [40]:
counts = tags.flatMap(lambda x: x).groupBy(lambda x: x).map(lambda x: (x[0], len(x[1])))

Taking long? go to: http://cluster-1-m:8088 and explore it (if you're using default cluster name).

Did you know flatMap?? If yes, rewrite the statement before to use flatMap.


In [41]:
counts.sortBy(lambda x: x[1], ascending=False).take(10)


Out[41]:
[('javascript', 1206322),
 ('java', 1128803),
 ('c#', 997465),
 ('php', 969707),
 ('android', 887010),
 ('jquery', 770009),
 ('python', 624016),
 ('html', 572556),
 ('c++', 467183),
 ('ios', 457172)]

Shout if you're the first one here! Congrats!

Puzzles:

  1. Can you compute how many times someone asked about Python this month (you can compute posts with python tag only)?
  2. Can you measure Pythons monthly popularity over last year? Can you plot it?
  3. Can you do the same but only for main posts (questions)?
  4. (*) Can you find the question that has the most posts attached??
    1. Do the same but use ranking by total score of subposts.

In [50]:
# if you hate xml (you do), then save it as json on hdfs!
import json
parsed.map(lambda x: json.dumps(x.attrib)).saveAsTextFile("posts.jsons")