In [2]:
df = sqlContext.read.json("posts.jsons")

Can you answer why is it taking so long?


In [3]:
df.printSchema()


root
 |-- AcceptedAnswerId: string (nullable = true)
 |-- AnswerCount: string (nullable = true)
 |-- Body: string (nullable = true)
 |-- ClosedDate: string (nullable = true)
 |-- CommentCount: string (nullable = true)
 |-- CommunityOwnedDate: string (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- FavoriteCount: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- LastActivityDate: string (nullable = true)
 |-- LastEditDate: string (nullable = true)
 |-- LastEditorDisplayName: string (nullable = true)
 |-- LastEditorUserId: string (nullable = true)
 |-- OwnerDisplayName: string (nullable = true)
 |-- OwnerUserId: string (nullable = true)
 |-- ParentId: string (nullable = true)
 |-- PostTypeId: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- ViewCount: string (nullable = true)


In [4]:
# would you like to feel like you're at home?
df.limit(100).toPandas()


Out[4]:
AcceptedAnswerId AnswerCount Body ClosedDate CommentCount CommunityOwnedDate CreationDate FavoriteCount Id LastActivityDate ... LastEditorDisplayName LastEditorUserId OwnerDisplayName OwnerUserId ParentId PostTypeId Score Tags Title ViewCount
0 None None <p>You can paint graphics into your own off-sc... None 4 None 2009-03-26T18:04:48.793 None 686911 2009-03-26T18:04:48.793 ... None None Tom Hawtin - tackline 4725 686822 2 2 None None None
1 None None <p>If you don't link against it, you are in th... None 3 None 2009-03-26T18:04:58.603 None 686913 2009-03-26T23:44:46.150 ... X-Istence 13986 X-Istence 13986 686857 2 2 None None None
2 None None <p>With GPL I think the only way out is to wra... None 3 None 2009-03-26T18:05:05.657 None 686914 2009-03-27T14:42:09.613 ... None 29639 None 29639 686857 2 0 None None None
3 None None <pre><code>// A Gorgon class - For the love of... None 0 2009-03-26T18:05:09.430 2009-03-26T18:05:09.430 None 686915 2009-04-01T17:34:20.113 ... Trey Jackson 6148 itsmatt 7862 184618 2 16 None None None
4 None None <p>Axis2 does not support Spring web services ... None 0 None 2009-03-26T18:05:16.450 None 686916 2009-03-26T18:05:16.450 ... None None None 28946 216779 2 1 None None None
5 None None <p>If you can, I'd recommend writing to the Ev... None 1 None 2009-03-26T18:05:43.627 None 686918 2009-03-26T18:05:43.627 ... None None Paul Stevens 83260 680200 2 4 None None None
6 None None <p>You get your browser's language for your bu... None 2 None 2009-03-26T18:05:45.763 None 686919 2010-01-21T19:22:49.687 ... None 63550 Seb 76583 686905 2 11 None None None
7 None None <pre><code>Do\n\n love();\n\n honor();\n\n che... None 0 2009-03-26T18:05:49.727 2009-03-26T18:05:49.757 None 686920 2009-03-26T18:13:29.660 ... Rich B 5640 Jordan 79856 686216 2 2 None None None
8 686992 3 <p>Usually not calling Dispose indicates a pos... None 1 None 2009-03-26T18:06:09.583 0 686923 2009-03-27T19:00:54.797 ... None 14074 None 14074 None 1 2 <c#><unit-testing> How to make sure that Dispose method was calle... 1651
9 686935 5 <p>This is a snippet from my <code>index.php</... None 1 None 2009-03-26T18:06:09.990 None 686924 2009-05-05T03:31:08.060 ... Tomalak 18771 None 55794 None 1 0 <php><jquery><ajax> To use PHP templates or jQuery AJAX calls? 1456
10 None None <p>I think the common concensus would be use a... None 5 None 2009-03-26T18:06:21.160 None 686925 2009-03-26T18:06:21.160 ... None None Josh Stodola 54420 686896 2 1 None None None
11 None None <p>You can use the usual <code>acquire; try { ... None 0 None 2009-03-26T18:06:36.963 None 686926 2009-03-26T20:03:02.970 ... mmyers 13531 Tom Hawtin - tackline 4725 686897 2 10 None None None
12 None None <p>I return a small class called ResultSet fro... None 3 None 2009-03-26T18:06:40.160 None 686927 2009-03-26T18:06:40.160 ... None None Moose 19032 686896 2 5 None None None
13 686957 3 <p>Is there any documentation on exactly how e... None 0 None 2009-03-26T18:06:54.793 None 686929 2011-11-17T18:04:26.563 ... None 496830 mek 4653 None 1 2 <c#><multithreading> How expensive is the IsInvokeRequired 476
14 None None <p>You should detect the <a href="http://msdn.... None 0 None 2009-03-26T18:07:02.313 None 686930 2009-03-26T18:07:02.313 ... None None Jeff Yates 23234 683132 2 2 None None None
15 687622 3 <p>If curly brackets ('{' and '}') are used in... None 2 None 2009-03-26T18:07:07.773 3 686931 2009-04-06T12:22:11.780 ... None None Sydius 43496 None 1 7 <syntax><lua> Are curly brackets used in Lua? 6067
16 None None <p>I'm not sure if you can use it <em>from</em... None 0 None 2009-03-26T18:07:19.440 None 686932 2009-03-26T18:07:19.440 ... None None SCdF 1666 686611 2 1 None None None
17 None 3 <p>I'm using Perl's File::Find module to scan ... None 1 None 2009-03-26T18:08:25.273 1 686934 2009-03-27T12:58:08.490 ... brian d foy 2766176 None 34903 None 1 2 <perl><symlink><dangling><file-find> Why doesn't File::Find handle my dangling syml... 2059
18 None None <p>You don't want to do this. AJAX is nice and... None 0 None 2009-03-26T18:08:28.377 None 686935 2009-03-26T18:08:28.377 ... None None Paolo Bergantino 16417 686924 2 8 None None None
19 None None <p>list/ditionary constructor (i.e. table type... None 3 None 2009-03-26T18:08:32.200 None 686936 2009-03-27T10:10:14.620 ... Brian R. Bondy 3153 Brian R. Bondy 3153 686931 2 4 None None None
20 None None <p>Simple answer - whenever your SQL is not al... None 0 None 2009-03-26T18:08:33.650 None 686937 2015-07-17T15:27:22.333 ... None 31641 le dorfier 31641 686724 2 11 None None None
21 None None <pre><code>// Java-ish pseudocode\npublic Happ... None 3 2009-03-26T18:08:50.110 2009-03-26T18:08:50.123 None 686938 2009-03-26T18:08:50.123 ... None None jiggy 78743 686216 2 9 None None None
22 None None <p>Thanks for that last tip Sam. I had a dialo... None 0 None 2009-03-26T18:08:56.363 None 686940 2009-03-26T18:08:56.363 ... None None Echilon 30512 211971 2 2 None None None
23 None None <p>The question is what is worse:</p>\n\n<ul>\... None 11 None 2009-03-26T18:08:58.547 None 686941 2015-07-22T15:03:10.807 ... Grauenwolf 102937 Grauenwolf 5274 686724 2 118 None None None
24 None None <p>Maybe this could help: <a href="http://www.... None 0 None 2009-03-26T18:09:02.963 None 686942 2009-03-26T18:09:02.963 ... None None Seb 76583 686894 2 0 None None None
25 None None <p>InvokeRequired's cost depends on which clas... None 0 None 2009-03-26T18:09:16.113 None 686943 2009-03-26T18:09:16.113 ... None None Reed Copsey 65358 686929 2 2 None None None
26 687038 3 <p>I'm trying to dig into the XML for a VBPROJ... None 0 None 2009-03-26T18:09:19.547 1 686944 2011-10-04T22:39:14.753 ... None None Danny G 76302 None 1 5 <.net><asp.net><visual-studio><visual-studio-2... VBPROJ / CSPROJ 2941
27 None None <p>You can go in to your application propertie... None 0 None 2009-03-26T18:09:21.653 None 686945 2009-03-26T18:19:14.983 ... windfinder 45581 windfinder 45581 686893 2 3 None None None
28 None 1 <p>Is there a way to get SSIS (&lt;cough>hatei... None 4 None 2009-03-26T18:09:36.627 None 686947 2009-03-26T18:20:06.417 ... None None None 15751 None 1 0 <sql><ssis> SSIS - Suggest types from destination 423
29 None None <p>I believe the basic install (from the SubVe... None 0 None 2009-03-26T18:10:06.517 None 686948 2009-03-26T18:10:06.517 ... None None Frank 18196 686892 2 0 None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
70 687008 1 <pre><code>//window is an HWND\nLPWSTR path = ... None 0 None 2009-03-26T18:25:40.770 None 687001 2009-03-27T07:47:47.893 ... Kevin Montrose 80572 Kevin Montrose 80572 None 1 2 <c><winapi> QueryFullProcessImageName failing with valid H... 3100
71 None None <p>Assuming the select ID from where the names... None 0 None 2009-03-26T18:25:44.407 None 687002 2009-03-26T18:25:44.407 ... None None Seb 76583 686965 2 0 None None None
72 687080 1 <p>I have a problem to solve that I think will... 2014-05-22T02:23:36.087 3 None 2009-03-26T18:25:47.697 2 687003 2014-05-21T20:12:50.593 ... Daniel Lew 321731 Simon Gibbs 13935 None 1 18 <open-source> Paying open source project members for bug fix... 569
73 None None <p>I'd start by decoupling the data table righ... None 1 None 2009-03-26T18:26:13.077 None 687005 2009-03-26T18:26:13.077 ... None None Josh 26160 686960 2 0 None None None
74 None None <p>Unfortunately, you must manually clean up f... None 2 None 2009-03-26T18:26:32.687 None 687006 2015-05-18T15:51:11.230 ... None 18706 Luke Francl 17965 686852 2 63 None None None
75 None None <p>Consider implementing MVP (model view prese... None 0 None 2009-03-26T18:26:45.573 None 687007 2009-03-26T18:26:45.573 ... None None Hrvoje 1407 686960 2 0 None None None
76 None None <p><strong>Main problem:</strong> </p>\n\n<p>F... None 3 None 2009-03-26T18:27:22.263 None 687008 2009-03-26T19:13:32.433 ... Brian R. Bondy 3153 Brian R. Bondy 3153 687001 2 5 None None None
77 None None <p>It is normally provided by the browser and ... None 2 None 2009-03-26T18:27:40.267 None 687009 2010-01-21T19:22:00.750 ... None 63550 ChristopheD 81179 686905 2 21 None None None
78 None None <p>I think you are looking for something like ... None 0 None 2009-03-26T18:27:53.247 None 687010 2009-03-26T18:27:53.247 ... None None Kevin Crowell 81193 686980 2 2 None None None
79 None None <p>Use mod rewrite .htaccess\nThis should pres... None 0 None 2009-03-26T18:27:57.690 None 687011 2009-03-26T18:27:57.690 ... None None AB Kolan 64497 686855 2 0 None None None
80 None None <p>Joshua Bloch has proposed adding a mechanis... None 2 None 2009-03-26T18:28:15.163 None 687013 2009-03-26T18:28:15.163 ... None None Kjetil &#216;degaard 74185 686897 2 4 None None None
81 687023 6 <p>I am creating temporary files from a bash s... None 0 None 2009-03-26T18:28:33.743 17 687014 2016-02-09T16:32:03.473 ... None 2907 skinp 2907 None 1 43 <linux><bash><temporary-files> Removing created temp files in unexpected bash... 26708
82 None None <p>Have you tried logging your view's width an... None 0 None 2009-03-26T18:28:41.467 None 687015 2009-03-26T18:28:41.467 ... None None Wim Haanstra 69313 686910 2 1 None None None
83 None None <p>If all you want is authorization, start wit... None 0 None 2009-03-26T18:29:27.063 None 687016 2009-03-26T18:29:27.063 ... None None Frank Farmer 82769 686185 2 1 None None None
84 None None <p>The output window itself is truncating your... None 1 None 2009-03-26T18:29:38.640 None 687017 2009-03-26T18:29:38.640 ... None None Clyde 945 686996 2 7 None None None
85 1339268 3 <p>Oracle pads values in char columns so if I ... None 0 None 2009-03-26T18:30:31.617 None 687018 2015-03-18T05:56:56.460 ... Toto 972463 Toto 26699 None 1 2 <java><oracle><jdbc> How to use Oracle jdbc driver fixedString prop... 2554
86 None None <p>Are you talking about in SQL Server Managem... None 0 None 2009-03-26T18:30:51.007 None 687020 2009-03-26T18:30:51.007 ... None None Chris 6744 686996 2 5 None None None
87 None None <p>Question: Are you certain you have Java 6 ... None 0 None 2009-03-26T18:31:38.307 None 687022 2009-03-26T18:31:38.307 ... None None Steve L. 63013 686893 2 1 None None None
88 None None <p>You could set a "<a href="http://tldp.org/L... None 2 None 2009-03-26T18:32:01.427 None 687023 2009-03-26T18:46:57.270 ... Paul Tomblin 3333 Paul Tomblin 3333 687014 2 33 None None None
89 None None <blockquote>\n <p>Creating an instance of the... None 0 None 2009-03-26T18:32:27.743 None 687024 2009-03-26T18:32:27.743 ... None None Jason Baker 2147 685253 2 0 None None None
90 None None <p>The data is all there, but management studi... None 0 None 2009-03-26T18:33:21.437 None 687026 2009-03-26T18:33:21.437 ... None None Jim 681 686996 2 0 None None None
91 None None <p>You don't have to bother removing those tmp... None 5 None 2009-03-26T18:33:45.900 None 687028 2009-03-26T18:33:45.900 ... None None Mykola Golubyev 54684 687014 2 -2 None None None
92 None None <p>Here's some code that'll convert an integer... None 10 None 2009-03-26T18:34:11.843 None 687029 2009-07-29T15:54:17.183 ... Erik 356 Erik 16942 686847 2 9 None None None
93 None None <p>If you use LoadControl() to load the contro... None 0 None 2009-03-26T18:34:13.963 None 687030 2009-03-26T18:34:13.963 ... None None AndreasN 24821 686726 2 3 None None None
94 None None <p>maybe this <a href="http://stackoverflow.co... None 0 None 2009-03-26T18:34:18.300 None 687031 2009-03-26T18:34:18.300 ... None None JSC 37311 686753 2 0 None None None
95 None None <p>You should only use CHAR for strings that w... None 3 None 2009-03-26T18:34:47.923 None 687032 2009-03-26T18:34:47.923 ... None None Kevin Crowell 81193 687018 2 1 None None None
96 None None <p>You want to use the <a href="http://www.gnu... None 5 None 2009-03-26T18:34:49.343 None 687033 2016-02-09T16:32:03.473 ... Brian Campbell -1 Brian Campbell 69755 687014 2 13 None None None
97 687042 6 <p>I really want to use hashsets in my program... None 0 None 2009-03-26T18:35:10.187 4 687034 2010-11-09T12:47:48.533 ... None None Brian 18192 None 1 18 <c#><.net><visual-studio-2005> Using HashSet in C# 2.0, compatible with 3.5 17131
98 None None <p>You haven't mentioned only returning the JS... None 1 None 2009-03-26T18:35:55.847 None 687035 2009-03-26T18:35:55.847 ... None None Troy 5047 686753 2 3 None None None
99 None None <p>I ended up using that code for IE and faili... None 0 None 2009-03-26T18:35:59.403 None 687036 2009-03-26T18:35:59.403 ... None None craigmoliver 12252 679067 2 0 None None None

100 rows × 21 columns


In [5]:
# if not, you can use spark way to do roughly the same
df.show()


+----------------+-----------+--------------------+----------+------------+------------------+--------------------+-------------+------+--------------------+--------------------+---------------------+----------------+--------------------+-----------+--------+----------+-----+--------------------+--------------------+---------+
|AcceptedAnswerId|AnswerCount|                Body|ClosedDate|CommentCount|CommunityOwnedDate|        CreationDate|FavoriteCount|    Id|    LastActivityDate|        LastEditDate|LastEditorDisplayName|LastEditorUserId|    OwnerDisplayName|OwnerUserId|ParentId|PostTypeId|Score|                Tags|               Title|ViewCount|
+----------------+-----------+--------------------+----------+------------+------------------+--------------------+-------------+------+--------------------+--------------------+---------------------+----------------+--------------------+-----------+--------+----------+-----+--------------------+--------------------+---------+
|            null|       null|<p>I would recomm...|      null|           2|              null|2008-10-29T13:12:...|         null|246737|2008-10-29T13:12:...|                null|                 null|            null|             cciotti|      16834|  246710|         2|    2|                null|                null|     null|
|            null|       null|<p>You should cre...|      null|           0|              null|2008-10-29T13:12:...|         null|246738|2008-10-29T13:12:...|                null|                 null|            null|               Skubs|      25609|  246651|         2|    4|                null|                null|     null|
|            null|       null|<p><code>Singleto...|      null|           0|              null|2008-10-29T13:13:...|         null|246740|2008-10-29T13:13:...|                null|                 null|            null|             Justice|      12349|  246710|         2|    6|                null|                null|     null|
|            null|       null|<p>In the simple ...|      null|           0|              null|2008-10-29T13:13:...|         null|246743|2008-10-29T13:13:...|                null|                 null|            null|       David Schmitt|       4918|  179582|         2|    0|                null|                null|     null|
|          246759|          3|<p>Is there some ...|      null|           0|              null|2008-10-29T13:13:...|         null|246744|2008-10-29T23:18:...|2008-10-29T13:18:...|          Miguel Ping|           22992|         Miguel Ping|      22992|    null|         1|    2|<java><performanc...|    Java File Cursor|     1782|
|            null|       null|<p>Normalization ...|      null|           0|              null|2008-10-29T13:13:...|         null|246745|2013-10-15T09:21:...|2013-10-15T09:21:...|                 null|            5409|                 Rik|       5409|  246701|         2|    7|                null|                null|     null|
|            null|       null|<p>It is intended...|      null|           3|              null|2008-10-29T13:14:...|         null|246746|2008-10-29T13:23:...|2008-10-29T13:23:...|                chris|           18907|               chris|      18907|  246701|         2|    5|                null|                null|     null|
|            null|       null|<p>I've done exac...|      null|           1|              null|2008-10-29T13:14:...|         null|246747|2013-02-26T20:05:...|2013-02-26T20:05:...|                 null|            3655|            Aardvark|       3655|  246518|         2|    1|                null|                null|     null|
|            null|       null|<p>One way to fin...|      null|           0|              null|2008-10-29T13:14:...|         null|246748|2008-10-29T13:14:...|                null|                 null|            null|            This Mat|      14045|  246289|         2|    1|                null|                null|     null|
|            null|       null|<p>MSDN has a ser...|      null|           0|              null|2008-10-29T13:14:...|         null|246749|2008-10-30T00:46:...|2008-10-30T00:46:...|                  aku|            1196|                 aku|       1196|  246732|         2|    4|                null|                null|     null|
|            null|       null|<p>If you are jus...|      null|           6|              null|2008-10-29T13:14:...|         null|246750|2008-10-29T13:14:...|                null|                 null|            null|          tvanfosson|      12950|  246710|         2|   27|                null|                null|     null|
|            null|       null|<p>Normalization ...|      null|           2|              null|2008-10-29T13:14:...|         null|246751|2008-10-29T14:58:...|2008-10-29T14:58:...|                olavk|            7488|               olavk|       7488|  246701|         2|  124|                null|                null|     null|
|            null|       null|<p>The .aspx file...|      null|           0|              null|2008-10-29T13:15:...|         null|246752|2008-10-29T13:15:...|                null|                 null|            null|                null|      26888|  244755|         2|    1|                null|                null|     null|
|            null|       null|<p>You could use ...|      null|           8|              null|2008-10-29T13:15:...|         null|246753|2011-10-03T14:28:...|2011-10-03T14:28:...|             Re0sless|          180368|            Re0sless|       2098|  246623|         2|   46|                null|                null|     null|
|            null|       null|<p>check this fro...|      null|           0|              null|2008-10-29T13:15:...|         null|246754|2008-10-29T13:15:...|                null|                 null|            null|Joachim Kerschbaumer|      20227|  246636|         2|    3|                null|                null|     null|
|            null|       null|<p>A full tutoria...|      null|           1|              null|2008-10-29T13:16:...|         null|246756|2008-10-29T13:16:...|                null|                 null|            null|             Gamecat|      18061|  246723|         2|    2|                null|                null|     null|
|            null|       null|<p>There is a Cod...|      null|           0|              null|2008-10-29T13:17:...|         null|246757|2008-10-29T13:17:...|                null|                 null|            null|      GalacticCowboy|      29638|  246520|         2|    2|                null|                null|     null|
|            null|       null|<p>Maybe java.io....|      null|           1|              null|2008-10-29T13:17:...|         null|246758|2008-10-29T13:17:...|                null|                 null|            null|        johanneslink|      32352|  246744|         2|    2|                null|                null|     null|
|            null|       null|<p>Have you taken...|      null|           1|              null|2008-10-29T13:18:...|         null|246759|2008-10-29T13:18:...|                null|                 null|            null|                null|      17640|  246744|         2|    4|                null|                null|     null|
|          246796|          3|<p>In an UML comp...|      null|           0|              null|2008-10-29T13:18:...|         null|246760|2009-10-21T15:01:...|2009-10-21T15:01:...|                 null|            5640|               Bernd|       null|    null|         1|    4|               <uml>|How do I tag a da...|     6540|
+----------------+-----------+--------------------+----------+------------+------------------+--------------------+-------------+------+--------------------+--------------------+---------------------+----------------+--------------------+-----------+--------+----------+-----+--------------------+--------------------+---------+
only showing top 20 rows


In [6]:
df.registerTempTable("posts")

In [8]:
sqlContext.sql("""
    SELECT Title, cast(ViewCount as int) FROM posts
    ORDER BY cast(ViewCount as int) DESC
    LIMIT 100
""").show(truncate=False)


+--------------------------------------------------------------------+---------+
|Title                                                               |ViewCount|
+--------------------------------------------------------------------+---------+
|How to undo last commit(s) in Git?                                  |3671106  |
|Converting String to Int in Java?                                   |3546204  |
|How to check if one string contains another substring in JavaScript?|3220360  |
|How to redirect to another page using jQuery?                       |3141591  |
|How to delete a Git branch both locally and remotely?               |3032622  |
|How to check if a checkbox is checked in jQuery?                    |2531445  |
|What is the best comment in source code you have ever encountered?  |2422977  |
|How to revert Git repository to a previous commit?                  |2413566  |
|Horizontally center a div in a div                                  |2403840  |
|How to remove a particular element from an array in JavaScript?     |2376031  |
|For-each over an array in JavaScript?                               |2358268  |
|Generating random integers in a specific range                      |2295175  |
|How to declare an array                                             |2264597  |
|How to create an HTML button that acts like a link?                 |2221859  |
|UPDATE from SELECT using SQL Server                                 |2183056  |
|How can I get query string values in JavaScript?                    |2152548  |
|Get selected value in dropdown list using JavaScript?               |2090650  |
|Iterate through a HashMap                                           |2088230  |
|How to checkout a remote Git branch?                                |2080782  |
|Check if string contains specific words?                            |2074962  |
+--------------------------------------------------------------------+---------+
only showing top 20 rows

Puzzles

  1. List titles of most viewed posts but only having python tag. (may want to see pypspark.sql.function)
  2. Convert approprioate strings into ints and dates.
  3. Can you find the question that has the most posts attached?? Do the same but use ranking by total score of subposts.
  4. What else can you find in this data?

In [ ]: