In [1]:
import pandas as pd

In [3]:
file = r'.\data\reddit\ethereum\slim_sorted_comments.csv'
reader = pd.read_csv(file, chunksize=1000, header=0, index_col='commentId')
for df in reader:
    # grab the first chunk and leave ...
    break

reader.close()

In [ ]:
## Looking at the comment data

In [5]:
df.head(n = 10)


Out[5]:
author body created_utc postId score
commentId
t1_ceht9kr qaezel The links are out! 1.388923e+09 t3_1ucwto 1
t1_cei0u4u vbenes Mind: blown! Contracts (their Turing complete... 1.388950e+09 t3_1ucwto 3
t1_celpxzb salwilliam "is decentralized Bitcoin-Ethereum exchange po... 1.389313e+09 t3_1ucwto 1
t1_celr9kl vbuterin Almost. It's pointless to have an Ethereum con... 1.389317e+09 t3_1ucwto 8
t1_cembxgf salwilliam http://4.bp.blogspot.com/-dFGIcV-hH5w/Uoz0JKRL... 1.389381e+09 t3_1ucwto -4
t1_cemsiby chem_deth Your site is down, buddy. Get some Cloudflare ... 1.389423e+09 t3_1ucwto 3
t1_cemt8dp coiv I don't think many people know what "Turing co... 1.389426e+09 t3_1ucwto 24
t1_cemtctw free593 This looks really good. I like how the concept... 1.389427e+09 t3_1ucwto 6
t1_cemxn5d needsTimeMachine The ethereum site is down, so do you happen to... 1.389453e+09 t3_1ucwto 4
t1_cemzmit standardcrypto Already slashdotted. This is going to be huge :) 1.389459e+09 t3_1ucwto 4

In [ ]:
## finding deleted authors

In [8]:
df[df['author'] == '[deleted]'].head(n = 10)


Out[8]:
author body created_utc postId score
commentId
t1_cewocxg [deleted] Hey Vik! I'd love to get involved with your ... 1.390460e+09 t3_1ucwto 1
t1_cexos6j [deleted] [deleted] 1.390571e+09 t3_1ucwto 3
t1_cezwnv3 [deleted] [deleted] 1.390806e+09 t3_1ucwto 1
t1_cfa3sgi [deleted] [deleted] 1.391855e+09 t3_1ucwto 6
t1_cenl9ca [deleted] The problem with CPU mining is that it is only... 1.389519e+09 t3_1uzumc 1
t1_cenqlt6 [deleted] A 5-20x speedup is still dangerous. With that ... 1.389547e+09 t3_1uzumc 1
t1_ceowmzu [deleted] [deleted] 1.389662e+09 t3_1v5i4l 2
t1_cepztgx [deleted] What conditions should be met to switch to tru... 1.389770e+09 t3_1v78zi 1
t1_ceq456q [deleted] > There is currently a lot of reasearch/discus... 1.389794e+09 t3_1v78zi 1
t1_ceq5pnu [deleted] Seems so. 1.389799e+09 t3_1v78zi 0

In [ ]:
## grouping authors by score (high/low)

In [18]:
df.groupby('author')['score'].sum().to_frame().reset_index().sort_values('score', ascending=False).head(n = 10)


Out[18]:
author score
216 vbuterin 191
59 [deleted] 93
56 Ursium 92
167 nucleo_io 78
147 malefizer 73
180 que23 72
110 ericcart 58
22 ItsAConspiracy 44
229 zerox102 41
101 ddink7 39

In [21]:
df.groupby('author')['score'].sum().to_frame().reset_index().sort_values('score', ascending=True).head(n = 10)


Out[21]:
author score
188 salwilliam -3
66 ambrozy007 -1
32 MaxK -1
70 antanst 0
201 stop_runs 0
91 coin-table 0
13 Dartanan 0
211 twisthype 0
52 SyncoBeat 1
51 Symphonic_Rainboom 1

In [ ]:
## grouping authors by # of comments

In [19]:
df.groupby('author')['postId'].count().to_frame().reset_index().sort_values('postId', ascending=False).head(n = 10)


Out[19]:
author postId
216 vbuterin 57
56 Ursium 53
59 [deleted] 48
180 que23 47
167 nucleo_io 45
110 ericcart 37
147 malefizer 34
229 zerox102 28
22 ItsAConspiracy 20
19 Haposhi 18

In [ ]:
## grouping authors by total comment length

In [28]:
df.groupby('author')['body'].sum().map(lambda x: len(x)).to_frame().reset_index().sort_values('body', ascending=False).head(n = 10)


Out[28]:
author body
216 vbuterin 26156
167 nucleo_io 13325
63 aaron-lebo 12099
56 Ursium 11219
180 que23 10230
147 malefizer 9409
110 ericcart 7750
59 [deleted] 7321
22 ItsAConspiracy 7229
30 Maegfaer 6954

In [ ]: