# 数据清洗之推特数据

wangchengjun@nju.edu.cn

## 数据清洗（data cleaning）

• 清洗错误行
• 正确分列
• 提取所要分析的内容
• 介绍通过按行、chunk的方式对大规模数据进行预处理

# 1. 抽取tweets样本做实验



In [23]:

bigfile = open('/Users/chengjun/百度云同步盘/Writing/OWS/ows-raw.txt', 'rb')
chunkSize = 1000000
chunk = bigfile.readlines(chunkSize)
print len(chunk)
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample.txt", 'w') as f:
for i in chunk:
f.write(i)




2752



# 2. 清洗错行的情况



In [13]:

with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample.txt", 'rb') as f:
lines = f.readlines()




In [14]:

# 总行数
len(lines)




Out[14]:

2752




In [15]:

# 查看第一行
lines[0]




Out[15]:

'"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n'



# 问题： 第一行是变量名

## 2. 如何获取每一个变量名？



In [16]:

varNames = lines[0].replace('\n', '').split(',')
varNames




Out[16]:

['"Twitter ID"',
'Text',
'"Profile Image URL"',
'Day',
'Hour',
'Minute',
'"Created At"',
'Geo',
'"From User"',
'"From User ID"',
'Language',
'"To User"',
'"To User ID"',
'Source']




In [17]:

len(varNames)




Out[17]:

14




In [26]:

lines[1344]




Out[26]:

'121818600490283009,"RT @chachiTHEgr8: RT @TheNewDeal: First they ignore you, then they laugh at you, then they fight you, then you win. - Gandhi #OccupyWallStreet #OWS #p2",http://a0.twimg.com/profile_images/326662126/Photo_233_normal.jpg,2011-10-06,5,26,"2011-10-06 05:26:32",N;,k_l_h_j,382233343,en,,0,"&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;"\n'



# 如何来处理错误换行情况？



In [30]:

with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean.txt", 'w') as f:
right_line = '' # 正确的行，它是一个空字符串
blocks = [] # 确认为正确的行会被添加到blocks里面
for line in lines:
right_line += line.replace('\n', ' ')
line_length = len(right_line.split(','))
if line_length >= 14:
blocks.append(right_line)
right_line = ''
for i in blocks:
f.write(i + '\n')




In [20]:

len(blocks)




Out[20]:

2627




In [27]:

blocks[1344]




Out[27]:

'121818879105310720,"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change\rhttp://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,"2011-10-06 05:27:38",N;,MiyazakiMegu,260948518,en,,0,"&lt;a href=&quot;http://www.tweetdeck.com&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;"'




In [32]:

with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean4.txt", 'w') as f:
right_line = '' # 正确的行，它是一个空字符串
blocks = [] # 确认为正确的行会被添加到blocks里面
for line in lines:
right_line += line.replace('\n', ' ').replace('\r', ' ')
line_length = len(right_line.split(','))
if line_length >= 14:
blocks.append(right_line)
right_line = ''
for i in blocks:
f.write(i + '\n')




In [33]:

blocks[1344]




Out[33]:

'121818879105310720,"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change http://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,"2011-10-06 05:27:38",N;,MiyazakiMegu,260948518,en,,0,"&lt;a href=&quot;http://www.tweetdeck.com&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;" '



# 3. 读取数据、正确分列



In [35]:

# 提示：你可能需要修改以下路径名
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean.txt", 'rb') as f:
chunk = f.readlines()




In [36]:

len(chunk)




Out[36]:

2627




In [37]:

chunk[:3]




Out[37]:

['"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n',
'121813144174727168,"RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!",http://a2.twimg.com/profile_images/1539375713/Twitter_normal.jpg,2011-10-06,5,4,"2011-10-06 05:04:51",N;,Anonops_Cop,401240477,en,,0,"&lt;a href=&quot;http://twitter.com/&quot;&gt;web&lt;/a&gt;"\n',
'121813146137657344,"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7",http://a2.twimg.com/profile_images/1574715503/Kate6_normal.jpg,2011-10-06,5,4,"2011-10-06 05:04:51",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,"&lt;a href=&quot;http://twitter.com/&quot;&gt;web&lt;/a&gt;"\n']




In [38]:

import csv

lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')



# 4. 统计数量

### 统计发帖数量所对应的人数的分布



In [39]:

from collections import defaultdict
data_dict = defaultdict(int)
line_num = 0
lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')
for i in lines:
line_num +=1
data_dict[i[8]] +=1 # i[8] 是user




In [41]:

data_dict.items()[:5]




Out[41]:

[('Mike_EH_52', 1),
('neeirish', 1),
('heroofcamelot', 1),
('agentdarkapple', 1),
('KliphScurlock', 1)]




In [72]:

print line_num




2626




In [48]:

%matplotlib inline
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import matplotlib
#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
matplotlib.rc("savefig", dpi=100)
font = FontProperties(fname=r'/Users/chengjun/github/cjc2016/data/msyh.ttf', size=14) # 注意：修改这里的路径名




In [49]:

plt.hist(data_dict.values())
#plt.yscale('log')
#plt.xscale('log')
plt.xlabel(u'发帖数', fontproperties=font)
plt.ylabel(u'人数', fontproperties=font)
plt.show()







In [50]:

tweet_dict = defaultdict(int)
for i in data_dict.values():
tweet_dict[i] += 1

plt.loglog(tweet_dict.keys(), tweet_dict.values(), 'ro',linewidth=2)
plt.xlabel(u'推特数', fontproperties=font)
plt.ylabel(u'人数', fontproperties=font)
plt.show()







In [51]:

import numpy as np
import statsmodels.api as sm

def powerPlot(d_value, d_freq, color, marker):
d_freq = [i + 1 for i in d_freq]
d_prob = [float(i)/sum(d_freq) for i in d_freq]
#d_rank = ss.rankdata(d_value).astype(int)
x = np.log(d_value)
y = np.log(d_prob)
xx = sm.add_constant(x, prepend=True)
res = sm.OLS(y,xx).fit()
constant,beta = res.params
r2 = res.rsquared
plt.plot(d_value, d_prob, linestyle = '',  color = color, marker = marker)
plt.plot(d_value, np.exp(constant+x*beta),"red")
plt.xscale('log'); plt.yscale('log')
plt.text(max(d_value)/2,max(d_prob)/10,
r'$\beta$ = ' + str(round(beta,2)) +'\n' + r'$R^2$ = ' + str(round(r2, 2)))




In [52]:

histo, bin_edges = np.histogram(data_dict.values(), 15)
bin_center = 0.5*(bin_edges[1:] + bin_edges[:-1])
powerPlot(bin_center,histo, 'r', 'o')
#lg=plt.legend(labels = [u'Tweets', u'Fit'], loc=3, fontsize=20)
plt.ylabel(u'概率', fontproperties=font)
plt.xlabel(u'推特数', fontproperties=font)
plt.show()







In [53]:

import statsmodels.api as sm
from collections import defaultdict
import numpy as np

def powerPlot(data):
d = sorted(data, reverse = True )
d_table = defaultdict(int)
for k in d:
d_table[k] += 1
d_value = sorted(d_table)
d_value = [i+1 for i in d_value]
d_freq = [d_table[i]+1 for i in d_value]
d_prob = [float(i)/sum(d_freq) for i in d_freq]
#d_rank = ss.rankdata(d_value).astype(int)
x = np.log(d_value)
y = np.log(d_prob)
xx = sm.add_constant(x, prepend=True)
res = sm.OLS(y,xx).fit()
constant,beta = res.params
r2 = res.rsquared
plt.plot(d_value, d_prob, 'ro')
plt.plot(d_value, np.exp(constant+x*beta),"red")
plt.xscale('log'); plt.yscale('log')
plt.text(max(d_value)/2,max(d_prob)/5,
'Beta = ' + str(round(beta,2)) +'\n' + 'R squared = ' + str(round(r2, 2)))
plt.title('Distribution')
plt.ylabel('P(K)')
plt.xlabel('K')
plt.show()




In [54]:

powerPlot(data_dict.values())






# 5. 清洗tweets文本



In [60]:

tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''




In [61]:

import re

import twitter_text



# 安装twitter_text

## pip install twitter-text-py



In [62]:

import re
import twitter_text

def extract_rt_user(tweet):
rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
try:
rt_user_name = rt_patterns.findall(tweet)[0][1].strip(' @')
return rt_user_name
except IndexError, e:
pass




In [63]:

def extract_tweet_text(tweet, at_names, urls):
for i in at_names:
tweet = tweet.replace(i, '')
for j in urls:
tweet = tweet.replace(j, '')
marks = ['RT @', '@', '&quot;', '#', '\n', '\t', '  ']
for k in marks:
tweet = tweet.replace(k, '')
return tweet




In [64]:

ex = twitter_text.Extractor(tweet)
at_names = ex.extract_mentioned_screen_names()
urls = ex.extract_urls()
hashtags = ex.extract_hashtags()
rt_user = extract_rt_user(tweet)
tweet_text = extract_tweet_text(tweet, at_names, urls)

print at_names, urls, hashtags, rt_user,'-------->', tweet_text




[u'AnonKitsu', u'chengjun', u'mili'] [u'http://computational-communication.com', u'http://ccc.nju.edu.cn'] [u'OCCUPYWALLSTREET', u'OWS', u'OCCUPYNY'] AnonKitsu --------> : ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! OCCUPYWALLSTREET OWS OCCUPYNY PLEASE RT !!HELP!!!!




In [65]:

import csv

lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')
tweets = [i[1] for i in lines]




In [66]:

for tweet in tweets[:5]:
ex = twitter_text.Extractor(tweet)
at_names = ex.extract_mentioned_screen_names()
urls = ex.extract_urls()
hashtags = ex.extract_hashtags()
rt_user = extract_rt_user(tweet)
tweet_text = extract_tweet_text(tweet, at_names, urls)

print at_names, urls, hashtags, rt_user,
print tweet_text




[u'AnonKitsu'] [] [u'OCCUPYWALLSTREET', u'OWS', u'OCCUPYNY'] AnonKitsu : ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! OCCUPYWALLSTREET OWS OCCUPYNY PLEASE RT !!HELP!!!!
[u'jamiekilstein', u'allisonkilkenny'] [u'http://t.co/Fte55Kh7'] [u'ows'] None Interesting interview (never aired, wonder why??) by Fox with ows protester
[u'Seductivpancake'] [] [u'ows'] None  Right! Those guys have a victory condition: regime change. ows doesn't seem to have a goal I can figure out.
[u'bembel'] [u'http://j.mp/rhHavq'] [u'OccupyWallStreet', u'OWS'] bembel  Occupy Wall Street als linke Antwort auf die Tea Party?OccupyWallStreet OWS
[] [] [u'ows'] None ows White shirt= Brown shirt.




In [77]:

import csv

lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')
tweet_user_data = [(i[1], i[8]) for i in lines]
for tweet,user in tweet_user_data:
rt_user = extract_rt_user(tweet)
if rt_user:
print rt_user, ',', user




rt_user1, user1

rt_user2, user3

rt_user2, user4

...

# 阅读文献



In [ ]: