本文是採用第二種方法
要先取得應用程式的帳號,密碼 app_id, app_secret
In [1]:
# 載入python 套件
import requests
import datetime
import time
import pandas as pd
第一步 - 要先取得應用程式的帳號,密碼 (app_id, app_secret)
第二步 - 輸入要分析的粉絲團的 id
[教學]如何申請建立 Facebook APP ID 應用程式ID
In [2]:
# 粉絲頁的id
page_id = "appledaily.tw"
# 應用程式的帳號,密碼
app_id = ""
app_secret = ""
# 上一篇爬取的post的csv檔案
post_path = 'post/'+page_id+'_post.csv'
access_token = app_id + "|" + app_secret
這篇是承襲上一篇 fb粉絲團分析and輸出報告-抓取篇(posts)
上一篇提到說,會爬取每一則粉絲頁po文,而本文的目標就是在把該post裡面的留言
都給爬取出來,所以會需要用到每一個post的id,也就是說需要上一篇的檔案才有辦法爬取
基本上由4個function來完成:
request_until_succeed
來確保完成爬取
getFacebookCommentFeedData
來產生comment的各種資料(message,like_count,created_time,comments,from...)
processFacebookComment
是處理getFacebookPageFeedData得到的各種資料,把它們結構化
scrapeFacebookPageFeedComments
主程式
In [3]:
# 判斷response有無正常 正常 200,若無隔五秒鐘之後再試
def request_until_succeed(url):
success = False
while success is False:
try:
req = requests.get(url)
if req.status_code == 200:
success = True
if req.status_code == 400:
return None
except Exception as e:
print(e)
time.sleep(5)
print("Error for URL %s: %s" % (url, datetime.datetime.now()))
print("Retrying.")
return req
url = base + node + fields + parameters
base : 可以設定Facebook Graph API的版本,這邊設定v2.6
node : 分析哪個粉絲頁的post 由page_id去設定
fields : 你要取得資料的種類
parameters : 權限設定和每次取多少筆(num_statuses)
In [4]:
def getFacebookCommentFeedData(status_id, access_token, num_comments):
base = "https://graph.facebook.com/v2.6"
node = "/%s/comments" % status_id
fields = "?fields=id,message,like_count,created_time,comments,from,attachment"
parameters = "&order=chronological&limit=%s&access_token=%s" % \
(num_comments, access_token)
url = base + node + fields + parameters
# 取得data
data = request_until_succeed(url)
if data is None:
return None
else:
return data.json()
In [5]:
def processFacebookComment(comment, status_id, parent_id = ''):
# 確認資料欄位是否有值,並做處理
comment_id = comment['id']
comment_author = comment['from']['name']
if 'message' not in comment:
comment_message = ''
else:
comment_message = comment['message']
if 'like_count' not in comment:
comment_likes = 0
else:
comment_likes = comment['like_count']
if 'attachment' in comment:
attach_tag = "[[%s]]" % comment['attachment']['type'].upper()
if comment_message is '':
comment_message = attach_tag
else:
comment_message = (comment_message+ " " +attach_tag)
comment_published = datetime.datetime.strptime(comment['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
# 根據所在時區 TW +8
comment_published = comment_published + datetime.timedelta(hours=8)
comment_published = comment_published.strftime('%Y-%m-%d %H:%M:%S')
# 回傳tuple形式的資料
return (comment_id, status_id, parent_id, comment_message, comment_author,
comment_published, comment_likes)
主程式概念是這樣的,每一個posts會用while迴圈來把所有的留言都爬出來
而每一個留言又會有另一個while迴圈把回覆留言的留言再爬出來,所以總共有兩個while迴圈
In [6]:
def scrapeFacebookPageFeedComments(page_id, access_token, post_path):
# all_statuses 用來儲存的list,先放入欄位名稱
all_comments = [("comment_id", "status_id", "parent_id", "comment_message",
"comment_author", "comment_published", "comment_likes")]
num_processed = 0 # 計算處理多少post
scrape_starttime = datetime.datetime.now()
print("Scraping %s Comments From Posts: %s\n" % (page_id, scrape_starttime))
post_df = pd.read_csv(post_path)
for status_id in post_df['status_id']:
has_next_page = True
comments = getFacebookCommentFeedData(status_id, access_token, 100)
while has_next_page and comments is not None:
for comment in comments['data']:
all_comments.append(processFacebookComment(comment, status_id))
if 'comments' in comment:
has_next_subpage = True
subcomments = getFacebookCommentFeedData(comment['id'], access_token, 100)
while has_next_subpage:
for subcomment in subcomments['data']:
all_comments.append(processFacebookComment(
subcomment,
status_id,
comment['id']))
num_processed += 1
if num_processed % 1000 == 0:
print("%s Comments Processed: %s" %
(num_processed,
datetime.datetime.now()))
if 'paging' in subcomments:
if 'next' in subcomments['paging']:
data = request_until_succeed(subcomments['paging']['next'])
if data != None:
subcomments = data.json()
else:
subcomments = None
else:
has_next_subpage = False
else:
has_next_subpage = False
num_processed += 1
if num_processed % 1000 == 0:
print("%s Comments Processed: %s" %
(num_processed, datetime.datetime.now()))
if 'paging' in comments:
if 'next' in comments['paging']:
data = request_until_succeed(comments['paging']['next'])
if data != None:
comments = data.json()
else:
comments = None
else:
has_next_page = False
else:
has_next_page = False
print("\nDone!\n%s Comments Processed in %s" %
(num_processed, datetime.datetime.now() - scrape_starttime))
return all_comments
總共跑完有690628筆,106Mb,要花十幾個小時
all_statuses[0] 為 column name
all_statuses[1:] 為處理後結構化的資料
In [8]:
all_comments = scrapeFacebookPageFeedComments(page_id, access_token, post_path)
df = pd.DataFrame(all_comments[1:], columns=all_comments[0])
In [9]:
path = 'comment/'+page_id+'_comment.csv'
df.to_csv(path,index=False,encoding='utf8')
In [ ]: