如何爬取Facebook粉絲頁資料 (comments) ?

基本上是透過 Facebook Graph API 去取得粉絲頁的資料,但是使用 Facebook Graph API 還需要取得權限,有兩種方法 :
第一種是取得 Access Token
第二種是建立 Facebook App的應用程式,用該應用程式的帳號,密碼當作權限
兩者的差別在於第一種會有時效限制,必須每隔一段時間去更新Access Token,才能使用
Access Token

本文是採用第二種方法
要先取得應用程式的帳號,密碼 app_id, app_secret


In [1]:
# 載入python 套件
import requests
import datetime
import time
import pandas as pd

第一步 - 要先取得應用程式的帳號,密碼 (app_id, app_secret)
第二步 - 輸入要分析的粉絲團的 id
[教學]如何申請建立 Facebook APP ID 應用程式ID


In [2]:
# 粉絲頁的id
page_id = "appledaily.tw"

# 應用程式的帳號,密碼
app_id = ""
app_secret = ""

# 上一篇爬取的post的csv檔案
post_path = 'post/'+page_id+'_post.csv'

access_token = app_id + "|" + app_secret


這篇是承襲上一篇 fb粉絲團分析and輸出報告-抓取篇(posts)
上一篇提到說,會爬取每一則粉絲頁po文,而本文的目標就是在把該post裡面的留言
都給爬取出來,所以會需要用到每一個post的id,也就是說需要上一篇的檔案才有辦法爬取

基本上由4個function來完成:

request_until_succeed
來確保完成爬取

getFacebookCommentFeedData
來產生comment的各種資料(message,like_count,created_time,comments,from...)

processFacebookComment
是處理getFacebookPageFeedData得到的各種資料,把它們結構化

scrapeFacebookPageFeedComments
主程式


In [3]:
# 判斷response有無正常 正常 200,若無隔五秒鐘之後再試
def request_until_succeed(url):
    success = False
    while success is False:
        try: 
            req = requests.get(url)
            if req.status_code == 200:
                success = True
            if req.status_code == 400:
                return None
        except Exception as e:
            print(e)
            time.sleep(5)
            print("Error for URL %s: %s" % (url, datetime.datetime.now()))
            print("Retrying.")
    return req

url = base + node + fields + parameters
base : 可以設定Facebook Graph API的版本,這邊設定v2.6
node : 分析哪個粉絲頁的post 由page_id去設定
fields : 你要取得資料的種類
parameters : 權限設定和每次取多少筆(num_statuses)


In [4]:
def getFacebookCommentFeedData(status_id, access_token, num_comments):
    
    base = "https://graph.facebook.com/v2.6"
    node = "/%s/comments" % status_id
    fields = "?fields=id,message,like_count,created_time,comments,from,attachment"
    parameters = "&order=chronological&limit=%s&access_token=%s" % \
            (num_comments, access_token)
    url = base + node + fields + parameters

    # 取得data
    data = request_until_succeed(url)
    if data is None:
        return None
    else:
        return data.json()

In [5]:
def processFacebookComment(comment, status_id, parent_id = ''):
    
    # 確認資料欄位是否有值,並做處理
    comment_id = comment['id']
    comment_author = comment['from']['name']
    if 'message' not in comment:
        comment_message = ''
    else:
        comment_message = comment['message']
    if 'like_count' not in comment:
        comment_likes = 0 
    else:
        comment_likes = comment['like_count']
    
    if 'attachment' in comment:
        attach_tag = "[[%s]]" % comment['attachment']['type'].upper()
        if comment_message is '':
            comment_message = attach_tag
        else:
            comment_message = (comment_message+ " " +attach_tag)

    comment_published = datetime.datetime.strptime(comment['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
     # 根據所在時區 TW +8
    comment_published = comment_published + datetime.timedelta(hours=8)
    comment_published = comment_published.strftime('%Y-%m-%d %H:%M:%S')
    
    # 回傳tuple形式的資料
    return (comment_id, status_id, parent_id, comment_message, comment_author,
            comment_published, comment_likes)

主程式概念是這樣的,每一個posts會用while迴圈來把所有的留言都爬出來
而每一個留言又會有另一個while迴圈把回覆留言的留言再爬出來,所以總共有兩個while迴圈


In [6]:
def scrapeFacebookPageFeedComments(page_id, access_token, post_path):
    
    # all_statuses 用來儲存的list,先放入欄位名稱
    all_comments = [("comment_id", "status_id", "parent_id", "comment_message",
        "comment_author", "comment_published", "comment_likes")]

    num_processed = 0   # 計算處理多少post
    scrape_starttime = datetime.datetime.now()

    print("Scraping %s Comments From Posts: %s\n" % (page_id, scrape_starttime))
    
    post_df = pd.read_csv(post_path)

    for status_id in post_df['status_id']:
        has_next_page = True

        comments = getFacebookCommentFeedData(status_id, access_token, 100)

        while has_next_page and comments is not None:
            for comment in comments['data']:
                
                all_comments.append(processFacebookComment(comment, status_id))
                if 'comments' in comment:
                    has_next_subpage = True

                    subcomments = getFacebookCommentFeedData(comment['id'], access_token, 100)

                    while has_next_subpage:
                        for subcomment in subcomments['data']:
                            all_comments.append(processFacebookComment(
                                    subcomment,
                                    status_id,
                                    comment['id']))

                            num_processed += 1
                            if num_processed % 1000 == 0:
                                print("%s Comments Processed: %s" %
                                      (num_processed,
                                       datetime.datetime.now()))

                        if 'paging' in subcomments:
                            if 'next' in subcomments['paging']:
                                data =  request_until_succeed(subcomments['paging']['next'])
                                if data != None:
                                    subcomments = data.json()
                                else:
                                    subcomments = None
                            else:
                                has_next_subpage = False
                        else:
                            has_next_subpage = False
                            
                num_processed += 1
                if num_processed % 1000 == 0:
                    print("%s Comments Processed: %s" %
                          (num_processed, datetime.datetime.now()))

            if 'paging' in comments:
                if 'next' in comments['paging']:
                    data =  request_until_succeed(comments['paging']['next'])
                    if data != None:
                        comments = data.json()
                    else:
                        comments = None
                else:
                    has_next_page = False
            else:
                has_next_page = False
                
    print("\nDone!\n%s Comments Processed in %s" %
          (num_processed, datetime.datetime.now() - scrape_starttime))
    return all_comments

總共跑完有690628筆,106Mb,要花十幾個小時
all_statuses[0] 為 column name
all_statuses[1:] 為處理後結構化的資料


In [8]:
all_comments = scrapeFacebookPageFeedComments(page_id, access_token, post_path)
df = pd.DataFrame(all_comments[1:], columns=all_comments[0])

In [9]:
path = 'comment/'+page_id+'_comment.csv'
df.to_csv(path,index=False,encoding='utf8')


In [ ]: