如何爬取Facebook粉絲頁資料 (posts) ?

基本上是透過 Facebook Graph API 去取得粉絲頁的資料,但是使用 Facebook Graph API 還需要取得權限,有兩種方法 :
第一種是取得 Access Token
第二種是建立 Facebook App的應用程式,用該應用程式的帳號,密碼當作權限
兩者的差別在於第一種會有時效限制,必須每隔一段時間去更新Access Token,才能使用
Access Token

本文是採用第二種方法
要先取得應用程式的帳號,密碼 app_id, app_secret


In [1]:
# 載入python 套件
import requests
import datetime
import time
import pandas as pd

第一步 - 要先取得應用程式的帳號,密碼 (app_id, app_secret)
第二步 - 輸入要分析的粉絲團的 id (page_id)
[教學]如何申請建立 Facebook APP ID 應用程式ID


In [2]:
# 分析的粉絲頁的id
page_id = "appledaily.tw"

app_id = ""
app_secret = ""

access_token = app_id + "|" + app_secret

爬取的基本概念是送request透過Facebook Graph API來取得資料
而request就是一個url,這個url會根據你的設定(你要拿的欄位)而回傳你需要的資料

但是在爬取大型粉絲頁時,很可能會因為你送的request太多了,就發生錯誤
這邊的解決方法很簡單用一個while迴圈,發生錯誤就休息5秒,5秒鐘後,再重新送request

基本上由5個function來完成:

request_until_succeed
來確保完成爬取

getFacebookPageFeedData
來產生post的各種資料(message,link,created_time,type,name,id...)

getReactionsForStatus
來獲得該post的各reaction數目(like, angry, sad ...)

processFacebookPageFeedStatus
是處理getFacebookPageFeedData得到的各種資料,把它們結構化

scrapeFacebookPageFeedStatus
為主程式


In [3]:
# 判斷response有無正常 正常 200,若無隔五秒鐘之後再試
def request_until_succeed(url):
    success = False
    while success is False:
        try: 
            req = requests.get(url)
            if req.status_code == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)
            print("Error for URL %s: %s" % (url, datetime.datetime.now()))
            print("Retrying.")

    return req

url = base + node + fields + parameters
base : 可以設定Facebook Graph API的版本,這邊設定v2.6
node : 分析哪個粉絲頁的post 由page_id去設定
fields : 你要取得資料的種類
parameters : 權限設定和每次取多少筆(num_statuses)


In [4]:
# 取得Facebook data
def getFacebookPageFeedData(page_id, access_token, num_statuses):

    # Construct the URL string; see http://stackoverflow.com/a/37239851 for
    # Reactions parameters
    base = "https://graph.facebook.com/v2.6"
    node = "/%s/posts" % page_id 
    fields = "/?fields=message,link,created_time,type,name,id," + \
            "comments.limit(0).summary(true),shares,reactions" + \
            ".limit(0).summary(true)"
    parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token)
    url = base + node + fields + parameters

    # 取得data
    data = request_until_succeed(url).json()
    return data

In [5]:
# 取得該篇文章的 reactions like,love,wow,haha,sad,angry數目
def getReactionsForStatus(status_id, access_token):

    # See http://stackoverflow.com/a/37239851 for Reactions parameters
        # Reactions are only accessable at a single-post endpoint

    base = "https://graph.facebook.com/v2.6"
    node = "/%s" % status_id
    reactions = "/?fields=" \
            "reactions.type(LIKE).limit(0).summary(total_count).as(like)" \
            ",reactions.type(LOVE).limit(0).summary(total_count).as(love)" \
            ",reactions.type(WOW).limit(0).summary(total_count).as(wow)" \
            ",reactions.type(HAHA).limit(0).summary(total_count).as(haha)" \
            ",reactions.type(SAD).limit(0).summary(total_count).as(sad)" \
            ",reactions.type(ANGRY).limit(0).summary(total_count).as(angry)"
    parameters = "&access_token=%s" % access_token
    url = base + node + reactions + parameters

    # 取得data
    data = request_until_succeed(url).json()
    return data

生成status_link ,此連結可以回到該臉書上的post
status_published = status_published + datetime.timedelta(hours=8) 根據所在時區 TW +8


In [6]:
def processFacebookPageFeedStatus(status, access_token):

    # 要去確認抓到的資料是否為空
    status_id = status['id']
    status_type = status['type']
    if 'message' not in status.keys():
        status_message = ''
    else:
        status_message = status['message']
    if 'name' not in status.keys():
        link_name = ''
    else:
        link_name = status['name']
    link = status_id.split('_')
    
    # 此連結可以回到該臉書上的post
    status_link = 'https://www.facebook.com/'+link[0]+'/posts/'+link[1]

    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    # 根據所在時區 TW +8
    status_published = status_published + datetime.timedelta(hours=8)
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') 
    
    # 要去確認抓到的資料是否為空
    if 'reactions' not in status:
        num_reactions = 0
    else:
        num_reactions = status['reactions']['summary']['total_count']
    if 'comments' not in status:
        num_comments = 0
    else:
        num_comments = status['comments']['summary']['total_count']
    if 'shares' not in status:
        num_shares = 0
    else:
        num_shares = status['shares']['count']

    def get_num_total_reactions(reaction_type, reactions):
        if reaction_type not in reactions:
            return 0
        else:
            return reactions[reaction_type]['summary']['total_count']
    
    # 取得該篇文章的 reactions like,love,wow,haha,sad,angry數目
    reactions = getReactionsForStatus(status_id, access_token)
    
    num_loves = get_num_total_reactions('love', reactions)
    num_wows = get_num_total_reactions('wow', reactions)
    num_hahas = get_num_total_reactions('haha', reactions)
    num_sads = get_num_total_reactions('sad', reactions)
    num_angrys = get_num_total_reactions('angry', reactions)
    num_likes = get_num_total_reactions('like', reactions)

    # 回傳tuple形式的資料
    return (status_id, status_message, link_name, status_type, status_link,
            status_published, num_reactions, num_comments, num_shares,
            num_likes, num_loves, num_wows, num_hahas, num_sads, num_angrys)

假設一個粉絲頁,有250個posts
第一次用 getFacebookPageFeedData 得到 url 送入 request_until_succeed
得到第一個dictionary
dictionary中有兩個key,一個是data(100筆資料都在其中)
而另一個是next(下一個100筆的url在裡面,把它送出去會在得到另一個dictionary,裡面又含兩個key,一樣是data和next)
第一次送的 request data: 第100筆資料 next: 下100筆資料的url
第二次送的 request data: 第101-200筆資料 next: 下100筆資料的url
第三次送的 request data: 第201- 250筆資料 next: 無 (因為沒有下一百筆了)
總共送3次request

由於Facebook限制每次最多抓100篇posts,因此當粉絲頁超過100篇時,
就會有 next 的 url,必須送出此url在獲得下100篇,由 has_next_page 來決定
是否下100篇

num_processed是用來計算處理多少posts,每處理100筆就輸出時間

最後會把結果輸出成csv,供後續章節繼續分析和預測


In [7]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    # all_statuses 用來儲存的list,先放入欄位名稱
    all_statuses = [('status_id', 'status_message', 'link_name', 'status_type', 'status_link',
            'status_published', 'num_reactions', 'num_comments', 'num_shares',
            'num_likes', 'num_loves', 'num_wows', 'num_hahas', 'num_sads', 'num_angrys')]
    
    has_next_page = True 
    num_processed = 0   # 計算處理多少post
    scrape_starttime = datetime.datetime.now()

    print("Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime))

    statuses = getFacebookPageFeedData(page_id, access_token, 100)

    while has_next_page:
        for status in statuses['data']:

            # 確定有 reaction 再把結構化後的資料存入 all_statuses
            if 'reactions' in status:
                all_statuses.append(processFacebookPageFeedStatus(status,access_token))

            # 觀察爬取進度,每處理100篇post,就輸出時間,
            num_processed += 1
            if num_processed % 100 == 0:
                print("%s Statuses Processed: %s" % (num_processed, datetime.datetime.now()))

        # 每超過100個post就會有next,可以從next中取得下100篇, 直到沒有next
        if 'paging' in statuses.keys():
            statuses = request_until_succeed(statuses['paging']['next']).json()
        else:
            has_next_page = False

    print("\nDone!\n%s Statuses Processed in %s" % \
        (num_processed, datetime.datetime.now() - scrape_starttime))
    
    return all_statuses

In [8]:
all_statuses = scrapeFacebookPageFeedStatus(page_id, access_token)


Scraping appledaily.tw Facebook Page: 2017-03-14 18:24:27.344058

100 Statuses Processed: 2017-03-14 18:24:52.341175
200 Statuses Processed: 2017-03-14 18:25:17.114291
300 Statuses Processed: 2017-03-14 18:25:42.147270
400 Statuses Processed: 2017-03-14 18:26:08.442417
500 Statuses Processed: 2017-03-14 18:26:34.240028
600 Statuses Processed: 2017-03-14 18:26:59.547042
700 Statuses Processed: 2017-03-14 18:27:24.895273
800 Statuses Processed: 2017-03-14 18:27:49.752256
900 Statuses Processed: 2017-03-14 18:28:19.682193
1000 Statuses Processed: 2017-03-14 18:28:45.043433
1100 Statuses Processed: 2017-03-14 18:29:10.427917
1200 Statuses Processed: 2017-03-14 18:29:35.867298
1300 Statuses Processed: 2017-03-14 18:30:01.426084
1400 Statuses Processed: 2017-03-14 18:30:27.023855
1500 Statuses Processed: 2017-03-14 18:30:52.605395
1600 Statuses Processed: 2017-03-14 18:31:19.059738
1700 Statuses Processed: 2017-03-14 18:31:44.696886
1800 Statuses Processed: 2017-03-14 18:32:10.995391
1900 Statuses Processed: 2017-03-14 18:32:36.193756
2000 Statuses Processed: 2017-03-14 18:33:01.537011
2100 Statuses Processed: 2017-03-14 18:33:28.878967
2200 Statuses Processed: 2017-03-14 18:33:55.079628
2300 Statuses Processed: 2017-03-14 18:34:23.744874
2400 Statuses Processed: 2017-03-14 18:34:48.970759
2500 Statuses Processed: 2017-03-14 18:35:14.022596
2600 Statuses Processed: 2017-03-14 18:35:39.631498
2700 Statuses Processed: 2017-03-14 18:36:05.411284
2800 Statuses Processed: 2017-03-14 18:36:31.221738
2900 Statuses Processed: 2017-03-14 18:36:57.033939
3000 Statuses Processed: 2017-03-14 18:37:22.369105
3100 Statuses Processed: 2017-03-14 18:37:48.160849
3200 Statuses Processed: 2017-03-14 18:38:14.836224
3300 Statuses Processed: 2017-03-14 18:38:40.125286
3400 Statuses Processed: 2017-03-14 18:39:04.902551
3500 Statuses Processed: 2017-03-14 18:39:30.268135
3600 Statuses Processed: 2017-03-14 18:39:56.086407
3700 Statuses Processed: 2017-03-14 18:40:21.084072
3800 Statuses Processed: 2017-03-14 18:40:46.285839
3900 Statuses Processed: 2017-03-14 18:41:11.102542
4000 Statuses Processed: 2017-03-14 18:41:36.079854
4100 Statuses Processed: 2017-03-14 18:42:01.454794
4200 Statuses Processed: 2017-03-14 18:42:26.706651
4300 Statuses Processed: 2017-03-14 18:42:51.357438
4400 Statuses Processed: 2017-03-14 18:43:16.144430
4500 Statuses Processed: 2017-03-14 18:43:41.703740
4600 Statuses Processed: 2017-03-14 18:44:06.435289
4700 Statuses Processed: 2017-03-14 18:44:30.746588
4800 Statuses Processed: 2017-03-14 18:44:55.695510
4900 Statuses Processed: 2017-03-14 18:45:19.992582
5000 Statuses Processed: 2017-03-14 18:45:44.352221
5100 Statuses Processed: 2017-03-14 18:46:08.782931
5200 Statuses Processed: 2017-03-14 18:46:33.469404

Done!
5234 Statuses Processed in 0:22:14.865004

5234篇post共花了20分鐘,把結果存成csv交給下一章去分析
all_statuses[0] 為 column name
all_statuses[1:] 為處理後結構化的資料


In [9]:
df = pd.DataFrame(all_statuses[1:], columns=all_statuses[0])

In [10]:
df.head()


Out[10]:
status_id status_message link_name status_type status_link status_published num_reactions num_comments num_shares num_likes num_loves num_wows num_hahas num_sads num_angrys
0 232633627068_10155689734022069 加油!放寬心才能走出來\n \n#金曲男星 #蛋堡 #邱振熙 蛋堡 Soft Lipa 【壹週刊】​金曲男星進精神療養院 曾入圍歌王 link https://www.facebook.com/232633627068/posts/10... 2017-03-14 18:20:02 275 3 0 240 13 14 3 4 1
1 232633627068_10155689752042069 #最新 趕快清查把該抓的抓起來!\n \n相關→ 自殺副局長12年前與晶鑽搭上線 多次提供開... 【晶鑽弊案】北市高官也涉貪 建管處前主秘遭搜索約談 link https://www.facebook.com/232633627068/posts/10... 2017-03-14 17:59:25 157 8 0 141 3 2 7 0 4
2 232633627068_10155689484782069 #慎入 這就跟把雞排放進我嘴裡又不讓我咬一樣呀...... #宅編\n  \n完整 #動新聞... 【大咬片】馴獸師把頭放進鱷魚嘴 被咬得血流滿面 video https://www.facebook.com/232633627068/posts/10... 2017-03-14 17:50:00 269 24 4 210 4 29 24 2 0
3 232633627068_10155689727032069 距離周末前往台中還有...好久 #隨編\n \n#正妹 #紅豆餅妹 #朝聖啦 #蕭卉君 \n... 清新紅豆餅妹藏逆天「胸器」!網友揪朝聖啦 link https://www.facebook.com/232633627068/posts/10... 2017-03-14 17:40:00 2904 109 144 2802 38 44 18 1 1
4 232633627068_10155689539617069 Betty批「這種人根本不配當攝影師,很沒道德」\n  \n【完整 #動新聞】大尺度女模控無... 大尺度女模控無良攝影師 外流露點走光照 video https://www.facebook.com/232633627068/posts/10... 2017-03-14 17:30:00 595 18 7 496 8 21 4 2 64

In [11]:
path = 'post/'+page_id+'_post.csv'
df.to_csv(path,index=False,encoding='utf8')

In [ ]: