爬取單一文章資訊

  1. 你有可能會遇到「是否滿18歲」的詢問頁面
  2. 解析 ptt.cc/bbs 裏面文章的結構
  3. 爬取文章
  4. 爬取留言

URL https://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html

BACKUP https://afuntw.github.io/Test-Crawling-Website/pages/ptt/M.1537847530.A.E12.html


In [1]:
import requests
import re
import json

from bs4 import BeautifulSoup, NavigableString
from pprint import pprint

In [2]:
ARTICLE_URL = 'https://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html'

透過 cookies 繞過年齡檢查

觀察開發者工具 > NetWork > requests header


In [3]:
resp = requests.get(ARTICLE_URL)
if resp.status_code == 200:
    print(resp.text)


<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-print.css" media="print">




	</head>
    <body>
		
<div class="bbs-screen bbs-content">
    <div class="over18-notice">
        <p>本網站已依網站內容分級規定處理</p>

        <p>警告︰您即將進入之看板內容需滿十八歲方可瀏覽。</p>

        <p>若您尚未年滿十八歲,請點選離開。若您已滿十八歲,亦不可將本區之內容派發、傳閱、出售、出租、交給或借予年齡未滿18歲的人士瀏覽,或將本網站內容向該人士出示、播放或放映。</p>
    </div>
</div>

<div class="bbs-screen bbs-content center clear">
    <form action="/ask/over18" method="post">
        <input type="hidden" name="from" value="/bbs/Gossiping/M.1537847530.A.E12.html">
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="yes" value="yes">我同意,我已年滿十八歲<br><small>進入</small></button>
        </div>
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="no" value="no">未滿十八歲或不同意本條款<br><small>離開</small></button>
        </div>
    </form>
</div>

		

<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>


		
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.25/bbs.js"></script>

    </body>
</html>


In [4]:
cookies = {'over18': '1'}
resp = requests.get(ARTICLE_URL, cookies=cookies)
if resp.status_code == 200:
    print(resp.text)


<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>[問卦] 中央與北大併校 - 看板 Gossiping - 批踢踢實業坊</title>
<meta name="robots" content="all">
<meta name="keywords" content="Ptt BBS 批踢踢">
<meta name="description" content="如題啊,最近陽明跟交大併校吵的很兇,中央都變成台聯大邊緣人了。
為什麼不讓中央跟台北大學併校呢?
中央缺法商剛好北大有,
中央的理工北大沒有,兩校剛好互補,
而且地理位置也不遠,有沒有人想過讓台北大學跟中央合併呢?
">
<meta property="og:site_name" content="Ptt 批踢踢實業坊">
<meta property="og:title" content="[問卦] 中央與北大併校">
<meta property="og:description" content="如題啊,最近陽明跟交大併校吵的很兇,中央都變成台聯大邊緣人了。
為什麼不讓中央跟台北大學併校呢?
中央缺法商剛好北大有,
中央的理工北大沒有,兩校剛好互補,
而且地理位置也不遠,有沒有人想過讓台北大學跟中央合併呢?
">
<link rel="canonical" href="https://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html">

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-print.css" media="print">




	</head>
    <body>
		
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>

<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/bbs/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/contact.html">聯絡資訊</a>
	</div>
</div>
<div id="navigation-container">
	<div id="navigation" class="bbs-content">
		<a class="board" href="/bbs/Gossiping/index.html">返回看板</a>
		<div class="bar"></div>
		<div class="share">
			<span>分享</span>
			<div class="fb-like" data-send="false" data-layout="button_count" data-width="90" data-show-faces="false" data-href="http://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html"></div>

			<div class="g-plusone" data-size="medium"></div>
<script type="text/javascript">
window.___gcfg = {lang: 'zh-TW'};
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>

		</div>
	</div>
</div>
<div id="main-container">
    <div id="main-content" class="bbs-screen bbs-content"><div class="article-metaline"><span class="article-meta-tag">作者</span><span class="article-meta-value">R101 (索尼大法好)</span></div><div class="article-metaline-right"><span class="article-meta-tag">看板</span><span class="article-meta-value">Gossiping</span></div><div class="article-metaline"><span class="article-meta-tag">標題</span><span class="article-meta-value">[問卦] 中央與北大併校</span></div><div class="article-metaline"><span class="article-meta-tag">時間</span><span class="article-meta-value">Tue Sep 25 11:52:08 2018</span></div>
如題啊,最近陽明跟交大併校吵的很兇,中央都變成台聯大邊緣人了。
為什麼不讓中央跟台北大學併校呢?
中央缺法商剛好北大有,
中央的理工北大沒有,兩校剛好互補,
而且地理位置也不遠,有沒有人想過讓台北大學跟中央合併呢?
有沒有八卦?

--
<span class="f2">※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.115.197.252</span>
<span class="f2">※ 文章網址: </span><a href="https://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html" target="_blank" rel="nofollow"><span class="f2">https://www.ptt.cc/bbs/Gossiping/M.1537847530.A.E12.html</span></a>
<div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">bobobola</span><span class="f3 push-content">: 中央不缺商阿</span><span class="push-ipdatetime">     42.75.76.1 09/25 11:54
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">nikewang</span><span class="f3 push-content">: 北中和中興合併不就好了</span><span class="push-ipdatetime">121.157.204.247 09/25 11:54
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">nikewang</span><span class="f3 push-content">: 北大</span><span class="push-ipdatetime">121.157.204.247 09/25 11:54
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">qqq1234</span><span class="f3 push-content">: 北大好不容易才脫離中興獨立 怎可能去併</span><span class="push-ipdatetime">   117.56.55.46 09/25 11:59
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">Lakland</span><span class="f3 push-content">: 北大跟北科合作一陣子了,中央找體大吧</span><span class="push-ipdatetime">   114.24.29.42 09/25 11:59
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">atlaswhz</span><span class="f3 push-content">: 中央找體大和警大組成桃聯大好了</span><span class="push-ipdatetime">   1.34.181.133 09/25 12:07
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">sooppp</span><span class="f3 push-content">: 體大的聽的懂中央上課在教什麼嗎?</span><span class="push-ipdatetime">223.140.169.234 09/25 12:15
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">homepark</span><span class="f3 push-content">: 中央缺醫學喇</span><span class="push-ipdatetime">223.137.74.137 09/25 12:16
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">lee457088</span><span class="f3 push-content">: 197.252是哪棟</span><span class="push-ipdatetime">140.115.216.209 09/25 12:18
</span></div>不知道
<div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">mecca</span><span class="f3 push-content">: 當年有文法商理工醫農學院 現在洗洗睡吧</span><span class="push-ipdatetime">210.64.134.103 09/25 12:40
</span></div><span class="f2">※ 編輯: R101 (140.115.130.200), 09/25/2018 15:41:01
</span></div>
    
    <div id="article-polling" data-pollurl="/poll/Gossiping/M.1537847530.A.E12.html?cacheKey=2100-35156991&amp;offset=1735&amp;offset-sig=9c9b4e00053e6d701be35d84593a0332e903e91f" data-longpollurl="/v1/longpoll?id=dda406fae0d0b13b8584a5c9c23b2d3f71497976" data-offset="1735"></div>
    

    
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>

</div>

		

<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>


		
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.25/bbs.js"></script>

    </body>
</html>


In [5]:
soup = BeautifulSoup(resp.text, 'lxml')

爬取文章

  • 作者 id
  • 作者暱稱
  • 文章標題
  • 發佈時間
  • 文章內容
  • 發文 ip

In [6]:
article = {
    'author_id': '',
    'author_nickname': '',
    'title': '',
    'timestamp': '',
    'contents': '',
    'ip': ''
}
article_body = soup.find(id='main-content')

# article header
article_head = article_body.findAll('div', class_='article-metaline')
for metaline in article_head:
    meta_tag = metaline.find(class_='article-meta-tag').text
    meta_value = metaline.find(class_='article-meta-value').text
    if meta_tag == '作者':
        compile_nickname = re.compile('\((.*)\)').search(meta_value)
        article['author_id'] = meta_value.split('(')[0].strip(' ')
        article['author_nickname'] = compile_nickname.group(1) if compile_nickname else ''
    elif meta_tag == '標題':
        article['title'] = meta_value
    elif meta_tag == '時間':
        article['timestamp'] = meta_value

#  article content
contents = [expr for expr in article_body.contents if isinstance(expr, NavigableString)]
contents = [re.sub('\n', '', expr) for expr in contents]
contents = [i for i in contents if i]
contents = '\n'.join(contents)
article['contents'] = contents

# article publish ip
article_ip = article_body.find(class_='f2').text
compile_ip = re.compile('[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}').search(article_ip)
article['ip'] = compile_ip.group(0) if compile_ip else ''

pprint(article)


{'author_id': 'R101',
 'author_nickname': '索尼大法好',
 'contents': '如題啊,最近陽明跟交大併校吵的很兇,中央都變成台聯大邊緣人了。為什麼不讓中央跟台北大學併校呢?中央缺法商剛好北大有,中央的理工北大沒有,兩校剛好互補,而且地理位置也不遠,有沒有人想過讓台北大學跟中央合併呢?有沒有八卦?--\n'
             '不知道',
 'ip': '140.115.197.252',
 'timestamp': 'Tue Sep 25 11:52:08 2018',
 'title': '[問卦] 中央與北大併校'}

爬取流言

  • 推噓
  • 推文 id
  • 推文內容
  • 推文 ip
  • 推文時間

In [7]:
comments = []
for comment in article_body.findAll('div', class_='push'):
    tag = comment.find(class_='push-tag').text
    guest_id = comment.find(class_='push-userid').text
    guest_content = comment.find(class_='push-content').text
    guest_ipdatetime = comment.find(class_='push-ipdatetime').text
    compile_ip = re.compile('[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}').search(guest_ipdatetime)
    guest_ip = compile_ip.group(0) if compile_ip else ''
    guest_timestamp = re.sub(guest_ip, '', guest_ipdatetime).strip()
    comments.append({
        'tag': tag,
        'id': guest_id,
        'content': guest_content,
        'ip': guest_ip,
        'timestamp': guest_timestamp
    })
pprint(comments)


[{'content': ': 中央不缺商阿',
  'id': 'bobobola',
  'ip': '42.75.76.1',
  'tag': '推 ',
  'timestamp': '09/25 11:54'},
 {'content': ': 北中和中興合併不就好了',
  'id': 'nikewang',
  'ip': '121.157.204.247',
  'tag': '→ ',
  'timestamp': '09/25 11:54'},
 {'content': ': 北大',
  'id': 'nikewang',
  'ip': '121.157.204.247',
  'tag': '→ ',
  'timestamp': '09/25 11:54'},
 {'content': ': 北大好不容易才脫離中興獨立 怎可能去併',
  'id': 'qqq1234',
  'ip': '117.56.55.46',
  'tag': '推 ',
  'timestamp': '09/25 11:59'},
 {'content': ': 北大跟北科合作一陣子了,中央找體大吧',
  'id': 'Lakland',
  'ip': '114.24.29.42',
  'tag': '→ ',
  'timestamp': '09/25 11:59'},
 {'content': ': 中央找體大和警大組成桃聯大好了',
  'id': 'atlaswhz',
  'ip': '1.34.181.133',
  'tag': '推 ',
  'timestamp': '09/25 12:07'},
 {'content': ': 體大的聽的懂中央上課在教什麼嗎?',
  'id': 'sooppp',
  'ip': '223.140.169.234',
  'tag': '推 ',
  'timestamp': '09/25 12:15'},
 {'content': ': 中央缺醫學喇',
  'id': 'homepark',
  'ip': '223.137.74.137',
  'tag': '推 ',
  'timestamp': '09/25 12:16'},
 {'content': ': 197.252是哪棟',
  'id': 'lee457088',
  'ip': '140.115.216.209',
  'tag': '→ ',
  'timestamp': '09/25 12:18'},
 {'content': ': 當年有文法商理工醫農學院 現在洗洗睡吧',
  'id': 'mecca',
  'ip': '210.64.134.103',
  'tag': '推 ',
  'timestamp': '09/25 12:40'}]

將資料存成 json 檔


In [8]:
article['comments'] = comments
data = [article]
with open('M.1537847530.A.E12.json', 'w+', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)