数据抓取：

使用Python编写网络爬虫

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

需要解决的问题

页面解析
获取Javascript隐藏源数据
自动翻页
自动登录
连接API接口



In [4]:

    
import urllib2
from bs4 import BeautifulSoup

一般的数据抓取，使用urllib2和beautifulsoup配合就可以了。
尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。
以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。
- 在天涯论坛，关于雾霾的帖子的第一页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾
- 第二页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾

Beautiful Soup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4

open your terminal/cmd

$ pip install beautifulsoup4

第一个爬虫

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/



In [25]:

    
url = 'file:///Users/chengjun/GitHub/cjc2016/data/test.html'
content = urllib2.urlopen(url).read() 
soup = BeautifulSoup(content, 'html.parser') 
soup









    Out[25]:





<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body></html>



In [26]:

    
print(soup.prettify())









    



<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

html
- head
  - title
- body
  - p (class = 'title', 'story' )
    - a (class = 'sister')
      - href/id



In [72]:

    
for tag in soup.find_all(True):
    print(tag.name)









    



html
head
title
body
p
b
p
a
a
a
p



In [58]:

    
soup('head') # or soup.head









    Out[58]:





[<head><title>The Dormouse's story</title></head>]



In [59]:

    
soup('body') # or soup.body









    Out[59]:





[<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body>]



In [29]:

    
soup('title')  # or  soup.title









    Out[29]:





[<title>The Dormouse's story</title>]



In [60]:

    
soup('p')









    Out[60]:





[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]



In [62]:

    
soup.p









    Out[62]:





<p class="title"><b>The Dormouse's story</b></p>



In [30]:

    
soup.title.name









    Out[30]:





u'title'



In [31]:

    
soup.title.string









    Out[31]:





u"The Dormouse's story"



In [48]:

    
soup.title.text









    Out[48]:





u"The Dormouse's story"



In [32]:

    
soup.title.parent.name









    Out[32]:





u'head'



In [33]:

    
soup.p









    Out[33]:





<p class="title"><b>The Dormouse's story</b></p>



In [34]:

    
soup.p['class']









    Out[34]:





[u'title']



In [50]:

    
soup.find_all('p', {'class', 'title'})









    Out[50]:





[<p class="title"><b>The Dormouse's story</b></p>]



In [78]:

    
soup.find_all('p', class_= 'title')









    Out[78]:





[<p class="title"><b>The Dormouse's story</b></p>]



In [49]:

    
soup.find_all('p', {'class', 'story'})









    Out[49]:





[<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]



In [57]:

    
soup.find_all('p', {'class', 'story'})[0].find_all('a')









    Out[57]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [35]:

    
soup.a









    Out[35]:





<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>



In [79]:

    
soup('a')









    Out[79]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [37]:

    
soup.find(id="link3")









    Out[37]:





<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>



In [36]:

    
soup.find_all('a')









    Out[36]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [80]:

    
soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')









    Out[80]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [81]:

    
soup.find_all('a', {'class', 'sister'})[0]









    Out[81]:





<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>



In [44]:

    
soup.find_all('a', {'class', 'sister'})[0].text









    Out[44]:





u'Elsie'



In [46]:

    
soup.find_all('a', {'class', 'sister'})[0]['href']









    Out[46]:





u'http://example.com/elsie'



In [47]:

    
soup.find_all('a', {'class', 'sister'})[0]['id']









    Out[47]:





u'link1'



In [71]:

    
soup.find_all(["a", "b"])









    Out[71]:





[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [38]:

    
print(soup.get_text())









    



The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

数据抓取：

根据URL抓取微信公众号文章内容

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com



In [102]:

    
url = "http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&\
mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = urllib2.urlopen(url).read() #获取网页的html文本
soup = BeautifulSoup(content, 'html.parser') 
print soup.title.text
print soup.find('div', {'class', 'rich_media_meta_list'}).find(id = 'post-date').text
print soup.find('div', {'class', 'rich_media_content'}).get_text()









    



南大新传 | 微议题：地震中民族自豪—“中国人先撤”
2015-05-04

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

数据抓取：

抓取天涯回帖网络

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com



In [2]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX width=1000 height=500></iframe>')
# the webpage we would like to crawl









    Out[2]:



In [5]:

    
page_num = 0
url = "http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX" % page_num
content = urllib2.urlopen(url).read() #获取网页的html文本
soup = BeautifulSoup(content, "lxml") 
articles = soup.find_all('tr')



In [7]:

    
print articles[0]









    



<tr>
<th scope="col"> 标题</th>
<th scope="col">作者</th>
<th scope="col">点击</th>
<th scope="col">回复</th>
<th scope="col">发表时间</th>
</tr>



In [8]:

    
print articles[1]









    



<tr class="bg">
<td class="td-title ">
<span class="face" title="">
</span>
<a href="/post-free-2849477-1.shtml" target="_blank">
							【民间语文第161期】宁波px启示:船进港湾人应上岸<span class="art-ico art-ico-3" title="内有0张图片"></span>
</a>
</td>
<td><a class="author" href="http://www.tianya.cn/50499450" target="_blank">贾也</a></td>
<td>194677</td>
<td>2703</td>
<td title="2012-10-29 07:59">10-29 07:59</td>
</tr>



In [9]:

    
len(articles[1:])









    Out[9]:





50

http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=PX

通过分析帖子列表的源代码，使用inspect方法，会发现所有要解析的内容都在‘td’这个层级下



In [20]:

    
for t in articles[1].find_all('td'): print t









    



<td class="td-title ">
<span class="face" title="">
</span>
<a href="/post-free-2849477-1.shtml" target="_blank">
							【民间语文第161期】宁波px启示:船进港湾人应上岸<span class="art-ico art-ico-3" title="内有0张图片"></span>
</a>
</td>
<td><a class="author" href="http://www.tianya.cn/50499450" target="_blank">贾也</a></td>
<td>194677</td>
<td>2703</td>
<td title="2012-10-29 07:59">10-29 07:59</td>



In [21]:

    
td = articles[1].find_all('td')



In [23]:

    
print td[0]









    



<td class="td-title ">
<span class="face" title="">
</span>
<a href="/post-free-2849477-1.shtml" target="_blank">
							【民间语文第161期】宁波px启示:船进港湾人应上岸<span class="art-ico art-ico-3" title="内有0张图片"></span>
</a>
</td>



In [28]:

    
print td[0]









    



<td class="td-title ">
<span class="face" title="">
</span>
<a href="/post-free-2849477-1.shtml" target="_blank">
							【民间语文第161期】宁波px启示:船进港湾人应上岸<span class="art-ico art-ico-3" title="内有0张图片"></span>
</a>
</td>



In [29]:

    
print td[0].text









    






							【民间语文第161期】宁波px启示:船进港湾人应上岸



In [30]:

    
print td[0].text.strip()









    



【民间语文第161期】宁波px启示:船进港湾人应上岸



In [31]:

    
print td[0].a['href']









    



/post-free-2849477-1.shtml



In [24]:

    
print td[1]









    



<td><a class="author" href="http://www.tianya.cn/50499450" target="_blank">贾也</a></td>



In [25]:

    
print td[2]









    



<td>194677</td>



In [26]:

    
print td[3]









    



<td>2703</td>



In [27]:

    
print td[4]









    



<td title="2012-10-29 07:59">10-29 07:59</td>



In [11]:

    
records = []
for i in articles[1:]:
    td = i.find_all('td')
    title = td[0].text.strip()
    title_url = td[0].a['href']
    author = td[1].text
    author_url = td[1].a['href']
    views = td[2].text
    replies = td[3].text
    date = td[4]['title']
    record = title + '\t' + title_url+ '\t' + author + '\t'+ author_url + '\t' + views+ '\t'  + replies+ '\t'+ date
    records.append(record)



In [16]:

    
print records[2]









    



宁波准备停止PX项目了,元芳,你怎么看?	/post-free-2848797-1.shtml	牧阳光	http://www.tianya.cn/36535656	82776	625	2012-10-28 19:11

抓取天涯论坛PX帖子列表

回帖网络（Thread network）的结构

列表
主帖
回帖



In [85]:

    
def crawler(page_num, file_name):
    try:
        # open the browser
        url = "http://bbs.tianya.cn/list.jsp?item=free&nextid=%d&order=8&k=PX" % page_num
        content = urllib2.urlopen(url).read() #获取网页的html文本
        soup = BeautifulSoup(content, "lxml") 
        articles = soup.find_all('tr')
        # write down info
        for i in articles[1:]:
            td = i.find_all('td')
            title = td[0].text.strip()
            title_url = td[0].a['href']
            author = td[1].text
            author_url = td[1].a['href']
            views = td[2].text
            replies = td[3].text
            date = td[4]['title']
            record = title + '\t' + title_url+ '\t' + author + '\t'+ \
                        author_url + '\t' + views+ '\t'  + replies+ '\t'+ date
            with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!
                        p.write(record.encode('utf-8')+"\n") ##!!encode here to utf-8 to avoid encoding

    except Exception, e:
        print e
        pass



In [97]:

    
# crawl all pages
for page_num in range(10):
    print (page_num)
    crawler(page_num, '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_list.txt')



In [304]:

    
import pandas as pd

df = pd.read_csv('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_list.txt', sep = "\t", header=None)
df[:2]









    Out[304]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      【民间语文第161期】宁波px启示:船进港湾人应上岸
      /post-free-2849477-1.shtml
      贾也
      http://www.tianya.cn/50499450
      194675
      2703
      2012-10-29 07:59
    
    
      1
      宁波镇海PX项目引发群体上访 当地政府发布说明(转载)
      /post-free-2839539-1.shtml
      无上卫士ABC
      http://www.tianya.cn/74341835
      88244
      1041
      2012-10-24 12:41



In [305]:

    
len(df)









    Out[305]:





467



In [306]:

    
df=df.rename(columns = {0:'title', 1:'link', 2:'author',3:'author_page', 4:'click', 5:'reply', 6:'time'})
df[:2]









    Out[306]:






  
    
      
      title
      link
      author
      author_page
      click
      reply
      time
    
  
  
    
      0
      【民间语文第161期】宁波px启示:船进港湾人应上岸
      /post-free-2849477-1.shtml
      贾也
      http://www.tianya.cn/50499450
      194675
      2703
      2012-10-29 07:59
    
    
      1
      宁波镇海PX项目引发群体上访 当地政府发布说明(转载)
      /post-free-2839539-1.shtml
      无上卫士ABC
      http://www.tianya.cn/74341835
      88244
      1041
      2012-10-24 12:41



In [307]:

    
len(df.link)









    Out[307]:





467

抓取作者信息



In [309]:

    
df.author_page[:5]









    Out[309]:





0    http://www.tianya.cn/50499450
1    http://www.tianya.cn/74341835
2    http://www.tianya.cn/36535656
3    http://www.tianya.cn/36959960
4    http://www.tianya.cn/53134970
Name: author_page, dtype: object

http://www.tianya.cn/62237033

http://www.tianya.cn/67896263



In [408]:

    
user_info









    Out[408]:





[<p><span>\u79ef\u3000\u3000\u5206</span>104165</p>,
 <p><span>\u767b\u5f55\u6b21\u6570</span>2008</p>,
 <p><span>\u6700\u65b0\u767b\u5f55</span>2016-04-17 00:28:00</p>,
 <p><span>\u6ce8\u518c\u65e5\u671f</span>2010-05-08 15:56:00</p>]



In [413]:

    
# user_info = soup.find('div',  {'class', 'userinfo'})('p')
# user_infos = [i.get_text()[4:] for i in user_info]
            
def author_crawler(url, file_name):
    try:
        content = urllib2.urlopen(url).read() #获取网页的html文本
        soup = BeautifulSoup(content, "lxml")
        link_info = soup.find_all('div', {'class', 'link-box'})
        followed_num, fans_num = [i.a.text for i in link_info]
        try:
            activity = soup.find_all('span', {'class', 'subtitle'})
            post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]
        except:
            post_num, reply_num = 1, 0
        record =  '\t'.join([url, followed_num, fans_num, post_num, reply_num])
        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!
                    p.write(record.encode('utf-8')+"\n") ##!!encode here to utf-8 to avoid encoding

    except Exception, e:
        print e, url
        record =  '\t'.join([url, 'na', 'na', 'na', 'na'])
        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!
                    p.write(record.encode('utf-8')+"\n") ##!!encode here to utf-8 to avoid encoding
        pass



In [414]:

    
for k, url in enumerate(df.author_page):
    if k % 10==0:
        print k
    author_crawler(url, '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_author_info.txt')









    



0
10
20
30
40
need more than 0 values to unpack http://www.tianya.cn/67896263
need more than 0 values to unpack http://www.tianya.cn/42330613
sequence item 3: expected string or Unicode, int found http://www.tianya.cn/26517664
50
need more than 0 values to unpack http://www.tianya.cn/75591747
60
need more than 0 values to unpack http://www.tianya.cn/24068399
70
80
90
need more than 0 values to unpack http://www.tianya.cn/67896263
sequence item 3: expected string or Unicode, int found http://www.tianya.cn/62237033
100
110
120
130
140
150
160
170
180
190
need more than 0 values to unpack http://www.tianya.cn/67896263
200
need more than 0 values to unpack http://www.tianya.cn/85353911
210
220
230
240
250
260
270
280
need more than 0 values to unpack http://www.tianya.cn/67896263
290
need more than 0 values to unpack http://www.tianya.cn/67896263
300
310
320
need more than 0 values to unpack http://www.tianya.cn/67896263
330
340
350
360
370
need more than 0 values to unpack http://www.tianya.cn/67896263
380
390
400
410
420
430
440
450
460



In [357]:

    
url = df.author_page[1]
content = urllib2.urlopen(url).read() #获取网页的html文本
soup1 = BeautifulSoup(content, "lxml")



In [359]:

    
user_info = soup.find('div',  {'class', 'userinfo'})('p')
area, nid, freq_use, last_login_time, reg_time = [i.get_text()[4:] for i in user_info]
print area, nid, freq_use, last_login_time, reg_time 

link_info = soup1.find_all('div', {'class', 'link-box'})
followed_num, fans_num = [i.a.text for i in link_info]
print followed_num, fans_num









    



浙江杭州市 259643 5832 2016-04-16 16:53:46 2011-04-14 20:49:00



In [393]:

    
activity = soup1.find_all('span', {'class', 'subtitle'})
post_num, reply_num = [j.text[2:] for i in activity[:1] for j in i('a')]
print post_num, reply_num

2 5



In [386]:

    
print activity[2]









    



<span class="subtitle">
<a href="http://blog.tianya.cn/blog-3644295-1.shtml" target="_blank">贾也的博客</a>　
			

			</span>



In [370]:

    
link_info = soup.find_all('div', {'class', 'link-box'})
followed_num, fans_num = [i.a.text for i in link_info]
print followed_num, fans_num



In [369]:

    
link_info[0].a.text









    Out[369]:





u'152'

http://www.tianya.cn/50499450/follow

还可抓取他们的关注列表和粉丝列表

数据抓取：

使用Python抓取回帖

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com



In [13]:

    
df.link[2]









    Out[13]:





'/post-free-2848797-1.shtml'



In [15]:

    
url = 'http://bbs.tianya.cn' + df.link[2]
url









    Out[15]:





'http://bbs.tianya.cn/post-free-2848797-1.shtml'



In [20]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://bbs.tianya.cn/post-free-2848797-1.shtml width=1000 height=500></iframe>')
# the webpage we would like to crawl









    Out[20]:



In [18]:

    
post = urllib2.urlopen(url).read() #获取网页的html文本
post_soup = BeautifulSoup(post, "lxml") 
#articles = soup.find_all('tr')



In [123]:

    
print (post_soup.prettify())[:1000]









    



<!DOCTYPE HTML>
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   宁波准备停止PX项目了，元芳，你怎么看？_天涯杂谈_天涯论坛
  </title>
  <meta content="宁波准备停止PX项目了，元芳，你怎么看？　　从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。..." name="description"/>
  <meta content="IE=EmulateIE9" http-equiv="X-UA-Compatible"/>
  <meta content="牧阳光" name="author"/>
  <meta content="format=xhtml; url=http://bbs.tianya.cn/m/post-free-2848797-1.shtml" http-equiv="mobile-agent"/>
  <link href="http://static.tianyaui.com/global/ty/TY.css" rel="stylesheet" type="text/css"/>
  <link href="http://static.tianyaui.com/global/bbs/web/static/css/bbs_article_38e0681.css" rel="stylesheet" type="text/css"/>
  <link href="http://static.tianyaui.com/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <link href="http://bbs.tianya.cn/post-free-2848797-2.shtml" rel="next"/>
  <script type="text/javascript">
   var bbsGlobal = {
	isEhomeItem : false,
	isNewArticle : false,
	authorId : "36535656",
	authorNa



In [36]:

    
pa = post_soup.find_all('div', {'class', 'atl-item'})
len(pa)









    Out[36]:





90



In [38]:

    
print pa[0]









    



<div _host="%E7%89%A7%E9%98%B3%E5%85%89" class="atl-item host-item">
<div class="atl-content">
<div class="atl-con-hd clearfix">
<div class="atl-con-hd-l"></div>
<div class="atl-con-hd-r"></div>
</div>
<div class="atl-con-bd clearfix">
<div class="bbs-content clearfix">
<br/>
　　从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。<br/>
<br/>
</div>
<div class="clearfix" id="alt_action"></div>
<div class="clearfix">
<div class="host-data">
<span>楼主发言：11次</span> <span>发图：0张</span>
</div>
<div class="atl-reply" id="alt_reply">
<a author="牧阳光" authorid="36535656" class="reportme a-link" href="javascript:void(0);" replyid="0" replytime="2012-10-28 19:11:00"> 举报</a> | 
							<a class="a-link acl-share" href="javascript:void(0);">分享</a> |  
			               	<a class="a-link acl-more" href="javascript:void(0);">更多</a> |
							<span><a class="a-link" name="0">楼主</a></span>
<a _name="牧阳光" _time="2012-10-28 19:11:00" class="a-link2 replytop" href="#fabu_anchor">回复</a>
</div>
</div>
<div id="ds-quick-box" style="display:none;"></div>
</div>
<div class="atl-con-ft clearfix">
<div class="atl-con-ft-l"></div>
<div class="atl-con-ft-r"></div>
<div id="niuren_ifm"></div>
</div>
</div>
</div>



In [39]:

    
print pa[1]









    



<div _host="%E6%80%A8%E9%AD%82%E9%AC%BC" class="atl-item" id="1" js_restime="2012-10-28 19:17:56" js_username="%E6%80%A8%E9%AD%82%E9%AC%BC" replyid="92725226">
<div class="atl-head" id="ea93038aa568ef2bf7a8cf6b6853b744">
<div class="atl-head-reply"></div>
<div class="atl-info">
<span>作者：<a class="js-vip-check" href="http://www.tianya.cn/73157063" target="_blank" uid="73157063" uname="怨魂鬼">怨魂鬼</a> </span>
<span>时间：2012-10-28 19:17:56</span>
</div>
</div>
<div class="atl-content">
<div class="atl-con-hd clearfix">
<div class="atl-con-hd-l"></div>
<div class="atl-con-hd-r"></div>
</div>
<div class="atl-con-bd clearfix">
<div class="bbs-content">
							　　图片分享<img original="http://img3.laibafile.cn/p/m/122161321.jpg" src="http://static.tianyaui.com/img/static/2011/imgloading.gif"/><br/><br/>　　
							
						</div>
<div class="atl-reply">
							来自 <a _stat="/stat/bbs/post/来自" class="a-link" href="http://www.tianya.cn/mobile/" rel="nofollow" target="_blank">天涯社区（微论）客户端</a> |
							<a author="怨魂鬼" authorid="73157063" class="reportme a-link" href="javascript:void(0);" replyid="92725226" replytime="2012-10-28 19:17:56">举报</a> |
														
							<span>1楼</span> |
							<a class="a-link-2 ir-shang" floor="1" href="javascript:void(0);" title="打赏层主">
								打赏
							</a> |
							<a class="a-link-2 reply" href="#fabu_anchor" title="引用回复">回复</a> |
							<a _stat="/stat/bbs/post/评论" class="a-link-2 ir-remark" floor="1" href="javascript:void(0);" title="插入评论">
								评论
							</a>
</div>
</div>
<div class="atl-con-ft clearfix">
<div class="atl-con-ft-l"></div>
<div class="atl-con-ft-r"></div>
</div>
</div>
</div>



In [40]:

    
print pa[89]









    



<div _host="%E5%B1%B1%E9%9B%A8%E6%AC%B2%E6%BB%A1%E6%A5%BC" class="atl-item" id="100" js_restime="2012-10-28 21:34:21" js_username="%E5%B1%B1%E9%9B%A8%E6%AC%B2%E6%BB%A1%E6%A5%BC" replyid="92725457">
<div class="atl-head" id="8cc4c81381b126cc08dd65759233d6de">
<div class="atl-head-reply"></div>
<div class="atl-info">
<span>作者：<a class="js-vip-check" href="http://www.tianya.cn/74980506" target="_blank" uid="74980506" uname="山雨欲满楼">山雨欲满楼</a> </span>
<span>时间：2012-10-28 21:34:21</span>
</div>
</div>
<div class="atl-content">
<div class="atl-con-hd clearfix">
<div class="atl-con-hd-l"></div>
<div class="atl-con-hd-r"></div>
</div>
<div class="atl-con-bd clearfix">
<div class="bbs-content">
							　　围观也是力量，得道多助失道寡助
							
						</div>
<div class="atl-reply">
<a author="山雨欲满楼" authorid="74980506" class="reportme a-link" href="javascript:void(0);" replyid="92725457" replytime="2012-10-28 21:34:21">举报</a> |
														
							<span>100楼</span> |
							<a class="a-link-2 ir-shang" floor="100" href="javascript:void(0);" title="打赏层主">
								打赏
							</a> |
							<a class="a-link-2 reply" href="#fabu_anchor" title="引用回复">回复</a> |
							<a _stat="/stat/bbs/post/评论" class="a-link-2 ir-remark" floor="100" href="javascript:void(0);" title="插入评论">
								评论
							</a>
</div>
</div>
<div class="atl-con-ft clearfix">
<div class="atl-con-ft-l"></div>
<div class="atl-con-ft-r"></div>
</div>
</div>
</div>

作者：柠檬在追逐时间：2012-10-28 21:33:55

　　@lice5 2012-10-28 20:37:17

　　作为宁波人还是说一句：革命尚未成功同志仍需努力

　　-----------------------------

　　对现在说成功还太乐观，就怕说一套做一套

作者：lice5 时间：2012-10-28 20:37:17

　　作为宁波人还是说一句：革命尚未成功同志仍需努力

4 /post-free-4242156-1.shtml 2014-04-09 15:55:35 61943225 野渡自渡人 @Y雷政府34楼2014-04-0422:30:34　　野渡君雄文！支持是必须的。　　-----------------------------　　@清坪过客16楼2014-04-0804:09:48　　绝对的权力导致绝对的腐败！　　-----------------------------　　@T大漠鱼T35楼2014-04-0810:17:27　　@周丕东@普欣@拾月霜寒2012@小摸包@姚文嚼字@四號@凌宸@乔志峰@野渡自渡人@曾兵2010@缠绕夜色@曾颖@风青扬请关注



In [118]:

    
print pa[0].find('div', {'class', 'bbs-content'}).text.strip()









    



从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。



In [119]:

    
print pa[87].find('div', {'class', 'bbs-content'}).text.strip()









    



@lice5 2012-10-28 20:37:17　　作为宁波人 还是说一句：革命尚未成功 同志仍需努力 　　-----------------------------　　对 现在说成功还太乐观，就怕说一套做一套



In [104]:

    
pa[1].a









    Out[104]:





<a class="js-vip-check" href="http://www.tianya.cn/73157063" target="_blank" uid="73157063" uname="\u6028\u9b42\u9b3c">\u6028\u9b42\u9b3c</a>



In [113]:

    
print pa[0].find('a', class_ = 'reportme a-link')









    



<a author="牧阳光" authorid="36535656" class="reportme a-link" href="javascript:void(0);" replyid="0" replytime="2012-10-28 19:11:00"> 举报</a>



In [115]:

    
print pa[0].find('a', class_ = 'reportme a-link')['replytime']









    



2012-10-28 19:11:00



In [114]:

    
print pa[0].find('a', class_ = 'reportme a-link')['author']









    



牧阳光



In [122]:

    
for i in pa[:10]:
    p_info = i.find('a', class_ = 'reportme a-link')
    p_time = p_info['replytime']
    p_author_id = p_info['authorid']
    p_author_name = p_info['author']
    p_content = i.find('div', {'class', 'bbs-content'}).text.strip()
    p_content = p_content.replace('\t', '')
    print p_time, '--->', p_author_id, '--->', p_author_name,'--->', p_content, '\n'









    



2012-10-28 19:11:00 ---> 36535656 ---> 牧阳光 ---> 从宁波市政府新闻发言人处获悉，宁波市经与项目投资方研究决定：（1）坚决不上PX项目；（2）炼化一体化项目前期工作停止推进，再作科学论证。 

2012-10-28 19:17:56 ---> 73157063 ---> 怨魂鬼 ---> 图片分享 

2012-10-28 19:18:17 ---> 73157063 ---> 怨魂鬼 ---> @怨魂鬼 2012-10-28 19:17:56　　图片分享   　　[发自掌中天涯客户端 ]　　-----------------------------　　2楼我的天下！ 

2012-10-28 19:18:46 ---> 36535656 ---> 牧阳光 ---> 。。。沙发板凳这么快就被坐了~~ 

2012-10-28 19:19:04 ---> 41774471 ---> zgh0213 ---> 元芳你怎么看 

2012-10-28 19:19:37 ---> 73157063 ---> 怨魂鬼 ---> @牧阳光 2012-10-28 19:18:46　　。。。沙发板凳这么快就被坐了~~　　-----------------------------　　运气、 

2012-10-28 19:20:04 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 5楼 　　运气、　　-----------------------------　　哈哈。。。 

2012-10-28 19:20:07 ---> 54060837 ---> 帆小叶 ---> 八卦的被和谐了。帖个链接http://api.pwnz.org/0/?url=bG10aC4　　wOTIyNzQvNzIvMDEvMjEvc3dlbi9tb2MuYW5　　paGN0ZXJjZXMud3d3Ly9BMyVwdHRo 

2012-10-28 19:20:33 ---> 36535656 ---> 牧阳光 ---> @怨魂鬼 2楼 　　2楼我的天下！　　-----------------------------　　。。。还是掌中天涯，NB的~~ 

2012-10-28 19:25:22 ---> 36535656 ---> 牧阳光 ---> 消息来源，官方微博@宁波发布

如何翻页

http://bbs.tianya.cn/post-free-2848797-1.shtml

http://bbs.tianya.cn/post-free-2848797-2.shtml

    http://bbs.tianya.cn/post-free-2848797-3.shtml



In [126]:

    
post_soup.find('div', {'class', 'atl-pages'})#['onsubmit']









    Out[126]:





<div class="atl-pages"><form action="" method="get" onsubmit="return goPage(this,'free',2848797,7);">\n<span>\u4e0a\u9875</span>\n<strong>1</strong>\n<a href="/post-free-2848797-2.shtml">2</a>\n<a href="/post-free-2848797-3.shtml">3</a>\n<a href="/post-free-2848797-4.shtml">4</a>\n\u2026\n<a href="/post-free-2848797-7.shtml">7</a>\n<a class="js-keyboard-next" href="/post-free-2848797-2.shtml">\u4e0b\u9875</a>\n\xa0\u5230<input class="pagetxt" name="page" type="text"/>\u9875\xa0<input class="pagebtn" maxlength="6" name="gopage" type="submit" value="\u786e\u5b9a"/></form></div>



In [137]:

    
post_pages = post_soup.find('div', {'class', 'atl-pages'})
post_pages = post_pages.form['onsubmit'].split(',')[-1].split(')')[0]
post_pages









    Out[137]:





'7'



In [144]:

    
url = 'http://bbs.tianya.cn' + df.link[2]
url_base = ''.join(url.split('-')[:-1]) + '-%d.shtml'
url_base









    Out[144]:





'http://bbs.tianya.cn/postfree2848797-%d.shtml'



In [415]:

    
def parsePage(pa):
    records = []
    for i in pa:
        p_info = i.find('a', class_ = 'reportme a-link')
        p_time = p_info['replytime']
        p_author_id = p_info['authorid']
        p_author_name = p_info['author']
        p_content = i.find('div', {'class', 'bbs-content'}).text.strip()
        p_content = p_content.replace('\t', '').replace('\n', '')#.replace(' ', '')
        record = p_time + '\t' + p_author_id+ '\t' + p_author_name + '\t'+ p_content
        records.append(record)
    return records

import sys
def flushPrint(s):
    sys.stdout.write('\r')
    sys.stdout.write('%s' % s)
    sys.stdout.flush()



In [246]:

    
url_1 = 'http://bbs.tianya.cn' + df.link[10]
content = urllib2.urlopen(url_1).read() #获取网页的html文本
post_soup = BeautifulSoup(content, "lxml") 
pa = post_soup.find_all('div', {'class', 'atl-item'})
b = post_soup.find('div', class_= 'atl-pages')
b









    Out[246]:





<div class="atl-pages host-pages"></div>



In [247]:

    
url_1 = 'http://bbs.tianya.cn' + df.link[0]
content = urllib2.urlopen(url_1).read() #获取网页的html文本
post_soup = BeautifulSoup(content, "lxml") 
pa = post_soup.find_all('div', {'class', 'atl-item'})
a = post_soup.find('div', {'class', 'atl-pages'})
a









    Out[247]:





<div class="atl-pages"><form action="" method="get" onsubmit="return goPage(this,'free',2849477,28);">\n<span>\u4e0a\u9875</span>\n<strong>1</strong>\n<a href="/post-free-2849477-2.shtml">2</a>\n<a href="/post-free-2849477-3.shtml">3</a>\n<a href="/post-free-2849477-4.shtml">4</a>\n\u2026\n<a href="/post-free-2849477-28.shtml">28</a>\n<a class="js-keyboard-next" href="/post-free-2849477-2.shtml">\u4e0b\u9875</a>\n\xa0\u5230<input class="pagetxt" name="page" type="text"/>\u9875\xa0<input class="pagebtn" maxlength="6" name="gopage" type="submit" value="\u786e\u5b9a"/></form></div>



In [251]:

    
a.form









    Out[251]:





<form action="" method="get" onsubmit="return goPage(this,'free',2849477,28);">\n<span>\u4e0a\u9875</span>\n<strong>1</strong>\n<a href="/post-free-2849477-2.shtml">2</a>\n<a href="/post-free-2849477-3.shtml">3</a>\n<a href="/post-free-2849477-4.shtml">4</a>\n\u2026\n<a href="/post-free-2849477-28.shtml">28</a>\n<a class="js-keyboard-next" href="/post-free-2849477-2.shtml">\u4e0b\u9875</a>\n\xa0\u5230<input class="pagetxt" name="page" type="text"/>\u9875\xa0<input class="pagebtn" maxlength="6" name="gopage" type="submit" value="\u786e\u5b9a"/></form>



In [254]:

    
if b.form:
    print 'true'
else:
    print 'false'









    



false



In [416]:

    
import random
import time

def crawler(url, file_name):
    try:
        # open the browser
        url_1 = 'http://bbs.tianya.cn' + url
        content = urllib2.urlopen(url_1).read() #获取网页的html文本
        post_soup = BeautifulSoup(content, "lxml") 
        # how many pages in a post
        post_form = post_soup.find('div', {'class', 'atl-pages'})
        if post_form.form:
            post_pages = post_form.form['onsubmit'].split(',')[-1].split(')')[0]
            post_pages = int(post_pages)
            url_base = '-'.join(url_1.split('-')[:-1]) + '-%d.shtml'
        else:
            post_pages = 1
        # for the first page
        pa = post_soup.find_all('div', {'class', 'atl-item'})
        records = parsePage(pa)
        with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!
            for record in records:    
                p.write('1'+ '\t' + url + '\t' + record.encode('utf-8')+"\n") 
        # for the 2nd+ pages
        if post_pages > 1:
            for page_num in range(2, post_pages+1):
                time.sleep(random.random())
                flushPrint(page_num)
                url2 =url_base  % page_num
                content = urllib2.urlopen(url2).read() #获取网页的html文本
                post_soup = BeautifulSoup(content, "lxml") 
                pa = post_soup.find_all('div', {'class', 'atl-item'})
                records = parsePage(pa)
                with open(file_name,'a') as p: # '''Note'''：Ａppend mode, run only once!
                    for record in records:    
                        p.write(str(page_num) + '\t' +url + '\t' + record.encode('utf-8')+"\n") 
        else:
            pass
    except Exception, e:
        print e
        pass

测试



In [182]:

    
url = 'http://bbs.tianya.cn' + df.link[2]
file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_test.txt'
crawler(url, file_name)









    



2http://bbs.tianya.cn/post-free-2848797-2.shtml
3http://bbs.tianya.cn/post-free-2848797-3.shtml
4http://bbs.tianya.cn/post-free-2848797-4.shtml
5http://bbs.tianya.cn/post-free-2848797-5.shtml
6http://bbs.tianya.cn/post-free-2848797-6.shtml
7http://bbs.tianya.cn/post-free-2848797-7.shtml

正式抓取！



In [417]:

    
for k, link in enumerate(df.link):
    flushPrint(link)
    if k % 10== 0:
        print 'This it the post of : ' + str(k)
    file_name = '/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt'
    crawler(link, file_name)









    



/post-free-2849477-1.shtmlThis it the post of : 0
/post-free-2842180-1.shtmlThis it the post of : 10
/post-free-3316698-1.shtmlThis it the post of : 20
/post-free-923387-1.shtmlThis it the post of : 30
/post-free-4236026-1.shtmlThis it the post of : 40
/post-free-2850721-1.shtmlThis it the post of : 50
/post-free-5054821-1.shtmlThis it the post of : 60
/post-free-3326274-1.shtmlThis it the post of : 70
/post-free-4236793-1.shtmlThis it the post of : 80
/post-free-4239792-1.shtmlThis it the post of : 90
/post-free-5042110-1.shtmlThis it the post of : 100
/post-free-2241144-1.shtmlThis it the post of : 110
/post-free-3324561-1.shtmlThis it the post of : 120
/post-free-3835452-1.shtmlThis it the post of : 130
/post-free-5045950-1.shtmlThis it the post of : 140
/post-free-2848818-1.shtmlThis it the post of : 150
/post-free-3281916-1.shtmlThis it the post of : 160
/post-free-949151-1.shtmlThis it the post of : 170
/post-free-2848839-1.shtmlThis it the post of : 180
/post-free-3228423-1.shtmlThis it the post of : 190
/post-free-2852970-1.shtmlThis it the post of : 200
/post-free-3325388-1.shtmlThis it the post of : 210
/post-free-3835748-1.shtmlThis it the post of : 220
/post-free-3833431-1.shtmlThis it the post of : 230
/post-free-3378998-1.shtmlThis it the post of : 240
/post-free-3359022-1.shtmlThis it the post of : 250
/post-free-3365791-1.shtmlThis it the post of : 260
/post-free-3396378-1.shtmlThis it the post of : 270
/post-free-3835212-1.shtmlThis it the post of : 280
/post-free-4248593-1.shtmlThis it the post of : 290
/post-free-3833373-1.shtmlThis it the post of : 300
/post-free-3847600-1.shtmlThis it the post of : 310
/post-free-3832970-1.shtmlThis it the post of : 320
/post-free-4076130-1.shtmlThis it the post of : 330
/post-free-3835673-1.shtmlThis it the post of : 340
/post-free-3835434-1.shtmlThis it the post of : 350
/post-free-3368554-1.shtmlThis it the post of : 360
/post-free-3832938-1.shtmlThis it the post of : 370
/post-free-3835075-1.shtmlThis it the post of : 380
/post-free-3832963-1.shtmlThis it the post of : 390
/post-free-4250604-1.shtmlThis it the post of : 400
/post-free-3834828-1.shtmlThis it the post of : 410
/post-free-3835007-1.shtmlThis it the post of : 420
/post-free-3838253-1.shtmlThis it the post of : 430
/post-free-3835167-1.shtmlThis it the post of : 440
/post-free-3835898-1.shtmlThis it the post of : 450
/post-free-3835123-1.shtmlThis it the post of : 460
/post-free-3835031-1.shtml

读取数据



In [418]:

    
dtt = []
with open('/Users/chengjun/github/cjc2016/data/tianya_bbs_threads_network.txt', 'r') as f:
    for line in f:
        pnum, link, time, author_id, author, content = line.replace('\n', '').split('\t')
        dtt.append([pnum, link, time, author_id, author, content])
len(dtt)









    Out[418]:





8079



In [419]:

    
dt = pd.DataFrame(dtt)
dt[:5]









    Out[419]:






  
    
      
      0
      1
      2
      3
      4
      5
    
  
  
    
      0
      1
      /post-free-2849477-1.shtml
      2012-10-29 07:59:00
      50499450
      贾也
      先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...
    
    
      1
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:13:54
      22122799
      三平67
      我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...
    
    
      2
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:27:02
      39027950
      赶浪头
      默默围观~
    
    
      3
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:43:15
      53986501
      m408833176
      不能收藏？
    
    
      4
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:55:52
      39073643
      兰质薰心
      楼主好文！　　相信政府一定有能力解决好这些问题.



In [420]:

    
dt=dt.rename(columns = {0:'page_num', 1:'link', 2:'time', 3:'author',4:'author_name', 5:'reply'})
dt[:5]









    Out[420]:






  
    
      
      page_num
      link
      time
      author
      author_name
      reply
    
  
  
    
      0
      1
      /post-free-2849477-1.shtml
      2012-10-29 07:59:00
      50499450
      贾也
      先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...
    
    
      1
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:13:54
      22122799
      三平67
      我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...
    
    
      2
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:27:02
      39027950
      赶浪头
      默默围观~
    
    
      3
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:43:15
      53986501
      m408833176
      不能收藏？
    
    
      4
      1
      /post-free-2849477-1.shtml
      2012-10-29 08:55:52
      39073643
      兰质薰心
      楼主好文！　　相信政府一定有能力解决好这些问题.



In [421]:

    
dt.reply[:100]









    Out[421]:





0     先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...
1     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...
2                                                 默默围观~
3                                                 不能收藏？
4                              楼主好文！　　相信政府一定有能力解决好这些问题.
5                                                人民在觉醒。
6                                          理性的文字，向楼主致敬！
7                                     呼唤变革,人民需要的是服务型政府！
8                                      顶贾兄！让我们携手努力保卫家园！
9                                        围观就是力量,顶起就有希望.
10                                       文章写得太有力量了，支持你！
11    @贾也 2012-10-29 7:59:00　　导语：人人宁波，面朝大海，春暖花开　　......
12                                   中国人从文盲走向民粹，实在是太快了。
13                               杀死娘胎里的，毒死已出生的，这个社会怎么了？
14                                                    3
15    环境比什么都可贵，每一次呼吸，每一顿粮食，都息息相关，若任其恶化，而无从改观，那遑谈国家之未...
16                                                 写的很好
17    未来这里将是全球最大的垃圾场，而他们早已放浪西方。苟活的将面临数不清的癌症，无助的死亡。悲哀...
18    媒体失声，高压维稳，只保留微博和论坛可以说这件事！因为那些人知道，网上的人和事就只能热几天，...
19                                       说的太好了，看的我泪流满面！
20          “我相信官场中，许多官员应该葆有社会正能量”　　通篇好文，顶！唯这句，不说也罢....
21                                            先占一环，然后看帖
22                                                说的太好了
23    我上的小学，隔壁就是一家水泥厂，到处飞扬的水泥灰是我最熟悉的颜色;坐一站地车，就是一家造纸厂...
24     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花开！
25                                               前排占座~~
26                                          贾也先生是一位真爷们！
27                                                     
28                           为什么我的眼里常含着泪水?因为我对这片土地爱得深沉!
29    又是因为环保的群体事件，影响面大，危害严重，理由叫的响，取得阶段性胜利。　　那些拆迁的、城管...
                            ...                        
70                          这是我近几年看过的写的最好的文章，！！！不多说了，危险
71    @shdsb 2012-10-29 10:17:43　　媒体失声，高压维稳，只保留微博和论坛...
72    @pals2009 48楼 　　每次看到这样的消息，都很痛心，很堵很堵。为什么在经济发展的同...
73                                                成都啊成都
74                                               是不得人心呀
75    真爷们。。顶　　我是宁。波人，楼主说的是我们的心声。。。　　现在看到人民警察，不是有安全感，...
76    作者：弱水三千chen　回复日期：2012-10-29 11:43:18　 回复  　　@兰...
77                                            泣血之作！　　谢。
78    作者：文蛮子 时间：2012-10-29 11:42:58 　　摆明了，带路党们难以煽动民众...
79                                                字字真切!
80    @曾开贵 2012-10-29 11:40:09　　没有ZF，哪来新ZG，没有新ZG，你们吃...
81                                        好文，顶一下，为了我的故乡
82                                                    0
83                                             好文章，顶一个！
84    作者：文蛮子 时间：2012-10-29 11:42:58 　　摆明了，带路党们难以煽动民众...
85                                    一定要顶。在被和谐前让多点人知道吧
86                                   围观也是一种力量　　天涯，也是好样的
87                                                  很沉重
88                                                 生不逢国
89                                               很好的文章。
90     我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花开！
91                                                   路过
92    民间语文，狗屁有点通。　　似是而非，点风扇火，实是不该。　　排污排毒，环境污染，确实严重。　...
93    @横冲节度使 2012-10-29 12:11:50　　楼主这种汉奸、带路党，成天就做梦盼着...
94    @赶浪头 2楼 　　默默围观~　　　　---------------------------...
95                                   午休时间静静看完了，心中莫名地压抑。
96                                             好文！必须顶起！
97                                                 扎口了。
98                                           谢谢分享 楼主辛苦了
99                                             看不到我的回复哦
Name: reply, dtype: object

总帖数是多少？

http://search.tianya.cn/bbs?q=PX 共有18459 条内容



In [14]:

    
18459/50









    Out[14]:





369

实际上到第10页就没有了 http://bbs.tianya.cn/list.jsp?item=free&order=1&nextid=9&k=PX, 原来那只是天涯论坛，还有其它各种版块，如天涯聚焦： http://focus.tianya.cn/ 等等。

娱乐八卦 512
股市论坛 187
情感天地 242
天涯杂谈 1768

在天涯杂谈搜索雾霾，有41页 http://bbs.tianya.cn/list.jsp?item=free&order=20&nextid=40&k=%E9%9B%BE%E9%9C%BE

天涯SDK

http://open.tianya.cn/wiki/index.php?title=SDK%E4%B8%8B%E8%BD%BD



In [ ]:

	0	1	2	3	4	5	6
0	【民间语文第161期】宁波px启示:船进港湾人应上岸	/post-free-2849477-1.shtml	贾也	http://www.tianya.cn/50499450	194675	2703	2012-10-29 07:59
1	宁波镇海PX项目引发群体上访当地政府发布说明(转载)	/post-free-2839539-1.shtml	无上卫士ABC	http://www.tianya.cn/74341835	88244	1041	2012-10-24 12:41

	0	1	2	3	4	5
0	1	/post-free-2849477-1.shtml	2012-10-29 07:59:00	50499450	贾也	先生是一位真爷们！第161期导语：人人宁波，面朝大海，春暖花开!　　宁波的事，怎谈？无从谈，...
1	1	/post-free-2849477-1.shtml	2012-10-29 08:13:54	22122799	三平67	我们中国人都在一条船，颠簸已久，我们都想做宁波人，希望有一个风平浪静的港湾，面朝大海，春暖花...
2	1	/post-free-2849477-1.shtml	2012-10-29 08:27:02	39027950	赶浪头	默默围观~
3	1	/post-free-2849477-1.shtml	2012-10-29 08:43:15	53986501	m408833176	不能收藏？
4	1	/post-free-2849477-1.shtml	2012-10-29 08:55:52	39073643	兰质薰心	楼主好文！　　相信政府一定有能力解决好这些问题.