4.2 サードパーティ製パッケージを使ってスクレイピングに挑戦

Requests http://docs.python-requests.org/
Beautiful Soup http://www.crummy.com/software/BeautifulSoup/



In [1]:

    
import requests
import bs4

RequestsでWebページを取得



In [2]:

    
# Requestsでgihyo.jpのページのデータを取得
import requests
r = requests.get('http://gihyo.jp/lifestyle/clip/01/everyday-cat')
r.status_code # ステータスコードを取得









    Out[2]:





200



In [3]:

    
r.text[:50] # 先頭50文字を取得









    Out[3]:





'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/19'

Requestsを使いこなす

connpass APIリファレンス https://connpass.com/about/api/



In [4]:

    
# JSON形式のAPIレスポンスを取得
r = requests.get('https://connpass.com/api/v1/event/?keyword=python')
data = r.json() # JSONをデコードしたデータを取得
for event in data['events']:
    print(event['title'])









    



人工知能のコードをハックする会 #1
[秋葉原] 詳解ディープラーニング 輪読&勉強会(1章+2章+keras超入門)
Pythonで作る初心者のためのニューラルネットワーク実装
Pythonで作る初心者のためのニューラルネットワーク実装
Python札幌 プログラム初学者向けハンズオン　2017年 #2 　懇親会
BPStudy#120〜小さなチーム、大きな仕事。開発・運営が一体となったチーム運営とは
[秋葉原] 自然言語処理と深層学習の勉強会 (第五回 分散表現/系列変換モデル)
Excel ベイズ入門 #Last：ベイズ決定・線形回帰モデル
データビジュアライゼーション講習 - P/L（損益計算書）の可視化・分析 8/22
機械学習 名古屋 分科会 #5



In [5]:

    
# 各種HTTPメソッドに対応
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post('http://httpbin.org/post', data=payload)
r = requests.put('http://httpbin.org/put', data=payload)
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')



In [6]:

    
# Requestsの便利な使い方
r = requests.get('http://httpbin.org/get', params=payload)
r.url









    Out[6]:





'http://httpbin.org/get?key1=value1&key2=value2'



In [7]:

    
r = requests.get('https://httpbin.org/basic-auth/user/passwd', auth=('user', 'passwd'))
r.status_code









    Out[7]:





200

httpbin(1): HTTP Client Testing Service https://httpbin.org/

Beautiful Soup 4でWebページを解析



In [8]:

    
# Beautiful Soup 4で「技評ねこ部通信」を取得
import requests
from bs4 import BeautifulSoup
r = requests.get('http://gihyo.jp/lifestyle/clip/01/everyday-cat')
soup = BeautifulSoup(r.content, 'html.parser')
title = soup.title # titleタグの情報を取得
type(title) # オブジェクトの型は Tag 型









    Out[8]:





bs4.element.Tag



In [9]:

    
print(title) # タイトルの中身を確認
print(title.text) # タイトルの中のテキストを取得









    



<title>技評ねこ部通信｜gihyo.jp … 技術評論社</title>
技評ねこ部通信｜gihyo.jp … 技術評論社



In [10]:

    
# 技評ねこ部通信の1件分のデータを取得
div = soup.find('div', class_='readingContent01')
li = div.find('li') # divタグの中の最初のliタグを取得
print(li.a['href']) # liタグの中のaタグのhref属性の値を取得
print(li.a.text) # aタグの中の文字列を取得
li.a.text.split(maxsplit=1) # 文字列のsplit()で日付とタイトルに分割









    



http://gihyo.jp/lifestyle/clip/01/everyday-cat/201708/04
2017年8月4日　甘えん坊なはちべい






    Out[10]:





['2017年8月4日', '甘えん坊なはちべい']



In [11]:

    
# 技評ねこ部通信の全データを取得
div = soup.find('div', class_='readingContent01')
for li in div.find_all('li'): # divタグの中の全liタグを取得
    url = li.a['href']
    date, text = li.a.text.split(maxsplit=1)
    print('{},{},{}'.format(date, text, url))









    



2017年8月4日,甘えん坊なはちべい,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201708/04
2017年8月3日,ケーブルカー駅のしろくろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201708/03
2017年8月2日,ビビりな丸子,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201708/02
2017年8月1日,河原の公園のしろくろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201708/01
2017年7月31日,技評ねこ部の投稿コーナー！,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/31
2017年7月28日,しろこの青春,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/28
2017年7月27日,クリーニング屋のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/27
2017年7月26日,見返りしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/26
2017年7月25日,2匹のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/25
2017年7月24日,風通る公園のしろこ　三たび,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/24
2017年7月21日,塀の上のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/21
2017年7月20日,マダムしろこのご近所しろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/20
2017年7月19日,風通る公園のしろこ　再び,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/19
2017年7月18日,マダムしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/18
2017年7月14日,ドヤ顔のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/14
2017年7月13日,風通る公園のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/13
2017年7月12日,学校のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/12
2017年7月11日,住宅街のしろこ　夜の部,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/11
2017年7月10日,住宅街のしろこ,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/10
2017年7月7日,植え込みのしろこ２,http://gihyo.jp/lifestyle/clip/01/everyday-cat/201707/07

Beautiful Soup 4を使いこなす



In [12]:

    
# タグの情報を取得する
div = soup.find('div', class_='readingContent01')
type(div) # データの型はTag型









    Out[12]:





bs4.element.Tag



In [13]:

    
div.name









    Out[13]:





'div'



In [14]:

    
div['class']









    Out[14]:





['readingContent01']



In [15]:

    
div.attrs # 全属性を取得









    Out[15]:





{'class': ['readingContent01']}



In [16]:

    
# さまざまな検索方法
a_tags = soup.find_all('a') # タグ名を指定
len(a_tags)









    Out[16]:





131



In [17]:

    
import re
for tag in soup.find_all(re.compile('^b')): # 正規表現で指定
    print(tag.name)









    



base
body
br
br
br
br
br



In [18]:

    
for tag in soup.find_all(['html', 'title']): # リストで指定
    print(tag.name)









    



html
title



In [19]:

    
# キーワード引数での属性指定
tag = soup.find(id='categoryNavigation') # id属性を指定して検索
tag.name, tag.attrs









    Out[19]:





('div', {'class': ['headCategoryNavigation'], 'id': 'categoryNavigation'})



In [20]:

    
tags = soup.find_all(id=True) # id属性があるタグを全て検索
len(tags)









    Out[20]:





57



In [21]:

    
div = soup.find('div', class_='readingContent01') # class属性はclass_と指定する
div.attrs









    Out[21]:





{'class': ['readingContent01']}



In [22]:

    
div = soup.find('div', {'class': 'readingContent01'}) # 辞書形式でも指定できる
div.attrs









    Out[22]:





{'class': ['readingContent01']}



In [23]:

    
# CSSセレクターを使用した検索
soup.select('title') # タグ名を指定









    Out[23]:





[<title>技評ねこ部通信｜gihyo.jp … 技術評論社</title>]



In [24]:

    
tags = soup.select('body a') # body タグの下のaタグ
len(a_tags)









    Out[24]:





131



In [25]:

    
a_tags = soup.select('p > a') # pタグの直下のaタグ
len(a_tags)









    Out[25]:





9



In [26]:

    
soup.select('body > a') # bodyタグの直下のaタグは存在しない









    Out[26]:





[]



In [27]:

    
div = soup.select('.readingContent01') # classを指定
div = soup.select('div.readingContent01')
div = soup.select('#categoryNavigation') # idを指定
div = soup.select('div#categoryNavigation')
a_tag = soup.select_one('div > a') # 最初のdivタグ直下のaタグを返す