hour of code

利用Python抓取所需的資料

相關套件

Install Python3 打開終端機輸入(OSX)



In [ ]:

    
!brew install python3

本次利用到的簡報功能請利用



In [ ]:

    
!pip3 install jupyter

本次利用到的

第三方套件

BeautifulSoup
requests



In [ ]:

    
!pip3 install beautifulsoup4



In [ ]:

    
!pip3 install requests

計算機網路教過
現在我們要做的只是寫成一個 client 來代替瀏覽器
去跟server要求資訊來分析



In [ ]:

抓取資訊

GET
POST
下載圖片或檔案

回傳資訊

html
JSON
xml

Rqueests

Document



In [ ]:

    
import requests

HTTP GET request



In [ ]:

    
response = requests.get("http://www.google.com.tw/search")
print(response)

得到 URL 連結跟訊息



In [ ]:

    
print(response.url)
print(response.text)

送一個有參數的 HTTP GET request



In [ ]:

    
response = requests.get("https://www.google.com.tw/search", params={"q":"嘉義大學"})



In [ ]:

    
response.url



In [ ]:

    
response.text

HTTP POST request



In [ ]:

    
response = requests.post("http://httpbin.org/post")
print(response)



In [ ]:

    
print(response.text)

送一個有參數的 HTTP POST request



In [ ]:

    
response = requests.post("http://httpbin.org/post", params={"p":"TEST"})



In [ ]:

    
print( response.text )

原始檔案下載(圖片，音樂...)



In [ ]:

    
response = requests.get( "https://upload.wikimedia.org/wikipedia/commons/8/84/HTML.svg", stream = True )
chunk_size = 1024

with open( "./img.svg", "wb" ) as file:
    for chunk in response.iter_content(chunk_size):
        file.write(chunk)

BeautifulSoup

Document

HTML



In [ ]:

    
from bs4 import BeautifulSoup



In [ ]:

    
html = """
<html>
 <head>
  <title>
    The Link Test
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    HaHa
   </b>
  </p>
  <p class="links">
   <a class="link" href="http://www.google.com" id="link1">Google</a>
   <br>
   <a class="link" href="http://www.ncyu.edu.tw" id="link2">NCYU</a>
   <br>
   <a class="link" href="http://www.ncyu.edu.tw/csie" id="link2">CSIE</a>
  </p>
  <p class="links">
   ...
  </p>
 
</body></html>
"""



In [ ]:

    
soup = BeautifulSoup(html, "html.parser")



In [ ]:

    
soup.title.string



In [ ]:

    
soup.p



In [ ]:

    
soup.find_all('a')



In [ ]:

    
soup.find_all('a', id="link1")



In [ ]:

    
soup.find_all('a', id=True)



In [ ]:

    
soup.find_all('a', class_='link')



In [ ]:

    
soup.find_all('a', attrs={"class":"link"})