Aprendizaje de Máquinas -- 0 -- Scraping

Notas de clase sobre aprendizaje de máquinas

Juan David Velásquez Henao
jdvelasq@unal.edu.co
Universidad Nacional de Colombia, Sede Medellín
Facultad de Minas
Medellín, Colombia

Licencia
Readme

Software utilizado

Este es un documento interactivo escrito como un notebook de Jupyter, en el cual se presenta un tutorial sobre la extracción, transformación, visualización y carga de datos usando Python en el contexto de la ciencia de los datos. Los notebooks de Jupyter permiten incoporar simultáneamente código, texto, gráficos y ecuaciones. El código presentado en este notebook puede ejecutarse en los sistemas operativos Linux y OS X.

Haga click aquí para obtener instrucciones detalladas sobre como instalar Jupyter en Windows y Mac OS X.

Haga clic aquí para ver la última versión de este documento en nbviewer.

Descargue la última versión de este documento a su disco duro; luego, carguelo y ejecutelo en línea en Try Jupyter!

Contenido

Bibliografía.

The Python Tutorial by Python Software Fundation

IPython in deep at GitHub

IPython wiki at GitHub

Web Scraping

Ejemplo 1


In [6]:
# importa las librerias
import urllib.request
import urllib.parse
# import urllib2

In [7]:
google = urllib.request.urlopen('http://google.com') 
google = google.read()
print(google[:200])


b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x'

In [11]:
url = 'http://google.com?q='
url_with_query = url + urllib.parse.quote('python web scraping')
web_search = urllib.request.urlopen(url_with_query)
web_search = web_search.read() 
print(web_search[:400])


b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:\'xkK4WJSRG8yymwGYzZ24Dw\',kEXPI:\'18167,1351828,1351903,1352240,1352623,1352995,3700284,37'

In [12]:
import requests
google = requests.get('http://google.com') 
print(google.status_code)
print(google.content[:200])
print(google.headers)
print(google.cookies.items())


200
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x'
{'Date': 'Thu, 02 Mar 2017 16:06:18 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '4579', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': 'NID=98=s6YBZCZgKy3lIWCYYXCzoa0rxmueIipaGfKsAuvBly3OKk7Sq0ncGjmi7yzZv_uvQGj0zVxvYTUTNFbAitFkjahnfytwxL3T7xW12Kf-W5g0zB8yvKQPUHaeFOkAwMxl; expires=Fri, 01-Sep-2017 16:06:18 GMT; path=/; domain=.google.com.co; HttpOnly'}
[('NID', '98=s6YBZCZgKy3lIWCYYXCzoa0rxmueIipaGfKsAuvBly3OKk7Sq0ncGjmi7yzZv_uvQGj0zVxvYTUTNFbAitFkjahnfytwxL3T7xW12Kf-W5g0zB8yvKQPUHaeFOkAwMxl')]

Ejemplo 2


In [40]:
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import requests

In [56]:
html = urlopen("http://www.allitebooks.com") 
bsObj = BeautifulSoup(html.read())
titles = bsObj.findAll("a", {'rel':'bookmark'})
for x in titles:
    print(x)
    print(' ')


<a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">
<img alt="Code Generation with Roslyn" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Code-Generation-with-Roslyn.jpg" width="351"/> </a>
 
<a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">Code Generation with Roslyn</a>
 
<a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">
<img alt="Beginning Power BI" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Beginning-Power-BI.jpg" width="350"/> </a>
 
<a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">Beginning Power BI, 2nd Edition</a>
 
<a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">
<img alt="Cisco LAN Switching Configuration Handbook, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Cisco-LAN-Switching-Configuration-Handbook-2nd-Edition.jpg" width="392"/> </a>
 
<a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">Cisco LAN Switching Configuration Handbook, 2nd Edition</a>
 
<a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">
<img alt="Data Visualisation with R" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Data-Visualisation-with-R.jpg" width="332"/> </a>
 
<a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">Data Visualisation with R</a>
 
<a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">
<img alt="Pro MongoDB Development" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Pro-MongoDB-Development.jpg" width="350"/> </a>
 
<a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">Pro MongoDB Development</a>
 
<a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">
<img alt="HTML5 for Flash Developers" class="attachment-post-thumbnail wp-post-image" height="493" src="http://www.allitebooks.com/wp-content/uploads/2017/02/HTML5-for-Flash-Developers-400x493.jpg" width="400"/> </a>
 
<a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">HTML5 for Flash Developers</a>
 
<a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">
<img alt="Microsoft Windows Server 2012 Administration Instant Reference" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Microsoft-Windows-Server-2012-Administration-Instant-Reference.jpg" width="332"/> </a>
 
<a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">Microsoft Windows Server 2012 Administration Instant Reference</a>
 
<a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">
<img alt="Learning Concurrent Programming in Scala, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="475" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Learning-Concurrent-Programming-in-Scala-2nd-Edition-400x475.jpg" width="400"/> </a>
 
<a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">Learning Concurrent Programming in Scala, 2nd Edition</a>
 
<a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">
<img alt="eBay Commerce Cookbook" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/eBay-Commerce-Cookbook.jpg" width="389"/> </a>
 
<a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">eBay Commerce Cookbook</a>
 
<a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">
<img alt="Beginning C 2008 Objects" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Beginning-C-2008-Objects.jpg" width="378"/> </a>
 
<a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">Beginning C# 2008 Objects</a>
 
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [25]:
page = requests.get('http://www.allitebooks.com') 
bs = BeautifulSoup(page.content)
print(bs.title)


<title>All IT eBooks - Free IT eBooks Download</title>
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

In [29]:
print(bs.find_all('a'))


[<a href="/" title="All IT eBooks">All IT eBooks</a>, <a title="All Categories">Categories</a>, <a href="http://www.allitebooks.com/web-development/">Web Development</a>, <a href="http://www.allitebooks.com/web-development/asp-net/">ASP.NET</a>, <a href="http://www.allitebooks.com/web-development/cms/">CMS</a>, <a href="http://www.allitebooks.com/web-development/html-html5-css/">HTML, HTML5 &amp; CSS</a>, <a href="http://www.allitebooks.com/web-development/javascript/">JavaScript</a>, <a href="http://www.allitebooks.com/web-development/jsp/">JSP</a>, <a href="http://www.allitebooks.com/web-development/php/">PHP</a>, <a href="http://www.allitebooks.com/web-development/python/">Python</a>, <a href="http://www.allitebooks.com/web-development/ruby/">Ruby</a>, <a href="http://www.allitebooks.com/web-development/rails/">Rails</a>, <a href="http://www.allitebooks.com/web-development/xml/">XML</a>, <a href="http://www.allitebooks.com/web-development/services-apis/">Services &amp; APIs</a>, <a href="http://www.allitebooks.com/web-development/other-web-development/">Other</a>, <a href="http://www.allitebooks.com/programming/">Programming</a>, <a href="http://www.allitebooks.com/programming/c/">C &amp; C++</a>, <a href="http://www.allitebooks.com/programming/c-programming/">C#</a>, <a href="http://www.allitebooks.com/programming/delphi/">Delphi</a>, <a href="http://www.allitebooks.com/programming/java/">Java</a>, <a href="http://www.allitebooks.com/programming/net/">.NET</a>, <a href="http://www.allitebooks.com/programming/objective-c/">Objective-C</a>, <a href="http://www.allitebooks.com/programming/opencl/">OpenCL</a>, <a href="http://www.allitebooks.com/programming/perl/">Perl</a>, <a href="http://www.allitebooks.com/programming/powershell/">PowerShell</a>, <a href="http://www.allitebooks.com/programming/scala/">Scala</a>, <a href="http://www.allitebooks.com/programming/swift/">Swift</a>, <a href="http://www.allitebooks.com/programming/visual-basic/">Visual Basic</a>, <a href="http://www.allitebooks.com/datebases/">Datebases</a>, <a href="http://www.allitebooks.com/datebases/big-data/">Big Data</a>, <a href="http://www.allitebooks.com/datebases/data-analysis/">Data Analysis</a>, <a href="http://www.allitebooks.com/datebases/mongodb/">MongoDB</a>, <a href="http://www.allitebooks.com/datebases/mysql/">MySQL</a>, <a href="http://www.allitebooks.com/datebases/nosql/">NoSQL</a>, <a href="http://www.allitebooks.com/datebases/postgresql/">PostgreSQL</a>, <a href="http://www.allitebooks.com/datebases/oracle/">Oracle</a>, <a href="http://www.allitebooks.com/datebases/sql/">SQL</a>, <a href="http://www.allitebooks.com/game-programming/">Game Programming</a>, <a href="http://www.allitebooks.com/graphics-design/">Graphics &amp; Design</a>, <a href="http://www.allitebooks.com/graphics-design/3d-max/">3D MAX</a>, <a href="http://www.allitebooks.com/graphics-design/cad/">CAD</a>, <a href="http://www.allitebooks.com/graphics-design/coreldraw/">Coreldraw</a>, <a href="http://www.allitebooks.com/graphics-design/dreamweaver/">Dreamweaver</a>, <a href="http://www.allitebooks.com/graphics-design/flash/">Flash</a>, <a href="http://www.allitebooks.com/graphics-design/illustrator/">Illustrator</a>, <a href="http://www.allitebooks.com/graphics-design/maya/">Maya</a>, <a href="http://www.allitebooks.com/graphics-design/photoshop/">Photoshop</a>, <a href="http://www.allitebooks.com/graphics-design/premiere/">Premiere</a>, <a href="http://www.allitebooks.com/operating-systems/">Operating Systems</a>, <a href="http://www.allitebooks.com/operating-systems/windows/">Windows</a>, <a href="http://www.allitebooks.com/operating-systems/linux-unix/">Linux &amp; Unix</a>, <a href="http://www.allitebooks.com/operating-systems/macintosh/">Macintosh</a>, <a href="http://www.allitebooks.com/operating-systems/android/">Android</a>, <a href="http://www.allitebooks.com/operating-systems/ios/">iOS</a>, <a href="http://www.allitebooks.com/operating-systems/windows-phone/">Windows Phone</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/">Networking &amp; Cloud Computing</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/cloud-computing/">Cloud Computing</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/network-administration/">Network Administration</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/network-security/">Network Security</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/networks-protocols-apis/">Networks, Protocols &amp; APIs</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/wireless-networks/">Wireless Networks</a>, <a href="http://www.allitebooks.com/administration/">Administration</a>, <a href="http://www.allitebooks.com/administration/cloud-virtualization/">Cloud &amp; Virtualization</a>, <a href="http://www.allitebooks.com/administration/infrastructure/">Infrastructure</a>, <a href="http://www.allitebooks.com/administration/mail-servers/">Mail Servers</a>, <a href="http://www.allitebooks.com/administration/microsoft-platform/">Microsoft Platform</a>, <a href="http://www.allitebooks.com/administration/monitoring/">Monitoring</a>, <a href="http://www.allitebooks.com/administration/task-automation/">Task Automation</a>, <a href="http://www.allitebooks.com/administration/web-servers/">Web Servers</a>, <a href="http://www.allitebooks.com/administration/other/">Other</a>, <a href="http://www.allitebooks.com/computers-technology/">Computers &amp; Technology</a>, <a href="http://www.allitebooks.com/computers-technology/computer-science/">Computer Science</a>, <a href="http://www.allitebooks.com/certification/">Certification</a>, <a href="http://www.allitebooks.com/enterprise/">Enterprise</a>, <a href="http://www.allitebooks.com/enterprise/business-applications/">Business Applications</a>, <a href="http://www.allitebooks.com/enterprise/communications/">Communications</a>, <a href="http://www.allitebooks.com/enterprise/erp-crm/">ERP &amp; CRM</a>, <a href="http://www.allitebooks.com/marketing-seo/">Marketing &amp; SEO</a>, <a href="http://www.allitebooks.com/hardware/">Hardware &amp; DIY</a>, <a href="http://www.allitebooks.com/security/">Security</a>, <a href="http://www.allitebooks.com/software/">Software</a>, <a href="http://www.allitebooks.com/software/mac/">Mac</a>, <a href="http://www.allitebooks.com/software/office/">Office</a>, <a href="http://www.allitebooks.com/software/windows-pc/">Windows &amp; PC</a>, <a href="http://www.allitebooks.com/web-development/">Web Development</a>, <a href="http://www.allitebooks.com/programming/">Programming</a>, <a href="http://www.allitebooks.com/datebases/">Datebases</a>, <a href="http://www.allitebooks.com/graphics-design/">Graphics &amp; Design</a>, <a href="http://www.allitebooks.com/operating-systems/">Operating Systems</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/">Networking &amp; Cloud Computing</a>, <a href="http://www.allitebooks.com/administration/">Administration</a>, <a href="http://www.allitebooks.com/certification/">Certification</a>, <a href="http://www.allitebooks.com/computers-technology/">Computers &amp; Technology</a>, <a href="http://www.allitebooks.com/enterprise/">Enterprise</a>, <a href="http://www.allitebooks.com/game-programming/">Game Programming</a>, <a href="http://www.allitebooks.com/hardware/">Hardware &amp; DIY</a>, <a href="http://www.allitebooks.com/marketing-seo/">Marketing &amp; SEO</a>, <a href="http://www.allitebooks.com/security/">Security</a>, <a href="http://www.allitebooks.com/software/">Software</a>, <a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">
<img alt="Code Generation with Roslyn" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Code-Generation-with-Roslyn.jpg" width="351"/> </a>, <a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">Code Generation with Roslyn</a>, <a href="http://www.allitebooks.com/author/nick-harrison/" rel="tag">Nick Harrison</a>, <a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">
<img alt="Beginning Power BI" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Beginning-Power-BI.jpg" width="350"/> </a>, <a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">Beginning Power BI, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/dan-clark/" rel="tag">Dan Clark</a>, <a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">
<img alt="Cisco LAN Switching Configuration Handbook, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Cisco-LAN-Switching-Configuration-Handbook-2nd-Edition.jpg" width="392"/> </a>, <a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">Cisco LAN Switching Configuration Handbook, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/david-hucaby/" rel="tag">David Hucaby</a>, <a href="http://www.allitebooks.com/author/david-jansen/" rel="tag">David Jansen</a>, <a href="http://www.allitebooks.com/author/steve-mcquerry/" rel="tag">Steve McQuerry</a>, <a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">
<img alt="Data Visualisation with R" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Data-Visualisation-with-R.jpg" width="332"/> </a>, <a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">Data Visualisation with R</a>, <a href="http://www.allitebooks.com/author/thomas-rahlf/" rel="tag">Thomas Rahlf</a>, <a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">
<img alt="Pro MongoDB Development" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Pro-MongoDB-Development.jpg" width="350"/> </a>, <a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">Pro MongoDB Development</a>, <a href="http://www.allitebooks.com/author/deepak-vohra/" rel="tag">Deepak Vohra</a>, <a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">
<img alt="HTML5 for Flash Developers" class="attachment-post-thumbnail wp-post-image" height="493" src="http://www.allitebooks.com/wp-content/uploads/2017/02/HTML5-for-Flash-Developers-400x493.jpg" width="400"/> </a>, <a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">HTML5 for Flash Developers</a>, <a href="http://www.allitebooks.com/author/matt-fisher/" rel="tag">Matt Fisher</a>, <a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">
<img alt="Microsoft Windows Server 2012 Administration Instant Reference" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Microsoft-Windows-Server-2012-Administration-Instant-Reference.jpg" width="332"/> </a>, <a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">Microsoft Windows Server 2012 Administration Instant Reference</a>, <a href="http://www.allitebooks.com/author/chris-henley/" rel="tag">Chris Henley</a>, <a href="http://www.allitebooks.com/author/matthew-hester/" rel="tag">Matthew Hester</a>, <a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">
<img alt="Learning Concurrent Programming in Scala, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="475" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Learning-Concurrent-Programming-in-Scala-2nd-Edition-400x475.jpg" width="400"/> </a>, <a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">Learning Concurrent Programming in Scala, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/aleksandar-prokopec/" rel="tag">Aleksandar Prokopec</a>, <a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">
<img alt="eBay Commerce Cookbook" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/eBay-Commerce-Cookbook.jpg" width="389"/> </a>, <a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">eBay Commerce Cookbook</a>, <a href="http://www.allitebooks.com/author/chuck-hudson/" rel="tag">Chuck Hudson</a>, <a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">
<img alt="Beginning C 2008 Objects" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Beginning-C-2008-Objects.jpg" width="378"/> </a>, <a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">Beginning C# 2008 Objects</a>, <a href="http://www.allitebooks.com/author/grant-palmer/" rel="tag">Grant Palmer</a>, <a href="http://www.allitebooks.com/author/william-barker/" rel="tag">William Barker</a>, <a href="http://www.allitebooks.com/page/2/" title="2">2</a>, <a href="http://www.allitebooks.com/page/3/" title="3">3</a>, <a href="http://www.allitebooks.com/page/4/" title="4">4</a>, <a href="http://www.allitebooks.com/page/5/" title="5">5</a>, <a href="http://www.allitebooks.com/page/692/" title="Last Page →">692</a>]

In [17]:
print(bs.find_all('p'))


[<p class="rteright"><input name="organization_KEY" type="hidden" value="51595"/> <input name="chapter_KEY" type="hidden" value="314"/> <input name="email_trigger_KEYS" type="hidden" value="28321"/> <input name="object" type="hidden" value="supporter"/> <input name="Receive_Email" type="hidden" value="1"/> <input name="link" type="hidden" value="groups"/> <input name="linkKey" type="hidden" value="152258"/> <input name="redirect" type="hidden" value="http://www.enoughproject.org/eloqua/thank-you-signing"/> <input class="salsainput" id="Email_4" name="Email" placeholder="email address" title="enter value" type="text" value=""/> <input type="Submit" value="Sign up"/></p>, <p>Thank you for your committment to ending genocide and mass atrocities. We will be updating this page shortly with additional actions.</p>, <p style="padding:0"><a href="http://www.facebook.com/enoughproj"><img name="Facebook" src="http://www.enoughproject.org/files/icons/facebook.png" style="margin:0 10px" title="Facebook"/></a><a href="http://www.twitter.com/enoughproject"><img name="Twitter" src="http://www.enoughproject.org/files/icons/twitter.png" style="margin:0 10px" title="Twitter"/></a><a href="http://www.youtube.com/EnoughProject"><img name="YouTube" src="http://www.enoughproject.org/files/icons/youtube.png" style="margin:0 10px" title="YouTube"/></a><a href="http://www.flickr.com/photos/enoughproject/"><img name="Flickr" src="http://www.enoughproject.org/files/icons/flickr.png" style="margin:0 10px" title="Flickr"/></a><a href="http://instagram.com/enoughproject"><img name="Instagram" src="http://www.enoughproject.org/files/icons/instagram.png" style="margin:0 10px" title="Instagram"/></a></p>, <p><strong>Enough Project</strong><br/>
1420 K St. NW, Suite 200, Washington, DC 20005<br/>
Phone: (<span style="color: rgb(38, 50, 56); font-family: arial, sans-serif; line-height: 16px;">202) 580-7690</span></p>]

In [30]:
header_children = [c for c in bs.head.children] 
print(header_children)


['\n', <meta charset="utf-8"/>, '\n', <title>All IT eBooks - Free IT eBooks Download</title>, '\n', <link href="http://gmpg.org/xfn/11" rel="profile"/>, '\n', <link href="http://www.allitebooks.com/xmlrpc.php" rel="pingback"/>, '\n', <meta content="width=device-width, initial-scale=1.0" name="viewport"/>, '\n', ' This site is optimized with the Yoast WordPress SEO plugin v2.1.1 - https://yoast.com/wordpress/plugins/seo/ ', '\n', <meta content="Free IT eBooks Download" name="description"/>, '\n', <link href="http://www.allitebooks.com" rel="canonical"/>, '\n', <link href="http://www.allitebooks.com/page/2/" rel="next"/>, '\n', <script type="application/ld+json">{"@context":"http:\/\/schema.org","@type":"WebSite","url":"http:\/\/www.allitebooks.com\/","name":"All IT eBooks","potentialAction":{"@type":"SearchAction","target":"http:\/\/www.allitebooks.com\/?s={search_term}","query-input":"required name=search_term"}}</script>, '\n', ' / Yoast WordPress SEO plugin. ', '\n', <link href="http://www.allitebooks.com/feed/" rel="alternate" title="All IT eBooks » Feed" type="application/rss+xml"/>, '\n', <link href="http://www.allitebooks.com/comments/feed/" rel="alternate" title="All IT eBooks » Comments Feed" type="application/rss+xml"/>, '\n', <link href="http://www.allitebooks.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?ver=4.1.1" id="wpt-twitter-feed-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/bootstrap.css?ver=4.1.1" id="bootstrap-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/font-awesome.min.css?ver=4.1.1" id="fontawesome-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/style.css?ver=4.1.1" id="classPlus-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/custom.css.php?ver=4.1.1" id="custom-css-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <script src="http://www.allitebooks.com/wp-includes/js/jquery/jquery.js?ver=1.11.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/superfish.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/bootstrap.min.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/jquery.autosize.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script type="text/javascript">
	window._wp_rp_static_base_url = 'https://wprp.zemanta.com/static/';
	window._wp_rp_wp_ajax_url = "http://www.allitebooks.com/wp-admin/admin-ajax.php";
	window._wp_rp_plugin_version = '3.5.4';
	window._wp_rp_post_id = '26687';
	window._wp_rp_num_rel_posts = '4';
	window._wp_rp_thumbnails = true;
	window._wp_rp_post_title = 'Code+Generation+with+Roslyn';
	window._wp_rp_post_tags = ['.net', 'c+%26amp%3B+c%2B%2B', 'system', 'write', 'busi', 'learn', 'comput', 'code', 'innov', 'gener', 'softwar', 'logic', 'tree', 'tabl', 'data', 'design', 'book'];
	window._wp_rp_promoted_content = true;
</script>, '\n', <script async="" src="https://wprp.zemanta.com/static/js/loader.js?version=3.5.4" type="text/javascript"></script>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/images/favicon.ico" id="site-favicon" rel="shortcut icon" type="image/x-icon"/>, ' ', ' Mobile Specific Meta ', '\n', <meta content="yes" name="apple-mobile-web-app-capable"/>, '\n', <meta content="black" name="apple-mobile-web-app-status-bar-style"/>, '\n']

In [31]:
navigation_bar = bs.find(id="globalNavigation") 
for d in navigation_bar.descendants:
    print(d)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-569213ddd6da> in <module>()
      1 navigation_bar = bs.find(id="globalNavigation")
----> 2 for d in navigation_bar.descendants:
      3     print(d)

AttributeError: 'NoneType' object has no attribute 'descendants'

In [33]:
for s in d.previous_siblings:
    print(s)


<ul>
<li id="navAbout"><a href="/about" title="About"><span></span>About</a></li>
<li id="navBlog"><a href="/blog" title="Blog"><span></span>Blog</a></li>
<li id="navConflicts"><a href="/conflicts" title="Conflicts"><span></span>Conflicts</a></li>
<li id="navReports"><a href="/reports" title="Reports"><span></span>Reports</a></li>
<li id="navTakeAction"><a class="selected" href="/take_action" title="Take Action"><span></span>Take Action</a></li>
<!--<li id="navShop"><a href="/shop" title="Shop"><span></span>Shop</a></li>-->
<li id="navDonate"><a href="/donate" title="Donate"><span></span>Donate</a></li>
</ul>



In [35]:
from bs4 import BeautifulSoup 
import requests
page = requests.get('http://www.allitebooks.com') 
bs = BeautifulSoup(page.content)
ta_divs = bs.find_all("div", class_="views-row")
print(len(ta_divs))
for ta in ta_divs: title = ta.h2
link = ta.a
about = ta.find_all('p') 
print(title, link, about)


0
<h2><a href="">Please Check Back Soon For Our Latest Actions!</a></h2> <a href="">Please Check Back Soon For Our Latest Actions!</a> [<p>Thank you for your committment to ending genocide and mass atrocities. We will be updating this page shortly with additional actions.</p>]
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

In [ ]:


In [36]:
from lxml import html

In [37]:
page = html.parse('http://www.enoughproject.org/take_action')
root = page.getroot()
ta_divs = root.cssselect('div.views-row')


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/cssselect.py in <module>()
     12 try:
---> 13     import cssselect as external_cssselect
     14 except ImportError:

ModuleNotFoundError: No module named 'cssselect'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-37-0b427fa23b52> in <module>()
      1 page = html.parse('http://www.enoughproject.org/take_action')
      2 root = page.getroot()
----> 3 ta_divs = root.cssselect('div.views-row')

/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/html/__init__.py in cssselect(self, expr, translator)
    430         """
    431         # Do the import here to make the dependency optional.
--> 432         from lxml.cssselect import CSSSelector
    433         return CSSSelector(expr, translator=translator)(self)
    434 

/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/cssselect.py in <module>()
     14 except ImportError:
     15     raise ImportError(
---> 16         'cssselect does not seem to be installed. '
     17         'See http://packages.python.org/cssselect/')
     18 

ImportError: cssselect does not seem to be installed. See http://packages.python.org/cssselect/

In [38]:
print ta_divs


  File "<ipython-input-38-4cc9c52e6889>", line 1
    print ta_divs
                ^
SyntaxError: Missing parentheses in call to 'print'

In [ ]:


In [ ]: