Software utilizado
Este es un documento interactivo escrito como un notebook de Jupyter, en el cual se presenta un tutorial sobre la extracción, transformación, visualización y carga de datos usando Python en el contexto de la ciencia de los datos. Los notebooks de Jupyter permiten incoporar simultáneamente código, texto, gráficos y ecuaciones. El código presentado en este notebook puede ejecutarse en los sistemas operativos Linux y OS X.
Haga click aquí para obtener instrucciones detalladas sobre como instalar Jupyter en Windows y Mac OS X.
Haga clic aquí para ver la última versión de este documento en nbviewer.
Descargue la última versión de este documento a su disco duro; luego, carguelo y ejecutelo en línea en Try Jupyter!
Bibliografía.
The Python Tutorial by Python Software Fundation
IPython in deep at GitHub
IPython wiki at GitHub
In [6]:
# importa las librerias
import urllib.request
import urllib.parse
# import urllib2
In [7]:
google = urllib.request.urlopen('http://google.com')
google = google.read()
print(google[:200])
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x'
In [11]:
url = 'http://google.com?q='
url_with_query = url + urllib.parse.quote('python web scraping')
web_search = urllib.request.urlopen(url_with_query)
web_search = web_search.read()
print(web_search[:400])
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:\'xkK4WJSRG8yymwGYzZ24Dw\',kEXPI:\'18167,1351828,1351903,1352240,1352623,1352995,3700284,37'
In [12]:
import requests
google = requests.get('http://google.com')
print(google.status_code)
print(google.content[:200])
print(google.headers)
print(google.cookies.items())
200
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es-419"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x'
{'Date': 'Thu, 02 Mar 2017 16:06:18 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '4579', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': 'NID=98=s6YBZCZgKy3lIWCYYXCzoa0rxmueIipaGfKsAuvBly3OKk7Sq0ncGjmi7yzZv_uvQGj0zVxvYTUTNFbAitFkjahnfytwxL3T7xW12Kf-W5g0zB8yvKQPUHaeFOkAwMxl; expires=Fri, 01-Sep-2017 16:06:18 GMT; path=/; domain=.google.com.co; HttpOnly'}
[('NID', '98=s6YBZCZgKy3lIWCYYXCzoa0rxmueIipaGfKsAuvBly3OKk7Sq0ncGjmi7yzZv_uvQGj0zVxvYTUTNFbAitFkjahnfytwxL3T7xW12Kf-W5g0zB8yvKQPUHaeFOkAwMxl')]
In [40]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
In [56]:
html = urlopen("http://www.allitebooks.com")
bsObj = BeautifulSoup(html.read())
titles = bsObj.findAll("a", {'rel':'bookmark'})
for x in titles:
print(x)
print(' ')
<a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">
<img alt="Code Generation with Roslyn" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Code-Generation-with-Roslyn.jpg" width="351"/> </a>
<a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">Code Generation with Roslyn</a>
<a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">
<img alt="Beginning Power BI" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Beginning-Power-BI.jpg" width="350"/> </a>
<a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">Beginning Power BI, 2nd Edition</a>
<a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">
<img alt="Cisco LAN Switching Configuration Handbook, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Cisco-LAN-Switching-Configuration-Handbook-2nd-Edition.jpg" width="392"/> </a>
<a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">Cisco LAN Switching Configuration Handbook, 2nd Edition</a>
<a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">
<img alt="Data Visualisation with R" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Data-Visualisation-with-R.jpg" width="332"/> </a>
<a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">Data Visualisation with R</a>
<a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">
<img alt="Pro MongoDB Development" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Pro-MongoDB-Development.jpg" width="350"/> </a>
<a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">Pro MongoDB Development</a>
<a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">
<img alt="HTML5 for Flash Developers" class="attachment-post-thumbnail wp-post-image" height="493" src="http://www.allitebooks.com/wp-content/uploads/2017/02/HTML5-for-Flash-Developers-400x493.jpg" width="400"/> </a>
<a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">HTML5 for Flash Developers</a>
<a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">
<img alt="Microsoft Windows Server 2012 Administration Instant Reference" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Microsoft-Windows-Server-2012-Administration-Instant-Reference.jpg" width="332"/> </a>
<a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">Microsoft Windows Server 2012 Administration Instant Reference</a>
<a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">
<img alt="Learning Concurrent Programming in Scala, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="475" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Learning-Concurrent-Programming-in-Scala-2nd-Edition-400x475.jpg" width="400"/> </a>
<a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">Learning Concurrent Programming in Scala, 2nd Edition</a>
<a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">
<img alt="eBay Commerce Cookbook" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/eBay-Commerce-Cookbook.jpg" width="389"/> </a>
<a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">eBay Commerce Cookbook</a>
<a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">
<img alt="Beginning C 2008 Objects" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Beginning-C-2008-Objects.jpg" width="378"/> </a>
<a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">Beginning C# 2008 Objects</a>
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [25]:
page = requests.get('http://www.allitebooks.com')
bs = BeautifulSoup(page.content)
print(bs.title)
<title>All IT eBooks - Free IT eBooks Download</title>
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
In [29]:
print(bs.find_all('a'))
[<a href="/" title="All IT eBooks">All IT eBooks</a>, <a title="All Categories">Categories</a>, <a href="http://www.allitebooks.com/web-development/">Web Development</a>, <a href="http://www.allitebooks.com/web-development/asp-net/">ASP.NET</a>, <a href="http://www.allitebooks.com/web-development/cms/">CMS</a>, <a href="http://www.allitebooks.com/web-development/html-html5-css/">HTML, HTML5 & CSS</a>, <a href="http://www.allitebooks.com/web-development/javascript/">JavaScript</a>, <a href="http://www.allitebooks.com/web-development/jsp/">JSP</a>, <a href="http://www.allitebooks.com/web-development/php/">PHP</a>, <a href="http://www.allitebooks.com/web-development/python/">Python</a>, <a href="http://www.allitebooks.com/web-development/ruby/">Ruby</a>, <a href="http://www.allitebooks.com/web-development/rails/">Rails</a>, <a href="http://www.allitebooks.com/web-development/xml/">XML</a>, <a href="http://www.allitebooks.com/web-development/services-apis/">Services & APIs</a>, <a href="http://www.allitebooks.com/web-development/other-web-development/">Other</a>, <a href="http://www.allitebooks.com/programming/">Programming</a>, <a href="http://www.allitebooks.com/programming/c/">C & C++</a>, <a href="http://www.allitebooks.com/programming/c-programming/">C#</a>, <a href="http://www.allitebooks.com/programming/delphi/">Delphi</a>, <a href="http://www.allitebooks.com/programming/java/">Java</a>, <a href="http://www.allitebooks.com/programming/net/">.NET</a>, <a href="http://www.allitebooks.com/programming/objective-c/">Objective-C</a>, <a href="http://www.allitebooks.com/programming/opencl/">OpenCL</a>, <a href="http://www.allitebooks.com/programming/perl/">Perl</a>, <a href="http://www.allitebooks.com/programming/powershell/">PowerShell</a>, <a href="http://www.allitebooks.com/programming/scala/">Scala</a>, <a href="http://www.allitebooks.com/programming/swift/">Swift</a>, <a href="http://www.allitebooks.com/programming/visual-basic/">Visual Basic</a>, <a href="http://www.allitebooks.com/datebases/">Datebases</a>, <a href="http://www.allitebooks.com/datebases/big-data/">Big Data</a>, <a href="http://www.allitebooks.com/datebases/data-analysis/">Data Analysis</a>, <a href="http://www.allitebooks.com/datebases/mongodb/">MongoDB</a>, <a href="http://www.allitebooks.com/datebases/mysql/">MySQL</a>, <a href="http://www.allitebooks.com/datebases/nosql/">NoSQL</a>, <a href="http://www.allitebooks.com/datebases/postgresql/">PostgreSQL</a>, <a href="http://www.allitebooks.com/datebases/oracle/">Oracle</a>, <a href="http://www.allitebooks.com/datebases/sql/">SQL</a>, <a href="http://www.allitebooks.com/game-programming/">Game Programming</a>, <a href="http://www.allitebooks.com/graphics-design/">Graphics & Design</a>, <a href="http://www.allitebooks.com/graphics-design/3d-max/">3D MAX</a>, <a href="http://www.allitebooks.com/graphics-design/cad/">CAD</a>, <a href="http://www.allitebooks.com/graphics-design/coreldraw/">Coreldraw</a>, <a href="http://www.allitebooks.com/graphics-design/dreamweaver/">Dreamweaver</a>, <a href="http://www.allitebooks.com/graphics-design/flash/">Flash</a>, <a href="http://www.allitebooks.com/graphics-design/illustrator/">Illustrator</a>, <a href="http://www.allitebooks.com/graphics-design/maya/">Maya</a>, <a href="http://www.allitebooks.com/graphics-design/photoshop/">Photoshop</a>, <a href="http://www.allitebooks.com/graphics-design/premiere/">Premiere</a>, <a href="http://www.allitebooks.com/operating-systems/">Operating Systems</a>, <a href="http://www.allitebooks.com/operating-systems/windows/">Windows</a>, <a href="http://www.allitebooks.com/operating-systems/linux-unix/">Linux & Unix</a>, <a href="http://www.allitebooks.com/operating-systems/macintosh/">Macintosh</a>, <a href="http://www.allitebooks.com/operating-systems/android/">Android</a>, <a href="http://www.allitebooks.com/operating-systems/ios/">iOS</a>, <a href="http://www.allitebooks.com/operating-systems/windows-phone/">Windows Phone</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/">Networking & Cloud Computing</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/cloud-computing/">Cloud Computing</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/network-administration/">Network Administration</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/network-security/">Network Security</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/networks-protocols-apis/">Networks, Protocols & APIs</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/wireless-networks/">Wireless Networks</a>, <a href="http://www.allitebooks.com/administration/">Administration</a>, <a href="http://www.allitebooks.com/administration/cloud-virtualization/">Cloud & Virtualization</a>, <a href="http://www.allitebooks.com/administration/infrastructure/">Infrastructure</a>, <a href="http://www.allitebooks.com/administration/mail-servers/">Mail Servers</a>, <a href="http://www.allitebooks.com/administration/microsoft-platform/">Microsoft Platform</a>, <a href="http://www.allitebooks.com/administration/monitoring/">Monitoring</a>, <a href="http://www.allitebooks.com/administration/task-automation/">Task Automation</a>, <a href="http://www.allitebooks.com/administration/web-servers/">Web Servers</a>, <a href="http://www.allitebooks.com/administration/other/">Other</a>, <a href="http://www.allitebooks.com/computers-technology/">Computers & Technology</a>, <a href="http://www.allitebooks.com/computers-technology/computer-science/">Computer Science</a>, <a href="http://www.allitebooks.com/certification/">Certification</a>, <a href="http://www.allitebooks.com/enterprise/">Enterprise</a>, <a href="http://www.allitebooks.com/enterprise/business-applications/">Business Applications</a>, <a href="http://www.allitebooks.com/enterprise/communications/">Communications</a>, <a href="http://www.allitebooks.com/enterprise/erp-crm/">ERP & CRM</a>, <a href="http://www.allitebooks.com/marketing-seo/">Marketing & SEO</a>, <a href="http://www.allitebooks.com/hardware/">Hardware & DIY</a>, <a href="http://www.allitebooks.com/security/">Security</a>, <a href="http://www.allitebooks.com/software/">Software</a>, <a href="http://www.allitebooks.com/software/mac/">Mac</a>, <a href="http://www.allitebooks.com/software/office/">Office</a>, <a href="http://www.allitebooks.com/software/windows-pc/">Windows & PC</a>, <a href="http://www.allitebooks.com/web-development/">Web Development</a>, <a href="http://www.allitebooks.com/programming/">Programming</a>, <a href="http://www.allitebooks.com/datebases/">Datebases</a>, <a href="http://www.allitebooks.com/graphics-design/">Graphics & Design</a>, <a href="http://www.allitebooks.com/operating-systems/">Operating Systems</a>, <a href="http://www.allitebooks.com/networking-cloud-computing/">Networking & Cloud Computing</a>, <a href="http://www.allitebooks.com/administration/">Administration</a>, <a href="http://www.allitebooks.com/certification/">Certification</a>, <a href="http://www.allitebooks.com/computers-technology/">Computers & Technology</a>, <a href="http://www.allitebooks.com/enterprise/">Enterprise</a>, <a href="http://www.allitebooks.com/game-programming/">Game Programming</a>, <a href="http://www.allitebooks.com/hardware/">Hardware & DIY</a>, <a href="http://www.allitebooks.com/marketing-seo/">Marketing & SEO</a>, <a href="http://www.allitebooks.com/security/">Security</a>, <a href="http://www.allitebooks.com/software/">Software</a>, <a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">
<img alt="Code Generation with Roslyn" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Code-Generation-with-Roslyn.jpg" width="351"/> </a>, <a href="http://www.allitebooks.com/code-generation-with-roslyn/" rel="bookmark">Code Generation with Roslyn</a>, <a href="http://www.allitebooks.com/author/nick-harrison/" rel="tag">Nick Harrison</a>, <a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">
<img alt="Beginning Power BI" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Beginning-Power-BI.jpg" width="350"/> </a>, <a href="http://www.allitebooks.com/beginning-power-bi-2nd-edition/" rel="bookmark">Beginning Power BI, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/dan-clark/" rel="tag">Dan Clark</a>, <a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">
<img alt="Cisco LAN Switching Configuration Handbook, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Cisco-LAN-Switching-Configuration-Handbook-2nd-Edition.jpg" width="392"/> </a>, <a href="http://www.allitebooks.com/cisco-lan-switching-configuration-handbook-2nd-edition/" rel="bookmark">Cisco LAN Switching Configuration Handbook, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/david-hucaby/" rel="tag">David Hucaby</a>, <a href="http://www.allitebooks.com/author/david-jansen/" rel="tag">David Jansen</a>, <a href="http://www.allitebooks.com/author/steve-mcquerry/" rel="tag">Steve McQuerry</a>, <a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">
<img alt="Data Visualisation with R" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Data-Visualisation-with-R.jpg" width="332"/> </a>, <a href="http://www.allitebooks.com/data-visualisation-with-r/" rel="bookmark">Data Visualisation with R</a>, <a href="http://www.allitebooks.com/author/thomas-rahlf/" rel="tag">Thomas Rahlf</a>, <a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">
<img alt="Pro MongoDB Development" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/03/Pro-MongoDB-Development.jpg" width="350"/> </a>, <a href="http://www.allitebooks.com/pro-mongodb-development/" rel="bookmark">Pro MongoDB Development</a>, <a href="http://www.allitebooks.com/author/deepak-vohra/" rel="tag">Deepak Vohra</a>, <a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">
<img alt="HTML5 for Flash Developers" class="attachment-post-thumbnail wp-post-image" height="493" src="http://www.allitebooks.com/wp-content/uploads/2017/02/HTML5-for-Flash-Developers-400x493.jpg" width="400"/> </a>, <a href="http://www.allitebooks.com/html5-for-flash-developers/" rel="bookmark">HTML5 for Flash Developers</a>, <a href="http://www.allitebooks.com/author/matt-fisher/" rel="tag">Matt Fisher</a>, <a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">
<img alt="Microsoft Windows Server 2012 Administration Instant Reference" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Microsoft-Windows-Server-2012-Administration-Instant-Reference.jpg" width="332"/> </a>, <a href="http://www.allitebooks.com/microsoft-windows-server-2012-administration-instant-reference/" rel="bookmark">Microsoft Windows Server 2012 Administration Instant Reference</a>, <a href="http://www.allitebooks.com/author/chris-henley/" rel="tag">Chris Henley</a>, <a href="http://www.allitebooks.com/author/matthew-hester/" rel="tag">Matthew Hester</a>, <a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">
<img alt="Learning Concurrent Programming in Scala, 2nd Edition" class="attachment-post-thumbnail wp-post-image" height="475" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Learning-Concurrent-Programming-in-Scala-2nd-Edition-400x475.jpg" width="400"/> </a>, <a href="http://www.allitebooks.com/learning-concurrent-programming-in-scala-2nd-edition/" rel="bookmark">Learning Concurrent Programming in Scala, 2nd Edition</a>, <a href="http://www.allitebooks.com/author/aleksandar-prokopec/" rel="tag">Aleksandar Prokopec</a>, <a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">
<img alt="eBay Commerce Cookbook" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/eBay-Commerce-Cookbook.jpg" width="389"/> </a>, <a href="http://www.allitebooks.com/ebay-commerce-cookbook/" rel="bookmark">eBay Commerce Cookbook</a>, <a href="http://www.allitebooks.com/author/chuck-hudson/" rel="tag">Chuck Hudson</a>, <a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">
<img alt="Beginning C 2008 Objects" class="attachment-post-thumbnail wp-post-image" height="499" src="http://www.allitebooks.com/wp-content/uploads/2017/02/Beginning-C-2008-Objects.jpg" width="378"/> </a>, <a href="http://www.allitebooks.com/beginning-c-2008-objects/" rel="bookmark">Beginning C# 2008 Objects</a>, <a href="http://www.allitebooks.com/author/grant-palmer/" rel="tag">Grant Palmer</a>, <a href="http://www.allitebooks.com/author/william-barker/" rel="tag">William Barker</a>, <a href="http://www.allitebooks.com/page/2/" title="2">2</a>, <a href="http://www.allitebooks.com/page/3/" title="3">3</a>, <a href="http://www.allitebooks.com/page/4/" title="4">4</a>, <a href="http://www.allitebooks.com/page/5/" title="5">5</a>, <a href="http://www.allitebooks.com/page/692/" title="Last Page →">692</a>]
In [17]:
print(bs.find_all('p'))
[<p class="rteright"><input name="organization_KEY" type="hidden" value="51595"/> <input name="chapter_KEY" type="hidden" value="314"/> <input name="email_trigger_KEYS" type="hidden" value="28321"/> <input name="object" type="hidden" value="supporter"/> <input name="Receive_Email" type="hidden" value="1"/> <input name="link" type="hidden" value="groups"/> <input name="linkKey" type="hidden" value="152258"/> <input name="redirect" type="hidden" value="http://www.enoughproject.org/eloqua/thank-you-signing"/> <input class="salsainput" id="Email_4" name="Email" placeholder="email address" title="enter value" type="text" value=""/> <input type="Submit" value="Sign up"/></p>, <p>Thank you for your committment to ending genocide and mass atrocities. We will be updating this page shortly with additional actions.</p>, <p style="padding:0"><a href="http://www.facebook.com/enoughproj"><img name="Facebook" src="http://www.enoughproject.org/files/icons/facebook.png" style="margin:0 10px" title="Facebook"/></a><a href="http://www.twitter.com/enoughproject"><img name="Twitter" src="http://www.enoughproject.org/files/icons/twitter.png" style="margin:0 10px" title="Twitter"/></a><a href="http://www.youtube.com/EnoughProject"><img name="YouTube" src="http://www.enoughproject.org/files/icons/youtube.png" style="margin:0 10px" title="YouTube"/></a><a href="http://www.flickr.com/photos/enoughproject/"><img name="Flickr" src="http://www.enoughproject.org/files/icons/flickr.png" style="margin:0 10px" title="Flickr"/></a><a href="http://instagram.com/enoughproject"><img name="Instagram" src="http://www.enoughproject.org/files/icons/instagram.png" style="margin:0 10px" title="Instagram"/></a></p>, <p><strong>Enough Project</strong><br/>
1420 K St. NW, Suite 200, Washington, DC 20005<br/>
Phone: (<span style="color: rgb(38, 50, 56); font-family: arial, sans-serif; line-height: 16px;">202) 580-7690</span></p>]
In [30]:
header_children = [c for c in bs.head.children]
print(header_children)
['\n', <meta charset="utf-8"/>, '\n', <title>All IT eBooks - Free IT eBooks Download</title>, '\n', <link href="http://gmpg.org/xfn/11" rel="profile"/>, '\n', <link href="http://www.allitebooks.com/xmlrpc.php" rel="pingback"/>, '\n', <meta content="width=device-width, initial-scale=1.0" name="viewport"/>, '\n', ' This site is optimized with the Yoast WordPress SEO plugin v2.1.1 - https://yoast.com/wordpress/plugins/seo/ ', '\n', <meta content="Free IT eBooks Download" name="description"/>, '\n', <link href="http://www.allitebooks.com" rel="canonical"/>, '\n', <link href="http://www.allitebooks.com/page/2/" rel="next"/>, '\n', <script type="application/ld+json">{"@context":"http:\/\/schema.org","@type":"WebSite","url":"http:\/\/www.allitebooks.com\/","name":"All IT eBooks","potentialAction":{"@type":"SearchAction","target":"http:\/\/www.allitebooks.com\/?s={search_term}","query-input":"required name=search_term"}}</script>, '\n', ' / Yoast WordPress SEO plugin. ', '\n', <link href="http://www.allitebooks.com/feed/" rel="alternate" title="All IT eBooks » Feed" type="application/rss+xml"/>, '\n', <link href="http://www.allitebooks.com/comments/feed/" rel="alternate" title="All IT eBooks » Comments Feed" type="application/rss+xml"/>, '\n', <link href="http://www.allitebooks.com/wp-content/plugins/wp-to-twitter/css/twitter-feed.css?ver=4.1.1" id="wpt-twitter-feed-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/bootstrap.css?ver=4.1.1" id="bootstrap-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/font-awesome.min.css?ver=4.1.1" id="fontawesome-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/style.css?ver=4.1.1" id="classPlus-style-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/css/custom.css.php?ver=4.1.1" id="custom-css-css" media="all" rel="stylesheet" type="text/css"/>, '\n', <script src="http://www.allitebooks.com/wp-includes/js/jquery/jquery.js?ver=1.11.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/superfish.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/bootstrap.min.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script src="http://www.allitebooks.com/wp-content/themes/allitebooks/js/jquery.autosize.js?ver=4.1.1" type="text/javascript"></script>, '\n', <script type="text/javascript">
window._wp_rp_static_base_url = 'https://wprp.zemanta.com/static/';
window._wp_rp_wp_ajax_url = "http://www.allitebooks.com/wp-admin/admin-ajax.php";
window._wp_rp_plugin_version = '3.5.4';
window._wp_rp_post_id = '26687';
window._wp_rp_num_rel_posts = '4';
window._wp_rp_thumbnails = true;
window._wp_rp_post_title = 'Code+Generation+with+Roslyn';
window._wp_rp_post_tags = ['.net', 'c+%26amp%3B+c%2B%2B', 'system', 'write', 'busi', 'learn', 'comput', 'code', 'innov', 'gener', 'softwar', 'logic', 'tree', 'tabl', 'data', 'design', 'book'];
window._wp_rp_promoted_content = true;
</script>, '\n', <script async="" src="https://wprp.zemanta.com/static/js/loader.js?version=3.5.4" type="text/javascript"></script>, '\n', <link href="http://www.allitebooks.com/wp-content/themes/allitebooks/images/favicon.ico" id="site-favicon" rel="shortcut icon" type="image/x-icon"/>, ' ', ' Mobile Specific Meta ', '\n', <meta content="yes" name="apple-mobile-web-app-capable"/>, '\n', <meta content="black" name="apple-mobile-web-app-status-bar-style"/>, '\n']
In [31]:
navigation_bar = bs.find(id="globalNavigation")
for d in navigation_bar.descendants:
print(d)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-31-569213ddd6da> in <module>()
1 navigation_bar = bs.find(id="globalNavigation")
----> 2 for d in navigation_bar.descendants:
3 print(d)
AttributeError: 'NoneType' object has no attribute 'descendants'
In [33]:
for s in d.previous_siblings:
print(s)
<ul>
<li id="navAbout"><a href="/about" title="About"><span></span>About</a></li>
<li id="navBlog"><a href="/blog" title="Blog"><span></span>Blog</a></li>
<li id="navConflicts"><a href="/conflicts" title="Conflicts"><span></span>Conflicts</a></li>
<li id="navReports"><a href="/reports" title="Reports"><span></span>Reports</a></li>
<li id="navTakeAction"><a class="selected" href="/take_action" title="Take Action"><span></span>Take Action</a></li>
<!--<li id="navShop"><a href="/shop" title="Shop"><span></span>Shop</a></li>-->
<li id="navDonate"><a href="/donate" title="Donate"><span></span>Donate</a></li>
</ul>
In [35]:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.allitebooks.com')
bs = BeautifulSoup(page.content)
ta_divs = bs.find_all("div", class_="views-row")
print(len(ta_divs))
for ta in ta_divs: title = ta.h2
link = ta.a
about = ta.find_all('p')
print(title, link, about)
0
<h2><a href="">Please Check Back Soon For Our Latest Actions!</a></h2> <a href="">Please Check Back Soon For Our Latest Actions!</a> [<p>Thank you for your committment to ending genocide and mass atrocities. We will be updating this page shortly with additional actions.</p>]
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file /Users/jdvelasq/anaconda/lib/python3.6/runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
In [ ]:
In [36]:
from lxml import html
In [37]:
page = html.parse('http://www.enoughproject.org/take_action')
root = page.getroot()
ta_divs = root.cssselect('div.views-row')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/cssselect.py in <module>()
12 try:
---> 13 import cssselect as external_cssselect
14 except ImportError:
ModuleNotFoundError: No module named 'cssselect'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
<ipython-input-37-0b427fa23b52> in <module>()
1 page = html.parse('http://www.enoughproject.org/take_action')
2 root = page.getroot()
----> 3 ta_divs = root.cssselect('div.views-row')
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/html/__init__.py in cssselect(self, expr, translator)
430 """
431 # Do the import here to make the dependency optional.
--> 432 from lxml.cssselect import CSSSelector
433 return CSSSelector(expr, translator=translator)(self)
434
/Users/jdvelasq/anaconda/lib/python3.6/site-packages/lxml/cssselect.py in <module>()
14 except ImportError:
15 raise ImportError(
---> 16 'cssselect does not seem to be installed. '
17 'See http://packages.python.org/cssselect/')
18
ImportError: cssselect does not seem to be installed. See http://packages.python.org/cssselect/
In [38]:
print ta_divs
File "<ipython-input-38-4cc9c52e6889>", line 1
print ta_divs
^
SyntaxError: Missing parentheses in call to 'print'
In [ ]:
In [ ]:
Content source: jdvelasq/machine-learning
Similar notebooks: