In [10]:
import slate
import urllib
import re

In [3]:
furl = urllib.urlopen('http://www.nature.com/ismej/journal/v10/n1/pdf/ismej2015100a.pdf')
with open( '/tmp/asdfasdfasdf.pdf', 'w' ) as ftempfile :
    ftempfile.write( furl.read() )
with open( '/tmp/asdfasdfasdf.pdf' ) as f :
    doc = slate.PDF(f)

PDF is garbage

In this example, we are looking for a link to some source code :

http://prodege.jgi-psf.org//downloads/src

However, in the PDF, the URL is line wrapped, so the src is lost.


In [20]:
urlre = re.compile( '(?P<url>https?://[^\s]+)' )

for page in doc :
    print urlre.findall( page )


[]
['http://prodege.jgi-psf.org//downloads/', 'http://prodege.jgi-psf.org,']
[]
['http://img.jgi.', 'http://www.nature.com/ismej)']

PDF is garbage, continued

If we remove line breaks to fix URLs that have been wrapped, we discover that the visible line breaks in the document do not correspond to actual line breaks in the represented text. The result is random garbage.


In [19]:
urlre = re.compile( '(?P<url>https?://[^\s]+)' )

for page in doc :
    print urlre.findall( page.replace('\n','') )


[]
['http://prodege.jgi-psf.org//downloads/availablerun', 'http://prodege.jgi-psf.org,which']
[]
['http://img.jgi.Cell', 'http://creativecommons.org/licenses/by/4.0/the', 'http://www.nature.com/ismej)The']

Nope.

At this point, the author elects to flip a table.

Let's try looking at the HTML version. I'll swipe some code from Dive into Python here, because finding URLs in a HTML document is what is known as a "Solved Problem."


In [1]:
from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)
            
def get_urls_from(url):
    url_list = []
    import urllib
    usock = urllib.urlopen(url)
    parser = URLLister()
    parser.feed(usock.read())         
    usock.close()      
    parser.close()                    
    map(url_list.append, 
        [item for item in parser.urls if item.startswith(('http', 'ftp', 'www'))])
    return url_list

Here are all the URLs in the document...


In [3]:
urls = get_urls_from('http://www.nature.com/ismej/journal/v10/n1/full/ismej2015100a.html')
urls


Out[3]:
['http://www.isme-microbes.org/',
 'http://mts-isme.nature.com/cgi-bin/main.plex',
 'http://ad.doubleclick.net/N285/jump/ismej.nature.com/;abr=!NN2;type=sc;artid=ismej2015100a;issue=1;pos=top;subjmeta=631;subjmeta=208;subjmeta=212;techmeta=45;techmeta=23;sz=728x90;tile=1;ord=123456789?',
 'http://prodege.jgi-psf.org//downloads/src',
 'http://prodege.jgi-psf.org',
 'http://dx.doi.org/10.1089/cmb.2012.0021',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=22506599&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000303811800001&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC38XmsFOmt7k%3D&pissn=1751-7362&pyear=2016&md5=186ecb2f73d6216debcdc0dc83436dfe',
 'http://dx.doi.org/10.1186/1471-2105-10-421',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=20003500&amp;dopt=Abstract',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BD1MXhsF2gu7jP&pissn=1751-7362&pyear=2016&md5=38e25da61da7c88bff514b4f0c502751',
 'http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf',
 'http://dx.doi.org/10.1073/pnas.1001665107',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=20668244&amp;dopt=Abstract',
 'http://dx.doi.org/10.1371/journal.pgen.1004596',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=25210772&amp;dopt=Abstract',
 'http://dx.doi.org/10.1038/ismej.2014.183',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000351204900007&amp;DestApp=WOS_CPL',
 'http://dx.doi.org/10.1093/bioinformatics/btq564',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=20966005&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000284430900009&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3cXhsVKjtLrJ&pissn=1751-7362&pyear=2016&md5=df5069e170f7d82c809224797936b465',
 'http://dx.doi.org/10.1186/1471-2105-11-119',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=20211023&amp;dopt=Abstract',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3cXjt12htbs%3D&pissn=1751-7362&pyear=2016&md5=79208a238042d850ec93c9090fb7ca7b',
 'http://www.nature.com/doifinder/10.1038/nmeth0411-311',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=21451520&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000288940300017&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3MXjvFOnsLc%3D&pissn=1751-7362&pyear=2016&md5=0923d7644613aeaa5964bcdf4f2e0602',
 'http://dx.doi.org/10.1093/nar/gkt963',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=24165883&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000331139800083&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC2cXos1Sr&pissn=1751-7362&pyear=2016&md5=5aff5401128f97793cadd2d1f6bf04da',
 'http://dx.doi.org/10.1186/1944-3277-10-8',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=26203331&amp;dopt=Abstract',
 'http://www.nature.com/doifinder/10.1038/nbt.2939',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=24997787&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000346455900043&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC2cXhtFSlsLjI&pissn=1751-7362&pyear=2016&md5=2ae07546c7270b896f47e9b8c0c26666',
 'http://www.nature.com/doifinder/10.1038/nature12352',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=23851394&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000322157900031&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3sXhtFShurnE&pissn=1751-7362&pyear=2016&md5=20244df8b818beea04a0c4b4e850d953',
 'http://dx.doi.org/10.1371/journal.pone.0017288',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=21408061&amp;dopt=Abstract',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3MXjslSmtr4%3D&pissn=1751-7362&pyear=2016&md5=24d49ca4672d97ff8afae936a58d5824',
 'http://dx.doi.org/10.1126/science.1247023',
 'http://dx.doi.org/10.1073/pnas.1304246110',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=23801761&amp;dopt=Abstract',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3sXhtVWntbzN&pissn=1751-7362&pyear=2016&md5=78c9402b2462717ffd402008f725ddf4',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000262637600007&amp;DestApp=WOS_CPL',
 'http://dx.doi.org/10.1371/journal.pone.0026161',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=22028825&amp;dopt=Abstract',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC3MXhsVGltLbI&pissn=1751-7362&pyear=2016&md5=d8bfe0c1f4ff00d13a9f7015209804bc',
 'http://creativecommons.org/licenses/by/4.0/',
 'http://creativecommons.org/licenses/by/4.0/',
 'http://mts-isme.nature.com/',
 'http://mse.force.com/nature',
 'http://www.isme-microbes.org/',
 'http://www.isme-microbes.org/',
 'http://www.readcube.com/articles/10.1038/ismej.2015.100',
 'https://s100.copyright.com/AppDispatchServlet?publisherName=NPG&publication=The+ISME+Journal&title=ProDeGe: a computational protocol for fully automated decontamination of genomes&author=Kristin Tennessen, Evan Andersen, Scott Clingenpeel, Christian Rinke, Derek S Lundberg et al.&contentID=10.1038/ismej.2015.100&publicationDate=06/09/2015&volumeNum=10&issueNum=1&cc=by',
 'https://s100.copyright.com/AppDispatchServlet?publisherName=NPGR&publication=The+ISME+Journal&title=ProDeGe: a computational protocol for fully automated decontamination of genomes&author=Kristin Tennessen, Evan Andersen, Scott Clingenpeel, Christian Rinke, Derek S Lundberg et al.&contentID=10.1038/ismej.2015.100&publicationDate=06/09/2015&volumeNum=10&issueNum=1&numPages=4&pageNumbers=pp269-272&orderBeanReset=true',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&term=Tennessen+K',
 'http://www.nature.com/naturejobs/science/jobs/542489-talents-wanted-programme-faculty-search-for-the-school-of-environment-beijing-normal-university',
 'http://www.nature.com/naturejobs/science/jobs/546269-research-engineer-research-scientist-in-renewable-energy',
 'http://www.nature.com/natureevents/science/events/41001-BIO_Europe_Spring_2016',
 'http://www.nature.com/natureevents/science/events/41003-ChinaBio_Partnering_Forum_2016',
 'http://ad.doubleclick.net/N285/jump/ismej.nature.com/;abr=!NN2;type=sc;artid=ismej2015100a;issue=1;pos=right;subjmeta=631;subjmeta=208;subjmeta=212;techmeta=45;techmeta=23;sz=160x600;tile=2;ord=123456789?',
 'http://publicationethics.org/',
 'http://www.isme-microbes.org/',
 'http://www.natureasia.com/']

Bleh. That is mostly links in the references, ads and navigation cruft from the journal's content mismanagement system. Because their system is heinously ad hoc, there is no base URL. So, we're forced to use an ad hoc exclusion list.


In [4]:
excluded = [ 'http://www.nature.com',
             'http://dx.doi.org',
             'http://www.ncbi.nlm.nih.gov',
             'http://creativecommons.org',
             'https://s100.copyright.com',
             'http://mts-isme.nature.com',
             'http://www.isme-microbes.org',
             'http://ad.doubleclick.net',
             'http://mse.force.com',
             'http://links.isiglobalnet2.com',
             'http://www.readcube.com',
             'http://chemport.cas.org',
             'http://publicationethics.org/',
             'http://www.natureasia.com/'
           ]

def novel_url( url ) :
    for excluded_url in excluded :
        if url.startswith( excluded_url ) :
            return False
    return True

filter( novel_url, urls )


Out[4]:
['http://prodege.jgi-psf.org//downloads/src',
 'http://prodege.jgi-psf.org',
 'http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf']

Much better. Now, let's see if these exist...


In [5]:
import requests

for url in filter( novel_url, urls ) :
    request = requests.get( url )
    if request.status_code == 200:
        print 'Good : ', url
    else:
        print 'Fail : ', url


Good :  http://prodege.jgi-psf.org//downloads/src
Good :  http://prodege.jgi-psf.org
Fail :  http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf

Looks like this will work, though we'll need to make a hand-curated list of excluded URLs. Othersise, the counts of dead links could be badly skewed by any issues within the journal's content mismanagement system, ad servers and other irrelevent crud.

Walking through Zotero

Let's try walking through the publications in a Zotero library...


In [8]:
from pyzotero import zotero
api_key    = open( 'zotero_api_key.txt' ).read().strip()
library_id = open( 'zotero_api_userID.txt' ).read().strip()
library_type = 'group'
group_id = '405341' # microBE.net group ID

zot = zotero.Zotero(group_id, library_type, api_key)
items = zot.top(limit=5)
# we've retrieved the latest five top-level items in our library
# we can print each item's item type and ID
for item in items:
    #print('Item: %s | Key: %s') % (item['data']['itemType'], item['data']['key'])
    print item['data']['key'], ':', item['data']['title']


QC9BAHIK : ProDeGe: a computational protocol for fully automated decontamination of genomes
E7S5UR96 : In search of non-photosynthetic Cyanobacteria
T9GDRBT5 : Evidence-based recommendations on storing and handling specimens for analyses of insect microbiota
BJJUJW48 : Cautionary tale of using 16S rRNA gene sequence similarity values in identification of human-associated bacterial species
QD3JS59Z : ConStrains identifies microbial strains in metagenomic datasets

So far so good. Let's have a look at the url attribute...


In [47]:
for item in items:
    print item['data']['key'], ':', item['data']['url']


QC9BAHIK : http://www.nature.com/ismej/journal/v10/n1/full/ismej2015100a.html
E7S5UR96 : http://espace.library.uq.edu.au/view/UQ:368958
T9GDRBT5 : https://peerj.com/articles/1190
BJJUJW48 : 
QD3JS59Z : http://www.nature.com/nbt/journal/v33/n10/full/nbt.3319.html

Well, it looks like not all resources have URLs. Let's try looping over some of these and extracting links...


In [9]:
for item in items:
    paper_url = item['data']['url']
    if paper_url.startswith( 'http' ) :
        link_urls = get_urls_from( paper_url )
        print item['data']['key']
        for url in filter( novel_url, link_urls ) :
            print '    ', url


QC9BAHIK
     http://prodege.jgi-psf.org//downloads/src
     http://prodege.jgi-psf.org
     http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf
E7S5UR96
     http://www.uq.edu.au/
     http://www.uq.edu.au/
     http://www.uq.edu.au/contacts/
     http://www.uq.edu.au/study/
     http://www.uq.edu.au/maps/
     http://www.uq.edu.au/news/
     http://www.uq.edu.au/events/
     http://www.library.uq.edu.au/
     http://my.uq.edu.au/
     http://ezproxy.library.uq.edu.au/login?url=http://dx.doi.org/10.14264/uql.2015.855
     http://espace.library.uq.edu.au/list/author/Soo%2C+Rochelle+Melissa/
     http://espace.library.uq.edu.au/list/?cat=quick_filter&search_keys%5Bcore_70%5D=School of Chemistry and Molecular Biosciences
     http://espace.library.uq.edu.au/list/subject/452051/
     http://espace.library.uq.edu.au/list/subject/452105/
     http://espace.library.uq.edu.au/list/?cat=quick_filter&search_keys%5B0%5D=Melainabacteria
     http://espace.library.uq.edu.au/list/?cat=quick_filter&search_keys%5B0%5D=Cyanobacteria
     http://espace.library.uq.edu.au/list/?cat=quick_filter&search_keys%5B0%5D=Metabolism
     http://scholar.google.com/scholar?q=intitle:"In search of non-photosynthetic Cyanobacteria"
     http://www.uq.edu.au/
     http://www.uq.edu.au/ipswich/
     http://www.uq.edu.au/gatton/
     http://www.uq.edu.au/about/herston-campus
     http://www.uq.edu.au/maps/
     http://www.universitiesaustralia.edu.au/
     http://www.universitas21.com/
     http://www.edx.org/
     http://www.go8.edu.au/
     http://www.uq.edu.au/terms-of-use/
     http://www.uq.edu.au/rti/
     http://www.library.uq.edu.au/feedback/add
     http://www.uq.edu.au/about/cricos-link
     http://www.uq.edu.au/omc/media
     http://www.pf.uq.edu.au/emerg.html
     https://www.facebook.com/uniofqld
     http://twitter.com/uqnewsonline
     http://www.flickr.com/photos/uqnews/sets/
     http://instagram.com/uniofqld
     https://www.youtube.com/universityqueensland
     http://vimeo.com/uq
     http://www.uq.edu.au/itunes/
     http://www.linkedin.com/edu/school?id=10238
     http://www.alumni.uq.edu.au/giving
     http://www.uq.edu.au/departments/
     http://www.uq.edu.au/uqjobs/
     http://www.uq.edu.au/contacts/
     http://www.uq.edu.au/services/
     http://www.uq.edu.au/uqanswers/
     http://fez.library.uq.edu.au/
T9GDRBT5
     https://peerj.com/blog/
     http://www.mendeley.com/import/?doi=10.7717/peerj.1190
     http://twitter.com/share?url&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F&via&#x3D;thePeerJ&text&#x3D;Storage&#x25;20methods&#x25;20and&#x25;20insect&#x25;20microbiota&related&#x3D;
     http://www.facebook.com/sharer.php?u&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F
     https://plus.google.com/share?url&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F
     http://twitter.com/share?url&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F&via&#x3D;thePeerJ&text&#x3D;Storage&#x25;20methods&#x25;20and&#x25;20insect&#x25;20microbiota&related&#x3D;
     http://www.facebook.com/sharer.php?u&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F
     https://plus.google.com/share?url&#x3D;https&#x25;3A&#x25;2F&#x25;2Fpeerj.com&#x25;2Farticles&#x25;2F1190&#x25;2F
     https://doi.org/10.7717/peerj.1190
     https://doi.org/10.7717/peerj.1190
     https://doi.org/10.1146%2Fannurev.ento.49.061802.123416
     https://doi.org/10.1111%2F1574-6976.12025
     https://doi.org/10.1146%2Fannurev-ento-010814-020822
     https://doi.org/10.1038%2Fnrmicro3382
     https://doi.org/10.1016%2F0305-1978%2893%2990012-G
     https://doi.org/10.1046%2Fj.1365-294x.1999.00795.x
     https://doi.org/10.1111%2Fj.1570-7458.2006.00451.x
     https://doi.org/10.1071%2FIS12067
     https://doi.org/10.1371%2Fjournal.pone.0061218
     https://doi.org/10.1371%2Fjournal.pone.0086995
     https://doi.org/10.1111%2Fmec.12209
     https://doi.org/10.1371%2Fjournal.pone.0079061
     https://doi.org/10.1603%2F0022-2585-41.3.340
     https://doi.org/10.1111%2Fmec.12611
     https://doi.org/10.1111%2Fj.1365-294X.2012.05752.x
     https://doi.org/10.1111%2Fj.1574-6968.2010.01965.x
     https://doi.org/10.1371%2Fjournal.pone.0070460
     https://scholar.google.com/scholar_lookup?title=Tissue%20storage%20and%20primer%20selection%20influence%20pyrosequencing-based%20inferences%20of%20diversity%20and%20community%20composition%20of%20endolichenic%20and%20endophytic%20fungi&author=U%E2%80%99Ren&publication_year=2014
     https://doi.org/10.1186%2F1471-2180-14-103
     https://doi.org/10.1073%2Fpnas.1319284111
     https://doi.org/10.1128%2FAEM.01886-10
     https://doi.org/10.1371%2Fjournal.pone.0086995
     https://doi.org/10.1111%2Fmec.12611
     https://doi.org/10.1111%2F1755-0998.12331
     https://doi.org/10.1111%2Fj.1574-6968.2010.01965.x
     https://doi.org/10.1371%2Fjournal.pone.0070460
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-1-2x.jpg
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-1-full.png
     https://doi.org/10.7717/peerj.1190/fig-1
     https://doi.org/10.1371%2Fjournal.pone.0086995
     https://doi.org/10.1002%2Fmbo3.216
     https://doi.org/10.1071%2FIS12067
     https://doi.org/10.1007%2Fs13127-010-0012-4
     https://doi.org/10.1111%2Fele.12282
     https://doi.org/10.1038%2Fismej.2012.8
     https://doi.org/10.1038%2Fnmeth.2604
     https://doi.org/10.1098%2Frspb.2014.1988
     https://doi.org/10.1128%2FAEM.00062-07
     https://doi.org/10.1038%2Fismej.2011.139
     https://scholar.google.com/scholar_lookup?title=R:%20a%20language%20and%20environment%20for%20statistical%20computing&author=&publication_year=2013
     http://CRAN.R-project.org/package=vegan
     http://had.co.nz/ggplot2/book
     https://doi.org/10.1111%2Fj.1442-9993.2001.01070.pp.x
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-2-2x.jpg
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-2-full.png
     https://doi.org/10.7717/peerj.1190/fig-2
     https://doi.org/10.1111%2Fj.1365-294X.2012.05752.x
     https://doi.org/10.1371%2Fjournal.pone.0061218
     https://doi.org/10.1128%2FAEM.01226-14
     https://doi.org/10.1111%2Fj.1574-6968.2010.01965.x
     https://scholar.google.com/scholar_lookup?title=Tissue%20storage%20and%20primer%20selection%20influence%20pyrosequencing-based%20inferences%20of%20diversity%20and%20community%20composition%20of%20endolichenic%20and%20endophytic%20fungi&author=U%E2%80%99Ren&publication_year=2014
     https://doi.org/10.1186%2F1471-2180-14-103
     https://doi.org/10.1073%2Fpnas.1319284111
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-3-2x.jpg
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-3-full.png
     https://doi.org/10.7717/peerj.1190/fig-3
     https://doi.org/10.7717/peerj.1190/table-1
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-4-2x.jpg
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/fig-4-full.png
     https://doi.org/10.7717/peerj.1190/fig-4
     https://doi.org/10.1073%2Fpnas.1405838111
     https://doi.org/10.1016%2F0020-1790%2885%2990020-4
     https://doi.org/10.1073%2Fpnas.0807920105
     https://doi.org/10.1371%2Fjournal.pone.0061218
     https://doi.org/10.1371%2Fjournal.pone.0086995
     https://doi.org/10.1111%2Fmec.12611
     https://doi.org/10.7717/peerj.1190/supp-1
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/FigS1.pdf
     https://doi.org/10.7717/peerj.1190/supp-2
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/FigS2.pdf
     https://doi.org/10.7717/peerj.1190/supp-3
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/FigS3.pdf
     https://doi.org/10.7717/peerj.1190/supp-4
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/Table_S1.xlsx
     https://doi.org/10.7717/peerj.1190/supp-5
     https://dfzljdn9uc3pi.cloudfront.net/2015/1190/1/Table_S2.docx
     https://doi.org/10.1111%2Fj.1442-9993.2001.01070.pp.x
     https://doi.org/10.1111%2Fele.12282
     https://doi.org/10.1603%2F0022-2585-41.3.340
     https://doi.org/10.1038%2Fismej.2012.8
     https://doi.org/10.1038%2Fnrmicro3382
     https://doi.org/10.1111%2Fj.1365-294X.2012.05752.x
     https://doi.org/10.1146%2Fannurev.ento.49.061802.123416
     https://doi.org/10.1186%2F1471-2180-14-103
     https://doi.org/10.1146%2Fannurev-ento-010814-020822
     https://doi.org/10.1038%2Fnmeth.2604
     https://doi.org/10.1111%2F1574-6976.12025
     https://doi.org/10.1371%2Fjournal.pone.0079061
     https://doi.org/10.1073%2Fpnas.0807920105
     https://doi.org/10.1073%2Fpnas.1319284111
     https://doi.org/10.1046%2Fj.1365-294x.1999.00795.x
     https://doi.org/10.1371%2Fjournal.pone.0086995
     https://doi.org/10.1371%2Fjournal.pone.0061218
     https://doi.org/10.1111%2Fmec.12209
     https://doi.org/10.1073%2Fpnas.1405838111
     https://doi.org/10.1111%2Fj.1574-6968.2010.01965.x
     https://doi.org/10.1111%2Fj.1570-7458.2006.00451.x
     https://doi.org/10.1038%2Fismej.2011.139
     https://doi.org/10.1071%2FIS12067
     https://doi.org/10.1007%2Fs13127-010-0012-4
     http://CRAN.R-project.org/package=vegan
     http://CRAN.R-project.org/package=vegan
     https://doi.org/10.1016%2F0305-1978%2893%2990012-G
     https://doi.org/10.1098%2Frspb.2014.1988
     https://scholar.google.com/scholar_lookup?title=R:%20a%20language%20and%20environment%20for%20statistical%20computing&author=&publication_year=2013
     https://doi.org/10.1128%2FAEM.01886-10
     https://doi.org/10.1371%2Fjournal.pone.0070460
     https://doi.org/10.1002%2Fmbo3.216
     https://doi.org/10.1111%2Fmec.12611
     https://doi.org/10.1016%2F0020-1790%2885%2990020-4
     https://scholar.google.com/scholar_lookup?title=Tissue%20storage%20and%20primer%20selection%20influence%20pyrosequencing-based%20inferences%20of%20diversity%20and%20community%20composition%20of%20endolichenic%20and%20endophytic%20fungi&author=U%E2%80%99Ren&publication_year=2014
     https://doi.org/10.1128%2FAEM.00062-07
     http://had.co.nz/ggplot2/book
     http://had.co.nz/ggplot2/book
     https://doi.org/10.1111%2F1755-0998.12331
     https://doi.org/10.1128%2FAEM.01226-14
     http://www.mendeley.com/import/?doi=10.7717/peerj.1190
     https://www.facebook.com/
     http://www.lib.noaa.gov/noaa_research.xml
     http://www.microbiomedigest.com/
     http://www.tobinhammer.com/publications-and-twitter-feed.html
     https://m.facebook.com/
     http://m.facebook.com
     http://m.facebook.com/
     http://www.facebook.com/Gertruda
     http://www.traackr.com/
     http://apps.webofknowledge.com.proxy2.library.illinois.edu/full_record.do
     http://apps.webofknowledge.com/Search.do
     http://apps.webofknowledge.com/full_record.do
     http://plus.url.google.com/url
     http://2015.maintenance.academicanalytics.com/PersonQuadrants/PersonQuadrants
     http://adobe.com/apollo
     http://apps.webofknowledge.com.ezproxy2.library.arizona.edu/summary.do
     http://apps.webofknowledge.com/summary.do
     http://feedly.com/i/category/Open%20Access
     http://l.facebook.com/l.php
     http://scholar.glgoo.org/scholar
     http://scholar.google.com.sci-hub.io/
     http://scholar.google.com.secure.sci-hub.io/scholar
     http://search.aol.com/aol/search
     http://sfx.kcl.ac.uk/kings
     http://sfx.unimi.it/unimi
     http://sfxhosted.exlibrisgroup.com/emu
     http://www.ask.com/web
     http://www.sciencedirect.com/science/article/pii/030519789390012G
     http://www.scopus.com/record/display.uri
     http://www.scopus.com/results/citedbyresults.url
     http://www.sogou.com/
     https://blu182.mail.live.com/
     https://dx.doi.org/10.7717/peerj.1190/supp-3
     https://exchange.ou.edu/owa/redir.aspx
     https://l.facebook.com/l.php
     https://login.ezproxy.lib.utexas.edu/connect
     https://outlook.caltech.edu/owa/redir.aspx
     https://peerj.freshdesk.com/helpdesk/tickets/14030
     https://plus.google.com/
     https://plus.url.google.com/url
     https://scholar-google-com-au.ezproxy2.library.usyd.edu.au/
     https://scholar-google-com.ezproxy.library.wisc.edu/
     https://scholar-google-com.proxy.lib.fsu.edu/scholar_lookup
     https://scholar-google-com.proxy2.library.illinois.edu
     https://squirrel.science.ru.nl/src/read_body.php
     https://weboutlook.du.edu/owa/redir.aspx
     https://www.facebook.com
     https://twitter.com/share
     https://peerj.com/blog/
     http://twitter.com/thePeerJ/
     http://facebook.com/thePeerJ/
     https://plus.google.com/+Peerj
     http://www.linkedin.com/company/peerj
     http://www.pinterest.com/thepeerj/boards/
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-7223665263f8> in <module>()
      2     paper_url = item['data']['url']
      3     if paper_url.startswith( 'http' ) :
----> 4         link_urls = get_urls_from( paper_url )
      5         print item['data']['key']
      6         for url in filter( novel_url, link_urls ) :

<ipython-input-1-699ee3a39b8b> in get_urls_from(url)
     14     url_list = []
     15     import urllib
---> 16     usock = urllib.urlopen(url)
     17     parser = URLLister()
     18     parser.feed(usock.read())

/usr/lib/python2.7/urllib.pyc in urlopen(url, data, proxies, context)
     85         opener = _urlopener
     86     if data is None:
---> 87         return opener.open(url)
     88     else:
     89         return opener.open(url, data)

/usr/lib/python2.7/urllib.pyc in open(self, fullurl, data)
    211         try:
    212             if data is None:
--> 213                 return getattr(self, name)(url)
    214             else:
    215                 return getattr(self, name)(url, data)

/usr/lib/python2.7/urllib.pyc in open_http(self, url, data)
    349         for args in self.addheaders: h.putheader(*args)
    350         h.endheaders(data)
--> 351         errcode, errmsg, headers = h.getreply()
    352         fp = h.getfile()
    353         if errcode == -1:

/usr/lib/python2.7/httplib.pyc in getreply(self, buffering)
   1200         try:
   1201             if not buffering:
-> 1202                 response = self._conn.getresponse()
   1203             else:
   1204                 #only add this keyword if non-default for compatibility

/usr/lib/python2.7/httplib.pyc in getresponse(self, buffering)
   1125 
   1126         try:
-> 1127             response.begin()
   1128             assert response.will_close != _UNKNOWN
   1129             self.__state = _CS_IDLE

/usr/lib/python2.7/httplib.pyc in begin(self)
    451         # read until we get a non-100 response
    452         while True:
--> 453             version, status, reason = self._read_status()
    454             if status != CONTINUE:
    455                 break

/usr/lib/python2.7/httplib.pyc in _read_status(self)
    407     def _read_status(self):
    408         # Initialize with Simple-Response defaults
--> 409         line = self.fp.readline(_MAXLINE + 1)
    410         if len(line) > _MAXLINE:
    411             raise LineTooLong("header line")

/usr/lib/python2.7/socket.pyc in readline(self, size)
    478             while True:
    479                 try:
--> 480                     data = self._sock.recv(self._rbufsize)
    481                 except error, e:
    482                     if e.args[0] == EINTR:

IOError: [Errno socket error] timed out

Clearly, we need to expand the excluded URL list. And we need to match domains, not URLs.


In [22]:
excluded = [ 'nature.com',
             'doi.org',
             'ncbi.nlm.nih.gov',
             'creativecommons.org',
             'copyright.com',
             'isme-microbes.org',
             'doubleclick.net',
             'force.com',
             'isiglobalnet2.com',
             'readcube.com',
             'cas.org',
             'publicationethics.org',
             'natureasia.com',
             'uq.edu.au',
             'edx.org',
             'facebook.com',
             'instagram.com',
             'youtube.com',
             'flickr.com',
             'twitter.com',
             'go8.edu.au',
             'google.com',
             'vimeo.com',
             'peerj.com',
             'mendeley.com',
             'cloudfront.net',
             'webofknowledge.com',
             'sciencedirect.com',
             'aol.com',
             'pinterest.com',
             'scopus.com',
             'live.com',
             'exlibrisgroup.com',
             'usyd.edu.au',
             'academicanalytics.com',
             'microbiomedigest.com',
             'ask.com',
             'sogou.com',
             'ou.com',
             'du.edu',
             'ru.nl',
             'freshdesk.com',
             'caltech.edu',
             'traackr.com',
             'adobe.com',
             'linkedin.com',
             'feedly.com',
             'google.co.uk',
             'glgoo.org',
             'library.wisc.edu',
             'lib.fsu.edu',
             'library.illinois.edu',
             'exchange.ou.edu',
             'lib.noaa.gov',
             'innocentive.com',
             'sfx.kcl.ac.uk',
             'sfx.unimi.it',
             'lib.utexas.edu',
             'orcid.org',
           ]

def novel_url( url ) :
    for excluded_url in excluded :
        if url.__contains__( excluded_url ) :
            return False
    return True

This excluded list is getting sloppy as the author slowly lapses into a vegitative state, but we'll push on anyway.


In [23]:
for item in items:
    paper_url = item['data']['url']
    if paper_url.startswith( 'http' ) :
        try :
            link_urls = get_urls_from( paper_url )
            print item['data']['key']
            for url in list(set(filter( novel_url, link_urls ))) :
                print '    ', url
        except IOError :
            print item['data']['key'], 'FAILED'


QC9BAHIK
     http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf
     http://prodege.jgi-psf.org
     http://prodege.jgi-psf.org//downloads/src
E7S5UR96 FAILED
T9GDRBT5
     http://had.co.nz/ggplot2/book
     http://CRAN.R-project.org/package=vegan
     http://www.tobinhammer.com/publications-and-twitter-feed.html
QD3JS59Z
     https://bitbucket.org/luo-chengwei/constrains
     http://hmpdacc.org/resources/tools_protocols.php

Some journals aggressivly ban and throttle IPs, so this process gets slow and awful, but it works. Let's check these for dead links...


In [25]:
for item in items:
    paper_url = item['data']['url']
    if paper_url.startswith( 'http' ) :
        try :
            link_urls = get_urls_from( paper_url )
            print item['data']['key']
            for url in list(set(filter( novel_url, link_urls ))) :
                request = requests.get( url )
                if request.status_code == 200:
                    print '   Good : ', url
                else:
                    print '   Fail : ', url
        except IOError :
            print item['data']['key'], 'FAILED'


QC9BAHIK
   Fail :  http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf
   Good :  http://prodege.jgi-psf.org
   Good :  http://prodege.jgi-psf.org//downloads/src
E7S5UR96 FAILED
T9GDRBT5
   Fail :  http://had.co.nz/ggplot2/book
   Good :  http://CRAN.R-project.org/package=vegan
   Good :  http://www.tobinhammer.com/publications-and-twitter-feed.html
QD3JS59Z
   Good :  https://bitbucket.org/luo-chengwei/constrains
   Good :  http://hmpdacc.org/resources/tools_protocols.php

I guess that'll do for a proof of concept.


In [ ]: