In [1]:
import feedparser

载入内容


In [2]:
# load from url
d = feedparser.parse('http://www.oschina.net/news/rss')
d['feed']['title']


Out[2]:
'开源中国社区最新新闻'

In [3]:
# load from loacl file
d = feedparser.parse('oschina_news_rss.xml')
d['feed']['title']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-2fe05a3acff5> in <module>()
      1 # load from loacl file
      2 d = feedparser.parse('oschina_news_rss.xml')
----> 3 d['feed']['title']

/usr/local/lib/python3.4/dist-packages/feedparser.py in __getitem__(self, key)
    355             elif dict.__contains__(self, realkey):
    356                 return dict.__getitem__(self, realkey)
--> 357         return dict.__getitem__(self, key)
    358 
    359     def __contains__(self, key):

KeyError: 'title'

In [4]:
rawdata = """<rss version="2.0">
<channel>
<title>开源中国社区最新新闻</title>
</channel>
</rss>"""
d = feedparser.parse(rawdata)
d['feed']['title']


Out[4]:
'开源中国社区最新新闻'

常见 RSS 元素

获取 Channel 元素


In [5]:
# rss20 example
rss20="""<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Sample Feed</title>
<description>For documentation &lt;em&gt;only&lt;/em&gt;</description>
<link>http://example.org/</link>
<pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>
<!-- other elements omitted from this example -->
<item>
<title>First entry title</title>
<link>http://example.org/entry/3</link>
<description>Watch out for &lt;span style="background-image:
url(javascript:window.location='http://example.org/')"&gt;nasty
tricks&lt;/span&gt;</description>
<pubDate>Thu, 05 Sep 2002 00:00:01 GMT</pubDate>
<guid>http://example.org/entry/3</guid>
<!-- other elements omitted from this example -->
</item>
</channel>
</rss>"""

In [6]:
# RSS feed 的常见元素有 title 、 link 、 description 、 publication date(pubDate) 和 entry ID(guid) 。
d = feedparser.parse(rss20)
# d.feed 是 channel 元素
print(d.feed.title)
print(d.feed.link)
print(d.feed.description)
print(d.feed.published)
print(d.feed.published_parsed)


Sample Feed
http://example.org/
For documentation <em>only</em>
Sat, 07 Sep 2002 00:00:01 GMT
time.struct_time(tm_year=2002, tm_mon=9, tm_mday=7, tm_hour=0, tm_min=0, tm_sec=1, tm_wday=5, tm_yday=250, tm_isdst=0)

获取 Item 元素


In [7]:
# entries 是一个 item 组成的 list ,其顺序就是 rss 文件中的顺序
print(d.entries[0].title)
print(d.entries[0].link)
print(d.entries[0].description)
print(d.entries[0].published)
print(d.entries[0].published_parsed)
print(d.entries[0].id)


First entry title
http://example.org/entry/3
Watch out for <span style="background-image:
url(javascript:window.location='http://example.org/')">nasty
tricks</span>
Thu, 05 Sep 2002 00:00:01 GMT
time.struct_time(tm_year=2002, tm_mon=9, tm_mday=5, tm_hour=0, tm_min=0, tm_sec=1, tm_wday=3, tm_yday=248, tm_isdst=0)
http://example.org/entry/3

常见 Atom 元素

Atom feeds generally contain more information than RSS feeds (because more elements are required), but the most commonly used elements are still title, link, subtitle/description, various dates, and ID.


In [8]:
atom10 = """<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xml:base="http://example.org/"
xml:lang="en">
<title type="text">Sample Feed</title>
<subtitle type="html">
For documentation &lt;em&gt;only&lt;/em&gt;
</subtitle>
<link rel="alternate" href="/"/>
<link rel="self"
type="application/atom+xml"
href="http://www.example.org/atom10.xml"/>
<rights type="html">
&lt;p>Copyright 2005, Mark Pilgrim&lt;/p>&lt;
</rights>
<id>tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml</id>
<generator
uri="http://example.org/generator/"
version="4.0">
Sample Toolkit
</generator>
<updated>2005-11-09T11:56:34Z</updated>
<entry>
<title>First entry title</title>
<link rel="alternate"
href="/entry/3"/>
<link rel="related"
type="text/html"
href="http://search.example.com/"/>
<link rel="via"
type="text/html"
href="http://toby.example.com/examples/atom10"/>
<link rel="enclosure"
type="video/mpeg4"
href="http://www.example.com/movie.mp4"
length="42301"/>
<id>tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3</id>
<published>2005-11-09T00:23:47Z</published>
<updated>2005-11-09T11:56:34Z</updated>
<summary type="text/plain" mode="escaped">Watch out for nasty tricks</summary>
<content type="application/xhtml+xml" mode="xml"
xml:base="http://example.org/entry/3" xml:lang="en-US">
<div xmlns="http://www.w3.org/1999/xhtml">Watch out for
<span style="background: url(javascript:window.location='http://example.org/')">
nasty tricks</span></div>
</content>
</entry>
</feed>"""

In [9]:
d = feedparser.parse(atom10)
print(d.feed.title)
print(d.feed.link)
print(d.feed.subtitle)
print(d.feed.updated)
print(d.feed.updated_parsed)
print(d.feed.id)


Sample Feed
http://example.org/
For documentation <em>only</em>
2005-11-09T11:56:34Z
time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)
tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml

In [10]:
print(d.entries[0].title)
print(d.entries[0].link)
print(d.entries[0].id)
print(d.entries[0].published)
print(d.entries[0].published_parsed)
print(d.entries[0].updated)
print(d.entries[0].updated_parsed)
print(d.entries[0].summary)
print(d.entries[0].content)


First entry title
http://example.org/entry/3
tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3
2005-11-09T00:23:47Z
time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=0, tm_min=23, tm_sec=47, tm_wday=2, tm_yday=313, tm_isdst=0)
2005-11-09T11:56:34Z
time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)
Watch out for nasty tricks
[{'type': 'application/xhtml+xml', 'language': 'en-US', 'value': 'Watch out for\n<span style="background: url(javascript:window.location=\'http://example.org/\')">\nnasty tricks</span>', 'base': 'http://example.org/entry/3'}]

.. note::

The parsed summary and content are not the same as they appear in the
original feed. The original elements contained dangerous :abbr:`HTML
(HyperText Markup Language)` markup which was sanitized. See
:ref:`advanced.sanitization` for details.

Because Atom entries can have more than one content element, d.entries[0].content is a list of dictionaries. Each dictionary contains metadata about a single content element. The two most important values in the dictionary are the content type, in d.entries[0].content[0].type, and the actual content value, in d.entries[0].content[0].value.

You can get this level of detail on other Atom elements too.

Getting Detailed Information on Atom Elements

Several Atom elements share the Atom content model: title, subtitle, rights, summary, and of course content. (Atom 0.3 also had an info element which shared this content model.) Universal Feed Parser captures all relevant metadata about these elements, most importantly the content type and the value itself.


In [11]:
print(d.feed.title_detail)
print(d.feed.subtitle_detail)
print(d.feed.rights_detail)
print(d.entries[0].title_detail)
print(d.entries[0].summary_detail)
print(len(d.entries[0].content))
print(d.entries[0].content[0])


{'type': 'text/plain', 'language': 'en', 'value': 'Sample Feed', 'base': 'http://example.org/'}
{'type': 'text/html', 'language': 'en', 'value': 'For documentation <em>only</em>', 'base': 'http://example.org/'}
{'type': 'text/html', 'language': 'en', 'value': '<p>Copyright 2005, Mark Pilgrim</p><', 'base': 'http://example.org/'}
{'type': 'text/plain', 'language': 'en', 'value': 'First entry title', 'base': 'http://example.org/'}
{'type': 'text/plain', 'language': 'en', 'value': 'Watch out for nasty tricks', 'base': 'http://example.org/'}
1
{'type': 'application/xhtml+xml', 'language': 'en-US', 'value': 'Watch out for\n<span style="background: url(javascript:window.location=\'http://example.org/\')">\nnasty tricks</span>', 'base': 'http://example.org/entry/3'}