Given an HTML document, extract and clean up the main body text and title.
This is a Python port of a Ruby port of arc90's Readability project.
It's easy using pip
, just run:
$ pip install readability-lxml
As an alternative, you may also use conda to install, just run:
$ conda install -c conda-forge readability-lxml
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
- 0.8.2 Added article author(s) (thanks @mattblaha)
- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
- 0.8 Replaced XHTML output with HTML5 output in summary() call.
- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive_keywords and negative_keywords
This code is under the Apache License 2.0 license.
- Latest readability.js
- Ruby port by starrhorne and iterationlabs
- Python port by gfxmonk
- Decruft effort <https://web.archive.org/web/20110214150709/https://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
- "BR to P" fix from readability.js which improves quality for smaller texts
- Github users contributions.