freshcrate
Skin:/
Home > Frameworks > justext

justext

Heuristic based boilerplate removal tool

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

.. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/ jusText ======= .. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg :target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is `designed <doc/algorithm.rst>`_ to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can `try it online <http://nlp.fi.muni.cz/projects/justext/>`_. This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code. Adaptations of the algorithm to other languages: - `C++ <https://github.com/endredy/jusText>`_ - `Go <https://github.com/JalfResi/justext>`_ - `Java <https://github.com/wizenoze/justext-java>`_ Some libraries using jusText: - `chirp <https://github.com/9b/chirp>`_ - `lazynlp <https://github.com/chiphuyen/lazynlp>`_ - `off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>`_ - `pears <https://github.com/PeARSearch/PeARS-orchard>`_ - `readability calculator <https://github.com/joaopalotti/readability_calculator>`_ - `sky <https://github.com/kootenpv/sky>`_ Some currently (Jan 2020) maintained alternatives: - `dragnet <https://github.com/dragnet-org/dragnet>`_ - `html2text <https://github.com/Alir3z4/html2text>`_ - `inscriptis <https://github.com/weblyzard/inscriptis>`_ - `newspaper <https://github.com/codelucas/newspaper>`_ - `python-readability <https://github.com/buriy/python-readability>`_ - `trafilatura <https://github.com/adbar/trafilatura>`_ Installation ------------ Make sure you have Python_ 2.7+/3.5+ and `pip <https://pip.pypa.io/en/stable/>`_ (`Windows <http://docs.python-guide.org/en/latest/starting/install/win/>`_, `Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>`_) installed. Run simply: .. code-block:: bash $ [sudo] pip install justext Dependencies ------------ :: lxml (version depends on your Python version) Usage ----- .. code-block:: bash $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info Python API ---------- .. code-block:: python import requests import justext response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text Testing ------- Run tests via .. code-block:: bash $ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9 Acknowledgements ---------------- .. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc .. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en .. _PRESEMT: http://presemt.eu/ .. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/ .. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf This software has been developed at the `Natural Language Processing Centre`_ of `Masaryk University in Brno`_ with a financial support from PRESEMT_ and `Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek. .. :changelog: Changelog for jusText ===================== 3.0.2 (2025-02-25) ------------------ - *BUG FIX:* Handle urllib imports in Python 2 and 3 correctly `#51 <https://github.com/miso-belica/jusText/pull/51>`_. 3.0.1 (2024-05-09) ------------------ - *BUG FIX:* Fix issue with new version of lxml `#48 <https://github.com/miso-belica/jusText/pull/48>`_. 3.0.0 (2021-10-21) ------------------ - *INCOMPATIBLE CHANGE:* Dropped support for Python 3.4 and below. - *BUG FIX:* Don't join words separated only by ``<br>`` tag. - *BUG FIX:* List available stop-lists alphabetically. 2.2.0 (2016-03-06) ------------------ - *INCOMPATIBLE CHANGE:* Stop words are case insensitive. - *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2 - *BUG FIX:* Preserve new lines from original text in paragraphs. 2.1.1 (2014-05-27) ------------------ - *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 <https://github.com/miso-belica/jusText/issues/9>`_. 2.1.0 (2014-01-25) ------------------ - *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``<p>`` tag `#5 <https://github.com/miso-belica/jusText/pull/5>`_. 2.0.0 (2013-08-26) ------------------ - *FEATURE:* Added pluggable DOM preprocessor. - *FEATURE:* Added support for Python 3.2+. - *INCOMPATIBLE CHANGE:* Paragraphs are instances of ``justext.paragraph.Paragraph``. - *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of command ``python -m justext``.

Release History

VersionChangesUrgencyDate
3.0.2Imported from PyPI (3.0.2)Low4/21/2026
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.2## What's Changed * Handle urllib imports in Python 2 and 3 correctly by @miso-belica in https://github.com/miso-belica/jusText/pull/51 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.1...v3.0.2Low2/25/2025
v3.0.1## What's Changed * Fix issue with new version of lxml by @miso-belica in https://github.com/miso-belica/jusText/pull/48 **Full Changelog**: https://github.com/miso-belica/jusText/compare/v3.0.0...v3.0.1Low5/9/2024
v3.0.0## Highlights - **INCOMPATIBLE CHANGE:** Drop support for Python 3.4 and below by @miso-belica in https://github.com/miso-belica/jusText/pull/38 - **FEATURE** More efficient code (maybe even 2.5x speedup) by @adbar in https://github.com/miso-belica/jusText/pull/41 - **FIX** Fix `cgi.escape` error in Python 3.8+ by @garaud in https://github.com/miso-belica/jusText/pull/37 - **FIX:** Don't join words separated only by ``<br>`` tag - **FIX:** List available stop-lists alphabetically ## WhatLow10/21/2021

Dependencies & License Audit

Loading dependencies...

Similar Packages

tqdmFast, Extensible Progress Meterv4.68.1
inspect-aiFramework for large language model evaluationsmain@2026-06-05
hypothesisThe property-based testing library for Pythonv6.155.2
bleachAn easy safelist-based HTML-sanitizing tool.main@2026-06-05
jupyter-clientJupyter protocol implementation and client librariesv8.9.0

More in Frameworks

langchainThe agent engineering platform
deer-flowAn open-source long-horizon SuperAgent harness that researches, codes, and creates. With the help of sandboxes, memories, tools, skill, subagents and message gateway, it handles different levels of ta
tqdmFast, Extensible Progress Meter
simBuild, deploy, and orchestrate AI agents. Sim is the central intelligence layer for your AI workforce.