trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
Description
# Trafilatura: Discover and Extract Text Data on the Web <br/> <img alt="Trafilatura Logo" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-logo.png" align="center" width="60%"/> <br/> [](https://pypi.python.org/pypi/trafilatura) [](https://pypi.python.org/pypi/trafilatura) [](http://trafilatura.readthedocs.org/en/latest/?badge=latest) [](https://codecov.io/gh/adbar/trafilatura) [](https://pepy.tech/project/trafilatura) [](https://aclanthology.org/2021.acl-demo.15/) <br/> <img alt="Demo as GIF image" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-demo.gif" align="center" width="80%"/> <br/> ## Introduction Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats. Going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**. Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html) and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>) by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich. ### Features - Advanced web crawling and text discovery: - Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS) - Smart crawling and URL management (filtering and deduplication) - Parallel processing of online and offline input: - Live URLs, efficient and polite processing of download queues - Previously downloaded HTML files and parsed HTML trees - Robust and configurable extraction of key elements: - Main text (common patterns and generic algorithms like jusText and readability) - Metadata (title, author, date, site name, categories and tags) - Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting - Optional elements: comments, links, images, tables - Multiple output formats: - TXT and Markdown - CSV - JSON - HTML, XML and [XML-TEI](https://tei-c.org/) - Optional add-ons: - Language detection on extracted content - Speed optimizations - Actively maintained with support from the open-source community: - Regular updates, feature additions, and optimizations - Comprehensive documentation ### Evaluation and alternatives Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts. For more information see the [benchmark section](https://trafilatura.readthedocs.io/en/latest/evaluation.html) and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst) to run the evaluation with the latest data and packages. #### Other evaluations: - Most efficient open-source library in *ScrapingHub*'s [article extraction benchmark](https://github.com/scrapinghub/article-extraction-benchmark) - Best overall tool according to [Bien choisir son outil d'extraction de contenu à partir du Web](https://hal.archives-ouvertes.fr/hal-02768510v3/document) (Lejeune & Barbaresi 2020) - Best single tool by ROUGE-LSum Mean F1 Page Scores in [An Empirical Comparison of Web Content Extraction Algorithms](https://webis.de/downloads/publications/papers/bevendorff_2023b.pdf) (Bevendorff et al. 2023) ## Usage and documentation [Getting started with Trafilatura](https://trafilatura.readthedocs.io/en/latest/quickstart.html) is straightforward. For more information and detailed guides, visit [Trafilatura's documentation](https://trafilatura.readthedocs.io/): - [Installation](https://t
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 2.0.0 | Imported from PyPI (2.0.0) | Low | 4/21/2026 |
| v2.0.0 | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`: - now returns an instance of the `Document` class by default - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: | Low | 12/3/2024 |
| v1.12.2 | - downloads: add support for SOCKS proxies with @gremid (#682) - extraction fix: ValueError in table spans (#685) - spider: `prune_xpath` parameter added by @felipehertzer (#684) - spider: relax strict parameter for link extraction (#687) - sitemaps: `max_sitemaps` parameter added by @felipehertzer (#690) - maintenance: make compression libraries optional (#691) - metadata: review and lint code (#694) | Low | 9/10/2024 |
| v1.12.1 | Navigation: - spider: restrict search to sections containing URL path (#673) - crawler: add parameter class and types, **breaking change** for undocumented functions (#675) - maintenance: simplify link discovery and extend tests (#674) - CLI: review code, add types and tests (#677) Bugfixes: - fix `AttributeError` in element deletion (#668) - fix `MemoryError` in table header columns (#665) Docs: - docs: fix variable name for extract_metadata in quickstart by @jpigla in #678 | Low | 8/20/2024 |
| v1.12.0 | Breaking change: - enforce fixed list of output formats, deprecate `-out` on the CLI (#647) Faster, more accurate extraction: - review link and structure checks (#653) - improve justext fallback (#652) - baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646) - review XPaths for undesirable content (#645) Bugfixes and maintenance: - CLI fix: markdown format should trigger `include_formatting` (#649) - images fix: use a length threshold on src attribute | Low | 7/30/2024 |
| v1.11.0 | Breaking change: - metadata now skipped by default (#613), to trigger inclusion in all output formats: - `with_metadata=True` (Python) - `--with-metadata` (CLI) Extraction: - add HTML as output format (#614) - better and faster baseline extraction (#619) - better handling of HTML/XML elements (#628) - XPath rules added with @felipehertzer (#540) - fix: avoid faulty readability_lxml content (#635) Evaluation: - new scripts and data with @LydiaKoerber (#606, #615) - additio | Low | 6/27/2024 |
| v1.10.0 | Breaking changes: - raise errors on deprecated CLI and function arguments (#581) - regroup classes and functions linked to deduplication (#582) ``trafilatura.hashing`` → ``trafilatura.deduplication`` Extraction: - port of is_probably_readerable from readability.js by @zirkelc in #587 - Markdown table fixes by @naktinis in #601 - fix list spacing in TXT output (#598) - CLI fixes: file processing options, mtime, and tests (#605) - CLI fix: read standard input as binary (#607) Downloa | Low | 5/30/2024 |
| v1.9.0 | Extraction: - add markdown as explicit output (#550) - improve recall preset (#571) - speedup for readability-lxml (#547) - add global options object for extraction and use it in CLI (#552) - fix: better encoding detection (#548) - recall: fix for lists inside tables with @mikhainin (#534) - add symbol to preserve vertical spacing in Markdown (#499) - fix: table cell separators in non-XML output (#563) - slightly better accuracy and execution speed overall Metadata: - add file creat | Low | 5/2/2024 |
| v1.8.1 | Maintenance: - Pin LXML to prevent broken dependency (#535) Extraction: - Improve extraction accuracy for major news outlets (#530) - Fix formatting by correcting order of element generation and space handling with @dlwh (#528) - Fix: prevent tail insertion before children in nested elements by @knit-bee (#536) | Low | 4/3/2024 |
| v1.8.0 | Extraction: - Better precision by @felipehertzer (#509, #520) - Code formatting in TXT/Markdown output added (#498) - Improved CSV output (#496) - LXML: compile XPath expressions (#504) - Overall speedup about +5% Downloads and Navigation: - More robust scans with `is_live_page()` (#501) - Better sitemap start and safeguards (#503, #506) - Fix for headers in response object (#513) Maintenance: - License changed to Apache 2.0 - `Response` class: convenience functions added (#497) | Low | 3/20/2024 |
| v1.7.0 | Extraction: - improved `html2txt()` function (#483) Downloads: - add advanced `fetch_response()` function → pending deprecation for `fetch_url(decode=False)` Maintenance: - support for LXML v5+ (#484 by @knit-bee, #485) - update [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.7.0) | Low | 1/25/2024 |
| v1.6.4 | Maintenance: - MacOS: fix setup, update htmldate and add tests (#460) - drop invalid XML element attributes with @vbarbaresi in #462 - remove cyclic imports (#458) Navigation: - introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling by @vbarbaresi in #461 - improve feed detection (#457) Documentation: - enhancements to documentation and testing with @Maddesea in #456 | Low | 1/8/2024 |
| v1.6.3 | Extraction: - preserve space in certain elements with @idoshamun (#429) - optional list of xPaths to prune by @HeLehm (#414) Metadata: - more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0)) - new `htmldate` extensive search parameter in config (#434) - changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5)) Navigation: - reviewed code for feeds (#443) - new config option: e | Low | 11/29/2023 |
| v1.6.2 | Extraction: - more lenient HTML parsing (#370) - improved code block support with @idoshamun (#372, #401) - convertion of relative links to absolute by @feltcat (#377) - remove use of signal from core functions (#384) Metadata: - JSON-LD fix for sitenames by @felipehertzer (#383) Command-line interface: - more robust batch processing (#381) - added `--probe` option to CLI to check for extractable content (#378, #392) Maintenance: - simplified code (#408) - support for Python 3. | Low | 9/6/2023 |
| v1.6.1 | Extraction: - minor fixes: tables in figures (#301), headings (#354) and lists (#318) Metadata: - simplify and fully test JSON parsing code, with @felipehertzer (#352, #368) - authors, JSON and unicode fixes by @felipehertzer in #365 - fix for authors without `additionalName` by @awwitecki in #363 Navigation: - reviewed link processing in feeds and sitemaps (#340, #350) - more robust spider (#359) - updated underlying courlan package (#360) Full Changelog: https://github.com/adba | Low | 6/15/2023 |
| v1.6.0 | Extraction: - new content hashes and default file names (#314) - fix deprecation warning with @sdondley in #321 - fix for metadata image by @andremacola in #328 - fix potential unicode issue in third-party extraction with @Korben00 in #331 - review logging levels (#347) Command-line interface: - more efficient sitemap processing (#326) - more efficient downloads (#338) - fix for single URL processing (#324) and URL blacklisting (#339) Navigation - additional safety check on domain | Low | 5/11/2023 |
| v1.5.0 | Extraction: - fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303) - pagetype and image urls added to metadata by @andremacola (#282, #310) - add as_dict method to Document class with @edkrueger in #306 - XML output fix with @knit-bee in #315 - various smaller fixes: lists (#309), XPaths, metadata hardening Navigation: - transfer URL management to courlan.UrlStore (#232, #312) - fixes for spider module Maintenance: - simp | Low | 3/30/2023 |
| v1.4.1 | Extraction: - extraction bugs fixed (#263, #266), more robust HTML doctype parsing - XML output improvements by @knit-bee (#273, #274) - adjust thresholds for link density in paragraphs Metadata: - improved title and sitename detection (#284) - faster author, categories, domain name, and tags extraction - fixes to author emoji regexes by @felipehertzer (#269) Command-line interface: - review argument consistency and add deprecation warnings (#261) Setup: - make download timeout | Low | 1/19/2023 |
| v1.4.0 | Impact on extraction and output format: - better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258) - XML: preserve list type as attribute (#229) - XML TEI: better conformity with @knit-bee (#238, #242, #253, #254) - faster text cleaning and shorter code (#237 with @deedy5, #245) - metadata: add language when detector is activated (#224) - metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235) - TXT: change markdown formatt | Low | 10/18/2022 |
| v1.3.0 | - fast and robust `html2txt()` function added (#221) - more robust parsing (#228) - fixed bugs in metadata extraction, with @felipehertzer in #213 & #226 - extraction about 10-20% faster, slightly better recall - partial fixes for memory leaks (#216) - docs extended and updated (#217, #225) - prepared deprecation of old `process_record()` function - more stable processing with updated dependencies **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0 | Low | 7/29/2022 |
| v1.2.2 | - more efficient rules for extraction - metadata: further attributes used (with @felipehertzer) - better baseline extraction - issues fixed: #202, #204, #205 - evaluation updated Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2 | Low | 5/18/2022 |
| v1.2.1 | ## What's Changed - ``--precision`` and ``--recall`` arguments added to the CLI - better text cleaning: paywalls and comments - improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188 - further bugs fixed: #189, #192 (with @felipehertzer), #200 - efficiency: faster module loading and improved RAM footprint **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1 | Low | 5/2/2022 |
| v1.2.0 | - efficiency: replaced module readability-lxml by trimmed fork - bugs fixed: (#179, #180, #183, #184) - improved baseline extraction - cleaner metadata (with @felipehertzer) **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0 | Low | 3/7/2022 |
| v1.1.0 | - encodings: better detection, output NFC-normalized Unicode - maintenance and performance: more efficient code - bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others) - prepare compatibility with upcoming Python 3.11 - changed default settings - extended documentation **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0 | Low | 2/21/2022 |
| v1.0.0 | - compress HTML backup files & seamlessly open .gz files - support JSON web feeds - graphical user interface integrated into main package - faster downloads: reviewed backoff, compressed data - optional modules: downloads with `pycurl`, language identification with `py3langid` - bugs fixed (#111, #125, #132, #136, #140) - minor optimizations and fixes by @vbarbaresi in [#124](https://github.com/adbar/trafilatura/pull/124) & [#130](https://github.com/adbar/trafilatura/pull/130) - fixed arr | Low | 11/30/2021 |
| v0.9.3 | - better, faster encoding detection: replaced chardet with charset_normalizer - faster execution: updated justext to 3.0 - better extraction of sub-elements in tables (#78, #90) - more robust web feed parsing - further defined precision- and recall-oriented settings - license extraction in footers (#118) **Full Changelog**: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3 | Low | 10/21/2021 |
| v0.9.2 | - first precision- and recall-oriented presets defined - improvements in authorship extraction (thanks @felipehertzer) - requesting TXT output with formatting now results in Markdown format - bugs fixed: notably extraction robustness and consistency (#109, #111, #113) - setting for cookies in request headers (thanks @muellermartin) - better date extraction thanks to htmldate update | Low | 10/6/2021 |
| v0.9.1 | - improved author extraction (thanks @felipehertzer!) - bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ... - docs updated and extended - CLI: option names normalized (heed deprecation warnings), new option `explore` | Low | 8/2/2021 |
| v0.9.0 | - focused crawling functions including politeness rules - more efficient multi-threaded downloads + use as Python functions - documentation extended - bugs fixed: extraction and URL handling - removed support for Python 3.4 | Low | 6/15/2021 |
| v0.8.2 | - better handling of formatting, links and images, title type as attribute in XML formats - more robust sitemaps and feeds processing - more accurate extraction - further consolidation: code simplified and bugs fixed | Low | 4/21/2021 |
| v0.8.1 | - extraction trade-off: slightly better recall - code robustness: requests, configuration and navigation - bugfixes: image data extraction | Low | 3/11/2021 |
| v0.8.0 | - improved link discovery and handling - fixes in metadata extraction, feeds and sitemaps processing - breaking change: the `extract` function now reads target format from `output_format` argument only - new extraction option: preserve links, CLI options re-ordered - more opportunistic backup extraction | Low | 2/19/2021 |
| v0.7.0 | - customizable configuration file to parametrize extraction and downloads - better handling of feeds and sitemaps - additional CLI options: crytographic hash for file name, use Internet Archive as backup - more precise extraction - faster downloads: `requests` replaced with bare `urllib3` and custom decoding - consolidation: bug fixes and improvements, many thanks to the issues reporters! | Low | 1/4/2021 |
| v0.6.1 | - added `bare_extraction` function returning Python variables - improved link discovery in feeds and sitemaps - option to preserve image info - fixes (many thanks to bug reporters!) | Low | 12/2/2020 |
| v0.6.0 | - link discovery in sitemaps - compatibility with Python 3.9 - extraction coverage improved - deduplication now optional - bug fixes | Low | 11/6/2020 |
| v0.5.2 | - optional language detector changed: `langid` → `pycld3` - helper function `bare_extraction()` - optional deduplication off by default - better URL handling (`courlan`), more complete metadata - code consolidation (cleaner and shorter) | Low | 9/22/2020 |
| v0.5.1 | - extended and more convenient command-line options - output in JSON format - bug fixes | Low | 7/15/2020 |
| v0.5.0 | - faster and more robust text and metadata extraction - more efficient batch processing (parallel processing, URL queues) - support for ATOM/RSS feeds - complete command-line tool with corresponding options | Low | 6/2/2020 |
| v0.4.1 | - better metadata extraction and integration (XML & XML-TEI) - more efficient processing - output directory as CLI-option | Low | 4/24/2020 |
| v0.1.0 | First release used in production and meant to be archived on Zenodo for reproducibility and citability. | Low | 9/25/2019 |
