freshcrate
Home > Databases > trafilatura

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.

Description

# Trafilatura: Discover and Extract Text Data on the Web <br/> <img alt="Trafilatura Logo" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-logo.png" align="center" width="60%"/> <br/> [![Python package](https://img.shields.io/pypi/v/trafilatura.svg)](https://pypi.python.org/pypi/trafilatura) [![Python versions](https://img.shields.io/pypi/pyversions/trafilatura.svg)](https://pypi.python.org/pypi/trafilatura) [![Documentation Status](https://readthedocs.org/projects/trafilatura/badge/?version=latest)](http://trafilatura.readthedocs.org/en/latest/?badge=latest) [![Code Coverage](https://img.shields.io/codecov/c/github/adbar/trafilatura.svg)](https://codecov.io/gh/adbar/trafilatura) [![Downloads](https://static.pepy.tech/badge/trafilatura/month)](https://pepy.tech/project/trafilatura) [![Reference DOI: 10.18653/v1/2021.acl-demo.15](https://img.shields.io/badge/DOI-10.18653%2Fv1%2F2021.acl--demo.15-blue)](https://aclanthology.org/2021.acl-demo.15/) <br/> <img alt="Demo as GIF image" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-demo.gif" align="center" width="80%"/> <br/> ## Introduction Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats. Going from HTML bulk to essential parts can alleviate many problems related to text quality, by **focusing on the actual content**, **avoiding the noise** caused by recurring elements like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**. Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html) and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>) by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich. ### Features - Advanced web crawling and text discovery: - Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS) - Smart crawling and URL management (filtering and deduplication) - Parallel processing of online and offline input: - Live URLs, efficient and polite processing of download queues - Previously downloaded HTML files and parsed HTML trees - Robust and configurable extraction of key elements: - Main text (common patterns and generic algorithms like jusText and readability) - Metadata (title, author, date, site name, categories and tags) - Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting - Optional elements: comments, links, images, tables - Multiple output formats: - TXT and Markdown - CSV - JSON - HTML, XML and [XML-TEI](https://tei-c.org/) - Optional add-ons: - Language detection on extracted content - Speed optimizations - Actively maintained with support from the open-source community: - Regular updates, feature additions, and optimizations - Comprehensive documentation ### Evaluation and alternatives Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. The extractor tries to strike a balance between limiting noise and including all valid parts. For more information see the [benchmark section](https://trafilatura.readthedocs.io/en/latest/evaluation.html) and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst) to run the evaluation with the latest data and packages. #### Other evaluations: - Most efficient open-source library in *ScrapingHub*'s [article extraction benchmark](https://github.com/scrapinghub/article-extraction-benchmark) - Best overall tool according to [Bien choisir son outil d'extraction de contenu à partir du Web](https://hal.archives-ouvertes.fr/hal-02768510v3/document) (Lejeune & Barbaresi 2020) - Best single tool by ROUGE-LSum Mean F1 Page Scores in [An Empirical Comparison of Web Content Extraction Algorithms](https://webis.de/downloads/publications/papers/bevendorff_2023b.pdf) (Bevendorff et al. 2023) ## Usage and documentation [Getting started with Trafilatura](https://trafilatura.readthedocs.io/en/latest/quickstart.html) is straightforward. For more information and detailed guides, visit [Trafilatura's documentation](https://trafilatura.readthedocs.io/): - [Installation](https://t

Release History

VersionChangesUrgencyDate
2.0.0Imported from PyPI (2.0.0)Low4/21/2026
v2.0.0Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`: - now returns an instance of the `Document` class by default - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction:Low12/3/2024
v1.12.2- downloads: add support for SOCKS proxies with @gremid (#682) - extraction fix: ValueError in table spans (#685) - spider: `prune_xpath` parameter added by @felipehertzer (#684) - spider: relax strict parameter for link extraction (#687) - sitemaps: `max_sitemaps` parameter added by @felipehertzer (#690) - maintenance: make compression libraries optional (#691) - metadata: review and lint code (#694)Low9/10/2024
v1.12.1Navigation: - spider: restrict search to sections containing URL path (#673) - crawler: add parameter class and types, **breaking change** for undocumented functions (#675) - maintenance: simplify link discovery and extend tests (#674) - CLI: review code, add types and tests (#677) Bugfixes: - fix `AttributeError` in element deletion (#668) - fix `MemoryError` in table header columns (#665) Docs: - docs: fix variable name for extract_metadata in quickstart by @jpigla in #678Low8/20/2024
v1.12.0Breaking change: - enforce fixed list of output formats, deprecate `-out` on the CLI (#647) Faster, more accurate extraction: - review link and structure checks (#653) - improve justext fallback (#652) - baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646) - review XPaths for undesirable content (#645) Bugfixes and maintenance: - CLI fix: markdown format should trigger `include_formatting` (#649) - images fix: use a length threshold on src attributeLow7/30/2024
v1.11.0Breaking change: - metadata now skipped by default (#613), to trigger inclusion in all output formats: - `with_metadata=True` (Python) - `--with-metadata` (CLI) Extraction: - add HTML as output format (#614) - better and faster baseline extraction (#619) - better handling of HTML/XML elements (#628) - XPath rules added with @felipehertzer (#540) - fix: avoid faulty readability_lxml content (#635) Evaluation: - new scripts and data with @LydiaKoerber (#606, #615) - additioLow6/27/2024
v1.10.0Breaking changes: - raise errors on deprecated CLI and function arguments (#581) - regroup classes and functions linked to deduplication (#582) ``trafilatura.hashing`` → ``trafilatura.deduplication`` Extraction: - port of is_probably_readerable from readability.js by @zirkelc in #587 - Markdown table fixes by @naktinis in #601 - fix list spacing in TXT output (#598) - CLI fixes: file processing options, mtime, and tests (#605) - CLI fix: read standard input as binary (#607) DownloaLow5/30/2024
v1.9.0Extraction: - add markdown as explicit output (#550) - improve recall preset (#571) - speedup for readability-lxml (#547) - add global options object for extraction and use it in CLI (#552) - fix: better encoding detection (#548) - recall: fix for lists inside tables with @mikhainin (#534) - add symbol to preserve vertical spacing in Markdown (#499) - fix: table cell separators in non-XML output (#563) - slightly better accuracy and execution speed overall Metadata: - add file creatLow5/2/2024
v1.8.1Maintenance: - Pin LXML to prevent broken dependency (#535) Extraction: - Improve extraction accuracy for major news outlets (#530) - Fix formatting by correcting order of element generation and space handling with @dlwh (#528) - Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)Low4/3/2024
v1.8.0Extraction: - Better precision by @felipehertzer (#509, #520) - Code formatting in TXT/Markdown output added (#498) - Improved CSV output (#496) - LXML: compile XPath expressions (#504) - Overall speedup about +5% Downloads and Navigation: - More robust scans with `is_live_page()` (#501) - Better sitemap start and safeguards (#503, #506) - Fix for headers in response object (#513) Maintenance: - License changed to Apache 2.0 - `Response` class: convenience functions added (#497) Low3/20/2024
v1.7.0Extraction: - improved `html2txt()` function (#483) Downloads: - add advanced `fetch_response()` function → pending deprecation for `fetch_url(decode=False)` Maintenance: - support for LXML v5+ (#484 by @knit-bee, #485) - update [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.7.0)Low1/25/2024
v1.6.4Maintenance: - MacOS: fix setup, update htmldate and add tests (#460) - drop invalid XML element attributes with @vbarbaresi in #462 - remove cyclic imports (#458) Navigation: - introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling by @vbarbaresi in #461 - improve feed detection (#457) Documentation: - enhancements to documentation and testing with @Maddesea in #456 Low1/8/2024
v1.6.3Extraction: - preserve space in certain elements with @idoshamun (#429) - optional list of xPaths to prune by @HeLehm (#414) Metadata: - more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0)) - new `htmldate` extensive search parameter in config (#434) - changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5)) Navigation: - reviewed code for feeds (#443) - new config option: eLow11/29/2023
v1.6.2Extraction: - more lenient HTML parsing (#370) - improved code block support with @idoshamun (#372, #401) - convertion of relative links to absolute by @feltcat (#377) - remove use of signal from core functions (#384) Metadata: - JSON-LD fix for sitenames by @felipehertzer (#383) Command-line interface: - more robust batch processing (#381) - added `--probe` option to CLI to check for extractable content (#378, #392) Maintenance: - simplified code (#408) - support for Python 3.Low9/6/2023
v1.6.1Extraction: - minor fixes: tables in figures (#301), headings (#354) and lists (#318) Metadata: - simplify and fully test JSON parsing code, with @felipehertzer (#352, #368) - authors, JSON and unicode fixes by @felipehertzer in #365 - fix for authors without `additionalName` by @awwitecki in #363 Navigation: - reviewed link processing in feeds and sitemaps (#340, #350) - more robust spider (#359) - updated underlying courlan package (#360) Full Changelog: https://github.com/adbaLow6/15/2023
v1.6.0Extraction: - new content hashes and default file names (#314) - fix deprecation warning with @sdondley in #321 - fix for metadata image by @andremacola in #328 - fix potential unicode issue in third-party extraction with @Korben00 in #331 - review logging levels (#347) Command-line interface: - more efficient sitemap processing (#326) - more efficient downloads (#338) - fix for single URL processing (#324) and URL blacklisting (#339) Navigation - additional safety check on domainLow5/11/2023
v1.5.0Extraction: - fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303) - pagetype and image urls added to metadata by @andremacola (#282, #310) - add as_dict method to Document class with @edkrueger in #306 - XML output fix with @knit-bee in #315 - various smaller fixes: lists (#309), XPaths, metadata hardening Navigation: - transfer URL management to courlan.UrlStore (#232, #312) - fixes for spider module Maintenance: - simpLow3/30/2023
v1.4.1Extraction: - extraction bugs fixed (#263, #266), more robust HTML doctype parsing - XML output improvements by @knit-bee (#273, #274) - adjust thresholds for link density in paragraphs Metadata: - improved title and sitename detection (#284) - faster author, categories, domain name, and tags extraction - fixes to author emoji regexes by @felipehertzer (#269) Command-line interface: - review argument consistency and add deprecation warnings (#261) Setup: - make download timeout Low1/19/2023
v1.4.0Impact on extraction and output format: - better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258) - XML: preserve list type as attribute (#229) - XML TEI: better conformity with @knit-bee (#238, #242, #253, #254) - faster text cleaning and shorter code (#237 with @deedy5, #245) - metadata: add language when detector is activated (#224) - metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235) - TXT: change markdown formattLow10/18/2022
v1.3.0- fast and robust `html2txt()` function added (#221) - more robust parsing (#228) - fixed bugs in metadata extraction, with @felipehertzer in #213 & #226 - extraction about 10-20% faster, slightly better recall - partial fixes for memory leaks (#216) - docs extended and updated (#217, #225) - prepared deprecation of old `process_record()` function - more stable processing with updated dependencies **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0Low7/29/2022
v1.2.2- more efficient rules for extraction - metadata: further attributes used (with @felipehertzer) - better baseline extraction - issues fixed: #202, #204, #205 - evaluation updated Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2Low5/18/2022
v1.2.1## What's Changed - ``--precision`` and ``--recall`` arguments added to the CLI - better text cleaning: paywalls and comments - improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188 - further bugs fixed: #189, #192 (with @felipehertzer), #200 - efficiency: faster module loading and improved RAM footprint **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1Low5/2/2022
v1.2.0- efficiency: replaced module readability-lxml by trimmed fork - bugs fixed: (#179, #180, #183, #184) - improved baseline extraction - cleaner metadata (with @felipehertzer) **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0Low3/7/2022
v1.1.0- encodings: better detection, output NFC-normalized Unicode - maintenance and performance: more efficient code - bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others) - prepare compatibility with upcoming Python 3.11 - changed default settings - extended documentation **Full Changelog**: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0Low2/21/2022
v1.0.0- compress HTML backup files & seamlessly open .gz files - support JSON web feeds - graphical user interface integrated into main package - faster downloads: reviewed backoff, compressed data - optional modules: downloads with `pycurl`, language identification with `py3langid` - bugs fixed (#111, #125, #132, #136, #140) - minor optimizations and fixes by @vbarbaresi in [#124](https://github.com/adbar/trafilatura/pull/124) & [#130](https://github.com/adbar/trafilatura/pull/130) - fixed arrLow11/30/2021
v0.9.3- better, faster encoding detection: replaced chardet with charset_normalizer - faster execution: updated justext to 3.0 - better extraction of sub-elements in tables (#78, #90) - more robust web feed parsing - further defined precision- and recall-oriented settings - license extraction in footers (#118) **Full Changelog**: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3Low10/21/2021
v0.9.2- first precision- and recall-oriented presets defined - improvements in authorship extraction (thanks @felipehertzer) - requesting TXT output with formatting now results in Markdown format - bugs fixed: notably extraction robustness and consistency (#109, #111, #113) - setting for cookies in request headers (thanks @muellermartin) - better date extraction thanks to htmldate updateLow10/6/2021
v0.9.1- improved author extraction (thanks @felipehertzer!) - bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ... - docs updated and extended - CLI: option names normalized (heed deprecation warnings), new option `explore`Low8/2/2021
v0.9.0- focused crawling functions including politeness rules - more efficient multi-threaded downloads + use as Python functions - documentation extended - bugs fixed: extraction and URL handling - removed support for Python 3.4Low6/15/2021
v0.8.2- better handling of formatting, links and images, title type as attribute in XML formats - more robust sitemaps and feeds processing - more accurate extraction - further consolidation: code simplified and bugs fixedLow4/21/2021
v0.8.1- extraction trade-off: slightly better recall - code robustness: requests, configuration and navigation - bugfixes: image data extractionLow3/11/2021
v0.8.0- improved link discovery and handling - fixes in metadata extraction, feeds and sitemaps processing - breaking change: the `extract` function now reads target format from `output_format` argument only - new extraction option: preserve links, CLI options re-ordered - more opportunistic backup extractionLow2/19/2021
v0.7.0- customizable configuration file to parametrize extraction and downloads - better handling of feeds and sitemaps - additional CLI options: crytographic hash for file name, use Internet Archive as backup - more precise extraction - faster downloads: `requests` replaced with bare `urllib3` and custom decoding - consolidation: bug fixes and improvements, many thanks to the issues reporters!Low1/4/2021
v0.6.1- added `bare_extraction` function returning Python variables - improved link discovery in feeds and sitemaps - option to preserve image info - fixes (many thanks to bug reporters!)Low12/2/2020
v0.6.0- link discovery in sitemaps - compatibility with Python 3.9 - extraction coverage improved - deduplication now optional - bug fixesLow11/6/2020
v0.5.2- optional language detector changed: `langid` → `pycld3` - helper function `bare_extraction()` - optional deduplication off by default - better URL handling (`courlan`), more complete metadata - code consolidation (cleaner and shorter)Low9/22/2020
v0.5.1- extended and more convenient command-line options - output in JSON format - bug fixesLow7/15/2020
v0.5.0- faster and more robust text and metadata extraction - more efficient batch processing (parallel processing, URL queues) - support for ATOM/RSS feeds - complete command-line tool with corresponding optionsLow6/2/2020
v0.4.1- better metadata extraction and integration (XML & XML-TEI) - more efficient processing - output directory as CLI-optionLow4/24/2020
v0.1.0First release used in production and meant to be archived on Zenodo for reproducibility and citability.Low9/25/2019

Dependencies & License Audit

Loading dependencies...

Similar Packages

azure-storage-blobMicrosoft Azure Blob Storage Client Library for Pythonazure-template_0.1.0b6187637
azure-storage-file-shareMicrosoft Azure Azure File Share Storage Client Library for Pythonazure-template_0.1.0b6187637
mirakuruProcess executor (not only) for tests.3.0.2
opentelemetry-instrumentation-qdrantOpenTelemetry Qdrant instrumentation0.60.0
django-modelclusterDjango extension to allow working with 'clusters' of models as a single unit, independently of the database6.4.1