# trafilatura

> Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.

- **URL**: https://www.freshcrate.ai/projects/trafilatura
- **Author**: pypi
- **Category**: Databases
- **Latest version**: `2.0.0` (2026-04-21)
- **License**: Apache 2.0
- **Source**: https://github.com/adbar/trafilatura
- **Homepage**: https://pypi.org/project/trafilatura/
- **Language**: Python
- **GitHub**: 5,758 stars, 358 forks
- **Registry**: pypi (`trafilatura`)
- **Tags**: `corpus`, `html2text`, `natural-language-processing`, `news-crawler`, `pypi`, `scraper`, `tei-xml`, `text-extraction`, `webscraping`

## Description

# Trafilatura: Discover and Extract Text Data on the Web

<br/>

<img alt="Trafilatura Logo" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-logo.png" align="center" width="60%"/>

<br/>

[![Python package](https://img.shields.io/pypi/v/trafilatura.svg)](https://pypi.python.org/pypi/trafilatura)
[![Python versions](https://img.shields.io/pypi/pyversions/trafilatura.svg)](https://pypi.python.org/pypi/trafilatura)
[![Documentation Status](https://readthedocs.org/projects/trafilatura/badge/?version=latest)](http://trafilatura.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/trafilatura.svg)](https://codecov.io/gh/adbar/trafilatura)
[![Downloads](https://static.pepy.tech/badge/trafilatura/month)](https://pepy.tech/project/trafilatura)
[![Reference DOI: 10.18653/v1/2021.acl-demo.15](https://img.shields.io/badge/DOI-10.18653%2Fv1%2F2021.acl--demo.15-blue)](https://aclanthology.org/2021.acl-demo.15/)

<br/>

<img alt="Demo as GIF image" src="https://raw.githubusercontent.com/adbar/trafilatura/master/docs/trafilatura-demo.gif" align="center" width="80%"/>

<br/>


## Introduction

Trafilatura is a cutting-edge **Python package and command-line tool**
designed to **gather text on the Web and simplify the process of turning
raw HTML into structured, meaningful data**. It includes all necessary
discovery and text processing components to perform **web crawling,
downloads, scraping, and extraction** of main texts, metadata and
comments. It aims at staying **handy and modular**: no database is
required, the output can be converted to commonly used formats.

Going from HTML bulk to essential parts can alleviate many problems
related to text quality, by **focusing on the actual content**,
**avoiding the noise** caused by recurring elements like headers and footers
and by **making sense of the data and metadata** with selected information.
The extractor strikes a balance between limiting noise (precision) and
including all valid parts (recall). It is **robust and reasonably fast**.

Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html)
and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>)
by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like
the Allen Institute, Stanford, the Tokyo Institute of Technology, and
the University of Munich.


### Features

- Advanced web crawling and text discovery:
   - Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
   - Smart crawling and URL management (filtering and deduplication)

- Parallel processing of online and offline input:
   - Live URLs, efficient and polite processing of download queues
   - Previously downloaded HTML files and parsed HTML trees

- Robust and configurable extraction of key elements:
   - Main text (common patterns and generic algorithms like jusText and readability)
   - Metadata (title, author, date, site name, categories and tags)
   - Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
   - Optional elements: comments, links, images, tables

- Multiple output formats:
   - TXT and Markdown
   - CSV
   - JSON
   - HTML, XML and [XML-TEI](https://tei-c.org/)

- Optional add-ons:
   - Language detection on extracted content
   - Speed optimizations

- Actively maintained with support from the open-source community:
   - Regular updates, feature additions, and optimizations
   - Comprehensive documentation


### Evaluation and alternatives

Trafilatura consistently outperforms other open-source libraries in text
extraction benchmarks, showcasing its efficiency and accuracy in
extracting web content. The extractor tries to strike a balance between
limiting noise and including all valid parts.

For more information see the [benchmark section](https://trafilatura.readthedocs.io/en/latest/evaluation.html)
and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
to run the evaluation with the latest data and packages.


#### Other evaluations:

- Most efficient open-source library in *ScrapingHub*'s [article extraction benchmark](https://github.com/scrapinghub/article-extraction-benchmark)
- Best overall tool according to [Bien choisir son outil d'extraction de contenu à partir du Web](https://hal.archives-ouvertes.fr/hal-02768510v3/document)
  (Lejeune & Barbaresi 2020)
- Best single tool by ROUGE-LSum Mean F1 Page Scores in [An Empirical Comparison of Web Content Extraction Algorithms](https://webis.de/downloads/publications/papers/bevendorff_2023b.pdf)
  (Bevendorff et al. 2023)


## Usage and documentation

[Getting started with Trafilatura](https://trafilatura.readthedocs.io/en/latest/quickstart.html)
is straightforward. For more information and detailed guides, visit
[Trafilatura's documentation](https://trafilatura.readthedocs.io/):

- [Installation](https://t

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `2.0.0` | 2026-04-21 | Low | Imported from PyPI (2.0.0) |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |
| `v2.0.0` | 2024-12-03 | Low | Breaking changes: - Python 3.6 and 3.7 deprecated (#709) - `bare_extraction()`:    - now returns an instance of the `Document` class by default    - `as_dict` deprecation warning → use `.as_dict()` method on return value (#730) - `bare_extraction()` and `extract()`: `no_fallback` deprecation warning → use `fast` instead (#730) - downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724) - deprecated graphical user interface now removed (#713) - extraction: |

## Citation

- HTML: https://www.freshcrate.ai/projects/trafilatura
- Markdown: https://www.freshcrate.ai/projects/trafilatura.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/trafilatura/deps

_Generated by freshcrate.ai. Indexes pypi releases for AI-agent ecosystem packages._
