waybackpy
Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.
Description
<!-- markdownlint-disable MD033 MD041 --> <div align="center"> <img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br> <h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3> </div> <p align="center"> <a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ATests"><img alt="Unit Tests" src="https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg"></a> <a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a> <a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a> <a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a> <a href="https://app.codacy.com/gh/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade_Settings"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/6d777d8509f642ac89a20715bb3a6193"></a> <a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a> <a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a> <a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a> </p> --- # <img src="https://github.githubassets.com/images/icons/emoji/unicode/2b50.png" width="30"></img> Introduction Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs. Wayback Machine has 3 client side APIs. - SavePageNow or Save API - CDX Server API - Availability API These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface. ## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f3d7.png" width="20"></img> Installation **Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**: ```bash pip install waybackpy ``` **Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**: See also [waybackpy feedstock](https://github.com/conda-forge/waybackpy-feedstock), maintainers are [@rafaelrdealmeida](https://github.com/rafaelrdealmeida/), [@labriunesp](https://github.com/labriunesp/) and [@akamhy](https://github.com/akamhy/). ```bash conda install -c conda-forge waybackpy ``` **Install directly from [this git repository](https://github.com/akamhy/waybackpy) (NOT recommended)**: ```bash pip install git+https://github.com/akamhy/waybackpy.git ``` ## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f433.png" width="20"></img> Docker Image Docker Hub: [hub.docker.com/r/secsi/waybackpy](https://hub.docker.com/r/secsi/waybackpy) Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI). RAUDI is a tool by [SecSI](https://secsi.io), an Italian cybersecurity startup. ## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f680.png" width="20"></img> Usage ### As a Python package #### Save API aka SavePageNow ```python >>> from waybackpy import WaybackMachineSaveAPI >>> url = "https://github.com" >>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0" >>> >>> save_api = WaybackMachineSaveAPI(url, user_agent) >>> save_api.save() https://web.archive.org/web/20220118125249/https://github.com/ >>> save_api.cached_save False >>> save_api.timestamp() datetime.datetime(2022, 1, 18, 12, 52, 49) ``` #### CDX API aka CDXServerAPI ```python >>> from waybackpy import WaybackMachineCDXServerAPI >>> url = "https://google.com" >>> user_agent = "my new app's user agent" >>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent) ``` ##### oldest ```python >>> cdx_api.oldest() com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381 >>> oldest = cdx_api.oldest() >>> oldest com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381 >>> oldest.archive_url 'https://web.archive.org/web/19981111184551/http://google.com:80/' >>> oldest.original 'http://google.com:80/' >>> oldest.urlkey 'com,google)/' >>> oldest.timestamp '19981111184551' >>> oldest.datetime_timestamp datetime.datetime(1998, 11, 11, 18, 45, 51) >>> oldest.statuscode '200' >>> oldest.mimetype 'text/html' ``` ##### newest ```python >>> newest = cdx_api.newest() >>> newest com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563 >>> newe
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 3.0.6 | Imported from PyPI (3.0.6) | Low | 4/21/2026 |
| 3.0.5 | ## What's Changed * undo drop python3.6 by @akamhy in https://github.com/akamhy/waybackpy/pull/163 **Full Changelog**: https://github.com/akamhy/waybackpy/compare/3.0.4...3.0.5 [](https://sourceforge.net/projects/waybackpy/files/3.0.5/v3.0.5.zip/download) | Low | 2/18/2022 |
| 3.0.4 | ## What's Changed * Move metadata from __init__.py into setup.cfg by @eggplants in https://github.com/akamhy/waybackpy/pull/153 * add sort param support in CDX API class by @akamhy in https://github.com/akamhy/waybackpy/pull/156 * Add sort, use_pagination and closest by @akamhy in https://github.com/akamhy/waybackpy/pull/158 * Cdx based oldest newest and near by @akamhy in https://github.com/akamhy/waybackpy/pull/159 **Full Changelog**: https://github.com/akamhy/waybackpy/compare/3.0.3. | Low | 2/18/2022 |
| 3.0.3 | ## What's Changed * Dropped Python 3.4 to 3.6, both inclusive. * Catch 429 and 509 status code for save page now API * Increase the default CDX limit from 5000 to 25000 records per API call. * Added type hint * The package will now close the sessions explicitly. * Removed useless code. * Added docstrings. ## New Contributors * @eggplants made their first contribution in https://github.com/akamhy/waybackpy/pull/124 * @deepsource-autofix made their first contribution in https://github. | Low | 2/9/2022 |
| 3.0.2 | Nothing changed wrt to the previous version but creating a release for Conda forge. Replace the NON-ASCII character figlet with ASCII character figlet. see https://github.com/conda-forge/staged-recipes/pull/17643 [](https://sourceforge.net/projects/waybackpy/files/3.0.2/v3.0.2.zip/download) | Low | 1/25/2022 |
| 3.0.1 | ## What's Changed * escape '.' before 'archive.org' by @akamhy in https://github.com/akamhy/waybackpy/pull/112 * Update setup.py by @rafaelrdealmeida in https://github.com/akamhy/waybackpy/pull/114 * do not use f-strings in setup.py by @akamhy in https://github.com/akamhy/waybackpy/pull/115 ## New Contributors * @rafaelrdealmeida made their first contribution in https://github.com/akamhy/waybackpy/pull/114 See also https://github.com/conda-forge/staged-recipes/pull/17634 and https:// | Low | 1/25/2022 |
| 3.0.0 | ## What's Changed - 3 different APIs have now 3 different classes, WaybackMachineCDXServerAPI, WaybackMachineSaveAPI and WaybackMachineAvailabilityAPI. - CLI now supports the CDX API. - The past Url class will be continued to be supported, don't need to worry that your old code will break. - Get is now deprecated, it was a bad idea even trying to add tasks meant for urllib. **Full Changelog**: https://github.com/akamhy/waybackpy/compare/2.4.4...3.0.0 | Low | 1/18/2022 |
| 2.4.4 | - When the response code is 509, raise an error with an explanation (based on the actual error message contained in the response HTML). - Fix typo [](https://sourceforge.net/projects/waybackpy/files/2.4.4/v2.4.4.zip/download) | Low | 9/3/2021 |
| 2.4.3 | - Fix redirect issues with HTTP and HTTPS redirection - More stable archiving [](https://sourceforge.net/projects/waybackpy/files/2.4.3/v2.4.3.zip/download) | Low | 4/2/2021 |
| 2.4.2 | - added CLI Arg --file, if this Arg is not used with known URLs than waybackpy will not save the output URLs in file. - added cached_save flag on waybackpy URL object, if the returned saved archive is older than 3 mins the flag is true else false. - BUG FIX : the CLI --json arg was not returning valid JSON instead JSON loaded python dict. This is now fixed. [](https://sourceforge.net/projects/waybackpy/files/2.4.2/v2.4.2.zip/ | Low | 1/24/2021 |
| 2.4.1 | - Change str repr of cdxsnapshot to cdx line - Support unix ts as an arg in near - Don't fetch more pages if >=2 pages are empty, Pagination API - Don't use pagination API if total pages <= 2 - The Cdx method get() now gets the last fetched archive by default [](https://sourceforge.net/projects/waybackpy/files/2.4.1/v2.4.1.zip/download) | Low | 1/12/2021 |
| 2.4.0 | - Cdx API now fully supported [](https://sourceforge.net/projects/waybackpy/files/2.4.0/v2.4.0.zip/download) | Low | 1/10/2021 |
| 2.3.3 | - Added support for querying CDX Pagination API - Cdx class is publicly available to be used in third party code. - Some methods of Url now used Cdx Pagination API [](https://sourceforge.net/projects/waybackpy/files/2.3.3/v2.3.3.zip/download) | Low | 1/4/2021 |
| 2.3.2 | - Better error messages for CLI users. - FIXED BUG: removed code from __init__ that was fetching availability API without instruction. [](https://sourceforge.net/projects/waybackpy/files/2.3.2/v2.3.2.zip/download) | Low | 1/2/2021 |
| 2.3.1 | - Fixed bug: Url.__init__() was making unnecessary requests to the availability checking API. [](https://sourceforge.net/projects/waybackpy/files/2.3.1/v2.3.1.zip/download) | Low | 1/1/2021 |
| 2.3.0 | - Now using requests package instead of urllib.request. requests package is bettor for handling unusual redirects and other issues. - Now using threading for checking live URLs. - Improve code quality and formatting. - And now we also have a new cool logo. - Docs are no longer hosted on readthedocs, but https://akamhy.github.io/waybackpy/ [](https://sourceforge.net/projects/waybackpy/files/2.3.0/v2.3.0.zip/download) | Low | 12/13/2020 |
| 2.2.0 | Changes: - Added `archive_url` and `--archive_url` in the wrapper and CLI respectively. This just is an alias for the `newest` method. - All the return types of archive URLs are not strings anymore but instance of the Url class. - Added `JSON` and `--json` in the wrapper and CLI respectively. Used to read the API response of the avialiblity API. - the `len()` method on Url objects will now return the age of the archive. [](https://sourceforge.net/projects/waybackpy/files/2.1.9/v2.1.9.zip/download) | Low | 10/2/2020 |
| 2.1.8 | 1) New feature - known urls list 2) Updated Readme [](https://sourceforge.net/projects/waybackpy/files/2.1.8/v2.1.8.zip/download) | Low | 10/2/2020 |
| 2.1.7 | New regex added for parsing the archive URL. | Low | 8/9/2020 |
| 2.1.6 | - fix issues with cli | Low | 7/24/2020 |
| 2.1.5 | - minor bug fixes | Low | 7/24/2020 |
| 2.1.4 | - removed duplicate method which should improve the error handling | Low | 7/23/2020 |
| 2.1.3 | - Support CLI - Code refactoring - bug fixes - better exceptions | Low | 7/22/2020 |
| 2.1.2 | - Minor bug fixes. - Updated index.rst - 2 new test introduced | Low | 7/20/2020 |
| 2.1.1 | - Minor bug fixes - Example replit links changed to my account. | Low | 7/19/2020 |
| 2.1.0 | - Updates for recent API changes - Updated documentation | Low | 7/19/2020 |
| 2.0.2 | Release 2.0.2 | Low | 7/18/2020 |
| 2.0.1 | No Time out for final save() try. | Low | 7/18/2020 |
| 2.0.0 | OOP based | Low | 7/18/2020 |
| v1.6 | Release v1.6 | Low | 5/7/2020 |
| v1.4 | Release v1.4 | Low | 5/5/2020 |
| v1.3 | Release v1.3 | Low | 5/5/2020 |
| v1.2 | support for get() fix bug with near() | Low | 5/5/2020 |
| v1.1 | First release of waybackpy ! | Low | 5/4/2020 |
