# recordlinkage

> A record linkage toolkit for linking and deduplication

- **URL**: https://www.freshcrate.ai/projects/recordlinkage
- **Author**: pypi
- **Category**: Developer Tools
- **Latest version**: `0.16` (2026-04-21)
- **License**: BSD-3-Clause
- **Source**: https://github.com/J535D165/recordlinkage
- **Homepage**: https://pypi.org/project/recordlinkage/
- **Language**: Python
- **GitHub**: 1,046 stars, 152 forks
- **Registry**: pypi (`recordlinkage`)
- **Tags**: `pypi`

## Description

<div align="center">
  <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br>
</div>

# RecordLinkage: powerful and modular Python record linkage toolkit

[![Pypi Version](https://badge.fury.io/py/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/)
[![Github Actions CI Status](https://github.com/J535D165/recordlinkage/workflows/tests/badge.svg?branch=master)](https://github.com/J535D165/recordlinkage/actions)
[![Code Coverage](https://codecov.io/gh/J535D165/recordlinkage/branch/master/graph/badge.svg)](https://codecov.io/gh/J535D165/recordlinkage)
[![Documentation Status](https://readthedocs.org/projects/recordlinkage/badge/?version=latest)](https://recordlinkage.readthedocs.io/en/latest/?badge=latest)
[![Zenodo DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3559042.svg)](https://doi.org/10.5281/zenodo.3559042)

**RecordLinkage** is a powerful and modular record linkage toolkit to
link records in or between data sources. The toolkit provides most of
the tools needed for record linkage and deduplication. The package
contains indexing methods, functions to compare records and classifiers.
The package is developed for research and the linking of small or medium
sized files.

This project is inspired by the [Freely Extensible Biomedical Record
Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which
is a great project. In contrast with FEBRL, the recordlinkage project
uses [pandas](http://pandas.pydata.org/) and
[numpy](http://www.numpy.org/) for data handling and computations. The
use of *pandas*, a flexible and powerful data analysis and manipulation
library for Python, makes the record linkage process much easier and
faster. The extensive *pandas* library can be used to integrate your
record linkage directly into existing data manipulation projects.

One of the aims of this project is to make an easily extensible record
linkage framework. It is easy to include your own indexing algorithms,
comparison/similarity measures and classifiers.

## Basic linking example

Import the `recordlinkage` module with all important tools for record
linkage and import the data manipulation framework **pandas**.

``` python
import recordlinkage
import pandas
```

Load your data into pandas DataFrames.

``` python
df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
```

Comparing all record can be computationally intensive. Therefore, we
make set of candidate links with one of the built-in indexing techniques
like **blocking**. In this example, only pairs of records that agree on
the surname are returned.

``` python
indexer = recordlinkage.Index()
indexer.block('surname')
candidate_links = indexer.index(df_a, df_b)
```

For each candidate link, compare the records with one of the comparison
or similarity algorithms in the Compare class.

``` python
c = recordlinkage.Compare()

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
feature_vectors = c.compute(candidate_links, df_a, df_b)
```

Classify the candidate links into matching or distinct pairs based on
their comparison result with one of the [classification
algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html).
The following code classifies candidate pairs with a Logistic Regression
classifier. This (supervised machine learning) algorithm requires
training data.

``` python
logrg = recordlinkage.LogisticRegressionClassifier()
logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)

logrg.predict(feature_vectors)
```

The following code shows the classification of candidate pairs with the
Expectation-Conditional Maximisation (ECM) algorithm. This variant of
the Expectation-Maximisation algorithm doesn't require training data
(unsupervised machine learning).

``` python
ecm = recordlinkage.ECMClassifier()
ecm.fit_predict(feature_vectors)
```

## Main Features

The main features of this Python record linkage toolkit are:

-   Clean and standardise data with easy to use tools
-   Make pairs of records with smart indexing methods such as
    **blocking** and **sorted neighbourhood indexing**
-   Compare records with a large number of comparison and similarity
    measures for different types of variables such as strings, numbers
    and dates.
-   Several classifications algorithms, both supervised and unsupervised
    algorithms.
-   Common record linkage evaluation tools
-   Several built-in datasets.

## Documentation

The most recent documentation and API reference can be found at
[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/).
The documentation provides some basic usage examples like
[ded

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `0.16` | 2026-04-21 | Low | Imported from PyPI (0.16) |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |
| `v0.16` | 2023-07-20 | Low | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API.   ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli |

## Citation

- HTML: https://www.freshcrate.ai/projects/recordlinkage
- Markdown: https://www.freshcrate.ai/projects/recordlinkage.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/recordlinkage/deps

_Generated by freshcrate.ai. Indexes pypi releases for AI-agent ecosystem packages._
