Description
<div align="center"> <img src="https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg"><br> </div> # RecordLinkage: powerful and modular Python record linkage toolkit [](https://pypi.python.org/pypi/recordlinkage/) [](https://github.com/J535D165/recordlinkage/actions) [](https://codecov.io/gh/J535D165/recordlinkage) [](https://recordlinkage.readthedocs.io/en/latest/?badge=latest) [](https://doi.org/10.5281/zenodo.3559042) **RecordLinkage** is a powerful and modular record linkage toolkit to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package contains indexing methods, functions to compare records and classifiers. The package is developed for research and the linking of small or medium sized files. This project is inspired by the [Freely Extensible Biomedical Record Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which is a great project. In contrast with FEBRL, the recordlinkage project uses [pandas](http://pandas.pydata.org/) and [numpy](http://www.numpy.org/) for data handling and computations. The use of *pandas*, a flexible and powerful data analysis and manipulation library for Python, makes the record linkage process much easier and faster. The extensive *pandas* library can be used to integrate your record linkage directly into existing data manipulation projects. One of the aims of this project is to make an easily extensible record linkage framework. It is easy to include your own indexing algorithms, comparison/similarity measures and classifiers. ## Basic linking example Import the `recordlinkage` module with all important tools for record linkage and import the data manipulation framework **pandas**. ``` python import recordlinkage import pandas ``` Load your data into pandas DataFrames. ``` python df_a = pandas.DataFrame(YOUR_FIRST_DATASET) df_b = pandas.DataFrame(YOUR_SECOND_DATASET) ``` Comparing all record can be computationally intensive. Therefore, we make set of candidate links with one of the built-in indexing techniques like **blocking**. In this example, only pairs of records that agree on the surname are returned. ``` python indexer = recordlinkage.Index() indexer.block('surname') candidate_links = indexer.index(df_a, df_b) ``` For each candidate link, compare the records with one of the comparison or similarity algorithms in the Compare class. ``` python c = recordlinkage.Compare() c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85) c.exact('sex', 'gender') c.date('dob', 'date_of_birth') c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7) c.exact('place', 'placename') c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5) # The comparison vectors feature_vectors = c.compute(candidate_links, df_a, df_b) ``` Classify the candidate links into matching or distinct pairs based on their comparison result with one of the [classification algorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html). The following code classifies candidate pairs with a Logistic Regression classifier. This (supervised machine learning) algorithm requires training data. ``` python logrg = recordlinkage.LogisticRegressionClassifier() logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS) logrg.predict(feature_vectors) ``` The following code shows the classification of candidate pairs with the Expectation-Conditional Maximisation (ECM) algorithm. This variant of the Expectation-Maximisation algorithm doesn't require training data (unsupervised machine learning). ``` python ecm = recordlinkage.ECMClassifier() ecm.fit_predict(feature_vectors) ``` ## Main Features The main features of this Python record linkage toolkit are: - Clean and standardise data with easy to use tools - Make pairs of records with smart indexing methods such as **blocking** and **sorted neighbourhood indexing** - Compare records with a large number of comparison and similarity measures for different types of variables such as strings, numbers and dates. - Several classifications algorithms, both supervised and unsupervised algorithms. - Common record linkage evaluation tools - Several built-in datasets. ## Documentation The most recent documentation and API reference can be found at [recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/). The documentation provides some basic usage examples like [ded
Release History
| Version | Changes | Urgency | Date |
|---|---|---|---|
| 0.16 | Imported from PyPI (0.16) | Low | 4/21/2026 |
| v0.16 | A new release of `recordlinkage` after a long time (too long, I'm sorry). This release bumps the minor version to 0.16. This version supports `pandas` 2 and `pandas` 1. It doesn't contain any structural changes or improvements to the API. ## What's Changed * Fix typo by @havardox in https://github.com/J535D165/recordlinkage/pull/184 * Fix usage examples by @martinhohoff in https://github.com/J535D165/recordlinkage/pull/190 * Fix links by @andyjessen in https://github.com/J535D165/recordli | Low | 7/20/2023 |
| v0.15 | - Remove deprecated recordlinkage classes ([#173](https://github.com/J535D165/recordlinkage/pull/173)) - Bump min Python version to 3.6, ideally 3.8+ ([#171](https://github.com/J535D165/recordlinkage/pull/171)) - Bump min pandas version to >=1 - Resolve deprecation warnings for numpy and pandas - Happy lint, sort imports, format code with yapf - Remove unnecessary np.sort in SNI algorithm ([#141](https://github.com/J535D165/recordlinkage/pull/141)) - Fix bug for cosine and qgram string com | Low | 4/19/2022 |
| v0.14 | - Drop Python 2.7 and Python 3.4 support. ([\#91](https://github.com/J535D165/recordlinkage/pull/91)) - Upgrade minimal pandas version to 0.23. - Simplify the use of all cpus in parallel mode. ([\#102](https://github.com/J535D165/recordlinkage/pull/102)) - Store large example datasets in user home folder or use environment variable. Before, example datasets were stored in the package. (see issue [\#42](https://github.com/J535D165/recordlinkage/issues/42)) ([\#92](https://github.com/J535D165/r | Low | 12/1/2019 |
| v0.13.2 | Fix distribution problem. | Low | 3/27/2019 |
| v0.13 | Release v0.13 | Low | 3/15/2019 |
| v0.11.2 | - Minor installation improvement. Exclude unwanted files | Low | 1/4/2018 |
| v0.11.1 | - Fix installation issue. Submodule 'preprocessing' was not added to the source distribution. | Low | 1/4/2018 |
| v0.11.0 | - The submodule 'standardise' is renamed. The new name is 'preprocessing'. The submodule 'standardise' will get deprecated in a next version. - Deprecation errors were not visible for many users. In this version, the errors are better visible. - Improved and new logs for indexing, comparing and classification. - Faster comparing of string variables. Thanks Joel Becker. - Changes make it possible to pickle Compare and Index objects. This makes it easier to run code in parallel. | Low | 1/4/2018 |
| v0.10.1 | - print statement in the geo compare algorithm removed. - String, numeric and geo compare functions now raise directly when an incorrect algorithm name is passed. - Fix unit test that failed on Python 2.7. | Low | 12/28/2017 |
| v0.10.0 | - A new compare API. The new Compare class no longer takes the datasets and pairs as arguments. The actual computation is now performed when calling `.compute(PAIRS, DF1, DF2)`. The documentation is updated as well, but still needs improvement. - Two new string similarity measures are added: Smith Waterman (smith_waterman) and Longest Common Substring (lcs). Thanks to Joel Becker and Jillian Anderson from the Networks Lab of the University of Waterloo. - Added and/or upda | Low | 12/28/2017 |
| v0.9.0 | - A new index API. The new index API is no longer a single class (``recordlinkage.Pairs(...)``) with all the functionality in it. The new API is based on Tensorflow and FEBRL. With the new structure, it easier to parallise the record linkage process. In future releases, this will be implemented natively. `See the reference page for more information and migrating. <http://recordlinkage.readthedocs.io/en/latest/ref-index.html>`_ - Significant speed improvement of the Sorted Neigh | Low | 12/28/2017 |
| v0.8.1 | - Issues solved with rendering docs on ReadTheDocs. Still not clear what is going on with the `autodoc_mock_imports` in the sphinx conf.py file. Maybe a bug in sphinx. - Move six to dependencies. - The reference part of the docs is split into separate subsections. This makes the reference better readable. - The landing page of the docs is slightly changed. | Low | 1/27/2017 |
| v0.8.0 | - Add additional arguments to the function that downloads and loads the krebsregister data. The argument `missing_values` is used to fill missing values. Default: nothing is done. The argument `shuffle` is used to shuffle the records. Default is True. - Remove the lastest traces of the old package name. The new package name is 'Python Record Linkage Toolkit' - Better error messages when there are only matches or non-matches are passed to train the classifier. - Add AirSpeedVelocity | Low | 1/23/2017 |
| v0.7.2 | Release v0.7.2 | Low | 11/9/2016 |
| v0.7.1 | Release v0.7.1 | Low | 11/9/2016 |
| v0.6.0 | This version includes the following updates: - Reformatting the code such that it follows PEP8. - Add Travis-CI and codecov support. - Switch to distributing wheels. - Fix bugs with depreciated pandas functions. `__sub__` is no longer used for computing the difference of Index objects. It is now replaced by ``INDEX.difference(OTHER_INDEX). - Exclude pairs with NaN's on the index-key in Q-gram indexing. - Add tests for krebsregister dataset. - Fix Python3 bug on krebsregister dataset. - Improve u | Low | 10/12/2016 |
| v0.5.0 | - Batch comparing added. Signifant speed improvement. - rldatasets are now included in the package itself. - Added an experimental gender imputation tool. - Blocking and SNI skip missing values - No longer need for different index names - FEBRL datasets included - Unit tests for indexing and comparing improved - Documentation updated | Low | 9/9/2016 |
| v0.4.0 | - Fixes a serious bug with deduplication (thanks to https://github.com/dserban). - Fixes undesired behaviour for sorted neighbourhood indexing with missing values. - Add new datasets to the package like Febrl datasets - Move Krebsregister dataset to this package. - Improve and add some tests - Various documentation updates | Low | 8/20/2016 |
| v0.3.1 | Release v0.3.1 | Low | 6/15/2016 |
| v0.3 | This version contains a lot of changes to the API. Hopefully, there are no large API changes needed for now. - Total restructure of compare functions (The end of changing the API is close to now.) - Compare method `numerical` is now named `numeric` and `fuzzy` is now named `string`. - Add haversine formula to compare geographical records. - Use numexpr for computing numeric comparisons. - Add step, linear and squared comparing. - Add eye index method. - Improve, update and add new tests. - Rem | Low | 6/11/2016 |
| v0.2 | - Full Python3 support - Update the parameters of the Logistic Regression Classifier manually. In literature, this is often denoted as the _deterministic record linkage_. - Expectation/Conditional Maxisation algorithm completely rewritten. The performance of the algorithm is much better now. The algorithm is still experimental. - New string comparison metrics: Q-gram string comparing and Cosine string comparing. - New indexing algorithm: Q-gram indexing. - Several internal tests. - Updated docu | Low | 5/28/2016 |
| v0.1.2 | In the version are the following things added or changed: - Arguments in compare functions renamed. - Remove exact comparing of dataframes and add efficiency tricks for exact comparing. - Update documentation about comparing, classifying and evaluation. | Low | 4/23/2016 |
| v0.1.1 | This update includes: - Updated documentation about indexing, comparing and classification - Improved performance for some indexing methods - Random indexing returns now exact number of record pairs - Argumens renamed in comparing functions | Low | 4/17/2016 |
| v0.1.0 | The is the first big release of the record linkage package. See the [documentation](http://recordlinkage.readthedocs.org/en/latest/) for information about the available functions. The framework needs to be extended with more functions, but there is a stable, easily extendable, framework to do that. More information how to do that is coming. | Low | 4/11/2016 |
