Secure Transaction Service (III): Fuzzy Matching on personal data with Machine Learning

One of the main issues in systems in charge of managing hundreds of thousands or even millions of personal records is the diversity of sources of information and the multiple levels of quality, validation and formats in which the information can be inserted into the databases. The main problem generated by a defective validation and a uniform application of quality measures when the clients/consumers of the system handle personal data is the appearance of hundreds or thousands of duplicates records that apparently represent different people, companies or whatever but actually are the same one.

The effects of having duplicate records in your system are not good. Transactions are attributed to wrong people, balances are wrong claim management can turn into a nightmare, compliance work is simply ineffective facing threats as money laundry and fraud. We have a better way to recognize the people, companies and any entity we work with to improve our procedures and business quality. Not mentioning the obligations and regulations regarding compliance and Anti Money Laundry measures.

To solve this problem we find different possibilities and concepts. Fuzzy Matching techniques should help us to apply some screening to data, detect the possible duplicates by finding possible matches.

A Bit of Theory

One way to apply Fuzzy Matching is usually based on string similarity measures like Jaro-Winkler or the Levenshtein distance measure. The obvious problem here is that the amount of calculations necessary grow quadratic. Every entry has to be compared with every other entry in the dataset, in our case this means calculating one of these measures 663.000^2 times. This can be done faster using TF-IDF, N-Grams, and sparse matrix multiplication.


TF-IDF is a method to generate features from text by multiplying the frequency of a word in a document (Term Frequency, or TF) by the importance (the Inverse Document Frequency or IDF) of the same term in an entire corpus. This last term weights less important words (e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF is calculated as:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

We can consider a document containing 100 words in which the word cat appears 3 times. The term frequency (TF) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the TF-IDF weight is the product of these quantities: 0.03 * 4 = 0.12.

TF-IDF is very useful in text classification and text clustering. It is used to transform documents into numeric vectors, that can easily be compared.


While the terms in TF-IDF are usually words, this is not a necessity. In our case using words as terms wouldn’t help us much, as most company names only contain one or two words. This is why we will use n-grams: sequences of N contiguous items, in this case characters.

Cosine Similarity

To calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used. The cosine similarity can be seen as a normalized dot product. For a good explanation see: this site. We can theoretically calculate the cosine similarity of all items in our dataset with all other items in scikit-learn by using the cosine_similarity function, however the Data Scientists at ING found out this has some disadvantages:

  • The sklearn version does a lot of type checking and error handling.
  • The sklearn version calculates and stores all similarities in one go, while we are only interested in the most similar ones. Therefore it uses a lot more memory than necessary.

To optimize for these disadvantages ING created their own library which stores only the top N highest matches in each row, and only the similarities above an (optional) threshold.

There are many other types of algorithms. Please pay attention to Soundex and Levenshtein:

The Context

Well, it’s been a short intro to something basic: the existence of several options and strategies for the detection of similarities in literals.

This is useful to us. The Secure Transaction Service manages hundreds of thousands of records coming from heterogeneous sources. Older databases still operate receiving new data that generates new duplicate cases. However, the existence of duplicates must be mitigated to minimum to have the STS data as clean and available as possible.
We have three different timings of handling duplicates depending on the time and the way the data are being acquired:

  • Import: Migration of data from heterogeneous systems and databases.
  • Real-Time: Entry of new data, providing a prediction to the API consumers (web, mobile apps, for instance) about a probable insertion of a duplicate.
  • Background Cleaning: Supervision of migrated records and new data to detect the insertion of possible duplicates. This requires as well the re-arrange of existing links to existing linked information (e.g. financial transactions).

The Use Case

We’ll consider in our case a given source of data. The main factors of this data source are:

  • You could see in our last post the concepts derived of the applied standards. The PII (Personally Identifiable Information) is not usually present and it has to be composed somehow from the data structures in the source.
  • The source database was receiving information from several apps and the validation … well, some times it’s better not to talk about some topics. Come on! validate formats, regular expressions, try to normalize your data entries! Empty columns that are key for identification, personal remarks .. in the postcode column!!! and many more. An authentic nightmare for a data analyst.

The Target

We have to feed the STS database (remember, it’s distributed, it’s NoSQL) with a list of PIIs composed from the sources with no duplicates and the minimum loss of information.

The Tools

Use Apache Spark. That’s all.

The Process

We have defined several phases in a process:

  • Get the source data and transformation to processable entities.
  • Create the combinations of candidates for detection of duplicates.
  • Get a set of test data.
  • Training a Machine Learning Model and test the model against the test data.
  • Apply the trained & tested model to all the source records.
  • Generate the list of duplicates.
  • Merge the duplicates and obtain the final list.
  • Load the STS database.

And first of all, do not be scared of Machine Learning because:

Contrary to popular belief, Machine Learning is not a magical box of magic, nor is it the reason for $30bn in VC funding. At its core, machine learning is just a thing-labeler, taking your description of something and telling you what label it should get. Which sounds much less interesting than what you read on Hacker News.

Cassie Kozyrkov (2018)

1-Get the source data and transformation to processable entities

First, get the data from the source. A Materialized View (MV) can be good to simplify the query and you can bet with your co-workers about who builds the longest SQL query to feed the MV. It can take some time to design the query you’ll really need. A lot of join operations will needed. The values and IDs will be present or not almost in a random way. Who said this would be easy?

But in the end, once you have your MV, it’s just a matter of:

def getSourceData(): DataFrame ={
val mvQuery = s"(SELECT * FROM yourmaterializedview) AS TEMP", mvQuery, sqlServerSourceConnectionProperties)

Easy right? maybe you want to apply additional filters and so on. What you’ll want for sure is to normalize the content in the DataFrame you’ve obtained from that query. Because, for sure, unexpected null values, bad design in relationships and incorrect data types and formats will be found. For sure!

2-Create the combinations of candidates for detection of duplicates

Secondly, create all the possible combinations. In our case we have pre-arranged the source data structure to an entity we have called Person. We have to set here the conditions for comparison in order to not get a cartesian product. Well, that’s a possibility but it can take a long.

The usage of functions to find similarities is interesting. Spark does not include too many functions for this purpose (only two) and we’ll need to import a library

libraryDependencies += "info.debatty" % "java-string-similarity" % "1.1.0"

There are some other options like Spark-Stringmetric. It’s just a matter of testing different options and choosing the best one for your purposes.

In the function you can see how the combinations are made by combining soundex and Levenshtein for the family name given this is the column with more false similarities by far. Levenshtein is applied to the first name as well while we set two discreet values whose equality marks clear candidates as duplicates, date of birth (DOB) and the postcode. The combination of everything is used to calculate a vector of features for the involved columns.

The vector contains the evaluation of the combination for possible duplicates. For instance:

[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0] Well, the distance between two records is 0. It’s 100% a duplicate

[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0] Everything is extremely distant. This is clearly a different record.

[0.0,0.0,0.025654442,0.0,0.012552,0.0,0.25,0.0] There are some distance in some columns that we have to evaluate.

Obviously we are not showing the personal data in the other columns but you can see in the picture below some examples from the column with the vectors of features for each pair or persons.

3-Get a set of test data

We have now to generate a set of data to train and test the Machine Learning model. To do this we use the Dataset composed by two lists of Persons and the Vector with the scores for each combination of Persons.

The function generateTrainingData() uses the collections of Persons to automatically generate a final decision/label from the features in the Vector for each combination or Persons. We can do this manually but it is a tedious and boring task. The final goal is to get a set of data big enough to train the model and verify the predictions from the training phase in the test phase against a set of pre-validated data.

In this picture we can see how the vectors with the features have been evaluated and a label is assigned. In the capture any value is greater than the distance set in the evaluation. That’s what the label in the capture is always positive (1.0). It is not always the case though.

I did not mention but it is important you can use a data file or a set of files with some real data to use as training/test datasets. As we’ll see later these files or file will be used also for the build test phase.

4-Training a Machine Learning Model and test the model against the test data

This is the beauty and versatility of Spark. The Spark ML support (remember: is amazing, it contains many many algorithms and the implementation makes them extremely easy to use and tweak with hyperparameters. We have used here a model based on Logistic Regression but obviously other options are valid as well. you can find here an explanation of how to use hyperparameters (Hyperparameters by definition are input parameters which are necessarily required by an algorithm to learn from data.) in this model with Spark.

The code snippet with the ModelTrainer class is really clear. You can follow the step in the class to see the phases. We get some basic indicators from the tested model to be evaluated in the test phase (build time). The tuning of the parameters is very important to adapt to your analysis and data. We have used a multiclass logistic regression model with elastic net regularization.

5-Apply the trained & tested model to all the source records.

Get the model, get the set of combinations (all comparable pairs for persons with their features) and apply bthe model (the function transform in Spark)

6-Generate the list of duplicates

What we need her to work is a list of IDs with the groups of the records with their matches (the list of IDs of other records that have been predicted/detected as duplicates). We need this list to be able to merge the duplicates into a single record into the new storage.

The final result is something like this:

In the capture you can see that each ID has at least one duplicate..that is itself! But some others have several related IDs in the same row. Those are duplicates. The first column shows the number of duplicates for each record (again, 1 means it is not duplicated)

7-Merge the duplicates and obtain the final list

The we get the definitive list of persons and we merge the set of duplicates (already merged into a singe record) and the list of persons no related to any duplicate.

8-Load data to the STS database

One could say this is the easy part but it’s not. We have to build here the data structures linked to each Person in the STS database to form PIIs. The problem is, the data in the source are pretty defective. Many records have not the right information, the uniformity and general quality of the source data usually is low. We have to be careful here and design a good construction of the new data structures in our destination to not replicate the same defects in the new data repository.


The only logical concern about testing in a Machine Learning component is about data. We naturally want to test and the training of the model and we want to be sure that everything is tested the best as possible.

scenario("apply the model to data with all possible combinations in data") {
val (trainedModel, metadata) = etl.trainAndGetTrainedModel()
val prediction = etl.applyModelToAllCombinations(trainedModel.get, allCombinationsDS)
assert(metadata.get("areaUnderROC") > 0.85)
assert(metadata.get("maxFMeasure") > 0.9)
assert(metadata.get("accuracy") > 0.9)

Not a big deal… but beware! we need a subset of real data to test the model in the test phase to obtain something serious. This is a problem because the obvious restrictions about the personal data. We cannot simply push the test files to git (besides, they can be relly big files). So, there are several options here:

  • Get real-like data. Not real information about real people but with no anonymization or obfuscation of values.
  • A much better option, save your test datasets in a security vault in cloud, protected by a strong keys. The delivery pipeline will extract and decrypt the test datasets for us during the build time test phase and for every time the component train the model. This keeps the test dataset apart from any public exposition/access.

The FTS Delivery Platform supports 100% the AI components by providing transparent injection of test datasets in build and run time.


  • Most of people thinks that Machine Learning techniques are difficult. Well, they are not when we use amazing frameworks and tools like Apache Spark.
  • On the opposite, the most difficult part in this kind of tasks is to have a good understanding of the data and the data types, get the knowledge about its properties and defects and prepare the data for their processing.
  • Then, the processing of predictions and groups of data is not a big deal and many options are available (Clustering, Graphs). Again, Spark provides all you need to perform these tasks.
  • The load of data to the destination storage is not straight forward. The construction and population of a new data model from the old one can be shocking. Be ready to make decisions about the massive number of defect you’ll find in the source data.
  • A new paradigm of programming is coming up based on the structure of data rather than strongly defined requirements. Data will be the requirements. As development engineers we have to be ready for this paradigm. It’s really easy with tools like Spark.

Thanks a million if you have reached the end of this long long post and do not hesitate of sending your questions.

Author: Jesus de Diego

Software Architect and Team Lead

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.