Loft47 connects with many other systems you use in your business everyday. Because we hold accounting records it is important that we do our very best not to duplicate contacts between systems. In order to do that we’ve developed a very sophisticated contact matching algorithm.
When attempting to match contacts we first look at the contacts email. Any email addresses that are shared between two contacts will generate an exact match between two systems. If an email is populated in a contact record then an exact match of the email is required to consider them matched. If there is no email or duplicate contacts with the same email we apply our matching algorithm.
Lofts contact matching algorithm calculates the similarity between two contacts based on sequences of three characters by comparing groups of 3 characters and counting matches. Two records are considered a match when 60% or more of those characters or sequences (strings) match.
Matching Algorithm Process
Preprocessing
The input strings are preprocessed to remove any leading or trailing spaces, and are converted to lowercase for case-insensitive comparison.
The preprocessed strings are then divided into trigrams. Trigrams are generated by taking each consecutive set of three characters from the string, including spaces if present. For example, the trigrams of the string "example" would be "exa", "xam", "amp", "mpl", "ple".
Trigram Matching
The trigrams of the two input strings are compared to find the common trigrams. The number of common trigrams is counted, and this count is used as a measure of similarity.
Similarity Calculation
The similarity between the two strings is calculated using the Jaccard similarity coefficient, which is defined as the ratio of the number of common trigrams to the total number of unique trigrams in both strings. The formula for calculating Jaccard similarity is: Jaccard similarity = (Number of common trigrams) / (Total number of unique trigrams)
Scoring
The Jaccard similarity coefficient is then multiplied by 100 to get a similarity score as a percentage, with 100% indicating a perfect match and 0% indicating no similarity.
Example of Matching Algorithm
Here's an example of how two words, "example" and "samples", are scored using the trigram algorithm:
Preprocessing
The input words are converted to lowercase and trimmed of leading/trailing spaces: "example" and "samples".
Trigrams are generated for both words:
For "example": "exa", "xam", "amp", "mpl", "ple"
For "samples": "sam", "amp", "mpl", "ple", "les"
Trigram Matching
Common trigrams between the two words are: "amp", "mpl", "ple".
Similarity Calculation
The number of common trigrams is 3, and the total number of unique trigrams in both words is 5. So, the Jaccard similarity coefficient is 3/5 = 0.6 or 60%.
Scoring
The similarity score between "example" and "samples" is 60% using the trigram similarity algorithm.