Business Name Matching in Real Contact API: The Why, How and Where

Businesses often have person and/or business names for the lead or the account sign-up. For lead verification purposes, customers want to ensure that Real Contact API uses both names available for matching purposes so there are no false positives.  

The challenge here is that person name matching is quite different from business name matching, given the different structures of these names and, of course, the need to avoid false positives with common business terms—everything from Corporation to Pizzaria

Trestle has recently made a significant investment in Real Contact API now allowing to input business names and match them with phone/address data from Trestle’s authoritative database, enabling accurate lead verification. The focus of our efforts was on business name matching, a more complex task. This article delves into our approach and the nuances we encountered, ensuring we got it right. 

How it’s Different from Person Name Match

Business name matching presents unique challenges that differ from person name matching. For person names, each component (first name and last name) holds equal significance. In contrast, business names include elements that vary in importance for establishing identity.

Key Differences in Elements:

In the name Debi’s Deli and Delivery, “Debi’s” is the distinctive identifier of the business, whereas “Deli and Delivery” are generic terms common across the industry. These terms, although descriptive, are not critical for determining uniqueness. In other words, “Debi’s Deli and Delivery” differs from “Mason’s Deli and Delivery.”

Handling of Common Words:

Business names often contain words like “for,” “the,” and “and.” These are generally disregarded in matching processes because they do not contribute to the distinctiveness of the business. For example, “Debi’s Deli and Delivery” and “Costa Restaurant and Delivery” both include “and” and “delivery.” However, these common words are not considered when determining a match. Similarly, “Restaurant” and “Deli” are specific business category keywords that cannot be considered in the matching. 

Focus on Unique Identifiers:

Effective business name matching requires emphasizing elements that uniquely identify a business rather than generic descriptors. This approach ensures matches are based on meaningful similarities and provides more relevant results.

This selective emphasis on significant elements makes business name matching a complex and nuanced task, highlighting the importance of focusing on unique identifiers to ensure accuracy.

Truth Set Making

Challenges in Business Name Matching:

MarketVision ResearchMarket Vision
Laure´n’s DeliLauren Food and Delivery
TracFone WirelessSafeLink Wireless

We encounter a variety of challenges when matching business names, each requiring careful consideration to enhance accuracy:

  1. Concatenation Issues: For example, “MarketVision” versus “Market Vision” might lead to false negatives due to differences in word spacing.
  2. Special Characters and Possessives: Variations like “Laure´n’s Deli” versus “Lauren Food and Delivery” could result in false negatives because of special characters and possessive forms.
  3. Common Industry Terms: Names like “TracFone Wireless” and “SafeLink Wireless” could trigger false positives due to the common industry term “Wireless.”
  4. Insignificant Words: Words such as “and” are typically removed from consideration as they don’t contribute to a business’s unique identity.
  5. First Words as Business Terms: The significance of the first word being a common business term needs careful evaluation to avoid incorrect matching outcomes.

Building an Effective Truth Set:

To address these challenges, including a diverse range of real-life scenarios (both for the input name and the actual business name available with Trestle) in the truth set is crucial. This diversity ensures that the algorithm is well-equipped to handle the complexities of real-world business name matching, improving both reliability and accuracy.

To refine the accuracy of our business name-matching algorithm, we conducted tests against three distinct truth sets, each containing thousands of business names. Constructing each set was a meticulous and labor-intensive process, crucial to capturing a wide array of real-life scenarios our algorithm might encounter.

Importance of Diverse Truth Sets:

  1. Coverage of Variabilities: Each truth set was carefully curated to include a variety of naming conventions and issues, such as concatenation, special characters, and common industry terms. This ensured that our testing covered a broad spectrum of discrepancies found in real-world data.
  2. Scenario-Specific Testing: By incorporating specific scenarios like “MarketVision” versus “Market Vision,” and “Laure´n’s Deli” versus “Lauren Food and Delivery,” we were able to fine-tune the algorithm’s ability to differentiate and correctly match based on nuanced differences. This helped minimize false negatives where the algorithm fails to recognize matches due to minor variations.
  3. Mitigation of Common Pitfalls: It was crucial to include common terms and how they are treated (e.g., ignoring “and”). This testing helped avoid false positives, where the algorithm erroneously matches entities based on non-distinctive terms.

Challenges in Set Construction:

The construction of each truth set was a rigorous process, involving detailed analysis and selection of test cases that accurately represent the diverse business naming scenarios. This process required not only technical insight but also a deep understanding of business nomenclature across various industries, ensuring comprehensive coverage for all possible scenarios.

Outcome and Improvements:

The comprehensive testing against these three truth sets has been instrumental in enhancing the reliability and effectiveness of our business name-matching algorithm. By continuously testing and refining the algorithm with real-world data, we have significantly improved its precision and reduced the likelihood of erroneous matches.

The Algorithm 

Let us go through the algorithm with the examples above and see how it gives out results.

Parsing and Normalizing the name

This is a three-step process.

  1. Removing any present diacritic character with its alphabet (e.g., ‘ø’ is replaced with ‘o’)
  2. Removing any special character with simple whitespace (e.g., quotes are replaced with whitespace)
  3. Lowercase all alphabets

Removing Common Business name

In this step, we take care of the common business terms. Words that specify business categories like Investors, Telecom, Services, Market, etc., and words like “A,” “the,” etc. are removed. These words at a later matching stage can cause problems if left as they are because they do not have any significance in the uniqueness of the business name. However, they can serve a purpose in matching, which is discussed later in the “Further Improvements” section.

** Note that in some cases, like “Research America,” all the words are common, but the first word is the unique name. Considering that the first word is rarely common, we exempt the first word from the common word removal process. We considered quite a few of these exceptions/rules as we iterated over our truth sets to improve the algorithms.

After parsing and removing common words, the examples would look like: 

MarketVision Market Vision
Lauren sLauren 
TracFone SafeLink 

Name Matching Stage

This is also performed in a few stages:

First, we see if the names to be matched are exact matches. If they are, then a perfect score is provided.

If it is not an exact match, we check for a matching word in the given names. If there is, we provide a “partial name match” score.

In the last stage, we try to tackle the real-world problems of a user by providing business name hints.

  1. The user might use extra characters like quotes, etc. (e.g., case of “Lauren s” above).
  2. Possibility of concatenated words like in the case of “MarketVision.”
  3. Pluralize or singularize the name (e.g., “Domino” and “Dominos”)

To tackle this, we create affixations of the words that are present to us. We take both names and make a list of possible words, like a list of all possible combinations of concatenated words, all words suffixed with “s” and “es.” We now check if the lists we created have any words in common. If there is, we again give it an “affixation word match” score.

Now, let’s see how our algorithm matched these words:

MarketVision Market VisionAffixation word match
Lauren sLaurenExact match
TracFone SafeLink Not Matched

Accuracy Metrics

Our business name-matching algorithm has demonstrated substantial progress through a structured approach of iterative refinements and testing across multiple truth sets. This meticulous process has culminated in an accuracy rate of 97.55%.

Progression Through Truth Sets

  1. Initial Testing:
    • The algorithm’s first version showed promising results, achieving an initial accuracy of 91%. This phase highlighted the algorithm’s fundamental strengths but also identified areas requiring enhancements, specifically in handling common business terms and recognizing name variants.
  2. Second Iteration:
    • After incorporating feedback from the initial tests, several fixes were implemented to refine the algorithm’s capability to differentiate and accurately match business names, particularly improving its handling of concatenated words and common terms. The accuracy in this second round of testing increased to 96%, a clear indication of improvement based on targeted enhancements.
  3. Final Refinement:
    • Continuing from the success of the second iteration, we made further modifications to address more nuanced challenges. The third and final truth set was enriched with even more diverse scenarios, testing the algorithm under a broader range of conditions and leading to a peak accuracy of 97.55%.

Analysis of Remaining Challenges

Despite the high accuracy, 2.45% of the cases resulted in incorrect predictions, which can be broken down into:

  1. False Positives (65% of Errors):
  • The predominant issue leading to false positives was the occurrence of new common words in business names that were not previously cataloged. These instances were primarily due to our list of common terms not covering some newer or less frequent words, which led to incorrect matches when unrecognized as common.
  • Solution: To address this, we plan to continually update and expand our common words list, which should help reduce false positives further by ensuring that non-distinctive terms are correctly identified and excluded from contributing to a match.
  1. False Negatives (35% of Errors):
  • False negatives were mainly caused by the abbreviation of certain words in business names, which the algorithm failed to recognize as equivalent to their complete forms. This issue affected the algorithm’s ability to detect matches when only partial or abbreviated names were used.
  • Solution: Enhancing the algorithm’s ability to recognize and equate abbreviations with their complete forms is critical for further development.

Scale and Latency

Another area we focused on with all these improvements, iterative rule building, etc. was scale and latency. The goal was that the latencies could not be more than 5 mSec p99 (meaning 99% of the queries to the business name matching should be less than 5 mSec). We ensured that we were well within those limits. In addition, we confirmed that we can serve at any scale if the load increases for this service. 

Conclusion

The continuous improvements in our business name-matching algorithm, guided by detailed analysis and feedback from extensive testing, have significantly enhanced its precision and reliability. By addressing specific issues, such as the treatment of common terms and abbreviations, we expect to achieve even higher accuracy in future iterations, making the algorithm increasingly robust in handling the complexities of business name matching.

How Customers Can Take Advantage 

As part of our commitment to enhancing the Real Contact API and providing a comprehensive lead verification solution, we’re excited to introduce the Business Name Matching feature. This API can now cater to our customers’ needs, whether they have a business name or both a business and individual person name for lead or account registration.

How It Works:

  1. Input Flexibility: You can enter a business name alone or both a business name and a person’s name, along with contact details such as a phone number, email address, or physical address. The additional parameter, business.name, is documented here
  2. Matching Process: Our system will attempt to match the provided business name to the corresponding contact details associated with the phone, email, or address in Trestle’s database. If a match is found with either the business name or the person’s name, we will confirm it as a ‘match’ (phone to name, email to name, or address to name match response attribute).
  3. Comprehensive Verification: Even if only one element (the business name or the person’s name) matches the contact details, it will still be considered a successful match. 
  4. Non-Matching Scenario: If neither the business name nor the person’s name matches the provided contact details, the result will be a “no match.”

Future Improvements Planned

A few improvements are being planned:

  1. Making use of common words: Instead of completely eliminating Common Business Terms, we can use them by categorizing them into specific broad categories, which can have some impact on the overall score. The amount of impact on the score needs to be figured out after testing.
  2. Abbreviations: A possible improvement is to identify the abbreviation, especially in cases where only a few words are abbreviated. In these cases, we might be able to use the business term.
  3. Using a mix of other algorithms that check the actual difference between words: Sometimes, a misspelling can lead to a mismatch between names, which should have matched if not for a single or two misspelled words. This can be handled through string distance calculating algorithms like Jaro Winkler.

Do you receive business names as part of the lead form sign-ups or account registrations? We would love for you to test the Real Contact API with these additional inputs and let us know if you have any feedback via support@trestleiq.com.

This blog post was written by Pratyaksh Sahu, Software Engineer at Trestle.