Trestle’s Data Pipeline and Ingestion Overview
As we covered in the previous posts, one of the differentiators that has helped Trestle build identity products with the highest coverage and accuracy has been our data ingestion pipeline and infrastructure. There’s a few reasons for it:
Our Identity Graph consists of 300MM+ records with 100+ data attributes per identity that are sourced across multiple data vendors. This quickly makes it a Big Data problem and brings all the challenges along with it.
We do not deal in any regulated data like SSNs or government ID. This means we do not have a unique field for all the data we get, hence the merge between different providers becomes even more complex. Here is an example of it:
Address: 123 Main. St.
Address-1: 456 Wayne St.
Address-1 (historical): 123 Main. St.
Should these records be merged? Do these phone numbers match? Which address is the most current one? And should we believe the date of birth provided by Provider-2 even though we do not have redundant backing testimony from any other source? This is an easy example, but this gives a sense of the issues we think about during the merge/match process. Being overly restrictive in match logic and we have too many duplicates. Being too laxed and we will have incorrect data eventually. To add further complexity and to ensure the accuracy bar we have at Trestle, we only ingest any data if we have enough proof points for every data point added. This means, at a minimum, we need at least one additional proof point from a separate data source to ingest any data.
On the infrastructure side, think about the O(n*m) problem here. Essentially, it is the time taken or the operations needed for the merge of a data provider. Meaning for every address and name pair or phone and name pair we get from Data Provider-2, we have to search the entire Data Provider-1 dataset to see if they match. If so, then we merge appropriately to get the new merged/appended record. We have to repeat this with multiple data providers we source different attributes from. And all of this has to be done quickly because the goal is to bring the latest data as soon as possible to our production instance.
At Trestle, we strive to refresh our data once every month, which means we must complete each data build in under a week.
You put all of this together, and it gives a sense of the complexity involved and why being able to do it at scale builds a differentiator like no other. The good thing is the team at Trestle are veterans in this space and have dealt with this complexity before. They know exactly the right architecture and technology stack to be implemented. The process looks something as follows:
Most of the process above is self-explanatory, but there are a few details that we want to highlight.
All the different kinds of validations we run at every step to ensure there is no data loss and the ingestion pipeline is working as expected. One of the big things we do here is to build dummy datasets mimicking data provider formats with their data nuances. We do that so we can run the necessary tests before starting the ingestion pipelines with the entire datasets. It also becomes very difficult to catch any issues with the entire dataset. A lot of time and effort is spent building and testing with the dummy data sets, so we do not run into issues in Production or after a week running through the entire pipeline.
Data will skew to where the number of linkages is in thousands or even millions. For example, a business like Starbucks, with thousands of locations, creates a single business with hundreds of thousands of linkages. The match and merging logic with such a high number of linkages can really slow the systems down. Thus, being able to handle them appropriately becomes crucial for the overall stability of the system, as well as to ensure the ingestions do not get bogged down dealing with them. Different scenarios require different ways of handling the skew issues. It ranges from identifying incorrect data like a single person having 100+ addresses, so we just don’t insert that person or limit the person-address links to 1000.
As with any Big Data processing problem, expect to deal with concurrency issues that can result in data loss, dangling linkage, or even worse, entirely incorrect data linkages being created. It goes back to having appropriate aggregate statistics checks, necessary data validations and just experience to know potential threading issues and how to preempt them from the onset.
Cost optimizations become critical for any business dealing with such large-scale data. To give a sense of the scale, here are some of our machine configurations involved:
Redis: 3.5 TB in-memory Redis database with 60 shards
Open Search: 8 r6gd.8xlarge.search instances
EC2: 10 c6a.8X large instances
I am not even adding the S3 storage, Kinesis and other support infrastructure needed for every ingestion we run on a monthly basis. A lot of this can be solved by expanding the ingestion timeframe or by throwing money at the problem. Being able to optimize the cost versus the time taken is where experience and the grind comes into the picture. We can get the best of both worlds.
Appropriate logging and checkpointing mechanisms are crucial, so at each step, there is a robust debugging and retrying mechanism to ensure that we do not have to restart the entire step/process and can just pick from where it got interrupted.
The biggest lesson that we learned at Ekata from building this and what we are living at Trestle as well is that a lot of this doesn’t happen overnight. It is hard work but we are trying to improve every day. We are constantly finding things on how to better match/merge, how to optimize the costs, how to avoid expensive erroneous linkages, and how to eke out milliseconds to improve the ingestion times. The good news is we take care of this stuff so that you do not have to.
There is a lot more that I am missing to cover in this post, but hopefully, this starts to give a sense of what we have created that will enable innovation from Trestle in the coming months.