In our last engineering blog post, Building Trestle’s Architecture, we alluded to a few other key activities that heavily impacted our approach to infrastructure. Quite possibly the most important activity among those is data sourcing. Our customers rely on Trestle’s identity data APIs to provide high quality data quickly and without interruption. Over many years, we’ve developed a framework for how to find, vet, ingest and provide data in a way that meets our customers’ requirements.
Before we dive into the framework, let’s pause and think about those customer requirements (because if you’re here to understand why this impacts our infrastructure, this is why):
“[for Trestle to provide] High quality data quickly and without interruption.”
At first glance, all these words together seem to make sense. But if we think about this a bit harder, what does “high quality” mean? Or “quickly”, for that matter? Does a two second round-trip latency work? What about a minute? Let’s be more specific.
There’s two components to this: coverage and accuracy. An example of coverage would be, “Of all the phone numbers in the United States, what percentage can you return data?” This metric only tells half the story, with the other half made of accuracy, “Of all those phone numbers you returned data, what percentage of that data is correct?” Accuracy is much more difficult to gauge and is approached on a case-by-case basis with customers through data evaluations of their target demographic.
When measuring the latency of APIs, metrics generally take the form of “P50” or “P75”, which simply means, “What is the round-trip latency of an API call 50% (P50) or 75% (P75) of the time?” Since Trestle was born from the fraud and risk industry where latencies really mattered, our latency “bar” is a P99. Depending on the product, we generally aim for a P99 between 200-500 ms.
This is mainly covered by availability. Trestle provides a 99.95% availability SLA. Looking beyond just infrastructure uptime, in the data business there are other risks like regulation that can impact the availability of data. What if a data source is no longer able to provide that data because they’ve been collecting it in a way that has become outlawed?
Despite meeting these requirements, if we stopped here, we’d be missing a key requirement: the data needs to be valuable. If we throw garbage data in, our customers will get garbage data out, which brings us back to our repeatable framework for data sourcing that separates us from our competitors when it comes to coverage and accuracy:
We go through this framework for every single data provider, a process that takes between 6-12 months. Despite the lengthy amount of time we spend on determining if a data provider fits our requirements, we’ve found that it works; we’ve kept a core group of providers that continue to provide high quality data extremely quickly and without interruption (that phrase is going to keep coming up). This isn’t to say we’re asleep at the wheel; we’ve got 10+ new data providers that have been going through this process for months now and are doing exceedingly well. This excites us and it should excite you — it means new and valuable data signals that fits in our aggressive roadmap.
One thing that’s easy to take for granted is that not every data provider does what we do. Many have less rigorous requirements that allow them to onboard the newest, hottest source. Sure, that can make them some good revenue in the short term, but we’ve seen time and time again that without assessing key areas like Data Provenance, that data’s time can be limited.
This is all to say that there’s a lot of work that goes on behind the scenes of ethical data sourcing, and it’s something we’re incredibly proud of. A common question we’ll receive from potential customers new to the data sourcing arena is, “Why don’t we just go out and buy this data ourselves?” Just as you’d use AWS, Azure, or GCP to perform the undifferentiated heavy lifting of running your servers, we perform the same role in your data strategy. Data sourcing is the core focus of our business, and we’re exceptionally good at it.
Here’s a great example: like a public cloud provider, we build redundancy into our systems, creating provider “waterfalls” to maximize coverage and prevent any signal loss should any one source become inaccessible for any reason. This redundancy is a crucial and expected component of our services that our enterprise customers have come to expect. Another example is our linkage process. You’d hope that when you merge files together unique identities would resolve easily, but that’s almost never the case. The merging and linking process requires expertise in data processing and data science at a massive scale to unlock these insights for customers. Both redundancy and spending considerable engineering time in the merging and linking process increase costs, but they are necessary to provide high quality data quickly without interruption.
All of this should give you a good sense of our data sourcing philosophy. At the end of the day, our goal is to do all the heavy lifting, so we can be the one provider you evaluate, negotiate with, and hold accountable when it comes to your identity data strategy.
Unconvinced? Thinking of sourcing data on your own? Here’s what to look for:
- Footprint: What is the depth and breadth of coverage provided by the source and how does it compare to the market and/or your business needs?
- Data Efficacy: How does the data add business value? This should be quantifiable and discrete. We’ve seen too many customers believe that some specific dataset will be incredibly useful but realize later on that the value isn’t what they’d thought.
- Response Times: What are the response time requirements? What sources can be file based and what needs to be called in real time? What’s the financial tradeoff you’re going to make? (Generally, real-time callouts will be significantly more expensive than file-based sources.)
- Privacy & Security: What are your privacy and security requirements? Do your customers require specific certifications? Your approach here should be thought through carefully and appropriate measures taken relative to the sensitivity of the data.
We hope this helps your own data sourcing efforts. If you have any questions or would like to learn more about Trestle’s data sourcing methodology, reach out to us at email@example.com or contact us.