One of the most fundamental pieces of building a new company is planning the architecture that will eventually support your products. Many teams will aim for a minimum viable product (MVP) then iteratively build upon it, piecing together necessary components as the need arises. As most of them will say, this strategy is often the only way forward when you don’t know who the customers will be or what their needs are. It does allow for flexibility in the short term but creates technical debt and herculean efforts needed for migration in the long term. Fortunately, upon creating Trestle, we knew a few things from experience:
- Who are customers were and would be;
- What our customers cared about; and
- What kind of architecture was optimal for maintainable, scalable and performant APIs.
All that was left to do was build that architecture and launch the product on a tight timeline. Here’s how we did it:
For a lean team at Trestle, we knew this would be a challenge. But with experience building these products before, our executive team was battle-hardened and knew the pitfalls to avoid and what to do to plan for the long term. At the onset, we established two key principles of how our architecture needed to be built and operate:
- We would be cloud-native and multi-cloud, giving us the maximum amount of flexibility to avoid lock-in and provide the lowest possible latencies to our customers.
- We would utilize serverless architecture for rapid scaling, consistent performance, provisioned concurrency, and avoiding as much engineering overhead as possible.
- We needed a highly customizable API gateway to support a-la-carte API packages, thereby increasing margin and providing a better developer experience.
- Our database engine needed to specialize in text-based search.
Let’s dig deeper into those requirements and why we needed them.
Cloud Native & Multi-Cloud
You’ve probably heard of the great fire of 2016. Well, maybe not. That’s when our previous company’s colocation data center almost burnt to a crisp due to a faulty A/C unit. Since then, we’ve learned our lesson and are leaving the undifferentiated heavy lifting of running a data center to the folks who specialize in it. Another huge benefit of utilizing the public cloud – besides a significantly lower risk of all your data being wiped due to fire – is hugely improved latencies. For our customers, this is extremely important because of their need for real-time data as they talk and interact with customers. Finally, we wanted to take a multi-cloud approach to avoid vendor lock-in and leverage what specific vendors do extremely well where others may fall short.
This one is fairly self-explanatory. Utilizing serverless, specifically Lambda, allows us to auto-scale rapidly without a human in the loop. Some of our customers will scale up to 3-4,000 QPS and expect the same latency they were getting at 2 QPS. From our learnings, leveraging microservice and kubernetes-based architectures, this is best accomplished when auto-scaling is seamless and requires zero human intervention. Another great feature of Lambda that solved our cold start problem was Provisioned Concurrency, also allowing us to solve for quick traffic spikes without impacting latency. Lastly but probably most importantly, with such a small team, we had very little tolerance for engineering overhead to support these systems. A serverless option was the best choice for our team to avoid costly hours spent managing VMs.
Highly Customizable API Gateway
This was a buy scenario, not build. We’ve had experience rolling our own when it came to building API gateways, and we needed something ready to go right out of the gate. After extensive review of AWS’s API Gateway, Kong, Mulesoft, and Apigee, we went with the latter given its capabilities and access to great big data tools in the GCP ecosystem like BigQuery. Apigee is not only feature-rich when it comes to providing easy API accessibility for developers, but it also provides future-looking (for us) monetization methods that will be necessary for our go-to-market approach for a-la-carte API packages. That said, finding support on getting started with Apigee was surprisingly difficult. We were used to accidentally paying for cloud services before we realized charges were being incurred, but with Apigee, we had the opposite problem; the process to move into production and pay for the service was obscure and not well-documented.
Although a known entity from our previous work but just as important of a requirement was using a DB engine that specialized in text-based search: Elastic Search. The scale of data we process and store is enormous, so using a database that was hyper-specialized in text-based searches was critical given our need for low latencies. We decided to go with AWS’s managed ElasticSearch offering – AWS Open Search – allowing us to focus on merging and searching data logic rather than on scaling, monitoring, and troubleshooting the infrastructure. In there, we continue to work, searching constantly for 5-10 millisecond improvements wherever we can to improve our product delivery for customers.
Based on the above requirements, here’s what our architecture looked like:
These architectural bones, coupled with the known software development life cycle principles that we had already been working on – things like dev, staging, production environments, and a CI/CD (continuous integration/continuous development) framework cover most of our customer-facing infrastructure.
There are a few additional points to note that impacted every decision we made in our architecture which we will cover in a follow-up blog post. Stay tuned!
Header Image: Victorgrigas, CC BY-SA 3.0, via Wikimedia Commons