Towards a Scalable Architecture for API Usage Collection and Reporting

Overview

Trestle, an API-based SaaS company with thousands of customers, needs precise and efficient usage tracking, which is critical for ensuring billing accuracy and maintaining customer trust and compliance with service agreements.  We offer 12 distinct API products, each integrated via AWS API Gateway and backed by individual AWS Lambda functions. Historically, our method of collecting this data relied heavily on AWS’s GetUsage API—a process that, while straightforward, posed significant scalability and efficiency challenges as our customer base grew. In this blog, we’ll explore how we designed our usage tracking architecture, reducing the report generation times from 30 minutes to less than five seconds and significantly improving response times for user-facing usage charts.

The Challenge

Each client interaction with our APIs was authenticated and logged via API keys issued through our developer portal, which relies on AWS API Gateway for key management. To report usage, our system needed to track the number of requests made by each API key. Initially, this was accomplished by periodically querying the AWS GetUsage API, which provides aggregated usage data based on the API key and usage plan. Specifically, the GetUsage method was invoked for each API key across all usage plans, as follows:

  • Request Composition: A typical GetUsage request requires the apiKeyId and usagePlanId, along with the desired date range for which data is requested.
  • Response Time: Each GetUsage call could take 200 to 500 milliseconds to complete. The worst part was that this API had a strict rate limit, so we weren’t able to parallelize these calls as well.
  • Aggregation Overhead: Given thousands of API keys and multiple usage plans, compiling a comprehensive usage report could involve hundreds to thousands of individual API calls. Collectively, this process took upwards of 30 minutes—a significant delay that hindered timely billing and customer service.

This method’s primary drawbacks were its linear scalability issues and high latency, which became increasingly untenable as the number of API keys grew into the thousands.

Design

Logging-based Data Collection

To eliminate the bottleneck, we shifted from API-based polling to log-based data collection. For each API<>Stage in AWS API Gateway, custom access logs with the following format was enabled:

{
 "apiKeyId": "$context.identity.apiKeyId",
 "statusCode": "$context.status",
 ...
}

apiKeyId : AWS-generated identifier for an apiKey

statusCode : HTTP statusCode returned for the API call by the Integration (Lambda functions). This helps in tracking 2xx/4xx/5xx statuses.

PostgreSQL for Data Storage

Initial DB Designs

The first approach was to store each API request as a distinct row in the usage_count table, capturing details like apiKeyId, productId, statusCode, planType and timestamp. This straightforward schema ensured detailed tracking but was not scalable, as it was projected to accumulate tens of millions of rows monthly, creating substantial overhead in maintenance and performance.

To improve scalability, we revised the schema to aggregate requests at insertion by hour but considerably lowering the row count. In this revised schema, data was aggregated based on apiKeyId, productId, statusCode, planType, and hourOfDay, storing the count of requests per hour.

While this approach mitigated some scalability issues by reducing the volume of data (as compared to the initial per-request storage), it still involved managing millions of rows and complex queries for aggregation, impacting the database’s performance and flexibility.

JsonB Based Storage

To address these challenges, we transitioned to using PostgreSQL’s JsonB data type for the table. This schema stores multiple aggregated metrics (hourly and daily) within a single database row for each unique combination of apiKeyId, productId, statusCode, and planType. The JsonB structure organizes data efficiently, facilitating fast and flexible querying capabilities:

Table: usage_count
Columns:
- apiKeyId: string
- productId: int
- statusCode: string
- planType: string
- countMap: JSONB



countMap JSONB Structure:

{
  "hour": {
    "yyyymmddHH": ci1,
    "yyyymmddHH": ci2,
    "yyyymmddHH": cj1
  },
  "day": {
    "yyyymmdd": ci1+ci2+...+ci24,
    "yyyymmdd": cj1+...+cj24
  }
}
  • The hour object stores the usage counts per hour with the keys representing the timestamp in the format "yyyymmddHH" and the values representing the count for that specific hour. The day object stores the aggregated usage counts per day, with the keys representing the date in the format "yyyymmdd" and the values representing the sum of the hourly counts for that day.
  • The lowest level of granularity needed was “hour.” Although this structure is modular, it is easy to collect data at a more granular level if required, such as “30M” (30 minutes) or “15M” (15 minutes).
Advantages:
  • Efficient Storage and Fast Querying: JSONB reduces the storage footprint and simplifies data management by aggregating extensive data into fewer rows with structured, index-supported JSON fields, making queries substantially faster.
  • Rich Querying Capabilities: JSON offers robust and flexible data retrieval and manipulation directly within the JSON structure. jsonb_object_agg and jsonb_each_text are some of the built-in functions we use extensively.

Based on our internal experiments, two years worth of usage data for 1,000 apiKeys needs only 200MB of storage with this schema.

Data Collection with CRON

A cron job was written to periodically collect the usage data per “apiKeyId” and “status” from the custom access logs in AWS Gateway for each stage. This cron job processes the access logs, extracts the necessary information, and updates the usage_count table in the database accordingly. It determines the time range for data collection based on the last successful cron run for effective failure handling and then reads the access logs in slices to efficiently handle large volumes of data.

When the cron has not completed its execution successfully for the past “x” hours, a custom metric will be pushed to Cloudwatch, which will trigger an alarm alerting the on-call engineer.

Cron Tracking

Another table for tracking the above cron was created, which helps monitor and manage the cron’s execution. It tracks each successful cron run by storing information such as the start time, end time, and status. This table serves multiple purposes, including determining the time range for collecting usage data, monitoring and alerting on cron failures, and facilitating troubleshooting and auditing. Using this table’s information, we can identify missed or failed runs, trigger alerts for prompt action, and analyze the cron execution history. The schema for this table is:

Table : usage_cron_tracking
Columns :
api_stage (where data was collected from) : string 
status_code (http statusCode) : string
results_count (total count collected for this api_stage & statusCode) : int
start_time : timestamp
end_time : timestamp
status (SUCCESS or FAILURE) : string

Data Access APIs

The final component is efficient data access APIs that queries this table. Using native SQL capabilities of PostgreSQL, especially JsonB operators, we can quickly aggregate and retrieve usage metrics. For example, for aggregating usage counts per API key and product for a given time range:

SELECT apiKeyId, productId, SUM(cast(value AS numeric)) AS totalCount
FROM usage_count, jsonb_each_text(countMap->'day') AS kv(key, value)
WHERE key BETWEEN :startTime AND :endTime
GROUP BY apiKeyId, productId;

Conclusion

By leveraging AWS API Gateway custom logs, PostgreSQL’s JSONB data type, and optimized SQL queries, we transformed our API usage data collection and reporting process. This not only slashed data retrieval times by more than 100x but also improved our system’s scalability and reliability, proving crucial for handling our growing user base effectively. This architecture provides a robust template for SaaS providers looking to optimize API management and billing operations.

This blog post was written by Deepak Kumar Ganesan, Software Engineer at Trestle.