From Blind Spots to Clear Insights: The Evolution of Observability Tools and Practices at Greenlight

Greenlight Engineering
10 min readJun 26, 2023

--

Greenlight® Financial Technology, Inc. is a fintech company on a mission to help parents raise financially-smart kids by providing tools to teach children about spending wisely, saving, earning, and investing.

As part of these tools, parents can provide debit cards to their children, allowing them to spend their hard-earned cash anywhere debit cards are accepted. With that comes a huge responsibility for Greenlight, because if Greenlight’s systems are down, children may be unable to spend their money, potentially stranding them in a car without gas, or holding up the line at their favorite coffee shop. It is important to both Greenlight and the families we serve that our system is available when they need it the most.

Greenlight has experienced substantial growth in recent years. We have transitioned from a system that ran on AWS Fargate to a microservice-style architecture running in multiple cloud provider regions on Kubernetes. You can read more about that journey here: https://greenlight-engineering.medium.com/journey-to-microservices-28721a43aa35. With this rapid growth, we quickly ran into challenges with observability around our system. Having great observability is the first step towards detecting problems in production, and ultimately preventing problems, which provides the best product experience to our families.

Usefulness of Exception Tracking

At first, Greenlight was able to observe our system simply by the types of exceptions that the code produced. We set up an exceptions tracking system that would alert our on-call engineers when there were new exceptions in production, and the engineers worked to quickly resolve them. We added this exceptions tracking into our backend components running on AWS Fargate, and on our native mobile applications for iOS and Android. Exceptions reported to this system were quickly attributed to individual users, and so it made it very easy for our engineers to determine the impact on our families.

This setup worked for us for quite a long time, but as we transitioned off of AWS Fargate into a multi-service distributed system, and as the number of families we served grew, the usefulness of an observability system based off of exception tracking started to diminish. Any intermittent failure in the system, caused by things like database query timeouts or connection issues between services, would immediately alert an on-call engineer. This lead to high pager fatigue and terrible on-call experience for our developers being woken up multiple times in the middle of the night for an insignificant error.

What we needed was the ability to not only see when there were problems, but to estimate the real-time impact of that problem. For example, a system that handles hundreds of API requests per second that has one or two intermittent exceptions per hour is not something we want to wake our engineers up in the middle of the night for, but if it is a system that handles one or two API requests and all of those are failing, we want to know as soon as possible. Essentially, we wanted to be able to alert our on-call engineers when there was a high percentage of exceptions relative to the amount of work that system was doing.

Application Performance Monitoring

At this point in our architecture, almost all of the work that the Greenlight systems did was triggered via HTTP or gRPC calls into our services. We had background tasks and queues to schedule work to be done asynchronously, but those ultimately got converted into an HTTP request and sent to the system (leftovers from running on AWS Fargate). We knew that there were several great options for us to be able to easily observe the success rate of our APIs, and we ultimately settled on Datadog to be our observability vendor.

Setting up our services to report their data to Datadog was quite easy, and immediately started providing immense insights into how our system was working. With Datadog’s APM product we were able to easily see exactly how many API calls we were handling correctly, and it allowed us to setup monitors around error rates and response latency.

The Datadog platform also provided us with incredible observability through APM traces, allowing us to see the path an API call was making through our system, the database queries it was running, and interactions with external systems. Our engineers quickly adopted Datadog, and as we became more and more comfortable with what it provided, we started turning off the older exception tracking based alerts.

Managing Monitors at Scale

We were now at a point in our Journey to Microservices where we had around two dozen different services in production, managed by about a dozen different teams. We had converted all of the services over to using Datadog, and we were able to quickly detect problems affecting our families, and identify and respond to them quickly.

At Greenlight, we have multiple internal environments that we use for testing our product before releasing it to our customers. Catching problems in these internal environments quickly is important to reducing our cycle time to address the problems before they impact the parents and children who rely on Greenlight. To do that, teams started creating monitors for our internal environments which could detect these problems the same way as we would detect them in production. However, by doing that, we introduced a multiplicative problem to our stack.

You see, all of the monitors we had created in Datadog, up to this point, was created through the Datadog console. Teams would simply create a monitor, configure the proper metrics to alert off of, and the proper alert channels to notify when the alert does trigger. This quickly became unwieldy when introducing multiple environments into the equation.

So instead, we started to move a lot of our monitoring into Terraform. This allowed us to leverage the same monitoring configuration across all environments, and allowed teams to peer review the changes before being made into the environment. We had already been using Terraform at this point for our other AWS infrastructure components, so were very familiar with its tooling.

However we quickly realized that this pattern wasn’t going to continue to scale. Teams had started to develop Terraform modules that allowed them to reuse monitors in multiple services without re-writing the same query over and over again, but quickly ran into problems as the flexibility, and therefore complexity, of these modules grew over time. It quickly became unwieldy to maintain these common modules, as just the smallest tweak to the syntax could cause all of the monitors that leveraged it to break.

Furthermore, every change to a monitor was deployed to every environment all at the same time. Testing out a configuration change for a monitor in a lower environment could mean that we lost the ability to detect problems in production. We needed a more robust solution.

Also during this time, we had quickly expanded from running about two dozen services in production, to running close to fifty services. The company was rapidly developing new features, and as such was building new services to handle those capabilities. With these new services also came new interaction patterns with other systems. Many of these new systems interacted through more than just API requests, but instead produced and consumed messages from their own queues, processed Change Data Capture events from Kafka, or ran scheduled jobs through Kubernetes CronJobs. Gone are the days where a simple HTTP status code could indicate whether your service is working or not.

In order to combat all of these new patterns, Greenlight’s SRE team looked to roll out Service Level Objectives (SLOs) to front our observability and monitoring needs. Instead of basing all of our alerts around HTTP status codes and latency, we can instead describe them in a format that is agnostic to the mechanism we are monitoring. There are other benefits to adopting SLOs for observability which in of itself could be its own post, but for now, know that this involved a significant change to how we were handling monitors in our stack.

Introducing 🚸 Crosswalk

One of the biggest issues we had with our Terraform approach was that we had no way to tie a change to a service to a change in monitoring. We wanted to be able to introduce monitoring for a new feature at the same time we deployed that feature to production for the first time. We also wanted to be able to roll out the monitoring for this feature out to lower environments as we are testing that new feature.

In order to do that, we built our own tool for managing SLOs, Monitors, and Dashboards, which we call Crosswalk. Crosswalk allows our engineers to define SLOs and burn-rate monitors for those SLOs as code, right next to the code we use to develop the functionality. As most of our backend engineers are familiar with the Node.JS/Typescript ecosystem, we wrote this tool to be as easy to use for them as possible.

Crosswalk is simply a Node.JS library, which we hope to be able to provide open-source community. At a high level, we leverage Typescript classes to define our Monitors and SLOs.

What this allowed us to do is use class composition and inheritance to build out a suite of pre-defined SLOs, Monitors, and Dashboards, enabling our engineers to leverage battle-tested and robust configurations for our tech stacks. Using those monitors is as simple as writing code such as the following.

Once all of this is defined, it is committed to an observability/ folder in the service’s repository (we use one repo per service at Greenlight). When that service is built by our Continuous Integration (CI) pipeline, the version is tagged to the commit. We then hooked up Crosswalk to our deployment system, enabling the changes introduced in that version to be automatically synchronized with what is in Datadog.

By placing observability configuration alongside the code it is observing, we can improve the development process of all of our engineers. Changes introduced into backend services can correspond directly with changes made to SLO, Monitor, and Dashboard configurations. As those changes roll out to each of our environments, so does the configuration for that service’s observability. This also enables engineers to validate their monitoring prior to the feature being launched in production, giving more confidence that they’ll know if there is a problem.

Crosswalk uses tags on both the Monitors and SLOs to help synchronize it with what is defined in the repo. This allows us to run these synchronizations without managing an external “state” file, like Terraform, and therefore allows us to more accurately reproduce the configuration across environments. Crosswalk also enforces other policies that the SRE team introduced, such as monitor tagging, alert priorities, and conventions around monitor titles and descriptions. Standardizing on alerting priorities and tagging enables us to accurately measure our response to these alerts, and enables us to measure how well our tools and policies are impacting the overall experience for our on-call engineers. Finally, because the Crosswalk system knows all of the Datadog resources that are defined for a service, it can automatically link together related resources, allowing on-call engineers to more quickly identify problems. For example, here’s a screenshot of one of the monitors created by Crosswalk, automatically providing links to APM and to relevant Dashboards for that SLO:

Screenshot showing a Datadog monitor that alerts off of a burn rate of an SLO, with links that direct the on-call engineer to APM Trace views, and related dashboards for the SLO

We are excited to share that we have successfully implemented Crosswalk across all of our backend services. This implementation has enabled us to transform our Datadog configuration from a disconnected set of monitors to a thriving ecosystem of around a thousand rich and engaging SLOs, monitors, and dashboards. Additionally, our engineers are now able to operate more efficiently with reduced overhead. We are thrilled about the possibilities this opens up for us and the potential for even greater achievements in the future.

With this seamless integration of Crosswalk into our SDLC, our engineers can now swiftly identify and address potential issues long before any new features are deployed into production. This proactive approach not only minimizes customer disruption in our production environment but also improves our efficiency in minimizing the time to detect and time to resolve any issues that do inevitably arise in production. We are thrilled about the profound impact Crosswalk has had on our workflow, allowing us to deliver high-quality products with confidence thus facilitating a smooth and reliable experience for our customers.

Summary

Observability is very important to keeping a healthy and reliable product for our customers to use. Having strong observability practices enable us to respond to issues in production faster, and ensures that our customers can access their money when needed. We’ve had several iterations of observability stacks as we have evolved throughout the years, ranging from simple exception tracking all of the way to managing hundreds of SLOs across dozens of services. We’ve built tools specific for our use cases and made it easier than ever to ensure we have strong visibility into our systems.

I am excited to share with you the lessons we learned along the way, and look forward to releasing Crosswalk to the community in the near future. If you have any questions or want to learn more about our observability practices, feel free to reach out on LinkedIn.

Brandon Tuttle, Staff Site Reliability Engineer at Greenlight

https://www.linkedin.com/in/brandon-tuttle-435964a9

--

--