Journey to Microservices
Greenlight is a fintech company on a mission to help parents raise financially-smart kids by providing tools to teach children about spending wisely, saving, earning, and investing.
Like many startups, Greenlight started our technical journey with a single service. This service, very secretly named server-api, was, as astute readers might have guessed, both a server and an API. It was Greenlight’s first API, and it came in with a bang. One developer, just before midnight in July 2015, created the first commit to the server-api repository with 270 files and over 43,472 lines of code. With that, Greenlight had created its first service.
Greenlight launched its first commercial product in 2017. All of the functionality provided to our native mobile applications, and our web applications, was housed in the single server-api service. Being an early-stage startup at that time, we had no local data center, so we relied heavily on AWS for infrastructure and supporting services. At this time, the entirety of the Greenlight engineering team was less than 10 people.
As we continued to evolve our DevOps processes — without a DevOps team — we were able to make changes to one service and reap the rewards immediately. Product owners felt they could tackle anything the market could throw at them — with only a handful of engineers.
High growth and lots of change
Just like kids grow up and experience new challenges on the path to adulthood, start-ups face their own set of challenges when they grow quickly. This happened to Greenlight in our technical journey. We started out registering just a few hundred families a day, which quickly became a few thousand a day. More registrations meant a need for more engineers. We started expanding the size of the engineering team — and with each new hire, it became more apparent our old ways of working no longer made the cut.
At first, hiring new engineers was smooth sailing. We could have a new engineer shadow another engineer, and learn how work was done. Everyone had the same simple development environment and committed code to the same repo. The team built code and deployed it with the same configuration. Everyone was in every meeting, and all information was disseminated to all parties. Then, before we reached 20 engineers, we started encountering challenges to our established engineering norms.
We began to figure out how we would work with multiple teams. Everyone still knew how most everything in the system worked, could develop against it locally, and could operate it. We were growing fast enough that it made sense to invest in support tooling, so a new service was added to enable our Customer Service department to operate against the application data. Instead of using server-api as an API, the support application directly leveraged the underlying data store of server-api. This new service was called commander.
While server-api was a backend, commander was written as a React web app. This could have initially been an opportunity to introduce our first microservice, but that moment came later. Instead, there were multiple applications operating on the same data, and for the first time, we started to stumble into the complexity of distributed systems with our initial battles on data caching and distributed transactions. This was really the dawn of systems architecture at Greenlight.
At this point, a few engineers worked on the commander application, deployed on AWS Fargate, and most of our full-stack engineers worked on the server-api service, deployed on AWS Elastic Beanstalk. There really were not any true backend engineers at that time, as everyone worked up and down the tech stack, and left to right across the features delivered in the application. As we had enough engineers to create a few product teams, we started to see a much higher occurrence of merge challenges. Smaller teams needed to perform reviews for the pull-requests related to their feature work, but the merge from one team would start to overlap with merges from other teams that were working in similar areas of the code base. With more engineers and more feature teams, working on overlapping constructs, the development cycle time started to grow exponentially.
In addition to merge problems, testing became much more complicated. Initially, there were very vague lines between unit, integration, and acceptance testing. Since there was really only one deployment unit, almost everything was written as an integration or acceptance test. Data would get defined in a database, and a full suite of tests would be run on each merge to main. As more and more features were added into server-api, the amount of tests that were executed, and the time it took to run these tests, started to balloon. Additionally, some of the tests had external dependencies that meant the execution of these tests could not be parallelized, so only one execution would be run at a time to avoid false negative results. This led to certain tests getting skipped or bypassed to merge changes more quickly into shared environments, resulting in more instability in a rapidly growing codebase.
It was about this time that Greenlight crossed a large milestone for the number of families on the platform, and we signed a large partnership. It became obvious that our feature teams were challenged to keep pace with an expanding Product team and accelerating demand for feature development. A few engineers had been involved at companies that used a microservice architecture, and the momentum started to swell behind the idea that microservices were needed as a tool to help dig our way out of this feature delivery quagmire.
The Dawn of Microservices
At this point in our journey, server-api was deployed on AWS Elastic Beanstalk and commander was deployed on AWS ECS Fargate. For those not familiar, Elastic Beanstalk was a very interesting service that allowed you to identify the language you wanted to use to write code, and AWS would template out all the infrastructure, networking, and operational monitoring for your service. This was an easy on-ramp into AWS, and it’s where our cloud journey started. As we started to scale, challenges arose with this service and our choice of a deployment runtime. This led us to explore AWS ECS Fargate for our commander service, but we also met some challenges there. We needed to make a decision at this point on where our coming fleet of microservices might live as we hired more engineers and tried to lean into a separation of teams and microservices.
Challenges with DevOps
One challenge was that the deployment controls for AWS Elastic Beanstalk were not responsive enough for our needs. After directing the service to perform a deployment of new code, it could take tens of minutes for the deployment to be cascaded across the deployment instances. When you were dealing with possible breaking changes on a large SQL database, this meant a large amount of instability on releases. This process was difficult to automate, so deployments ended up being manual activities that were performed with surgical gloves.
Another challenge with AWS Elastic Beanstalk was that the runtime environment was pinned to the Elastic Beanstalk configuration, not necessarily the runtime you provided within a container. This meant that the AMI provided by the Elastic Beanstalk service was set, and our code had to conform to that constraint. Initially, this didn’t seem like a huge constraint, but over time, this became a very large constraint in our expansion, as we became tied to progressively older versions of Node.js in our tech stack.
AWS ECS Fargate did offer a more cloud-native approach to running applications. Task configuration allowed us to provide a container, some infrastructure configuration, and some resource constraints. We were forced to take more ownership of decisions around underlying computer resources, networking paths, and load balancing that we didn’t have to make with Elastic Beanstalk, but the new deployment context did solve some of the DevOps challenges we had with Elastic Beanstalk. Based on this initial success, we decided to move towards using ECS Fargate as the target deployment context for our first new service, gift-of-greenlight.
As we started to define the Fargate Task definitions for the new gift-of-greenlight service, it became clear that our Jenkins CICD system was not suited for the task at hand. The process to define and configure ECS Fargate Task definitions was very different from the processes we built for deployments into Elastic Beanstalk. We found it easy to get started, but much more challenging to build any type of consistency across multiple services. Additionally, the Fargate Task definition didn’t (at that time) have an easy mechanism for separating common elements vs environment-specific elements, therefore we ended up creating specific task definitions for each environment. These task definitions were nearly impossible for us to validate locally, so there was a lot of guess-and-check occurring during deployments into lower environments, leading to a lot of hope and prayer before deployments into higher environments.
Infrastructure was another complication. Before this point, we relied heavily on the Elastic Beanstalk service to coordinate the Cloudformation Stacks and create the necessary infrastructure resources. With ECS Fargate, we now had to figure out how to coordinate and manage infrastructure changes. Initially, this was done with the tried and true click-ops method through the AWS console, but the addition of more engineers, and just a single microservice, made this process nearly impossible to manage. To solve this problem, we started working towards an infrastructure-as-code (IaC) model. Much more on our IaC approach in later posts, but the moral of the story is that an ECS Fargate deployment was very targeted to specific environments, required a lot of infrastructure knowledge from engineers, and proved to be a fragile deployment context within our DevOps implementation.
Challenges with testing
Another major challenge we had with ECS Fargate as a microservice platform is that each service defined was intended to be an isolated deployment. The premise is that all traffic would, by default, come through a load balancer from outside the VPC and land on the service within the VPC. This worked great for North/South (N/S) traffic, but it was a challenge for traffic between services within a cluster, otherwise known as East/West (E/W) traffic. This networking constraint was a product of the deployment context, and was very difficult to validate in testing. As we started to introduce more E/W communication paths between services, we introduced more failure points that were very environment specific.
The E/W communication path also introduced new challenges. Now we had deployment dependencies on specific versions of API contracts. We also had the added complexity of service discovery, and due to the network routing issue and configuration complexity mentioned above, this was no small challenge. These environment-specific challenges added risk to our testing because many parts of the deployment could not be fully validated locally. This forced us to really focus on and evolve our development and testing practices, which is a journey we are still traveling today.
Challenges with scaling
The same challenges with slow rollouts that created complexity for Elastic Beanstalk releases also impacted complexity with scaling applications. It could take tens of minutes to bring new instances online — and if your service was experiencing a sudden spike in traffic, there is a good chance that you may not be able to manage all of the requests until Elastic Beanstalk deployments had time to deploy and catch up to the demand. In many cases, this led us to pre-warm instances and over-provision resources to try to account for known upcoming spikes in load.
Sometimes we could anticipate load spikes, such as scheduled jobs that would happen at regular intervals. When we were a smaller engineering organization, everyone knew when the scheduled jobs would occur, and knew to avoid this window of processing. You didn’t deploy at these times, you didn’t set up other jobs to run at these times, and we attempted to tune our scale-up and scale-down schedules in Elastic Beanstalk to be ready for these workload spikes. As we grew larger, not everyone intrinsically knew every schedule of every job of every team, and inevitably, collisions occurred. This created resource contention on the platform, which impacted the performance of the mobile application that depended on APIs hosted by the service. The first response to this scaling challenge was to split the deployment footprint of the server-api service. The codebase was configured to run in an online mode for live API requests from the mobile application, and an offline mode for scheduled job work. This isolated the traffic, and isolated the risk from spiky workloads, but it still didn’t solve the speed of capacity scaling and deployment configuration challenges of Elastic Beanstalk.
Another scaling challenge for us was experiencing unexpected spikes in traffic. When the engineering organization was smaller, most information across the organization was known. As the engineering team grew, it became increasingly difficult to stay in lockstep with other teams in the company. We weren’t small enough to be able to know about everything going on at every given time.
One day, we had an unexpected spike in registrations. Since the bulk of our application functionality was consolidated in a single service, the engineers rotated on-call duties for the entire service. As registration traffic increased to the current scaling thresholds defined for server-api, our operational alerts started paging on-call engineers. We were ready with our alerts and action plan, but our response was constrained by the less-than-ideal scaling model of Elastic Beanstalk. In short, we experienced challenges, and this led to change.
What we decided to do
At this point, we started a major step change in the evolution of Greenlight Engineering. Each of these concepts will probably become the subject of another post, but for now, let me summarize some of the major areas of focus.
- Standardize on Kubernetes
- Define patterns for CICD pipelines for our Kubernetes deployments
- Define patterns for rollouts and scaling on Kubernetes
- Define a Kubernetes native API GW for N/S traffic into our cluster
- Define a pattern for E/W communication between services in the cluster
- Build tooling to enable development teams to operate effectively with microservices
We realized that cross-team communication was hard, and that we needed to enable much smaller and more autonomous deployment units. The overriding objective driving changes in engineering became to allow teams to code, build, deploy, and operate services independently.
What we learned
As we worked through these decisions, and reflected upon our challenges, we realized a few ground truths that were used to rebase Greenlight Engineering. Some of these started to build out some patterns, best practices, and architectural tenets that still guide what we do to this day. Here are some of the big lessons we learned during this period:
- Kubernetes is for real, and it unlocks a ton of tooling from the broader k8s community
- The success of Kubernetes and the patterns used can provide guidance for how to think about and design a distributed system. It is a little meta to build an application with the same patterns that are used to orchestrate the deployment units of your application, but good engineering is good engineering.
- Sizing and scoping microservices is actually pretty hard, so spend time doing it. We leaned heavily into Domain-Driven Design (DDD) and resource modeling to help us drive consistency in how we defined the boundaries and API products for new services.
- Where and when to take dependencies is possibly the most important concept in a microservice architecture. Spend time on modeling service boundaries and be mindful of the dependencies between services. The goal of high cohesion within a service and low coupling between services is easy to define as a goal, but hard to measure and enforce as you grow. Build patterns and guidance around service dependencies for distributed teams and hopefully allow your organization to steer clear of the dreaded distributed monolith.
- The most optimal engineering solution may not always be the right decision. Tradeoff analysis is hard, and most of the time it boils down to making sure you have the right weighting on your preferences. Since this is the most subjective part, it is no wonder it is the hardest part of the process. We have a great story on our journey into gRPC, the benefits we gained, the challenges we faced, and the iteration of decisions around how we think about E/W traffic between microservices.
- Use GitOps to control the world. We started with deployments, but we have since moved on to controlling many different processes with a GitOps model. Feature toggles is one of the more interesting problems we solved with a GitOps workflow. This pattern has really helped us mature controls around sensitive changes and will serve as a great jump point for a later discussion.
- Every tool you add to your platform has an implementation and an operational cost, above the usage cost. Don’t underestimate these costs and be very intentional when dropping new technology components into your stack, especially the ones you will operate. We have had a long discussion chain around big transformative tools like Service Mesh and Kafka. You have to figure out how and when to re-evaluate decisions because when you are growing fast, the context in which you make decisions also changes very quickly.
It is amazing to see how much transformation has happened in such a short amount of time, and it is amazing how many challenges we have successfully overcome at Greenlight. We look forward to sharing more information with you about our journey, in the hopes it can help you in your journey.
- Chris Byrd, Principal Engineer @ Greenlight https://www.linkedin.com/in/cdbyrd