2018-11-20

How to select an API Gateway for (micro)services

TL;DR: Pick an API gateway that goes best with your tech stack, provided it meets the minimum feature/security/maturity threshold.

If you’re starting to implement the API Gateway pattern, in which a single gateway sits in front of a collection of services and routes the requests to them, you have a whole zoo of solutions to choose from. Just doing a quick search, as of this morning, you can pick:

Gateways as libraries, for example Spring Cloud Gateway and Netflix Zuul 2 for JVM, Ocelot for .NET;
From the cloud providers: AWS Application Load Balancer, AWS API Gateway, Azure Application Gateway, GCP Cloud Endpoints;
Envoy, either by itself, or as Ambassador for Kubernetes, or as part of Istio;
Platform solutions like Kubernetes Nginx ingress controller.
Other open source gateways, like KrakenD or Kong (both of which come in open source and enterprise versions).
Other commercial ones Cloverleaf.

It’s impossible to give a detailed feature comparison for all of them, but after some experience of selecting and running an API gateway in production, I feel that I’m at least slightly qualified to give advice on how to pick one.

Features - What the gateway needs to do

This gateway can route only to physical addresses.

Before picking a gateway, let’s go over what’s expected from one.

Routing and service discovery

Routing is the most basic feature of an API gateway; without it, there is no point of having one. At a minimum, a gateway should be able to route via HTTP other URLs.

Many gateways also come with some form of service discovery, where back-end services register themselves with a registry, and the gateway gets the list of routable services from that registry. From my experience, unless you maintain hundreds or more distinct services, built-in service discovery is not that helpful because sometimes you may want to keep a service not accessible through the gateway but still discoverable by other services. (This assumes that gateway is only used by users outside your system, like human users or other external callers.) In that case, you’d want the services to opt into the routing explicitly.

On top of that, the platform on which you run the services may already have a built-in rough service discovery capability, e.g. DNS on Kubernetes. It’s also not that hard to roll your own crude “service discovery” with DNS naming conventions.

Authentication and authorization

An API gateway on the lookout for predators.

Whether this is a requirement will depend on your specific authentication and authorization architecture. However, having a gateway that does not support auth in any form will definitely limit your architectural choices. Fortunately, almost all of them support auth in some way; as of writing this, Azure Application Gateway is the only major exception.

Here, the gateways split into three camps.

Gateways-as-libraries, such as Ocelot or Spring Cloud Gateway, can integrate with their respective frameworks — ASP.NET Core and Spring Framework. This makes it possible to let the gateway fully handle auth using any method supported by the framework (and usually there are a lot of choices) with minimal additional work.

You will also get the choice of letting the same framework handle user sessions in the gateway, which again can be set up with minimal effort.

If you choose not to use JWT (JSON Web Tokens), you will have to invent your own way of passing the user information from the gateway to the back-end services. This was the choice my team made a couple of years ago due to the capabilities of our OAuth 2 provider that we did not have a say in selecting. Today, I’d suggest sticking with JWT, which is widely supported. Depending on your web framework, you may also have to come up with a way of having the back-end services extract JWT/other user info and stick it into the service’s security context.

Delegating gateways: some gateways either require or give an option to authorize each request by making a sub-request to an authorization service, which you’re expected to provide. Kubernetes Nginx ingress controller, Envoy, and AWS API Gateway are in this category. Generally, the gateway would expect a simple pass/fail answer from such service and not anything fancy like a redirect.

Each of the gateways has their own requirements:

AWS API Gateway can delegate authorization sub-requests only to an AWS Lambda or to AWS Cognito. That’s fine if you already have both feet in the AWS serverless world, but not great otherwise.
Kubernetes Nginx ingress controller wraps Nginx auth functionality in Kubernetes annotations. In this setup, Nginx makes an HTTP sub-request to a service that’s expected to return 2xx or 401/403.
Envoy expects a gRPC or HTTP service that handles a specific request/response schema. It passes the whole request context (location, headers, etc) to the delegated service. Envoy can also verify JWT tokens.

In all of these cases, the gateway’s primary role is routing, and you’re expected to bring your own components for authentication, sessions, and propagating user information to the back-end services.

Standalone gateways with auth provide opinionated authentication/authorization support, usually limited to JWT (JSON Web Tokens):

KrakenD works exclusively with JWT. It can verify, sign, and revoke them, but it cannot issue them. You’d need to bring your own service that would log in users and issue tokens.
Standalone Envoy can also verify JWT tokens, but it cannot handle the revocation.
Istio, which comes with a built-in Envoy, supports only JWT and has streamlined integrations with several sources. Istio docs don’t mention support for token revocation. The price of Istio is that you have to install, manage, and use it; Istio includes a lot more than the gateway. This works well if you’re already planning to use Istio, but is not great otherwise.
Kong provides the most authentication options among the standalone gateways; some of the options are available only in the enterprise version. Kong also provides its own user repository. The possible downside with Kong is that you’d have to set up and manage Postgres or Cassandra. This is not a big deal if you already use either of these, but not optimal if your organization has deep expertise in other database types.
Google Cloud Endpoints supports JWT, and also has streamlined integrations with various sources. But, as with AWS, you’d need to be on GCP to take advantage of this gateway.

Overall by now you should start seeing a subtle theme: if you’re using a gateway that integrates very well with your tech stack, you’ll be able to set up more features faster, and it will be easier to maintain. Later in this post, I will make it a lot less subtle.

Routing to static websites

You may be interested in deploying website front-ends as static sites behind the gateway, e.g. in S3 buckets or in containers that have only Nginx and the website files. This has a few benefits:

All API URLs in your front-end code will be in form of "/some/path" instead of envHostname + "/some/path".
The gateway can block unauthorized users from loading a front-end page.
Potentially simpler local development experience.
Not having to deal with CORS.

If you do that, you’ll run into an interesting routing conflict. Many front-end frameworks such as React and Vue have their own routing that rewrites the contents of the browser URL bar to make it look like the user has moved to another page. The new URL might not correspond to a real resource on the back end, and when the user presses Refresh in the browser, your system should be prepared to deal with it.

For example, if the user presses the refresh button and the browser tries to load /portal/settings/app.bundle.js, your system has to know that it should go to the portal-service and load /app.bundle.js instead of /settings/app.bundle.js.

You have several options to handle this:

Have the gateway rewrite the URLs before forwarding the request to the static website service. This requires a gateway that can match and rewrite URLs based on a regex prefix. Some gateways can do that (Spring Cloud Gateway, Envoy), while others can’t (AWS API Gateway). Alternatively, if your gateway supports custom filters, you could write one to deal with this problem.
Let the static site service deal with it, by either setting up the appropriate rules in Nginx or wrapping the static site in a framework of your choice (e.g. Spring) and letting it handle the routing issues. The latter option will add slight complications all over the place and will require more compute resources.
Don’t host static sites behind the gateway; instead, put them on a CDN, or somewhere else. You’ll need to deal with CORS (generally not a big problem), and you may have to control authorization separately.

Protocols and TLS offloading

Generally, all gateways can handle HTTP/1.1 and TLS offloading, and by now majority can handle incoming HTTP/2 connections. It will be surprising if your top gateway choice cannot do these.

When it comes to protocols, the main differentiators are:

Whether the gateway can handle WebSockets: most can, but some can’t (AWS API Gateway, Netflix Zuul 1). This may be either critical for you, or completely irrelevant. However, if you pick a gateway that doesn’t support WebSockets and it turns out that you need them, you will spend a lot of effort on workarounds (that happened to us).
Whether the gateway can use HTTP/2 to communicate with the back-end services. For most people, it will not be critical, but you do get better performance with HTTP/2, so it’s definitely a nice feature to have.

With TLS, your biggest problem will be not whether the gateway can handle it, but how you’re going to store, distribute, and rotate the certificates securely. Certificate renewal can be painful, especially if you do it infrequently, and I suggest awarding bonus points to gateways that support easy certificate provisioning/rotation.

For example, Kubernetes Nginx ingress controller can be set up to renew the certificates from Let’s Encrypt automatically. On AWS, using a certificate from the AWS Certificate Manager is quite painless.

But on Azure, as of writing this, you cannot just link a certificate from the Key Vault to the Application Gateway.

Alternatively, you don’t have to do TLS offloading on the gateway. Instead, you could place another piece of infrastructure in front of the gateway (e.g. AWS Elastic Load Balancer) and have it offload TLS. This is the setup that we went with, and I think we got the best of both worlds: easy TLS certificate management and other features from AWS ELB, and an API gateway that fit our service ecosystem like a glove.

Other filters

Broadly, filtering is the stuff that the gateway can do between receiving the incoming request and forwarding it to the back-end service, or between receiving and forwarding the response.

In addition to the features mentioned in previous sections, filters can provide extra protection (rate limiting, header sanitization), fault-tolerance (retries), or other useful features (gzip compression, redirects). I’d classify these in the same category as heated car seats in Canadian winter: they’re handy, but if they’re not present, there are other ways to deal with these problems. That said, depending on your scale and architecture, some of these may turn out to be very useful to have specifically in the gateway, so I suggest thinking about that in advance.

Logging and request tracing

All gateways have some request logging capability; some can add tracing information to the requests for easier request correlation in the logs, e.g. tracing in Envoy. Tracing can be very useful when the request might be handled by multiple services, though as with most other filters, there are other solutions for it.

For me, the main differentiator is the timeliness of the logs. A gateway that logs to stdout or a file can have its logs shipped to a log aggregator like Splunk or SumoLogic in under a second. AWS API Gateway logs can show up in AWS CloudWatch after as much as a minute, and log aggregators often poll the S3 bucket with logs less often than that.

This difference may not seem like much, but which one would you rather have when you’re trying to troubleshoot a live issue?

Developer friendliness

Once you have a gateway, someone will have to feed it and take it out for walks. Let’s think through that ahead of time.

He may be cute, but he's a lot more work than you expect.

Organizational fit

Let’s imagine you just got a gateway. Now, someone will have to:

Deploy and manage it.
Deploy and maintain the configuration, including the routes.
Handle unforeseen problems and requirements.
Run multiple environments (production, test environments, local development environments).
Promote the configuration changes between the environments.
Upgrade and apply security patches for the gateway, the host server, and, if you’re using one, the container.
Have a disaster recovery plan in case everything goes wrong.

All of this has to be done efficiently, and while keeping production errors to a minimum.

Also, consider how the gateway will fit into your organization. If the gateway is entirely different from the rest of your service ecosystem, and if the process to configure it is complicated, you will end up with one or a handful of people who will be “gateway people”, and every time the gateway will need to be touched, one of them will have to be involved.

This can lead to more handoffs, slower deliveries, avoiding the gateway (or worse, working around it), and key-person risk.

If configuring the gateway is easy, and if the gateway looks and feels like just another service in your system, then any experienced developer will be able to work on it.

For this reason, I recommend the following:

Pick a gateway that most of your developers will be comfortable with.
Take an infrastructure-as-code approach to the gateway deployment and configuration, keeping the configuration in source control and deploying both the gateway and the configuration automatically.
Make it easy to set up and manage the routes.
Pick a gateway that’s easy to patch and upgrade; ideally, one that is distributed through a repository like Docker Hub, npm, or Maven.

Managing the routes

Broadly, there are ways to make the route configuration easy to maintain:

Keep the all route configuration in a central repository and deploy it automatically, or
Whenever a new service is created, update the route on the gateway, either through a REST call or through service discovery.

At this point, every gateway supports one or the other or both. But whatever the approach, keep in mind that you may want to occasionally repoint the route from one service to another without changing the service names, or have a route pointed to something outside your service ecosystem.

At HOOPP, we’ve set up the route configuration in our regular configuration git repository, along with the rest of the gateway config. Any developer can submit a pull request with a change, and many have. The pull request has to pass a CI build that lints the config files, and be approved by someone on the Core Platform team (usually happens within 5-60 min). Then the config is automatically applied to test environments, and, if those start up successfully, to production. The gateway configuration is updated maybe every couple weeks; aside from a few initial hiccups, it did not cause any production issues. But if an issue does happen, git has the full revision history, and any change is easy to revert.

This is not the only way to do it, and probably not the best, but it was easy to set up and results in the least amount of work for everyone involved.

Local development experience

Something that’s easily overlooked: if you’d like the local development environment to resemble production as closely as possible, then you’d need to find a way to run the gateway on the developers’ workstations, including providing a route map that’s appropriate for local development, i.e. one that points to localhost routes.

At HOOPP, we found that some developers hated the idea of running the gateway locally, and some thought it was essential. You may need to accommodate both, but if you have to pick, I’d pick the run-the-gateway-locally side, and make the local startup and config management as fast and as painless as possible.

Gateways-as-libraries have an unfair advantage here. Since you already know how to start and manage configuration for your other services, and gateway-as-library is just another service, you will have an inherent head start when working with them.

Other considerations

There are other criteria when picking a gateway, but most of them tend not to be big differentiators for the major gateway solutions:

Documentation: is it thin on information, or is it extensive and detailed?
Answer availability: is it easy to search for answers on StackOverflow, Google, Github, or the vendor’s forum?
Maturity: more mature products tend to be more secure and more battle-tested.
Adoption: is it popular? Winners typically attract more resources for improvements, documentation, security patches, plugins, etc.
Security: what’s the vendor’s security posture? Have they had security audits, is there a bug bounty program, etc? If you’re not sure how to evaluate that, I suggest involving someone with information security expertise.

Making the choice

Unless you’re prepared to have a team dedicated to the gateway, you’re almost certainly better off picking one that (1) is easy for you to automate and integrate into your systems, and that (2) your developers can support with minimal additional training.

Once you apply all the constraints from the preceding 2,800 words of the post, you will find that gateway-as-library in your organization’s language of choice is likely the front-runner, provided that it exists and satisfies the requirements in this post.

In the absence of that, the top choice would be the gateway that can be easily deployed on your existing infrastructure, like KrakenD or Envoy on a container-based infrastructure. This usually gives enough options that you don’t have to venture farther afield to pick another gateway.