The Journey to Service Mesh. Part 2: How We Met Linkerd
This series will be about the practical usage of Service Mesh in ANNA Money. In the previous part, we talked about baby steps and why we use Netramesh instead of Istio, the benefits and drawbacks of this architectural decision. This part will cover the practical usage of three service mesh solutions. In the end, we find out who becomes the winner in the local mesh battle.
As the number of services grew, we began to face more and more problems described in the first part. And one moment, we realized that we need something more than traces in Jaeger.
We collected all our needs in the shortlist of requirements to a target service mesh solution:
- mTLS to secure internal traffic
- service-to-service authorization rules to prevent unauthorized access in the internal circuit
- blue-green deployment via traffic split to add the possibility of making canary releases for developers
- robust load balancing because the default implementation of the
Service
concept in K8s (GKE) uses the round-robin algorithm - observability via distributed tracing, golden metrics, and specific UI
- resource consumption should be as less as possible (why it is crucial we covered it in the previous part)
- pretty UI should have enough features and be fast and responsive
- easy configuration to eliminate complexity
Netramesh that we used is not fit most of the bullets, so we started looking around.
Kuma
The first solution we try was Kuma. The unique feature of Kuma is multi-mesh that supports multiple individual meshes with one control plane. It is a really cool concept if you have many teams or business domains and want to isolate them from each other (mesh per team/domain). Other things are similar to any service mesh solution.
We’ve installed Kuma with an integrated tool named kumactl. There are no questions for the initial installation process, but we strongly recommend using the Helm chart for installation and use a separated node pool for the control plane.
It is strange but helm chart for Kuma does not support tolerations. We will back to helm chart quality in the next part.
Kuma has pretty UI and wizards to create resources for Kubernetes. It is beneficial for beginners. Most of the features we tried work as expected.
Now let’s touch on the problems we had. The first and big problem is high resource consumption. We’ve already touched on this issue in the previous part and why it matters to us.
Second one. Unfortunately, as Istio, Kuma has a problem with distributed tracing and requires application modification in passing tracing headers.
It is strange again, but all service mesh solutions do not have a Netramesh feature with transparent distributed tracing. So, it was not an argument to not use Kuma.
The next thing that matters is mTLS, and when we turned it on, we broke our test environment because nginx-ingress has no integration with Kuma. In fact, Kuma enforces you to use their affiliated product Kong for Ingress. It is not what we expected because we do not have plans to change our ingress controller.
Without mTLS, some features will not work. For example, traffic permissions.
So, it was an interesting experience with a new product, but Kuma is not fit for our needs and architecture.
Istio
Besides our previous faulty attempt to adopt Istio as a service mesh solution, we should give it another chance because it is the most popular mesh at the moment.
To our disappointment, it did not become better. All the good old problems were still presented: high resource consumption, complex and slow UI, application modification requirement to support distributing traces across services.
Also, it has a similar problem with mTLS enforcement. Istio does not requires Kong but requires its own Ingress controller and specific configuration for it. And as Kuma, it breaks the environment with the Nginx controller.
So, our decision for Istio was NO.
Linkerd
Linkerd positions itself as a light and simple solution without complexity. It does not try to replace any component in infrastructure with itself, but reuse existed components and concepts.
We want to say that the installation process is very easy with linkerd cli, but we recommend using the Helm chart for production usage.
Let’s start with mTLS implementation. In contrast to Istio or Kuma, traffic encryption is enabled by default. And it does not break any part of the existing infrastructure. After installation, the Nginx ingress controller worked as expected, and that was like magic.
While migrating to Linkerd, some connections between services were unsecured because we tried to roll out mesh step-by-step.
Also, you should keep in mind that this kind of implementation requires controlling traffic flow with network policies or another tool.
In future releases, Linkerd will support service-to-service authorization rules, so that was ok for us.
As Linkerd is ultra-light, it does not support canary deployments out-of-the-box but has a core feature named Traffic Split. To make a canary deployment, you should write the SMI TrafficSplit resource for Kubernetes and apply it. To automate blue-green deployments, Flagger is recommended tool.
In fact, you can write any tool top-up the TrafficSplit API, but for us, Flagger has enough features.
We were very excited about the real resource consumption of Linkerd components. You can read more about benchmarking in the Linkerd blog article. And that is true. Linkerd is very efficient this way, and there was no difference in resource usage between it and Netramesh that we used before.
You could say that a control plane with additional components requires a separated node pool with a cost, but there is no free lunch.
The last thing that we want to mention that UI (and CLI tool) is very intuitive, light, and responsive. It becomes a little bit slower if a namespace has many resources (deployments/pods/etc.), so we can give a piece of advice here to keep namespaces small. We chose namespace per business domain strategy.
Drawbacks
We should say that Linkerd (like other mesh solutions) requires application modification to support distributed tracing between services. And we lost traces in Jaeger right after adoption. That was sad.
Linkerd has only one strategy for load balancing (but it is EWMA that cool), does not support circuit breaker, and has limited customization for load balancing overall.
Comparison Table
We made the table to compare these solutions from our perspective:
Let’s summarize the second part
We made R&D with three different service mesh solutions: Kuma, Istio, and Linkerd.
We chose Linkerd because it is better to fit our needs. Maybe on your workload, or architecture, or scale, another solution would be better.
We lost traces in Jaeger and faced some untrivial issues during the migration. We sorted it out and want to share our experience on making adoption easier in the next part.