The Journey to Service Mesh. Part 1: Eliminate the Leader
This series will be about the practical usage of Service Mesh in ANNA Money. In this part, we will cover early attempts to use this concept.
When the number of microservices grows up, you could face some new problems. The first problem is the number of hops from service to service (it doesn't matter horizontally or vertically). Each is a potential lead to failing all request chain that affects customers.
The second one is the loss of transparency that leads to a complicated understanding of what is going on in the system. For example, when a bug or error appears, or latency fastly increases.
The third is about security. Usually, requests between services (call them internal requests) are trusted by default. One service is hacked, and there is a possibility to call any other by the internal API, which usually has much less protection than the external one.
Istio is a widespread Service Mesh implementation that can solve all of these issues, so we looked at it first. It offers traffic management, observability, and security.
Do you know that feeling when you look at a landing page and think that that tool is a silver bullet, didn’t you?
We thought very similarly at that time. Just install some pods, add annotations, maybe several configs, and it should work, right?
Unfortunately no. Very fast we faced with three major problems for us.
First problem: resource consumption
Maybe our test cluster is smaller in terms of resources (CPU/Memory), but it is much more significant in Kubernetes resources count. We very densely packed our test environments to save money.
When we launched Istio, we realized that it costs more than twice that of any test environment. Surprise, hah? I don't try to tell that all the features should have no overhead, but it was too much for us. Maybe it was not about money but psychology.
Ok, we decided to limit service mesh usage to stage and production environments and did not touch any other test to reduce costs.
Second problem: complicated and slow UI
Istio does not have UI as a part of itself. Instead of it provides Grafana dashboards and Kiali UI. I want to note Istio itself and Kiali, in particular, have no intuitive but complicated installation and configuration process.
UI is running slow. Sometimes we did not get any information and observed a lot of timeout errors. Think that we acquired not enough resources, so see the first problem. Another issue that UI is unintuitive to use. It is сomplicated too.
Ok, maybe we should spend more time tuning and learning. And I think that we did it this way, but we had one more problem.
Third problem: distributed tracing does not work
How could it be? That was a question that we had when we installed and set up the stage environment. In fact, Istio generates traces, but without header propagation at the application layer, it is just chopped dissimilar pieces. It was useless.
We did not want to instrument or modify applications, and that was the last drop. We decided that we need another solution because distributed tracing was the critical point for us.
Netramesh to the rescue
We started looking around and found the talk by Aleksandr Lukyanchenko from Avito (https://www.youtube.com/watch?v=NY9bYfyACyk in Russian) about their problem with overhead and resource consumption by Istio at scale. For solving the problem, they created and open-sourced their service mesh implementation: Netramesh.
The killer feature of Netramesh is the absence of a control plane and a minimal number of features (sometimes the less, the better). It leads to minimal overhead and unlimited scalability. And the most important thing it could solve our problem with distributed tracing without modification of applications because we propagated X-Request-ID
header (but it could be any other header that is unique for a request), and it was enough.
Make it usable
Netramesh has no injector, so we should add proxy and init container directly to deployments. It is not rocket science, but some things should keep in mind.
First, a lifecycle pre-stop hook is required, and it should be longer than an application hook:
# we add 5 more seconds for proxy container
lifecycle:
preStop:
exec:
command: ["sleep","{{ (sleep | int) + 5 }}"]
It requires because an application should be gracefully stopped before a proxy container stopped. In other cases, any network request will be rejected.
Second, if your application started too fast, it leads to the same problem, but on an application startup phase. So we add sleep on an application launching command as well to fix this issue.
Third, we discovered that operations in Jaeger are unusable without additional processing.
As you could see on the screenshot, there are many similar operations (e.g., paths), and their count can grow indefinitely. So, it is hard to get aggregated results for paths that contain identifiers.
Also, different system paths (for example, healthchecks) were presented in operations. We wanted to cut them off because they were just noise.
We added the possibility to ignore some paths for tracing via the NETRA_HTTP_TRACING_IGNORED_PATHS header for a proxy:
env:
- name: NETRA_HTTP_TRACING_IGNORED_PATHS
value: "/api/healthz,..."
Also, we added the possibility to union paths into one operation but regexp replacement. For example, we know that an identifier has 24 symbols, so that we could use the following regexp pattern via another proxy header:
env:
- name: NETRA_HTTP_OPERATIONS_UNION_REGEXPS
value: "regexp1;/[a-zA-Z0-9]{24};..."
At present, there is the Opentelemetry collector that can solve similar problems. We will touch on it in the following parts.
Then we redeployed our applications one by one, and voila, we got beauty traces:
Let’s summarize the first part
We decided to not using Istio because:
- it is a complicated tool
- it has a significant overhead for us
- it requires application instrumentation
We chose Netramesh and achieved:
- distributed traces that improve observability
- low overhead
- easy explicit configuration
We live with Netramesh for a year, and some time ago, we wanted to bring honest service mesh with mTLS, traffic split, and robust load balancing. See you in the next part, where we cover the next steps in our journey.