In the following article we will explore the capacity of the ELK stack to get the most from a microservices approach to software development. Microservices bring plenty of opportunity for a more efficient approach to creating software products, along with some drawbacks and potential risks. When properly used, ELK has the potential to reduce these risks and help you deliver quality in an efficient way.
We have no idea which was first: the chicken or the egg, but we know that the Monolith was the igniter of changes that we observe every day in IT. The Monolith, the big, the scary, the unchangeable, the breakable, the Return of Investment (ROI) eater. And then out of nowhere Microservices arrived. Those small, pretty small pieces of software that conquered the Monolith, split it into pieces and made changes to software easy again. But you, reader, should know best that requirements always change. Code becomes obsolete and hard to maintain. You may have a feeling that behind the horizon stands new danger for your ROI. What may it be? How to deal with that?
Why use a Microservices Approach?
Getting into microservices seems easy at first glance. When you need to create new shiny features for your clients, developers create from scratch small software pieces. Developers love new software – it is so clean, manageable, no dependencies, no legacy stuff. They can use new technologies, learn while doing their job, thus their morale is high. You are happy that features are created quickly, developers are happy, but where is the catch?
As we wrote size matters. Creating tens, hundreds or thousands of Microservices creates a huge network of connections and dependencies between them (Microservices hell). Before Microservices, the Monolith crash was obvious – none of your clients could use it anymore. Nowadays failures may be intermittent and hard to find. Are you sure that your software product meets the Service Layer Agreement (SLA)? Are you sure that all payment transactions end with success? Are you sure that your clients love your product and will not give up before your goals are met?
What can you do about it?
You may say that answers for those questions can easily be found using tools such Google Analytics. Sure, it’s a great tool, but due to its nature, it does not know much about your infrastructure, servers, ISP connection, servers’ CPU or RAM usage. All of those resources affect your clients’ experience. Some companies may have policies ensuring that their data does not leave their servers. So how can you deal with that?
The ELK Stack may help you with that. This is a group of tools called ElasticSearch, Logstash and Kibana, which were created to help monitor your systems’ infrastructure and business goals. It’s a single point of information where you can find all the data that you need.
Theory of the ELK Stack
ElasticSearch is a noSQL, document database, meant to store data that is time-related. It stores events containing data relevant to your needs. It may be logs from production, metrics of operation systems, or even metrics of your CI/CD process. ElasticSearch implements many aggregation functions to enable fast information retrieval for long periods of time (e.g. What browser was the most popular in your solution last year. Should you still support Internet Explorer?).
Logstash is a piece of software that can have multiple roles. It was designed to be log-files parser, because text parsing may be a resource-intensive process. It also can be a point of aggregation for all of your services. It may tag events from your system exactly as you need it and store those in ElasticSearch or other types of services or databases.
Lastly, Kibana is a website that reads data from your ElasticSearch and displays it for you. It’s an easy to use, modern-looking tool for visualizing your system’s behavior and health.
My experience with the ELK Stack
Some time ago I was working at a company that delivered some products with certain SLA. The system evolved from the Monolith into a distributed services network, and then started to change into Microservices. It was monitored by a dedicated team who used logs that developers left in the code. That model worked fine, but sometimes there were problems not indicated by the logs, but by clients calling, angry that they cannot use our product. It made me wonder how it can be?
I spoke with a dedicated monitoring team, the Ops team, the Networking team and they saw no issues in their data. Looking for help, I found the ELK Stack and, after some negotiations with the manager and the Ops and Monitoring teams, we deployed it. We started with what I think is the most important part monitoring the input/output of our system: the access.log files.
This magic file is created by a web server and contains a lot of information about your system. For example, what endpoint was called, by whom, was the response successful or not, how big was the payload, how long did it take your system to process it, or even what browser your user used. A lot of useful data. Using that data we discovered some bad things. Some services had nearly 10% of error responses that we weren’t aware of. There were also periods when requests took a lot of time that we didn’t know about either. It took some drilling to fix that, but this example showed us that such a simple thing as an access.log file may improve your SLA and users experience.
What’s next for the ELK Stack?
ELK Stack becomes more and more powerful, the more data you feed it. There are unlimited scenarios that you can use it for. These are some examples:
– Find out what percentage of clients that put at least one thing to the shopping cart, actually finished checkout. If that percentage is small, it may indicate too complicated processes or payment-provider problems.
– Measuring ROI of new features. Knowing the cost of a feature allows you to monitor its ROI in real time. It may be that it’s not used enough to pay off its investment and it may be cheaper to remove it.
– The most expensive code in production is the one you’re not using. You could write an event per API-call in ElasticSearch and measure usage (or use APM – more on that below). Maybe some of your Microservices are not used anymore. Now you have proof that you can safely decommission them.
– ElasticSearch provides a module that can predict trend-changes and detect anomalies. You can use it as a source for decision-making according to scaling, demand, marketing or other needs.
– You can gather historical data and compare it. For example: imagine you invested in optimization of software. Now you can observe how much it paid off.
– ElasticSearch is a powerful search engine. If it has enough data in it, you can find all related things using one search query. Even your clients can search important information quicker.
The ELK stack may also help you during migration from the Monolith into Microservices. You could begin migration with documenting features in the system. Then, each use of a feature may be sent to ElasticSearch via REST API. Those metrics help find hot spots in the system and help to decide what needs to be migrated first. Libraries such Metrics.NET.ElasticSearch will help you with that kind of task.
The ELK stack is great in aggregating logs, metrics and other information from many sources in one place. It shipswith small services called Beats that will do all the work of reading and structuring requests for ElasticSearch for you. Just remember to add custom fields to each event (using Logstash mutate plugin or Beats configuration, for example) to indicate environment, machine, client and service name. A single Kibana dashboard can then be used to display the same kind of information for multiple environments.
The newest and most exciting part of the ELK Stack is APM – Application Performance Monitoring. If you have Elastic Cloud or ElasticSearch on-premise, you can use it to monitor internals of your applications with just a few lines of code. Elastic.co delivers ready-to-use libraries for languages such as Java, .NET, node.js, python, Ruby or even Go. It can deliver information on performance issues to developers, as well as how services communicate with each other in production, and how they are interconnected.
Pointing out all these features, you might get the impression that those tools will solve all of your problems! As always there’s a catch. Investing in the ELK Stack may become pricey. You might worry over how free, open-source tools can overwhelm your wallet. I’ll try to explain and give you some ideas on how to deal with that.
The first obvious reason why the ELK Stack price is high is hosting. Regardless of whether cloud-hosted or on-premises hosted, it will consume resources. ElasticSearch is pretty smart about data and how it’s stored. In general, new data isn’t immediately written to disk storage. Depending on configuration, new data is stored in the RAM to be accessible faster. It is especially useful for dashboards showing real-time data.
Speaking of memory, the ELK Stack is infamous for its high usage. It’s caused by optimizations. To make queries faster, some indices may be stored in the RAM. In multi-node clusters, RAM memory is used to aggregate data from multiple nodes, join it and create the requested response.
If you’re planning to ingest a lot of data, then parse or pre-process it, you may need additional resources. In general Logstash works better when it has access to multiple threads. Those then process events in batches of hundreds or thousands, which may be a resource-intensive task.
The maintenance and configuration costs should be put in consideration. Testing the ELK Stack you may deploy a single-node setup or even put all of those tools on a single server/VM to minimize configuration. When using it in production, you have to think about your data safety and high availability: Multiple nodes, single server per application, backups, secure communication, and secure access to specific kinds of data. All of which requires knowledge and time to configure and maintain.
When choosing a solution, you may consider other options for data storage or visualisation. For storing metrics you could consider InfluxDB. It’s a time-series database that was designed to deal with time-based data, and fairly large ingest rates in mind. It’s scalable, and easy to install and use.
If you’re not interested in all the features that Kibana has in its extended version, then consider using Grafana. Like Kibana, it’s a data visualizer tool, but unlike Kibana, it’s also able to display data from multiple sources at the same time (ElasticSearch, InfluxDB, Prometheus or even direct SQL queries). You can use the cloud-hosted version, or deploy it by yourself using the install package, a Docker image or even install it on Raspberry Pi.
Microservices have changed the way software is designed nowadays. They are cool, small, easy and cheap to make. Unfortunately, they also have drawbacks such as their role in increasing overall testing and system complexity. To be able to manage that, tools like the ELK Stack help to harness their potential and also bring other benefits to your company. The ELK Stack seems to be an all-in-one, single-source-of-truth solution that deals with tasks like infrastructure monitoring, log aggregation, usage statistics, and business process performance data gathering. This single point of information is invaluable when you want to deliver quality. Just give it a try!