Ruin The assorted ramblings of Brendan Tobolaski

Elk in Production

Elk is a software stack for processing and searching logs. Elk consists of Elasticsearch, Logstash and Kibana. Logstash is a log processing tool. It supports a variety of inputs, log processing functions and outputs. One of its outputs is typically Elasticsearch. Elasticsearch is a full text search engine based on Lucene. Elasticsearch makes your logs easily searchable and Kibana is a web interface for searching and making dashboards out of the data stored in logstash.

Logstash is written in Ruby and targets the jRuby interpreter. In general, it doesnʼt matter other than the configuration file is written in ruby. As previously mentioned, Logstash supports a large number of inputs such as a file, heroku, Kafka and many more. There is probably an input for however you are currently logging. At Signal Vine, we are currently using the lumberjack, Rabbitmq and syslog inputs.

The easiest way to get started with collecting logs is the syslog input. Many components of your current infrastructure probably already send logs over syslog. Its easy to configure your chosen syslog daemon to forward logs over tcp. Its probably just as easy to use udp but that isnʼt a good idea as then logs will be dropped even under normal operating conditions. You should be careful to keep the logstash server up and running when collecting syslog messages over tcp. Iʼve seen some strange behavior when the Logstash server is unavailable. Basically, some processes use a large amount of cpu while not doing anything. One downside to using the syslog input is that the timestamps arenʼt very precise, they are only second precision. This is only a problem because Elasticsearch doesnʼt preserve the insert order. What this means is that when you are viewing log messages from within the same second, they likely wonʼt be in the same order that they occurred. Most of the time this isnʼt an issue but it is something to keep in mind.

Lumberjack is a bit harder to explain. Lumberjack is a custom binary format with encryption built in. It was designed for use with lumberjack, which was a Go service that reads from files on the server and ships them to a Logstash server. It has recently been acquired by Elastic and renamed to Logstash-forwarder. While you could run the Logstash service on every server to read from log files and ship them to a central instance to finish processing them, the Logstash service has a decent amount of overhead. It runs on the jvm and as such uses a decent chunk of memory. Logstash-forwarder is very efficient and you shouldnʼt notice its overhead even on the smallest of servers. It even has encryption built in. Unfortunately, this does make it harder to get setup. It took me a few tries to get it right but it has been incredibly stable since then. In fact, since I set it up, I havenʼt had to touch it at all. It simply works. It even has handled resuming after downtime on the logstash server without any sort of intervention.

At Signal Vine, we use Rabbitmq for the internal communication medium between components of our applicaiton. Weʼve also chosen to ship our application level logs over Rabbitmq. Logstash has a built in input for processing Rabbitmq messages. This works out really well if you send json to Logstash. Logstash can take the log messages from Rabbitmq and directly insert them into Elasticsearch. This allows you add any custom fields that you want to query on. This then becomes a very effective way to query your own logs. It does help if you have a few standard fields that logs must contain. This allows you to query across all of your application logs using very similar queries.

Once Logstash has received the log messages, it can processes them with various types of filters. These filters include: adding geolocation data to web server logs, parsing the a date field included in the log message, the ability to process custom log formats and even run ruby code on the message. These filters are useful for coercing whatever sort of log message you get into a queryable format for later consumption. The available functions make this process very easy. Given that the configuration file is made in ruby, youʼre able to selectively apply these filters based on any scheme that you desire. It might be useful to apply one filter to all of the logs that come from a specific input. You can also target these filters based on the contents of the log message.

The notable exception to the processing ease is grok. Grok allows you to process custom log formats using regex which, can be a difficult processes. Luckily Logstash ships with a number of complete patterns and partial patterns that you can reuse for your logging format. There is also a great tool for building your grok filter based on an example of a message. Its important to add a unique tag if the parsing fails and you should leave the default _grokparsefailure as the combination of these two tags allows you to easily query log messages that failed to parse and then locate the specific filter that failed.

Once the messages have been fully processed, Logstash can send these messages to a variety of systems for further processing or storage. You can send the logs to s3 for permanent storage, Send an email with a particularly troubling log message, forward the data to another logstash instance or store them in Elasticsearch. I think Elasticsearch is nearly universally used. It is a very effective way to query logging data. At Signal Vine, we also use the Riemann output to process logs that we may want to be alerted about. I find that this process works really well as Iʼm able to use all of Riemannʼs powerful stream processing tools to filter the notifications that the engineering team receives. I would like to forward the full log message to Riemann but there happens to be a bug with a pending resolution preventing this.

After all of the information about Logstash, the information about Elasticsearch is rather dull. It serves as an easily queryable data store. It supports full text search and it includes Luceneʼs powerful querying language to make short work of finding relevant log messages. Elasticsearch allows you to scale horizontally based on the number of shards that you choose when configuring it. This is very useful for if the amount of logs that you want to keep exceeds the capacity of a single server or if your write volume is greater than a single server can handle. Elasticsearch also allows you to specify the number of number of replicas to keep which will allow you to lose some servers without losing any of your data. Just keep in mind that Elasticsearch has some catastrophic failure modes. Even with the ability to shard the data across many servers, most likely it wonʼt be long before you run out of storage for your logs. That is where curator comes in. Curator allows you to specify how much logging data to keep. You can either choose how long to keep logging data or the total amount of disk space to use.

Once your logs are stored in Elasticsearch, you need a way to easily query them. This is where the final piece of the Elk stack comes in, Kibana. Iʼm currently using Kibana 3 as I havenʼt yet put in the time to understand Kibana 4. Iʼve tried it out but, I wasnʼt a fan of it as it seems very unfamiliar. I found the changes to be dizzying. Iʼm sure that there are quite a number of benefits but, as I havenʼt yet researched them, I donʼt know what they are. With that being said, Kibana is a great way to look at your log data. It exposes all of the powerful querying functionality of Elasticsearch for doing ad hoc querying of your logs as well as creating great dashboards from the data contained in your logs. When you look through the documentation, there are several great examples of this.

The Elk stack is not without its flaws but, it gives you a powerful set of tools for processing log data. Having centralized storage for logs is an invaluable tool for any IT team to have. Having a single location where you can research issues happening happening on your infrastructure can greatly speed up the time between detection and the fix. It might even help you find errors that you may not have noticed without it. The Elk stack is the best tooling in this area that Iʼve seen and you should think about using it as well.