Ruin The assorted ramblings of Brendan Tobolaski

The Party of Fear

It is extremely unfortunate that the United States has developed into a two party system. Its even more unfortunate that one of those parties is unable to field respectable candidates. Prior to Tuesdaysʼs debate, the two leading candidates, Trump and Cruz, are both literal fascists. The two appeals of the Republicans: American isn’t safe and, Make America great again.

It seems that the Republican party can be characterized by a desire to have the biggest and best military force in the world so that we can stamp out any possible threat (by bombing those fuckers into the ground). Have we learned nothing? Nothing from Vietnam and both Gulf wars? The lessons of which should be apparent, we can use our military to kill people but, we can’t control them. In fact, our belligerent attitude is causing us to be significantly less safe. How many people will join ISIS after we kill their family members in collateral damage?

Unfortunately, the damage may already be done by Trump. Even if he fails to capture the Republican Nomination, he has already made open bigotry acceptable. A year ago, I would have expected provably false, racist slander to eliminate a political candidate from any election, instead, it has only propelled his campaign. I, unfortunately, know people that have a hatred of Mexican immigrants. Trumps comments have made this sort of sentiment into something that is able to be discussed.

Condemning an entire race wasn’t enough for Trump, all Muslims are in the crosshairs as well. Not only would he prevent the US from taking in the abysmally small number of Syrian refugees that Obama has committed to, He would also prevent any citizen that happens to be a muslim from returning to the United States. This is wrong. It is against everything that this country was built upon and every sane citizen should find this idea repulsive. This is already having repercussions, violence against Muslims is up. It is inciting the true American terrorists, white people.

In addition, Trump advocated for committing war crimes during the debate. He would like to target the families of ISIS members. This is flat out sickening. Under no conditions should we every consider doing this and those who preach it should be no where near running this country.

None of these things comes from a position of strength. The primary strategies for the Republican party appears to be the creation of fear and nostalgia. Their strategy requires all of us to live in a state of fear, a fear that they alone can resolve. This is not the world that I live in and you shouldn’t either. Egregiously, theyʼre also exploiting the widespread hatred of Mexican and the hatred and fear of Muslims to further solidify there following. I want no part in this and you shouldnʼt either.

Back to Basics

I read Ben Brook’s most recent thoughts on WordPress and it lead me to an important series of thinking that has culminated in what you see now, my return to using WordPress. My initial reaction to reading Ben’s post was denial. Why does it matter if I had a complicated cms setup for my writing. So what if I want to spend my time writing my own cms just to run Ruin? It doesn’t matter, does it? That is when it hit me, it does matter.

For the longest time, I’ve wanted to write my own cms. I don’t have a particularly good reason for why I want to do this, other than I enjoy writing, developing software and I have some ideas that I want to try out. All of this is fine but, that isn’t the reason that I have this site. I have this site because I intended to write on it. Looking at this what I’ve managed to get out his year makes me sad. Compared to previous years, my output has dropped considerably. Some of it is simply dropping the linked list style posts. I don’t think that those are particularly useful to people and I’ve stopped doing them.

There are also large gaps where I apparently stopped writing at all. Each one of these is a time where I was going to finish my cms, so I stopped writing until it was “done”. That point never actually arrived though. On multiple occasions, I’ve spent weeks writing my new blogging platform only to realize that it will be a very, very long time before its complete. On most of these occasions, I did have something that would be workable but was missing features that I would call essential. At these points, I’d revert to my previous cms, Jekyll, and continue my writing. I was never quite satisfied, so I would quickly return to tinkering with making my own.

It took reading Ben Brook’s post for me to step back far enough to evaluate the situation. This cycle is deadly to my writing. Furthermore, I’ve long had more projects that I wish to explore than I have time. Building blogging software is no where near the top of that list. It also isn’t the reason that I have this site. I have the site as a place for me to publish my writing, not as a place to fiddle with different cmses.

So, I’m doing exactly as he suggests, I’m using WordPress and utilizing the things that the community has created to fulfil all of my functionality desires. It took all of an hour to have all of the functionality that I wanted. Now its a just a matter of making it look the way that I want. Of course, I have to write my theme in PHP which, I don’t like but, I can just use _s. It a small price to pay to be able to concentrate on writing and building the tools I want instead of a CMS. I just need to remember that.

Replicating Jepsen Results

If you arenʼt aware, Kyle Kingsbury has a great series of posts testing whether databases live up to their claims. Its an invaluable resource as many of the databases he has tested donʼt live up to their stated goals. That being said, some of the posts are getting quite old at this point so its possible that the developers may have fixed the issues that caused them to fail their stated goals. Luckily, Kyleʼs Jepsen project is open source and youʼre free to try and replicate his results.

This does take some setup though. Youʼll need 5 database servers. It’s easiest to use Debian Jessie for this as that is what Kyle uses and therefore all of the tests that heʼs written work against it. You do need to replace SystemD with SysV init before the tests will be able to run. You also need a machine to run Jepsen on. You shouldnʼt try to reuse one of the database servers for this as the tests will cut off access to some servers at certain points in the tests. For the easiest testing process, youʼll want the database servers to be called n1-n5. They need to all be resolvable by all the other database servers and the server running the tests. The server running the tests also needs to be able to ssh to all of the database servers using the same username and password/ssh key and have sudo access. These hosts must also exist in the known hosts file in the non-hashed format before Jepsen is able to execute a test. Iʼm unsure of what the default values that Jepsen uses for username and password but, youʼre easily able to change the values that it uses for each test. Finally, the server running the tests will need the Java JDK 8 and leiningen to run.

That is quite a bit, isnʼt it? I thought that it was and given the wonderful tooling we have to replicate these sorts of environments, I thought that, for sure, someone had created a way to spin up a set of servers on AWS to run any of the tests that you would like. I wasnʼt able to locate one which likely just means that my search skills were lacking. Since I couldnʼt locate one, I made one using Terraform. jepsen-lab is relatively simple but, it goes through the process of setting up all of the previously stated requirements. It sets up all of the servers and configures them as required and once that process is complete, it outputs the ip address that youʼre able to ssh into. It does leave a number of steps for you to complete on your own: You need to clone the Jepsen repo and youʼll need to modify the test configuration for the username and password. The former is simply because I donʼt know what revision you may wish to use and the latter is because the step is dependent on which tests you chose to run. For more information on how to use jepsen-lab, see the readme in the repository.

After getting everything setup, it’s just a matter of running lein test from the correct directory and verifying the results. You can also make any modifications you like to see if they change the results of the tests. In future installments, Iʼll discuss the particular tests that Iʼve tried to replicate, modifications that Iʼve made and the results that Iʼve gotten.

fpm

For many developers, the way that they deploy is by checking out a specific revision from version control. While some people consider that to be a bit of an anti-pattern, I think that is a fine way to deploy applications written in dynamic languages. In theory, you could do the same thing for compiled languages, it just doesnʼt work well in practice. This would require you to compile your application on every server during the deploy. While this is possible, its very inefficient and time consuming. A much better way to do this is to build your application and then distribute the resultant artifacts. The way that Iʼve chosen to do this is by building native packages, specifically debs.

Generating these debs arenʼt very difficult. It took me quite a bit of research to figure out what needed to be there (debianʼs packaging and Clemens Leeʼs package building HowTo guides were both hugely helpful). once you figure that out, its just a matter of creating the correct directory structure and then run it through dpkg-deb . Alright then, how do you make a similar rpm? Time to do some research, huh?

Why should any of this be required? Surely many other people have figured out what is required. One of them must have documented their knowledge somehow. The answer to both of these things is of course. Theres an awesome tool called fpm that creates packages of many different types from many different sources. Of course, it can package up files and directories into debs and rpms.

Iʼve known about fpm for quite some time. In fact, I knew about it before I started building debs by hand. As I mentioned, its not terribly difficult to use dpkg-deb to produce a deb. I also donʼt really like that fpm is written in ruby. While I think ruby is a fine language, getting it installed with everything that is needed to build native gem extensions is a pain. A pain that I didnʼt want to pay for a simple cli tool. It also requires a bit more setup than that to fully utilize. The rpm output requires the rpmbuild command to be installed and Iʼm sure that some of the other outputs require similar commands to be available. Iʼd love to see a similar tool compiled into a static binary but, Iʼve long given up on ever producing this tool for myself.

As I alluded to earlier, what prompted me to start using fpm was generating rpms. Iʼve since realized that I shouldnʼt have dragged my feet on it for so long. Instead of figuring out everything that is required to generate an rpm, I just used fpm: fpm -s dir -t rpm -v $VERSION -d libgmp10 ~/.local/bin=/usr/local/bin/. Of course, I can simply swap out the rpm with deb to generate a deb instead of an rpm. This ignores many of the fancier things that fpm can do. You can easily make native packages for gems, python modules, and cpan modules (to name a few). It also supports some more “exotic” formats such as self extracting scripts and OS X packages. Iʼve converted many of my deb building scripts to use fpm and Iʼll be using fpm for all of my packaging needs.

Disabling Analytics

Iʼve been quite pleased with how this site has been going. Its been growing slowly over time, for the past few months, 10% - 20% month over month. I think thats pretty good but, obviously, that means my traffic levels are basically internet radiation. Given that is the current state of the site, Ben Brookʼs article, Death to Analytics really struck a chord with me. While I enjoy seeing my traffic grow, it doesnʼt provide me any benefit. Clearly the ever growing traffic hasʼt been motivating me to write more. In fact, its probably a detriment.

Since I know the articles that people come to my site to read, Iʼm inclined to write additional things along those lines. Unfortunately, over 60% of people come here for the various tutorials that Iʼve written. While I like that Iʼve written these and Iʼm glad that people are benefitting from them, I donʼt really want to keep writing them. I write them when I come across something that I have a hard time doing and when I think I have some knowledge that would be helpful to pass along. They arenʼt the reason that I write on this site. Feeling pressure to write more of them just keeps me from writing at all of this site and that makes me feel bad.

It also doesnʼt matter how many people are visiting my site. While I have enjoyed seeing the number of people that visit my site increase, I donʼt find people simply visiting my site particularly pleasing. Many of the people that have happened upon my little hobble probably werenʼt particularly pleased either. Knowing how many times this has occurred isnʼt something that I should care about and, if I really consider it, I donʼt care. What I really care about is making an impact on you. Of course, analytics canʼt tell me that. Only you can. I really appreciate when someone takes the time to start a discussion about one of my articles or let me know that they enjoy my site. It really made my day when one of you decided to send me some money to support the site. Iʼd love to have more of these things.

So, Iʼve removed the analytics from here. Iʼm going to do as I should have been doing anyways, writing about the things that interest me. Iʼd love to know your thoughts so, please let me know in whatever way you prefer. If you happen to love what I do here, consider supporting me in some way.

Otto

Two weeks ago, at their first ever HashiConf, HashiCorp announced two new tools, Otto and Nomad. Both of these are great new tools but, Iʼm going to be concentrating on the former as Iʼm more interested in it. For the first time, HashiCorp is rethinking one of their current products, Vagrant.

I use Vagrant every day, its immensely useful. Its a great way to set up isolated and consistent development environments. All of that comes with a cost though. Setting up Vagrant is quite a bit of work. You have to figure out how to provision the environment. Vagrant has all of the standard choices built in, so, you can pick your favorite but, that requires you to have some existing knowledge. You could provision your environment using shell scripts but, that quickly gets painful. As this is a fairly large pain point, a variety of tools have sprung up in an attempt to fix this, such as Puphpet and Rove. For a while, I was really excited by this sort of thing. I almost built a community with Nathan Leclaire for hosting vagrant environments after he came up with the idea. Things didnʼt work as it was a really busy time for both of us. After quite a bit of thinking, Iʼm glad that we didnʼt. It just wasnʼt the right way to move things forward.

The other big problem with Vagrant is moving your app into production. You put in a lot of work to build your development environment but, theres a pretty good chance that youʼll need to put in a bunch more work to prepare your production environment. The quick setup that you do for setting up your development environment will not be sufficient for production. In a lot of ways, Vagrant seems to work better if youʼre working back from your production environment. Being able to replicate your exact production environment has lots of benefits but, if you donʼt already an existing set of scripts, roles, or modules then, using Vagrant is going to take a lot of setup to get going.

Thats where Otto comes in. Otto is HashiCorp rethinking how you should setup development environments. It can automatically detect the type of application that youʼre developing and it will build an appropriate development environment automatically. Of course, this leverages Vagrant under the covers, it just skips the laborious setup process. The other big thing that Otto provides is a way to move your application into production.

I think Otto is the answer to a question that Iʼve been pondering for quite some time: how should a small group of developers create and deploy an application? There arenʼt a whole lot of good options for small teams. Recently there have been an explosion of tools for simplifying managing application but, they seem to all be focussed on much larger teams and applications. Things like Mesos, kubernetes and Docker are all great but, they require quite a bit of knowledge to run. For a small team, theyʼre all are too much of a knowledge investment to be useful. Deploying to plain servers also requires too much knowledge to keep running and to secure. The only good option here is Heroku but, that isnʼt without its downsides. Heroku is extremely expensive for what it provides and it ties you to a proprietary platform.

Otto really fills this need. When it comes time to move your app into production, Otto will create industry standard infrastructure for you. This is very important as it allows many hard-earned lessons to automatically be handled. Iʼve felt that things like Chef roles and Puppet modules presented a similar opportunity but, both have fallen very short of that goal. This allows developers to get back to what they do best, improving your application.

As with most of HashiCorpʼs products, Otto is launching with a small set of functionality. The two main limitations are that the app types that Otto supports are rather limited and there is only one type of deployment that is available. Of course, both of these things will improve over time but, theyʼve kept me from being able to start using Otto. These days, Iʼm spending most of my time working in Clojure and Otto doesnʼt currently support jvm applications. In the current version, Otto only supports Docker, Go, PHP, Node, and Ruby but, those cover a large swath of developers. Otto also will only deploy to aws which I donʼt use due to the relatively high cost. I really want to use Otto but, its features are quite enough for me yet.

Otto is an important evolution of a crucial DevOps tool, Vagrant. It makes huge strides forward for many of the current use cases. It removes the biggest pain point of getting Vagrant up and running. It also fills a crucial need by providing a good way to move applications into production. I’m looking forward to using Otto in the future.

Apple Music

Apple music seems to be rather polarizing. Quite a number of people have fairly disappointed in it. From what I read, your opinion of it will be largely determined by what you were using before Apple Music. If youʼre currently have a large number of songs in iTunes then youʼre unlikely to like Apple Music. On the other hand, if youʼre currently using a music streaming service like Rdio or Spotifiy then, there is a lot to like. I happen to have been a long time customer of Rdio.

The most obvious advantage to Apple Music is its deep integration into iOS. This is definitely an unfair advantage for Apple. In the past few years, Apple has introduced apis that make 3rd party audio apps integrated more deeply into iOS but, its still not quite parity though. My carʼs audio system has the option to connect over bluetooth. In my previous usage of Rdio, quite frequently it would fail to start playing when I got into my car. This has not happened once. It is also currently the only native music app on the Apple Watch. That, of course, will be changing with WatchOS 2. Then there is the Siri integration. You can ask Siri to play one of your playlists, an artist or even a song and it will start playing in Apple Music. It seems unlikely that this particular functionality will ever appear for 3rd party apps although, the opening up of Spotlight to search 3rd party apps in iOS 9 does make this scenario seem plausible.

The initial setup for Apple Music is a little bit wonky. The interface doesʼt make it clear when youʼve selected a sufficient number of genres so, you have to figure out that youʼve selected enough and then hit next. The artist selection step seems a bit more straight forward. The inital artists that Apple Music suggested werenʼt really my taste but, after selecting the couple that I did like and hitting “More Artists” a couple of times, they got better. There is a limit to the number of artists that you can select. As you select addtional artist, the screen fills up with their bubles. When you hit the More Artists button again, it simply replaces the artists that you didnʼt select with new ones. This places a hard cap on the number of artists that you can select. Additionally, when you have quite a few artists selected, the interface is extremely slow to scroll. It is clearly optimized for selecting a small number of artists.

The initial playlist suggestions were ok. They were pretty much exactly what I asked for, They were all related to the selections that I made. It was a mix of deep cuts of my favorite bands with a smattering of genre focuessed playlists. These days, I find the playlist suggestions to be a bit better but, I donʼt often listen to them. I use it in almost exactly the same way that I used Rdio which is mostly picking my favorite songs and downloading them for offline use. However, I have found a few songs that I like by listening to the suggestions.

Managing songs is a bit cumbersome on what must be the primary device, the iPhone (purely because there are way more iPhones than Macs or iPads). Apple has hidden most of the actions that you can do behind a pop up menu. Not only is this an obnociously long list, most of the things that you might want to do with a song are hidden in there. Strangely, the “heart” is not available in that menu. As far as I can tell, that is only available from the now playing screen. The options that are there are somewhat confusing. You can add a song to your music, you can make it available offline and you can add it to a playlist. Does making it available offline add it to your music? Does adding it to a playlist add it to your music? Its not remotely clear. There is a similar set of actions for Albums but the favorite is far easier to get to.

I mostly avoid all of that complexity. I simply make the songs that I like as loved. Then, I have a smart playlist that contains all of my loved songs. I have this set to be available offline. Using it this way is mostly automatic, I hit the heart to add a song to my playlist and then it is downloaded for offline use on my iPhone.

I canʼt exactly call Apple Music a run away success. I find it to be better than any of the other options but, I find it to be better than any of the other options. I guess it works for my very limited use case.

cd to Source

On my computer, I have all of my source code in a single directory in my home directory, ~/src. I think that this is a fairly common thing both in the concept and the location. Usually this means that I can navigate to a different project by simply cd ../<project name> but, sometimes this doesnʼt work. This is usually a problem when Iʼm somewhere in a projectʼs source tree besides the root directory. This then requires me to either keep track of how deep in a project I am or to use cd ~/src/<project name>. This is what Iʼve typically been using but, Iʼm a software developer and so I like to spend hours removing even these tiny little inefficiencies. That is a bit of an exaggeration in this case but, Iʼm sure that I spend far longer on this then it will ever save me.

So for the simple part, the function to handle changing directories.

cds() {
    cd ~/src/$1
}

Now, that works just fine but, it is slightly inefficient. It is shorter but it now requires you to type out the full directory name. This results in this method being longer for directory names longer than 4 characters which is most of them for me. The solution is to add tab completion. Luckily, this is really easy in zsh. All together it looks like this:

cds() {
    cd ~/src/$1
}

compctl -/ -W ~/src cds

Pretty simple, huh? You just have to throw that in one of zsh startup directories, such as ~/.zshrc.

Capturing Influxdb Send Errors in Riemann

At Signal Vine we recently upgraded from InfluxDB v0.8.8 to v0.9. Unfortunately, we experienced a number of issues related to the migration. One of them was that InfluxDB experienced a problem that caused it to stop accepting data. Since this happened late on a Friday night, no one noticed the lack of data until Monday. This means that we lost all of the data from the Weekend. Luckily for us, our app is not heavily utilized over the weekend but, I decided that we needed to be able to detect any of these sorts of issues in the future.

It turns out that an exception was being omitted each time Riemann failed to send to InfluxDB. I decided to go with a simple (try ... (catch ...)) which is probably not the ideal way to handle this. There is *exception-stream* but I not sure how it is used and I was unable to find an example demonstrating its use. This is what I came up with:

(ns riemann.config
  (:require [riemann.time :as time]))

(def influx (batch 100 1/10
                   (let [throttled-alert (throttle 1 900 tell-brendan)
                         throttled-log (throttle 1 60 (fn log [e] (warn "influxdb-send-exception" (str e))))]
                     (async-queue! :agg {:queue-size 1000
                                         :core-pool-size 1
                                         :max-pool-size 4
                                         :keep-alive-time 60000}
                                   (let [send-influx (influxdb {:host "influxdb.example.com"
                                                                :version :0.9
                                                                :scheme "http"
                                                                :port "8086"
                                                                :db "metrics"
                                                                :username "riemann"
                                                                :password "password"
                                                                :tag-fields #{:host :environment}})]
                                     (fn influx-sending [event]
                                       (try
                                         (send-influx event)
                                         (catch Exception e
                                           (throttled-alert {:host "riemann.example.com"
                                                             :service "influxdb send error"
                                                             :state "fatal"
                                                             :metric 1
                                                             :tags []
                                                             :ttl 60
                                                             :description (str e)
                                                             :time (time/unix-time-real)})
                                           (throttled-log e)))))))))

Obviously that is a bit more complex then just catching the errors, so Iʼll dig into it. First off, Iʼm batching events together before sending them to InfluxDB. This helps to reduce the cpu load of Riemann. Then I define the two alert function that will be used later on. tell-brendan is a function that sends an email to me. I only want to get one of these every 15 minutes as it is likely that I would see the alert and start working on the problem immediately. However, I do want to see if sending metrics to Riemann is still failing so, I have Riemann log a failure notification every minute. These are both defined here so that the throttle applies to all of the branches later on.

Next up is another performance option. Iʼve setup an async queue so that sending metrics to InfluxDB doesnʼt block Riemannʼs incoming event stream. Iʼve had sending to InfluxDB cause Riemann to back up to the point where events were expiring before Riemann was processing them. Sending them to InfluxDB asynchronously fixes this. It doesnʼt matter how long it takes events to be sent to InfluxDB, all of Riemannʼs processing only depends on Riemann. Since moving to InfluxDB v0.9 and implementing the async queue, the 99.9% stream latency for Riemann has dropped from 50-100ms to 2.5ms.

Next up, I define the Riemann client. There isnʼt much to see here. The only mildly interesting thing is the :tag-fields value. At Signal Vine, all of our events are tagged with an environment. The #{:host :environment} sends both the host value from the event and the environment as InfluxDB tags. This makes it easier to query exactly what you want from Riemann.

Now for the main attraction, the index-sending function. While Riemann tends to hide this in its internal function, Riemannʼs streams are simply made up of functions that take a single argument, the event. Its just that Riemannʼs built in functions return the function that satisfies this and then calls all of the children that you have defined. Since we donʼt have any children, we simply need to construct a function that takes the event. Well in this case it actually takes a vector of events. As previously mentioned, we use a (try ... (catch ... )) to handle any InfluxDB errors. So we simply try to send the event to Riemann. If that throws any exception, we catch it and pass the exception to our notification functions. Iʼve chosen to construct a new event as the passed in events have little to do with the exception.

Iʼm quite fond of this approach but, it does have its limitations. One of the big ones is that this will generate an alert for even momentary issues in InfluxDB. If you happen to restart InfluxDB, you will get an alert. I donʼt really mind this but it is something to keep in mind. It also discards any events which fail to send. In that same scenario, when we restart InfluxDB, we will lose any events that Riemann tries to send to InfluxDB. It would be much better if we would pause the sending process for some period of time and then attempt to resend the events. the main reason that Iʼm not currently doing this is that Iʼm not really sure how to make that happen.

Elk in Production

Elk is a software stack for processing and searching logs. Elk consists of Elasticsearch, Logstash and Kibana. Logstash is a log processing tool. It supports a variety of inputs, log processing functions and outputs. One of its outputs is typically Elasticsearch. Elasticsearch is a full text search engine based on Lucene. Elasticsearch makes your logs easily searchable and Kibana is a web interface for searching and making dashboards out of the data stored in logstash.

Logstash is written in Ruby and targets the jRuby interpreter. In general, it doesnʼt matter other than the configuration file is written in ruby. As previously mentioned, Logstash supports a large number of inputs such as a file, heroku, Kafka and many more. There is probably an input for however you are currently logging. At Signal Vine, we are currently using the lumberjack, Rabbitmq and syslog inputs.

The easiest way to get started with collecting logs is the syslog input. Many components of your current infrastructure probably already send logs over syslog. Its easy to configure your chosen syslog daemon to forward logs over tcp. Its probably just as easy to use udp but that isnʼt a good idea as then logs will be dropped even under normal operating conditions. You should be careful to keep the logstash server up and running when collecting syslog messages over tcp. Iʼve seen some strange behavior when the Logstash server is unavailable. Basically, some processes use a large amount of cpu while not doing anything. One downside to using the syslog input is that the timestamps arenʼt very precise, they are only second precision. This is only a problem because Elasticsearch doesnʼt preserve the insert order. What this means is that when you are viewing log messages from within the same second, they likely wonʼt be in the same order that they occurred. Most of the time this isnʼt an issue but it is something to keep in mind.

Lumberjack is a bit harder to explain. Lumberjack is a custom binary format with encryption built in. It was designed for use with lumberjack, which was a Go service that reads from files on the server and ships them to a Logstash server. It has recently been acquired by Elastic and renamed to Logstash-forwarder. While you could run the Logstash service on every server to read from log files and ship them to a central instance to finish processing them, the Logstash service has a decent amount of overhead. It runs on the jvm and as such uses a decent chunk of memory. Logstash-forwarder is very efficient and you shouldnʼt notice its overhead even on the smallest of servers. It even has encryption built in. Unfortunately, this does make it harder to get setup. It took me a few tries to get it right but it has been incredibly stable since then. In fact, since I set it up, I havenʼt had to touch it at all. It simply works. It even has handled resuming after downtime on the logstash server without any sort of intervention.

At Signal Vine, we use Rabbitmq for the internal communication medium between components of our applicaiton. Weʼve also chosen to ship our application level logs over Rabbitmq. Logstash has a built in input for processing Rabbitmq messages. This works out really well if you send json to Logstash. Logstash can take the log messages from Rabbitmq and directly insert them into Elasticsearch. This allows you add any custom fields that you want to query on. This then becomes a very effective way to query your own logs. It does help if you have a few standard fields that logs must contain. This allows you to query across all of your application logs using very similar queries.

Once Logstash has received the log messages, it can processes them with various types of filters. These filters include: adding geolocation data to web server logs, parsing the a date field included in the log message, the ability to process custom log formats and even run ruby code on the message. These filters are useful for coercing whatever sort of log message you get into a queryable format for later consumption. The available functions make this process very easy. Given that the configuration file is made in ruby, youʼre able to selectively apply these filters based on any scheme that you desire. It might be useful to apply one filter to all of the logs that come from a specific input. You can also target these filters based on the contents of the log message.

The notable exception to the processing ease is grok. Grok allows you to process custom log formats using regex which, can be a difficult processes. Luckily Logstash ships with a number of complete patterns and partial patterns that you can reuse for your logging format. There is also a great tool for building your grok filter based on an example of a message. Its important to add a unique tag if the parsing fails and you should leave the default _grokparsefailure as the combination of these two tags allows you to easily query log messages that failed to parse and then locate the specific filter that failed.

Once the messages have been fully processed, Logstash can send these messages to a variety of systems for further processing or storage. You can send the logs to s3 for permanent storage, Send an email with a particularly troubling log message, forward the data to another logstash instance or store them in Elasticsearch. I think Elasticsearch is nearly universally used. It is a very effective way to query logging data. At Signal Vine, we also use the Riemann output to process logs that we may want to be alerted about. I find that this process works really well as Iʼm able to use all of Riemannʼs powerful stream processing tools to filter the notifications that the engineering team receives. I would like to forward the full log message to Riemann but there happens to be a bug with a pending resolution preventing this.

After all of the information about Logstash, the information about Elasticsearch is rather dull. It serves as an easily queryable data store. It supports full text search and it includes Luceneʼs powerful querying language to make short work of finding relevant log messages. Elasticsearch allows you to scale horizontally based on the number of shards that you choose when configuring it. This is very useful for if the amount of logs that you want to keep exceeds the capacity of a single server or if your write volume is greater than a single server can handle. Elasticsearch also allows you to specify the number of number of replicas to keep which will allow you to lose some servers without losing any of your data. Just keep in mind that Elasticsearch has some catastrophic failure modes. Even with the ability to shard the data across many servers, most likely it wonʼt be long before you run out of storage for your logs. That is where curator comes in. Curator allows you to specify how much logging data to keep. You can either choose how long to keep logging data or the total amount of disk space to use.

Once your logs are stored in Elasticsearch, you need a way to easily query them. This is where the final piece of the Elk stack comes in, Kibana. Iʼm currently using Kibana 3 as I havenʼt yet put in the time to understand Kibana 4. Iʼve tried it out but, I wasnʼt a fan of it as it seems very unfamiliar. I found the changes to be dizzying. Iʼm sure that there are quite a number of benefits but, as I havenʼt yet researched them, I donʼt know what they are. With that being said, Kibana is a great way to look at your log data. It exposes all of the powerful querying functionality of Elasticsearch for doing ad hoc querying of your logs as well as creating great dashboards from the data contained in your logs. When you look through the documentation, there are several great examples of this.

The Elk stack is not without its flaws but, it gives you a powerful set of tools for processing log data. Having centralized storage for logs is an invaluable tool for any IT team to have. Having a single location where you can research issues happening happening on your infrastructure can greatly speed up the time between detection and the fix. It might even help you find errors that you may not have noticed without it. The Elk stack is the best tooling in this area that Iʼve seen and you should think about using it as well.