Catching errors posting to influxdb from riemann

At Signal Vine we recently upgraded from InfluxDB v0.8.8 to v0.9. Unfortunately, we experienced a number of issues related to the migration. One of them was that InfluxDB experienced a problem that caused it to stop accepting data. Since this happened late on a Friday night, no one noticed the lack of data until Monday. This means that we lost all of the data from the Weekend. Luckily for us, our app is not heavily utilized over the weekend but, I decided that we needed to be able to detect any of these sorts of issues in the future.

It turns out that an exception was being omitted each time Riemann failed to send to InfluxDB. I decided to go with a simple (try ... (catch ...)) which is probably not the ideal way to handle this. There is *exception-stream* but I not sure how it is used and I was unable to find an example demonstrating its use. This is what I came up with:

(ns riemann.config
  (:require [riemann.time :as time]))

(def influx (batch 100 1/10
                   (let [throttled-alert (throttle 1 900 tell-brendan)
                         throttled-log (throttle 1 60 (fn log [e] (warn "influxdb-send-exception" (str e))))]
                     (async-queue! :agg {:queue-size 1000
                                         :core-pool-size 1
                                         :max-pool-size 4
                                         :keep-alive-time 60000}
                                   (let [send-influx (influxdb {:host "influxdb.example.com"
                                                                :version :0.9
                                                                :scheme "http"
                                                                :port "8086"
                                                                :db "metrics"
                                                                :username "riemann"
                                                                :password "password"
                                                                :tag-fields #{:host :environment}})]
                                     (fn influx-sending [event]
                                       (try
                                         (send-influx event)
                                         (catch Exception e
                                           (throttled-alert {:host "riemann.example.com"
                                                             :service "influxdb send error"
                                                             :state "fatal"
                                                             :metric 1
                                                             :tags []
                                                             :ttl 60
                                                             :description (str e)
                                                             :time (time/unix-time-real)})
                                           (throttled-log e)))))))))

Obviously, that is a bit more complex then just catching the errors, so Iʼll dig into it. First off, Iʼm batching events together before sending them to InfluxDB. This helps to reduce the cpu load of Riemann. Then I define the two alert function that will be used later on. tell-brendan is a function that sends an email to me. I only want to get one of these every 15 minutes as it is likely that I would see the alert and start working on the problem immediately. However, I do want to see if sending metrics to Riemann is still failing so, I have Riemann log a failure notification every minute. These are both defined here so that the throttle applies to all of the branches later on.

Next up is another performance option. Iʼve set up an async queue so that sending metrics to InfluxDB doesnʼt block Riemannʼs incoming event stream. Iʼve had sending to InfluxDB cause Riemann to back up to the point where events were expiring before Riemann was processing them. Sending them to InfluxDB asynchronously fixes this. It doesnʼt matter how long it takes events to be sent to InfluxDB, all of Riemannʼs processing only depends on Riemann. Since moving to InfluxDB v0.9 and implementing the async queue, the 99.9% stream latency for Riemann has dropped from 50-100ms to 2.5ms.

Next up, I define the Riemann client. There isnʼt much to see here. The only mildly interesting thing is the :tag-fields value. At Signal Vine, all of our events are tagged with an environment. The #{:host :environment} sends both the host value from the event and the environment as InfluxDB tags. This makes it easier to query exactly what you want from Riemann.

Now for the main attraction, the index-sending function. While Riemann tends to hide this in its internal function, Riemannʼs streams are simply made up of functions that take a single argument, the event. Its just that Riemannʼs built in functions return the function that satisfies this and then calls all of the children that you have defined. Since we donʼt have any children, we simply need to construct a function that takes the event. Well in this case it actually takes a vector of events. As previously mentioned, we use a (try ... (catch ... )) to handle any InfluxDB errors. So we simply try to send the event to Riemann. If that throws any exception, we catch it and pass the exception to our notification functions. Iʼve chosen to construct a new event as the passed in events have little to do with the exception.

Iʼm quite fond of this approach but, it does have its limitations. One of the big ones is that this will generate an alert for even momentary issues in InfluxDB. If you happen to restart InfluxDB, you will get an alert. I donʼt really mind this but it is something to keep in mind. It also discards any events which fail to send. In that same scenario, when we restart InfluxDB, we will lose any events that Riemann tries to send to InfluxDB. It would be much better if we would pause the sending process for some period of time and then attempt to resend the events. the main reason that Iʼm not currently doing this is that Iʼm not really sure how to make that happen.