Sunday, October 21, 2012

The Long Road to Logstash

I'm a Splunk addict. I use it almost every day, primarily for problem investigation. So when we started going over our daily indexing limit every day at the start of this semester, I knew I was in trouble. After being locked out from searches for the second time I started to look for alternatives. I found 3 serious candidates:
I found things I didn't like about all 3 of them but Logstash was by far the most flexible. After a lot of confusion and frustration I finally have it at a point where it is useful. What follows are the things I wish I had known before undertaking this project. It more or less assumes you have had some introduction to Logstash. The most recent presentation from PuppetConf 2012 is quite good: http://www.youtube.com/watch?v=RuUFnog29M4

Logstash

  • "Use the grep filter to determine if something exists. Use the grok filter only if you know it will be successful." - This is a big problem with complex log format, like dovecot. I spent many hours trying to write a grok filter that would match every possible Dovecot log line. It's futile. Use the grep filter to grep for something like 'interestingfield=".*"' and add a tag to indicate the field's presence, then grok for 'interestingfield=%{QUOTEDSTRING}' on just that tag. Grok failures are bad. They add the _grokparsefailure tag, and they seemed to contribute to the next problem I ran into, watchdog thread timeouts.
  • Logstash, by default, has a 2 second timeout to all filter operations. If it hits that timeout, it will kill itself. I was probably getting this because I was trying to run logstash, elasticsearch and kibana all on the same underpowered development VM, but I think part of the problem was I was doing lots of grok filters that were failing. The recommended solution to the "watchdog timeout" is to run logstash under some system that automatically restarts it. On RHEL6-based distros (and debian-based systems) you probably have upstart. On RHEL5-based distros you can use inittab. There's a good upstart entry in the new logstash cookbook (http://cookbook.logstash.net/recipes/using-upstart/). For inittab you should have a wrapper script that waits a few seconds before attempting to launch logstash just so the old logstash process can give up the TCP ports it was listening on.


Elasticsearch

If you want to be able to search your logs with Kibana, logstash needs to output to Elasticsearch. Unless you already have experience deploying Elasticsearch, you will probably spend more time learning about Elasticsearch than Logstash. I was an Elasticsearch newbie, so some of these may seem like common sense to people familiar with Elasticsearch.
  • There are 2 plugins I would consider essential for running Elasticsearch: head (management), and bigdesk (monitoring). Paramedic is also extremely useful for monitoring multiple cluster nodes. If you decide to use the "elasticsearch_river" output, you will need a river plugin. Unlike other plugins, you *must* restart Elasticsearch after installing a river plugin (at least with the rabbitmq one). A good list of plugins: http://www.elasticsearch.org/guide/reference/modules/plugins.html
  • Elasticsearch is a java program, so you will need to tune the JVM to your machine. bin/elasticsearch.in.sh is a good place to start. I'd personally recommend setting ES_HEAP_SIZE and telling Elasticsearch to lock that memory. You will almost definitely need to increase the limit of open files. In RHEL it seems to default to 1024. Most recommendations for Elasticsearch are around 300k or 600k.
  • There are 3 "outputs" in logstash for elasticsearch. The "elasticsearch" output will run embedded elasticsearch and can either store data in itself or connect to an Elasticsearch cluster and send the data to that. I couldn't get "elasticsearch_http" to work with bulk indexing. I'm currently using "elasticsearch_river", which sends events from logstash to an AMQP (rabbitmq) queue which Elasticsearch indexes data from.
  • You need to configure the query cache. I went with setting the query cache to "soft" which lets the JVM garbage collection expire the query cache. For more info, see: http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/
  • Compress/optimize old indices.
  • Use mappings and the options of the special _source field to limit the number of fields Elasticsearch saves and their type. (http://www.elasticsearch.org/guide/reference/mapping/ and http://www.elasticsearch.org/guide/reference/mapping/source-field.html)
  • Best practice: Make a template that will be applied to all logstash indexes. I started with http://untergeek.com/2012/09/20/using-templates-to-improve-elasticsearch-caching-with-logstash/