Ruin The assorted ramblings of Brendan Tobolaski

Ansible in Production

For quite a while, if you wanted any sort of consistancy in your servers, you needed a configuration maangement tool. This has really started to change over about the last year and a half. Weʼve seen a proliferation of tools aimed at making running a datacenter easier than managing a single machine. These are things like Apache Mesos, Kubernetes, or CoreOs. All of these tools are based on similar idea, you tell these systems that you want to run x number of this service with these constraints and they figure out how to run them. Of course, they differ quite a bit on the details but, at a broad level, this is what they do. While I find that idea to be hugely compelling, Iʼve decided to forgo one of these systems for now. This is mostly due to running a deployment of one of them and having it fail miserably at unpredictable times. We are also not at a scale in which handling distinct servers is difficult.

With that roundabout introduction out of the way, my configuration management tool of choice is Ansible. I find describing Ansible to be a bit difficult as it is a tool with a few different use cases. In some ways, it operates like a distributed scripting language. You can define a script that will change the current master database, point all of your applications at the new database, restart the old master and then point all of your applications back at the original database instance. It is also equally useful in a more tradtional view of configuration management where Ansible installs and configure individual servers in their specific roles.

Ansible has two rather broad modes of operation. One of which is ansible-pull. In this mode of operation, each server pulls down your configuration scripts and then applies them to themselves. This is somewhat similar to tradtional configuration management tools like Chef or Puppet. This mode doesnʼt appear to be used very often and this is probably a good thing. Both Chef and Puppet are far superior in this mode of operation. The typical mode of operation for Ansible is push. There is a control server, this could be your machine or a server somewhere, that initiates the Ansible run. The control server then connects to all of the servers that are part of the current playbook over ssh. Each step in the playbook is then applied sequentially with all of the servers specified for that step receiving the commands in parallel. There are knobs that you are able to turn in order to control how many servers receive the commands in parallel. This mode of operation leads to a number of really cool bits of functionality. For instance, when youʼre provisioning a new web server youʼre able to immediately add it to the load balancer which, is something that chef and puppet are unable to do 1.

The terminology that is used to describe Ansibleʼs configuration is a bit strange. There are inventories, playbooks, plays, roles, tasks, host vars, group vars and modules. I think that inventories, host vars and group vars are self explanatory. So that just leaves all of the others. I feel like the name playbooks was inspired a little too much from sports but, it actually happens to work quite well for describing their function. A playbook is the script that you run. This can be many different things, it could be as broad as a single script that creates your entire infrastructure, it might also be like the example of provisioning a new web server or even a multistep deployment script. Playbooks can include other playbooks or plays in them and they contain how information on which servers the plays should be applied to. Plays are blocks in the playbook. Each play must have a user and a list of hosts that play should run on. Plays can include custom variables, roles, and applications of modules. Modules are the basic commands of Ansible. These are things like installing a package on the server and copying a configuration file to the server. Tasks are apllications of these modules.

Roles are a set of related set of module applications. You might have a role for MySql or Postgres but given that they are the only way to include functionality between multiple playbooks (you can include other playbooks as well but that is limiting in some crucial ways), you end up using them that the name roles doesnʼt apply to. I have a database migrations role. We have a few different infastructure configurations for our app but the migrations are always applied in the same way, so those steps were pulled out into a self contained role. While semantically this doesnʼt make much sense, itʼs the only way to pull this common code out and being able to reuse it between different playbooks is extremely valuable. While the power of Ansible comes from being able to group roles and modules in playbooks to automate important processes, youʼre also able to execute the modules directly. You could have patched all of your servers for shellshock by simply running ansible -i inventory all -m apt -a 'update_cache=yes name=bash state=latest'.

Unlike other configuration management tools, Ansible is extremely easy to get started with. Itʼs especially easy by not having a server to setup, unlike Chef2 and Puppet. You can simply install Ansible on your machien and get started automating servers. Itʼs also easy to get started writing playbooks and roels because they are written in YAML. YAML is very easy to read and write and there isnʼt very much syntax to pick up. That being said playbooks have some expected keys and each of modules takes a set of arguments. Other than a few core modules, like apt, copy, template and service, I still have to look up the arguments every time that I use them. I really recommend Dash for referring to the documentation, itʼs great and it will greatly speed up your development time with Ansible. Roles have a rather complex (compared to the rest of ansible) directory structure but once you use it a couple of times, it will really click. That is enough to get up and running with Ansible but there are a few more advanced options that you won’t have seen and to get everything you should probably read the entirety of the Ansible docs except for the modules documentation. Itʼs really worth the effort, the documentation is quite succinct and will greatly assist you in writing your Ansible scripts.

Once you have that basic knowledge, you can start writing scritps for everything. If you do anything over ssh, you can write a playbook for it. In fact, you probably should, it will be much faster and much less error prone than doing it by hand (if you do it more than once). The seperation between what should be a role and what should be in the playbook becomes fairly clear early on, if the task that you are doing requires applying a template, using handlers, or copying a file from the Ansible control server, then it should be a role. This is due to how roles bundle these things togethor, it makes much more sense after youʼve been using Ansible for a while. As I pointed out previously, you should also put shared scripts steps into roles whether or not the name “role” is semantically correct. In my case, I have the steps necessary to run database migrations in an Ansible role.

In my experience, the best way to build up your Ansible scripts is by simply doing everything with them. If you need to deploy a database server, write a playbook and role for doing just that. When you enevitably find something that could be done better, add it to the role or playbook and then reapply it to all of the those nodes. This way, you don&rquo;t need to recall every step that you took when setting up the next one. This is the way that most of our playbooks and roles have been developed. For example, at the time that I was deploying logstash, I had no idea that the disk scheduler should be disabled (set to noop) on ssd based machines. Iʼve since added that step to the relevant roles. This is much the same as any other configuration management tool, youʼre able to distill

Iʼve found that Ansible is extremely good at automating complex multiple server tasks. This includes things like deployments of multiple application services. Our deployments, at Signal Vine, are run using Ansible. Our deployments are quite complex, they involve reconfiguring 4 types of servers and applying migrations to 3 different data stores. Anisble has been handling this all of this beautifully. Iʼve also written playbooks for all of the complicated operation processes. Theres one for changing database settings and restarting the database with its dependant services. With another one, Iʼm able to change the postgres master and then setup replication on the demoted master. After writing a couple of these, it becomes clear how valuable they are.

Now, what would an automation tool be without having some prebuilt patterns available? Ansible Galaxy fufills this need. Unfortunately, I donʼt have a lot of experience with it as there wasnʼt that much there a year ago 3 and Iʼm not eager to rip out working code to replace them with untested (on our servers) roles. I really like the idea of Ansible Galaxy and I really appriciate that all of the roles have a rating associated with them. It really helps you narrow down which roles you should audit. I feel that the usefulness of this roles is slightly hampered by not being able to run run a role multiple times in a play by passing it different variables. This is a feature that is due to arrive in Ansible 2 but, Ansible 2 hasnʼt shipped yet. In some cases this can be mitigated by the roleʼs author. If they make a variable that someone might want to call multiple times into an array then, they can properly handle this situations. In other cases, this isnʼt possible.

This gets into what I think is Ansibleʼs biggest sore spot, reuse and composibility. While Ansible Galaxy is nice, it has no where near the utility of Chefʼs Supermarket or Puppetʼs Forge. I donʼt think that Ansible as it stands today will ever have anything like that. Ansible is intentionally not a programming language. While I see some advantages to that for onboarding new userʼs, I really feel like it hampers your ability to abstract things. Certainly users can go too far in abstractions but limiting them so much is also painful. One of the things that I wish that I could do is loop over a set of tasks but, thats not possible. In a similar vein, Ansible has role dependencies. This is very helpful but, it doesn’t help you in all cases. If you have multiple roles that depend on a single roles with different variables set, in a single play, Ansible will not run all of them..

In the past, Iʼve used Ansible to build Docker images. While this is entirely possible4, it is not a pleasant experience. At first this seems like a good idea, you can use the same dependable scripts to deploy a particular service into any environment, whether that is a single monolithic server or into a container. The reality is that these are very different environments and you probably donʼt want to install your app in the same way on both. You will end up filling your roles with various conditionals to handle being able to run in a container or being directly installed on a server. This ends up being extermely unweildy. This also doesnʼt work with Dockerʼs expected mode of operation. Each step you specify in the Dockerfile builds up a cached image layer. Then, when you change one of the steps, Docker will use these cached layers up to the point where modified it and then run the remaining steps directly. When you are using Ansible to run the provisioning, all of it happens in a single step. So to change a single thing, youʼll need to completely rebuild the container and Ansible isnʼt well optimized for working in this way, so your container builds will take a significant period of time. I would guess it will be somewhere between 1 - 10 minutes to build a container. This isnʼt horrible but, it is enough to be annoying. Instead you should use Docker and Ansible as they were intended. Use Dockerʼs toolchain to make container build artifacts and then use Ansible to deploy those to the required servers.

Iʼve been disappointed with the testing situation in Ansible. As far as I can tell, this is the only coverage that testing has gotten. Itʼs really not enough for me. I need a little more hand holding to really understand the picture that they are trying to paint. I havenʼt yet tried testing the way that theyʼve suggested, I just canʼt see that working well. Iʼve defaulted to mostly manual testing which is a drag. I do have a staging environment that i can apply changes to before running changes agaimst our production environment. in a similar vein, itʼs entirely possible to write a playbook that fails when running it in check mode (--check) that works without a hitch when applying it. I do understand how this could happen but, itʼs very annoying that Ansible doesnʼt notice it and issue a warning for it. I do like that Ansible includes --syntax for checking whether the specified playbookʼs syntax is valid. Unfortunately, it doesnʼt check whether the variables are defined before they are used.

Another area that Iʼm not satisfied with, is how you expand the users of your Ansible scripts. I think is fairly reasonable to expect anyone handling operations in your company to be able to install and run Ansible from the command line. I donʼt think that works to expand it to everyone in your company. It isnʼt clear on how this can be done easily. It also becomes difficult to see when things are happening or when they have happened in the past. Ansible has a commerical product for this, Tower. Tower is an option for doing this but, itʼs both pricey and it may not be exactly what youʼre looking for. You also could setup Ansible tasks in your CI server but, then your CI server needs ssh access to all of your servers with sufficient access rights to do those tasks. That would mean an attacker could change the software running in your production environment if they were able to compromise your CI system. That isnʼt somethign that I would feel comfortable with.

All that being said, I think Ansible is a great product. If you arenʼt currently using a Configuration Management tool, I highly recommend that you check out Ansible. If you already have one and youʼre satisfied with it, you should probably keep using it but, you may still find Ansible useful. Itʼs very useful for doing multi-server scripting. Of the type that you might do during a deploy. Ansible is a good, dependable and efficient tool, I ʼm happy to use it.

  1. Yes, itʼs possible to make that happen with both chef and puppet but, it involves extra steps. First you provision the new server, which registers it with the configuration management server. Then, on their next configuration run, the load balancers add the new server into the rotation. With Ansible, it can be available immidately. 

  2. Chef does have chef-solo but, itʼs not what you should use if you want to use Chef. Chef with a server is much better. 

  3. Thats when I started writing our Ansible scripts. Iʼve take another look at it and it seems fairly decent now. 

  4. And easy if you know what you are doing. Basically you need to setup your Ansible playbook as it was being used for ansible-pull and then you need to install Ansible in the container, add your playbooks/roles, and finally invoke Ansible within the container.