Previously in Docker from Development to Production I wrote about how to leverage Docker and Consul to quickly move from a development environment to a production environment. There is a critical piece missing from that story in how to deal with distribution of secrets in any environment, particularly one which makes use of service discovery. Luckily I’m not the first to encounter this problem, the developers at Hashicorp have been working on a solution. Over the last several months at my local Docker meet up there have been more mentions of this issue during our discussions than I can count. Typically someone will bring up Hashicorp’s Vault as a potential solution, at which point the conversation turns to a series of unanswered questions since no one seems to have any experience deploying Vault. So, given the need to solve this issue, and the fact that I’ve designed a system requiring Vault or something similar I decided it was time to investigate. While I won’t repeat the Vault users guide here I’ll dive into some of the features, talk about how I’m using Vault in my system, and provide links to code along the way to help you see it working in a PoC type system.
Before we get started let’s clearly define the problem. I have secrets in the form of credentials, passwords, and maybe PKI materials that I want to share with my code. Storing these items in a configuration management or revision control system is wrong because I need finer grain access control than those tools allow.
In an ideal world only the service that needs the access would ever be given credentials, those credentials would rotate on a regular basis, and they would never be shared even with the same service running in a different process. While I’m dreaming I’ll add that an intrusion detection system would be monitoring all processes – as well as network traffic etc. – and upon detecting any compromise would revoke the credentials issued to that process and kill the process. We won’t get all the way there in this article but we’ll see how close we can get and glimpse what the additional work would be to solve this problem.
An Overview of Vault and Some Close Friends
Vault is an open source project from Hashicorp (other projects include Vagrant and Consul to name a few) that stores secrets, provides audit logging for all actions, and allows secrets to be maintained via leasing and revocation processes. Vault also unifies access to databases, AWS IAM, and a growing number of other – what Hashicorp call – secret backends. What this means is you can request that Vault create a new credential for your database and Vault will create the credential and maintain it’s lifespan through leasing and revocation processes.
Authorization with Vault is achieved via any of several methods supported by Vault’s auth or credential backends. The primary method is via token, but LDAP, username/password, MFA, and App ID are also supported. A token is basically a shared secret and is the primary means of identification and authorization to all Vault APIs. Every token has an associated lease time so will eventually expire unless renewed. You can get a token via any of the other Auth backends. Typically for an application this means using App ID.
The App ID auth backend allows any service that knows two arbitrary predefined strings – called app id and user id – to get a token. Vault does not generate these strings for you, nor does Vault define how these strings should be shared between systems. It is recommended however that at least one of these is delivered to a participating machine using a process out of band from the typical CM system.
Vault can use a variety of backend data stores to keep your encrypted data. These include supported stores like Consul, file, and in memory stores, but there are also community supported backends like DynamoDB, etcd, zookeeper, S3, and PostgreSQL. The biggest differentiators among storage backends is the varied support for high availability configurations. Consul does support HA configurations and is the recommended configuration by Hashicorp. I’ll leave as an exercise for the reader to determine the suitability of that recommendation to their particular situation, but in our case we will use Vault with Consul since we are already supporting Consul for service discovery.
Operationally Vault can be configured in a highly available active/standby server architecture. Any unsealed Vault server can accept requests but they are always passed to and processed by the active master. Any unsealed standby can take over as master in the event the current master fails. All HA functionality is available when using an HA storage backend, such as Consul.
In addition to Vault and Consul, Hashicorp has developed an interesting and sometimes useful tool called consul-template. I personally first came across consul-template back in 2014 while configuring some load balancers based on current services available in Consul. Consul-template works great for these types of scenarios, and for those unfamiliar with this tool it will basically allow you to fill in template values based on changes in Consul. Last year Hashicorp added support in consul-template for Vault. This extends the templating functionality to Vault secrets.
I’d first like to point out that Vault is a relatively deep program. What I mean by that is it is very flexible allowing you to implement security policies customized for your organization. So keep in mind that none of the policy shown here has been reviewed by security professionals, and some configuration is purposefully open to ease the implementation of this PoC. In other words, don’t try this in production.
A First Pass Solution
My first thoughts on using Vault were focused on a particular subset of the secret sharing problem, namely distribution of database access credentials to Spring-based Java and Groovy applications using connection pools. In Java, connection pools are typically implemented as a Proxy that intercepts calls to the standard database connection factory and instead of creating a new connection a cached connection that has already been created is provided. The connection factory on the other hand knows how to create new connections including owning the database credentials. In Spring both of these objects are typically living in the same context as our application and can be managed by the application. So, given this what happens when a lease expires on a credential in use by the app? At the very least we’d need to provide new credentials to the connection factory. It is probably database dependent as to what happens when the credentials used to establish a connection expire. To be safe we’d likely want to remove any cached connections with expiring credentials from our connection pool, or risk a mass drop in connections at some point. Things get a bit more complicated if we move into a Java EE environment. Here our connection factory is likely owned and managed by the application server rather than the application. So even though I could likely implement an additional proxy to intercept calls to the connection pool in a Spring app and work with the life-cycles of the connection pool and factory to create a solution that could handle automatically getting credentials from Vault and responding to lease renewals and revocations, it sounds really complicated and does nothing to solve the same problem if my target app is written in Python, Go, Elixir, or some other inferior language like Perl or whatever it is people do on Windows.
Before I spent a lot of time thinking about how to integrate Vault with my Spring apps it had occurred to me that I could simply write a configuration file with the Vault credentials, load those into my app, and just restart each time I get new credentials. But I dismissed the thought almost immediately as far to simple. Several things happened that changed my mind. First, I discovered that consul-template had added support for Vault almost a year ago. Second, the size of the Java only solution was growing rapidly and I wasn’t certain that a solution there wouldn’t require completely replacing the connection factory each time the credentials updated. This could easily lead to memory leaks since the connection factory is typically a singleton with a life-cycle matching that of the application, so any bugs anywhere in those Java libraries that improperly hold references probably haven’t been discovered. Third, I was explaining the situation to a friend and the act of saying it out loud made me realize that it was way overly complicated and possibly just a little too clever, and he agreed. Clever almost never makes sense when you get paged at 2 AM. Fourth, while either solution introduces new dependencies into the system as a whole, this dependency would not be directly on the critical processing path for incoming requests.
This solution is not without some drawbacks. First, care needs to be taken that the services cycle at different times so that capacity is maintained to respond to the service load. Second, the service should be removed from load balancers in a way that allows in flight requests to complete prior to shutdown if that is critical to the upstream systems, unless you can afford this in your error budget. Third, any code that has a typical usage scenario involving regular restarts does run a higher risk of accumulating latent issues with memory leaks that go unnoticed typically until usage changes with newly released features. Not that I’ve seen this on several occasions. So as always invest in good monitoring practices, and then look at what the monitoring is telling you or better yet setup some automatic ticketing based on your monitoring to tell you to look at what it is telling you. Or just wait for the alarm/page, but memory leaks almost never make sense when you get paged at 2 AM. You’re probably picking up on a pattern here, it’s all part of one of my general rules of software engineering which states “Almost nothing makes sense when you get paged at 2 AM, except rage, plan accordingly.”
The Proof of Concept
In Docker from Development to Production we used AWS to show a PoC. This time I wanted to do something a little different. Instead we’ll be using Vagrant to build out a local set of virtual machines with software defined network to demonstrate Vault. The same concepts can be applied to AWS and CloudFormation. This approach while it is obviously limited to demos does have a few advantages for development. First, CloudFormation is an all or nothing approach, it roles back changes on failure. This is actually a great feature but it can slow iterations just a bit. Salt on the other hand will detect and skip most states it has already applied allowing me to iterate quicker locally, since I’m only applying changes. CloudFormation will actually apply changes in a transactional way as well but it still feels a bit slower to me. I can also bring up individual machines defined in my Vagrantfile, whereas again CloudFormation is an all or nothing transaction. But really my biggest reason for doing this is I’m guessing most people haven’t really seen Vagrant used in this way, and it’s one more tool to add to your tool belt. While most with exposure to Vagrant will have realized that a Vagrantfile is actually a Ruby program you will really see that here. Though this is not the focus of this article you should take a look at the interaction between the Vagrantfile and the surface.yaml file.
You can see details of how to bring up the environment in the README on the poc-vault-server Github project page. I’ll try to leave most details there as this is meant to be a living project that may change after this article was written. So in this section I’ll focus on the higher level concepts captured in the PoC.
Vault and Entropy in Production
While Vault is definitely a service it is important to note that it needs to be treated as a low entropy service in much the same way that you would your production database systems. The data Vault stores is critical to the operation of other services. This means treating the Consul data volumes in much the same why as a database data volume. However, it also means developing a policy around how Vault unseal keys are stored, as well as how many unsealed Vault servers must be on standby at any given time. When I say policy here I don’t mean a document. While that certainly might be one element it should not be the entire effort. If I were running Vault in a production context I would setup ticketing and alarming based on monitoring thresholds around number of unsealed Vault servers. I would also run regular “Vault sealing exercises” where a server is sealed and pages are sent to key holders, this would allow you to determine response times as well as insure enough keys are actually present to complete an unseal under a controlled circumstance ultimately acting as the driver for policy and process improvement.
Vault Initialization and Unsealing Keys
There are basically two activities we could be referring to by initialization. Either the act of creating a new Vault store or the act of adding configuration to an existing Vault store. For simplicity let’s agree to call the act of creating a new Vault store initialization. In this PoC I’ve decided to treat initialization as a manual process. The reasoning is that I would prefer to keep the unseal keys out-of-band even if that means written on a piece of paper tucked into an engineers wallet. Obviously low-tech is not the only way to keep things out-of-band, but it is certainly cost effective and easy to implement for a PoC. Another possibility would be to use a group of machines each responsible for one unseal key. These machines would be locked down via a firewall allowing no incoming network connections in order to reduce attack surface as much as possible. They would have to poll or watch a message queue in order to respond to events requiring them to provide unsealing keys to vault servers, and they would heartbeat in order to provide telemetry data for monitoring systems. However, the complexity of such a system grows significantly the more secure it is required to be, and likely still requires plenty of manual intervention.
Vault Setup, Chickens, and Eggs
By Vault setup I mean the creation of configuration, dynamic secret generation policies, security policy, roles, app IDs, etc. in the existing Vault store. The landscape here is filled with “chicken vs egg” type problems. While many of the policies, role configurations, and even rules regarding how app IDs are generated could be placed in a configuration management system. There are secrets that should not be. There are two examples in our PoC. First, the master database credentials that Vault uses to access Postgres and generate on the fly credentials for the hello world service. Second, the salt value that is used as part of the generation of App and User ID to establish a trust relationship with Vault’s client machines. The first situation might be solved by automating setup of databases and making credential push to Vault part of that setup. I did not use that approach in the initial version of this PoC but it seams like a good idea to try out. The second issue is a bit harder to solve because we have an ongoing need to deploy the salt to new nodes as they are brought online. I’m still looking for a good out-of-band approach for the placement of this salt on all systems. One idea on the AWS platform is to grant access rights to all new servers created in the VPC so that they can load a well known S3 object containing the salt value. I believe this would meet the requirement of being out of band with CM – probably CloudFormation in this case – but easily accessible by systems that need it. Another option that keeps the salt value out of CM would be to build that value into the base VM image. This assumes that the image is not accessible by someone that might want to attack the system. This technique could work with a VMWare based environment as well as other virtual infrastructure environments that allow use of custom machine images or templates.
AppID and UserID
I chose to use the App ID authentication mechanism in Vault to authenticate machines because it appears to be simpler than using TLS. With TLS I’d have to figure out another way to securely deliver unique materials to the machine, where as with App ID I’m just delivering the same hopefully secure salt value to all machines. To actually generate my App IDs I use the SHA 256 hash of the service name plus my salt value. Similarly for User IDs I compute the SHA 256 hash of the local machines IP address plus my salt value. On the Vault server I can pre-populate known User IDs with access to a particular App ID since I know the salt value and the address range of clients. Based on my limited understanding of cryptographic hashing the time to guess the salt value based on knowing the hash for one particular machine is approximately exponential in L where L is the length of the salt value. Also, assuming you know H and X in H=h(X,S) – where X is the IP, S is the Salt, and H is the hash value for that machine – you might find S such that the value of h computed collides with H but is not the original S used. This means you still wouldn’t be able to generate another H for an X of your choosing. Basically what all that means is as long as S is a long random value you should be relatively safe as long as you keep it secret. However, definitely read up on SHA2 and whatever resent attacks might be developing for that algorithm before basing your policy on it.
I’ve been working on learning as much as I can about how Vault will behave operationally, but I’ve still got more questions. I’m going to list those that I currently have for my reference and hopefully address them in smaller blog entries as I find answers.
- How does consul-template respond when Vault is sealed? Will it kill the service?
- What is the best way to handle initialization of Vault? Particularly key distribution for unsealing? Should it be automated?
- How do we securely distribute salt values or other secret information only to trusted machines or processes prior to building the trust relationship with Vault?
The PoC demonstrates that Vault can be used to provide credentials needed for a service. Those credentials are even automatically generated custom to a particular instance of a service (on a per machine basis in the PoC). The credentials even get rotated based on the lease time given by Vault. So we covered quite a bit of the ground set out in our vision.
So what about automatic revocation by intrusion detection systems? I think that should be possible and possibly quite simple. Vault allows a secret lease to be programmatically revoked, so it is only a matter of wiring revoke actions to the appropriate alarms in the IDS. I’ll leave this as future work most likely for someone that completely understands the operational profile of an IDS.
There are some remaining challenges. Particularly establishing the initial trust relationship with Vault requires keeping some form of secret outside of Vault and CM. This is definitely a challenging problem, but there seems to be some possible solutions depending on the environment in which the system is deployed and the level of security required. Vault sealing and initialization are also challenging topics. While it would be great to have some automated solution to this issue, it may be better suited to solving via policy and process with a healthy dose of monitoring to keep things up and running.
Overall my impression is that Vault solves a lot of problems, but security is still hard. I’ll definitely be using Vault moving forward in the systems I’m designing because it makes a lot of the problem easier. I’m hoping I’ll be smart enough to figure out solutions to at least some of the remaining issues, or get some feedback from the community that will help.