Secure File Exchange for Mortals

Leave a reply

TL;DR
Crypt-Keeper is a simple and scalable web service for keeping file encryption credentials out-of-band to the file being protected and exchanged. This is different than many contemporary file exchange services in that the storage mechanism does not have to be trusted to keep your data secret. Crypt-Keeper leverages AWS S3 for storage in order to take advantage of nearly infinite storage capacity, ease of management, availability, and of course huge scalability.

In my last post, Distributing Secrets, I talked about some of the problems associated with internal distribution of secrets across a deployed architecture. Unfortunately this is not the only area where we run into issues with distributing secrets, often many of us have a need to share data with customers. In many of those cases the data we are sharing is of a sensitive nature, and should be protected. Of course there are very well known methods for keeping data secure in transit, for example using SSL, but this does not protect the data at rest. In the next section we’ll go over the problem, and some existing partial solutions.

The Problem

There are multiple existing solutions for sharing files. Dropbox, Google Drive, S3, Google Cloud Storage, Azure Storage, FTP or SFTP are all possibilities among many others. So, why do we need yet another solution? Let’s first look at some of the other options, then we can see why we need something new to augment the available feature set. We’ll see that each option suffers from some combination of the following problems: P1) locality of compute resources in the architecture, P2) the need to trust the infrastructure to not snoop on encryption keys, or P3) lack of infrastructure or service to maintain encrypted files or shared key materials.

Off the shelf solutions

Dropbox is primarily focused on syncing user data across machines, and user to user sharing, but more recently they have an enterprise offering. Of course there is programmatic access and this would certainly be a possible solution to file exchange save a few issues. First, Dropbox does not offer compute services, so my processing infrastructure would still need to pull a file from the non-local network storage before completing it’s task, AKA P1 above. Second, I need to trust Dropbox that they really do encrypt my data and pinky promise not to use those keys to snoop on me, P2. Third, I could encrypt everything before it goes in Dropbox, but then I still need to maintain those key materials somewhere, so P3.

Google Drive is fairly similar to Dropbox, but Google does offer some compute services. I don’t believe Google Drive is being marketed at enterprise scale at this point, but I’ve seen some enterprising small businesses attempt to use it for sharing files internally. Hypothetically I could use Google compute service, so that problem is solved. I’m not judging those that have ventured out to alternative cloud providers, but honestly when you say cloud I – and probably 90% of the rest of the industry – immediately think AWS with Azure as a close second. I don’t think it’s just me as this Gartner report shows similar results. So, the reality is I’m probably not using Compute Engine. Even if not I still either need to trust Google, or encrypt my own docs. So, Google drive suffers from P2 and P3. Note that by my reading of the terms of service Google is not even pretending not to snoop on your data.

S3 is more of a building block than either Dropbox, or Google Drive. You basically get a RESTful interface to abstract storage containers with a few different selections of SLA, and a very long track record of meeting or beating those SLAs at what I think is a great price. Basically, if you are in a situation where S3 can work for your data needs you will probably do the math and find it is the best option. Now if you should find yourself in a position where S3 doesn’t work for your needs and you can’t point to a legal requirement or fiduciary duty that tells you why then you need to question the direction of your business. If you’ve done the math and concluded that it is cheaper to build your own datacenter, then you are wrong. Seriously, you forgot to include something in your math. That said S3 has the same problem as previous systems. I either trust Amazon not to use my keys, or I encrypt before transferring documents. Here, again S3 has problems P2 and P3.

Google Cloud Storage is the Google equivalent to Amazon S3, part of Google’s Cloud Platform. They offer several different SLAs, and an API to the service. I do not personally have any experience using GCS, but I do know there are several enterprises that do use the Google platform. GCS suffers from P2 and P3 just as the other services mentioned here.

Azure storage is a little bit more complicated model than S3 but does offer competitive options. I had the opportunity to chat with a senior engineer on the Azure storage team some months back and I have to say I’m impressed with the technology and the direction. That said Microsoft is quickly recovering from but still affected by years of very poor management. However, I have ZERO experience with production on Azure and I can’t imagine at the moment when that would change. To my knowledge Azure storage suffers from the same issue as the other systems reviewed here in that I either need to trust Azure or encrypt my own docs. Starting to see a pattern Azure suffers from P2 and P3.

Role your own solutions

I’m sure there are other options, but for most situations if I’m exchanging documents with a customer my service will be based on some form of FTP, SFTP, or I’m going to skip the intermediary and use S3 since for many companies that is the ultimate destination for data anyway. Given that I’ll try to explain why I think FTP and SFTP are not the right solutions.

Let’s get specific about what I mean by FTP. FTP has been around for a long time, so there are plenty of variations to choose from. RFC 114 was introduced in 1971 well before the TCP/IP stack was even the standard for the Internet, this is only of historical note today as it is replaced by the RFC 765/959 draft and standard. RFC 765 and RFC 959 update the protocol for use on TCP/IP. RFC 1579 introduces passive mode. RFC 4217 defines the use of FTP over TLS, but note this was not an official standard until October 2005. So basically when I say FTP I mean any and all of these standards including what is typically called FTP, FTPS, FTP(ES), FTP over SSL, or FTP with TLS. There might be other names but we are basically looking at a control pipe and data pipe either of which may or may not be secured. That’s FTP. Obviously I can deploy my FTP service in a way that solves P1, but nothing about FTP addresses P2 or P3 in any way.

Now, SFTP refers only to SSH File Transfer Protocol. Despite the similarity of the name this is not FTP. It is a subprotocol of the SSH or Secure SHell protocol. It offers the same level of transport layer security that SSH does, and that is all. We end up in a similar situation with SFTP as we are with FTP. We can easily solve the P1 problem depending on how we deploy services, but there is no builtin solution to either P2 or P3.

Why none of the above really work

Now that we’ve established what the players are, I’ll tell you I would not use either FTPS or SFTP. The reasons are simple. First, they suffer from the same issue as the third party services mentioned above in terms of P3. They can secure the transport of the data, but don’t really help with the encrypted storage P2 problem. Second, I don’t think they are easily secured, scaled or managed. For example, it is possible to configure an FTP server so that it allows non-secure transmission of files based on choices the client makes. Third, I’m already going to be running HTTPS somewhere if I support any websites or services. Since I’m already putting effort into securing, scaling and managing HTTPS then I’m going to use it. Now since I’m using HTTPS anyway I might as well use the service someone else has already created for storage and just move the whole thing to S3. Problem solved, except that remember no matter what protocol I use I still had issues with storing the file unencrypted and managing keys, or what I’ve been calling P2 and P3. Enter Crypt-Keeper.

Crypt-Keeper solves the problem of storing a file at rest using a secure encryption key that I can easily share with any other Crypt-Keeper user on my own trusted infrastructure. It does this leveraging the security and scalability of S3, but without completely trusting the S3 or AWS platform.

The Solution – Crypt-Keeper

Crypt-Keeper is a reference implementation of the Secure Document Service (SDS) implemented as a Python client, and Django web application. The Python client exposes a Python native library for simple programmatic access to Crypt-Keeper, and has a command-line interface for scripting simple file transfers. The Django web application features a management interface for the service, and a simple RESTful service interface available to use with any client that can effectively use HTTP protocols.

The SDS protocol is so simple you can execute the three necessary use cases using a client like curl, or Paw. To upload a file you POST metadata about the file to the SDS file upload API endpoint. You get back information including a document ID, encryption key, encryption type, and a signed S3 URL to PUT the file. If you use the Python client it will automatically encrypt and stream the file to S3. You can throw out the encryption key when you are done, the download endpoint will retrieve your key when needed as long as you have the document ID. Next, you can use the SDS file share API endpoint to assign access privileges to another Crypt-Keeper user. Finally, you or any users that have access rights and the document ID can download the file using the SDS file download endpoint to retrieve document metadata, encryption information, and a temporary signed S3 download URL. Of course using the Crypt-Keeper Python client is even easier since it grabs all the information needed from the SDS API endpoints, applies the correct encryption to your file, and uploads or downloads from S3.

One of the really great things about the SDS protocol is it is nearly infinitely scalable. There are two limitations to scale: the storage system which is basically S3 with S3’s limitations, and the datastore for SDS. In Crypt-Keeper we use a traditional SQL based backend like MySQL or Postgres. However, if the datastore were replaced with Dynamo you could scale much higher than for a traditional DB. So, why didn’t I use Dynamo in Crypt-Keeper? I want to be able to use Crypt-Keeper on less expensive infrastructure, I want to be able to run the service outside of AWS, and I wanted to get the project to a workable state with 2 to 3 months of my spare time at about 5 hours a week.

So that is Crypt-Keeper. I think it is an elegant solution to a difficult problem, with quite a bit of potential. In the next section I’ll talk a bit about the limitations and future plans.

Limitations and Future Plans

Currently, there are few known limitations to Crypt-Keeper. Files larger than 5 gigabytes are not supported as we are not using multipart upload. Amazon recommends using multipart uploads for any file larger than 100 megabytes, so this is a change that I’d like to work in soon to support larger file sizes. Initial analysis on this problem indicates that the changes to Crypt-Keeper will be backwards compatible with the initial release of Crypt-Keeper. I delayed the implementation because I want to get the code out in the wild sooner, and just don’t have the time right now to invest. This will likely be one of the first additional features I’ll add in the near future, based on bugs taking priority, and demand for larger file sizes.

There are some known bugs in the client. The command line interface does not report failures in a smooth fashion at present. Additional testing is needed here. Running nosetests with coverage on the client package currently reports 98% coverage, but as always numbers don’t always tell the full story. For example, there are currently no tests on the command parser, and nosetests completely ignores that file in coverage reports. Perhaps this is user error, but sadly I don’t think it is. I’ve turned on the cover inclusive options etc., but with no change in output. My experience with coverage tools in other languages is that they report inclusively by default, and exclude by configuration. Long story short I didn’t really discover until recently that I had not implemented any tests on the command line processor, and there are definitely issues in that module. No excuses here, I was looking at total coverage and feeling cocky in the high nineties, even experienced engineers get to learn something every once in a while.

Discovering a failing in my tool chain, or at least my usage of the test coverage data led me to look deeper into the service test coverage. At this point the only aspects of the service that have zero test coverage are the management commands that run to configure the service but are not part of the critical path. As it stands today the service is at about 75% coverage, but I’m a bit more confident in that number than with the client as I’ve spent quite a bit more time discovering bugs and coding tests on the service than on the client. Also, while the client tests are unit tests only, the tests on Django setup a test SQLite DB and actually perform end to end integration minus a real production DB. So, while 75% is less than 98% that 75% is focused on the critical path of the service. Obviously all these numbers are subject to change as time moves on and the Github repo, running the coverage reports for yourself are the best whys to determine current status.

Another thing I’d like to add is an easy to deploy Docker container. I’ve obviously – see previous blog posts – done some work with Docker before, but given the sensitive nature of this program I want to be sure that any official container is secure. I still have more analysis to do on the reference Vagrant machine that is part of the distribution of Crypt-Keeper. The goal with the vagrant install is to PoC what a secure production environment might look like. I look forward to community feedback on those aspects.

Conclusion

I had a great time creating Crypt-Keeper. I hope it is useful. I’d love feedback. What do you think? Does Crypt-Keeper solve the problem? Send PRs!

Distributing Secrets

Leave a reply

Previously in Docker from Development to Production I wrote about how to leverage Docker and Consul to quickly move from a development environment to a production environment. There is a critical piece missing from that story in how to deal with distribution of secrets in any environment, particularly one which makes use of service discovery. Luckily I’m not the first to encounter this problem, the developers at Hashicorp have been working on a solution. Over the last several months at my local Docker meet up there have been more mentions of this issue during our discussions than I can count. Typically someone will bring up Hashicorp’s Vault as a potential solution, at which point the conversation turns to a series of unanswered questions since no one seems to have any experience deploying Vault. So, given the need to solve this issue, and the fact that I’ve designed a system requiring Vault or something similar I decided it was time to investigate. While I won’t repeat the Vault users guide here I’ll dive into some of the features, talk about how I’m using Vault in my system, and provide links to code along the way to help you see it working in a PoC type system.

The Problem

Before we get started let’s clearly define the problem. I have secrets in the form of credentials, passwords, and maybe PKI materials that I want to share with my code. Storing these items in a configuration management or revision control system is wrong because I need finer grain access control than those tools allow.

Vision

In an ideal world only the service that needs the access would ever be given credentials, those credentials would rotate on a regular basis, and they would never be shared even with the same service running in a different process. While I’m dreaming I’ll add that an intrusion detection system would be monitoring all processes – as well as network traffic etc. – and upon detecting any compromise would revoke the credentials issued to that process and kill the process. We won’t get all the way there in this article but we’ll see how close we can get and glimpse what the additional work would be to solve this problem.

An Overview of Vault and Some Close Friends

Vault is an open source project from Hashicorp (other projects include Vagrant and Consul to name a few) that stores secrets, provides audit logging for all actions, and allows secrets to be maintained via leasing and revocation processes. Vault also unifies access to databases, AWS IAM, and a growing number of other – what Hashicorp call – secret backends. What this means is you can request that Vault create a new credential for your database and Vault will create the credential and maintain it’s lifespan through leasing and revocation processes.

Authorization with Vault is achieved via any of several methods supported by Vault’s auth or credential backends. The primary method is via token, but LDAP, username/password, MFA, and App ID are also supported. A token is basically a shared secret and is the primary means of identification and authorization to all Vault APIs. Every token has an associated lease time so will eventually expire unless renewed. You can get a token via any of the other Auth backends. Typically for an application this means using App ID.

The App ID auth backend allows any service that knows two arbitrary predefined strings – called app id and user id – to get a token. Vault does not generate these strings for you, nor does Vault define how these strings should be shared between systems. It is recommended however that at least one of these is delivered to a participating machine using a process out of band from the typical CM system.

Vault can use a variety of backend data stores to keep your encrypted data. These include supported stores like Consul, file, and in memory stores, but there are also community supported backends like DynamoDB, etcd, zookeeper, S3, and PostgreSQL. The biggest differentiators among storage backends is the varied support for high availability configurations. Consul does support HA configurations and is the recommended configuration by Hashicorp. I’ll leave as an exercise for the reader to determine the suitability of that recommendation to their particular situation, but in our case we will use Vault with Consul since we are already supporting Consul for service discovery.

Operationally Vault can be configured in a highly available active/standby server architecture. Any unsealed Vault server can accept requests but they are always passed to and processed by the active master. Any unsealed standby can take over as master in the event the current master fails. All HA functionality is available when using an HA storage backend, such as Consul.

In addition to Vault and Consul, Hashicorp has developed an interesting and sometimes useful tool called consul-template. I personally first came across consul-template back in 2014 while configuring some load balancers based on current services available in Consul. Consul-template works great for these types of scenarios, and for those unfamiliar with this tool it will basically allow you to fill in template values based on changes in Consul. Last year Hashicorp added support in consul-template for Vault. This extends the templating functionality to Vault secrets.

In Use

I’d first like to point out that Vault is a relatively deep program. What I mean by that is it is very flexible allowing you to implement security policies customized for your organization. So keep in mind that none of the policy shown here has been reviewed by security professionals, and some configuration is purposefully open to ease the implementation of this PoC. In other words, don’t try this in production.

A First Pass Solution

My first thoughts on using Vault were focused on a particular subset of the secret sharing problem, namely distribution of database access credentials to Spring-based Java and Groovy applications using connection pools. In Java, connection pools are typically implemented as a Proxy that intercepts calls to the standard database connection factory and instead of creating a new connection a cached connection that has already been created is provided. The connection factory on the other hand knows how to create new connections including owning the database credentials. In Spring both of these objects are typically living in the same context as our application and can be managed by the application. So, given this what happens when a lease expires on a credential in use by the app? At the very least we’d need to provide new credentials to the connection factory. It is probably database dependent as to what happens when the credentials used to establish a connection expire. To be safe we’d likely want to remove any cached connections with expiring credentials from our connection pool, or risk a mass drop in connections at some point. Things get a bit more complicated if we move into a Java EE environment. Here our connection factory is likely owned and managed by the application server rather than the application. So even though I could likely implement an additional proxy to intercept calls to the connection pool in a Spring app and work with the life-cycles of the connection pool and factory to create a solution that could handle automatically getting credentials from Vault and responding to lease renewals and revocations, it sounds really complicated and does nothing to solve the same problem if my target app is written in Python, Go, Elixir, or some other inferior language like Perl or whatever it is people do on Windows.

Solution 2.0

Before I spent a lot of time thinking about how to integrate Vault with my Spring apps it had occurred to me that I could simply write a configuration file with the Vault credentials, load those into my app, and just restart each time I get new credentials. But I dismissed the thought almost immediately as far to simple. Several things happened that changed my mind. First, I discovered that consul-template had added support for Vault almost a year ago. Second, the size of the Java only solution was growing rapidly and I wasn’t certain that a solution there wouldn’t require completely replacing the connection factory each time the credentials updated. This could easily lead to memory leaks since the connection factory is typically a singleton with a life-cycle matching that of the application, so any bugs anywhere in those Java libraries that improperly hold references probably haven’t been discovered. Third, I was explaining the situation to a friend and the act of saying it out loud made me realize that it was way overly complicated and possibly just a little too clever, and he agreed. Clever almost never makes sense when you get paged at 2 AM. Fourth, while either solution introduces new dependencies into the system as a whole, this dependency would not be directly on the critical processing path for incoming requests.

This solution is not without some drawbacks. First, care needs to be taken that the services cycle at different times so that capacity is maintained to respond to the service load. Second, the service should be removed from load balancers in a way that allows in flight requests to complete prior to shutdown if that is critical to the upstream systems, unless you can afford this in your error budget. Third, any code that has a typical usage scenario involving regular restarts does run a higher risk of accumulating latent issues with memory leaks that go unnoticed typically until usage changes with newly released features. Not that I’ve seen this on several occasions. So as always invest in good monitoring practices, and then look at what the monitoring is telling you or better yet setup some automatic ticketing based on your monitoring to tell you to look at what it is telling you. Or just wait for the alarm/page, but memory leaks almost never make sense when you get paged at 2 AM. You’re probably picking up on a pattern here, it’s all part of one of my general rules of software engineering which states “Almost nothing makes sense when you get paged at 2 AM, except rage, plan accordingly.”

The Proof of Concept

In Docker from Development to Production we used AWS to show a PoC. This time I wanted to do something a little different. Instead we’ll be using Vagrant to build out a local set of virtual machines with software defined network to demonstrate Vault. The same concepts can be applied to AWS and CloudFormation. This approach while it is obviously limited to demos does have a few advantages for development. First, CloudFormation is an all or nothing approach, it roles back changes on failure. This is actually a great feature but it can slow iterations just a bit. Salt on the other hand will detect and skip most states it has already applied allowing me to iterate quicker locally, since I’m only applying changes. CloudFormation will actually apply changes in a transactional way as well but it still feels a bit slower to me. I can also bring up individual machines defined in my Vagrantfile, whereas again CloudFormation is an all or nothing transaction. But really my biggest reason for doing this is I’m guessing most people haven’t really seen Vagrant used in this way, and it’s one more tool to add to your tool belt. While most with exposure to Vagrant will have realized that a Vagrantfile is actually a Ruby program you will really see that here. Though this is not the focus of this article you should take a look at the interaction between the Vagrantfile and the surface.yaml file.

You can see details of how to bring up the environment in the README on the poc-vault-server Github project page. I’ll try to leave most details there as this is meant to be a living project that may change after this article was written. So in this section I’ll focus on the higher level concepts captured in the PoC.

Vault and Entropy in Production

While Vault is definitely a service it is important to note that it needs to be treated as a low entropy service in much the same way that you would your production database systems. The data Vault stores is critical to the operation of other services. This means treating the Consul data volumes in much the same why as a database data volume. However, it also means developing a policy around how Vault unseal keys are stored, as well as how many unsealed Vault servers must be on standby at any given time. When I say policy here I don’t mean a document. While that certainly might be one element it should not be the entire effort. If I were running Vault in a production context I would setup ticketing and alarming based on monitoring thresholds around number of unsealed Vault servers. I would also run regular “Vault sealing exercises” where a server is sealed and pages are sent to key holders, this would allow you to determine response times as well as insure enough keys are actually present to complete an unseal under a controlled circumstance ultimately acting as the driver for policy and process improvement.

Vault Initialization and Unsealing Keys

There are basically two activities we could be referring to by initialization. Either the act of creating a new Vault store or the act of adding configuration to an existing Vault store. For simplicity let’s agree to call the act of creating a new Vault store initialization. In this PoC I’ve decided to treat initialization as a manual process. The reasoning is that I would prefer to keep the unseal keys out-of-band even if that means written on a piece of paper tucked into an engineers wallet. Obviously low-tech is not the only way to keep things out-of-band, but it is certainly cost effective and easy to implement for a PoC. Another possibility would be to use a group of machines each responsible for one unseal key. These machines would be locked down via a firewall allowing no incoming network connections in order to reduce attack surface as much as possible. They would have to poll or watch a message queue in order to respond to events requiring them to provide unsealing keys to vault servers, and they would heartbeat in order to provide telemetry data for monitoring systems. However, the complexity of such a system grows significantly the more secure it is required to be, and likely still requires plenty of manual intervention.

Vault Setup, Chickens, and Eggs

By Vault setup I mean the creation of configuration, dynamic secret generation policies, security policy, roles, app IDs, etc. in the existing Vault store. The landscape here is filled with “chicken vs egg” type problems. While many of the policies, role configurations, and even rules regarding how app IDs are generated could be placed in a configuration management system. There are secrets that should not be. There are two examples in our PoC. First, the master database credentials that Vault uses to access Postgres and generate on the fly credentials for the hello world service. Second, the salt value that is used as part of the generation of App and User ID to establish a trust relationship with Vault’s client machines. The first situation might be solved by automating setup of databases and making credential push to Vault part of that setup. I did not use that approach in the initial version of this PoC but it seams like a good idea to try out. The second issue is a bit harder to solve because we have an ongoing need to deploy the salt to new nodes as they are brought online. I’m still looking for a good out-of-band approach for the placement of this salt on all systems. One idea on the AWS platform is to grant access rights to all new servers created in the VPC so that they can load a well known S3 object containing the salt value. I believe this would meet the requirement of being out of band with CM – probably CloudFormation in this case – but easily accessible by systems that need it. Another option that keeps the salt value out of CM would be to build that value into the base VM image. This assumes that the image is not accessible by someone that might want to attack the system. This technique could work with a VMWare based environment as well as other virtual infrastructure environments that allow use of custom machine images or templates.

AppID and UserID

I chose to use the App ID authentication mechanism in Vault to authenticate machines because it appears to be simpler than using TLS. With TLS I’d have to figure out another way to securely deliver unique materials to the machine, where as with App ID I’m just delivering the same hopefully secure salt value to all machines. To actually generate my App IDs I use the SHA 256 hash of the service name plus my salt value. Similarly for User IDs I compute the SHA 256 hash of the local machines IP address plus my salt value. On the Vault server I can pre-populate known User IDs with access to a particular App ID since I know the salt value and the address range of clients. Based on my limited understanding of cryptographic hashing the time to guess the salt value based on knowing the hash for one particular machine is approximately exponential in L where L is the length of the salt value. Also, assuming you know H and X in H=h(X,S) – where X is the IP, S is the Salt, and H is the hash value for that machine – you might find S such that the value of h computed collides with H but is not the original S used. This means you still wouldn’t be able to generate another H for an X of your choosing. Basically what all that means is as long as S is a long random value you should be relatively safe as long as you keep it secret. However, definitely read up on SHA2 and whatever resent attacks might be developing for that algorithm before basing your policy on it.

Conclusion

The PoC demonstrates that Vault can be used to provide credentials needed for a service. Those credentials are even automatically generated custom to a particular instance of a service (on a per machine basis in the PoC). The credentials even get rotated based on the lease time given by Vault. So we covered quite a bit of the ground set out in our vision.

So what about automatic revocation by intrusion detection systems? I think that should be possible and possibly quite simple. Vault allows a secret lease to be programmatically revoked, so it is only a matter of wiring revoke actions to the appropriate alarms in the IDS. I’ll leave this as future work most likely for someone that completely understands the operational profile of an IDS.

There are some remaining challenges. Particularly establishing the initial trust relationship with Vault requires keeping some form of secret outside of Vault and CM. This is definitely a challenging problem, but there seems to be some possible solutions depending on the environment in which the system is deployed and the level of security required. Vault sealing and initialization are also challenging topics. While it would be great to have some automated solution to this issue, it may be better suited to solving via policy and process with a healthy dose of monitoring to keep things up and running.

Overall my impression is that Vault solves a lot of problems, but security is still hard. I’ll definitely be using Vault moving forward in the systems I’m designing because it makes a lot of the problem easier. I’m hoping I’ll be smart enough to figure out solutions to at least some of the remaining issues, or get some feedback from the community that will help.

Docker 1.11 and Otto 0.2.0

Leave a reply

While prepping for an upcoming presentation I recreated the development environment described in my previous post on using Docker in development and production environments. Much to my surprise the creation failed while trying to start the Postgresql Docker container with an error “shim error: invalid argument”. This appears to be related to running Docker 1.11 on Linux kernels prior to 3.10. By default Otto uses the “hashicorp/precise64” box which has Linux kernel 3.2.0. So fail.

I haven’t had a chance to completely determine what should be changed where in order to prevent this type of issue. However, for a quick work around for the particular config in my article you can otto compile followed by editing the Vagrantfile in “.otto/compiled/dev/Vagrantfile” and replacing the box definition with “ubuntu/trusty64”. Trusty uses Linux kernel greater than 3.10 so this eliminates the issue.

Docker from Development to Production

Leave a reply

I like to do things – or at least those things I don’t really enjoy doing – fast. This is why I really appreciate enabling technologies that help me do things quickly. For example, I love Spring Boot because I can build all the boring parts of a service or web app quickly. Or at least assuming I’ve got my development environment setup correctly I can. It’s never that simple though. The new guy that just started will need three days of setting up his dev environment, so that will keep him from building anything for at least part of this sprint. Depending on how complicated the environment was when he started, and how far the environment has drifted from the docs our team wrote, maybe it takes even longer. Of course, then deploying to QA and Prod is kind of slow, and any issues mean calling the dev team. Oh, and then there is that 3 AM page because the production environment on server node 3 doesn’t quite match the other nodes.

This probably sounds familiar to a lot of developers. Just replace Spring Boot with the killer tech you prefer that makes your life easier for that brief part of your day when you actually write code instead of the ever expanding set of responsibilities that might even be called something like DevOps.

Purpose

The purpose of this article is to introduce an opinionated proof-of-concept infrastructure, architecture, and tool chain that makes the transition from zero to production deployment as fast as possible. In order to reach the widest possible audience, this infrastructure will be based on AWS. I believe if you have recently created your AWS account you can deploy the infrastructure for free, and if not it’s around $3 per day to play around with, assuming you use micro instances. The emphasis here is on speed of transition from dev to prod, so when I say opinionated I do not mean my opinion is this is the best solution for you or your problem – that depends a lot on your specific problem. I will go into more detail on the opinions expressed in this solution throughout this article. At the time of this writing, the actual infrastructure implementation is lacking certain obvious properties needed for use in a real production environment – for example there is a single Consul server node. So keep in mind this is a proof of concept, which still requires effort before using in production settings.

The Problem

I can develop a service very quickly. For example, I can generate the skeleton of a Spring Boot service using Spring Initializr, and implement even a data service very quickly. Any gains I make in development time can be lost in the deployment process. Production deployment can be arduous and slow depending on how the rest of my organization works.

Despite adoption of configuration management tools – like SaltStack – there is often times drift in production environments. This can be the result of either plastering over an existing environment of unknown state with configuration management, or lack of operational discipline by making individual changes outside configuration management. Micro-services make the situation more challenging because rather than a handful of apps with competing dependency graphs, I now have potentially dozens of services each with conflicting dependencies.

Even outside of production there are potential roadblocks. What if I need a development environment that is exposed to the public internet to test integration of my service with another cloud based service?

Docker solves many of the issues of environment drift and dependency management by packaging everything together. However, to be honest if my organization – or yours – was having a hard time getting me the resources to deploy my micro-service, I doubt introducing Docker will help with timelines in the near term. In the long run, it may be an easy sell to convince an infrastructure operations organization to stop deploying custom dependencies for each language supported, and instead create a homogeneous environment of Docker hosts where they can concentrate their efforts on standardized: configuration, log management, monitoring, analytics, alarming, and automated deployment processes. Especially so if that environment features automatic service discovery that almost completely eliminates configuration from the deployment process.

A Solution

In short, it doesn’t matter if you are trying to overcome the politics or technical debt of a broken organization, or you are bootstrapping the next Instagram. If you want to deploy micro-services this article presents a way to use force-multiplying tools to turn a credit card number into a production scalable, micro-service infrastructure as quickly as possible.

AWS Infrastructure

The diagram above shows an abstraction of the AWS infrastructure built below. The elastic load balancer distributes service requests across the auto-scaling group of EC2 instances, which are configured as EC2 Container Services (ECS) running our Dockerized service. Each ECS instance includes a Consul agent and Registrator to handle service discovery and automatic registration. Database services are provided by an RDS instance. Consul server is deployed to a standard EC2 instance – this will expand to several instances in the production version. All of this is deployed in a virtual private cloud (VPC) and can scale across multiple availability zones.

Prerequisites

git
brew – or a package manager of your choice. Install with:
```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
or check the homepage for updated instructions.
Virtual Box – or some other VM supported by Vagrant/Otto. For example, install with:
```
brew tap caskroom/cask && brew install Caskroom/cask/virtualbox
```
An AWS account.

Setting up your Development Environment

For our development environment we’ll be using Otto. Otto is a tool that builds management of dependencies and environment setup on top of several other tools including Vagrant, Terraform, Consul, and Packer. As of the time of this writing, it is very early in development so potential for using in prod deployment is limited – especially for Java. However, for the dev environment we get some extra freebies not included by just using Vagrant – namely service discovery and automatic deployments of dependencies.

We’ll begin by installing otto:
```
brew install otto
```
otto will automatically install it’s dependencies on an as-needed basis.
Next we’ll create or clone the project we want to work with. For the purposes of this demo, I’ve created a Spring Boot based micro-services hello world project that connects to a PostgreSQL database and a Redis instance in order to complete it’s work. We use Spring JPA connecting to PostgreSQL to track the number of calls for each user, and we use Redis to track the number of calls in a session. The goal of this contrived hello world is simply to have multiple service dependencies to better reflect a real world service.
To clone the existing project execute:
```
mkdir -p ~/source
cd ~/source
git clone https://github.com/mauricecarey/micro-services-hello-world-sb.git
cd micro-services-hello-world-sb
```
We need to define our dependencies using Otto’s Appfiles. You can see the Appfiles I’ve defined for postgres and redis on GitHub. We just need to include these as dependencies in the next step.

We need to define an Appfile config for our service in order to declare the needed dependencies.

cat <<EOF > ~/source/micro-services-hello-world-sb/Appfile
application {
  name = "micro-services-hello-world-sb"
  type = "java"

  dependency {
    source = "github.com/mauricecarey/otto-examples/micro-services-hello-world-sb/postgres"
  }
  dependency {
    source = "github.com/mauricecarey/otto-examples/micro-services-hello-world-sb/redis"
  }
}
EOF

This file provides a name for our application, explicitly sets the type to java, and declares the dependencies.

We will compile the environment setup, start the dev environment, and test our service.
Execute the following to compile and start the otto environment:

otto compile
otto dev

If you don’t have vagrant installed, otto will ask you to install when running otto dev. Keep in mind a lot is happening in this step including: potential Vagrant install, potential Vagrant box download, downloads for Docker images, and installing dev tools. If you halt the environment when finished – versus destroy – restarts will only take a few seconds. Once the environment is finished building you can login with:

otto dev ssh

Now we can build and run the application:

mvn package
mvn spring-boot:run

Open a new terminal then:

cd ~/source/micro-services-hello-world-sb
otto dev ssh
curl -i localhost:8080/health

You should see HTTP headers for the response plus JSON similar to:

{
  "status": "UP",
  "diskSpace": {
    "status": "UP",
    "total": 499099262976,
    "free": 314164527104,
    "threshold": 10485760
  },
  "redis": {
    "status": "UP",
    "version": "3.0.7"
  },
  "db": {
    "status": "UP",
    "database": "PostgreSQL",
    "hello": 1
  }
}

This means the service is healthy. Now in the same terminal hit the service:

curl -i localhost:8080/greeting

You should see:

TTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Application-Context: application
x-auth-token: 2ec567a0-4697-4b1c-a82b-fd99b021e87b
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Mon, 29 Feb 2016 22:54:03 GMT

{"id":1,"sessionCount":1,"count":1,"content":"Hello, World!"}

Now you can hit the service again with the given token:

curl -i -H "x-auth-token: 2ec567a0-4697-4b1c-a82b-fd99b021e87b" localhost:8080/greeting

You should see:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
X-Application-Context: application
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Mon, 29 Feb 2016 22:57:44 GMT

{"id":1,"sessionCount":2,"count":2,"content":"Hello, World!"}

I’ll leave it as an exercise to check that session count resets to zero only after a session timeout, new session, or redis restart. You can provide a name as well. For example, curl -i localhost:8080/greeting?name=test.

Now we need to install and configure the AWS tools to complete our development environment setup:
```
sudo apt-get install -y python-pip
sudo pip install awscli
aws configure
```

Dockerizing the Service and Distributing the Image

Since we used otto to build our dev environment, we already have Docker installed, and we can jump right in to building Docker images for our service.

First we need to define a Dockerfile to build a Docker image for the service. Open a new terminal or exit the otto environment and paste the following:

cat <<EOF > ~/source/micro-services-hello-world-sb/Dockerfile
FROM mmcarey/ubuntu-java:latest
MAINTAINER "maurice@mauricecarey.com"

WORKDIR /app

ADD target/microservice-hello-world.jar /app/microservice-hello-world.jar

EXPOSE 8080
CMD ["/usr/bin/java", "-jar", "/app/microservice-hello-world.jar"]
EOF

Now we copy the fat jar for our micro-service to target/microservice-hello-world.jar.

cp ~/source/micro-services-hello-world-sb/target/microservice-hello-world-0.0.1-SNAPSHOT.jar \
 ~/source/micro-services-hello-world-sb/target/microservice-hello-world.jar

We can build our Docker images with (back in our Otto dev environment):
```
docker build --tag hello-world .
```
We can now run the container with:
```
docker run -d -p 8080:8080 --dns=$(dig +short consul.service.consul) --name hello-world-run hello-world
```
This sets the dns resolver for the container to the consul instance running in our dev environment. We map port 8080 of the dev environment to 8080 of the container. Our container will be named hello-world-run for convenience.
We can test the connection to our container using curl:
```
curl -i localhost:8080/health
```
We can check out the logs from our service with:
```
docker logs hello-world-run
```

Next we will build the image for AWS, create an AWS ECR, and push the Docker image to our new repo. We do this using a script built for that purpose.

git clone https://github.com/mauricecarey/docker-scripts.git
export AWS_ACCOUNT_NUM=<YOUR ACCOUNT NUMBER>
AWS_REGION=us-east-1 REPO_NAME=hello-world IMAGE_VERSION=0.0.1 DOCKER_FILE=Dockerfile \
    ./docker-scripts/docker-aws-build.sh

At this point we have built and tested a Dockerized version of our service locally. We have pushed the Docker image to AWS. As you will see later, because we are using Consul for service discovery we will not make any changes to the Docker image, or have to add any additional configuration to deploy on AWS.

Building an Infrastructure on AWS

To setup the AWS stack, we will use CloudFormation to create a stack.

Check out aws-templates or just grab the full-stack.json file.

cd ~/source/micro-services-hello-world-sb
git clone https://github.com/mauricecarey/aws-templates.git

Goto the cloud formation AWS console in your browser and create a new stack using the ~/source/micro-services-hello-world-sb/aws-templates/full-stack.json file as a template.
Most of the default parameters should work but make sure you set the database name(hello), user(hello), and password(Passw0rd) as they are set in the application properties of the app. Note: Using the default micro instances with this template you will create four micro EC2 instances, one micro RDS instance, and an ELB. At the time of this writing cost was roughly $80 per month not including data transfer. You should follow the estimated cost link at the top of the cloud formation wizard’s final page before clicking on create to estimate actual costs for you.

Once the stack is up and running, switch to the outputs tab. Here you will find some useful parameters for the remainder of this article. You should set the following environment variables using the name of the ECS cluster and the ELB URL:

export AWS_ECS_CLUSTER_NAME=<ECS Cluster Name>
export AWS_ELB_NAME=<Elastic Load Balancer Name>
export AWS_ELB_URL=<Elastic Load Balancer URL>

The load balancer url will look like STACKNAME-EcsLoadBalan-AAAAAAAAAAAAA-XXXXXXXXX.REGION.elb.amazonaws.com. The STACKNAME-EcsLoadBalan-AAAAAAAAAAAAA portion is the part you use to define AWS_ELB_NAME above.

Deploying to AWS

With our cloud formation stack running, and all our AWS CLI tools installed on our development environment it is now a simple process to define the services we want to run on the ECS cluster.

We can register our task definitions for both redis and our hello world micro-service with ECS:

aws --region $AWS_REGION ecs register-task-definition --cli-input-json file://aws-templates/redis-task.json
aws --region $AWS_REGION ecs register-task-definition \
    --cli-input-json file://aws-templates/micro-services-hello-world-sb-task.json

Now that we have our task definitions registered with ECS we can start the services:

aws --region $AWS_REGION ecs create-service --cluster $AWS_ECS_CLUSTER_NAME --service-name redis-service \
    --task-definition redis --desired-count 1
aws --region $AWS_REGION ecs create-service --cluster $AWS_ECS_CLUSTER_NAME \
    --service-name micro-services-hello-world-service \
    --task-definition micro-services-hello-world-sb  --desired-count 3 --role ecsServiceRole \
    --load-balancers loadBalancerName=$AWS_ELB_NAME,containerName=hello-world,containerPort=8080

Check that the service is up:
```
curl -i http://$AWS_ELB_URL/health
```
Call the service:
```
curl -i http://$AWS_ELB_URL/greeting
```
You can perform the same calls we did locally to confirm the service is working properly.

Next Steps

Basically, in this section I’d like to try to answer questions about what is ready for production use and what is not. I’ll also mention what I’m continuing to work on.

Otto

Otto is definitely not ready for production use in creating production environments at this point, but of course we didn’t try to use this feature here. That opinion is primarily based on my experience trying to use it for Java. For other languages, assuming it is actually capable of creating a production environment, you would need to evaluate that environment with respect to your standards and needs.

The portion of Otto we used in this article is not on the traditional production critical path. As such, evaluation for adoption is a bit different. If we are talking about experienced developers they should be able to pick it up quickly to respond to any issues they might encounter on their machine. Otto utilizes Vagrant for much of the heavy lifting on the development environment and as such has been very stable in this regard. There are currently enough advantages to the dependency definitions and automatic service discovery setup to convince me to adopt Otto now for development use. I’m currently working on picking up Go as quickly as possible so I can help fix any issues I encounter.

Docker

Docker is based on Linux containers, which have been around for awhile now. Take a look at who’s using Docker in production today. Then forget that – mostly. Appeal to authority is my least favorite argument for adopting software or technology. You’ll obviously have to decide for yourself if you are ready to commit to using Docker, but if you are working in a micro-service architecture you need to consider what happens with downstream dependencies – including libraries or language VMs – for those services as they mature.

A few reasons I’m sold on Docker include:

Standardized deployment packaging across environments from development to production.
The ability to have services dependencies update at a different pace while sharing deployment environments.
Potential for higher equipment utilization leading to reduced costs.

A few reasons not to use Docker:

Fear of change.
You hate yourself.

AWS Infrastructure Template

The AWS infrastructure I used here still needs some work. As I mentioned previously it’s not ready for production, here are some reasons why:

Not truly multi-AZ,
Not multi-region – I have not started considering this at this point,
Logging is not fully configured to make best use of AWS – ideally should use something like fluentd,
Needs a security audit – I’m not an expert in this area,
Needs additional configuration and redundancy for ELB to ensure availability,
Monitoring and alerting need to be defined for the infrastructure components – like Consul,
The Consul server really should be three servers spread across two AZs at a minimum,
Needs auto-scaling rules,
Other things that I haven’t spotted yet – needs peer review.

Additionally, maybe a cloud formation script is not the right answer – perhaps Terraform integrated into Otto as a production deployment target could be. I plan to continue to improve this, but this is certainly an area where more eyes will be better so send me PRs.

Note: Unfortunately, I have seen production environments in small companies that lack many of the requirements for a production level environment I listed above, so for those reasons this may be a better fit for production than a one off internal solution since there is at least a pathway to all those requirements.

Moving Forward

I’m currently working on automating production deployment of Docker images. So, ideally I’ll have an additional write up on that soon.

We made use of simple DNS based service discovery using Consul, but we did not dive into further capabilities of Consul. In a future update I’ll go into how to store additional configuration using the key/value store in consul as well as how to move away from well-defined ports which enables packing more like-service instances on each host.

Conclusion

There are plenty of frameworks and libraries that have helped developers move quickly and deliver services faster, but even with configuration management, continuous deployment pipelines, and DevOps practices there is a gap between development and production. That gap I believe is defined by the expectations formed around delivered artifacts. Docker simplifies those delivered artifacts, and moves many former environmental dependencies into the build pipeline. Adding Consul to the mix further reduces environmental configuration hazards by allowing simple service discovery. By standardizing environments around Docker artifacts we can increase deployment velocity, and decrease risk of dependency issues in all environments. Finally, using AWS ECS to host Docker containers is a quick and easy way to get started with Docker that allows you to move from development to production very quickly.

Maurice’s Law of Misplaced Items

Leave a reply

Maurice’s Law of Misplaced Items basically states that the thing you have misplaced is going to be in the most relatively inconvenient location possible. I had an opportunity to test this just today.

Continue reading →

Logic Studio 9 on Quad G5 Unsupported but Working

Leave a reply

If like me you thought G5 support was dropped to soon in the Logic product line you may be happy to hear that what this means is that Apple won’t answer your questions. I’ve been running Logic 9 on my 4 year old quad G5 for about a week now. Ran some rather heavy projects and there were no issues. Though I did have to bring Logic Node into the mix once.

Continue reading →

Delivering and Quitting

Leave a reply

Sometimes I get confused. When this happens it can take a good smack to the ego to straighten me out. Ego or pride is the primary problem with knowing when to deliver and when to quit. Here are three points I remembered recently.

Continue reading →

The Problem

Off the shelf solutions

Role your own solutions

Why none of the above really work

The Solution – Crypt-Keeper

Limitations and Future Plans

Conclusion

The Problem

Vision

An Overview of Vault and Some Close Friends

In Use

A First Pass Solution

Solution 2.0

The Proof of Concept

Vault and Entropy in Production

Vault Initialization and Unsealing Keys

Vault Setup, Chickens, and Eggs

AppID and UserID

More Questions

Conclusion

Purpose

The Problem

A Solution

Prerequisites

Setting up your Development Environment

Dockerizing the Service and Distributing the Image

Building an Infrastructure on AWS

Deploying to AWS

Next Steps

Otto

Docker

AWS Infrastructure Template

Moving Forward

Conclusion