Tag Archives: continuous deployment

Continuous Deployment: Zero-Downtime Refactorings

Over at Kreuzverweis we have been practicing continuous deployment from the very beginning and also dam simple is continuously deployed. The theory of continuous deployment is, e.g., described by Timothy Fitz at IMVU or in the great book “Continuous Delivery” by Jez Humble and David Farley.

However, in practice it needs some thinking how to solve certain problems. You are not able to shut the system down for a couple of hours to do a migration step and release a completely new system in one batch. Continuous deployment is a logical consequence of continuous integration and it requires to develop changes in small batches, never breaking any code, and pushing to VCS often. But continuous deployment means that every commit is also a deployment, and thus it requires to migrate data in small batches on the fly and to account for the existence of legacy or stale data. In this blog post I will talk about how we currently refactor the complete file storage layer in dam simple with zero downtime.

What’s the problem, dude?

In dam simple, we currently have a 1:1 relation between documents and the file they represent. If you upload, lets say, a PDF, we create a document having references to the original file and differently sized thumbnails. It also contains information such as the title, the owner, keywords etc. However, we just started to implement versioning support for dam simple and obviously this requires a different approach to storing information about documents. We basically introduce a 1:n relationship between files and a document, since different files correspond to different version of the document.

Currently, a document is modelled as follows (please note that we use a very simplified version here):

We have a case class containing only the ID of the document and its title (among other information that we leave out for brevity). The storage layer component is called FileStore and provides method to retrieve the original and thumbnails of different sizes for a given document.

In order to support versioning we have to modify this API:

What has changed? We introduced a document store trait that provides access to the current and other arbitrary versions of a document. Each version is represented by a StoredFile, which contains metadata of the file and is used to access the original file and differently sized thumbnails using the FileStore as before.

So, to recap, we changed the following details:

  • We splitted information about the document into a case class Document and a case class StoredFile.
  • Both classes now access a dedicated component of the storage layer, while before everything was handled by the FileStore.
  • A document can now have more than one associated file.

And the problem is now

  • to introduce the new API “on the fly”, i.e., we do not want to shut down the service to migrate data.
  • And we want to enable it only for a limited number of users in order to be able to test the new feature thoroughly and to work on the UI without risking to disturb existing users.

I will describe how we refactored our code using simplified examples and showing only the read-part of the API. Refactoring other parts works accordingly.

Step 1: Introducing the new API and Testing

In the first step we concentrate on functionality that was available before, i.e., we do not implement the versioning support, but only access to the original file and its thumbnails. We start with extending the existing API, i.e., we do not touch the existing methods and only add new methods:

We then write tests for the new code and implement the methods as required. Since we had tests for the old code, we can assure that both the old and the new API works as expected. We are now in a state to deploy the current code base. The old methods have not been changed, all tests pass and the new methods are nowhere used except in the tests.

Step 2: Handling Stale Data.

Before we can enable the new upload API for a test group, we have to consider one important case. The test group already has data that was uploaded using the old API. Therefore, we will encounter assets of which the files can not be accessed using the new API. We already considered that case and let the DocumentStore return Options for the new methods. In case the new API returns None, we fall back to the old API, as we will show in a later gist.

Step 3: Enabling the New API for a Test Group

We can now enable the new upload for specific users, using techniques such as feature flipping. I recommend to always introduce new features guarded by some kind of feature flipping method, except you have very comprehensive acceptance tests in place. Being able to test new features before everyone can see them is one additional level of safety when deploying continuously. But you should not end up holding back too many features. With feature flipping, we switch between the old and the new API as is demonstrated in the following sippet:

In this example you can also see the fallback code used to handle stale data with the new API. Normally, it would be better to hide the fallback code in the API, but in this case, due to fact that we added a new type and changed the model quite fundamentally, the current approach seems to be more appropriate.

One important note: Write a test that checks that the correct APIs are called when the feature is enabled and disabled. You don’t want to find out in production that your forgot something!

Step 4: Enable New API by Default.

We are done with the first phase of refactoring, i.e., since data can not be updated, we have an implementation that uses the new implementation for newly uploaded content, and is still able to access data that was produced using the old API. When the new feature is tested within the test group it can be released for everyone. In this step we can also remove the old API completely from the code base, since everything is handled by the new one.

Sweet!

We have introduced the new API that is prepared for handling versions appropriately without having introduced only a second of downtime. Key to this approach was:

  • We introduced the new API in parallel to the old API, which enables us to guard the use of the new API with some sort of feature flipping, enabling the new API only for a test group.
  • We took care, that the new API is aware of the old format, i.e., in case we encounter stale data in our test group, the implementation falls back to the old API. Please note that the old API does not need to be aware of the new one.
  • In Scala, Options are king. Using Options we can very easily handle the cases were stale data is encountered.
  • Tests! You need tests for continuous deployment. With good tests in place you can modify your code base in small batches and always verify that existing functionality was not destroyed.
  • Feature Flipping! If you add new stuff, enable it only for a limited group of users, for instance the developers, everyone in your team, or maybe the test team. Automated tests are good, but enabling new stuff only for a limited number of people allows you to test with real people and to encounter problems that might not have been covered in your tests.

In the next blog post, I will talk about migrating data on the fly. Once we introduce the true versioning support, we also have to handle cases, where data is stored using the old model, but needs to be updated. In that case we need to migrate the data on the fly. So stay tuned.