Lazy Migrations in MongoDB with Scala Salat

In this post I’ll outline how we perform lazy migration with MongoDB using Salat, a case class serialization library for Scala. Salat in conjunction with a REST-like storage layer make for a very good approach to handle lazy migration of your database. Though there are still some open questions, specifically how to handle more sophisticated schema designs, this approach has proven very successfull the last months in migrating our doctivity schema numerous times.

MongoDB Storage Layer

Lets consider a simple user model:

{
  "_id": 1,
  "email": "user@example.com",
  "name": "Test User"
}

Now, while the application is evolving we might have to add additional information to this model, e.g., the last login timestamp:

{
  "_id": 1,
  "email": "user@example.com",
  "name": "Test User",
  "lastLogin": 1234
}

Having both types of documents in MongoDB is not an issue. But at some point you have to use those documents, and at that point you have to know what information is available. Writing the code in a way that knows all different versions of the user document at every possible place is cumbersome. Therefore, one needs a strategy to migrate documents in a controlled manner and to provide a single, reliable API to the rest of the application.

Using Salat

The first user model could be represented as the following case class:

case class User(@Key("_id") id: Long, email: String, name: String)

We are using Salat to serialize this case class into a BSON document. The case class member names are mapped 1:1 to fields in the document. Using the @Key annotation, we can controll the mapping, i.e., the id member becomes the _id field.

Given an object of that case class, let’s call it user, we can transform it into a MongoDBObject and back as follows:

val dbo = grater[User].asDBObject(user)
val userAgain = grater[User].asOject(dbo)

The dbo object is a normal Casbah MongoDBObject which you can store into MongoDB nomally. Consequently, you can also retrieve a document using Casbah, put it through grater and get a scala object of the given case class.

The storage layer consists of a set of stores that are able to store and retrieve different model objects. For the user we would have a UserStore. The API is basically RESTful. You put representations into the store, either new ones or updates for existing ones, you can query for, or delete them.

trait UserStore {
  def save(user: User)
  def findById(id: Long): Option[User]
  // ...
}

Obviously, with more demanding requirements, it might be necessary to extend this simple API, e.g., allowing to push to arrays directly without the need to deserialize complete representations first, and then storing them back, but we’ll stick with this simple base architecture first.

Versioning Documents

Now, if we need to add the timestamp of the last login to this model, we have to consider that all already stored users don’t have this timestamp yet. If we just extend the case class, deserialization of old users will fail, since the case class constructor requires a value for this field. We could make this field an Option, but that would effectively make the fact that there are old documents visible to the outside of the storage layer. Obviously there are cases where information is optional, but in this specific case we assume the lastLogin timestamp is a mandatory field in the application. The updated case class therefore becomes:

case class User(@Key("_id") id: Long, email: String, name: String, lastLogin: DateTime)

But how do we distinguish between old and new documents? One approach would be to check whether the lastLogin property is available. In this case it would even work, but in other cases this might lead to problems, e.g., when the format of an inline document changes between versions. In that case, checking for the version based on the content might easily become cumbersome.

Another approach is to store the version of the document explicitly. We prefer this approach, as it is easy to implement and makes the versioning explicit and easier to understand. Using Salat versioning is as simple as adding a version field:

case class User(@Key("_id") id: Long, email: String, name: String) {
  @Persist val _version = 1
}

We do not add the version as a constructor parameter, but as a “constant” that is fixed for the case class. The @Persist annotation tells Salat to serialize this member to the document. A resulting document would then look like:

{
  "_id": 1,
  "_version": 1,
  "email": "user@example.com",
  "name": "Test User"
}

Now we have the versioning information in the database and can act upon it accordingly.

Lazy Migration

Lazy migration means you migrate the document when you encouter it. That is to say, everytime we get a document from the database and want to transform it into a case class, we check the version and update the document it if needed.

To the outside of the storage layer, only the newest version is known. We do not expose old versions to the outside world, as that would mean loss of control. We would spread knowledge of old versions and migration paths across several parts of the application, something we should clearly avoid.

Internally, we use a function that wraps the migrations:

class UserStoreImpl extends UserStore {
  def findById(id: Long): Option[User] = {
    col.findOne(MongoDBObject("_id" -> id)) map (buildObject(_))
  }

  private def buildObject(dbo: MongoDBObject): User = {
    grater[User].asObject(dbo)
  }
}

And obviously, this is the place to handle old versions. Typically, we check for the version field in dbo, and handle each version in a match block:

private def buildObject(dbo: MongoDBObject): User = {
  dbo.get("_version") match {
    case Some(2) => grater[User].asObject(dbo)
    case Some(1) => buildObject_v1(dbo)
    case _ => throw new IllegalStateException("illegal version")
  }
}

The most recent version can be gratered directly, old versions are dispatched to some method that knows how to make recent version out of them. If an unknown version is encountered we throw an exception, since this is something that shouldn’t happen. You might also decide to handle this case differently, e.g., by returning an option from the build method, log an error, and handle the situation more gracefully. How to handle this depends on the application, your personal preferences, and your style.

One question remains: how do we handle the old document. We can not grater it into the most recent case class, as the lastLogin field is missing. We could parse the document then create on object manually, but that would mean extra work and not take advantage of Salat.

In a case such as our example, I would keep the old case class under a new name

case class User_v1(@Key("_id") id: Long, email: String, name: String) {
  @Persist val _version = 1
}

and then implement buildObject_v1 as follows:

def buildObject_v1(dbo: MongoDBObject): User = {
  val old = grater[User_v1].asObject(dbo)
  val updated = User(old.id, old.email, old.name, DateTime.now)
  save(updated)
  updated
}

This creates an instance of the old case class from the document, uses the available information as input for the case class constructor, and a more or less reasonable default value for the last login timestamp. We then save this migrated user object, which replaces the old document. Finally, we return the object.

Open Issues

This approach assumes a REST-like storage layer. So far we always store and retrieve complete representations of a domain object. However, MongoDB often requires schemas that contain inline documents to allow for more performant access to information.

An example could be user notifications. For instance, a user might need to be notified of certain activities within the system. The activities are stored in a separate collection, with the activities that are notifications for a user indexed on the userId. When we retrieve a user with it’s notifications, we basically make two queries, one against the user collection and one against the activities collections. However, if users are requested very frequently, we have two queries for each request. An alternative approach could be to store the notifications inline:

{
  "_id": 1,
  "_version": 3,
  "email": "user@example.com",
  "name": "Test User",
  "lastLogin": 123,
  "notifications": [
    {
      "_id": 1,
      "_type": "message",
      "_version": 1,
      "message": "How are you?"
    }
  ]
}

But if now the activity model changes, we have to handle the migration of activities in two places. We probably have an activity collection, and we have a list of notifications for each user. If we just update the case class, we have to implement the migration in two places. This works, but results in redundant code.

We are currently experimenting with an approach to provide to the outside world an API that wraps the grater part (which basically conforms to the buildObject(dbo) method from our example) with the following interface:

def buildUser(dbo: MongoDBObject): Either[User, User]

The idea is to return a Right if the document was up to date, and a Left if the document was migrated. We then now in the user store, that we have to updated the activity in the notifications array.

Conclusion

We use the approach presented here for a couple of months now and migrating the database on the fly is a no brainer in many cases. A crucial point is to write tests before deploying a migration and make sure that the migration paths are triggered as expected and produce the results expected. We have migrated the database numerous times now, and did not have a single problem. Using Salat, versioning documents, handling old versions and persisting the updated documents is extremely simple. We can acutally do most of the work in Scala and don’t have to cope with BSON documents directly even in the case of a migration.

Things only get more complicated, when you need to store inline documents redundantly. We are experimenting with some ideas, the most recent being the one outlined before. I would be happy to hear from other people and their approach to handling lazy migration in their application.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>