I recently started to fix some issues with the Smart Keywording ResourceSpace plugin and the latest ResourceSpace. Since I need a running ResourceSpace to test the plugin, and because I also need a fresh ResourceSpace sometimes to check whether everything installs correctly or to test a specific version, I built a configuration for Vagrant that allows me to boot up a fresh installation of ResourceSpace within 5 minutes. Fresh means really fresh, i.e., no data, latest commit from trunk, perfect for development and testing. But Vagrant also allows me to keep the state of a virtual machine until I explicitly destroy it. And I can have different versions of the virtual machine. If I modify the configuration to map different ports, they can even run in parallel.
I encourage you to checkout Vagrant, which seems to be great for having dedicated, small VMs for development and testing.
In dam simple we need to reliably determine the file type of uploaded documents. Unfortunately, we realized that browsers not always send the correct mime type, or to be more exact, sometimes messes up at least the encoding. Furthermore, any other client, such as the dam simple OSX app does not have this build in logic, so determining the correct mime type would need to be implemented for non-browser based clients again.
Since we use the mime type to control the further processing of the documents, it should be reliable (at least to the degree possible), and we decided to figure it out in the backend to have full control over it and to be able to deliver a consistent user experience.
In the following I will explain how to find out the mime type and encoding with standard unix tools and how to use the Scala process API.
file and mimetype
file is a default command available on Unix-like systems (at least all system I know provide it). It uses different strategies to determine the type of a file:
In theory that’s all we need. However, they are some drawbacks as I learned. First, at least on an Ubuntu 12.10 install, for some file types (MS Office e.g.) it did not provide a mime type but always a human readable description. Furthermore, the strategy used by file to determine the mime type is not able to distinguish between different types of MS Office documents, e.g., it can not distinguish between Word and Excel.
I then learned that there exists the mimetype command, on Ubuntu 12.10 provided by the package libfile-mimeinfo-perl. It use the file extension to determine the mime type. From my tests this clearly works better and more reliable then using file. So I decided to use mimetype to determine the mime type and for text files use file to get the encoding. This allows me to determine the necessary information to process documents in dam simple and to return them to a user with the correct mime type.
Calling system processes from Scala
Scala provides a nice abstraction over calling system processes.
val ret = Seq("mimetype", "-b", file.getAbsolutePath()) ! ProcessLogger(line => retValue = Some(line))
This runs the command specified within the Seq, assigns the process return value to ret and uses the ProcessLogger to catpure any stdout/stderr output. Lets look at the different parts in more detail.
The Seq(...) construct is used, since it allows for spaces in the command. An alternative would be to provide the command as a simple string, but this would impose some limitations. So the Seq represents the command:
mimetype -b <filename>
The “!” is basically the run method that returns the processes exit value. The “!” method optionally takes parameters, e.g., as depicted here a Process logger:
ProcessLogger(line => retValue = Some(line))
This instance just assigns the last line it receives to some variable. The variable will therefore contain the mime type of the file, or None if no such line was produced (which should not happen, because that would mean an error).
A complete implementation is provided in this gist: https://gist.github.com/4148825
Probably we could use some more scala features to shorten the code, however, it is a very concise implementation. There are alternatives, e.g., Apache Tika provides similar functionality, but I just felt it would be overkill when good unix tools exist to solve the problem.
So, after having two blogs for the last year, one in German and a work-related one in English, I decided to set up a single blog, in which I will only blog about professional stuff, i.e., everything related to Semantic Web, Semantic Multimedia, Software Development, or IT in general. This will become my primary homepage from now on. Now I only have to find time to create content :).