In dam simple we need to reliably determine the file type of uploaded documents. Unfortunately, we realized that browsers not always send the correct mime type, or to be more exact, sometimes messes up at least the encoding. Furthermore, any other client, such as the dam simple OSX app does not have this build in logic, so determining the correct mime type would need to be implemented for non-browser based clients again.
Since we use the mime type to control the further processing of the documents, it should be reliable (at least to the degree possible), and we decided to figure it out in the backend to have full control over it and to be able to deliver a consistent user experience.
In the following I will explain how to find out the mime type and encoding with standard unix tools and how to use the Scala process API.
file and mimetype
file is a default command available on Unix-like systems (at least all system I know provide it). It uses different strategies to determine the type of a file:
carsten:~/Downloads$ file Filter\ Mockup.tiff
Filter Mockup.tiff: TIFF image data, big-endian
You can use the –mime option to let it produce a mime type including the encoding, which is relevant for text files:
carsten:~/Downloads$ file --mime test.txt
test.txt: text/plain; charset=utf-8
In theory that’s all we need. However, they are some drawbacks as I learned. First, at least on an Ubuntu 12.10 install, for some file types (MS Office e.g.) it did not provide a mime type but always a human readable description. Furthermore, the strategy used by
file to determine the mime type is not able to distinguish between different types of MS Office documents, e.g., it can not distinguish between Word and Excel.
I then learned that there exists the
mimetype command, on Ubuntu 12.10 provided by the package
libfile-mimeinfo-perl. It use the file extension to determine the mime type. From my tests this clearly works better and more reliable then using file. So I decided to use
mimetype to determine the mime type and for text files use
file to get the encoding. This allows me to determine the necessary information to process documents in dam simple and to return them to a user with the correct mime type.
Calling system processes from Scala
Scala provides a nice abstraction over calling system processes.
val ret = Seq("mimetype", "-b", file.getAbsolutePath()) ! ProcessLogger(line => retValue = Some(line))
This runs the command specified within the
Seq, assigns the process return value to ret and uses the
ProcessLogger to catpure any stdout/stderr output. Lets look at the different parts in more detail.
Seq(...) construct is used, since it allows for spaces in the command. An alternative would be to provide the command as a simple string, but this would impose some limitations. So the Seq represents the command:
mimetype -b <filename>
The “!” is basically the run method that returns the processes exit value. The “!” method optionally takes parameters, e.g., as depicted here a Process logger:
ProcessLogger(line => retValue = Some(line))
This instance just assigns the last line it receives to some variable. The variable will therefore contain the mime type of the file, or None if no such line was produced (which should not happen, because that would mean an error).
A complete implementation is provided in this gist: https://gist.github.com/4148825
Could not embed GitHub Gist 4148825: Not Found
Probably we could use some more scala features to shorten the code, however, it is a very concise implementation. There are alternatives, e.g., Apache Tika provides similar functionality, but I just felt it would be overkill when good unix tools exist to solve the problem.