Determine mime-type of a file in Scala

In dam simple we need to reliably determine the file type of uploaded documents. Unfortunately, we realized that browsers not always send the correct mime type, or to be more exact, sometimes messes up at least the encoding. Furthermore, any other client, such as the dam simple OSX app does not have this build in logic, so determining the correct mime type would need to be implemented for non-browser based clients again.

Since we use the mime type to control the further processing of the documents, it should be reliable (at least to the degree possible), and we decided to figure it out in the backend to have full control over it and to be able to deliver a consistent user experience.

In the following I will explain how to find out the mime type and encoding with standard unix tools and how to use the Scala process API.

file and mimetype

file is a default command available on Unix-like systems (at least all system I know provide it). It uses different strategies to determine the type of a file:

carsten:~/Downloads$ file Filter\ Mockup.tiff
Filter Mockup.tiff: TIFF image data, big-endian

You can use the –mime option to let it produce a mime type including the encoding, which is relevant for text files:

carsten:~/Downloads$ file --mime test.txt 
test.txt: text/plain; charset=utf-8

In theory that’s all we need. However, they are some drawbacks as I learned. First, at least on an Ubuntu 12.10 install, for some file types (MS Office e.g.) it did not provide a mime type but always a human readable description. Furthermore, the strategy used by file to determine the mime type is not able to distinguish between different types of MS Office documents, e.g., it can not distinguish between Word and Excel.

I then learned that there exists the mimetype command, on Ubuntu 12.10 provided by the package libfile-mimeinfo-perl. It use the file extension to determine the mime type. From my tests this clearly works better and more reliable then using file. So I decided to use mimetype to determine the mime type and for text files use file to get the encoding. This allows me to determine the necessary information to process documents in dam simple and to return them to a user with the correct mime type.

Calling system processes from Scala

Scala provides a nice abstraction over calling system processes.

val ret = Seq("mimetype", "-b", file.getAbsolutePath()) ! ProcessLogger(line => retValue = Some(line))

This runs the command specified within the Seq, assigns the process return value to ret and uses the ProcessLogger to catpure any stdout/stderr output. Lets look at the different parts in more detail.

The Seq(...) construct is used, since it allows for spaces in the command. An alternative would be to provide the command as a simple string, but this would impose some limitations. So the Seq represents the command:

mimetype -b <filename>

The “!” is basically the run method that returns the processes exit value. The “!” method optionally takes parameters, e.g., as depicted here a Process logger:

ProcessLogger(line => retValue = Some(line))

This instance just assigns the last line it receives to some variable. The variable will therefore contain the mime type of the file, or None if no such line was produced (which should not happen, because that would mean an error).

A complete implementation is provided in this gist: https://gist.github.com/4148825

Probably we could use some more scala features to shorten the code, however, it is a very concise implementation. There are alternatives, e.g., Apache Tika provides similar functionality, but I just felt it would be overkill when good unix tools exist to solve the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax