Curiouser and Curiouser!

'Where shall I begin, please your Majesty?' He asked. 'Begin at the beginning,' the King said, very gravely, 'and go on till you come to the end: then stop.'

A first concurrency in Erlang

I've recently started learning the Erlang language using Joe Armstrong's new, in-beta, book. In the past I learned some Lisp but I ended up admiring Lisp more in the abstract than I liked it in practice and so I didn't stick at it. Lately I've had a yen to return to functional programming because so much of what I find elegant about Ruby seems to have it's roots in the FP world and I feel I've so much more to learn yet.

Getting started with a new programming language for me starts with picking a problem to solve that is useful, not too difficult, and exercises a reasonable spectrum of the languages features. In this case, because Erlang is so strong in it's support for concurrency, it seemed natural to look for a problem whose solution would be enhanced by a parallel approach. Since I have a very large collection of PDF's which includes quite a lot of duplicates, and since I don't like any of the de-dupe solutions I've tried, I thought a good first problem would be scanning for and finding duplicate PDF's across all my disks.

I spent a bit of time looking at the library that comes with Erlang and was pleased to discover the fold_files function in the filelib module. At a stroke this takes care of most of the actual work of iterating the file-system and filtering the required files. What remained was learning how to build the duplicate database in functional style and bearing in mind Erlang immutable data structures and single-assignment variables.

My approach to finding duplicates was to calculate an MD5 hash of the file size along with the first & last 16 bytes of the file which looks like this:

get_file_hash( FilePath, Size ) ->
  {ok,[Hunk1,Hunk2]} = myfile:open(
    FilePath, [read], fun( File ) -> file:pread( File, [{0,16},{Size-16,16}] ) end
  ),
  hex:hex(
    binary_to_list(
      erlang:md5(
        lists:append( integer_to_list( Size ), lists:append( Hunk1, Hunk2 ) )
      )
    )
  ).

As an aside I really like the way Erlang functions are declared, e.g.

length( [] ) -> 0;
length( [H|T] ) -> 1 + length( T ).

which uses pattern matching on the arguments to find the right clause of the function to evaluate. It's an elegant approach although I did get caught out by not realising that clauses of a function with the same name but different arity do no belong together, e.g.

find( Path ) -> expr;
find( Path, [] ) -> expr.

does not work because the two are not clauses of the same function (being instead clauses of find/1 and find/2 respectively). If you find yourself with perplexing head mismatch errors try looking for this pattern and separate the clauses properly (using a .).

My definition of the function myfile:open(), above, reflects my failure to close any files during the first attempts at iterating a large filesystem leading to emfile errors. I realised I had forgotten that it was even possible to call Ruby's File#open without passing a block because it's so natural and effective to do so, it's something I think Matz got dead right, hence:

myfile:open( Filename, Modes, Block ) when is_function( Block ) ->
  {ok,File} = file:open( Filename, Modes ),
  RetVal = Block( File ),
  file:close( File ),
  RetVal.

that demonstrates another neat aspect of Erlang function clauses, guard expressions.

Another aside is that, in Erlang, the = operator is not "equals" but "matches". This relates to the fact that Erlang, in a manner somewhat akin to Prolog, uses unification in expressions so that matching unbound variables will bind them and thereafter attempt to unify them, a simple example:

X = 2.

where X is an unbound variable will bind X to the value 2. If later you were to evaluate,

X = 3.

you would get an error because 3 does not match the value, 2, that is already bound to X. This, I think, can take some getting used to but is pretty neat. For example, if File is a filespec record then:

#filespec{hash=Key,size=Size} = File

will bind the variables Key and Size (in parallel) to the hash and size attributes of the record bound to File.

I was surprised at how quickly I arrived at a solution to my basic problem. It was almost as fast I could imagine doing it in Ruby. However what the basic solution lacked was that it took no advantage of Erlangs famous concurrency primitives, spawn & co. Initially I was quite unsure about how to parallelize my solution since so much of the work was going on in fold_files. After some pondering I concluded that what I was looking for was futures. I looked through the library that comes with Erlang but couldn't recognize was I was looking for, so I cooked up the following:

-module(futures).
-export([new/1,value/1]).

new( Fun ) ->
  Pid = spawn( fun oneshot/0 ),
  Pid ! { self(), Fun },
  Pid.

value( Pid ) ->
  receive
    { Pid, { ok, Result } } -> Result
  end.

oneshot() ->
  receive
    { Sender, Fun } ->
      Sender ! { self(), { ok, Fun() } };
    Any ->
      io:format( "Oneshot(~p) received unknown message ~p~n", [self(),Any] )
  end.

Which, it turns out, isn't a shocking travesty ;-) I did get some useful suggestions from the fine folks in #erlang including looking at Joe Armstrong's definition of pmap. Now I was able to make essentially a two line change to have a fully concurrent solution:

scan( Path ) ->
  lists:map(
    fun( Future ) -> futures:value( Future ) end,
    filelib:fold_files(
      Path,
      ".*\.pdf",
      true,
      fun( F, Acc ) -> [
        futures:new( fun() ->
          file_info:file_info( F )
        end ) | Acc ] end,
      []
    )
  ).

Creating the future spawns an Erlang process that runs the embedded function (that does all the nasty mucking about with files and MD5 hashing). Later the future is referenced for it's value which will block until a value is available (but by which time all the processes have been spawned off anyway). Crucial to this being efficient is Erlangs ability to cope with hundreds of thousands of processes (I was able to run tests creating millions of processes on my MacBook Pro CD2 with 2GB memory). For more about that take a look here.

The way I have defined the future module highlights what is for me, so far, the only thing I don't like about Erlang (and probably all functional languages). I miss the ability to wrap the future into a nicely packaged class. Calling:

Pid = futures:new( fun .... end ).
futures:value( Pid ).

seems less elegant than (hypothetically):

future = Future.new do ... end
future.value()

Something about having to treat the generic process id as a future... maybe it's a small point but modules do expose the details more and feel somewhat clunky after using Ruby classes for such a long time. This last is apparent in things like:

lists:map( fun ... end, List )

vs.

list.map do ... end

as well. There have to be tradeoff's I suppose.

I've still got a few bugs to work through (the odd future call seemingly to returnin a pid instead of the expected result which is a bit weird) but otherwise I have quickly arived at 80% of a fast neat solution to my problem. The last step of which will be to have it create a customisable script for handling the duplicate files.

I'm beginning to like Erlang a lot.

13/03/2007 13:09 by Matt Mower | Permalink