<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:ent="http://www.purl.org/NET/ENT/1.0/" version="2.0">
  <channel>
    <title>Curiouser and Curiouser! on functional-programming</title>
    <link>http://matt.blogs.it/</link>
    <description>RSS feed for topic functional-programming</description>
    <copyright>Copyright 2007 Matt Mower. Some rights reserved.</copyright>
    <generator>Squib/0.5.0.382</generator>
    <managingEditor>self@mattmower.com</managingEditor>
    <webMaster>self@mattmower.com</webMaster>
    <language>en-gb</language>
    <item>
      <title>A first concurrency in Erlang</title>
      <link>http://matt.blogs.it/entries/00002513.html</link>
      <pubDate>Tue, 13 Mar 2007 13:09:39 +0000</pubDate>
      <description>&lt;p&gt;I've recently started learning the &lt;a href="http://www.erlang.org/"&gt;Erlang&lt;/a&gt; language &lt;a href="http://matt.blogs.it/entries/00002510.html"&gt;using Joe Armstrong's new, in-beta, book&lt;/a&gt;. In the past I learned some Lisp but I ended up admiring Lisp more in the abstract than I liked it in practice and so I didn't stick at it. Lately I've had a yen to return to functional programming because so much of what I find elegant about Ruby seems to have it's roots in the FP world and I feel I've so much more to learn yet.&lt;/p&gt;

&lt;p&gt;Getting started with a new programming problem for me is about picking a problem this is useful, not too difficult, but exercises a reasonable spectrum of the language. In this case, because Erlang is so strong in it's support for concurrency, I needed a problem that would naturally have an elegant concurrent solution. Since I have a very large collection of PDF's which includes quite a lot of duplicates and, since I don't like any of the de-dupe solutions I've tried, I thought a good first problem would be scanning for and finding duplicate PDF's across all my disks.&lt;/p&gt;

&lt;p&gt;I spent a bit of time looking at the library that comes with Erlang and was pleased to discover &lt;code&gt;filelib:fold_files&lt;/code&gt; that neatly took care of most of the actual work of iterating the file-system what remained was learning how to build the resulting datastructure in functional style bearing in mind Erlang immutable datastructures and single-assignment variables. My approach to finding duplicates was to calculate an MD5 hash of the file size along with the first &amp;amp; last 16 bytes of the file which looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;get_file_hash( FilePath, Size ) -&amp;gt;
  {ok,[Hunk1,Hunk2]} = myfile:open( FilePath, [read], fun( File ) -&amp;gt; file:pread( File, [{0,16},{Size-16,16}] ) end ),
    hex:hex( binary_to_list( erlang:md5( lists:append( integer_to_list( Size ), lists:append( Hunk1, Hunk2 ) ) ) ) ).
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As an aside I really enjoy the way Erlang functions are declared, e.g.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;length( [] ) -&amp;gt; 0;
length( [H|T] ) -&amp;gt; 1 + length( T ).
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which uses pattern matching on the arguments to find the right clause of the function. It's elegant although I did get caught out by not realising that clauses of a function with the same name but different arity do no belong together, e.g.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;find( Path ) -&amp;gt; expr;
find( Path, [] ) -&amp;gt; expr.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;are not clauses of the same function (being instead clauses of &lt;code&gt;find/1&lt;/code&gt; and &lt;code&gt;find/2&lt;/code&gt; respectively) which can lead to an, if you're not expecting it, perplexing &lt;code&gt;head mismatch&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;My definition of &lt;code&gt;myfile:open()&lt;/code&gt;, above, reflects my failure to close any files in my first cut at the solution leading to some strange &lt;code&gt;emfile&lt;/code&gt; errors. I can't remember the last time I didn't pass a block to Ruby's &lt;code&gt;File#open&lt;/code&gt; so I think Matz got that dead right, hence:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;open( Filename, Modes, Block ) when is_function( Block ) -&amp;gt;
  {ok,File} = file:open( Filename, Modes ),
  RetVal = Block( File ),
  file:close( File ),
  RetVal.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;that demonstrates another neat aspect of Erlang function clauses, guard expressions.&lt;/p&gt;

&lt;p&gt;I was surprised at how quickly I had a solution to my basic problem, almost Ruby speed. However what the basic solution didn't do was take advantage of Erlangs concurrency primitives spawn &amp;amp; co. Initially I was quite confused about how to parallelize my solution since so much of the work was going on in &lt;code&gt;fold_files&lt;/code&gt;. After some pondering I concluded that what I was looking for was &lt;a href="http://en.wikipedia.org/wiki/Future_(programming"&gt;futures&lt;/a&gt;). I looked through the library that comes with Erlang but couldn't recognize was I was looking for, so I cooked up the following:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-module(futures).
-export([new/1,value/1]).

new( Fun ) -&amp;gt;
  Pid = spawn( fun oneshot/0 ),
  Pid ! { self(), Fun },
  Pid.

value( Pid ) -&amp;gt;
  receive
    { Pid, { ok, Result } } -&amp;gt; Result
  end.

  oneshot() -&amp;gt;
    receive
      { Sender, Fun } -&amp;gt;
        Sender ! { self(), { ok, Fun() } };
      Any -&amp;gt;
        io:format( "Oneshot(~p) received unknown message ~p~n", [self(),Any] )
    end.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Which, it turns out, isn't a shocking travesty ;-) I did get some useful suggestions from the fine folks in #erlang including looking at Joe Armstrong's &lt;a href="http://www.erlang.org/ml-archive/erlang-questions/200606/msg00187.html"&gt;definition of pmap&lt;/a&gt;. Now I was able to make essentially a two line change to have a fully concurrent solution:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;scan( Path ) -&amp;gt;
  lists:map(
    fun( Future ) -&amp;gt; futures:value( Future ) end,
      filelib:fold_files(
        Path,
        ".*.pdf",
        true,
        fun( F, Acc ) -&amp;gt; [ futures:new( fun() -&amp;gt; file_info:file_info( F ) end ) | Acc ] end,
        []
    )
  ).
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Creating the future spawns an Erlang process that runs the embedded function (that does all the nasty mucking about with files and MD5 hashing). Later the future is referenced for it's value which will block until a value is available (but by which time all the processes have been spawned off anyway). Crucial to this being efficient is Erlangs ability to cope with hundreds of thousands of processes (I was able to run tests creating millions of processes on my MacBook Pro CD2 with 2GB memory). For more about that &lt;a href="http://www.lshift.net/blog/2006/09/10/how-fast-can-erlang-create-processes"&gt;take a look here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The way I have defined the future module highlights what is for me, so far, the only thing I don't like about Erlang (and probably all functional languages). I miss the ability to wrap the future into a nicely packaged class. Calling:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;futures:value( Pid )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;seems less elegant than (hypothetically):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;future.value()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Maybe it's a small point but modules do feel very clunky after using Ruby classes for such a long time.&lt;/p&gt;

&lt;p&gt;I've still got a few bugs to work through (the odd future call seemingly to returnin a pids instead of the expected result which is a bit weird) but otherwise I have quickly arived at 80% of a fast neat solution to my problem. The last step of which will be to have it create a customisable script for handling the duplicate files.&lt;/p&gt;

&lt;p&gt;I'm beginning to like Erlang a lot.&lt;/p&gt;</description>
      <guid isPermaLink="true">http://matt.blogs.it/entries/00002513.html</guid>
      <ent:cloud ent:href="http://matt.blogs.it/topics/">
      </ent:cloud>
    </item>
  </channel>
</rss>