The clojure-hadoop library is an excellent facility for running Clojure functions within the Hadoop MapReduce framework. At Compass Labs we’ve been using its job abstraction for elements of our production flow and found a number of limitations that motivated us to implement extensions to the base library. We’ve promoted this for integration into a new release of clojure-hadoop which should issue shortly.
There are still some design and implementation warts in the current release which should be fixed by ourselves or other users in the coming weeks.
Compass Labs is a heavy user of Clojure for analytics and data process. We have been using Stuart Sierra’s excellent clojure-hadoop package for running a wide variety of algorithms derived from dozens of different Java libraries over our datasets on large Elastic MapReduce clusters.
The standard way to build a Jar for mapreduce is to pack all the libraries for a given package into a single jar and upload it to EMR. Unfortunately, building uberjars for Hadoop is a mallet when a small tap is needed.
We recently reached a point where the overhead of uploading large jars causes a noticeable slow down in our development cycle, especially when launching jobs on connections with limited upload bandwidth and with the slower uberjar creation of lein 1.4.
There are (at least) two solutions to this:
- Build smaller jars by having projects with dependencies specific to a given job
- Cache the dependencies on the cluster and send over your dependency list and pull the set you need into the Distributed Cache at job configuration time.
To allow us to continue to package all our job steps in a single jar, source tree and lein project, we opted for the latter solution which I will now describe.