Writing Java plugins for Flume in Clojure

I recently wrote a plugin in Clojure to add to the Cloudera Flume framework. As it was my first time writing a full java class interface I had to learn about the proper use of both proxy and gen-class. Given the poor error reporting at the java-clojure boundary, figuring out what you did wrong if you don’t get every detail exactly right (particularly when loading a class in the plugin’s final environment) can be difficult.

Continue reading “Writing Java plugins for Flume in Clojure” →

New Abstractions for Clojure-HBase

I just pushed Compass Lab’s HBase Client API to my fork of the clojure-hbase library. The API includes support for table schemas (to auto-encode inputs and outputs) and a constraint language that generates filters and calls for Get and Scan operations. We decided to integrate with the existing fork to retain access to the excellent admin functionality already implemented there. We’ll be talking with the original author and see if we’ll merge or split these two API development paths. In the meantime, you can use our fork of clojure-hbase.

Steps and Flows: Higher-order Composition for Clojure-Hadoop

The clojure-hadoop library is an excellent facility for running Clojure functions within the Hadoop MapReduce framework. At Compass Labs we’ve been using its job abstraction for elements of our production flow and found a number of limitations that motivated us to implement extensions to the base library. We’ve promoted this for integration into a new release of clojure-hadoop which should issue shortly.

There are still some design and implementation warts in the current release which should be fixed by ourselves or other users in the coming weeks.

Continue reading “Steps and Flows: Higher-order Composition for Clojure-Hadoop” →

Learning Bayesian statistics with R

I have a bad tendency in my research work to write my own code and libraries from scratch, in large part because I’ve decided to keep most of my coding in Common Lisp to leverage prior tools. However, I’ve recently been given a painful demonstration of how it is often faster to pay the up-front cost to learn the right tool than to rewrite (and maintain) the subsets you think you need. For example, I found myself venturing into Clojure/Java/Hadoop for my commercial work this year as a compromise between Lisp / dynamic language features and integration benefits. This week I’m finding the need to do some rather sophisticated work with graphical models and I need some tools to build and evaluate them.

I’ve looked at a wide variety of open source approaches such as Samiam (no continuous variables), WinBUGS (only windows), OpenBUGS (not quite ready), HBC (inference only), Mallett (OK, but I don’t like Java and doesn’t support all forms of real-valued random variables), Incanter (limited but growing support for graphical models) and R.

Continue reading “Learning Bayesian statistics with R” →

Self Tracking and Methodology

What is Self Tracking??????????

Self-tracking is a process through which we attempt to uncover patterns in our daily lives or environment. Tracking can be used for a variety of purposes, including exploratory (what correlations do I see), explanatory (why does this happen) or experimental (If I change X, Y should happen). Regardless of the specific purpose, our ultimate goal is almost always to develop some model of cause and effect that we can use to inform our future decisions. The discovery of cause-effect relationships and the consequences of interventions is the essential aim of the scientific method. It takes years of education and practical training to understand how to apply methodology to gain valid insights into fundamental questions about cause and effect in some natural or artificial system. Methodology is crucial to avoid developing incorrect conclusions.

However, we must also acknowledge that tracking, modeling and intervening in system is a fundamental human exercise. Continue reading “Self Tracking and Methodology” →

Two days in LA at the Lybba Confab

I’m looking forward to spending the next few days with the Lybba team. Lybba is a fantastic non-profit organization doing important work to transform the way that we interact with our individual health and the healthcare system.

Self Tracking without a Written Record

I had a great time giving an unusual talk at the Quantified Self meetup in SF last week. Several people asked me to post slides online. There were also a few questions we didn’t have time to address to which this is a partial answer.

Self-Experimentation without a Written Record

Tracking my lifestyle changes and related symptoms on an ongoing basis has proved to be challenging. The severity of my symptoms have never been such that I’ve made detailed note-taking a priority. Instead, I slowly evolved a mental methodology for keeping track of my experiences by focusing on one hypothesis at a time and slowly accumulating what I consider to be informative observations and conclusions.

In practice, I mentally maintain two mental ‘records’:

Continue reading “Self Tracking without a Written Record” →

Streamlining Hadoop and Clojure with Maven

Compass Labs is a heavy user of Clojure for analytics and data process. We have been using Stuart Sierra’s excellent clojure-hadoop package for running a wide variety of algorithms derived from dozens of different Java libraries over our datasets on large Elastic MapReduce clusters.

The standard way to build a Jar for mapreduce is to pack all the libraries for a given package into a single jar and upload it to EMR. Unfortunately, building uberjars for Hadoop is a mallet when a small tap is needed.

We recently reached a point where the overhead of uploading large jars causes a noticeable slow down in our development cycle, especially when launching jobs on connections with limited upload bandwidth and with the slower uberjar creation of lein 1.4.

There are (at least) two solutions to this:

Build smaller jars by having projects with dependencies specific to a given job
Cache the dependencies on the cluster and send over your dependency list and pull the set you need into the Distributed Cache at job configuration time.

To allow us to continue to package all our job steps in a single jar, source tree and lein project, we opted for the latter solution which I will now describe.

Continue reading “Streamlining Hadoop and Clojure with Maven” →

Talk at Mayo Transform 2010

A video link of my talk at the Mayo Clinic Transform Event.