Distributed Classification with ADMM

Wednesday, October 09, 2013

Today Jon Sondag and I presented our paper on ADMM for Hadoop at the IEEE BigData 2013 conference.

The paper describes our implementation of Boyd's ADMM algorithm in Hadoop Map Reduce. We talk about the statistical details of implementing ADMM as well as the nuances of storing state on Hadoop.

In our presentation we present background on the data pipeline we have built at Intent Media and motivate why a Hadoop Map Reduce job is the appropriate run-time for us to use. We mention the alternatives for building distributed logistic regression models, such as sampling the data, Apache Mahout, Vowpal Wabbit, and Spark.

We also discuss alternatives specifically designed for iterative computation on Hadoop, such as HaLoop and Twister.

Our presentation is below:

You may also read the full paper Practical Distributed Classification using the Alternating Direction Method of Multipliers Algorithm.

The paper describes our open source Hadoop based implementation of the ADMM algorithm and how to use it to compute a distributed logistic regression model.

Categorizing Text in Ruby

Thursday, May 23, 2013

We have open sourced the categorization libary that powers the fast dynamic labels and clusters on the Helioid site. This library is built to prioritize performance over accuracy. The library takes label quality into account by first generating a set of labels and then assigning documents to those labels, we have found that this increases the likelihood of producing meaningful labels.

The below example shows how to create a set of labeled cluster from documents. First include the categorize library.

require 'categorize'

include Categorize

Then define your set of documents.

documents = [
  'lorem ipsum dolor',
  'sed perspiciatis unde',
  'vero eos accusamus',
  'vero eos accusamus iusto odio'

Now make a model based on an additional query term, lorem, in this case.

Model.make_model('lorem', documents)
=> {
   'ipsum'            => [0],
   'sed perspiciatis' => [1],
   'vero'             => [2, 3]

The model output is a map of cluster labels to documents within those clusters. Install the gem and try it out.

Presenting at ACM DEV 2013

Sunday, January 13, 2013

Prabhas Pokharel presented our paper, Improving Data Collection and Monitoring through Real-time Data Analysis on Friday at ACM DEV 2013 in Bangalore.  The poster is below:

The paper was coauthored with Prabhas Pokharel, Mark Johnston, and Vijay Modi.  The abstract is below:

Feedback based on real-time data is increasingly important for ICT-based interventions in the developing world. Applications such as facility inventories, summarization of patient data from community health workers, etc. need processes for analyzing and aggregating datasets that update over time. In order to facilitate such processes, we have created a modular web service for real-time data analysis: bamboo.

If you are interested in using bamboo please see the bamboo service website, the Python library pybamboo and the JavaScript library bamboo.js.

Bamboo - Systematizing Realtime Data Analysis

Sunday, November 11, 2012

We now have a reasonable alpha version of bamboo online, from the docs:

Bamboo provides an interface for merging, aggregating and adding algebraic calculations to dynamic datasets. Clients can interact with Bamboo through a REST web interface and through Python.

bamboo includes JavaScript and Python libraries, and many operations to choose from:

Presenting formhub at MDM 2012

Sunday, July 22, 2012

On July 23rd and 24th Alex Dorey and I will be presenting formhub at the DataDev workshop at the IEEE Mobile Data Management (MDM 2012) conference.

Here is a blog post discussing our presentation at DataDev.

The formhub poster is below: