This week focuses on Hadoop and AWS Elastic Map Reduce, plus how to install the same version of Hadoop on OS X with Homebrew
- Map-Reduce With Ruby Using Apache Hadoop - Good place to start kicking the tires, but start about halfway down the page w/ the Ruby related content.
- Log analytics with Hadoop and Hive -
- Hadoop Streaming for Rapid Prototyping of Distributed Algorithms - Oldy, but goody.
- Best Practices with STDIN in Ruby? - Use ARGF instead of STDIN because ARGF will handle both STDIN and named files.
Things I should remember by now
Sorting by a column in Unix AND specifying a tab as the column separator:
cat /tmp/file | sort -t"`echo '\t'`" -k2nUsing awk to sum a column of numbers - Again, with the tab separator.
awk -F"`echo '\t'`" '{ sum += $2 } END { print sum }'
Installing Hadoop on OS X and Homebrew
If you are using OS X and Homebrew, Hadoop can be installed with a simple:
brew install hadoop
However, if you want to use a version compatible with AWS, specifically 0.20.205.0, you need to hack the brew formula.
Before:
url 'http://www.apache.org/dyn/closer.cgi?path=hadoop/core/hadoop-1.0.1/hadoop-1.0.1.tar.gz'
md5 'e627d9b688c4de03cba8313bd0bba148'
After:
url 'http://www.apache.org/dyn/closer.cgi?path=hadoop/core/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz'
md5 '8016D8A2A50CB2BEB17F2F45A1EA28DA'
Last I checked, this was the only way to do it w/o forking the project. Before running brew update you should remember to cd /usr/local/ && git stash. Afterwards, git stash pop to re-apply them.
Note: If your map or reduce methods catch exceptions, make sure they don't hide problems. You may end up with a successful run, but output is empty.