thoughts on time series and big data    About    Archive    Feed

Using MRJob with Hortonworks Sandbox

Choosing a Hadoop test environment

In past a typical way to setup a test environment for Hadoop applications was to install Hadoop directly on your computer. Assuming you were using a Linux or Mac machine. It was more difficult on a Windows machine. Most of developers nowadays prefer to pick one of the available preconfigured VMs. Two most common choices are Cloudera or Hortonworks.

I won't attempt a comprehensive comparison, but I tried both and found one of them clearly more convenient to use. If you predominantly plan to work from command line, if the VM is a step towards working with a regular or AWS based Hadoop cluster, my recommendation would be to use Hortonworks Sandbox. It is designed to be accessed from a local command line (e.g. Mac's Terminal) rather the from a terminal window in the VM itself. This, in my opinion, makes it more intuitive to use and makes the transition to a full cluster installation straightforward; not much more than just changing IP address. I also had an impression that Sandbox is lighter than Quickstart VM. This is not based on formal tests, but my laptop (Macbook Air, i7, 8GB RAM) became a bit sluggish with Quickstart VM. I didn't experience this with Sandbox.

Why Python and MRJob?

Python, next to R, is currently the main language of data science. There are few data scientists that choose Java, outside a conventional corporate setting. Hadoop, one of the main big data and data science tools, is however based on Java. This is a major dissonance. There are three ways to use Hadoop with writing code in Java: move to a higher-level framework such as Pig or Hive, accept inconveniences of Hadoop Streaming, or find a Python library that will make your life easier.

There are several libraries available, not only for Python. Majority of them are a wrapper around the Streaming. From all of them MRJob seems to be most maintained. This fact is important, because Hadoop undergoes frequent modifications. Some other libraries include Dumbo and Pydoop, they can offer extra functionality that might sometime be required.

Setting up MRJob on the Sandbox

Let's get to installation of MRJob on the Hortonworks Sandbox. In principle it shouldn't be more than running one pip command. However, it doesn't seem to be the case. None of the problems is particularly difficult. I managed to solve each of them with some googling, but I couldn't find one place that would address them explicitly. That's why I decided to write this post.

After you ssh to the Sandbox you want to run the pip command to install MRJob.

pip install mrjob

Depending on version of Sandbox there might be no pip installed. If so you need to install it yourself, for example using yum.

yum -y install python-pip

In some cases installation of pip through yum or mrjob through pip might fale. This issue typically results from the problems with DNS resolution. It doesn't happen every single time, but I encountered it already on two different occasions. The problem is that the Sandbox is missing addresses of DNS servers. You can fix it by editing resolv.conf file and adding DNS servers, for instance those registered at your local machine.

vi /etc/resolv.conf

Now you can add the DNS servers using format nameserver ???.???.???.???, replacing ??? with a particular IP address. The default 8.8.8.8 server doesn't always work correctly.

Second problem to solve is the proper configuration of HADOOP_HOME, which is required by MRJob, but is not automatically exported in the Sandbox, because it is formally depreciated. You can do it yourself.

export HADOOP_HOME=/usr/hdp/current/hadoop-client

You might also want to add it to .bash_profile located in your home directory, so that HADOOP_HOME is exported automatically every time you start the Sandbox. Alternatively, MRJob configuration files can be used to specify this value, but exporting a global path seems to be a more natural solution.

This doesn't solve all the problems. MRJob expects the Streaming jar to be located under HADOOP_HOME. This is not the case in the Sandbox. The simplest and fastest way to deal with it, is to simply copy it there.

cp /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-streaming.jar $HADOOP_HOME

Now you should be able to run MRJob on Hortonworks Sandbox.

In case of Cloudera QuickStart VM there is another, more elegant, documented solution to this problem, which you can read about here.