Press "Enter" to skip to content

Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2

NOTE (Sep 7 2009): Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files. Also updated to use HBase 0.20.0 Release Candidate 3

Introduction

As someone who has “skipped” Java and wants to learn as little as possible about it, and as one who has not had much experience with Hadoop so far, HBase deployment has a big learning curve. So some of the things I describe below may be obvious to those who have had experience in those domains.

Where’s the docs for HBase 0.20

If you go to the HBase wiki, you will find that there is not much documentation on the 0.20 version. This puzzled me since all the twittering, blog posting and other buzz was talking about people using 0.20 even though its “pre-release”

One of the great things about going to meetups such as the HBase Meetup is you can talk to the folks who actually wrote the thing and ask them “Where is the documentation for HBase 0.20

Turns out its in the HBase 0.20.0 distribution in the docs directory. The easiest thing is to get the pre-built 0.20.0 release candididate 3. If you download the source from the version control repository you have to build the documentation using Ant. If you are an Java/Ant kind of person it might not be hard. But just to build the docs, you have to meet some dependencies like

What we learnt with 0.19.x

We have been learning a lot about making HBase Cluster work at a basic level. I had a lot of problems getting 0.19.x running beyond a single node in Psuedo Distributed mode. I think a lot of my problems was just not getting how it all fit together with Hadoop and what the different startup/shutdown scripts did.

Then we finally tried the HBase EC2 Scripts even though it uses an AMI based on Fedora 8 and seemed wired to 0.19.0. Its a pretty nice script if you want to have an opionated HBase cluster set up. But it did educate us on how to get a cluster to go. It has a bit of strangeness by having a script in /root/hbase_init that is called at boot time to configure all the hadoop and hbase conf scripts and then call the hadoop and hbase startup scripts. Something like this is kind of needed for Amazon EC2 since you don’t really know what the IP Address/FQDN is until boot time.

The scripts also set up an Amazon Security Group for the cluster master and one for the rest of the cluster. I beleive it then uses this as a way to identify the group as well.

The main thing we did get was by going thru mainly the /root/hbase_init script we were able to figure out what the process was for bringing up Hadoop/HBase as a cluster.

We did build a staging cluster with this script. We were able to pretty easily change the scripts to use 0.19.3 instead of 0.19.0. But its opions were different than ours for many things. Plus after talking to the folks at the HBase Meetup, and having all sort of weird problems with our app on 0.19.3, we were convinced that our future is in HBase 0.20. And 0.20 introduces some new things like using Zookeeper to manage the Master selection so seems like its not worth it for us to continue to use this script. Though it helped in our learning quite a bit!

Building an HBase 0.20.0 Cluster

This post will use the HBase pre-built Release Candidate 3 and the prebuild standard Hadoop 0.20.0.

This post will show how to do all this “by hand”. Hopefully we’ll have an article on how to do all this with Chef sometime soon.

The Hbase folks say that you really should have at least 5 regionservers and one master. The master and several of the regionservers can also run the zookeeper quorum. Of course the master serveris also going to run the Hadoop Nameserver Secondary name server. Then the 5 other nodes are running the Hadoop HDFS Data nodes as well as the HBase region servers. When you build out larger clusters, you will probably want to dedicate machines to Zookeepers and hot-standby Hbase Masters. Name Servers are still the Single Point of Failure (SPOF). Rumour has it that this will be fixed in Hadoop 0.21.

We’re not using Map / Reduce yet so won’t go into that, but its just a mater of different startup scripts to make the same nodes do Map/Reduce as HDFS and HBase.

In this example, we’re installing and running everything as Root. It can also be done as a special user like hadoop as described in the earlier blog post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

Getting the pre-requisites in order

We started with the vanilla alestic Ubuntu 9.04 Jaunty 64Bit Server AMI: ami-5b46a732 and instantiated 6 High CPU Large Instances. You really want as much memory and cores as you can get. You can do the following by hand or combine it with the shell scripting described below in the section Installing Hadoop and HBase.

apt-get update
apt-get upgrade

Then added via apt-get install:

apt-get install sun-java6-jdk

Downloading Hadoop and HBase

You can use the production Hadoop 0.20.0 release. You can find them at the mirrors at http://www.apache.org/dyn/closer.cgi/hadoop/core/. The examples show from one mirror:

wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

You can download the HBase 0.20.0 Release Candidate 3 in a prebuilt form from http://people.apache.org/~stack/hbase-0.20.0-candidate-3/ (You can get the source out of Version Control:http://hadoop.apache.org/hbase/version_control.html but  you'll have to figure out how to build it.)

wget http://people.apache.org/~stack/hbase-0.20.0-candidate-3/hbase-0.20.0.tar.gz

Installing Hadoop and HBase

Assuming that you are running in your home directory on the master server and that the target for the versioned packages is in /mnt/pkgs and that there will be a link in /mnt for the path to the home for hadoop and hbase:

You can do a some simple scripting to do the following on all the nodes at once:

Create a file named servers with the list of the fully qualified domain names of all your servers including “localhost” for the master and call the file “servers”.

Make sure you can ssh to all the servers from the master. Ideally you are using ssh keys. On master:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

On each of your region servers make sure that the id_dsa.pub is also in their authorized_keys (don’t delete any other keys you have in the authorized keys!)

Now with a bit of shell command line scripting you can install on all your servers at once:

for host in `cat servers`
 do
 echo $host
 ssh $host 'apt-get update; apt-get upgrade; apt-get install sun-java6-jdk'
 scp ~/hadoop-0.20.0.tar.gz ~/hbase-0.20.0.tar.gz $host:
 ssh $host 'mkdir -p /mnt/pkgs; cd /mnt/pkgs; tar xzf ~/hadoop-0.20.0.tar.gz; tar xzf ~/hbase-0.20.0.tar.gz; ln -s /mnt/pkgs/hadoop-0.20.0 /mnt/hadoop; ln -s /mnt/pkgs/hbase-0.20.0 /mnt/hbase'
done

Use Amazon Private DNS Names in Config files

So far I have found that its best to use the Amazon Private DNS names in the hadoop and hbase config files. It looks like HBase uses the system hostname to determine various things at runtime. Thie is always the Private DNS name. It also means that its difficult to use the Web GUI interfaces to HBase from outside of the Amazon Cloud. I set up a “desktop” version of Ubuntu that is running in the Amazon Cloud that I VNC (or NX) into and use its browser to view the Web Interface.

In any case, Amazon instances normally have limited TCP/UDP access to the outside world due to the default security group settings. You would have to add the various ports used by HBase and Hadoop to the security group to allow outside access.

If you do use the Amazon Public DNS names in the config files, there will be startup errors like the following for each instance that is assigned to the zookeeper quorum (there may be other errors as well, but these are the most obvious):

ec2-75-101-104-121.compute-1.amazonaws.com: java.io.IOException: Could not find my address: domU-12-31-39-06-9D-51.compute-1.internal in list of ZooKeeper quorum servers
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.writeMyID(HQuorumPeer.java:128)
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.main(HQuorumPeer.java:67)

Configuring Hadoop

Now you have to configure the hadoop on master in /mnt/hadoop/conf:

hadoop-env.sh:

The minimal things to change are:

Set your JAVA_HOME to where the java package is installed. On Ubuntu:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Add the hbase path to the HADOOP_CLASSPATH:

export HADOOP_CLASSPATH=/mnt/hbase/hbase-0.20.0.jar:/mnt/hbase/hbase-0.20.0-test.jar:/conf

core-site.xml:

Here is what we used. Primarily setting where the hadoop files are and the nameserver path and port:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
   <property>
     <name>hadoop.tmp.dir</name>
     <value>/mnt/hadoop</value>
   </property>

   <property>
     <name>fs.default.name</name>
     <value>hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001</value>
   </property>

   <property>
     <name>tasktracker.http.threads</name>
     <value>80</value>
   </property>
</configuration>

mapred-site.xml:

Even though we are not currently using Map/Reduce this is a basic config:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
   <property>
     <name>mapred.job.tracker</name>
     <value>domU-12-31-39-06-9D-51.compute-1.internal:50002</value>
   </property>

   <property>
     <name>mapred.tasktracker.map.tasks.maximum</name>
     <value>4</value>
   </property>

   <property>
     <name>mapred.tasktracker.reduce.tasks.maximum</name>
     <value>4</value>
   </property>

   <property>
     <name>mapred.output.compress</name>
     <value>true</value>
   </property>

   <property>
     <name>mapred.output.compression.type</name>
     <value>BLOCK</value>
   </property>
</configuration>

hdfs-site.xml:

The main thing to change based on your config is the dfs.replication. It should be less than the total number of data-nodes / region-servers.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
   <property>
     <name>dfs.client.block.write.retries</name>
     <value>3</value>
   </property>

   <property>
     <name>dfs.replication</name>
     <value>3</value>
   </property>
</configuration>

Put the Fully qualified domain name of your master in the file masters and the names of the data-nodes in the file slaves.

masters:

domU-12-31-39-06-9D-51.compute-1.internal

slaves:

domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal

We did not change any of the other files so far.

Now copy these files to the data-nodes:

for host in `cat slaves`
do
  echo $host
  scp slaves masters hdfs-site.xml hadoop-env.sh core-site.xml ${host}:/mnt/hadoop/conf
done

And also format the hdfs on the master

/mnt/hadoop/bin/hadoop namenode -format

Configuring HBase

hbase-env.sh:

Similar to the hadoop-env.sh, you must set the JAVA_HOME:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

and add the hadoop conf directory to the HBASE_CLASSPATH:

export HBASE_CLASSPATH=/mnt/hadoop/conf

And for the master you will want to say:

export HBASE_MANAGES_ZK=true

hbase-site.xml:

Mainly need to define the hbase master, hbase rootdir and the list of zookeepers. We also had to bump up the hbase.zookeeper.property.maxClientCnxns from the default of 30 to 300.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
     <name>hbase.master</name>
     <value>domU-12-31-39-06-9D-51.compute-1.internal:60000</value>
   </property>

   <property>
     <name>hbase.rootdir</name>
     <value>hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001/hbase</value>
   </property>
   <property>
     <name>hbase.zookeeper.quorum</name>
     <value>domU-12-31-39-06-9D-51.compute-1.internal,domU-12-31-39-06-9D-C1.compute-1.internal,domU-12-31-39-06-9D-51.compute-1.internal</value>
   </property>
   <property>
     <name>hbase.cluster.distributed</name>
     <value>true</value>
   </property>
   <property>
     <name>hbase.zookeeper.property.maxClientCnxns</name>
     <value>300</value>
   </property>
</configuration>

You will also need to have a file called regionservers. Normally it contains the same hostnames as the hadoop slaves:

regionservers:

domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal

Copy the files to the region-servers:

for host in `cat regionservers`
do
  echo $host
  scp hbase-env.sh hbase-site.xml regionservers ${host}:/mnt/hbase/conf
done

Starting Hadoop and HBase

On the master:

(This just starts the Hadoop File System services, not Map/Reduce services)

/mnt/hadoop/bin/start-dfs.sh

Then start hbase:

/mnt/hbase/bin/start-hbase.sh

You can shut things down by doing the reverse:

/mnt/hbase/bin/stop-hbase.sh
/mnt/hadoop/bin/stop-dfs.sh

It is advisable to set up init scripts. This is described in the Ubuntu /etc/init.d style startup scripts section of the earlier blog post:Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

10 Comments

  1. tom tom September 6, 2009

    lovely post! Very helpful for a multinode setup.

  2. Robert J Berger Robert J Berger September 7, 2009

    Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files.

  3. Robert J Berger Robert J Berger September 7, 2009

    Updated to use HBase 0.20.0 Release Candidate 3. All the was needed to do was to update the code base to the release candidate. The config files are the same as described earlier.

  4. Kevin Peterson Kevin Peterson September 8, 2009

    If you are planning on accessing HBase from a Hadoop job, you’ll also need to add the zookeeper jar to hadoop’s classpath and restart the cluster. Either that or distribute zookeeper jar with the map reduce job, but that seems unlikely.

  5. jabir jabir February 28, 2012

    for host in `cat slaves`
    do
    echo $host
    scp slaves masters hdfs-site.xml hadoop-env.sh core-site.xml ${host}:/mnt/hadoop/conf
    ssh $host ‘/mnt/hadoop/bin/hadoop namenode -format’
    done

    not sure why you formant the slaves.

    running the name node -format once from any node as super user usually will format the namenode

  6. jabir jabir February 28, 2012

    But nicely written post 🙂 thnkx

  7. Robert J Berger Robert J Berger Post author | February 28, 2012

    Jabir: Thanks you are correct. No need to do a format on the slaves. I will update the post.

    By the way, now days the easiest way to install on Ubuntu is with the cloud era packages. Assuming you just want to have a basic install and don’t plan to do any mods to the code.

    Another cool way (and what we use now is with Infochimps Ironfan and Opscode Chef https://github.com/infochimps-labs/ironfan

Comments are closed.