There are two ways to run Hadoop from the command line on an Elastic MapReduce (EMR) cluster that is active in “waiting” mode. First the hard way:
Running Hadoop Directly by Logging into the Cluster’s Head Node
The following commands show how you can log into the cluster’s head node and run Hadoop from the shell, thereby obtaining the same results you would if you submitted the job through the web interface.
First, log into the head node of your EMR cluster (change your key filename and URL as necessary to match your information):
ssh -i my-key-pair.pem firstname.lastname@example.org
Set necessary environment variables. Note that “/home/hadoop/MyJar.jar” is included, as we’ll need this when we run Hadoop:
export HADOOP_CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src export CLASSPATH=./:/home/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar:/home/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:/home/hadoop/MyJar.jar:/home/hadoop/src
Compile your Java classes and make a JAR file with them:
cd src javac *.java jar cf MyJar.jar *.class mv MyJar.jar ../ cd ..
Copy your input file to the Hadoop filesystem:
hadoop fs -put input hdfs:///
Make sure your output destination doesn’t exist:
hadoop fs -rm -R output
hadoop jar MyJar.jar MyJar hdfs:///input output
View the results:
hadoop fs -ls output hadoop fs -cat output/part-r-00000
Running a Job on a “Live” EMR Cluster from any Terminal
Now for the easy way to run a job on an existing EMR cluster, assuming you’ve installed and configured the Amazon EMR Command Line Interface program. In this case the class you are running is stored on S3, along with the input. The output will be written to S3 as well. You do not need to be logged into the cluster’s head node for this to work.
elastic-mapreduce -j your-cluster-id --jar s3n://your-S3-bucket/code/MyJar.jar --arg MyJar --arg s3n://your-S3-bucket/input/ --arg s3n://your-S3-bucket/output