-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite: Writing a Crunch Job
In this example, you'll build and run a crunch job that calculates the number of movies released in a given year. This will demonstrate:
- Using the Kite application parent POM to manage dependencies
- Using the Kite maven plugin to submit a MR job
- Opening a dataset with Kite's API
Download the maven project tarball and unzip it:
wget http://...
tar xzf movies-crunch.tar.gz
Next, look at the contents:
[cloudera@localhost ~]$ tree movies-crunch
movies-crunch
├── pom.xml
└── src
└── main
├── java
│ └── com
│ └── cloudera
│ └── Movies.java
└── resources
└── mapred-site.xml
This is is a minimal crunch project:
-
Movies.javais the driver program that defines functions to extract the year from each movie title, group the years, and count them. -
pom.xmlconfigures maven to build and run the driver program -
mapred-site.xmlsets cluster configuration, like the default FS and the metastore URI
The pom.xml file uses the Kite app POM as its parent (not a dependency):
<parent>
<groupId>org.kitesdk</groupId>
<artifactId>kite-app-parent-cdh4</artifactId>
<version>0.15.0</version>
</parent>This parent POM configures the project's dependencies for CDH4, including test dependencies and test-jar artifacts. Notice that the only dependency that's listed directly in the POM is hive-exec, which is required for reading from the movies dataset in Hive. It is added to change it from a provided dependency to a compile dependency so that Kite will add it with -libjars when running the job.
The POM also adds the Kite maven plugin, which performs Kite-specific tasks in maven. In this case, you will use it to run the driver program, com.cloudera.Movies, after installing the jar:
cd movies-crunch
mvn clean install
mvn run-tool
The toolClass is already configured in the POM, but could instead be added to the command line with -Dkite.toolClass=com.cloudera.Movies. You can find more information on the Kite maven plugin in the [kitesdk.org usage docs|http://kitesdk.org/docs/current/kite-maven-plugin/usage.html] and about other tasks it can perform in the [goal docs|http://kitesdk.org/docs/current/kite-maven-plugin/plugin-info.html].
If you haven't already, build and run the crunch tool:
mvn clean install run-tool
The job will create a "year_counts" directory in HDFS that you can view:
[cloudera@localhost movies-crunch]$ hdfs dfs -cat year_counts/*
[1926,1]
[1931,3]
[1932,6]
[1933,11]
[1934,20]
[1935,33]
[1936,48]
[1937,67]
[1938,89]
[1939,118]