Skip to content

Hands on Kite: Writing a Crunch Job

Ryan Blue edited this page Jul 30, 2014 · 4 revisions

In this example, you'll build and run a crunch job that calculates the number of movies released in a given year. This will demonstrate:

  • Using the Kite application parent POM to manage dependencies
  • Using the Kite maven plugin to submit a MR job
  • Opening a dataset with Kite's API

Download the project

Download the maven project tarball and unzip it:

wget http://...
tar xzf movies-crunch.tar.gz

Next, look at the contents:

[cloudera@localhost ~]$ tree movies-crunch
movies-crunch
├── pom.xml
└── src
    └── main
        ├── java
        │   └── com
        │       └── cloudera
        │           └── Movies.java
        └── resources
            └── mapred-site.xml

This is is a minimal crunch project:

  • Movies.java is the driver program that defines functions to extract the year from each movie title, group the years, and count them.
  • pom.xml configures maven to build and run the driver program
  • mapred-site.xml sets cluster configuration, like the default FS and the metastore URI

The Maven POM

The pom.xml file uses the Kite app POM as its parent (not a dependency):

  <parent>
    <groupId>org.kitesdk</groupId>
    <artifactId>kite-app-parent-cdh4</artifactId>
    <version>0.15.0</version>
  </parent>

This parent POM configures the project's dependencies for CDH4, including test dependencies and test-jar artifacts. Notice that the only dependency that's listed directly in the POM is hive-exec, which is required for reading from the movies dataset in Hive. It is added to change it from a provided dependency to a compile dependency so that Kite will add it with -libjars when running the job.

The POM also adds the Kite maven plugin, which performs Kite-specific tasks in maven. In this case, you will use it to run the driver program, com.cloudera.Movies, after installing the jar:

cd movies-crunch
mvn clean install
mvn run-tool

The toolClass is already configured in the POM, but could instead be added to the command line with -Dkite.toolClass=com.cloudera.Movies. You can find more information on the Kite maven plugin in the [kitesdk.org usage docs|http://kitesdk.org/docs/current/kite-maven-plugin/usage.html] and about other tasks it can perform in the [goal docs|http://kitesdk.org/docs/current/kite-maven-plugin/plugin-info.html].

Running the tool

If you haven't already, build and run the crunch tool:

mvn clean install run-tool

The job will create a "year_counts" directory in HDFS that you can view:

[cloudera@localhost movies-crunch]$ hdfs dfs -cat year_counts/*
[1926,1]
[1931,3]
[1932,6]
[1933,11]
[1934,20]
[1935,33]
[1936,48]
[1937,67]
[1938,89]
[1939,118]

Clone this wiki locally