Skip to content

Hands on Kite Lab 1: Using the Kite CLI

Joey Echeverria edited this page Jul 30, 2014 · 18 revisions

Download the Kite dataset tool:

wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -O dataset
chmod +x dataset

Download the sample data:

wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
tar -zxf ml-100k.tar.gz

Infer the schema for movie records from the sample data:

./dataset csv-schema ml-100k/u.item --delimiter '|' --class org.kitesdk.examples.data.Movie -o movie.avsc

Infer the schema for rating records from the sample data:

./dataset csv-schema ml-100k/u.data --class org.kitesdk.examples.data.Rating -o rating.avsc

Since we'll be partitioning on some of the columns later, we need to update our automatically generated schema to change all of the fields to be non-null:

sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc

Now we can create the movies dataset:

./dataset create movies -s movie.avsc

We can also import the sample data into our new dataset:

./dataset csv-import --delimiter '|' ml-100k/u.item movies

We want to partition the rating data, so lets create a partifion configuration:

./dataset partition-config timestamp:year timestamp:month timestamp:day -s rating.avsc -o rating-part.json

Now we can create the ratings dataset from the schema and partition configuration:

./dataset create ratings -s rating.avsc -p rating-part.json

Since we want to write to a partitioned dataset, it's useful to stage the raw data in HDFS so we can launch a MapReduce job to partition the data and load it into the dataset:

hdfs dfs -put ml-100k/u.data
./dataset csv-import hdfs://localhost.localdomain/user/cloudera/u.data ratings

Let's take a look at some of the movies data:

./dataset show movies

We can do the same thing with the ratings data:

./dataset show ratings

The cool thing is we can use the same schema to load the data into HBase. Since HBase stores data by key, we will define the keys with a partition configuration:

./dataset partition-config userId:copy movieId:copy -s rating.avsc -o rating-hbase-part.json

We also need to map the fields in the data to the HBase row key and to columns in the table:

./dataset mapping-config userId:key movieId:key rating:f timestamp:f -s rating.avsc -p rating-hbase-part.json -o rating-mapping.json

Now we can create the HBase-backed dataset using the schema, partition configuration, and mapping configuration:

./dataset create dataset:hbase:localhost.localdomain/ratings -s rating.avsc -p rating-hbase-part.json -m rating-mapping.json

Once the dataset is created, we can import the same data we imported into HDFS into HBase instead:

./dataset csv-import hdfs://localhost.localdomain/user/joey/u.data dataset:hbase:localhost.localdomain/ratings

Finally, we can use Kite's view URIs to grab a single record from HBase based on the fields that make up the key:

./dataset show "view:hbase:localhost.localdomain/ratings?userId=196&movieId=242"
Clone this wiki locally