Skip to content

Hands on Kite Lab 1: Using the Kite CLI

Joey Echeverria edited this page Jul 30, 2014 · 18 revisions

If you want to check your work or get lost, you can refer to the lab solution at any time.

First you should download the Kite dataset tool:

wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -O dataset
chmod +x dataset

and then download the sample data:

wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
tar -zxf ml-100k.tar.gz

The movie lens data that you just downloaded includes a number of different files. For this lab, we'll be working with the movie data and the ratings data. The movie data is in the ml-100k/u.item file and the ratings data is in the ml-100k/u.data file.

You can use the csv-schema command to infer the schema of two CSV files.

The current release of Kite will infer that any field is nullable. Since we'll be partitioning the ratings data by some of the fields, you need to modify the schema. You can either open up a text editor and change the type of each field from [ "null", "long" ] to "long" or you can use the following sed command to do it all at once:

sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc

Now use the create command to create the movies dataset from the movies schema you just created.

Once the dataset is created, you can import the data using the csv-import command to import the CSV file from the local file system.

Before you can create the ratings dataset you need to create the partition configuration. Use the partition-config command to take the timestamp field and partition the data by year, month and day.

Now use the create command to create the ratings dataset from the ratings schema and ratings partition configuration you just created.

Since the ratings dataset is partitioned, it's useful to upload the raw data into HDFS before using the csv-import command. You can import the raw data using the hdfs dfs command and then specify the input to the csv-import command as an HDFS URI:

hdfs dfs -put ml-100k/u.data

If you put the data into your home directory as above, then you can use an input URI of hdfs://<namenode>/user/<user>/u.data where <namenode> is the hostname of the namenode and <user> is your user name.

Finally, use the show command to print the first 10 records in each dataset to your terminal.

Clone this wiki locally