Skip to content

Hands on Kite Lab 1: Using the Kite CLI

Joey Echeverria edited this page Aug 25, 2014 · 18 revisions

If you want to check your work or get lost, you can refer to the lab solution at any time.

If you're running the lab on a CDH5 cluster installed with packages, such as the Cloudera QuickStart VM, then you need to set the following environment variable:

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  1. First you need to download the Kite dataset tool:

     wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.16.0/kite-tools-0.16.0-binary.jar -O dataset
     chmod +x dataset
    

    and then download the sample data:

     wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
     tar -zxf ml-100k.tar.gz
    

    The movie lens data that you just downloaded includes a number of different files. For this lab, we'll be working with the movie data and the ratings data. The movie data is in the ml-100k/u.item file and the ratings data is in the ml-100k/u.data file.

  2. Use the csv-schema command to infer the schema for each dataset from the two CSV files.

  3. Use the create command to create the movies dataset from the movies schema you just created.

  4. Import the ml-100k/u.item file into the movies dataset using the csv-import command to import the CSV file from the local file system.

  5. Before you can create the ratings dataset you need to create the partition configuration. The current release of Kite will generate each field with a nullable type. Since we're partitioning the ratings data by some of the fields, you need to modify the schema to make the fields non-null. You can either open up a text editor and change the type of each field from [ "null", "long" ] to "long" or you can use the following sed command to do it all at once:

     sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc
    

    Now use the partition-config command to take the timestamp field and partition the data by year, month and day.

  6. Use the create command to create the ratings dataset from the ratings schema and ratings partition configuration you just created.

  7. Since the ratings dataset is partitioned, it's useful to upload the raw data into HDFS before using the csv-import command. You can import the raw data using the hdfs dfs command and then specify the input to the csv-import command as an HDFS URI:

     hdfs dfs -put ml-100k/u.data
    

    If you put the data into your home directory as above, then you can use an input URI of hdfs://<namenode>/user/<user>/u.data where <namenode> is the hostname of the namenode and <user> is your user name.

    Import the u.data file into the ratings dataset using the csv-import command to import the CSV file from HDFS.

  8. Finally, use the show command to print the first 10 records in each dataset to your terminal.

Clone this wiki locally