-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite Lab 1: Using the Kite CLI
If you want to check your work or get lost, you can refer to the lab solution at any time.
If you're running the lab on a CDH5 cluster installed with packages, such as the Cloudera QuickStart VM, then you need to set the following environment variable:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
-
First you need to download the Kite dataset tool:
wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.16.0/kite-tools-0.16.0-binary.jar -O dataset chmod +x datasetand then download the sample data:
wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz tar -zxf ml-100k.tar.gzThe movie lens data that you just downloaded includes a number of different files. For this lab, we'll be working with the movie data and the ratings data. The movie data is in the
ml-100k/u.itemfile and the ratings data is in theml-100k/u.datafile. -
Use the csv-schema command to infer the schema for each dataset from the two CSV files.
-
Use the create command to create the
moviesdataset from the movies schema you just created. -
Import the
ml-100k/u.itemfile into themoviesdataset using the csv-import command to import the CSV file from the local file system. -
Before you can create the
ratingsdataset you need to create the partition configuration. The current release of Kite will generate each field with a nullable type. Since we're partitioning the ratings data by some of the fields, you need to modify the schema to make the fields non-null. You can either open up a text editor and change the type of each field from[ "null", "long" ]to"long"or you can use the followingsedcommand to do it all at once:sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avscNow use the partition-config command to take the
timestampfield and partition the data by year, month and day. -
Use the create command to create the
ratingsdataset from the ratings schema and ratings partition configuration you just created. -
Since the
ratingsdataset is partitioned, it's useful to upload the raw data into HDFS before using thecsv-importcommand. You can import the raw data using thehdfs dfscommand and then specify the input to thecsv-importcommand as an HDFS URI:hdfs dfs -put ml-100k/u.dataIf you put the data into your home directory as above, then you can use an input URI of
hdfs://<namenode>/user/<user>/u.datawhere<namenode>is the hostname of the namenode and<user>is your user name.Import the
u.datafile into theratingsdataset using the csv-import command to import the CSV file from HDFS. -
Finally, use the show command to print the first 10 records in each dataset to your terminal.