-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite Lab 1: Using the Kite CLI
If you want to check your work or get lost, you can refer to the lab solution at any time.
-
First you need to download the Kite dataset tool:
wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -O dataset chmod +x datasetand then download the sample data:
wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz tar -zxf ml-100k.tar.gzThe movie lens data that you just downloaded includes a number of different files. For this lab, we'll be working with the movie data and the ratings data. The movie data is in the
ml-100k/u.itemfile and the ratings data is in theml-100k/u.datafile. -
Use the csv-schema command to infer the schema of the two CSV files.
The current release of Kite will infer that any field as nullable. Since we'll be partitioning the ratings data by some of the fields, you need to modify the schema. You can either open up a text editor and change the type of each field from
[ "null", "long" ]to"long"or you can use the followingsedcommand to do it all at once:sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc -
Use the create command to create the
moviesdataset from the movies schema you just created. -
Once the dataset is created, import the data using the csv-import command to import the CSV file from the local file system.
-
Before you can create the
ratingsdataset you need to create the partition configuration. Use the partition-config command to take thetimestampfield and partition the data by year, month and day.Now use the create command to create the
ratingsdataset from the ratings schema and ratings partition configuration you just created. -
Since the
ratingsdataset is partitioned, it's useful to upload the raw data into HDFS before using thecsv-importcommand. You can import the raw data using thehdfs dfscommand and then specify the input to thecsv-importcommand as an HDFS URI:hdfs dfs -put ml-100k/u.dataIf you put the data into your home directory as above, then you can use an input URI of
hdfs://<namenode>/user/<user>/u.datawhere<namenode>is the hostname of the namenode and<user>is your user name. -
Finally, use the show command to print the first 10 records in each dataset to your terminal.