Hands on Kite Lab 1: Using the Kite CLI

If you want to check your work or get lost, you can refer to the lab solution at any time.

First you need to download the Kite dataset tool:
```
 wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -O dataset
 chmod +x dataset
```
and then download the sample data:
```
 wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
 tar -zxf ml-100k.tar.gz
```
The movie lens data that you just downloaded includes a number of different files. For this lab, we'll be working with the movie data and the ratings data. The movie data is in the ml-100k/u.item file and the ratings data is in the ml-100k/u.data file.
Use the csv-schema command to infer the schema of the two CSV files.

The current release of Kite will infer that any field as nullable. Since we'll be partitioning the ratings data by some of the fields, you need to modify the schema. You can either open up a text editor and change the type of each field from [ "null", "long" ] to "long" or you can use the following sed command to do it all at once:
```
 sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc
```
Use the create command to create the movies dataset from the movies schema you just created.
Once the dataset is created, import the data using the csv-import command to import the CSV file from the local file system.
Before you can create the ratings dataset you need to create the partition configuration. Use the partition-config command to take the timestamp field and partition the data by year, month and day.

Now use the create command to create the ratings dataset from the ratings schema and ratings partition configuration you just created.
Since the ratings dataset is partitioned, it's useful to upload the raw data into HDFS before using the csv-import command. You can import the raw data using the hdfs dfs command and then specify the input to the csv-import command as an HDFS URI:
```
 hdfs dfs -put ml-100k/u.data
```
If you put the data into your home directory as above, then you can use an input URI of hdfs://<namenode>/user/<user>/u.data where <namenode> is the hostname of the namenode and <user> is your user name.
Finally, use the show command to print the first 10 records in each dataset to your terminal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hands on Kite Lab 1: Using the Kite CLI

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally