-
Notifications
You must be signed in to change notification settings - Fork 262
Hands on Kite Lab 1: Using the Kite CLI
Download the Kite dataset tool:
wget http://central.maven.org/maven2/org/kitesdk/kite-tools/0.15.0/kite-tools-0.15.0-binary.jar -O dataset
chmod +x dataset
Download the sample data:
wget --http-user=kite --http-password=kite http://bits.cloudera.com/10c0b54d/ml-100k.tar.gz
tar -zxf ml-100k.tar.gz
Infer the schema for movie records from the sample data:
./dataset csv-schema ml-100k/u.item --delimiter '|' --class org.kitesdk.examples.data.Movie -o movie.avsc
Infer the schema for rating records from the sample data:
./dataset csv-schema ml-100k/u.data --class org.kitesdk.examples.data.Rating -o rating.avsc
Since we'll be partitioning on some of the columns later, we need to update our automatically generated schema to change all of the fields to be non-null:
sed -iorig -e 's/\[ "null", "long" \]/"long"/' rating.avsc
Now we can create the movies dataset:
./dataset create movies -s movie.avsc
We can also import the sample data into our new dataset:
./dataset csv-import --delimiter '|' ml-100k/u.item movies
We want to partition the rating data, so lets create a partifion configuration:
./dataset partition-config timestamp:year timestamp:month timestamp:day -s rating.avsc -o rating-part.json
Now we can create the ratings dataset from the schema and partition configuration:
./dataset create ratings -s rating.avsc -p rating-part.json
Since we want to write to a partitioned dataset, it's useful to stage the raw data in HDFS so we can launch a MapReduce job to partition the data and load it into the dataset:
hdfs dfs -put ml-100k/u.data
./dataset csv-import hdfs://localhost.localdomain/user/cloudera/u.data ratings
Let's take a look at some of the movies data:
./dataset show movies
We can do the same thing with the ratings data:
./dataset show ratings
The cool thing is we can use the same schema to load the data into HBase. Since HBase stores data by key, we will define the keys with a partition configuration:
./dataset partition-config userId:copy movieId:copy -s rating.avsc -o rating-hbase-part.json
We also need to map the fields in the data to the HBase row key and to columns in the table:
./dataset mapping-config userId:key movieId:key rating:f timestamp:f -s rating.avsc -p rating-hbase-part.json -o rating-mapping.json
Now we can create the HBase-backed dataset using the schema, partition configuration, and mapping configuration:
./dataset create dataset:hbase:localhost.localdomain/ratings -s rating.avsc -p rating-hbase-part.json -m rating-mapping.json
Once the dataset is created, we can import the same data we imported into HDFS into HBase instead:
./dataset csv-import hdfs://localhost.localdomain/user/joey/u.data dataset:hbase:localhost.localdomain/ratings
Finally, we can use Kite's view URIs to grab a single record from HBase based on the fields that make up the key:
./dataset show "view:hbase:localhost.localdomain/ratings?userId=196&movieId=242"