Skip to content

Hive Deserializer for DynamoDB backup data format

License

Notifications You must be signed in to change notification settings

lyft/dynamodb-hive-serde

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dynamodb-hive-serde

Hive Deserializer for DynamoDB backup data format.

When AWS Data Pipeline is used to export backups of DynamoDB tables, the file format is somewhat difficult to parse in Hive. This custom deserializer makes it easy to process files in hive without any pre-processing.

Simply install the DynamoDbSerDe jar and specify the row format as the DynamoDB SerDe in your queries. Pick the DynamoDb column names you want to access and a type they should be. Per line of data the DynamoDb SerDe will locate the columns you specified and coerce the values into the types you specify.

Example query:

ADD jar /path/to/jar/dynamodb-hive-serde-1.0-SNAPSHOT.jar;

CREATE EXTERNAL TABLE dynamodb (id string, updated_at string, created_at string, version int)
ROW FORMAT SERDE 'com.lyft.hive.serde.DynamoDbSerDe'
LOCATION '/dynamodb/input/';

Timestamp format

You can specify a custom time format, which will be used to construct a Joda Time DateTimeFormatter. For example:

CREATE EXTERNAL TABLE dynamodb (id string, updated_at timestamp, created_at timestamp, version int)
ROW FORMAT SERDE 'com.lyft.hive.serde.DynamoDbSerDe'
WITH SERDEPROPERTIES ('input.timestamp.format'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSSZ')
LOCATION '/dynamodb/input/';

Building

First, install maven, then:

mvn package

About

Hive Deserializer for DynamoDB backup data format

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages