Skip to content

Conversation

@DennisDawson
Copy link

With an eye toward modularization, I've repurposed CreateEvents.java from the Spark example and placed it in org/kitesdk/examples/data. This lets the customer create the events dataset using the CLI, then populate it with a substantial number of records using the Java utility. The same dataset can be used for the Flume and Spark examples, without having to delete them after running their respective jobs.

In GenerateEvents, I essentially swapped the CreateEvents create() method with load(). I added the Avro plug-in to pom.xml, copied the avro folder with standard_event.avscinto the main directory, and copied BaseEventsTool.java to org/kitesdk/examples/data.

In my environment, it compiles, runs, and populates the events table as expected.

**Update

The random records were a little too random: if the user_id, session_id, and ip are different each time, when the Crunch utility runs, there are no sessions to aggregate. I revised the run method to first generate the user_id, session_id, and ip, then used a for loop to generate 1-25 random events. I also modified the randomTimestamp method to increase the base length of time and add random padding to create more realistic session duration.

I'm happy to incorporate any changes that make the code more elegant, my changes just make it work.

@DennisDawson DennisDawson changed the title Utility to generate events to existing table. CDK-928: Utility to generate events to existing table. Feb 19, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like any of the code in this class is used, so it would be better to remove it and make GenerateEvents implement Tool directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this elsewhere, but I think it would be better to use a variable rather than the inline test here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wrong, or just different? Are you suggesting that the test should set the variable before the load method? If the argument is invalid, does it change the result by setting it outside the load method? If the code must change before publication, please provide the acceptable alternate code, rather than have me guess at what I should do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants