Skip to content

From Nothing to Eye Candy

rfecher edited this page May 26, 2016 · 8 revisions

Deploy a Cluster

This will actually be quick. The easiest way to have a GeoWave cluster is to just use EMR with the GeoWave bootstrap actions: http://ngageoint.github.io/geowave/documentation.html#running-from-emr-2 Following those steps will give you a GeoWave cluster on AWS within minutes.

In fact in the baseline here there is a set of scripts that will automatically run through this example as an EMR bootstrap action so after starting EMR you will end up with a running GeoWave cluster with GDELT data ingested and a point layer and KDE layer exposed through GeoServer on port 8000 of the master node (using these scripts would of course be cheating!).

Ingest Data

Pick a large interesting dataset of your choice. OSM GPX data is approximately 2.8 billion points and there are some interesting use cases for the aggregation of all of that data, but in this tutorial let's walk through using some GDELT data.

wget http://data.gdeltproject.org/events/md5sums
for file in `cat md5sums | cut -d' ' -f3` ; do wget http://data.gdeltproject.org/events/$file ; done
md5sum -c md5sums 2>&1
Now that we have the data, let's make sure to configure Accumulo properly. Let's add a `geowave` user and a `geowave` namespace within Accumulo. We'll associate the `geowave` namespace with GeoWave's library for accumulo. When you install from RPM it will be located at `hdfs://${HOSTNAME}:${HDFS_PORT}/accumulo/classpath/geowave/${GEOWAVE_VERSION}-apache/geowave-accumulo.jar`. The following shell script will in essence allow all tables prefixed by `geowave.` to use the geowave library.

## configure accumulo
cat <<EOF | accumulo shell -u root -p secret -e "createuser geowave"
geowave
geowave
EOF
accumulo shell -u root -p secret -e "createnamespace geowave"
accumulo shell -u root -p secret -e "grant NameSpace.CREATE_TABLE -ns geowave -u geowave"
accumulo shell -u root -p secret -e "config -s general.vfs.context.classpath.geowave=hdfs://${HOSTNAME}:${HDFS_PORT}/accumulo/classpath/geowave/${GEOWAVE_VERSION}-apache/[^.].*.jar"
accumulo shell -u root -p secret -e "config -ns geowave -s table.classpath.context=geowave"

Now its just a matter of running geowave commands to configure a store to point to the new namespace (gwNamespace defines a table prefix, so by using geowave.gdelt we will be leveraging the accumulo namespace set up in the previous step. The following command will configure a named entity "gdelt-accumulo" that can be referenced on ingest to supply all the connection parameter, then configure a named entity "gdelt-spatial" that is a configured indexing strategy utilizing spatial dimensional indexing with a partitioning strategy to avoid hotspotting (each partition will be structured spatially but the data will be assigned to an arbitrary partition with a pre-split Accumulo table based on that partitioning approach). Lastly, the data is ingested by referencing the store and the index strategy that were configured. There is an optional --cql parameter that is applied in this case.


geowave config addstore -t accumulo gdelt-accumulo --gwNamespace geowave.gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
geowave config addindex -t spatial gdelt-spatial --partitionStrategy round_robin --numPartitions $NUM_PARTITIONS
geowave ingest localtogw $STAGING_DIR/gdelt gdelt-accumulo gdelt-spatial -f gdelt --gdelt.cql "BBOX(geometry,${WEST},${SOUTH},${EAST},${NORTH})"
You can verify this step completed successfully by viewing the tables in the Accumulo monitor (port 50095 on your master node). There should be a set of tables prefixed by geowave.gdelt that represents your GDELT points.

Run Analytics

Now let's show an example distributed process that may be run after that data has been ingested. In this case we will choose to run a Kernel Density Estimate (KDE) that can be used with a supplied color ramp to display a heatmap. We will configure a new store, just so we can store the data in entirely separate tables from the original dataset. Keep in mind this step is unnecessary if you'd prefer to keep the data in the same tables. We are configuring "gdelt-accumulo-out" as a store that we reference in the KDE command as the output of the analytic.


geowave config addstore -t accumulo gdelt-accumulo-out --gwNamespace geowave.kde_gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
hadoop jar ${GEOWAVE_TOOLS_HOME}/geowave-tools.jar analytic kde --featureType gdeltevent --minLevel 5 --maxLevel 26 --minSplits $NUM_PARTITIONS --maxSplits $NUM_PARTITIONS --coverageName gdeltevent_kde --hdfsHostPort ${HOSTNAME}:${HDFS_PORT} --jobSubmissionHostPort ${HOSTNAME}:${RESOURCE_MAN_PORT} --tileSize 1 gdelt-accumulo gdelt-accumulo-out
Again, like the point ingest you can verify this step completed successfully by viewing the tables in the Accumulo monitor (port 50095 on your master node). There should be a set of tables prefixed by geowave.kde_gdelt that represent your kernel density estimate results.

Setup GeoServer Layers