-
Notifications
You must be signed in to change notification settings - Fork 190
From Nothing to Eye Candy
This will actually be quick. The easiest way to have a GeoWave cluster is to just use EMR with the GeoWave bootstrap actions: http://ngageoint.github.io/geowave/documentation.html#running-from-emr-2 Following those steps will give you a GeoWave cluster on AWS within minutes.
In fact in the baseline here there is a set of scripts that will automatically run through this example as an EMR bootstrap action so after starting EMR you will end up with a running GeoWave cluster with GDELT data ingested and a point layer and KDE layer exposed through GeoServer on port 8000 of the master node (using these scripts would of course be cheating!).
Pick a large interesting dataset of your choice. OSM GPX data is approximately 2.8 billion points and there are some interesting use cases for the aggregation of all of that data, but in this tutorial let's walk through using some GDELT data.
wget http://data.gdeltproject.org/events/md5sums
for file in `cat md5sums | cut -d' ' -f3` ; do wget http://data.gdeltproject.org/events/$file ; done
md5sum -c md5sums 2>&1
## configure accumulo
cat <<EOF | accumulo shell -u root -p secret -e "createuser geowave"
geowave
geowave
EOF
accumulo shell -u root -p secret -e "createnamespace geowave"
accumulo shell -u root -p secret -e "grant NameSpace.CREATE_TABLE -ns geowave -u geowave"
accumulo shell -u root -p secret -e "config -s general.vfs.context.classpath.geowave=hdfs://${HOSTNAME}:${HDFS_PORT}/accumulo/classpath/geowave/${GEOWAVE_VERSION}-apache/[^.].*.jar"
accumulo shell -u root -p secret -e "config -ns geowave -s table.classpath.context=geowave"
Now its just a matter of running geowave commands to configure a store to point to the new namespace (gwNamespace defines a table prefix, so by using geowave.gdelt
we will be leveraging the accumulo namespace set up in the previous step. The following command will configure a named entity "gdelt-accumulo" that can be referenced on ingest to supply all the connection parameter, then configure a named entity "gdelt-spatial" that is a configured indexing strategy utilizing spatial dimensional indexing with a partitioning strategy to avoid hotspotting (each partition will be structured spatially but the data will be assigned to an arbitrary partition with a pre-split Accumulo table based on that partitioning approach). Lastly, the data is ingested by referencing the store and the index strategy that were configured. There is an optional --cql parameter that is applied in this case.
geowave config addstore -t accumulo gdelt-accumulo --gwNamespace geowave.gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
geowave config addindex -t spatial gdelt-spatial --partitionStrategy round_robin --numPartitions $NUM_PARTITIONS
geowave ingest localtogw $STAGING_DIR/gdelt gdelt-accumulo gdelt-spatial -f gdelt --gdelt.cql "BBOX(geometry,${WEST},${SOUTH},${EAST},${NORTH})"
Now let's show an example distributed process that may be run after that data has been ingested. In this case we will choose to run a Kernel Density Estimate (KDE) that can be used with a supplied color ramp to display a heatmap. We will configure a new store, just so we can store the data in entirely separate tables from the original dataset. Keep in mind this step is unnecessary if you'd prefer to keep the data in the same tables. We are configuring "gdelt-accumulo-out" as a store that we reference in the KDE command as the output of the analytic.
geowave config addstore -t accumulo gdelt-accumulo-out --gwNamespace geowave.kde_gdelt --zookeeper $HOSTNAME:2181 --instance $INSTANCE --user geowave --password geowave
hadoop jar ${GEOWAVE_TOOLS_HOME}/geowave-tools.jar analytic kde --featureType gdeltevent --minLevel 5 --maxLevel 26 --minSplits $NUM_PARTITIONS --maxSplits $NUM_PARTITIONS --coverageName gdeltevent_kde --hdfsHostPort ${HOSTNAME}:${HDFS_PORT} --jobSubmissionHostPort ${HOSTNAME}:${RESOURCE_MAN_PORT} --tileSize 1 gdelt-accumulo gdelt-accumulo-out