Funnel analysis is a method for tracking user conversion rates across actions. This enables detection of actions causing high user fallout.
These Hive UDFs enables funnel analysis to be performed simply and easily on any Hive table.
Maven is required to build the funnel UDFs.
There is a provided Makefile with all the build targets.
make jarThis creates a funnel.jar in the target/ directory.
To use the funnel UDFs, you need to register it with Hive.
With temporary functions:
ADD JAR funnel.jar;
CREATE TEMPORARY FUNCTION funnel            AS 'com.yahoo.hive.udf.funnel.Funnel';
CREATE TEMPORARY FUNCTION funnel_merge      AS 'com.yahoo.hive.udf.funnel.Merge';
CREATE TEMPORARY FUNCTION funnel_conversion AS 'com.yahoo.hive.udf.funnel.Conversion';
CREATE TEMPORARY FUNCTION funnel_fallout    AS 'com.yahoo.hive.udf.funnel.Fallout';With permenant functions you need to put the JAR on HDFS, and it will be registered with a database (you have to replace DATABASE and PATH_TO_JAR with your values):
CREATE FUNCTION DATABASE.funnel            AS 'com.yahoo.hive.udf.funnel.Funnel'  USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';
CREATE FUNCTION DATABASE.funnel_merge      AS 'com.yahoo.hive.udf.funnel.Merge'   USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';
CREATE FUNCTION DATABASE.funnel_conversion AS 'com.yahoo.hive.udf.funnel.Conversion' USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';
CREATE FUNCTION DATABASE.funnel_fallout    AS 'com.yahoo.hive.udf.funnel.Fallout' USING JAR 'hdfs:///PATH_TO_JAR/funnel.jar';There are four funnel UDFs provided: funnel,
funnel_merge, funnel_conversion,
funnel_fallout.
The funnel UDF outputs an array of longs showing conversion rates
across the provided funnel steps.
The funnel_merge UDF merges multiple arrays of longs by
adding them together.
The funnel_conversion UDF takes a raw count funnel result and
converts it to the conversion rate.
The funnel_fallout UDF takes a raw count funnel result and
converts it to the fallout rate.
There is no need to sort the data on timestamp, the UDF will take care of it. If there is a collision in the timestamps, it then sorts on the action column.
funnel(action_column, timestamp_column, array(funnel_1_a, funnel_1_b), array(funnel_2), ...)
- Builds a funnel report applied to the action_column, sorted by thetimestamp_column.
- The funnel steps are arrays of the same type as the actioncolumn. This allows for multiple matches to move to the next funnel.- For example, funnel_1 could be array('register_button', 'facebook_invite_register'). The funnel will match the first occurence of either of these actions and proceed to the next funnel.
- Or, funnel_1 could just be array('register_button').
 
- For example, funnel_1 could be 
- You can have an arbitrary number of funnels.
- The timestamp_columncan be of any comparable type (Strings, Integers, Dates, etc).
funnel_merge(funnel_column)
- Merges funnels. Use with funnel UDF.
funnel_conversion(funnel_column)
- Converts the result of a funnel_mergeto a conversion rate. Use with funnel and funnel_merge UDF.
- For example, a result from funnel_mergecould look like[245, 110, 54, 13]. This is result is in raw counts. If we pass this throughfunnel_conversionthen it would look like[1.0, 0.44, 0.49, 0.24].
funnel_fallout(funnel_column)
- Converts the result of a funnel_mergeto a fallout rate. Use with funnel and funnel_merge UDF.
- For example, a result from funnel_mergecould look like[245, 110, 54, 13]. This is result is in raw counts. If we pass this throughfunnel_falloutthen it would look like[0.0, 0.55, 0.50, 0.75].
Older versions of Hive have known security issues. Keep the following issues in mind when deciding what Hive version to use when building the UDFs. Use the following steps to mitigate these issues, or update to Hive 2.3.4 to avoid all issues at once.
In Apache Hive 2.3.3, 3.1.0 and earlier, local resources on HiveServer2 machines are not properly protected against malicious user if ranger, sentry or sql standard authorizer is not in use.
Update pom.xml to use Hive 2.3.4.
In Apache Hive 0.6.0 to 2.3.2, malicious user might use any xpath UDFs (xpath/xpath_string/xpath_boolean/xpath_number/xpath_double/xpath_float/xpath_long/xpath_int/xpath_short) to expose the content of a file on the machine running HiveServer2 owned by HiveServer2 user (usually hive) if hive.server2.enable.doAs=false.
Update pom.xml to use Hive 2.3.3 or do not set hive.server2.enable.doAs to false.
The authorization framework in Apache Hive 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0 and 1.2.1, on clusters protected by Ranger and SqlStdHiveAuthorization, allows attackers to bypass intended parent table access restrictions via unspecified partition-level operations.
Update pom.xml to use Hive 1.2.2.
Assume a table user_data:
| action | timestamp | user_id | gender | 
|---|---|---|---|
| signup_page | 100 | 1 | f | 
| confirm_button | 200 | 1 | f | 
| submit_button | 300 | 1 | f | 
| signup_page | 200 | 2 | m | 
| submit_button | 400 | 2 | m | 
| signup_page | 100 | 3 | f | 
| confirm_button | 200 | 3 | f | 
| decline | 200 | 3 | f | 
| ... | ... | ... | ... | 
SELECT funnel_merge(funnel)
FROM (SELECT funnel(action, timestamp, array('signup_page', 'email_signup'),
                                       array('confirm_button'),
                                       array('submit_button')) AS funnel
      FROM user_data
      GROUP BY user_id) t1;Result: [3, 2, 1]
SELECT funnel_conversion(funnel_merge(funnel))
FROM (SELECT funnel(action, timestamp, array('signup_page'),
                                       array('confirm_button'),
                                       array('submit_button')) AS funnel
      FROM user_data
      GROUP BY user_id) t1;Result: [1.0, 0.66, 0.5]
SELECT gender, funnel_merge(funnel)
FROM (SELECT gender,
             funnel(action, timestamp, array('signup_page'),
                                       array('confirm_button'),
                                       array('submit_button')) AS funnel
      FROM table
      GROUP BY user_id, gender) t1
GROUP BY gender;Result: m: [1, 0, 0], f: [2, 2, 1]
SELECT funnel_merge(funnel1), funnel_merge(funnel2)
FROM (SELECT funnel(action, timestamp, array('signup_page'),
                                       array('confirm_button'),
                                       array('submit_button')) AS funnel1
             funnel(action, timestamp, array('signup_page'),
                                       array('decline')) AS funnel2
      FROM table
      GROUP BY user_id) t1;Result: [3, 2, 1] [3, 1]
Josh Walters, [email protected]