Skip to content
This repository was archived by the owner on Apr 15, 2022. It is now read-only.

Commit 0c7ce3d

Browse files
committed
i forgot to pull before i made changes
2 parents 1ec744c + bdecfa0 commit 0c7ce3d

File tree

1 file changed

+28
-10
lines changed

1 file changed

+28
-10
lines changed

README.md

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@ os.system("""wget https://github.com/splicemachine/pysplice/archive/[pysplice RE
1919

2020
## Modules
2121
There are two modules inside the Python splicemachine package currently-- `splicemachine.spark.context` and `splicemachine.ml.zeppelin`
22-
`splicemachine.spark.context` contains one extremely important class that helps you interact with our database with Spark.
22+
The `splicemachine.spark.context` contains two extremely important classes that help you interact with our database with Spark.
23+
You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py')
24+
2325
The `splicemachine.spark.context.PySpliceContext` class is our native spark data source implemented in Python. Currently,
24-
it offers these, among other functions:
26+
it offers these, among other functions. YOU SHOULD USE THE SpliceCloudContext CLASS FOR USE WITH THE SPLICE MACHINE CLOUD SERVICE (ZEPPELIN NOTEBOOKS):
2527
1. df: turn the results of a sql query into dataframe
2628
2. tableExists: returns a boolean of whether or not a table exists
2729
3. getConnection: get a connection to the database (used to renew the connection)
@@ -30,13 +32,28 @@ os.system("""wget https://github.com/splicemachine/pysplice/archive/[pysplice RE
3032
6. update: update records in a table based on joining primary keys from the datafrane
3133
7. dropTable: drop a table from database
3234
8. getSchema: return the schema of a table from the database
35+
9. upsert: upsert into a table
3336

34-
You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py')
3537
```
3638
Usage:
3739
%spark.pyspark
3840
from splicemachine.spark.context import PySpliceContext
39-
splice = PySpliceContext('jdbc:splice://<SOME FRAMEWORK NAME>.splicemachine.io:1527/splicedb;ssl=basic;user=<USER>;password=<PASSWORD>', sqlContext)
41+
splice = PySpliceContext('jdbc:splice://<SOME FRAMEWORK NAME>.splicemachine.io:1527/splicedb;ssl=basic;user=<USER>;password=<PASSWORD>', spark) # takes in a jdbc url and a spark context
42+
my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES')
43+
filtered_df = my_dataframe.filter(my_dataframe.FLAVOR == 'red_velvet')
44+
# Assume you have created a table with a schema that conforms to filtered_df
45+
splice.insert(filtered_df, 'DEMO.FILTERED_CUPCAKES)
46+
```
47+
48+
The `splicemachine.spark.context.SpliceCloudContext` class is our native spark datasource implemented in Python for use with the cloud service. Although you can use the regular PySpliceContext on the cloud service, this provides ease of use (auto finding jdbc url, h2o support etc).
49+
50+
```
51+
Usage:
52+
%spark.pyspark
53+
from splicemachine.spark.context import SpliceCloudContext
54+
splice = SpliceCloudContext(spark, use_h2o=True) # sparksession is already created on Zeppelin startup, so enter this code #exactly as written here
55+
splice.hc # h2o context
56+
#use normally as well
4057
my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES')
4158
filtered_df = my_dataframe.filter(my_dataframe.FLAVOR == 'red_velvet')
4259
# Assume you have created a table with a schema that conforms to filtered_df
@@ -79,7 +96,7 @@ splice.insert(filtered_df, 'DEMO.FILTERED_CUPCAKES)
7996
```
8097

8198
### Show Confusion Matrix - Function that shows a nicely formatted confusion matrix for binary classification
82-
1. show_confusion_matrix(sc: spark context from zeppelin, sqlContext: sql context from zeppelin, TP: True positives, TN: true negatves, FP: false positives, FN: False Negatives)
99+
1. show_confusion_matrix(spark: spark session from zeppelin, TP: True positives, TN: true negatves, FP: false positives, FN: False Negatives)
83100
```
84101
Usage:
85102
from splicemachine.ml.zeppelin import show_confusion_matrix
@@ -104,10 +121,9 @@ experiment_name = z.input('Experiment name')
104121
experiment_maker(experiment_name)
105122
```
106123

107-
### Model Evaluator -- Class that Evaluates Binary Classification Models written in PySpark
124+
### SpliceBinaryClassificationEvaluator -- Class that Evaluates Binary Classification Models written in PySpark
108125
Methods:
109-
1. __init__(label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?))
110-
2. ModelEvaluator.setup_contexts(sc: spark_context from zeppelin, sqlContext: sqlContext from zeppelin) (required to run): Setup Spark Contexts
126+
1. __init__(spark: zeppelin spark context, label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?))
111127
3. ModelEvaluator.input(predictions_dataframe: dataframe (containing row with label col and prediction col) # Input a new run to average
112128
4. ModelEvalautor.get_results(output_type: 'dataframe'/'dict') # print out in either dataframe or dict format calculated metrics
113129
Here are the metrics this calculates:
@@ -129,8 +145,7 @@ computed_metrics = {
129145
```
130146
Usage:
131147
from splicemachine.ml.zeppelin import ModelEvaluator
132-
evaluator = ModelEvaluator()
133-
evaluator.setup_contexts(sc, sqlContext)
148+
evaluator = ModelEvaluator(spark)
134149
135150
CV_ITERATIONS = 5
136151
@@ -145,12 +160,15 @@ computed_metrics = {
145160
### Decision Tree Visualizer - Class that allows you to visualize binary/multiclass Decision Tree classification models in PySpark
146161
Methods:
147162
1. DecisionTreeVisualizer.visualize(model: fitted DecisionTreeClassifier model, feature_column_names: list (in order of the features included in your VectorAssembler), label_classes: list (in order of your classes), visual=True (png output via graphviz, or code like structure False))
163+
2. DecisionTreeVisualizer.feature_importance(spark: spark session from zeppelin, model: fitted machine learning model, dataset: dataframe containing your preprocessed features (from vectorassembler), featuresCol: column containing your feature vector)
148164

149165
```
150166
Usage:
151167
from splicemachine.ml.zeppelin import DecisionTreeVisualizer
152168
153169
myfittedDecisionTreeClassifier = unfittedClf.fit(df)
154170
DecisionTreeVisualizer.visualize(myfittedDecisionTreeClassifier, ['flavor', 'color', 'frosting'], ['juicy', 'not juciy'], True)
171+
172+
DecisionTreeVisualizer.featureImportance(spark, myfittedDecisionTreeClassifier, training_dataframe, "features")
155173
--> You can see your tree at this URL: <SOME URL>
156174

0 commit comments

Comments
 (0)