You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 15, 2022. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+28-10Lines changed: 28 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,9 +19,11 @@ os.system("""wget https://github.com/splicemachine/pysplice/archive/[pysplice RE
19
19
20
20
## Modules
21
21
There are two modules inside the Python splicemachine package currently-- `splicemachine.spark.context` and `splicemachine.ml.zeppelin`
22
-
`splicemachine.spark.context` contains one extremely important class that helps you interact with our database with Spark.
22
+
The `splicemachine.spark.context` contains two extremely important classes that help you interact with our database with Spark.
23
+
You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py')
24
+
23
25
The `splicemachine.spark.context.PySpliceContext` class is our native spark data source implemented in Python. Currently,
24
-
it offers these, among other functions:
26
+
it offers these, among other functions. YOU SHOULD USE THE SpliceCloudContext CLASS FOR USE WITH THE SPLICE MACHINE CLOUD SERVICE (ZEPPELIN NOTEBOOKS):
25
27
1. df: turn the results of a sql query into dataframe
26
28
2. tableExists: returns a boolean of whether or not a table exists
27
29
3. getConnection: get a connection to the database (used to renew the connection)
@@ -30,13 +32,28 @@ os.system("""wget https://github.com/splicemachine/pysplice/archive/[pysplice RE
30
32
6. update: update records in a table based on joining primary keys from the datafrane
31
33
7. dropTable: drop a table from database
32
34
8. getSchema: return the schema of a table from the database
35
+
9. upsert: upsert into a table
33
36
34
-
You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py')
35
37
```
36
38
Usage:
37
39
%spark.pyspark
38
40
from splicemachine.spark.context import PySpliceContext
splice = PySpliceContext('jdbc:splice://<SOME FRAMEWORK NAME>.splicemachine.io:1527/splicedb;ssl=basic;user=<USER>;password=<PASSWORD>', spark) # takes in a jdbc url and a spark context
42
+
my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES')
The `splicemachine.spark.context.SpliceCloudContext` class is our native spark datasource implemented in Python for use with the cloud service. Although you can use the regular PySpliceContext on the cloud service, this provides ease of use (auto finding jdbc url, h2o support etc).
49
+
50
+
```
51
+
Usage:
52
+
%spark.pyspark
53
+
from splicemachine.spark.context import SpliceCloudContext
54
+
splice = SpliceCloudContext(spark, use_h2o=True) # sparksession is already created on Zeppelin startup, so enter this code #exactly as written here
55
+
splice.hc # h2o context
56
+
#use normally as well
40
57
my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES')
### Model Evaluator -- Class that Evaluates Binary Classification Models written in PySpark
124
+
### SpliceBinaryClassificationEvaluator -- Class that Evaluates Binary Classification Models written in PySpark
108
125
Methods:
109
-
1.__init__(label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?))
110
-
2. ModelEvaluator.setup_contexts(sc: spark_context from zeppelin, sqlContext: sqlContext from zeppelin) (required to run): Setup Spark Contexts
126
+
1.__init__(spark: zeppelin spark context, label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?))
111
127
3. ModelEvaluator.input(predictions_dataframe: dataframe (containing row with label col and prediction col) # Input a new run to average
112
128
4. ModelEvalautor.get_results(output_type: 'dataframe'/'dict') # print out in either dataframe or dict format calculated metrics
113
129
Here are the metrics this calculates:
@@ -129,8 +145,7 @@ computed_metrics = {
129
145
```
130
146
Usage:
131
147
from splicemachine.ml.zeppelin import ModelEvaluator
132
-
evaluator = ModelEvaluator()
133
-
evaluator.setup_contexts(sc, sqlContext)
148
+
evaluator = ModelEvaluator(spark)
134
149
135
150
CV_ITERATIONS = 5
136
151
@@ -145,12 +160,15 @@ computed_metrics = {
145
160
### Decision Tree Visualizer - Class that allows you to visualize binary/multiclass Decision Tree classification models in PySpark
146
161
Methods:
147
162
1. DecisionTreeVisualizer.visualize(model: fitted DecisionTreeClassifier model, feature_column_names: list (in order of the features included in your VectorAssembler), label_classes: list (in order of your classes), visual=True (png output via graphviz, or code like structure False))
163
+
2. DecisionTreeVisualizer.feature_importance(spark: spark session from zeppelin, model: fitted machine learning model, dataset: dataframe containing your preprocessed features (from vectorassembler), featuresCol: column containing your feature vector)
148
164
149
165
```
150
166
Usage:
151
167
from splicemachine.ml.zeppelin import DecisionTreeVisualizer
0 commit comments