Skip to content
This repository was archived by the owner on Apr 15, 2022. It is now read-only.

Commit 5e275cb

Browse files
authored
Create README.md
1 parent 3631ce2 commit 5e275cb

File tree

1 file changed

+156
-0
lines changed

1 file changed

+156
-0
lines changed

README.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Splice Machine Python Package
2+
## Installation Instructions: with Pip
3+
`(sudo) pip install git+https://github.com/splicemachine/pysplice --process-dependency-links`
4+
5+
## Installation Instructions: Zeppelin- [x]= format x correctly
6+
```
7+
%spark.pyspark (new zeppelin 0.8) or %pyspark (Zeppelin >0.73)
8+
import os
9+
10+
#install pip without sudo
11+
os.system("wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py --user")
12+
13+
# install splicemachine python package
14+
os.system("""cd ~/.local/bin && wget https://github.com/splicemachine/pysplice/archive/[pysplice RELEASE_VERSION].zip &&
15+
unzip [pysplice RELEASE_VERSION].zip &&
16+
cd pysplice-[pysplice RELEASE VERSION] &&
17+
./pip install . --user""")
18+
```
19+
20+
## Modules
21+
There are two modules inside the Python splicemachine package currently-- `splicemachine.spark.context` and `splicemachine.ml.zeppelin`
22+
`splicemachine.spark.context` contains one extremely important class that helps you interact with our database with Spark.
23+
The `splicemachine.spark.context.PySpliceContext` class is our native spark data source implemented in Python. Currently,
24+
it offers these, among other functions:
25+
1. df: turn the results of a sql query into dataframe
26+
2. tableExists: returns a boolean of whether or not a table exists
27+
3. getConnection: get a connection to the database (used to renew the connection)
28+
4. insert: insert a dataframe into a table
29+
5. delete: Delete records in a table based on joining by primary keys from the data frame.
30+
6. update: update records in a table based on joining primary keys from the datafrane
31+
7. dropTable: drop a table from database
32+
8. getSchema: return the schema of a table from the database
33+
34+
You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py')
35+
```
36+
Usage:
37+
%spark.pyspark
38+
from splicemachine.spark.context import PySpliceContext
39+
splice = PySpliceContext('jdbc:splice://<SOME FRAMEWORK NAME>.splicemachine.io:1527/splicedb;ssl=basic;user=<USER>;password=<PASSWORD>', sqlContext)
40+
my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES')
41+
filtered_df = my_dataframe.filter(my_dataframe.FLAVOR == 'red_velvet')
42+
# Assume you have created a table with a schema that conforms to filtered_df
43+
splice.insert(filtered_df, 'DEMO.FILTERED_CUPCAKES)
44+
```
45+
46+
The `splicemachine.ml.zeppelin` package, on the other hand, offers machine learning utilities for use in Splice Machine's Zeppelin notebooks.
47+
Some of these functions are written specifically for users who are using the MLFlow Splice Machine Lifecycle System, but others are generic for PySpark MLlib.
48+
49+
Here are the functions it offers:
50+
51+
### MLFlow Run Wrapper- cross paragraph logging
52+
Methods:
53+
1. Run.create_new_run: remove current run and create a new one
54+
2. Run.log_param(key: string, value: string): log a parameter to MLFlow with a key: value
55+
3. Run.log_metric(key:string, metric:numeric): log a metric to MLFlow
56+
4. Run.log_model(fittedPipeline: FittedPipeline object): log a fitted pipeline for later deployment to SageMaker
57+
5. Run.log_artifact(run_relative_path: string (path)): log a file to MLFlow (decision tree visualization, model logs etc.)
58+
59+
```
60+
Usage:
61+
from splicemachine.ml.zeppelin import Run
62+
+---------Cell i-----------+
63+
baking_run = Run()
64+
baking_run.create_new_run
65+
66+
+---------Cell i+1---------+
67+
baking_run.log_param('dataset', 'banking')
68+
69+
+---------Cell i+2---------+
70+
baking_run.log_metric('r2', 0.985)
71+
72+
+---------Cell i+3---------+
73+
fittedPipe = pipeline.fit(baking_df)
74+
banking_run.log_model(fittedPipe)
75+
76+
+---------Cell i+4---------+
77+
78+
banking_run.log_artifact('output.txt')
79+
```
80+
81+
### Show Confusion Matrix - Function that shows a nicely formatted confusion matrix for binary classification
82+
1. show_confusion_matrix(sc: spark context from zeppelin, sqlContext: sql context from zeppelin, TP: True positives, TN: true negatves, FP: false positives, FN: False Negatives)
83+
```
84+
Usage:
85+
from splicemachine.ml.zeppelin import show_confusion_matrix
86+
TP = 3
87+
TN = 4
88+
FP = 2
89+
FN = 0
90+
show_confusion_matrix(sc, sqlContext, TP, TN, FP, FN)
91+
---> your confusion matrix will be printed in stdout
92+
```
93+
94+
### Experiment Maker -- Function that creates or uses an MLFlow Experiment
95+
1. experiment_maker(experiment_id: experiment_name (string))
96+
97+
```
98+
Usage:
99+
from splicemachine.ml.zeppelin import experiment_maker
100+
import mlflow
101+
102+
mlflow.set_tracking_uri('/mlruns') # so syncing will work
103+
experiment_name = z.input('Experiment name')
104+
experiment_maker(experiment_name)
105+
```
106+
107+
### Model Evaluator -- Class that Evaluates Binary Classification Models written in PySpark
108+
Methods:
109+
1. __init__(label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?))
110+
2. ModelEvaluator.setup_contexts(sc: spark_context from zeppelin, sqlContext: sqlContext from zeppelin) (required to run): Setup Spark Contexts
111+
3. ModelEvaluator.input(predictions_dataframe: dataframe (containing row with label col and prediction col) # Input a new run to average
112+
4. ModelEvalautor.get_results(output_type: 'dataframe'/'dict') # print out in either dataframe or dict format calculated metrics
113+
Here are the metrics this calculates:
114+
```
115+
computed_metrics = {
116+
'TPR': float(TP) / (TP + FN),
117+
'SPC': float(TP) / (TP + FN),
118+
'PPV': float(TP) / (TP + FP),
119+
'NPV': float(TN) / (TN + FN),
120+
'FPR': float(FP) / (FP + TN),
121+
'FDR': float(FP) / (FP + TP),
122+
'FNR': float(FN) / (FN + TP),
123+
'ACC': float(TP + TN) / (TP + FN + FP + TN),
124+
'F1': float(2 * TP) / (2 * TP + FP + FN),
125+
'MCC': float(TP * TN - FP * FN) / np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)),
126+
}
127+
```
128+
129+
```
130+
Usage:
131+
from splicemachine.ml.zeppelin import ModelEvaluator
132+
evaluator = ModelEvaluator()
133+
evaluator.setup_contexts(sc, sqlContext)
134+
135+
CV_ITERATIONS = 5
136+
137+
for _ in range(1, CV_ITERATIONS):
138+
# do tts, fit pipeline, predict and get df
139+
evaluator.input(predictions_dataframe)
140+
141+
evaluator.get_results('dataframe').show()
142+
143+
```
144+
145+
### Decision Tree Visualizer - Class that allows you to visualize binary/multiclass Decision Tree classification models in PySpark
146+
Methods:
147+
1. DecisionTreeVisualizer.visualize(model: fitted DecisionTreeClassifier model, feature_column_names: list (in order of the features included in your VectorAssembler), label_classes: list (in order of your classes), visual=True (png output via graphviz, or code like structure False))
148+
149+
```
150+
Usage:
151+
from splicemachine.ml.zeppelin import DecisionTreeVisualizer
152+
153+
myfittedDecisionTreeClassifier = unfittedClf.fit(df)
154+
DecisionTreeVisualizer.visualize(myfittedDecisionTreeClassifier, ['flavor', 'color', 'frosting'], ['juicy', 'not juciy'], True)
155+
--> You can see your tree at this URL: <SOME URL>
156+

0 commit comments

Comments
 (0)