|
| 1 | +# Splice Machine Python Package |
| 2 | +## Installation Instructions: with Pip |
| 3 | +`(sudo) pip install git+https://github.com/splicemachine/pysplice --process-dependency-links` |
| 4 | + |
| 5 | +## Installation Instructions: Zeppelin- [x]= format x correctly |
| 6 | +``` |
| 7 | +%spark.pyspark (new zeppelin 0.8) or %pyspark (Zeppelin >0.73) |
| 8 | +import os |
| 9 | +
|
| 10 | +#install pip without sudo |
| 11 | +os.system("wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py --user") |
| 12 | +
|
| 13 | +# install splicemachine python package |
| 14 | +os.system("""cd ~/.local/bin && wget https://github.com/splicemachine/pysplice/archive/[pysplice RELEASE_VERSION].zip && |
| 15 | + unzip [pysplice RELEASE_VERSION].zip && |
| 16 | + cd pysplice-[pysplice RELEASE VERSION] && |
| 17 | + ./pip install . --user""") |
| 18 | +``` |
| 19 | + |
| 20 | +## Modules |
| 21 | + There are two modules inside the Python splicemachine package currently-- `splicemachine.spark.context` and `splicemachine.ml.zeppelin` |
| 22 | + `splicemachine.spark.context` contains one extremely important class that helps you interact with our database with Spark. |
| 23 | + The `splicemachine.spark.context.PySpliceContext` class is our native spark data source implemented in Python. Currently, |
| 24 | + it offers these, among other functions: |
| 25 | +1. df: turn the results of a sql query into dataframe |
| 26 | +2. tableExists: returns a boolean of whether or not a table exists |
| 27 | +3. getConnection: get a connection to the database (used to renew the connection) |
| 28 | +4. insert: insert a dataframe into a table |
| 29 | +5. delete: Delete records in a table based on joining by primary keys from the data frame. |
| 30 | +6. update: update records in a table based on joining primary keys from the datafrane |
| 31 | +7. dropTable: drop a table from database |
| 32 | +8. getSchema: return the schema of a table from the database |
| 33 | + |
| 34 | +You can find the source code for this module here ('https://github.com/splicemachine/pysplice/splicemachine/spark/context.py') |
| 35 | +``` |
| 36 | +Usage: |
| 37 | +%spark.pyspark |
| 38 | +from splicemachine.spark.context import PySpliceContext |
| 39 | +splice = PySpliceContext('jdbc:splice://<SOME FRAMEWORK NAME>.splicemachine.io:1527/splicedb;ssl=basic;user=<USER>;password=<PASSWORD>', sqlContext) |
| 40 | +my_dataframe = splice.df('SELECT * FROM DEMO.BAKING_CUPCAKES') |
| 41 | +filtered_df = my_dataframe.filter(my_dataframe.FLAVOR == 'red_velvet') |
| 42 | +# Assume you have created a table with a schema that conforms to filtered_df |
| 43 | +splice.insert(filtered_df, 'DEMO.FILTERED_CUPCAKES) |
| 44 | +``` |
| 45 | + |
| 46 | + The `splicemachine.ml.zeppelin` package, on the other hand, offers machine learning utilities for use in Splice Machine's Zeppelin notebooks. |
| 47 | + Some of these functions are written specifically for users who are using the MLFlow Splice Machine Lifecycle System, but others are generic for PySpark MLlib. |
| 48 | + |
| 49 | + Here are the functions it offers: |
| 50 | + |
| 51 | +### MLFlow Run Wrapper- cross paragraph logging |
| 52 | + Methods: |
| 53 | + 1. Run.create_new_run: remove current run and create a new one |
| 54 | + 2. Run.log_param(key: string, value: string): log a parameter to MLFlow with a key: value |
| 55 | + 3. Run.log_metric(key:string, metric:numeric): log a metric to MLFlow |
| 56 | + 4. Run.log_model(fittedPipeline: FittedPipeline object): log a fitted pipeline for later deployment to SageMaker |
| 57 | + 5. Run.log_artifact(run_relative_path: string (path)): log a file to MLFlow (decision tree visualization, model logs etc.) |
| 58 | + |
| 59 | + ``` |
| 60 | + Usage: |
| 61 | + from splicemachine.ml.zeppelin import Run |
| 62 | + +---------Cell i-----------+ |
| 63 | + baking_run = Run() |
| 64 | + baking_run.create_new_run |
| 65 | + |
| 66 | + +---------Cell i+1---------+ |
| 67 | + baking_run.log_param('dataset', 'banking') |
| 68 | + |
| 69 | + +---------Cell i+2---------+ |
| 70 | + baking_run.log_metric('r2', 0.985) |
| 71 | + |
| 72 | + +---------Cell i+3---------+ |
| 73 | + fittedPipe = pipeline.fit(baking_df) |
| 74 | + banking_run.log_model(fittedPipe) |
| 75 | + |
| 76 | + +---------Cell i+4---------+ |
| 77 | + |
| 78 | + banking_run.log_artifact('output.txt') |
| 79 | + ``` |
| 80 | + |
| 81 | +### Show Confusion Matrix - Function that shows a nicely formatted confusion matrix for binary classification |
| 82 | +1. show_confusion_matrix(sc: spark context from zeppelin, sqlContext: sql context from zeppelin, TP: True positives, TN: true negatves, FP: false positives, FN: False Negatives) |
| 83 | +``` |
| 84 | +Usage: |
| 85 | +from splicemachine.ml.zeppelin import show_confusion_matrix |
| 86 | +TP = 3 |
| 87 | +TN = 4 |
| 88 | +FP = 2 |
| 89 | +FN = 0 |
| 90 | +show_confusion_matrix(sc, sqlContext, TP, TN, FP, FN) |
| 91 | +---> your confusion matrix will be printed in stdout |
| 92 | +``` |
| 93 | + |
| 94 | +### Experiment Maker -- Function that creates or uses an MLFlow Experiment |
| 95 | +1. experiment_maker(experiment_id: experiment_name (string)) |
| 96 | + |
| 97 | +``` |
| 98 | +Usage: |
| 99 | +from splicemachine.ml.zeppelin import experiment_maker |
| 100 | +import mlflow |
| 101 | +
|
| 102 | +mlflow.set_tracking_uri('/mlruns') # so syncing will work |
| 103 | +experiment_name = z.input('Experiment name') |
| 104 | +experiment_maker(experiment_name) |
| 105 | +``` |
| 106 | + |
| 107 | +### Model Evaluator -- Class that Evaluates Binary Classification Models written in PySpark |
| 108 | +Methods: |
| 109 | +1. __init__(label_column='label': string (input dataframe label col), prediction_column='prediction': string (input dataframe pred col), confusion_matrix=True: bool (show confusion matrix after each input df?)) |
| 110 | +2. ModelEvaluator.setup_contexts(sc: spark_context from zeppelin, sqlContext: sqlContext from zeppelin) (required to run): Setup Spark Contexts |
| 111 | +3. ModelEvaluator.input(predictions_dataframe: dataframe (containing row with label col and prediction col) # Input a new run to average |
| 112 | +4. ModelEvalautor.get_results(output_type: 'dataframe'/'dict') # print out in either dataframe or dict format calculated metrics |
| 113 | +Here are the metrics this calculates: |
| 114 | +``` |
| 115 | +computed_metrics = { |
| 116 | + 'TPR': float(TP) / (TP + FN), |
| 117 | + 'SPC': float(TP) / (TP + FN), |
| 118 | + 'PPV': float(TP) / (TP + FP), |
| 119 | + 'NPV': float(TN) / (TN + FN), |
| 120 | + 'FPR': float(FP) / (FP + TN), |
| 121 | + 'FDR': float(FP) / (FP + TP), |
| 122 | + 'FNR': float(FN) / (FN + TP), |
| 123 | + 'ACC': float(TP + TN) / (TP + FN + FP + TN), |
| 124 | + 'F1': float(2 * TP) / (2 * TP + FP + FN), |
| 125 | + 'MCC': float(TP * TN - FP * FN) / np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)), |
| 126 | + } |
| 127 | + ``` |
| 128 | + |
| 129 | + ``` |
| 130 | + Usage: |
| 131 | + from splicemachine.ml.zeppelin import ModelEvaluator |
| 132 | + evaluator = ModelEvaluator() |
| 133 | + evaluator.setup_contexts(sc, sqlContext) |
| 134 | + |
| 135 | + CV_ITERATIONS = 5 |
| 136 | + |
| 137 | + for _ in range(1, CV_ITERATIONS): |
| 138 | + # do tts, fit pipeline, predict and get df |
| 139 | + evaluator.input(predictions_dataframe) |
| 140 | + |
| 141 | + evaluator.get_results('dataframe').show() |
| 142 | + |
| 143 | +``` |
| 144 | + |
| 145 | +### Decision Tree Visualizer - Class that allows you to visualize binary/multiclass Decision Tree classification models in PySpark |
| 146 | +Methods: |
| 147 | +1. DecisionTreeVisualizer.visualize(model: fitted DecisionTreeClassifier model, feature_column_names: list (in order of the features included in your VectorAssembler), label_classes: list (in order of your classes), visual=True (png output via graphviz, or code like structure False)) |
| 148 | + |
| 149 | +``` |
| 150 | +Usage: |
| 151 | +from splicemachine.ml.zeppelin import DecisionTreeVisualizer |
| 152 | +
|
| 153 | +myfittedDecisionTreeClassifier = unfittedClf.fit(df) |
| 154 | +DecisionTreeVisualizer.visualize(myfittedDecisionTreeClassifier, ['flavor', 'color', 'frosting'], ['juicy', 'not juciy'], True) |
| 155 | +--> You can see your tree at this URL: <SOME URL> |
| 156 | +
|
0 commit comments