Skip to content
This repository was archived by the owner on Apr 15, 2022. It is now read-only.

Commit 7660e39

Browse files
author
Ben Epstein
authored
Feature store api cleanup (#96)
* cleanup of api * add context (primary) key to describe feature sets * optional verbose to print sql in create_training_context * added get_feature_dataset * comments * old code * i hate upppercase * commment * sql format * i still hate uppercase * null tx * sql format * sql format * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * docs * verbose * docs * docs * column ordering * feature param cleanup * training context features * removed clean_df * to_lower * docs * better logic * better logic * label column validation * refactor TrainingContext -> TrainingView, Feature Set Context Key -> Feature Set Join Key * missed one * exclude 2 more funcs * docs * as list * missed some more * hashable * pep * docs * docs * handleinvalid keep * feature_vector_sql * get-features_by_name requires names * exclude members * return Feature, docs fix
1 parent 66d0ff9 commit 7660e39

File tree

18 files changed

+610
-241
lines changed

18 files changed

+610
-241
lines changed

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ formats:
2121
python:
2222
version: 3.7
2323
install:
24-
- requirements: requirements.txt
24+
- requirements: requirements-docs.txt

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@
4646
'private-members':True,
4747
'inherited-members':True,
4848
'undoc-members': False,
49-
'exclude-members': '_check_for_splice_ctx,_dropTableIfExists, _generateDBSchema,_getCreateTableSchema,_jstructtype,_spliceSparkPackagesName,_splicemachineContext,apply_patches, main'
49+
'exclude-members': '_validate_feature_vector_keys,_process_features,__prune_features_for_elimination,_register_metadata,_register_metadata,__update_deployment_status,__log_mlflow_results,__get_feature_importance,__get_pipeline,_validate_training_view,_validate_feature_set,_validate_feature,__validate_feature_data_type,_check_for_splice_ctx,_dropTableIfExists, _generateDBSchema,_getCreateTableSchema,_jstructtype,_spliceSparkPackagesName,_splicemachineContext,apply_patches, main'
5050
}
5151

5252
# Add any paths that contain templates here, relative to this directory.

docs/getting-started.rst

Lines changed: 28 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,17 @@ If you are running inside of the Splice Machine Cloud Service in a Jupyter Noteb
1313
External Installation
1414
---------------------
1515

16-
If you would like to install outside of the K8s cluster (and use the ExtPySpliceContext), you can install with
16+
If you would like to install outside of the K8s cluster (and use the ExtPySpliceContext), you can install the stable build with
1717

1818
.. code-block:: sh
1919
20-
sudo pip install pysplice
20+
sudo pip install git+http://www.github.com/splicemachine/[email protected]
21+
22+
Or latest with
23+
24+
.. code-block:: sh
25+
26+
sudo pip install git+http://www.github.com/splicemachine/pysplice
2127
2228
Usage
2329
-----
@@ -28,24 +34,40 @@ This section covers importing and instantiating the Native Spark DataSource
2834

2935
.. tab:: Native Spark DataSource
3036

31-
To use the Native Spark DataSource inside of the cloud service, first create a Spark Session and then import your PySpliceContext
37+
To use the Native Spark DataSource inside of the `cloud service<https://cloud.splicemachine.io/register?utm_source=pydocs&utm_medium=header&utm_campaign=sandbox>`_., first create a Spark Session and then import your PySpliceContext
3238

3339
.. code-block:: Python
3440
3541
from pyspark.sql import SparkSession
3642
from splicemachine.spark import PySpliceContext
43+
from splicemachine.mlflow_support import * # Connects your MLflow session automatically
44+
from splicemachine.features import FeatureStore # Splice Machine Feature Store
45+
3746
spark = SparkSession.builder.getOrCreate()
38-
splice = PySpliceContext(spark)
47+
splice = PySpliceContext(spark) # The Native Spark Datasource (PySpliceContext) takes a Spark Session
48+
fs = FeatureStore(splice) # Create your Feature Store
49+
mlflow.register_splice_context(splice) # Gives mlflow native DB connection
50+
mlflow.register_feature_store(fs) # Tracks Feature Store work in Mlflow automatically
51+
3952
4053
.. tab:: External Native Spark DataSource
4154

42-
To use the External Native Spark DataSource, create a Spark Session with your external Jars configured. Then, import your ExtPySpliceContext and set the necessary parameters
55+
To use the External Native Spark DataSource, create a Spark Session with your external Jars configured. Then, import your ExtPySpliceContext and set the necessary parameters.
56+
Once created, the functionality is identical to the internal Native Spark Datasource (PySpliceContext)
4357

4458
.. code-block:: Python
4559
4660
from pyspark.sql import SparkSession
4761
from splicemachine.spark import ExtPySpliceContext
62+
from splicemachine.mlflow_support import * # Connects your MLflow session automatically
63+
from splicemachine.features import FeatureStore # Splice Machine Feature Store
64+
4865
spark = SparkSession.builder.config('spark.jars', '/path/to/splice_spark2-3.0.0.1962-SNAPSHOT-shaded.jar').config('spark.driver.extraClassPath', 'path/to/Splice/jars/dir/*').getOrCreate()
4966
JDBC_URL = '' #Set your JDBC URL here. You can get this from the Cloud Manager UI. Make sure to append ';user=<USERNAME>;password=<PASSWORD>' after ';ssl=basic' so you can authenticate in
50-
kafka_server = 'kafka-broker-0-' + JDBC_URL.split('jdbc:splice://jdbc-')[1].split(':1527')[0] + ':19092' # Formatting kafka URL from JDBC
67+
# The ExtPySpliceContext communicates with the database via Kafka
68+
kafka_server = 'kafka-broker-0-' + JDBC_URL.split('jdbc:splice://jdbc-')[1].split(':1527')[0] + ':19092' # Formatting kafka URL from JDBC
5169
splice = ExtPySpliceContext(spark, JDBC_URL=JDBC_URL, kafkaServers=kafka_server)
70+
71+
fs = FeatureStore(splice) # Create your Feature Store
72+
mlflow.register_splice_context(splice) # Gives mlflow native DB connection
73+
mlflow.register_feature_store(fs) # Tracks Feature Store work in Mlflow automatically

docs/splicemachine.features.rst

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ Submodules
1919
:undoc-members:
2020
:show-inheritance:
2121

22-
.. automodule:: splicemachine.features.training_context
22+
.. automodule:: splicemachine.features.training_view
2323
:members:
2424
:undoc-members:
2525
:show-inheritance:
2626

2727
splicemachine.features.feature_store
2828
----------------------------------
2929

30-
This Module contains the classes adn APIs for interacting with the Splice Machine Feature Store.
30+
This Module contains the classes and APIs for interacting with the Splice Machine Feature Store.
3131

3232
.. automodule:: splicemachine.features.feature_store
3333
:members:
@@ -37,30 +37,38 @@ This Module contains the classes adn APIs for interacting with the Splice Machin
3737
splicemachine.features.feature_set
3838
----------------------------------
3939

40-
This describes the Python representation of a Feature Set. A feature set is a database table that contains Features and their metadata
40+
This describes the Python representation of a Feature Set. A feature set is a database table that contains Features and their metadata.
41+
The Feature Set class is mostly used internally but can be used by the user to see the available Features in the given
42+
Feature Set, to see the table and schema name it is deployed to (if it is deployed), and to deploy the feature set
43+
(which can also be done directly through the Feature Store). Feature Sets are unique by their schema.table name, as they
44+
exist in the Splice Machine database as a SQL table. They are case insensitive.
45+
To see the full contents of your Feature Set, you can print, return, or .__dict__ your Feature Set object.
4146

4247
.. automodule:: splicemachine.features.feature_set
4348
:members:
44-
:undoc-members:
4549
:show-inheritance:
4650

4751

4852
splicemachine.features.Feature
4953
----------------------------------
5054

51-
This describes the Python representation of a Feature. A feature is a column of a table with particular metadata
55+
This describes the Python representation of a Feature. A Feature is a column of a Feature Set table with particular metadata.
56+
A Feature is the smallest unit in the Feature Store, and each Feature within a Feature Set is individually tracked for changes
57+
to enable full time travel and point-in-time consistent training datasets. Features' names are unique and case insensitive.
58+
To see the full contents of your Feature, you can print, return, or .__dict__ your Feature object.
5259

5360
.. automodule:: splicemachine.features.feature
5461
:members:
5562
:undoc-members:
5663
:show-inheritance:
5764

58-
splicemachine.features.training_context
65+
splicemachine.features.training_view
5966
----------------------------------
6067

61-
This describes the Python representation of a Training Context. A Training Context is a SQL statement defining an event of interest, and metadata around how to create a training dataset with that context
68+
This describes the Python representation of a Training View. A Training View is a SQL statement defining an event of interest, and metadata around how to create a training dataset with that view.
69+
To see the full contents of your Training View, you can print, return, or .__dict__ your Training View object.
6270

63-
.. automodule:: splicemachine.features.training_context
71+
.. automodule:: splicemachine.features.training_view
6472
:members:
6573
:undoc-members:
6674
:show-inheritance:

requirements-docs.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
py4j==0.10.7.0
2+
pytest
3+
mlflow==1.8.0
4+
pyyaml==5.3.1
5+
mleap==0.15.0
6+
graphviz==0.13
7+
requests
8+
gorilla==0.3.0
9+
tqdm==4.43.0
10+
pyspark-dist-explore==0.1.8
11+
numpy==1.18.2
12+
pandas==1.0.3
13+
scipy==1.4.1
14+
tensorflow==2.2.1
15+
pyspark
16+
h2o-pysparkling-2.4==3.28.1.2-1
17+
sphinx-tabs
18+
IPython
19+
cloudpickle==1.6.0

requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,5 @@ scipy==1.4.1
1414
tensorflow==2.2.1
1515
pyspark
1616
h2o-pysparkling-2.4==3.28.1.2-1
17-
sphinx-tabs
1817
IPython
1918
cloudpickle==1.6.0

splicemachine.egg-info/SOURCES.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ splicemachine/features/constants.py
1414
splicemachine/features/feature.py
1515
splicemachine/features/feature_set.py
1616
splicemachine/features/feature_store.py
17-
splicemachine/features/training_context.py
17+
splicemachine/features/training_view.py
1818
splicemachine/features/utils.py
1919
splicemachine/mlflow_support/__init__.py
2020
splicemachine/mlflow_support/constants.py
@@ -25,4 +25,4 @@ splicemachine/spark/constants.py
2525
splicemachine/spark/context.py
2626
splicemachine/spark/test/__init__.py
2727
splicemachine/spark/test/context_it.py
28-
splicemachine/spark/test/resources/__init__.py
28+
splicemachine/spark/test/resources/__init__.py

splicemachine/features/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from .feature import Feature
22
from .feature_set import FeatureSet
33
from .feature_store import FeatureStore
4+
from .constants import FeatureType

splicemachine/features/constants.py

Lines changed: 42 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,11 @@ class SQL:
5454
"""
5555

5656
get_features_by_name = f"""
57-
select feature_id,feature_set_id,Name,Description,feature_data_type, feature_type,Tags,compliance_level,
58-
last_update_ts,last_update_username from {FEATURE_STORE_SCHEMA}.feature where Name in ({{feature_names}})
57+
select fset.schema_name,fset.table_name,f.Name,f.Description,f.feature_data_type,f.feature_type,f.Tags,
58+
f.compliance_level,f.last_update_ts,f.last_update_username,f.feature_id,f.feature_set_id
59+
from {FEATURE_STORE_SCHEMA}.feature f
60+
join {FEATURE_STORE_SCHEMA}.feature_set fset on f.feature_set_id=fset.feature_set_id
61+
where {{where}}
5962
"""
6063

6164
get_features_in_feature_set = f"""
@@ -72,21 +75,33 @@ class SQL:
7275
FROM {FEATURE_STORE_SCHEMA}.feature_set_key GROUP BY 1) p
7376
ON fset.feature_set_id=p.feature_set_id
7477
"""
75-
get_training_contexts = f"""
76-
SELECT tc.context_id, tc.Name, tc.Description, CAST(SQL_text AS VARCHAR(1000)) context_sql,
78+
79+
get_training_views = f"""
80+
SELECT tc.view_id, tc.Name, tc.Description, CAST(SQL_text AS VARCHAR(1000)) view_sql,
7781
p.pk_columns,
7882
ts_column, label_column,
79-
c.context_columns
80-
FROM {FEATURE_STORE_SCHEMA}.training_context tc
83+
c.join_columns
84+
FROM {FEATURE_STORE_SCHEMA}.training_view tc
8185
INNER JOIN
82-
(SELECT context_id, STRING_AGG(key_column_name,',') pk_columns FROM {FEATURE_STORE_SCHEMA}.training_context_key WHERE key_type='P' GROUP BY 1) p ON tc.context_id=p.context_id
86+
(SELECT view_id, STRING_AGG(key_column_name,',') pk_columns FROM {FEATURE_STORE_SCHEMA}.training_view_key WHERE key_type='P' GROUP BY 1) p ON tc.view_id=p.view_id
8387
INNER JOIN
84-
(SELECT context_id, STRING_AGG(key_column_name,',') context_columns FROM {FEATURE_STORE_SCHEMA}.training_context_key WHERE key_type='C' GROUP BY 1) c ON tc.context_id=c.context_id
88+
(SELECT view_id, STRING_AGG(key_column_name,',') join_columns FROM {FEATURE_STORE_SCHEMA}.training_view_key WHERE key_type='J' GROUP BY 1) c ON tc.view_id=c.view_id
89+
"""
90+
91+
get_feature_set_join_keys = f"""
92+
SELECT fset.feature_set_id, schema_name, table_name, pk_columns FROM {FEATURE_STORE_SCHEMA}.feature_set fset
93+
INNER JOIN
94+
(
95+
SELECT feature_set_id, STRING_AGG(key_column_name,'|') pk_columns, STRING_AGG(key_column_data_type,'|') pk_types
96+
FROM {FEATURE_STORE_SCHEMA}.feature_set_key GROUP BY 1
97+
) p
98+
ON fset.feature_set_id=p.feature_set_id
99+
where fset.feature_set_id in (select feature_set_id from {FEATURE_STORE_SCHEMA}.feature where name in {{names}} )
85100
"""
86101

87102
get_all_features = f"SELECT NAME FROM {FEATURE_STORE_SCHEMA}.feature WHERE Name='{{name}}'"
88103

89-
get_available_features = f"""
104+
get_training_view_features = f"""
90105
SELECT f.feature_id, f.feature_set_id, f.NAME, f.DESCRIPTION, f.feature_data_type, f.feature_type, f.TAGS, f.compliance_level, f.last_update_ts, f.last_update_username
91106
FROM {FEATURE_STORE_SCHEMA}.Feature f
92107
WHERE feature_id IN
@@ -96,11 +111,11 @@ class SQL:
96111
(
97112
SELECT feature_id FROM
98113
(
99-
SELECT f.feature_id, fsk.KeyCount, count(distinct fsk.key_column_name) ContextKeyMatchCount
114+
SELECT f.feature_id, fsk.KeyCount, count(distinct fsk.key_column_name) JoinKeyMatchCount
100115
FROM
101-
{FEATURE_STORE_SCHEMA}.training_context tc
116+
{FEATURE_STORE_SCHEMA}.training_view tc
102117
INNER JOIN
103-
{FEATURE_STORE_SCHEMA}.training_context_key c ON c.context_id=tc.context_id AND c.key_type='C'
118+
{FEATURE_STORE_SCHEMA}.training_view_key c ON c.view_id=tc.view_id AND c.key_type='J'
104119
INNER JOIN
105120
(
106121
SELECT feature_set_id, key_column_name, count(*) OVER (PARTITION BY feature_set_id) KeyCount
@@ -111,36 +126,36 @@ class SQL:
111126
WHERE {{where}}
112127
GROUP BY 1,2
113128
)match_keys
114-
WHERE ContextKeyMatchCount = KeyCount
129+
WHERE JoinKeyMatchCount = KeyCount
115130
)fl
116131
)
117132
"""
118133

119-
training_context = f"""
120-
INSERT INTO {FEATURE_STORE_SCHEMA}.training_context (Name, Description, SQL_text, ts_column, label_column)
134+
training_view = f"""
135+
INSERT INTO {FEATURE_STORE_SCHEMA}.training_view (Name, Description, SQL_text, ts_column, label_column)
121136
VALUES ('{{name}}', '{{desc}}', '{{sql_text}}', '{{ts_col}}', {{label_col}})
122137
"""
123138

124-
get_training_context_id = f"""
125-
SELECT context_id from {FEATURE_STORE_SCHEMA}.Training_Context where Name='{{name}}'
139+
get_training_view_id = f"""
140+
SELECT view_id from {FEATURE_STORE_SCHEMA}.training_view where Name='{{name}}'
126141
"""
127142

128143
get_fset_primary_keys = f"""
129144
select distinct key_column_name from {FEATURE_STORE_SCHEMA}.Feature_Set_Key
130145
"""
131146

132-
training_context_keys = f"""
133-
INSERT INTO {FEATURE_STORE_SCHEMA}.training_context_key (Context_ID, Key_Column_Name, Key_Type)
134-
VALUES ({{context_id}}, '{{key_column}}', '{{key_type}}' )
147+
training_view_keys = f"""
148+
INSERT INTO {FEATURE_STORE_SCHEMA}.training_view_key (View_ID, Key_Column_Name, Key_Type)
149+
VALUES ({{view_id}}, '{{key_column}}', '{{key_type}}' )
135150
"""
136151

137152
update_fset_deployment_status = f"""
138153
UPDATE {FEATURE_STORE_SCHEMA}.feature_set set deployed={{status}} where feature_set_id = {{feature_set_id}}
139154
"""
140155

141156
training_set = f"""
142-
INSERT INTO {FEATURE_STORE_SCHEMA}.training_set (name, context_id )
143-
VALUES ('{{name}}', {{context_id}})
157+
INSERT INTO {FEATURE_STORE_SCHEMA}.training_set (name, view_id )
158+
VALUES ('{{name}}', {{view_id}})
144159
"""
145160

146161
get_training_set_id = f"""
@@ -171,10 +186,14 @@ class SQL:
171186
({{model_schema_name}}, {{model_table_name}}, {{model_start_ts}}, {{model_end_ts}}, {{feature_id}}, {{feature_cardinality}}, {{feature_histogram}}, {{feature_mean}}, {{feature_median}}, {{feature_count}}, {{feature_stddev}})
172187
"""
173188

189+
get_feature_vector = """
190+
SELECT {feature_names} FROM {feature_sets} WHERE
191+
"""
192+
174193
class Columns:
175194
feature = ['feature_id', 'feature_set_id', 'name', 'description', 'feature_data_type', 'feature_type',
176195
'tags', 'compliance_level', 'last_update_ts', 'last_update_username']
177-
training_context = ['context_id','name','description','context_sql','pk_columns','ts_column','label_column','context_columns']
196+
training_view = ['view_id','name','description','view_sql','pk_columns','ts_column','label_column','join_columns']
178197
feature_set = ['feature_set_id', 'table_name', 'schema_name', 'description', 'pk_columns', 'pk_types', 'deployed']
179198
history_table_pk = ['ASOF_TS','UNTIL_TS']
180199

splicemachine/features/feature.py

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,23 +13,31 @@ def __init__(self, *, name, description, feature_data_type, feature_type, tags,
1313
self.__dict__.update(args)
1414

1515
def is_categorical(self):
16+
"""
17+
Returns if the type of this feature is categorical
18+
"""
1619
return self.feature_type == FeatureType.categorical
1720

1821
def is_continuous(self):
22+
"""
23+
Returns if the type of this feature is continuous
24+
"""
1925
return self.feature_type == FeatureType.continuous
2026

2127
def is_ordinal(self):
28+
"""
29+
Returns if the type of this feature is ordinal
30+
"""
2231
return self.feature_type == FeatureType.ordinal
2332

2433
def _register_metadata(self, splice):
2534
"""
2635
Registers the feature's existence in the feature store
27-
:return: None
2836
"""
2937
feature_sql = SQL.feature_metadata.format(
3038
feature_set_id=self.feature_set_id, name=self.name, desc=self.description,
3139
feature_data_type=self.feature_data_type,
32-
feature_type=self.feature_type, tags=','.join(self.tags)
40+
feature_type=self.feature_type, tags=','.join(self.tags) if isinstance(self.tags, list) else self.tags
3341
)
3442
splice.execute(feature_sql)
3543

@@ -43,8 +51,19 @@ def __eq__(self, other):
4351

4452
def __repr__(self):
4553
return self.__str__()
54+
4655
def __str__(self):
4756
return f'Feature(FeatureID={self.__dict__.get("feature_id","None")}, ' \
4857
f'FeatureSetID={self.__dict__.get("feature_set_id","None")}, Name={self.name}, \n' \
4958
f'Description={self.description}, FeatureDataType={self.feature_data_type}, ' \
5059
f'FeatureType={self.feature_type}, Tags={self.tags})\n'
60+
61+
def __hash__(self):
62+
return hash(repr(self))
63+
64+
def __lt__(self, other):
65+
if isinstance(other, str):
66+
return self.name < other
67+
elif isinstance(other, Feature):
68+
return self.name < other.name
69+
raise TypeError(f"< not supported between instances of Feature and {type(other)}")

0 commit comments

Comments
 (0)