This documentation describe spec file used by Feast. There are 4 kind of specs:
- Entity Spec, which describe an entity definition.
- Feature Spec, which describe feature definition.
- Import Spec, which describe how to ingest/populate data for one/more features.
- Storage Spec, which describe storage either for warehouse or serving.
An entity is a type with an associated key which generally maps onto a known domain object, e.g. driver
, customer
, area
, and merchant
. An entity determines how a feature may be retrieved. e.g. for a driver
entity all driver features must be looked up with an associated driver id entity key.
Entity is described by entity spec. Following is an example of an entity spec
name: driver
description: GO-JEK’s 2 wheeler driver
tags:
- two wheeler
- gojek
Name | Type | Convention | Description |
---|---|---|---|
name |
string | Lower snake case (e.g driver or driver_area) | Entity name MUST be unique |
description |
string | N.A. | Description of the entity |
tags |
List of string | N.A. | Free form grouping |
A Feature is an individual measurable property or characteristic of an Entity. A feature has many to one relationship with an entity by referencing the entity's name from its feature spec. In the context of Feast, a Feature has the following attributes:
- Entity - it must be associated with a known Entity to Feast (see Entity Spec)
- ValueType - the feature type must be defined, e.g. String, Bytes, Int64, Int32, Float etc.
Following is en example of feature spec
id: titanic_passenger.alone
name: alone
entity: titanic_passenger
owner: [email protected]
description: binary variable denoting whether the passenger was alone on the titanic.
valueType: INT64
uri: http://example-jupyter.com/user/your_user/lab/tree/shared/zhiling.c/feast_titanic/titanic.ipynb
Name | Type | Convention | Description |
---|---|---|---|
entity |
string | lower snake case (e.g. driver , driver_area ) |
Entity related to this feature |
id |
string | lower snake case with format [entity].[featureName] (e.g.: customer.age_prediction ) |
feature id is a unique identifier of a feature, it is used both for feature retrieval from serving / warehouse |
name |
string | lower snake case, e.g: age_prediction |
short name of a feature |
owner |
string | use email or any unique identifier | owner is mainly be used to inform user who is responsible of maintaining the feature |
description |
string | Keep the description concise, more information about the feature can be provided as documentation linked by uri |
human readable description of a feature |
valueType |
Enum(one of: BYTES ,STRING ,INT32 ,INT64 ,DOUBLE ,FLOAT , BOOL , TIMESTAMP ) |
N.A. | Value type of the feature |
group |
string | lower snake case | feature group inherited by this feature. |
tags |
List of string | N.A. | Free form grouping |
options |
Key value string | N.A. | Option is used for extendability of feature spec |
Import spec describe how data is ingested into Feast to populate one or more features. An import spec contains information about:
- The entities to be imported.
- Import source type.
- Import source options, as required by the type.
- Import source schema, if required by the type.
Feast supports ingesting feature from 4 type of sources:
- File (either CSV or JSON)
- Bigquery Table
- Pubsub Topic
- Pubsub Subscription
Following section describes the import spec that has to be created for each of sources:
Example of csv file import spec:
type: file.csv
sourceOptions:
path: gs://my-bucket/customer_total_purchase.csv
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-09-25T00:00:00.000Z
fields:
- name: timestamp
- name: customer_id
- name: total_purchase
featureId: customer_id.total_purchase
Example of json file import spec:
type: file.json
sourceOptions:
path: gs://my-bucket/customer_total_purchase.json
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-09-25T00:00:00.000Z
fields:
- name: timestamp
- name: customer_id
- name: total_purchase
featureId: customer_id.total_purchase
Notes on import spec for file
type:
- The
type
field must befile.csv
orfile.json
options.path
specify the location of file, it must be accessible by Ingestion Job. (e.g. GCS)entities
list must only contain one entity.schema
must be specified.schema.entityIdColumn
must be same as one of thefields
name. It is used to find out which column is used as entity ID.schema.timestampColumn
can be specified to inform thetimestamp
column name. This field should be exclusively used withschema.timestampValue
. Ifschema.timestampValue
is used instead, all feature will have same timestamp.schema.fields
is a list of column/fields in the files. You can specify whether the field contains a feature by addingfeatureId
attribute and specify the feature ID.
Import spec for Big Query is almost similar to importing a file. The only difference is that all schema.fields.name
has to match column name of the source table.
For example to import following table from BigQuery table gcp-project.source-dataset.source-table
to a feature with id customer.last_login
:
customer_id | last_login |
---|---|
... | ... |
... | ... |
You will need to specify import spec which looks like this:
type: bigquery
options:
project: gcp-project
dataset: source-dataset
table: source-table
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-10-25T00:00:00.000Z
fields:
- name: customer_id
- name: last_login
featureId: customer.last_login
Notes on import spec for bigquery
type:
- The
type
field must bebigquery
options.project
specifies GCP project of the source table.options.dataset
specifies Big Query data set of the source table.options.table
specifies the source table name.entities
list must only contain one entity.schema.entityIdColumn
must be same as one of thefields
name. It is used to find out which column is used as entity ID.schema.timestampColumn
can be specified to inform thetimestamp
column name. This field should be exclusively used withschema.timestampValue
. Ifschema.timestampValue
is used instead, all feature will have same timestamp.schema.fields
is a list of column in the table and allschema.fields.name
must match column name of the table. You can specify whether the column contains a feature by addingfeatureId
attribute and specify the feature ID.
Important Notes Be careful when ingesting a partitioned table since by default ingestion job will read the whole table and might cause unpredictable ingestion result. You can specify which partition to ingest by using table decorator, as follow:
type: bigquery
options:
project: gcp-project
dataset: source-dataset
table: source-table$20181101 # yyyyMMdd format, it will only ingest the said partition.
entities:
- customer
schema:
entityIdColumn: customer_id
timestampValue: 2018-10-25T00:00:00.000Z
fields:
- name: customer_id
- name: last_login
featureId: customer.last_login
You can ingest from either PubSub topic or subscription.
For example following import spec will ingest data from PubSub topic feast-test
in GCP project my-gcp-project
type: pubsub
options:
topic: projects/my-gcp-project/topics/feast-test
entities:
- customer
schema:
fields:
- featureId: customer.last_login
Or if the source data is a subscription with name feast-test-subscription
type: pubsub
options:
topic: projects/my-gcp-project/subscriptions/feast-test-subscription
entities:
- customer
schema:
fields:
- featureId: customer.last_login
The PubSub source type expects that all messages that the ingestion job receives are binary encoded FeatureRow protobufs. See FeatureRow.proto.
Note that:
- You must be explicit about which features should be ingested using
schema.fields
.
There are 2 kinds of storage in feast:
- Serving Storage (BigTable and Redis)
- Warehouse Storage (BigQuery)
Serving storage is intended to support low latency feature retrieval, whereas warehouse storage is intended for training or data exploration. Each of supported storage has slightly different option. Following are the storage spec description for each storage.
BigQuery is used as warehouse storage in Feast.
id: BIGQUERY1
type: bigquery
options:
dataset: "test_feast"
project: "the-big-data-staging-007"
tempLocation: "gs://zl-test-bucket"
BigTable is used as serving storage. It is advisable to use BigTable for feature which doesn't require low latency access or feature which has huge amount of data.
id: BIGTABLE1
type: bigtable
options:
instance: "ds-staging"
project: "the-big-data-staging-007"
Redis is recommended to be used by feature which requires very low latency access. However, it has limited storage size compared to BigTable.
id: REDIS1
type: redis
options:
host: "127.0.0.1"
port: "6379"