Description
With the current API, if one wants to project in d=3, one has to know the exact number n of optional arguments before specifying 3 as the n+1th argument. This feels a bit uneasy, and it means that we can't add a supplementary hyperparameter to any method without it being a breaking change.
It seems to be that it would be nice to rethink the API "à la D3", so that:
- all the algorithms can be called interchangeably
- we could separate the training and transform phases (learn then transform? #11)
- we could specify hyperparameters individually
- we could serialize the model (in and out : save and load)
I would imagine that this could be structured as:
- new Druid([method or model]) — create a druid
- druid.values([accessor]) — sets the values accessor if specified, and returns the druid; return the values accessor if not specified
- druid.dimensions([number]) — sets or returns the dimensions (default: 2)
- druid.class([accessor]) — sets or returns the class accessor (for LDA)
- druid.method([name or class]) — sets the current method (UMAP, FASTMAP etc) if specified and returns the druid ; if not specified, return the method (as a Class or function).
- druid.fit(data) — train the model on the data and returns the druid
- druid.transform([data]) — transforms the data if specified; if data is not specified, returns the transformed train set
- druid.model([model]) — returns the serialized model (JSON) if a model is not specified, loads the model if specified
And for each hyperparameter, for example UMAP/min_dist
- druid.min_dist([min_dist]) — if specified, sets the min_dist hyperparameter and returns the druid, or read it if not specified
With this we could say for example:
const dr = new Druid("LDA"); // dr
dr.dimensions(2).class(d => d.species).values(d => [+d.sepal_length, +d.petal_length, …]).fit(data); // dr
dr.transform(); // transformed data
const model = dr.model(); // JSON {}
…
const dr = new Druid(model); // dr
dr.transform([new data]); // apply the model to new data…
I wonder what should be done for NaN, I suppose they should be automatically ignored if the values accessor returns any NaN.
Note also that some methods such as UMAP can accept a distance matrix instead of a data array.
PS: Sorry for spamming your project :) The potential is very exciting.