Enable LLM-Driven Data Exploration with Presets and 3W Integration #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR lays the groundwork for integrating Large Language Models (LLMs) into BibMon. It introduces a client that allows users to interact with endpoints for data processing and exploration. As a starting point, we have implemented access to the 3W dataset, enabling the model to infer the most relevant column based on the provided data.
Data tailoring is handled through what we call presets, which are located in
bibmon/llm/presets. These presets allow users to customize how data is structured before being sent to the model.Limitations
This feature acts purely as a client, meaning it requires an external endpoint for model interaction. For instance, you can use OpenAI's API or self-host an alternative model, as we have done.
Direct LLM inference within BibMon could be achieved using tools like llama-cpp-python or similar. However, this approach was avoided to prevent unnecessary complexity and bloat in the library.
While fine-tuning the LLM and creating more precise instructions for a dataset is possible, it requires a detailed data annotation process. Additional information on this can be found in our auxiliary notebook.
Note: This PR is dependent on #50.
Implementation Details (3W Dataset Integration)
Data Preset
The data sent to the model follows this structure:
{ "event_name": "string", "event_description": "string", "columns_and_description": "dict", "data": [ { "event_name": "string", "average_values": "string", "standard_deviation": "string", "head": "string", "tail": "string" }, ... ] }Model Response Format
The model will respond with the following structure:
{ "column": "key of the column of interest", "extra": "additional information deemed relevant by the model" }Usage Example
Additional Resources
For further information and examples, please refer to our detailed notebook showcasing this feature.