Before you can run the server, install the following:
./scripts/setup.sh
To run the web server in dev mode with fast edit-refresh:
./run_server_dev.sh
Format typescript files:
npm run format --workspace web/lib
npm run format --workspace web/blueprint
Run all the checks before mailing:
./scripts/checks.sh
Test python:
./scripts/test_py.sh
Test JavaScript:
./scripts/test_ts.sh
These tests will all run on GitHub CI as well.
We use HuggingFace spaces for demo links on PRs. Example.
-
Login with the HuggingFace to access git.
poetry run huggingface-cli login
Follow the instructions to use your git SSH keys to talk to HuggingFace.
-
Set
.env.local
environment variables so you can upload data to the space:# The repo to use for the huggingface demo. This does not have to exist when you set the flag, the deploy script will create it if it doesn't exist. HF_STAGING_DEMO_REPO='lilacai/your-space' # To authenticate with HuggingFace for uploading to the space. HF_ACCESS_TOKEN='hf_abcdefghijklmnop'
-
Deploy to your HuggingFace Space:
poetry run python -m scripts.deploy_staging \ --dataset=$DATASET_NAMESPACE/$DATASET_NAME # --create_space if this is the first time running the command so it will create the space for you.
./scripts/publish_pip.sh
This will:
- Build the package (typescript bundle & python)
- Publish the package on pip and on github packages
- Build and publish the docker image
- Bump the version at HEAD
- Create release notes on GitHub
The HuggingFace public demo can be found here.
To publish the demo:
poetry run python -m scripts.deploy_demo \
--project_dir=./demo_data \
--config=./lilac_hf_space.yml \
--hf_space=lilacai/lilac
Add:
--skip_sync to skip syncing data from the HuggingFace space data.
--skip_load to skip loading the data.
--load_overwrite to run all data from scratch, overwriting existing data.
--skip_data_upload to skip uploading data. This will use the datasets already on the space.
--skip_deploy to skip deploying to HuggingFace. Useful to test locally.
Typically, if we just want to push code changes without changing the data, run:
poetry run python -m scripts.deploy_demo \
--project_dir=./demo_data \
--config=./lilac_hf_space.yml \
--hf_space=lilacai/lilac \
--skip_sync \
--skip_load \
--skip_data_upload
The public demo uses the public pip package, so for code changes to land in the demo, they must first be published on pip. This is so users that fork the demo will always get an updated Lilac.
All docker images are published under the lilacai account on
Docker Hub. We build docker images for two platforms linux/amd64
and linux/arm64
.
NOTE:
./scripts/publish_pip.sh
will do this automatically.
gcloud builds submit \
--config cloudbuild.yml \
--substitutions=_VERSION=$(poetry version -s) \
--async .
To build the image for both platforms, as a one time setup do:
docker buildx create --name mybuilder --node mybuilder0 --bootstrap --use
Make sure you have Docker Desktop running and you are logged as the lilacai account. To build and push the image:
docker buildx build --platform linux/amd64,linux/arm64 \
-t lilacai/lilac \
-t lilacai/lilac:$(poetry version -s) \
--push .
To use various API's, API keys need to be provided. Create a file named .env.local
in the root,
and add variables that are listed in .env
with your own values.
The environment flags we use are listed in lilac/env.py.
End-user authentication is done via Google login, when LILAC_AUTH_ENABLED
is set to true (e.g. in
the public HuggingFace demo where we disable features for most users).
A Google Client token should be created from the Google API Console. Details can be found here.
By default, the Lilac google client is used. The secret can be found in Google Cloud console, and
should be defined under GOOGLE_CLIENT_SECRET
in .env.local.
For the session middleware, a random string should be created and defined as
LILAC_OAUTH_SECRET_KEY
in .env.local.
You can generate a random secret key with:
import string
import random
key = ''.join(random.choices(string.ascii_uppercase + string.digits, k=64))
print(f"LILAC_OAUTH_SECRET_KEY='{key}'")
You may need the following to install poetry:
- Install XCode and sign license
- XCode command line tools (MacOS)
- homebrew (MacOS)
- pyenv (Python version management)
- Set your current python version
- Python Poetry
To attach PDB to the Lilac server:
./run_server_pdb.sh
This starts the Lilac webserver in a single-threaded mode, ready to accept requests and respond to PDB breakpoints. To trigger your breakpoint, you can use the Lilac UI to trigger an HTTP request, or you can also use Chrome's inspector can take logged network requests and Copy > Copy as CURL command, to replay an HTTP request to the Lilac server.
line_profiler
allows breaking down time spent by line. To use, decorate a target function with
@profile and then run a python script with poetry run kernprof -lv my_script.py
. (The @profile
decorator is injected into the namespace by kernprof; no imports needed.) line_profiler
excels at
profiling numeric code, and for getting a high-level overview of where time gets spent.
In this before/after comparison, the code has gotten 3x faster with the implementation of load_to_parquet.
Total time: 2.82135 s
File: /Users/brian/dev/lilac/lilac/load_dataset.py
Function: process_source at line 56
Line # Hits Time Per Hit % Time Line Contents
==============================================================
56 @profile
57 def process_source(
58 project_dir: Union[str, pathlib.Path],
59 config: DatasetConfig,
60 task_step_id: Optional[TaskStepId] = None,
61 ) -> str:
62 """Process a source."""
63 1 12.0 12.0 0.0 output_dir = get_dataset_output_dir(project_dir, config.namespace, config.name)
64
65 1 91006.0 91006.0 3.2 config.source.setup()
66 1 0.0 0.0 0.0 try:
67 1 2.0 2.0 0.0 manifest = config.source.load_to_parquet(output_dir, task_step_id=task_step_id)
68 1 1.0 1.0 0.0 except NotImplementedError:
69 1 2627713.0 3e+06 93.1 manifest = slow_process(config.source, output_dir, task_step_id=task_step_id)
70
71 2 747.0 373.5 0.0 with open_file(os.path.join(output_dir, MANIFEST_FILENAME), 'w') as f:
72 1 76.0 76.0 0.0 f.write(manifest.model_dump_json(indent=2, exclude_none=True))
73
74 1 1.0 1.0 0.0 if not config.settings:
75 1 5324.0 5324.0 0.2 dataset = get_dataset(config.namespace, config.name, project_dir)
76 1 92883.0 92883.0 3.3 settings = default_settings(dataset)
77 1 3489.0 3489.0 0.1 update_project_dataset_settings(config.namespace, config.name, settings, project_dir)
78
79 1 99.0 99.0 0.0 log(f'Dataset "{config.name}" written to {output_dir}')
80
81 1 0.0 0.0 0.0 return output_dir
Total time: 1.00076 s
File: /Users/brian/dev/lilac/lilac/load_dataset.py
Function: process_source at line 56
Line # Hits Time Per Hit % Time Line Contents
==============================================================
56 @profile
57 def process_source(
58 project_dir: Union[str, pathlib.Path],
59 config: DatasetConfig,
60 task_step_id: Optional[TaskStepId] = None,
61 ) -> str:
62 """Process a source."""
63 1 13.0 13.0 0.0 output_dir = get_dataset_output_dir(project_dir, config.namespace, config.name)
64
65 1 78952.0 78952.0 7.9 config.source.setup()
66 1 0.0 0.0 0.0 try:
67 1 830583.0 830583.0 83.0 manifest = config.source.load_to_parquet(output_dir, task_step_id=task_step_id)
68 except NotImplementedError:
69 manifest = slow_process(config.source, output_dir, task_step_id=task_step_id)
70
71 2 248.0 124.0 0.0 with open_file(os.path.join(output_dir, MANIFEST_FILENAME), 'w') as f:
72 1 90.0 90.0 0.0 f.write(manifest.model_dump_json(indent=2, exclude_none=True))
73
74 1 1.0 1.0 0.0 if not config.settings:
75 1 5490.0 5490.0 0.5 dataset = get_dataset(config.namespace, config.name, project_dir)
76 1 81904.0 81904.0 8.2 settings = default_settings(dataset)
77 1 3394.0 3394.0 0.3 update_project_dataset_settings(config.namespace, config.name, settings, project_dir)
78
79 1 85.0 85.0 0.0 log(f'Dataset "{config.name}" written to {output_dir}')
80
81 1 0.0 0.0 0.0 return output_dir
memray
enables profiling memory usage over time, allowing one to pinpoint which sources are
responsible for memory usage/leakage.
This short snippet will run a test script test_script.py
and open the profiler results.
rm memray.bin memray-flamegraph-memray.html; \
poetry run memray run -o memray.bin test_script.py &&
poetry run memray flamegraph memray.bin &&
open memray-flamegraph-memray.html
poetry run python -X importtime -c "import lilac" 2> import.log && poetry run tuna import.log
rm memray.bin memray-flamegraph-memray.html;
echo "import lilac" > test_import_script.py && poetry run memray run -o memray.bin test_import_script.py
&& poetry run memray flamegraph memray.bin && open memray-flamegraph-memray.html