Skip to content

Latest commit

 

History

History
328 lines (243 loc) · 11.7 KB

development.md

File metadata and controls

328 lines (243 loc) · 11.7 KB

Development

Setup

Before you can run the server, install the following:

Install dependencies

./scripts/setup.sh

Run Lilac in dev mode

To run the web server in dev mode with fast edit-refresh:

./run_server_dev.sh

Format typescript files:

npm run format --workspace web/lib
npm run format --workspace web/blueprint

Testing

Run all the checks before mailing:

./scripts/checks.sh

Test python:

./scripts/test_py.sh

Test JavaScript:

./scripts/test_ts.sh

These tests will all run on GitHub CI as well.

Demos for PRs

We use HuggingFace spaces for demo links on PRs. Example.

  1. Login with the HuggingFace to access git.

    poetry run huggingface-cli login

    Follow the instructions to use your git SSH keys to talk to HuggingFace.

  2. Set .env.local environment variables so you can upload data to the space:

      # The repo to use for the huggingface demo. This does not have to exist when you set the flag, the deploy script will create it if it doesn't exist.
      HF_STAGING_DEMO_REPO='lilacai/your-space'
      # To authenticate with HuggingFace for uploading to the space.
      HF_ACCESS_TOKEN='hf_abcdefghijklmnop'
  3. Deploy to your HuggingFace Space:

    poetry run python -m scripts.deploy_staging \
      --dataset=$DATASET_NAMESPACE/$DATASET_NAME
    
    # --create_space if this is the first time running the command so it will create the space for you.
    

Publishing

Pip package

./scripts/publish_pip.sh

This will:

  • Build the package (typescript bundle & python)
  • Publish the package on pip and on github packages
  • Build and publish the docker image
  • Bump the version at HEAD
  • Create release notes on GitHub

HuggingFace public demo

The HuggingFace public demo can be found here.

To publish the demo:

poetry run python -m scripts.deploy_demo \
  --project_dir=./demo_data \
  --config=./lilac_hf_space.yml \
  --hf_space=lilacai/lilac

Add:
  --skip_sync to skip syncing data from the HuggingFace space data.
  --skip_load to skip loading the data.
  --load_overwrite to run all data from scratch, overwriting existing data.
  --skip_data_upload to skip uploading data. This will use the datasets already on the space.
  --skip_deploy to skip deploying to HuggingFace. Useful to test locally.

Typically, if we just want to push code changes without changing the data, run:

poetry run python -m scripts.deploy_demo \
  --project_dir=./demo_data \
  --config=./lilac_hf_space.yml \
  --hf_space=lilacai/lilac \
  --skip_sync \
  --skip_load \
  --skip_data_upload

The public demo uses the public pip package, so for code changes to land in the demo, they must first be published on pip. This is so users that fork the demo will always get an updated Lilac.

Docker images

All docker images are published under the lilacai account on Docker Hub. We build docker images for two platforms linux/amd64 and linux/arm64.

NOTE: ./scripts/publish_pip.sh will do this automatically.

Building on Google Cloud
gcloud builds submit \
  --config cloudbuild.yml \
  --substitutions=_VERSION=$(poetry version -s) \
  --async .
Building locally

To build the image for both platforms, as a one time setup do:

docker buildx create --name mybuilder --node mybuilder0 --bootstrap --use

Make sure you have Docker Desktop running and you are logged as the lilacai account. To build and push the image:

docker buildx build --platform linux/amd64,linux/arm64 \
  -t lilacai/lilac \
  -t lilacai/lilac:$(poetry version -s) \
  --push .

Configuration & Environment

To use various API's, API keys need to be provided. Create a file named .env.local in the root, and add variables that are listed in .env with your own values.

The environment flags we use are listed in lilac/env.py.

User Authentication for demos

End-user authentication is done via Google login, when LILAC_AUTH_ENABLED is set to true (e.g. in the public HuggingFace demo where we disable features for most users).

A Google Client token should be created from the Google API Console. Details can be found here.

By default, the Lilac google client is used. The secret can be found in Google Cloud console, and should be defined under GOOGLE_CLIENT_SECRET in .env.local.

For the session middleware, a random string should be created and defined as LILAC_OAUTH_SECRET_KEY in .env.local.

You can generate a random secret key with:

import string
import random
key = ''.join(random.choices(string.ascii_uppercase + string.digits, k=64))
print(f"LILAC_OAUTH_SECRET_KEY='{key}'")

Installing poetry

You may need the following to install poetry:

Debugging and Optimization

Debug Lilac using pdb

To attach PDB to the Lilac server:

./run_server_pdb.sh

This starts the Lilac webserver in a single-threaded mode, ready to accept requests and respond to PDB breakpoints. To trigger your breakpoint, you can use the Lilac UI to trigger an HTTP request, or you can also use Chrome's inspector can take logged network requests and Copy > Copy as CURL command, to replay an HTTP request to the Lilac server.

Profiling time with line_profiler

line_profiler allows breaking down time spent by line. To use, decorate a target function with @profile and then run a python script with poetry run kernprof -lv my_script.py. (The @profile decorator is injected into the namespace by kernprof; no imports needed.) line_profiler excels at profiling numeric code, and for getting a high-level overview of where time gets spent.

In this before/after comparison, the code has gotten 3x faster with the implementation of load_to_parquet.

Total time: 2.82135 s
File: /Users/brian/dev/lilac/lilac/load_dataset.py
Function: process_source at line 56

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    56                                           @profile
    57                                           def process_source(
    58                                             project_dir: Union[str, pathlib.Path],
    59                                             config: DatasetConfig,
    60                                             task_step_id: Optional[TaskStepId] = None,
    61                                           ) -> str:
    62                                             """Process a source."""
    63         1         12.0     12.0      0.0    output_dir = get_dataset_output_dir(project_dir, config.namespace, config.name)
    64
    65         1      91006.0  91006.0      3.2    config.source.setup()
    66         1          0.0      0.0      0.0    try:
    67         1          2.0      2.0      0.0      manifest = config.source.load_to_parquet(output_dir, task_step_id=task_step_id)
    68         1          1.0      1.0      0.0    except NotImplementedError:
    69         1    2627713.0    3e+06     93.1      manifest = slow_process(config.source, output_dir, task_step_id=task_step_id)
    70
    71         2        747.0    373.5      0.0    with open_file(os.path.join(output_dir, MANIFEST_FILENAME), 'w') as f:
    72         1         76.0     76.0      0.0      f.write(manifest.model_dump_json(indent=2, exclude_none=True))
    73
    74         1          1.0      1.0      0.0    if not config.settings:
    75         1       5324.0   5324.0      0.2      dataset = get_dataset(config.namespace, config.name, project_dir)
    76         1      92883.0  92883.0      3.3      settings = default_settings(dataset)
    77         1       3489.0   3489.0      0.1      update_project_dataset_settings(config.namespace, config.name, settings, project_dir)
    78
    79         1         99.0     99.0      0.0    log(f'Dataset "{config.name}" written to {output_dir}')
    80
    81         1          0.0      0.0      0.0    return output_dir


Total time: 1.00076 s
File: /Users/brian/dev/lilac/lilac/load_dataset.py
Function: process_source at line 56

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    56                                           @profile
    57                                           def process_source(
    58                                             project_dir: Union[str, pathlib.Path],
    59                                             config: DatasetConfig,
    60                                             task_step_id: Optional[TaskStepId] = None,
    61                                           ) -> str:
    62                                             """Process a source."""
    63         1         13.0     13.0      0.0    output_dir = get_dataset_output_dir(project_dir, config.namespace, config.name)
    64
    65         1      78952.0  78952.0      7.9    config.source.setup()
    66         1          0.0      0.0      0.0    try:
    67         1     830583.0 830583.0     83.0      manifest = config.source.load_to_parquet(output_dir, task_step_id=task_step_id)
    68                                             except NotImplementedError:
    69                                               manifest = slow_process(config.source, output_dir, task_step_id=task_step_id)
    70
    71         2        248.0    124.0      0.0    with open_file(os.path.join(output_dir, MANIFEST_FILENAME), 'w') as f:
    72         1         90.0     90.0      0.0      f.write(manifest.model_dump_json(indent=2, exclude_none=True))
    73
    74         1          1.0      1.0      0.0    if not config.settings:
    75         1       5490.0   5490.0      0.5      dataset = get_dataset(config.namespace, config.name, project_dir)
    76         1      81904.0  81904.0      8.2      settings = default_settings(dataset)
    77         1       3394.0   3394.0      0.3      update_project_dataset_settings(config.namespace, config.name, settings, project_dir)
    78
    79         1         85.0     85.0      0.0    log(f'Dataset "{config.name}" written to {output_dir}')
    80
    81         1          0.0      0.0      0.0    return output_dir

Profiling memory usage with memray

memray enables profiling memory usage over time, allowing one to pinpoint which sources are responsible for memory usage/leakage.

This short snippet will run a test script test_script.py and open the profiler results.

rm memray.bin memray-flamegraph-memray.html; \
  poetry run memray run -o memray.bin test_script.py &&
  poetry run memray flamegraph memray.bin &&
  open memray-flamegraph-memray.html

Profiling import time

poetry run python -X importtime -c "import lilac" 2> import.log && poetry run tuna import.log

Profiling lilac base library memory consumption

rm memray.bin memray-flamegraph-memray.html;
echo "import lilac" > test_import_script.py && poetry run memray run -o memray.bin test_import_script.py && poetry run memray flamegraph memray.bin && open memray-flamegraph-memray.html