CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design

This repository implements the paper "CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design".

💾 Models and Datasets

All fine-tuned models and datasets are available on the Hugging Face Hub under the CRL-CADmium organization.

Models

Datasets

🚀 Quick Setup

Ensure Anaconda (or Miniconda) is installed. From the project root directory, run the following to create the environment and install dependencies:

conda deactivate
conda create --prefix=venv python=3.11 -y
conda activate venv 
# Note: If 'conda activate venv' doesn't target the local './venv' directory, 
# you might need to use 'conda activate ./venv' instead.
conda install -c conda-forge pythonocc-core -y
pip install -r requirements.txt
pip install -e .

This sequence sets up a local Conda environment in the venv subdirectory, activates it, and installs all required packages, including the project itself in editable mode.

🔠 Tokenizing the Dataset

To process and tokenize the dataset:

Prerequisites:
- Successful environment setup (see above).
- Raw dataset (e.g., cadmium_ds) located in data/cadmium_ds.
Run the tokenization script from the project root directory:
```
python cadmium/src/tokenize_dataset.py
```
Output: The script saves tokenized data into three Parquet files (train, validation, test splits) in the data/ folder (e.g., data/train_json_qwen_tokenized.parquet).

🏋️ Training the Model

To train the model with the default configuration:

torchrun --nproc_per_node=4 cadmium/src/train.py --config-name train

The training configuration is defined in cadmium/config/train.yaml. By default, this setup is optimized for 4 GPUs using Fully Sharded Data Parallel (FSDP).

Adjusting for Fewer GPUs:

Remove FSDP: Comment out or remove the fsdp_config block in the YAML file
Maintain Effective Batch Size: The original configuration uses per_device_batch_size=4 with 4 GPUs (total batch size=16). To replicate this on fewer devices:
- Single GPU: Set per_device_batch_size=16 or increase gradient_accumulation_steps=4 (with per_device_batch_size=4)
- Intermediate GPUs: Adjust proportionally (e.g., 2 GPUs → per_device_batch_size=8 or gradient_accumulation_steps=2)
Here's the enriched README with clear prediction instructions and device flexibility:

🔮 Generating Predictions

To run inference with a trained model:

torchrun --nproc_per_node=N cadmium/src/predict.py --config-name predict

Where N is the number of GPUs to use.

It works with 1+ GPUs without code changes
Predictions are saved in data/results/ with:
- Individual JSON files per sample when using save_per_batch=True
- Consolidated CSV results after all batches complete
Modify cadmium/config/predict.yaml to adjust:
- Batch size (eval.batch_size)
- Generation parameters (temperature, top-p, etc.)
- Output directory paths

🏷️ Data Annotation

This process uses GPT-4.1 via the OpenAI API to generate natural language descriptions for CAD modeling sequences. We utilize minimal JSONs representations and Blender renders from the Text2CAD Hugging Face dataset:

First of all, you need to download and extract renders and minimal jsons:

# Create directory structure
mkdir -p data/text2cad_v1.1/{rgb_images,jsons}

# RGB Images
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='SadilKhan/Text2CAD',
    allow_patterns='text2cad_v1.1/misc/rgb_images/*.zip',
    local_dir='data/text2cad_v1.1/rgb_images',
    repo_type='dataset'
)"
unzip data/text2cad_v1.1/rgb_images/text2cad_v1.1/misc/rgb_images/\*.zip -d data/text2cad_v1.1/rgb_images

# Minimal JSON Descriptions
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='SadilKhan/Text2CAD',
    filename='text2cad_v1.1/misc/minimal_json/minimal_json_0000_0099.zip',
    local_dir='data/text2cad_v1.1/jsons',
    repo_type='dataset'
)"
unzip data/text2cad_v1.1/jsons/minimal_json_0000_0099.zip -d data/text2cad_v1.1/jsons

# Cleanup zip files
rm -rf data/text2cad_v1.1/{rgb_images,jsons}/text2cad_v1.1

Than, configure API access:

echo "OPENAI_API_KEY=your-key-here" > .env

And finally, run the annotation:

N_SPLITS=4  # Match to available CPU cores
for IDX in $(seq 0 $((N_SPLITS-1))); do
    python cadmium/src/annotate.py +n_splits=$N_SPLITS +idx_split=$IDX &
done
wait

⚙️ Metric Calculations

To compute the metrics for the reconstructed samples, the first step is to copy the generated JSON content of each sample present in results.csv in the results/ folder into a separate file. To perform this, run the command below.

python cadmium/src/utils/save_results_json.py --result_dir <path/to/results/dir/containing/results.csv/file>

The JSON files for all the samples will hence be stored in cadmium/src/data/generated_jsons/<dir_name>

Note: The above command works for DeepCAD Test data.

These JSONs can now be used for computing metrics, by following the command below.

python cadmium/src/utils/Evaluation/eval_seq.py --input_path cadmium/src/data/generated_jsons/<dir_name> --output_dir .

For the metrics proposed by CAD-MLLM, we use the codes from https://github.com/DavidXu-JJ/CAD-MLLM-metrics

▶️ Demo

Provide the path to the checkpoint and other suitable arguments in cadmium/config/inference_user_input.yaml

cd cadmium/src/utils/Demo
gradio app.py

License

Licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
cadmium		cadmium
images		images
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
train_script		train_script
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design

💾 Models and Datasets

Models

Datasets

🚀 Quick Setup

🔠 Tokenizing the Dataset

🏋️ Training the Model

Adjusting for Fewer GPUs:

🔮 Generating Predictions

🏷️ Data Annotation

⚙️ Metric Calculations

▶️ Demo

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

chandar-lab/CADmium

Folders and files

Latest commit

History

Repository files navigation

CADmium: Fine-Tuning Code Language Models for Text-Driven Sequential CAD Design

💾 Models and Datasets

Models

Datasets

🚀 Quick Setup

🔠 Tokenizing the Dataset

🏋️ Training the Model

Adjusting for Fewer GPUs:

🔮 Generating Predictions

🏷️ Data Annotation

⚙️ Metric Calculations

▶️ Demo

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages