MTP: OOPSLA 2025 Artifact # 359

Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications

Overview

This artifact accompanies the OOPSLA 2025 paper "Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications". It provides a complete implementation of MTP a novel programming language abstraction that enables type-safe integration of Large Language Models (LLMs) into traditional programming workflows.

The Meaning-Typed Programming (MTP) paradigm is implemented in the Jaseci Open-Sourced ecosystem as the MTLLM plugin to the Jac programming language. What's being referred to as the 'MTP' implementation in the paper is this MTLLM plugin.

Key Innovation: MTLLM bridges the gap between the structured world of programming languages and the unstructured outputs of LLMs through a type system that captures both structural types and semantic meaning, enabling compile-time guarantees for AI-powered functions.

Primary Contributions

Type-Safe LLM Integration: Compile-time type checking for LLM-powered functions with runtime output validation
Automatic Output Transformation: Runtime system that converts unstructured LLM outputs into typed programming language objects
Semantic Type System: Type annotations that capture both structural types (int, str) and semantic meaning for precise LLM guidance
Language-Integrated AI: Native by llm() syntax in the Jac programming language for seamless AI integration

Artifact Contents

This repository contains:

Complete MTLLM(MTP) implementation for the Jac programming language (version 0.3.8)
Comprehensive benchmark suite with 12 tasks comparing MTLLM(MTP) against DSPy and LMQL baselines
Evaluation scripts for reproducing all experimental results from the paper
Documentation and examples demonstrating all key features
Docker environment for reproducible evaluation

The implementation is based on the open-source Jaseci ecosystem and represents the exact version used for paper evaluation.

Getting Started

Prerequisites

Python 3.12+: Required for the Jac language runtime
OpenAI API Key: Required for evaluation benchmarks using GPT models
Operating System: Linux or macOS (Windows not currently supported)
Docker (optional): For containerized evaluation environment

Quick Start Options

Option 1: Direct Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/Jayanaka-98/mtllm-oopsla2025.git
cd mtllm-oopsla2025

# Install MTLLM(MTP) with all required dependencies
pip install "mtllm[openai,ollama,tools]==0.3.8"

# Install evaluation dependencies
pip install -r eval/requirements.txt

# Set up your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"

# Optional: Install Ollama for local model evaluation
curl -fsSL https://ollama.ai/install.sh | sh

Option 2: Docker Environment (Recommended)

For a fully reproducible environment:

# Clone the repository
git clone --recurse-submodules https://github.com/Jayanaka-98/mtllm-oopsla2025.git
cd mtllm-oopsla2025

# Build and start the Docker container
chmod +x setup.sh
./setup.sh

# Inside the container, set your API key
export OPENAI_API_KEY="your-api-key-here"

Verification

Test your installation by running a simple MTLLM(MTP) example:

# Create a test file
cat > test.jac << 'EOF'
import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

def greet(name: str) -> str by llm();

with entry {
    print(greet("OOPSLA reviewers"));
}
EOF

# Run the test
jac run test.jac

If successful, you should see a greeting message generated by the LLM.

Core Features and Examples

The following examples demonstrate the three main usage patterns of MTLLM(MTP), corresponding to Figures 8(a), 8(b), and 8(c) in the paper.

1. Type-Safe LLM Functions

MTLLM(MTP) functions allow you to define function signatures with traditional type annotations while delegating implementation to an LLM. The runtime ensures type safety by validating and converting LLM outputs.

Example: Basic function with type enforcement

import from mtllm.llms {OpenAI}

# Initialize the LLM
glob llm = OpenAI(model_name="gpt-4o");

# Define a type-safe LLM function
def calculate_age(cur_year: int, dob: str) -> int by llm();

with entry {
    age = calculate_age(cur_year=2025, dob="1998");
    print(f"Age: {age}");  # Output is guaranteed to be an integer
}

Run: jac run examples/func.jac

2. LLM-Powered Object Construction

MTLLM(MTP) can generate object fields automatically while maintaining type constraints, enabling AI-driven object initialization with structural guarantees.

Example: Automatic field generation

import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

obj Person {
    has name: str;
    has dob: str;
}

with entry {
    # LLM fills in missing field based on partial information
    einstein = Person(name="Einstein" by llm());
    print(f"{einstein.name} was born on {einstein.dob}");
}

Run: jac run examples/object.jac

3. LLM-Enhanced Object Methods

Methods can leverage LLM capabilities while accessing object state, enabling context-aware AI computations with type safety.

Example: Context-aware method with object state access

import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

obj Person {
    has name: str;
    has dob: str;

    # Method uses object state (self) for computation
    def calculate_age(cur_year: int) -> int by llm(incl_info=(self), temperature=0.7);
}

with entry {
    einstein = Person(name="Einstein", dob="March 14, 1879");
    print(f"Einstein's age in 2024: {einstein.calculate_age(2024)}");
}

Run: jac run examples/method.jac

Advanced Features

Multiple LLM Support: OpenAI GPT, Anthropic Claude, local models via Ollama
Type Coercion: Automatic parsing and validation of complex types (lists, objects, enums)
Error Recovery: Robust handling of malformed LLM outputs with retry mechanisms
Native Agentic Support: MTLLM(MTP) supports ReAct to be used to build agentic applications
Vision Model Support: MTLLM can infer with multi-modal models which can take images and videos as inputs.

📖 Complete Documentation: MTLLM User Guide

Evaluation and Benchmarks

This artifact includes a comprehensive evaluation suite that reproduces all experimental results from the paper. The benchmarks compare MTLLM(MTP) against two state-of-the-art frameworks: DSPy and LMQL.

Benchmark Tasks

The evaluation covers 13 diverse tasks across different domains:

Category	Task	Description
Text Processing	`translation`	Multi-language text translation
	`text_to_type`	Converting unstructured text to typed objects
	`template`	Give Output according to a predefined template
Reasoning	`mcq_reason`	Multiple-choice question reasoning
	`math_problem`	Mathematical word problem solving
	`odd_word_out`	Pattern recognition and categorization
Content Generation	`joke_gen`	Creative content generation
	`essay_reviewer`	Academic text analysis
	`expert_answer`	Domain-specific question answering
Applications	`taskman`	Task management and scheduling
	`rpg_level_gen`	Game content generation
	`personality_finder`	Personality analysis
	`wikipedia`	Information extraction and summarization

Performance Metrics

The evaluation measures:

Accuracy: Task-specific correctness metrics
Token Usage: Total tokens consumed per task
Runtime: Execution time per benchmark
Cost: Estimated API costs (USD)
Sensitivity: Impact on Accuracy from coding practices

Claims Validation

The paper makes four key claims that this artifact validates:

Claim 1: Development Complexity Reduction

MTLLM(MTP) reduces development complexity for model-integrated applications

This evaluation is done mainly through a case study of comparing the code using Lines-of-code as the metric. The three versions of the benchmark programs used in the paper are included in the benchmarks/ directory. We also have a user study evaluation which supports this claim as well documented in the paper.

Evidence: Compare MTLLM implementations with DSPy/LMQL baselines in the benchmarks/ directory. MTLLM consistently requires fewer lines of code and less boilerplate.

Claim 2: Competitive Accuracy

MTLLM(MTP) achieves similar or better accuracy than baseline frameworks

To support this claim we do an evaluation where we run the benchmark programs 20 trials and take the average success rate. In addition to this, we conduct a thorough evaluation with multiple LLMs using the GSM8k dataset for the math problem benchmark. However, this requires running llama models on local hardware which would produce variable results. We have reasonable timeout limits. Hence, we only include the scripts for running the experiments with OpenAI GPT models.

Evidence: Run the evaluation suite to reproduce accuracy results from Table 2 in the paper.

# (requires OpenAI API key)
cd eval

# Generate accuracy summary statistics
python overall_accuracy.py

# Generate evaluation results for the math problem benchmark for the GSM8k dataset.
python GSM8k_accuracy.py

Claim 3: Efficient Resource Usage

MTLLM(MTP) demonstrates similar or lower token usage, cost, and runtime compared to baselines

The cost is calculated using the OpenAI cost equation as discussed in the paper. To measure token usage we used custom versions of LMQL, DSPy, and MTLLM where the prompts and LLM responses are recorded. In this artifact, we do not include these custom versions. Hence the token usage and cost evaluation is not available in the artifact. Still, we include runtime evaluation scripts.

Evidence: Resource usage metrics are captured during evaluation and match paper results.

cd eval

# The following command runs the evaluation suite and measures runtime for both MTLLM and baseline implementations:
python eval.py --config eval.config.json --impl both

Claim 4: Resilience to Coding Practices

MTLLM(MTP) demonstrates resilience to suboptimal coding practices

We evaluate the robustness of MTLLM(MTP) against bad coding practices of developers. For this, we introduced seven variations of the level generator benchmark with different degrees of coding practices.

Evidence: Robustness tests show MTLLM maintains performance across different implementation styles.

cd eval/sensitivity_eval

# Run the following script to generate the results.
python exp.py

Interactive Demo

Experience MTLLM(MTP) with the included RPG game that uses LLM-powered procedural level generation:

# Install game dependencies
pip install pygame

# Run the interactive RPG demo
cd jaseci/jac/examples/rpg_game/jac_impl/jac_impl_6
jac run main.jac

This demonstrates the real-world application of MTLLM for dynamic content generation in an interactive environment.

Repository Structure

mtllm-oopsla2025/
├── README.md                    # This file
├── Dockerfile                   # Docker environment setup
├── setup.bash                   # Automated setup script
├── benchmarks/                  # Evaluation benchmarks
│   ├── translation/            # Translation task implementations
│   ├── text_to_type/           # Text-to-type conversion tasks
│   ├── mcq_reason/             # Multiple choice reasoning
│   ├── math_problem/           # Mathematical problem solving
│   ├── joke_gen/               # Content generation tasks
│   ├── essay_reviewer/         # Text analysis tasks
│   ├── expert_answer/          # Domain-specific QA
│   ├── taskman/                # Task management
│   ├── rpg_level_gen/          # Game content generation
│   ├── personality_finder/     # Personality analysis
│   ├── odd_word_out/           # Pattern recognition
│   ├── wikipedia/              # Information extraction
│   └── template/               # Template for new benchmarks
├── eval/                       # Evaluation scripts and results
│   ├── eval.py                 # Main evaluation runner
│   ├── overall_accuracy.py     # Results aggregation
│   ├── requirements.txt        # Python dependencies
│   └── local_cache/            # Cached compilation artifacts
└── jaseci/                     # Core Jaseci ecosystem
    ├── jac/                    # Jac language implementation
    ├── jac-mtllm/              # MTLLM(MTP) plugin source
    ├── jac-cloud/              # Cloud deployment tools
    └── scripts/                # Utility scripts

Each benchmark directory contains three implementations:

*_mtllm.jac: MTP implementation
*_dspy.py: DSPy baseline implementation
*_lmql.py: LMQL baseline implementation

Troubleshooting

Common Issues

Python Version Error

ERROR: Python 3.12+ required

Solution: Upgrade Python or use the Docker environment.

API Key Error

openai.AuthenticationError: Invalid API key

Solution: Verify your OpenAI API key is set correctly:

echo $OPENAI_API_KEY  # Should display your key
export OPENAI_API_KEY="your-actual-key-here"

Package Installation Error

ERROR: Could not find a version that satisfies mtllm

Solution: Ensure you're using Python 3.12+ and run:

pip install --upgrade pip
pip install "mtllm[openai,ollama,tools]==0.3.8"

Ollama Connection Error

ConnectionError: Could not connect to Ollama

Solution: Start the Ollama service:

ollama serve
# In another terminal:
ollama pull llama2  # or your preferred model

Getting Help

MTP Documentation: https://www.jac-lang.org/learn/jac-mtllm/
Jac Language Guide: https://www.jac-lang.org
Issues: Report bugs or ask questions via GitHub Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MTP: OOPSLA 2025 Artifact # 359

Overview

Primary Contributions

Artifact Contents

Getting Started

Prerequisites

Quick Start Options

Option 1: Direct Installation

Option 2: Docker Environment (Recommended)

Verification

Core Features and Examples

1. Type-Safe LLM Functions

2. LLM-Powered Object Construction

3. LLM-Enhanced Object Methods

Advanced Features

Evaluation and Benchmarks

Benchmark Tasks

Performance Metrics

Claims Validation

Claim 1: Development Complexity Reduction

Claim 2: Competitive Accuracy

Claim 3: Efficient Resource Usage

Claim 4: Resilience to Coding Practices

Interactive Demo

Repository Structure

Troubleshooting

Common Issues

Getting Help

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
benchmarks		benchmarks
eval		eval
examples		examples
jaseci @ 56875d6		jaseci @ 56875d6
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
setup.sh		setup.sh

jaseci-labs/mtp-evaluation

Folders and files

Latest commit

History

Repository files navigation

MTP: OOPSLA 2025 Artifact # 359

Overview

Primary Contributions

Artifact Contents

Getting Started

Prerequisites

Quick Start Options

Option 1: Direct Installation

Option 2: Docker Environment (Recommended)

Verification

Core Features and Examples

1. Type-Safe LLM Functions

2. LLM-Powered Object Construction

3. LLM-Enhanced Object Methods

Advanced Features

Evaluation and Benchmarks

Benchmark Tasks

Performance Metrics

Claims Validation

Claim 1: Development Complexity Reduction

Claim 2: Competitive Accuracy

Claim 3: Efficient Resource Usage

Claim 4: Resilience to Coding Practices

Interactive Demo

Repository Structure

Troubleshooting

Common Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages