Skip to content

jaseci-labs/mtp-evaluation

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MTP: OOPSLA 2025 Artifact # 359

Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications

Overview

This artifact accompanies the OOPSLA 2025 paper "Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications". It provides a complete implementation of MTP a novel programming language abstraction that enables type-safe integration of Large Language Models (LLMs) into traditional programming workflows.

The Meaning-Typed Programming (MTP) paradigm is implemented in the Jaseci Open-Sourced ecosystem as the MTLLM plugin to the Jac programming language. What's being referred to as the 'MTP' implementation in the paper is this MTLLM plugin.

Key Innovation: MTLLM bridges the gap between the structured world of programming languages and the unstructured outputs of LLMs through a type system that captures both structural types and semantic meaning, enabling compile-time guarantees for AI-powered functions.

Primary Contributions

  1. Type-Safe LLM Integration: Compile-time type checking for LLM-powered functions with runtime output validation
  2. Automatic Output Transformation: Runtime system that converts unstructured LLM outputs into typed programming language objects
  3. Semantic Type System: Type annotations that capture both structural types (int, str) and semantic meaning for precise LLM guidance
  4. Language-Integrated AI: Native by llm() syntax in the Jac programming language for seamless AI integration

Artifact Contents

This repository contains:

  • Complete MTLLM(MTP) implementation for the Jac programming language (version 0.3.8)
  • Comprehensive benchmark suite with 12 tasks comparing MTLLM(MTP) against DSPy and LMQL baselines
  • Evaluation scripts for reproducing all experimental results from the paper
  • Documentation and examples demonstrating all key features
  • Docker environment for reproducible evaluation

The implementation is based on the open-source Jaseci ecosystem and represents the exact version used for paper evaluation.

Getting Started

Prerequisites

  • Python 3.12+: Required for the Jac language runtime
  • OpenAI API Key: Required for evaluation benchmarks using GPT models
  • Operating System: Linux or macOS (Windows not currently supported)
  • Docker (optional): For containerized evaluation environment

Quick Start Options

Option 1: Direct Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/Jayanaka-98/mtllm-oopsla2025.git
cd mtllm-oopsla2025

# Install MTLLM(MTP) with all required dependencies
pip install "mtllm[openai,ollama,tools]==0.3.8"

# Install evaluation dependencies
pip install -r eval/requirements.txt

# Set up your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"

# Optional: Install Ollama for local model evaluation
curl -fsSL https://ollama.ai/install.sh | sh

Option 2: Docker Environment (Recommended)

For a fully reproducible environment:

# Clone the repository
git clone --recurse-submodules https://github.com/Jayanaka-98/mtllm-oopsla2025.git
cd mtllm-oopsla2025

# Build and start the Docker container
chmod +x setup.sh
./setup.sh

# Inside the container, set your API key
export OPENAI_API_KEY="your-api-key-here"

Verification

Test your installation by running a simple MTLLM(MTP) example:

# Create a test file
cat > test.jac << 'EOF'
import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

def greet(name: str) -> str by llm();

with entry {
    print(greet("OOPSLA reviewers"));
}
EOF

# Run the test
jac run test.jac

If successful, you should see a greeting message generated by the LLM.

Core Features and Examples

The following examples demonstrate the three main usage patterns of MTLLM(MTP), corresponding to Figures 8(a), 8(b), and 8(c) in the paper.

1. Type-Safe LLM Functions

MTLLM(MTP) functions allow you to define function signatures with traditional type annotations while delegating implementation to an LLM. The runtime ensures type safety by validating and converting LLM outputs.

Example: Basic function with type enforcement

import from mtllm.llms {OpenAI}

# Initialize the LLM
glob llm = OpenAI(model_name="gpt-4o");

# Define a type-safe LLM function
def calculate_age(cur_year: int, dob: str) -> int by llm();

with entry {
    age = calculate_age(cur_year=2025, dob="1998");
    print(f"Age: {age}");  # Output is guaranteed to be an integer
}

Run: jac run examples/func.jac

2. LLM-Powered Object Construction

MTLLM(MTP) can generate object fields automatically while maintaining type constraints, enabling AI-driven object initialization with structural guarantees.

Example: Automatic field generation

import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

obj Person {
    has name: str;
    has dob: str;
}

with entry {
    # LLM fills in missing field based on partial information
    einstein = Person(name="Einstein" by llm());
    print(f"{einstein.name} was born on {einstein.dob}");
}

Run: jac run examples/object.jac

3. LLM-Enhanced Object Methods

Methods can leverage LLM capabilities while accessing object state, enabling context-aware AI computations with type safety.

Example: Context-aware method with object state access

import from mtllm.llms {OpenAI}

glob llm = OpenAI(model_name="gpt-4o");

obj Person {
    has name: str;
    has dob: str;

    # Method uses object state (self) for computation
    def calculate_age(cur_year: int) -> int by llm(incl_info=(self), temperature=0.7);
}

with entry {
    einstein = Person(name="Einstein", dob="March 14, 1879");
    print(f"Einstein's age in 2024: {einstein.calculate_age(2024)}");
}

Run: jac run examples/method.jac

Advanced Features

  • Multiple LLM Support: OpenAI GPT, Anthropic Claude, local models via Ollama
  • Type Coercion: Automatic parsing and validation of complex types (lists, objects, enums)
  • Error Recovery: Robust handling of malformed LLM outputs with retry mechanisms
  • Native Agentic Support: MTLLM(MTP) supports ReAct to be used to build agentic applications
  • Vision Model Support: MTLLM can infer with multi-modal models which can take images and videos as inputs.

πŸ“– Complete Documentation: MTLLM User Guide

Evaluation and Benchmarks

This artifact includes a comprehensive evaluation suite that reproduces all experimental results from the paper. The benchmarks compare MTLLM(MTP) against two state-of-the-art frameworks: DSPy and LMQL.

Benchmark Tasks

The evaluation covers 13 diverse tasks across different domains:

Category Task Description
Text Processing translation Multi-language text translation
text_to_type Converting unstructured text to typed objects
template Give Output according to a predefined template
Reasoning mcq_reason Multiple-choice question reasoning
math_problem Mathematical word problem solving
odd_word_out Pattern recognition and categorization
Content Generation joke_gen Creative content generation
essay_reviewer Academic text analysis
expert_answer Domain-specific question answering
Applications taskman Task management and scheduling
rpg_level_gen Game content generation
personality_finder Personality analysis
wikipedia Information extraction and summarization

Performance Metrics

The evaluation measures:

  • Accuracy: Task-specific correctness metrics
  • Token Usage: Total tokens consumed per task
  • Runtime: Execution time per benchmark
  • Cost: Estimated API costs (USD)
  • Sensitivity: Impact on Accuracy from coding practices

Claims Validation

The paper makes four key claims that this artifact validates:

Claim 1: Development Complexity Reduction

MTLLM(MTP) reduces development complexity for model-integrated applications

This evaluation is done mainly through a case study of comparing the code using Lines-of-code as the metric. The three versions of the benchmark programs used in the paper are included in the benchmarks/ directory. We also have a user study evaluation which supports this claim as well documented in the paper.

Evidence: Compare MTLLM implementations with DSPy/LMQL baselines in the benchmarks/ directory. MTLLM consistently requires fewer lines of code and less boilerplate.

Claim 2: Competitive Accuracy

MTLLM(MTP) achieves similar or better accuracy than baseline frameworks

To support this claim we do an evaluation where we run the benchmark programs 20 trials and take the average success rate. In addition to this, we conduct a thorough evaluation with multiple LLMs using the GSM8k dataset for the math problem benchmark. However, this requires running llama models on local hardware which would produce variable results. We have reasonable timeout limits. Hence, we only include the scripts for running the experiments with OpenAI GPT models.

Evidence: Run the evaluation suite to reproduce accuracy results from Table 2 in the paper.

# (requires OpenAI API key)
cd eval

# Generate accuracy summary statistics
python overall_accuracy.py

# Generate evaluation results for the math problem benchmark for the GSM8k dataset.
python GSM8k_accuracy.py

Claim 3: Efficient Resource Usage

MTLLM(MTP) demonstrates similar or lower token usage, cost, and runtime compared to baselines

The cost is calculated using the OpenAI cost equation as discussed in the paper. To measure token usage we used custom versions of LMQL, DSPy, and MTLLM where the prompts and LLM responses are recorded. In this artifact, we do not include these custom versions. Hence the token usage and cost evaluation is not available in the artifact. Still, we include runtime evaluation scripts.

Evidence: Resource usage metrics are captured during evaluation and match paper results.

cd eval

# The following command runs the evaluation suite and measures runtime for both MTLLM and baseline implementations:
python eval.py --config eval.config.json --impl both

Claim 4: Resilience to Coding Practices

MTLLM(MTP) demonstrates resilience to suboptimal coding practices

We evaluate the robustness of MTLLM(MTP) against bad coding practices of developers. For this, we introduced seven variations of the level generator benchmark with different degrees of coding practices.

Evidence: Robustness tests show MTLLM maintains performance across different implementation styles.

cd eval/sensitivity_eval

# Run the following script to generate the results.
python exp.py

Interactive Demo

Experience MTLLM(MTP) with the included RPG game that uses LLM-powered procedural level generation:

# Install game dependencies
pip install pygame

# Run the interactive RPG demo
cd jaseci/jac/examples/rpg_game/jac_impl/jac_impl_6
jac run main.jac

This demonstrates the real-world application of MTLLM for dynamic content generation in an interactive environment.

Repository Structure

mtllm-oopsla2025/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ Dockerfile                   # Docker environment setup
β”œβ”€β”€ setup.bash                   # Automated setup script
β”œβ”€β”€ benchmarks/                  # Evaluation benchmarks
β”‚   β”œβ”€β”€ translation/            # Translation task implementations
β”‚   β”œβ”€β”€ text_to_type/           # Text-to-type conversion tasks
β”‚   β”œβ”€β”€ mcq_reason/             # Multiple choice reasoning
β”‚   β”œβ”€β”€ math_problem/           # Mathematical problem solving
β”‚   β”œβ”€β”€ joke_gen/               # Content generation tasks
β”‚   β”œβ”€β”€ essay_reviewer/         # Text analysis tasks
β”‚   β”œβ”€β”€ expert_answer/          # Domain-specific QA
β”‚   β”œβ”€β”€ taskman/                # Task management
β”‚   β”œβ”€β”€ rpg_level_gen/          # Game content generation
β”‚   β”œβ”€β”€ personality_finder/     # Personality analysis
β”‚   β”œβ”€β”€ odd_word_out/           # Pattern recognition
β”‚   β”œβ”€β”€ wikipedia/              # Information extraction
β”‚   └── template/               # Template for new benchmarks
β”œβ”€β”€ eval/                       # Evaluation scripts and results
β”‚   β”œβ”€β”€ eval.py                 # Main evaluation runner
β”‚   β”œβ”€β”€ overall_accuracy.py     # Results aggregation
β”‚   β”œβ”€β”€ requirements.txt        # Python dependencies
β”‚   └── local_cache/            # Cached compilation artifacts
└── jaseci/                     # Core Jaseci ecosystem
    β”œβ”€β”€ jac/                    # Jac language implementation
    β”œβ”€β”€ jac-mtllm/              # MTLLM(MTP) plugin source
    β”œβ”€β”€ jac-cloud/              # Cloud deployment tools
    └── scripts/                # Utility scripts

Each benchmark directory contains three implementations:

  • *_mtllm.jac: MTP implementation
  • *_dspy.py: DSPy baseline implementation
  • *_lmql.py: LMQL baseline implementation

Troubleshooting

Common Issues

Python Version Error

ERROR: Python 3.12+ required

Solution: Upgrade Python or use the Docker environment.

API Key Error

openai.AuthenticationError: Invalid API key

Solution: Verify your OpenAI API key is set correctly:

echo $OPENAI_API_KEY  # Should display your key
export OPENAI_API_KEY="your-actual-key-here"

Package Installation Error

ERROR: Could not find a version that satisfies mtllm

Solution: Ensure you're using Python 3.12+ and run:

pip install --upgrade pip
pip install "mtllm[openai,ollama,tools]==0.3.8"

Ollama Connection Error

ConnectionError: Could not connect to Ollama

Solution: Start the Ollama service:

ollama serve
# In another terminal:
ollama pull llama2  # or your preferred model

Getting Help

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.0%
  • Python 31.0%
  • Other 1.0%