Skip to content

Conversation

@wmoustafa
Copy link
Contributor

Introduce Symbolic Constraint Solver for SQL-Driven Data Generation

Overview

This PR introduces coral-data-generation, a symbolic constraint solver that inverts SQL expressions to derive input domain constraints. Instead of forward evaluation (generate → test → reject), it solves backward from predicates to derive what inputs must satisfy, enabling efficient test data generation with guaranteed constraint satisfaction.

Motivation

Problem: Traditional test data generation uses rejection sampling—generate random values, evaluate SQL predicates, discard mismatches. This is inefficient for complex nested expressions and cannot detect unsatisfiable queries.

Solution: Symbolic inversion treats SQL expressions as mathematical transformations with inverse functions. Starting from output constraints (e.g., = '50'), the system walks expression trees inward, applying inverse operations to derive input domains.

Examples

1. Nested String Operations

WHERE LOWER(SUBSTRING(name, 1, 3)) = 'abc'
→ name ∈ RegexDomain("^[aA][bB][cC].*$")
Generates: "Abc", "ABC123", "abcdef"

2. Cross-Domain Arithmetic

WHERE CAST(age * 2 AS STRING) = '50'
→ age ∈ IntegerDomain([25])

3. Date Extraction with Type Casting

WHERE SUBSTRING(CAST(birthdate AS STRING), 1, 4) = '2000'
→ birthdate ∈ DateDomain intersect RegexDomain("^2000-.*$")
Generates: 2000-01-15, 2000-12-31, 2000-06-20

4. Complex Nested Substring

WHERE SUBSTRING(SUBSTRING(product_code, 5, 10), 1, 3) = 'XYZ'
→ product_code must have 'XYZ' at positions 5-7
→ product_code ∈ RegexDomain("^.{4}XYZ.*$")

5. Contradiction Detection

WHERE SUBSTRING(name, 1, 4) = '2000' AND SUBSTRING(name, 1, 4) = '1999'
→ Empty domain (unsatisfiable - no data generated)

6. Date String Pattern Matching

WHERE CAST(order_date AS STRING) LIKE '2024-12-%'
→ order_date ∈ RegexDomain("^2024-12-.*$") ∩ DateFormatConstraint
Generates: 2024-12-01, 2024-12-15, 2024-12-31

Key Components

1. Domain System

  • Domain<T, D>: Abstract constraint representation supporting intersection, union, emptiness checking
  • RegexDomain: Automaton-backed string constraints (powered by dk.brics.automaton)
  • IntegerDomain: Interval-based numeric constraints with arithmetic closure
  • Cross-domain conversions: CastRegexTransformer bridges string ↔ numeric types

2. Transformer Architecture

Pluggable symbolic inversion functions implementing DomainTransformer:

  • SubstringRegexTransformer: Inverts SUBSTRING(x, start, len) with positional constraints
  • LowerRegexTransformer: Inverts LOWER(x) via case-insensitive regex generation
  • CastRegexTransformer: Cross-domain CAST inversion (string ↔ integer ↔ date)
  • PlusRegexTransformer: Arithmetic inversion: x + c = valuex = value - c
  • TimesRegexTransformer: Multiplication inversion: x * c = valuex = value / c

3. Relational Preprocessing

Normalizes Calcite RelNode trees for symbolic analysis:

  • ProjectPullUpController: Fixed-point projection normalization
  • CanonicalPredicateExtractor: Extracts predicates with global field indexing
  • DnfRewriter: Converts to Disjunctive Normal Form for independent disjunct solving

4. Solver

DomainInferenceProgram: Top-down expression tree traversal with domain refinement at each step, detecting contradictions via empty domain intersection.

Technical Approach

Symbolic Inversion: For nested expression f(g(h(x))) = constant:

  1. Create output domain from constant
  2. Apply f⁻¹ → intermediate domain
  3. Apply g⁻¹ → refined domain
  4. Apply h⁻¹ → input constraint on x

Contradiction Detection: Multiple predicates on same variable → domain intersection. Empty result = unsatisfiable query.

Extensibility: Architecture supports multi-table inference (join propagation), fixed-point iteration (recursive constraints), and arbitrary domain types (date, decimal, enum).

Testing

Integration Tests (RegexDomainInferenceProgramTest): 14+ test scenarios covering simple/nested transformations, cross-domain CAST operations, arithmetic inversion, and contradiction detection. All tests validate generated samples satisfy original SQL predicates.

Documentation

This module comes with aomprehensive README with conceptual model, examples, and API reference.

Future Extensibility

The architecture naturally extends to additional domains (DecimalDomain, DateDomain), more transformers (CONCAT, REGEXP_EXTRACT), multi-table inference (join constraint propagation), and aggregate support (cardinality constraints).

Copy link

@simbadzina simbadzina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half way through the README.md. Will continue reading and then proceed to the code.

How does the system in general handle expressions where the values depend on each other.
Eg.
SELECT * FROM test.suitcase WHERE width + height + length < 25

Does this need a new domain type?

Comment on lines +1 to +194
/**
* Copyright 2025 LinkedIn Corporation. All rights reserved.
* Licensed under the BSD-2 Clause license.
* See LICENSE in the project root for license information.
*/
package com.linkedin.coral.datagen.domain;

import java.util.Arrays;
import java.util.List;

import org.testng.annotations.Test;


/**
* Tests for IntegerDomain class.
*/
public class IntegerDomainTest {

@Test
public void testSingleValue() {
System.out.println("\n=== Single Value Test ===");
IntegerDomain domain = IntegerDomain.of(42);
System.out.println("Domain: " + domain);
System.out.println("Is empty: " + domain.isEmpty());
System.out.println("Contains 42: " + domain.contains(42));
System.out.println("Contains 43: " + domain.contains(43));
System.out.println("Samples: " + domain.sampleValues(5));
}

@Test
public void testSingleInterval() {
System.out.println("\n=== Single Interval Test ===");
IntegerDomain domain = IntegerDomain.of(10, 20);
System.out.println("Domain: " + domain);
System.out.println("Contains 10: " + domain.contains(10));
System.out.println("Contains 15: " + domain.contains(15));
System.out.println("Contains 20: " + domain.contains(20));
System.out.println("Contains 21: " + domain.contains(21));
System.out.println("Samples: " + domain.sampleValues(5));
}

@Test
public void testMultipleIntervals() {
System.out.println("\n=== Multiple Intervals Test ===");
List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 5),
new IntegerDomain.Interval(10, 15), new IntegerDomain.Interval(20, 30));
IntegerDomain domain = IntegerDomain.of(intervals);
System.out.println("Domain: " + domain);
System.out.println("Contains 3: " + domain.contains(3));
System.out.println("Contains 7: " + domain.contains(7));
System.out.println("Contains 12: " + domain.contains(12));
System.out.println("Contains 25: " + domain.contains(25));
System.out.println("Samples: " + domain.sampleValues(10));
}

@Test
public void testIntersection() {
System.out.println("\n=== Intersection Test ===");
IntegerDomain domain1 = IntegerDomain.of(1, 20);
IntegerDomain domain2 = IntegerDomain.of(10, 30);
IntegerDomain intersection = domain1.intersect(domain2);
System.out.println("Domain 1: " + domain1);
System.out.println("Domain 2: " + domain2);
System.out.println("Intersection: " + intersection);
System.out.println("Samples: " + intersection.sampleValues(5));
}

@Test
public void testUnion() {
System.out.println("\n=== Union Test ===");
IntegerDomain domain1 = IntegerDomain.of(1, 10);
IntegerDomain domain2 = IntegerDomain.of(20, 30);
IntegerDomain union = domain1.union(domain2);
System.out.println("Domain 1: " + domain1);
System.out.println("Domain 2: " + domain2);
System.out.println("Union: " + union);
System.out.println("Samples: " + union.sampleValues(10));
}

@Test
public void testAddConstant() {
System.out.println("\n=== Add Constant Test ===");
IntegerDomain domain = IntegerDomain.of(10, 20);
IntegerDomain shifted = domain.add(5);
System.out.println("Original domain: " + domain);
System.out.println("After adding 5: " + shifted);
System.out.println("Samples: " + shifted.sampleValues(5));
}

@Test
public void testMultiplyConstant() {
System.out.println("\n=== Multiply Constant Test ===");
IntegerDomain domain = IntegerDomain.of(10, 20);
IntegerDomain scaled = domain.multiply(2);
System.out.println("Original domain: " + domain);
System.out.println("After multiplying by 2: " + scaled);
System.out.println("Samples: " + scaled.sampleValues(5));
}

@Test
public void testNegativeMultiply() {
System.out.println("\n=== Negative Multiply Test ===");
IntegerDomain domain = IntegerDomain.of(10, 20);
IntegerDomain scaled = domain.multiply(-1);
System.out.println("Original domain: " + domain);
System.out.println("After multiplying by -1: " + scaled);
System.out.println("Samples: " + scaled.sampleValues(5));
}

@Test
public void testOverlappingIntervalsMerge() {
System.out.println("\n=== Overlapping Intervals Merge Test ===");
List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 10),
new IntegerDomain.Interval(5, 15), new IntegerDomain.Interval(20, 30));
IntegerDomain domain = IntegerDomain.of(intervals);
System.out.println("Input intervals: [1, 10], [5, 15], [20, 30]");
System.out.println("Merged domain: " + domain);
System.out.println("Samples: " + domain.sampleValues(10));
}

@Test
public void testAdjacentIntervalsMerge() {
System.out.println("\n=== Adjacent Intervals Merge Test ===");
List<IntegerDomain.Interval> intervals = Arrays.asList(new IntegerDomain.Interval(1, 10),
new IntegerDomain.Interval(11, 20), new IntegerDomain.Interval(30, 40));
IntegerDomain domain = IntegerDomain.of(intervals);
System.out.println("Input intervals: [1, 10], [11, 20], [30, 40]");
System.out.println("Merged domain: " + domain);
System.out.println("Samples: " + domain.sampleValues(10));
}

@Test
public void testEmptyDomain() {
System.out.println("\n=== Empty Domain Test ===");
IntegerDomain empty = IntegerDomain.empty();
System.out.println("Empty domain: " + empty);
System.out.println("Is empty: " + empty.isEmpty());
System.out.println("Samples: " + empty.sampleValues(5));
}

@Test
public void testIntersectionEmpty() {
System.out.println("\n=== Intersection Empty Test ===");
IntegerDomain domain1 = IntegerDomain.of(1, 10);
IntegerDomain domain2 = IntegerDomain.of(20, 30);
IntegerDomain intersection = domain1.intersect(domain2);
System.out.println("Domain 1: " + domain1);
System.out.println("Domain 2: " + domain2);
System.out.println("Intersection: " + intersection);
System.out.println("Is empty: " + intersection.isEmpty());
}

@Test
public void testComplexArithmetic() {
System.out.println("\n=== Complex Arithmetic Test ===");
// Solve: 2*x + 5 = 25, where x in [0, 100]
// => 2*x = 20
// => x = 10
IntegerDomain output = IntegerDomain.of(25);
IntegerDomain afterSubtract = output.add(-5); // x = 20
IntegerDomain solution = afterSubtract.multiply(1).intersect(IntegerDomain.of(0, 100));

System.out.println("Equation: 2*x + 5 = 25");
System.out.println("Output domain: " + output);
System.out.println("After subtracting 5: " + afterSubtract);
System.out.println("Solution (x must be in [0, 100]): " + solution);

// Verify
if (!solution.isEmpty()) {
long x = solution.sampleValues(1).get(0);
System.out.println("Sample x: " + x);
System.out.println("Verification: 2*" + x + " + 5 = " + (2 * x + 5));
}
}

@Test
public void testMultiIntervalIntersection() {
System.out.println("\n=== Multi-Interval Intersection Test ===");
List<IntegerDomain.Interval> intervals1 =
Arrays.asList(new IntegerDomain.Interval(1, 20), new IntegerDomain.Interval(30, 50));
List<IntegerDomain.Interval> intervals2 =
Arrays.asList(new IntegerDomain.Interval(10, 35), new IntegerDomain.Interval(45, 60));

IntegerDomain domain1 = IntegerDomain.of(intervals1);
IntegerDomain domain2 = IntegerDomain.of(intervals2);
IntegerDomain intersection = domain1.intersect(domain2);

System.out.println("Domain 1: " + domain1);
System.out.println("Domain 2: " + domain2);
System.out.println("Intersection: " + intersection);
System.out.println("Expected: [10, 20] ∪ [30, 35] ∪ [45, 50]");
System.out.println("Samples: " + intersection.sampleValues(15));
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't have assertions. Some other files have tests like these too.

@Test
public void testArithmeticExpression() {
testDomainInference("Arithmetic Expression Test", "SELECT * FROM test.T WHERE age * 2 + 5 = 25", inputDomain -> {
assertTrue(inputDomain instanceof IntegerDomain, "Should be IntegerDomain");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is an error, a new test I'm adding still passes

  @Test
  public void testMultiVariateArithmeticExpression() {
    testDomainInference("Arithmetic Expression Test", "SELECT * FROM test.suitcase WHERE width + height + length < 25", inputDomain -> {
      assertTrue(inputDomain instanceof IntegerDomain, "Should be IntegerDomain");
      IntegerDomain intDomain = (IntegerDomain) inputDomain;
      System.out.println(intDomain);
      assertTrue(intDomain.contains(10), "Should contain 10 (since 10 * 2 + 5 = 25)");
      assertTrue(intDomain.contains(10), "Should contain 10 (since 10 * 2 + 5 = 25)");
      assertTrue(intDomain.isSingleton(), "Should be singleton");
    });
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants