Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jan 19, 2026

⚡️ This pull request contains optimizations for PR #11114

If you approve this dependent PR, these changes will be merged into the original PR branch feat/langchain-1.0.

This PR will be automatically closed if the original PR is merged.


📄 20% (0.20x) speedup for calculate_text_metrics in src/backend/base/langflow/api/v1/knowledge_bases.py

⏱️ Runtime : 48.4 milliseconds 40.4 milliseconds (best of 90 runs)

📝 Explanation and details

The optimized code achieves a 19% speedup by replacing an expensive string splitting operation with a more efficient regex-based counting approach.

Key Optimization

Original approach:

total_words += _to_int(text_series.str.split().str.len().sum())

Optimized approach:

total_words += _to_int(text_series.str.count(_WORD_RE).sum())

Why This Is Faster

The original code uses str.split().str.len() which:

  1. Creates intermediate Python lists for every cell by splitting on whitespace
  2. Allocates memory for these intermediate list objects
  3. Counts list lengths in a separate operation

The optimized code uses str.count(_WORD_RE) with a pre-compiled regex r"\S+" (non-whitespace sequences) which:

  1. Counts matches directly without creating intermediate data structures
  2. Operates entirely within pandas/C layer for better performance
  3. Avoids list allocations that would later need garbage collection

From the line profiler results, this change reduces the word counting line from 75.3ms (42.5% of total time) to 45.7ms (31.2% of total time) — a ~40% improvement on this specific operation.

Test Case Performance

The optimization benefits all test cases, but shows particular gains for:

  • Large dataframes (500+ rows): More cells to process means more intermediate lists avoided
  • Text with multiple spaces: The regex approach handles all whitespace uniformly without creating empty strings in splits
  • Varied word counts: No overhead difference between single-word and multi-word cells

Impact Considerations

Since calculate_text_metrics processes knowledge base text data:

  • If called in data ingestion pipelines with large document batches, the 19% speedup compounds significantly
  • The reduction in memory allocations (no intermediate split lists) may also reduce GC pressure in long-running processes
  • The optimization maintains identical behavior for all edge cases (unicode, empty strings, special whitespace)

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 42 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 91.7%
🌀 Click to see Generated Regression Tests
import numpy as np  # used to create numeric / nan scalars in tests
import pandas as pd  # calculate_text_metrics depends on pandas
# imports
import pytest  # used for our unit tests
from langflow.api.v1.knowledge_bases import calculate_text_metrics


# function to test
# (EXACT original function signature and implementation preserved)
def _to_int(value) -> int:
    """Convert a pandas/numpy scalar to int, handling different Python version behaviors."""
    if hasattr(value, "item"):
        return int(value.item())
    return int(value)


def test_basic_single_column_counts():
    # Basic: single text column with normal strings, including empty string
    df = pd.DataFrame(
        {
            "text": [
                "hello world",  # 2 words, 11 characters (including space)
                "single",       # 1 word, 6 characters
                "",             # 0 words, 0 characters
            ]
        }
    )

    # Expected calculations done explicitly and deterministically
    expected_words = 2 + 1 + 0
    expected_chars = len("hello world") + len("single") + len("")

    # Call function under test
    words, chars = calculate_text_metrics(df, ["text"])


def test_basic_multiple_columns_combined_counts():
    # Basic: two text columns aggregated together
    df = pd.DataFrame(
        {
            "a": ["hi there", "x"],         # 2 words, 8 chars ; 1 word, 1 char => totals 3 words, 9 chars
            "b": ["friend", "multiple words"],  # 1 word, 6 chars ; 2 words, 14 chars => totals 3 words, 20 chars
        }
    )

    # Compute expected totals column-wise, then sum
    a_words = sum(len(s.split()) for s in df["a"].astype(str))
    a_chars = sum(len(s) for s in df["a"].astype(str))
    b_words = sum(len(s.split()) for s in df["b"].astype(str))
    b_chars = sum(len(s) for s in df["b"].astype(str))

    expected_words = a_words + b_words
    expected_chars = a_chars + b_chars

    words, chars = calculate_text_metrics(df, ["a", "b"])


def test_ignored_missing_columns_and_order_irrelevant():
    # Edge: columns list includes a non-existent column; should be ignored without error
    df = pd.DataFrame({"text": ["one two", "three"]})

    # Include a missing column name in the list; also reorder columns in the list
    words, chars = calculate_text_metrics(df, ["missing_col", "text", "another_missing"])

    # Compute expected from only the real 'text' column
    expected_words = sum(len(x.split()) for x in df["text"].astype(str))
    expected_chars = sum(len(x) for x in df["text"].astype(str))


def test_non_string_types_and_nans_are_stringified_precisely():
    # Edge: Non-string types are astype(str)-converted BEFORE fillna(""),
    # so np.nan becomes 'nan' (a string), and None becomes 'None'. This behavior is deterministic
    df = pd.DataFrame(
        {
            "mixed": [123, None, True, 45.6, np.nan],  # various Python / numpy types and NaN
        }
    )

    # The function does astype(str).fillna("") so we emulate that transformation here
    transformed = df["mixed"].astype(str).fillna("")

    # After astype(str), np.nan becomes the string 'nan' (not an actual NaN),
    # so fillna has no effect on it. We compute expected based on those string values.
    expected_words = sum(len(s.split()) for s in transformed)
    expected_chars = sum(len(s) for s in transformed)

    words, chars = calculate_text_metrics(df, ["mixed"])


def test_leading_trailing_and_multiple_spaces_handled_by_split():
    # Edge: multiple spaces, leading/trailing spaces should not create empty-word counts
    df = pd.DataFrame(
        {
            "spaced": [
                "  leading space",      # 2 words
                "trailing space  ",     # 2 words
                "a   b    c",           # 3 words despite multiple spaces
                " single ",             # 1 word
            ]
        }
    )

    # Expected using Python's split semantics (split on any whitespace)
    expected_words = sum(len(s.split()) for s in df["spaced"].astype(str))
    expected_chars = sum(len(s) for s in df["spaced"].astype(str))

    words, chars = calculate_text_metrics(df, ["spaced"])


def test_empty_dataframe_and_empty_columns_list_return_zeroes():
    # Edge: empty DataFrame and empty text_columns list should both result in zero counts
    empty_df = pd.DataFrame(columns=["a", "b"])
    words1, chars1 = calculate_text_metrics(empty_df, ["a", "b"])

    # If text_columns is empty, nothing is processed - should return zeros even if DataFrame has data
    df = pd.DataFrame({"x": ["some text", "other"]})
    words2, chars2 = calculate_text_metrics(df, [])


def test_unicode_and_emoji_character_counts():
    # Edge: unicode characters and emoji should be counted as single Python characters by len()
    df = pd.DataFrame(
        {
            "u": [
                "naïve café",               # accented chars count individually
                "emoji 👍👍",               # two thumbs plus a word
                "漢字",                      # CJK characters: two characters
            ]
        }
    )

    # Compute expected strictly by Python's len and split
    expected_words = sum(len(x.split()) for x in df["u"].astype(str))
    expected_chars = sum(len(x) for x in df["u"].astype(str))

    words, chars = calculate_text_metrics(df, ["u"])


def test_return_types_are_python_int_with_numpy_scalars_present():
    # Edge: ensure that when underlying sums are numpy scalars, the function still returns Python ints
    # Create strings but rely on the fact that pandas .str.len().sum() returns numpy scalar
    df = pd.DataFrame({"text": ["a", "bb", "ccc"]})

    words, chars = calculate_text_metrics(df, ["text"])


def test_large_scale_correctness_under_constraints():
    # Large Scale: generate a medium-sized dataset under the 1000-element constraint
    # Use 400 rows and 2 text columns => 800 text cells total (under 1000)
    rows = 400
    phrase1 = "alpha beta gamma delta epsilon"  # 5 words
    phrase2 = "one two three"  # 3 words
    df = pd.DataFrame(
        {
            "col1": [phrase1 for _ in range(rows)],
            "col2": [phrase2 for _ in range(rows)],
        }
    )

    # Expected words and characters computed deterministically
    expected_words = rows * (len(phrase1.split()) + len(phrase2.split()))
    expected_chars = rows * (len(phrase1) + len(phrase2))

    words, chars = calculate_text_metrics(df, ["col1", "col2"])


def test_handles_mixed_column_types_and_multiple_columns_selectively():
    # Mix of numeric, boolean, and textual columns, ensure only selected columns are processed
    df = pd.DataFrame(
        {
            "text": ["ok now", "more"],   # to be counted
            "num": [100, 200],            # selected but will be stringified and counted
            "ignore_me": ["x x x", "y"],  # not in the columns list; must be ignored
        }
    )

    # When selecting text and num, both will be astype(str) and included
    selected = ["text", "num"]
    words, chars = calculate_text_metrics(df, selected)

    # Compute expected manually: stringify 'num' column
    transformed_num = df["num"].astype(str)
    expected_words = sum(len(s.split()) for s in df["text"].astype(str)) + sum(len(s.split()) for s in transformed_num)
    expected_chars = sum(len(s) for s in df["text"].astype(str)) + sum(len(s) for s in transformed_num)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
import pytest
from langflow.api.v1.knowledge_bases import calculate_text_metrics


def test_single_column_single_row():
    """Test with a single text column and a single row of data."""
    df = pd.DataFrame({"text": ["hello world"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_single_column_multiple_rows():
    """Test with a single text column and multiple rows."""
    df = pd.DataFrame({"text": ["hello", "world", "test"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_multiple_columns():
    """Test with multiple text columns."""
    df = pd.DataFrame({
        "col1": ["hello world"],
        "col2": ["foo bar"]
    })
    words, characters = calculate_text_metrics(df, ["col1", "col2"])


def test_multiple_columns_multiple_rows():
    """Test with multiple columns and multiple rows."""
    df = pd.DataFrame({
        "col1": ["hello", "world"],
        "col2": ["foo", "bar"]
    })
    words, characters = calculate_text_metrics(df, ["col1", "col2"])


def test_text_with_numbers_and_symbols():
    """Test text containing numbers and special symbols."""
    df = pd.DataFrame({"text": ["Hello123!@#"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_text_with_extra_spaces():
    """Test text with extra spaces between words."""
    df = pd.DataFrame({"text": ["hello    world"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_column_not_in_dataframe():
    """Test when specified column does not exist in dataframe."""
    df = pd.DataFrame({"other_col": ["hello world"]})
    words, characters = calculate_text_metrics(df, ["nonexistent"])


def test_mixed_existing_and_nonexisting_columns():
    """Test with a mix of existing and non-existing columns."""
    df = pd.DataFrame({"col1": ["hello world"], "col2": ["foo bar"]})
    words, characters = calculate_text_metrics(df, ["col1", "nonexistent", "col2"])


def test_empty_dataframe():
    """Test with an empty dataframe."""
    df = pd.DataFrame({"text": []})
    words, characters = calculate_text_metrics(df, ["text"])


def test_empty_text_values():
    """Test with empty string values in dataframe."""
    df = pd.DataFrame({"text": ["", "", ""]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_none_values_in_column():
    """Test with None/NaN values that get filled as empty strings."""
    df = pd.DataFrame({"text": [None, None, None]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_mixed_none_and_valid_text():
    """Test with a mix of None and valid text values."""
    df = pd.DataFrame({"text": ["hello", None, "world"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_single_word_per_row():
    """Test with single words in each row."""
    df = pd.DataFrame({"text": ["a", "b", "c"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_very_long_single_word():
    """Test with a very long word without spaces."""
    df = pd.DataFrame({"text": ["abcdefghijklmnopqrstuvwxyz"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_whitespace_only():
    """Test with whitespace-only entries."""
    df = pd.DataFrame({"text": ["   ", "\t\t", "\n\n"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_numeric_column_coerced_to_string():
    """Test that numeric columns are properly coerced to strings."""
    df = pd.DataFrame({"text": [123, 456, 789]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_float_column_coerced_to_string():
    """Test that float columns are properly coerced to strings."""
    df = pd.DataFrame({"text": [1.5, 2.5, 3.5]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_empty_column_list():
    """Test with an empty list of columns to process."""
    df = pd.DataFrame({"text": ["hello world"]})
    words, characters = calculate_text_metrics(df, [])


def test_unicode_characters():
    """Test with Unicode characters in text."""
    df = pd.DataFrame({"text": ["hello 世界"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_special_whitespace_characters():
    """Test with various whitespace characters (tabs, newlines, etc.)."""
    df = pd.DataFrame({"text": ["hello\tworld\ntest"]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_boolean_column_coerced_to_string():
    """Test that boolean columns are properly coerced to strings."""
    df = pd.DataFrame({"text": [True, False, True]})
    words, characters = calculate_text_metrics(df, ["text"])


def test_single_empty_column_name():
    """Test with an empty column name string."""
    df = pd.DataFrame({"": ["hello world"]})
    words, characters = calculate_text_metrics(df, [""])


def test_large_dataframe_single_column():
    """Test with a large dataframe containing 500 rows."""
    # Create a dataframe with 500 rows of "word1 word2"
    df = pd.DataFrame({"text": ["word1 word2"] * 500})
    words, characters = calculate_text_metrics(df, ["text"])


def test_large_dataframe_multiple_columns():
    """Test with a large dataframe containing multiple columns."""
    # Create a dataframe with 300 rows and 3 columns
    df = pd.DataFrame({
        "col1": ["text one"] * 300,
        "col2": ["text two"] * 300,
        "col3": ["text three"] * 300
    })
    words, characters = calculate_text_metrics(df, ["col1", "col2", "col3"])


def test_large_text_entries():
    """Test with large individual text entries."""
    # Create a text with many words
    large_text = " ".join(["word"] * 200)
    df = pd.DataFrame({"text": [large_text] * 10})
    words, characters = calculate_text_metrics(df, ["text"])


def test_large_dataframe_with_mixed_content():
    """Test large dataframe with varied content."""
    # Create varied content: some empty, some with text
    data = []
    for i in range(250):
        if i % 3 == 0:
            data.append("")
        elif i % 3 == 1:
            data.append("hello world")
        else:
            data.append("test")
    df = pd.DataFrame({"text": data})
    words, characters = calculate_text_metrics(df, ["text"])


def test_many_columns():
    """Test with many columns to process."""
    # Create a dataframe with 50 columns
    data = {f"col{i}": ["text"] * 20 for i in range(50)}
    df = pd.DataFrame(data)
    words, characters = calculate_text_metrics(df, list(data.keys()))


def test_large_dataframe_all_empty_strings():
    """Test large dataframe where all entries are empty strings."""
    df = pd.DataFrame({"text": [""] * 500})
    words, characters = calculate_text_metrics(df, ["text"])


def test_large_dataframe_all_whitespace():
    """Test large dataframe where all entries are whitespace."""
    df = pd.DataFrame({"text": ["   "] * 500})
    words, characters = calculate_text_metrics(df, ["text"])


def test_large_dataframe_single_long_word():
    """Test large dataframe with single very long words."""
    long_word = "a" * 1000
    df = pd.DataFrame({"text": [long_word] * 100})
    words, characters = calculate_text_metrics(df, ["text"])


def test_moderate_scale_realistic_data():
    """Test with moderately scaled realistic data."""
    # Simulate realistic text data
    texts = [
        "The quick brown fox jumps over the lazy dog",
        "Machine learning is a subset of artificial intelligence",
        "Data science involves statistics and programming",
        "Python is a popular programming language",
        "Testing is crucial for software quality"
    ]
    df = pd.DataFrame({"text": texts * 50})  # 250 rows
    words, characters = calculate_text_metrics(df, ["text"])
    
    # Calculate expected values
    expected_words = sum(len(t.split()) for t in texts) * 50
    expected_chars = sum(len(t) for t in texts) * 50
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr11114-2026-01-19T22.17.34 and push.

Codeflash

The optimized code achieves a **19% speedup** by replacing an expensive string splitting operation with a more efficient regex-based counting approach.

## Key Optimization

**Original approach:**
```python
total_words += _to_int(text_series.str.split().str.len().sum())
```

**Optimized approach:**
```python
total_words += _to_int(text_series.str.count(_WORD_RE).sum())
```

## Why This Is Faster

The original code uses `str.split().str.len()` which:
1. **Creates intermediate Python lists** for every cell by splitting on whitespace
2. **Allocates memory** for these intermediate list objects
3. **Counts list lengths** in a separate operation

The optimized code uses `str.count(_WORD_RE)` with a pre-compiled regex `r"\S+"` (non-whitespace sequences) which:
1. **Counts matches directly** without creating intermediate data structures
2. **Operates entirely within pandas/C layer** for better performance
3. **Avoids list allocations** that would later need garbage collection

From the line profiler results, this change reduces the word counting line from **75.3ms** (42.5% of total time) to **45.7ms** (31.2% of total time) — a **~40% improvement** on this specific operation.

## Test Case Performance

The optimization benefits all test cases, but shows particular gains for:
- **Large dataframes** (500+ rows): More cells to process means more intermediate lists avoided
- **Text with multiple spaces**: The regex approach handles all whitespace uniformly without creating empty strings in splits
- **Varied word counts**: No overhead difference between single-word and multi-word cells

## Impact Considerations

Since `calculate_text_metrics` processes knowledge base text data:
- If called in data ingestion pipelines with large document batches, the 19% speedup compounds significantly
- The reduction in memory allocations (no intermediate split lists) may also reduce GC pressure in long-running processes
- The optimization maintains identical behavior for all edge cases (unicode, empty strings, special whitespace)
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jan 19, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the community Pull Request from an external contributor label Jan 19, 2026
@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 34.52%. Comparing base (10fb7f1) to head (3bbd376).

Files with missing lines Patch % Lines
...rc/backend/base/langflow/api/v1/knowledge_bases.py 66.66% 1 Missing ⚠️

❌ Your project check has failed because the head coverage (41.60%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@                  Coverage Diff                   @@
##           feat/langchain-1.0   #11360      +/-   ##
======================================================
- Coverage               34.53%   34.52%   -0.01%     
======================================================
  Files                    1414     1414              
  Lines                   67207    67209       +2     
  Branches                 9910     9910              
======================================================
- Hits                    23207    23203       -4     
- Misses                  42784    42790       +6     
  Partials                 1216     1216              
Flag Coverage Δ
backend 53.50% <66.66%> (-0.03%) ⬇️
lfx 41.60% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rc/backend/base/langflow/api/v1/knowledge_bases.py 17.10% <66.66%> (+0.62%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI community Pull Request from an external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants