Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jan 19, 2026

⚡️ This pull request contains optimizations for PR #11114

If you approve this dependent PR, these changes will be merged into the original PR branch feat/langchain-1.0.

This PR will be automatically closed if the original PR is merged.


📄 66% (0.66x) speedup for calculate_text_metrics in src/backend/base/langflow/api/v1/knowledge_bases.py

⏱️ Runtime : 80.2 milliseconds 48.3 milliseconds (best of 74 runs)

📝 Explanation and details

The optimized code achieves a 66% speedup by eliminating redundant pandas string operations in a loop. Here's why it's faster:

Key Optimization: Batch Processing Over Iteration

Original approach: Iterates through each text column, applying astype(str), fillna(""), str.len(), and str.split() separately for each column. This triggers pandas overhead (method dispatch, memory allocation, intermediate series creation) repeatedly—174 times in the profiler results.

Optimized approach:

  1. Filters valid columns upfront with a list comprehension
  2. Concatenates all text columns into a single series using pd.concat()
  3. Applies string operations (str.len() and str.split()) once on the combined series

Why This Works

Pandas string methods have significant per-call overhead. The line profiler shows:

  • Original: 174 iterations spending ~287ms total on string operations (lines 7-8)
  • Optimized: Single operation spending ~79ms (lines 5-6)

The pd.concat() cost (~94ms) is more than offset by eliminating 173 redundant string method calls. This batching reduces:

  • Function call overhead in pandas vectorized operations
  • Memory allocation for intermediate series objects
  • String operation setup/teardown cycles

Test Case Performance

Based on annotated tests, the optimization excels when:

  • Multiple columns are processed (tests like test_multiple_columns_aggregate_counts, test_large_dataframe_multiple_columns) – more columns = higher relative gain from batch processing
  • Large DataFrames with many rows (e.g., 500-row tests) – amortizes the concat overhead
  • Mixed valid/invalid columns – early filtering with list comprehension is more efficient than repeated if col not in df.columns checks inside the loop

The optimization maintains identical behavior for all edge cases (NaN handling, type conversion, empty strings) since the order of operations (astype(str).fillna("")) is preserved.

Impact Assessment

Without function_references, the specific deployment context is unclear. However, this function likely processes knowledge base content where text metrics inform chunking strategies or resource allocation. The 66% speedup would significantly benefit workflows that:

  • Process multiple text columns per document
  • Handle large document collections in batch
  • Repeatedly calculate metrics during indexing pipelines

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 64 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import numpy as np  # used to create NaN values for edge-case tests
# function to test
import pandas as pd  # required because the function operates on pandas.DataFrame
# imports
import pytest  # used for our unit tests
from langflow.api.v1.knowledge_bases import calculate_text_metrics

# unit tests

# Basic functionality tests

def test_basic_single_column_counts():
    # Single text column with simple phrases.
    df = pd.DataFrame({"text": ["hello world", "foo", ""]})
    # Expected: "hello world" -> 2 words, 11 chars (including space)
    # "foo" -> 1 word, 3 chars
    # "" -> 0 words, 0 chars
    expected_words = 2 + 1 + 0
    expected_chars = len("hello world") + len("foo") + len("")
    codeflash_output = calculate_text_metrics(df, ["text"]); result = codeflash_output

def test_multiple_columns_aggregate_counts():
    # Two text columns - ensure both are aggregated.
    df = pd.DataFrame({
        "a": ["one two", "three"],
        "b": ["x y z", ""],
        "irrelevant": [123, 456]
    })
    # Column 'a': "one two"(2 words, 7 chars) + "three"(1 word, 5 chars)
    # Column 'b': "x y z"(3 words, 5 chars) + ""(0,0)
    expected_words = 2 + 1 + 3 + 0
    expected_chars = len("one two") + len("three") + len("x y z") + len("")
    codeflash_output = calculate_text_metrics(df, ["a", "b"]); result = codeflash_output

def test_missing_columns_are_ignored():
    # If a column in text_columns is not present, it should be skipped (no error).
    df = pd.DataFrame({"a": ["alpha beta"]})
    # Requesting a missing column 'missing' should be ignored.
    expected_words = len(str("alpha beta").split())
    expected_chars = len(str("alpha beta"))
    codeflash_output = calculate_text_metrics(df, ["a", "missing"]); result = codeflash_output

# Edge cases

def test_nan_and_none_are_converted_to_strings_and_counted():
    # Important edge: the implementation does astype(str) then fillna(""),
    # which means None and np.nan become the literal strings "None" and "nan".
    df = pd.DataFrame({"text": [None, np.nan, ""]})
    # Recreate the behavior: function effectively uses str(value) for each cell
    values_as_strings = [str(v) for v in [None, np.nan, ""]]
    expected_chars = sum(len(s) for s in values_as_strings)
    expected_words = sum(len(s.split()) for s in values_as_strings)
    # Validate that calculate_text_metrics follows this behavior (counts "None" and "nan")
    codeflash_output = calculate_text_metrics(df, ["text"]); result = codeflash_output

def test_empty_dataframe_and_empty_columns_list():
    # Empty DataFrame should result in zero counts regardless of requested columns.
    empty_df = pd.DataFrame(columns=["a", "b"])
    codeflash_output = calculate_text_metrics(empty_df, [])
    # Requesting columns that exist but dataframe has no rows
    codeflash_output = calculate_text_metrics(empty_df, ["a", "b"]); result = codeflash_output

def test_non_string_types_numbers_and_booleans():
    # Numeric and boolean values should be converted to strings and counted accordingly.
    df = pd.DataFrame({
        "mixed": [123, 45.6, True, False, "end"]
    })
    # Expected behavior: str(123) -> "123" (1 word, 3 chars), str(45.6) -> "45.6", etc.
    values_as_strings = [str(v) for v in df["mixed"].tolist()]
    expected_chars = sum(len(s) for s in values_as_strings)
    expected_words = sum(len(s.split()) for s in values_as_strings)
    codeflash_output = calculate_text_metrics(df, ["mixed"]); result = codeflash_output

def test_whitespace_handling_multiple_spaces_tabs_newlines():
    # Verify splitting behavior on varied whitespace: multiple spaces, tabs, newlines.
    df = pd.DataFrame({
        "w": ["a  b", "c\t d\n e", "   ", ""]
    })
    # Use Python str.split() semantics (split on any whitespace, multiple spaces ignored).
    values_as_strings = [str(v) for v in df["w"].tolist()]
    expected_words = sum(len(s.split()) for s in values_as_strings)
    expected_chars = sum(len(s) for s in values_as_strings)
    codeflash_output = calculate_text_metrics(df, ["w"]); result = codeflash_output

def test_column_with_empty_strings_and_mixed_content():
    # Verify empty strings contribute 0, other strings counted normally.
    df = pd.DataFrame({"t": ["", "", "hello"]})
    expected_words = 0 + 0 + 1
    expected_chars = 0 + 0 + len("hello")
    codeflash_output = calculate_text_metrics(df, ["t"]); result = codeflash_output

# Large-scale scenarios (kept under the specified element limits)

def test_large_scale_multiple_rows_and_columns():
    # Create 500 rows (<1000) to test aggregation performance and correctness.
    rows = 500
    # Each cell contains "word " repeated 3 times -> "word word word " (15 chars, 3 words)
    cell = "word " * 3  # trailing space preserved intentionally
    df = pd.DataFrame({
        "c1": [cell] * rows,
        "c2": [cell] * rows
    })
    # Each cell: 3 words, length len(cell)
    expected_words = rows * 3 * 2  # 2 columns
    expected_chars = rows * len(cell) * 2
    codeflash_output = calculate_text_metrics(df, ["c1", "c2"]); result = codeflash_output

def test_large_single_long_string():
    # Single-row, single-column with a long string (length 2000) to test large string handling.
    long_piece = "x" * 2000  # 2000 characters, one 'word' when split
    df = pd.DataFrame({"long": [long_piece]})
    expected_words = 1  # "xxxxx..." -> one token
    expected_chars = 2000
    codeflash_output = calculate_text_metrics(df, ["long"]); result = codeflash_output

# Additional mutation-sensitive tests (ensure subtle behaviors are asserted)

def test_order_of_fillna_and_astype_effects_are_observed():
    # This test demonstrates the subtlety: astype(str) is applied before fillna(""),
    # so NaN becomes "nan" rather than being treated as an empty string.
    df = pd.DataFrame({"t": [np.nan]})
    # If implementation replaced astype/ fillna order, expected would be 0 words/0 chars.
    # But current behavior counts "nan".
    expected_chars = len(str(np.nan))
    expected_words = len(str(np.nan).split())
    codeflash_output = calculate_text_metrics(df, ["t"]); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
import pytest
from langflow.api.v1.knowledge_bases import calculate_text_metrics


class TestCalculateTextMetricsBasic:
    """Basic test cases for calculate_text_metrics function."""

    def test_single_column_simple_text(self):
        """Test with a single column containing simple text."""
        df = pd.DataFrame({"text": ["hello world", "foo bar"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_single_column_with_multiple_words(self):
        """Test with a single column containing multiple words per row."""
        df = pd.DataFrame({"content": ["the quick brown fox", "jumps over"]})
        words, chars = calculate_text_metrics(df, ["content"])

    def test_multiple_columns(self):
        """Test with multiple text columns."""
        df = pd.DataFrame({
            "col1": ["hello", "world"],
            "col2": ["foo bar", "baz qux"]
        })
        words, chars = calculate_text_metrics(df, ["col1", "col2"])

    def test_empty_dataframe(self):
        """Test with empty DataFrame."""
        df = pd.DataFrame({"text": []})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_single_row_single_column(self):
        """Test with a single row and single column."""
        df = pd.DataFrame({"text": ["hello"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_column_with_single_word_per_row(self):
        """Test with column containing single word per row."""
        df = pd.DataFrame({"text": ["apple", "banana", "cherry"]})
        words, chars = calculate_text_metrics(df, ["text"])


class TestCalculateTextMetricsEdgeCases:
    """Edge case test cases for calculate_text_metrics function."""

    def test_nonexistent_column(self):
        """Test with column that doesn't exist in DataFrame."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, chars = calculate_text_metrics(df, ["nonexistent"])

    def test_mixed_existing_and_nonexistent_columns(self):
        """Test with mix of existing and non-existing columns."""
        df = pd.DataFrame({"text": ["hello world"], "content": ["foo bar"]})
        words, chars = calculate_text_metrics(df, ["text", "nonexistent", "content"])

    def test_column_with_null_values(self):
        """Test with column containing NaN/None values."""
        df = pd.DataFrame({"text": ["hello", None, "world", float('nan')]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_column_with_all_null_values(self):
        """Test with column containing only null values."""
        df = pd.DataFrame({"text": [None, None, None]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_column_with_empty_strings(self):
        """Test with column containing empty strings."""
        df = pd.DataFrame({"text": ["", "", ""]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_column_with_whitespace_only(self):
        """Test with column containing only whitespace."""
        df = pd.DataFrame({"text": ["   ", "\t", "\n"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_numeric_column_converted_to_string(self):
        """Test with numeric column that gets converted to string."""
        df = pd.DataFrame({"numbers": [1, 22, 333]})
        words, chars = calculate_text_metrics(df, ["numbers"])

    def test_mixed_types_in_column(self):
        """Test with column containing mixed data types."""
        df = pd.DataFrame({"mixed": [123, "hello", 45.67, "world"]})
        words, chars = calculate_text_metrics(df, ["mixed"])

    def test_text_with_special_characters(self):
        """Test with text containing special characters."""
        df = pd.DataFrame({"text": ["hello@world", "foo#bar$baz"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_text_with_punctuation(self):
        """Test with text containing punctuation."""
        df = pd.DataFrame({"text": ["hello, world!", "foo. bar?"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_text_with_multiple_consecutive_spaces(self):
        """Test with text containing multiple consecutive spaces."""
        df = pd.DataFrame({"text": ["hello     world", "foo   bar"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_empty_column_list(self):
        """Test with empty column list."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, chars = calculate_text_metrics(df, [])

    def test_single_character_text(self):
        """Test with single character text."""
        df = pd.DataFrame({"text": ["a", "b", "c"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_very_long_single_word(self):
        """Test with very long single word."""
        long_word = "a" * 1000
        df = pd.DataFrame({"text": [long_word]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_text_with_numbers_and_words(self):
        """Test with text containing both numbers and words."""
        df = pd.DataFrame({"text": ["hello 123 world", "foo 456 bar 789"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_unicode_characters(self):
        """Test with unicode characters."""
        df = pd.DataFrame({"text": ["hello 世界", "foo 🌍"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_multiline_text(self):
        """Test with multiline text (newlines treated as spaces)."""
        df = pd.DataFrame({"text": ["hello\nworld", "foo\nbar\nbaz"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_tab_separated_words(self):
        """Test with tab-separated words."""
        df = pd.DataFrame({"text": ["hello\tworld", "foo\tbar"]})
        words, chars = calculate_text_metrics(df, ["text"])


class TestCalculateTextMetricsLargeScale:
    """Large scale test cases for calculate_text_metrics function."""

    def test_large_dataframe_single_column(self):
        """Test with large DataFrame (500 rows) and single column."""
        texts = ["hello world"] * 500
        df = pd.DataFrame({"text": texts})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_large_dataframe_multiple_columns(self):
        """Test with large DataFrame (300 rows) and multiple columns."""
        df = pd.DataFrame({
            "col1": ["hello world"] * 300,
            "col2": ["foo bar baz"] * 300,
            "col3": ["test text"] * 300
        })
        words, chars = calculate_text_metrics(df, ["col1", "col2", "col3"])

    def test_large_text_entries(self):
        """Test with large text entries in DataFrame."""
        large_text = " ".join(["word"] * 100)
        df = pd.DataFrame({"text": [large_text] * 50})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_mixed_null_and_large_data(self):
        """Test with large DataFrame containing mixed null and large values."""
        texts = ["hello world"] * 250 + [None] * 250
        df = pd.DataFrame({"text": texts})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_many_columns_with_text(self):
        """Test with many columns (100) containing text."""
        data = {f"col_{i}": ["hello"] * 10 for i in range(100)}
        df = pd.DataFrame(data)
        words, chars = calculate_text_metrics(df, [f"col_{i}" for i in range(100)])

    def test_varying_text_lengths(self):
        """Test with DataFrame containing varying text lengths."""
        texts = [" ".join(["word"] * i) for i in range(1, 101)]
        df = pd.DataFrame({"text": texts})
        words, chars = calculate_text_metrics(df, ["text"])
        expected_words = sum(range(1, 101))

    def test_large_dataframe_with_empty_values(self):
        """Test large DataFrame with many empty strings."""
        df = pd.DataFrame({
            "text": ["hello world"] * 400 + [""] * 600
        })
        words, chars = calculate_text_metrics(df, ["text"])

    def test_performance_with_large_text_volume(self):
        """Test performance with large total text volume."""
        df = pd.DataFrame({
            "text1": ["the quick brown fox jumps over the lazy dog"] * 250,
            "text2": ["sphinx of black quartz judge my vow"] * 250
        })
        words, chars = calculate_text_metrics(df, ["text1", "text2"])

    def test_all_numeric_columns_large_scale(self):
        """Test conversion of large numeric DataFrame to string metrics."""
        df = pd.DataFrame({
            "col1": range(500),
            "col2": range(500, 1000)
        })
        words, chars = calculate_text_metrics(df, ["col1", "col2"])


class TestCalculateTextMetricsReturnTypes:
    """Test return type consistency and validity."""

    def test_returns_tuple(self):
        """Test that function returns a tuple."""
        df = pd.DataFrame({"text": ["hello world"]})
        codeflash_output = calculate_text_metrics(df, ["text"]); result = codeflash_output

    def test_returns_integers(self):
        """Test that returned values are integers."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_return_values_non_negative(self):
        """Test that returned values are always non-negative."""
        df = pd.DataFrame({"text": ["hello", "world", None, ""]})
        words, chars = calculate_text_metrics(df, ["text"])

    def test_characters_greater_or_equal_to_words(self):
        """Test that character count is greater than or equal to word count."""
        df = pd.DataFrame({
            "text": ["hello world", "foo bar baz", "test", None]
        })
        words, chars = calculate_text_metrics(df, ["text"])


class TestCalculateTextMetricsDataIntegrity:
    """Test data integrity and cumulative calculations."""

    def test_column_order_independence(self):
        """Test that column processing order doesn't affect results."""
        df = pd.DataFrame({
            "text1": ["hello"],
            "text2": ["world"]
        })
        codeflash_output = calculate_text_metrics(df, ["text1", "text2"]); result1 = codeflash_output
        codeflash_output = calculate_text_metrics(df, ["text2", "text1"]); result2 = codeflash_output

    def test_duplicate_columns_in_list(self):
        """Test behavior with duplicate column names in input list."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, chars = calculate_text_metrics(df, ["text", "text"])

    def test_consistency_across_multiple_calls(self):
        """Test that multiple calls with same data produce same results."""
        df = pd.DataFrame({"text": ["hello world", "foo bar"]})
        codeflash_output = calculate_text_metrics(df, ["text"]); result1 = codeflash_output
        codeflash_output = calculate_text_metrics(df, ["text"]); result2 = codeflash_output

    def test_word_count_accuracy(self):
        """Test accuracy of word counting with known examples."""
        test_cases = [
            (["one"], 1),
            (["one two"], 2),
            (["one two three"], 3),
            (["a b c d e"], 5)
        ]
        for text, expected_words in test_cases:
            df = pd.DataFrame({"text": text})
            words, _ = calculate_text_metrics(df, ["text"])

    def test_character_count_accuracy(self):
        """Test accuracy of character counting."""
        test_cases = [
            ("a", 1),
            ("ab", 2),
            ("hello", 5),
            ("hello world", 11),
        ]
        for text, expected_chars in test_cases:
            df = pd.DataFrame({"text": [text]})
            _, chars = calculate_text_metrics(df, ["text"])

    def test_cumulative_across_rows(self):
        """Test that metrics are properly cumulative across rows."""
        df = pd.DataFrame({
            "text": ["hello", "world", "foo"]
        })
        words, chars = calculate_text_metrics(df, ["text"])

    def test_cumulative_across_columns(self):
        """Test that metrics are properly cumulative across columns."""
        df = pd.DataFrame({
            "col1": ["hello"],
            "col2": ["world"],
            "col3": ["foo"]
        })
        words, chars = calculate_text_metrics(df, ["col1", "col2", "col3"])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr11114-2026-01-19T15.29.55 and push.

Codeflash

The optimized code achieves a **66% speedup** by eliminating redundant pandas string operations in a loop. Here's why it's faster:

## Key Optimization: Batch Processing Over Iteration

**Original approach:** Iterates through each text column, applying `astype(str)`, `fillna("")`, `str.len()`, and `str.split()` separately for each column. This triggers pandas overhead (method dispatch, memory allocation, intermediate series creation) repeatedly—174 times in the profiler results.

**Optimized approach:** 
1. Filters valid columns upfront with a list comprehension
2. Concatenates all text columns into a single series using `pd.concat()`
3. Applies string operations (`str.len()` and `str.split()`) **once** on the combined series

## Why This Works

Pandas string methods have significant per-call overhead. The line profiler shows:
- Original: 174 iterations spending ~287ms total on string operations (lines 7-8)
- Optimized: Single operation spending ~79ms (lines 5-6)

The `pd.concat()` cost (~94ms) is more than offset by eliminating 173 redundant string method calls. This batching reduces:
- Function call overhead in pandas vectorized operations
- Memory allocation for intermediate series objects
- String operation setup/teardown cycles

## Test Case Performance

Based on annotated tests, the optimization excels when:
- **Multiple columns** are processed (tests like `test_multiple_columns_aggregate_counts`, `test_large_dataframe_multiple_columns`) – more columns = higher relative gain from batch processing
- **Large DataFrames** with many rows (e.g., 500-row tests) – amortizes the concat overhead
- **Mixed valid/invalid columns** – early filtering with list comprehension is more efficient than repeated `if col not in df.columns` checks inside the loop

The optimization maintains identical behavior for all edge cases (NaN handling, type conversion, empty strings) since the order of operations (`astype(str).fillna("")`) is preserved.

## Impact Assessment

Without `function_references`, the specific deployment context is unclear. However, this function likely processes knowledge base content where text metrics inform chunking strategies or resource allocation. The 66% speedup would significantly benefit workflows that:
- Process multiple text columns per document
- Handle large document collections in batch
- Repeatedly calculate metrics during indexing pipelines
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jan 19, 2026
@github-actions github-actions bot added the community Pull Request from an external contributor label Jan 19, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 34.52%. Comparing base (2f2b50b) to head (67d448f).

❌ Your project check has failed because the head coverage (41.56%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##           feat/langchain-1.0   #11355   +/-   ##
===================================================
  Coverage               34.52%   34.52%           
===================================================
  Files                    1414     1414           
  Lines                   67194    67188    -6     
  Branches                 9910     9910           
===================================================
- Hits                    23198    23197    -1     
+ Misses                  42781    42775    -6     
- Partials                 1215     1216    +1     
Flag Coverage Δ
lfx 41.56% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rc/backend/base/langflow/api/v1/knowledge_bases.py 16.86% <ø> (+0.38%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI community Pull Request from an external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants