Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support to accept Dataframe as input to split text, and added relevant tests #6302

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

edwinjosechittilappilly
Copy link
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly commented Feb 12, 2025

This pull request includes significant changes to the SplitTextComponent and introduces a new SplitTextComponentLegacy class. Additionally, there are updates to the DataFrame class and corresponding unit tests to support new functionalities. The most important changes are summarized below:

New Features and Enhancements:

  • New Component:

    • Added SplitTextComponentLegacy to handle deprecated text splitting functionality. (src/backend/base/langflow/components/processing/split_text_legacy.py)
  • Enhanced SplitTextComponent:

    • Updated SplitTextComponent to support both Data and DataFrame input types, added a new input field text_key, and modified output methods to handle data conversion more robustly. (src/backend/base/langflow/components/processing/split_text.py) [1] [2]

Codebase Improvements:

  • DataFrame Enhancements:
    • Enhanced DataFrame class to include new attributes text_key and default_value, and added methods for converting to and from LangChain Document objects. (src/backend/base/langflow/schema/dataframe.py) [1] [2]

Refactoring and Bug Fixes:

  • Method Refactoring:
    • Replaced split_text method calls with as_data in various components and tests to ensure consistency with the new method names. (src/backend/base/langflow/initial_setup/starter_projects/vector_store_rag.py, src/backend/tests/unit/components/processing/test_split_text_component.py, src/backend/tests/unit/initial_setup/starter_projects/test_vector_store_rag.py) [1] [2] [3]

Testing:

  • Unit Tests:
    • Added comprehensive unit tests for the updated SplitTextComponent and DataFrame functionalities, including new test cases for DataFrame input and URL loader. (src/backend/tests/unit/components/processing/test_split_text_component.py, src/backend/tests/unit/schema/test_schema_dataframe.py) [1] [2]

These changes enhance the flexibility and robustness of the text processing components and ensure that the codebase is well-tested and maintainable.

https://www.loom.com/share/0bab777a383749ffb03707a7e184e97c?sid=ff2da035-c8f5-4ea3-9c36-a1f0ea71d261

@github-actions github-actions bot added the enhancement New feature or request label Feb 12, 2025
@edwinjosechittilappilly edwinjosechittilappilly marked this pull request as ready for review February 12, 2025 05:44
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
Comment on lines 65 to 81
if len(self.data_inputs) == 0:
msg = "DataFrame is empty"
raise TypeError(msg)

self.data_inputs.text_key = self.text_key
try:
documents = self.data_inputs.to_lc_documents()
except Exception as e:
msg = f"Error converting DataFrame to documents: {e}"
raise TypeError(msg) from e
else:
if not self.data_inputs:
msg = "No data inputs provided"
raise TypeError(msg)

documents = [_input.to_lc_document() for _input in self.data_inputs if isinstance(_input, Data)]
documents = []
for _input in self.data_inputs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(self.data_inputs) == 0:
msg = "DataFrame is empty"
raise TypeError(msg)
self.data_inputs.text_key = self.text_key
try:
documents = self.data_inputs.to_lc_documents()
except Exception as e:
msg = f"Error converting DataFrame to documents: {e}"
raise TypeError(msg) from e
else:
if not self.data_inputs:
msg = "No data inputs provided"
raise TypeError(msg)
documents = [_input.to_lc_document() for _input in self.data_inputs if isinstance(_input, Data)]
documents = []
for _input in self.data_inputs:
if not len(self.data_inputs):
raise TypeError("DataFrame is empty")
raise TypeError(f"Error converting DataFrame to documents: {e}") from e
raise TypeError("No data inputs provided")
try:
documents = [_input.to_lc_document() for _input in self.data_inputs if isinstance(_input, Data)]
if len(documents) != len(self.data_inputs):
raise TypeError("Unsupported input types present")
except Exception as e:
raise TypeError(f"Error processing inputs: {e}") from e
raise TypeError(f"Error splitting text: {e}") from e

Copy link
Contributor

codeflash-ai bot commented Feb 12, 2025

⚡️ Codeflash found optimizations for this PR

📄 39% (0.39x) speedup for SplitTextComponent.split_text in src/backend/base/langflow/components/processing/split_text.py

⏱️ Runtime : 10.1 milliseconds 7.27 milliseconds (best of 23 runs)

📝 Explanation and details

To optimize the provided Python program for performance, we can reduce overhead by avoiding unnecessary intermediate processing and ensure efficient usage of memory and computation. Below is the improved version of the split_text method.

  • Reduce intermediate variable assignments: Avoid unnecessary assignments and processes which don't contribute directly to the result.
  • Use list comprehensions for better performance: They are faster and cleaner than traditional for-loops.
  • Efficient error handling: Minimize redundant computations within try-except blocks.

Here's the rewritten split_text method.

Key Improvements.

  1. Reduced Unnecessary Assignments: Directly assigned and processed the self.data_inputs without unnecessary intermediate steps.
  2. Optimized Data Processing.
    • Used list comprehensions for documents creation.
    • Ensured all documents are processed and handled properly within the comprehensions.
  3. Efficient Error Handling.
    • Combined checks within the try block to minimize repeated exception handing.

Benefits.

  • Performance: Using list comprehensions and avoiding unnecessary assignments gives a performance boost.
  • Readability: Code is cleaner and more understandable.
  • Error Handling: Minimized redundant processing within try blocks enhances performance and makes error messages more informative.

Let's maintain these improvements for more modular and efficient code execution.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 14 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage undefined
🌀 Generated Regression Tests Details
from typing import Any, Dict, List

import pandas as pd
# imports
import pytest  # used for our unit tests
from langchain_text_splitters import CharacterTextSplitter
from langflow.components.processing.split_text import SplitTextComponent
from langflow.custom import Component
from langflow.logging.logger import logger
from langflow.schema import Data, DataFrame
from pydantic import BaseModel


class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects."""
    def __init__(self, data: List[Dict[str, Any]] = None, text_key: str = "text", default_value: str = "", **kwargs):
        super().__init__(data, **kwargs)
        self._text_key = text_key
        self._default_value = default_value

    def to_lc_documents(self):
        list_of_dicts = self.to_dict(orient="records")
        documents = []
        for row in list_of_dicts:
            data_copy = row.copy()
            text = data_copy.pop(self._text_key, self._default_value)
            documents.append({"page_content": text, "metadata": data_copy})
        return documents

class Data(BaseModel):
    text_key: str = "text"
    data: Dict[str, Any] = {}
    default_value: str = ""

    def to_lc_document(self):
        data_copy = self.data.copy()
        text = data_copy.pop(self.text_key, self.default_value)
        return {"page_content": text, "metadata": data_copy}
from langflow.components.processing.split_text import SplitTextComponent


# unit tests
class TestSplitTextComponent:
    # Helper function to create SplitTextComponent
    def create_component(self, data_inputs, separator, chunk_size, chunk_overlap, text_key="text"):
        component = SplitTextComponent()
        component.data_inputs = data_inputs
        component.separator = separator
        component.chunk_size = chunk_size
        component.chunk_overlap = chunk_overlap
        component.text_key = text_key
        return component

    # Basic Functionality Tests
    

import pytest  # used for our unit tests
from langchain_text_splitters import CharacterTextSplitter
from langflow.components.processing.split_text import SplitTextComponent
from langflow.custom import Component
from langflow.logging.logger import logger
from langflow.schema import Data, DataFrame

# unit tests

@pytest.fixture
def setup_component():
    component = SplitTextComponent()
    component.separator = "\\n"
    component.chunk_overlap = 0
    component.chunk_size = 5
    component.text_key = "text"
    return component



def test_empty_inputs_list(setup_component):
    component = setup_component
    component.data_inputs = []
    with pytest.raises(TypeError, match="No data inputs provided"):
        component.split_text()


def test_text_with_special_characters(setup_component):
    component = setup_component
    component.data_inputs = [Data(data={"text": "Hello,\nworld!"})]
    codeflash_output = component.split_text()

def test_very_large_text(setup_component):
    component = setup_component
    component.data_inputs = [Data(data={"text": "A" * 10000})]
    codeflash_output = component.split_text()

def test_invalid_data_types(setup_component):
    component = setup_component
    component.data_inputs = [{"text": "Hello, world!"}]
    with pytest.raises(TypeError, match="Unsupported input type"):
        component.split_text()


def test_different_chunk_sizes_and_overlaps(setup_component):
    component = setup_component
    component.chunk_size = 4
    component.chunk_overlap = 2
    component.data_inputs = [Data(data={"text": "Hello, world!"})]
    codeflash_output = component.split_text()

def test_different_separators(setup_component):
    component = setup_component
    component.separator = " "
    component.data_inputs = [Data(data={"text": "Hello world!"})]
    codeflash_output = component.split_text()


def test_splitting_errors(setup_component):
    component = setup_component
    component.chunk_size = -1  # Invalid chunk size
    component.data_inputs = [Data(data={"text": "Hello, world!"})]
    with pytest.raises(TypeError, match="Error splitting text"):
        component.split_text()

def test_large_number_of_documents(setup_component):
    component = setup_component
    component.data_inputs = [Data(data={"text": "Hello, world!"})] * 1000
    codeflash_output = component.split_text()

def test_complex_nested_data(setup_component):
    component = setup_component
    component.data_inputs = [Data(data={"text": "Hello, world!", "metadata": {"author": "John"}})]
    codeflash_output = component.split_text()

def test_state_modification(setup_component):
    component = setup_component
    original_data = [Data(data={"text": "Hello, world!"})]
    component.data_inputs = original_data.copy()
    component.split_text()

Codeflash

@edwinjosechittilappilly
Copy link
Collaborator Author

edwinjosechittilappilly commented Feb 12, 2025

@mendonk
This PR would require Docs Update, as a follow up PR

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 12, 2025
@github-actions github-actions bot added the enhancement New feature or request label Feb 18, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 18, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 18, 2025
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 18, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 18, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 18, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 18, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 18, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 18, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants