feat: add DataToDataFrame component for converting Data objects #6112

Cristhianzl · 2025-02-04T12:43:02Z

This pull request introduces a new component to the langflow library that converts data objects into a DataFrame. The main addition is the DataToDataFrameComponent class, which includes methods for transforming data and handling inputs and outputs.

Key changes include:

Addition of DataToDataFrameComponent class in src/backend/base/langflow/components/processing/data_to_dataframe.py to convert one or multiple Data objects into a DataFrame. This class includes a detailed description, icon, name, inputs, outputs, and the build_dataframe method for processing data.

The new component enhances the library's data processing capabilities by allowing easy conversion of structured data into a DataFrame format.

… into a DataFrame for easier data manipulation and analysis.

…me method to explain the process of building a DataFrame from Data objects

…on to ensure proper syntax and avoid potential errors

codeflash-ai · 2025-02-12T13:22:20Z

src/backend/base/langflow/components/processing/data_to_dataframe.py

+        # If user passed a single Data, it might come in as a single object rather than a list
+        if not isinstance(data_input, list):
+            data_input = [data_input]
+
+        rows = []
+        for item in data_input:
+            if not isinstance(item, Data):
+                msg = f"Expected Data objects, got {type(item)} instead."
+                raise TypeError(msg)
+
+            # Start with a copy of item.data or an empty dict
+            row_dict = dict(item.data) if item.data else {}
+
+            # If the Data object has text, store it under 'text' col
+            text_val = item.get_text()


Suggested change

# If user passed a single Data, it might come in as a single object rather than a list

if not isinstance(data_input, list):

data_input = [data_input]

rows = []

for item in data_input:

if not isinstance(item, Data):

msg = f"Expected Data objects, got {type(item)} instead."

raise TypeError(msg)

# Start with a copy of item.data or an empty dict

row_dict = dict(item.data) if item.data else {}

# If the Data object has text, store it under 'text' col

text_val = item.get_text()

# Ensure data_input is a list

# Use list comprehension to create rows more efficiently

rows = [

{

**(item.data if item.data else {}),

**({"text": item.get_text()} if item.get_text() else {}),

}

for item in data_input

if isinstance(item, Data)

]

# Verify all items are Data objects and raise TypeError if not

if len(rows) != len(data_input):

raise TypeError("All input items must be Data objects.")

codeflash-ai · 2025-02-12T13:22:23Z

⚡️ Codeflash found optimizations for this PR

📄 16% (0.16x) speedup for `DataToDataFrameComponent.build_dataframe` in `src/backend/base/langflow/components/processing/data_to_dataframe.py`

⏱️ Runtime : 410 microseconds → 353 microseconds (best of 33 runs)

📝 Explanation and details

Here's the optimized code.

Ensuring data_input is a List Efficiently.

By making these changes, the code maintains its correctness while improving its performance and readability.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 4 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	undefined

🌀 Generated Regression Tests Details

from typing import Any, Dict, List

import pandas as pd  # used for DataFrame operations
# imports
import pytest  # used for our unit tests
from langflow.components.processing.data_to_dataframe import \
    DataToDataFrameComponent
from langflow.custom import Component
from langflow.schema import Data, DataFrame


# function to test
class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects."""

    def __init__(self, data: List[Dict] | List[Data] | pd.DataFrame | None = None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                msg = "List items must be either all Data objects or all dictionaries"
                raise ValueError(msg)
            kwargs["data"] = data
        elif isinstance(data, dict) or isinstance(data, pd.DataFrame):
            kwargs["data"] = data

        super().__init__(**kwargs)

    def to_data_list(self) -> List[Data]:
        list_of_dicts = self.to_dict(orient="records")
        return [Data(data=row) for row in list_of_dicts]

    def add_row(self, data: Dict | Data) -> "DataFrame":
        if isinstance(data, Data):
            data = data.data
        new_df = self._constructor([data])
        return pd.concat([self, new_df], ignore_index=True)

    def add_rows(self, data: List[Dict | Data]) -> "DataFrame":
        processed_data = []
        for item in data:
            if isinstance(item, Data):
                processed_data.append(item.data)
            else:
                processed_data.append(item)
        new_df = self._constructor(processed_data)
        return pd.concat([self, new_df], ignore_index=True)

    @property
    def _constructor(self):
        def _c(*args, **kwargs):
            return DataFrame(*args, **kwargs).__finalize__(self)
        return _c

    def __bool__(self):
        return not self.empty

class Data:
    def __init__(self, data: Dict[str, Any] = None):
        self.data = data or {}
        self.text_key = "text"
        self.default_value = ""

    def get_text(self):
        return self.data.get(self.text_key, self.default_value)
from langflow.components.processing.data_to_dataframe import \
    DataToDataFrameComponent

# unit tests

# Basic Functionality Tests


def test_empty_input():
    component = DataToDataFrameComponent()
    component.data_list = []
    codeflash_output = component.build_dataframe()


def test_non_data_objects_in_input():
    component = DataToDataFrameComponent()
    component.data_list = [Data(data={"name": "John"}), {"name": "Jane"}]
    with pytest.raises(TypeError):
        component.build_dataframe()







def test_invalid_input_types():
    component = DataToDataFrameComponent()
    component.data_list = "string"
    with pytest.raises(TypeError):
        component.build_dataframe()

def test_invalid_data_structures():
    component = DataToDataFrameComponent()
    component.data_list = Data(data="invalid_structure")
    with pytest.raises(TypeError):
        component.build_dataframe()

# Performance and Scalability Tests



from typing import Any, Dict, List

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.components.processing.data_to_dataframe import \
    DataToDataFrameComponent
from pydantic import BaseModel


# function to test
class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects."""

    def __init__(self, data: List[Dict] | List["Data"] | pd.DataFrame | None = None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                raise ValueError("List items must be either all Data objects or all dictionaries")
            kwargs["data"] = data
        elif isinstance(data, (dict, pd.DataFrame)):
            kwargs["data"] = data

        super().__init__(**kwargs)

    def to_data_list(self) -> List["Data"]:
        """Converts the DataFrame back to a list of Data objects."""
        list_of_dicts = self.to_dict(orient="records")
        return [Data(data=row) for row in list_of_dicts]

    def add_row(self, data: Dict | "Data") -> "DataFrame":
        """Adds a single row to the dataset."""
        if isinstance(data, Data):
            data = data.data
        new_df = self._constructor([data])
        return pd.concat([self, new_df], ignore_index=True)

    def add_rows(self, data: List[Dict | "Data"]) -> "DataFrame":
        """Adds multiple rows to the dataset."""
        processed_data = []
        for item in data:
            if isinstance(item, Data):
                processed_data.append(item.data)
            else:
                processed_data.append(item)
        new_df = self._constructor(processed_data)
        return pd.concat([self, new_df], ignore_index=True)

    @property
    def _constructor(self):
        def _c(*args, **kwargs):
            return DataFrame(*args, **kwargs).__finalize__(self)
        return _c

    def __bool__(self):
        """Truth value testing for the DataFrame."""
        return not self.empty

class Data(BaseModel):
    """Represents a record with text and optional data."""
    text_key: str = "text"
    data: Dict[str, Any] = {}
    default_value: str | None = ""

    def get_text(self):
        """Retrieves the text value from the data dictionary."""
        return self.data.get(self.text_key, self.default_value)

    def set_text(self, text: str | None) -> str:
        """Sets the text value in the data dictionary."""
        new_text = "" if text is None else str(text)
        self.data[self.text_key] = new_text
        return new_text
from langflow.components.processing.data_to_dataframe import \
    DataToDataFrameComponent


# unit tests

italojohnny

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

…onent to ensure proper construction of DataFrame from Data objects with various fields and configurations.

…use pandas module instead of turtle for DataFrame operations ♻️ (test_data_to_dataframe.py): Refactor test_data_to_dataframe.py to improve readability and consistency in DataFrame testing assertions

…dability and consistency in test cases

Cristhianzl · 2025-02-12T18:51:19Z

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

Done! Thanks!

rodrigosnader

LGTM

✨ (data_to_dataframe.py): add a new component to convert Data objects…

dd89317

… into a DataFrame for easier data manipulation and analysis.

Cristhianzl requested review from italojohnny, rodrigosnader and ogabrielluiz February 4, 2025 12:43

Cristhianzl self-assigned this Feb 4, 2025

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Feb 4, 2025

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025

[autofix.ci] apply automated fixes

48d92fa

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025

Cristhianzl added 2 commits February 12, 2025 10:15

📝 (data_to_dataframe.py): improve documentation for the build_datafra…

91cb01e

…me method to explain the process of building a DataFrame from Data objects

📝 (data_to_dataframe.py): add missing comma in the component definiti…

c2a1cb7

…on to ensure proper syntax and avoid potential errors

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025

codeflash-ai bot reviewed Feb 12, 2025

View reviewed changes

italojohnny requested changes Feb 12, 2025

View reviewed changes

Cristhianzl added 2 commits February 12, 2025 14:21

✨ (test_data_to_dataframe.py): Add unit tests for DataToDataFrameComp…

29d7c5a

…onent to ensure proper construction of DataFrame from Data objects with various fields and configurations.

✨ (test_data_to_dataframe.py): Refactor test_data_to_dataframe.py to …

7d77989

…use pandas module instead of turtle for DataFrame operations ♻️ (test_data_to_dataframe.py): Refactor test_data_to_dataframe.py to improve readability and consistency in DataFrame testing assertions

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 12, 2025

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025

[autofix.ci] apply automated fixes

da81527

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025

🔧 (test_data_to_dataframe.py): improve variable naming for better rea…

de0c75a

…dability and consistency in test cases

github-actions bot removed the enhancement New feature or request label Feb 12, 2025

github-actions bot added the enhancement New feature or request label Feb 12, 2025

[autofix.ci] apply automated fixes

ed67906

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025

Cristhianzl requested a review from italojohnny February 12, 2025 18:51

rodrigosnader approved these changes Feb 14, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DataToDataFrame component for converting Data objects #6112

feat: add DataToDataFrame component for converting Data objects #6112

Cristhianzl commented Feb 4, 2025

codeflash-ai bot Feb 12, 2025

codeflash-ai bot commented Feb 12, 2025

italojohnny left a comment

Cristhianzl commented Feb 12, 2025

rodrigosnader left a comment

feat: add DataToDataFrame component for converting Data objects #6112

Are you sure you want to change the base?

feat: add DataToDataFrame component for converting Data objects #6112

Conversation

Cristhianzl commented Feb 4, 2025

codeflash-ai bot Feb 12, 2025

Choose a reason for hiding this comment

codeflash-ai bot commented Feb 12, 2025

⚡️ Codeflash found optimizations for this PR

📄 16% (0.16x) speedup for DataToDataFrameComponent.build_dataframe in src/backend/base/langflow/components/processing/data_to_dataframe.py

italojohnny left a comment

Choose a reason for hiding this comment

Cristhianzl commented Feb 12, 2025

rodrigosnader left a comment

Choose a reason for hiding this comment

📄 16% (0.16x) speedup for `DataToDataFrameComponent.build_dataframe` in `src/backend/base/langflow/components/processing/data_to_dataframe.py`