Skip to content

Optimize join tables from different databases: executor#10146

Merged
ZoranPandovski merged 6 commits intomainfrom
cte-support
Nov 14, 2024
Merged

Optimize join tables from different databases: executor#10146
ZoranPandovski merged 6 commits intomainfrom
cte-support

Conversation

@ea-rus
Copy link
Collaborator

@ea-rus ea-rus commented Nov 11, 2024

Description

Updates:

Subselect step is used to get value from previous step data.

select distinct <column> from <step data1>

Next step is fetching data using these values as filter

select * from db2.table2 where <column> in (<ids from previous step>)

Side fix:
If join is without condition:

  • add 0=0 filter (multiply rows)
  • but use limitation. If expected number of rows is exceed limit - raise exception

Dependent on mindsdb/mindsdb_sql#412

Fixes #issue_number

Type of change

  • ⚡ New feature (non-breaking change which adds functionality)

Verification Process

To ensure the changes are working as expected:

  • Test Location: Specify the URL or path for testing.
  • Verification Steps: Outline the steps or queries needed to validate the change. Include any data, configurations, or actions required to reproduce or see the new functionality.

Additional Media:

  • I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

  • [x My code follows the style guidelines(PEP 8) of MindsDB.
  • I have appropriately commented on my code, especially in complex areas.
  • Necessary documentation updates are either made or tracked in issues.
  • Relevant unit and integration tests are updated or added.

@ea-rus ea-rus requested a review from StpMax November 11, 2024 15:09
if step.query.condition is None:
raise NotSupportedYet('Unable to join table without condition')
# prevent memory overflow
if len(left_data) * len(right_data) < 10 ** 7:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are left_data and right_data dataframes? If so, then may be better to get real size (df.memory_usage(index=True, deep=True).sum()) and compare with free memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are ResultSets

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants