NestedLoopJoin is significantly  slower than vanilla spark

### Bug description

same data and same stage (nestdloopjoin operator), vanilla spark cost 3min per task, while gluten (velox) cost 1h+ per task 

nestloopjoin probe side size has 150 billion records， build side has 92 records

![Image](https://github.com/user-attachments/assets/d52985a9-2bb8-481a-9e84-a1b21eb097ef)

the above graph is flame graph of velox, through the frame graph we can get it slow because gen dictionary vector for high base probe vector

### System information

Velox System Info v0.0.2
Commit: 976a5b72a3a068cd1c70cc92ab64cfedae3649a1
CMake Version: 3.28.3
System: Linux-6.10.14-linuxkit
Arch: x86_64
C++ Compiler: /opt/rh/devtoolset-11/root/usr/bin/c++
C++ Compiler Version: 11.2.1
C Compiler: /opt/rh/devtoolset-11/root/usr/bin/gcc
C Compiler Version: 11.2.1
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib64/python3.6/site-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

### Relevant logs

```bash

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NestedLoopJoin is significantly slower than vanilla spark #12294

Bug description

System information

Relevant logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NestedLoopJoin is significantly slower than vanilla spark #12294

Description

Bug description

System information

Relevant logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions