Support for Apache Spark-Based Preprocessing in RecBole for Large-Scale Datasets? #2176

JamorMoussa · 2025-05-03T19:01:15Z

JamorMoussa
May 3, 2025

Hello,

I'm working on a recommendation project in the retail domain and recently discovered RecBole—an interesting library that makes it easy to switch between different models and provides a standardized way to represent data. However, I’ve encountered an issue: RecBole relies on Pandas for data loading and preprocessing, which becomes a bottleneck when working with large datasets.

Does RecBole offer any interface or support for using Apache Spark during the data processing stage, followed by efficient data loading (e.g., using PyTorch DataLoader) for training models?

If not, what would be the recommended approach to preprocess data using Spark and still integrate with RecBole components like its models and trainers?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Apache Spark-Based Preprocessing in RecBole for Large-Scale Datasets? #2176

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Support for Apache Spark-Based Preprocessing in RecBole for Large-Scale Datasets? #2176

Uh oh!

JamorMoussa May 3, 2025

Replies: 0 comments

JamorMoussa
May 3, 2025