Skip to content

Improve augment request handling to batch requests rather than running them serially #162

@seasidesparrow

Description

@seasidesparrow

Currently, master pipeline is receiving and processing augment pipeline requests serially, so that only one celery worker is handling requests on both augment and master pipelines. We should also use the load-only argument to avoid loading and sending the fulltext field.

Discussion from Slack (SMD+MT):

SMD: I think we could easily speed up this process. It looks like bibcodes are sent one at a time to augment. this incurs the overhead of queueing a huge number of times. If app.request_aff_augment could handle a list of bibcodes it could package up the list of requests into a list protobuf object: https://github.com/adsabs/ADSMasterPipeline/blob/41f874a33915b1f972b938316954849e3f2f1070/adsmp/app.py#L486 https://github.com/adsabs/ADSPipelineMsg/blob/master/specs/augmentrecord.proto#L15 app.request_aff_augment call to get_record should pass the optional load_only argument since it only needs bib data and fulltext is big. If that doesn't help enough, we can request multiple database records at once. We can also have run.py simply queue batches bibcodes and use workers to read data from postgres and send off the augment request.

MT: That makes sense according to what I saw on the container: Without making use of the delay function in ADSAffil.tasks, the load was about 0.7, which sounds about right for single-threaded operation. With the delay function, load went up to about 2.2, which again makes sense if the receive, augment, and update queues are all running simultaneously. And it also makes sense that adjusting the number of workers within augment_pipeline makes no difference.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions