Skip to content

New Source Collector: "research" ML model to generate batches #299

Open
@josh-chamberlain

Description

@josh-chamberlain

Context

I have been talking recently with folks who use chatgpt o3 mini high to find actual URLs on the internet (not generate them) which have sources for their mathematical research paper. While I'm still suspicious of closed-source LLMs (and LLMs in general), these things are subsidized and cheap and will soon get too expensive...shouldn't we be taking advantage while we can?

I think part of what they are solving for is that googling is awful—which is what we are accounting for with our labeling pipeline. They are probably going to generate much less junk than our other collection methods, but we will still be protected from junk by human labelers.

Requirements

  • Create a source collector based on a research ML model—not for doing research, but for finding sources.
    • try o3 mini high first
    • consider that other models may be tried in the future, but will work the same way; can we make them operate from the same collector using options? should we just make new collectors for other models?
  • The collector should accept a prompt and give the user some guidance about what a prompt might look like. It'll probably look different from a google search.
  • The collector should generate URLs like any other.

Thoughts

If this works well, we might consider using LLMs to sort more aggressively on relevancy, making the human part of labeling more fun and less subjective.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Awaiting Dev

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions