Description
Context
I have been talking recently with folks who use chatgpt o3 mini high
to find actual URLs on the internet (not generate them) which have sources for their mathematical research paper. While I'm still suspicious of closed-source LLMs (and LLMs in general), these things are subsidized and cheap and will soon get too expensive...shouldn't we be taking advantage while we can?
I think part of what they are solving for is that googling is awful—which is what we are accounting for with our labeling pipeline. They are probably going to generate much less junk than our other collection methods, but we will still be protected from junk by human labelers.
Requirements
- Create a source collector based on a
research
ML model—not for doing research, but for finding sources.- try
o3 mini high
first - consider that other models may be tried in the future, but will work the same way; can we make them operate from the same
collector
using options? should we just make new collectors for other models?
- try
- The collector should accept a
prompt
and give the user some guidance about what a prompt might look like. It'll probably look different from a google search. - The collector should generate URLs like any other.
Thoughts
If this works well, we might consider using LLMs to sort more aggressively on relevancy
, making the human part of labeling more fun and less subjective.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status