Skip to content

Support providing streams into mixer via CLI #130

@soldni

Description

@soldni

@IanMagnusson asks

I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant figure out how to index the streams arg correctly:

dolma mix --streams[0].name "$name"
            --streams[0].documents "$input_prefix/$file" \
            --streams[0].output.path "$output_prefix/$file" \
            --streams[0].output.max_size_in_bytes 1000000000 \
            --streams[0].attributes s2orc-eval \
            --streams[0].filter.exculde "$@.attributes[?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]" 

We should support this use case. As a stopgap, we should support echo '{...}' | dolma -c - mix, i.e. allow passing config through stdin.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions