Skip to content

feat: add media type for unknown file type #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aftersnow
Copy link
Contributor

@aftersnow aftersnow commented May 19, 2025

In some cases, the model package system may not know the type of the file, or the file is not config, weight, doc, code or dataset, but it is still a valid model file, which is required by the downstream model serving/deployment system.

For instance:

In https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2/tree/main, there is a .nemo file.
In https://huggingface.co/google/gemma-3n-E4B-it-litert-preview/tree/main, there is a .task file.
In https://huggingface.co/mradermacher/SpaceThinker-Qwen2.5VL-3B-i1-GGUF/tree/main, there is a .dat file.

So the application/vnd.cnai.model.unknown.v1.tar media type and its friends are needed to represent this type.

The application/vnd.cnai.model.unknown.v1.tar+gzip and application/vnd.cnai.model.unknown.v1.tar+zstd media types represent the gzip and zstd compressed payloads of the application/vnd.cnai.model.unknown.v1.tar media type. If the file is large, implementations are RECOMMENDED to use application/vnd.cnai.model.unknown.v1.raw media type.

@aftersnow aftersnow force-pushed the add-unknown-media-type branch from 1350752 to 5d50d45 Compare May 19, 2025 12:45
@caozhuozi
Copy link
Contributor

caozhuozi commented May 20, 2025

which is required by the downstream model serving/deployment platform.

Dose it mean the model serving/deployment platforms will explicitly require an unknow media type when attempting to serve a specific model?

@aftersnow
Copy link
Contributor Author

which is required by the downstream model serving/deployment platform.

Dose it mean the model serving/deployment platforms will explicitly require an unknow media type when attempting to serve a specific model?

No, only the model package system need the unknown media type, because it's not config, weight, code, doc, or dataset. Package system should just treat it as an opaque binary. At last, the unknown media type should be passed to the model serving/deployment platform, the platform knowns the media type and consumes it. For instance, a .so or .lib file, or some novel model file types.

@aftersnow aftersnow requested a review from gorkem May 20, 2025 03:24
@caozhuozi
Copy link
Contributor

@aftersnow Thanks for the clarification. I'm not sure if "unknown" might be overused if we provide this option.
Perhaps we can further restrict its usage by tying it to weight?
How about using "model.weight.unknown"?

@aftersnow
Copy link
Contributor Author

aftersnow commented May 20, 2025

@aftersnow Thanks for the clarification. I'm not sure if "unknown" might be overused if we provide this option. Perhaps we can further restrict its usage by tying it to weight? How about using "model.weight.unknown"?

Yes, it might be overused, but if an unknown type do exists, which we met frequently in our production env, they maybe set to a wrong type. That is worse than a unknown type.

The problem of model.weight.unknown is that some unknown types are not model weight.

@gorkem
Copy link
Contributor

gorkem commented May 20, 2025

This media type is both unnecessary and imprecise. The unknown media type fails to convey its intended use. A packaging system must embed clear, unambiguous metadata so that downstream services can automatically and reliably recognize exactly what they’re handling. Without this metadata, interoperability collapses and automation pipelines will inevitably break.

@aftersnow
Copy link
Contributor Author

aftersnow commented May 21, 2025

This media type is both unnecessary and imprecise. The unknown media type fails to convey its intended use. A packaging system must embed clear, unambiguous metadata so that downstream services can automatically and reliably recognize exactly what they’re handling. Without this metadata, interoperability collapses and automation pipelines will inevitably break.

Yes, I agree with you. But what if the system need an type, for instance, a .so file, a .lib file or a binary file? It's not code, config, weight, dataset, or doc type. @gorkem Maybe we can change unknown to other, or misc?

@amisevsk
Copy link

In my opinion, object and lib files fall under the umbrella of code. I can't think of a case where we would want to distinguish between .so files and actual source code/scripts.

@aftersnow
Copy link
Contributor Author

Thank you @amisevsk. The .so example maybe not enough, here are more examples from Hugging Face:

In https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2/tree/main, there is a .nemo file.
In https://huggingface.co/google/gemma-3n-E4B-it-litert-preview/tree/main, there is a .task file.
In https://huggingface.co/mradermacher/SpaceThinker-Qwen2.5VL-3B-i1-GGUF/tree/main, there is a .dat file.

Our package system cannot correctly categorize these files. They may not belong to categories such as model weight, code, doc, config, or dataset. Therefore, we need a media type to accommodate this kind of file. Maybe "unknown" is misleading, perhaps "misc" or "other" are better?

@amisevsk
Copy link

amisevsk commented May 22, 2025

My approach in developing KitOps thus far has been to do a 'best effort' categorization, and leave it to the user to clarify any issues. With our current implementation, .nemo, .task, and .dat files would get included as 'code'-type layers (though in this case they appear to all be model-related).

To me, an 'unknown'/'misc' category is an undesirable element of the spec, as it's a dead end. Ideally, as the package system improves, it should be able to categorize all incoming files relatively accurately, accepting user input to correct any errors. With 'unknown' layers, tooling using the spec has to basically pretend they don't exist.

In other words, if we hit a file that can't be categorized, ultimately it will be on the end-user to provide additional context (i.e. say "this .nemo file is a model-related file and we would like it to be treated as such"). Sticking it in an 'unknown' layer type feels like a proxy for just returning an error or requiring additional input at packaging time.

@aftersnow
Copy link
Contributor Author

aftersnow commented May 23, 2025

To me, an 'unknown'/'misc' category is an undesirable element of the spec, as it's a dead end. Ideally, as the package system improves, it should be able to categorize all incoming files relatively accurately, accepting user input to correct any errors. With 'unknown' layers, tooling using the spec has to basically pretend they don't exist.

It seems the word unknown is the problem. What if we change unknown to opaque (it means the package system should only pass it to the user transparently), or some thing like that? Qe can try to find a better name to solve this problem as we talked in the meeting yesterday? @gorkem @amisevsk

Ideally, as the package system improves, it should be able to categorize all incoming files relatively accurately, accepting user input to correct any errors

First, there is always a new type in the rapid developing area of AI, the package system may need to upgrade very frequently, it's a big burden to us. Second, maybe there is no correct type now, for instance, '.nemo' or 'task'. The current code, config, dataset, weight, doc, none of them is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants