FoundationVision

All

19 repositories

UniTok
Public
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
tokenizer generative-model image-generation generative text-to-image autoregressive-models large-language-models generative-ai image-tokenizer
Python
•
MIT License
•7•419•9•0•Updated Sep 22, 2025Sep 22, 2025
Waver
Public
A video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.
46•621•5•1•Updated Aug 27, 2025Aug 27, 2025
Infinity
Public
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
transformers generative-model image-generation auto-regressive-model gpt text-to-image gpt-2 autoregressive-models text-to-image-generation
Python
•
MIT License
•75•1.5k•51•4•Updated Jun 24, 2025Jun 24, 2025
BitVAE
Public
official training and inference code of bitwise tokenizer
vae image-generation vqvae autoregressive-models image-tokenizer
Python
•
MIT License
•2•46•2•0•Updated May 18, 2025May 18, 2025
VAR
Public
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
transformers generative-model image-generation auto-regressive-model gpt neurips gpt-2 diffusion-models autoregressive-models vision-transformer
Jupyter Notebook
•
MIT License
•537•8.4k•51•3•Updated May 18, 2025May 18, 2025
Liquid
Public
Liquid: Language Models are Scalable and Unified Multi-modal Generators
generative text-to-image image-gen autoregressive-models large-language-models text-to-image-generation llms generative-ai multimodal-large-language-models
Python
•
MIT License
•33•620•11•0•Updated Apr 8, 2025Apr 8, 2025
GenerateU
Public
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
open-world object-detection multimodality open-vocabulary mllm open-vocabulary-detection
Python
•
MIT License
•8•182•15•0•Updated Mar 29, 2025Mar 29, 2025
FlashVideo
Public
FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
generative-models video-generation diffusion-models text-to-video efficient-generative-model
Python
•
Apache License 2.0
•24•446•13•1•Updated Mar 5, 2025Mar 5, 2025
UniRef
Public
[ICCV2023] Segment Every Reference Object in Spatial and Temporal Spaces
object-segmentation unified-model
Python
•
MIT License
•15•237•4•0•Updated Feb 14, 2025Feb 14, 2025
flashvideo-page
Public
HTML
•0•0•0•0•Updated Feb 10, 2025Feb 10, 2025
infinity.project
Public
HTML
•0•0•0•0•Updated Dec 24, 2024Dec 24, 2024
GLEE
Public
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
tracking open-world object-detection interactive-segmentation video-object-segmentation referring-expression-segmentation referring-expression-comprehension video-instance-segmentation zero-shot-object-detection referring-video-object-segmentation
Python
•
MIT License
•75•1.2k•44•2•Updated Oct 21, 2024Oct 21, 2024
LlamaGen
Public
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
image-generation llama auto-regressive-model diffusion text2image diffusion-models llm
Python
•
MIT License
•89•1.9k•64•1•Updated Aug 15, 2024Aug 15, 2024
OmniTokenizer
Public
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
vae image-generation auto-regressive-model tokenization video-generation vqvae
Python
•
MIT License
•8•314•9•0•Updated Jul 9, 2024Jul 9, 2024
vaex
Public
🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook
Python
•
MIT License
•8•100•4•0•Updated Jun 23, 2024Jun 23, 2024
ByteTrack
Public
[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
real-time deployment multi-object-tracking pytorch
Python
•
MIT License
•1k•5.7k•296•29•Updated Jun 19, 2024Jun 19, 2024
Groma
Public
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
llama multimodal grounding foundation-models large-language-models llm mllm vision-language-model llama2
Python
•
Apache License 2.0
•45•579•15•1•Updated Jun 7, 2024Jun 7, 2024
VNext
Public
Next-generation Video instance recognition framework on top of Detectron2 which supports InstMove (CVPR 2023), SeqFormer(ECCV Oral), and IDOL(ECCV Oral))
tracking motion transformer object-detection instance-segmentation video-instance-segmentation
Python
•
Apache License 2.0
•55•615•41•1•Updated Feb 21, 2024Feb 21, 2024
.github
Public
0•0•0•0•Updated Dec 16, 2023Dec 16, 2023