Unveiling Pixtral-12B-2409: Mistral AI’s Latest Multimodal Marvel
Mistral AI has recently unveiled its new multimodal large language model, Pixtral-12B-2409. This cutting-edge model combines text and image processing capabilities to offer unparalleled performance in various applications.
Model Architecture
Pixtral-12B-2409 features a 12-billion-parameter multimodal decoder paired with a 400-million-parameter vision encoder. This combination allows the model to natively process both text and images. The vision encoder is trained from scratch and is versatile enough to handle images of various sizes and aspect ratios, converting them into tokens for each 16x16 patch.
Key Features
- Multimodal Capabilities: The model is trained with interleaved image and text data, enabling it to understand and analyze both mediums effectively.
- Variable Image Sizes: Pixtral can handle images of any size and aspect ratio, processing them at their native resolution.
- Context Window: With a large context window of 128,000 tokens, the model can process multiple images within this window.
- Performance: Pixtral excels in both text-only and multimodal tasks, including instruction following, chart and figure understanding, and document question answering.
Technical Details
Pixtral-12B-2409 boasts a total of 12 billion parameters for the multimodal decoder and 400 million for the vision encoder, summing up to approximately 24GB in size. It tokenizes images using a patch size of 16x16 pixels and uses specific tokens to handle different aspect ratios. Released under the Apache 2.0 license, it is free for commercial use.
Benchmarks and Performance
The model performs excellently on various benchmarks, surpassing other models like Qwen2 7B VL, LLaVA-OV 7B, and Phi-3 Vision. Key benchmark scores include:
- 52.5% on the MMMU reasoning benchmark
- 58.0% on Mathvista
- 81.8% on ChartQA
- 90.7% on DocVQA
- 78.6% on VQAv2
Usage and Availability
Pixtral-12B-2409 is accessible via Mistral AI’s platforms, Le Chat and Le Plateforme, and is also available on Hugging Face. While the model supports multiple image formats (PNG, JPEG, WEBP, non-animated GIF), it has a file size limit of 10MB per image and a maximum of 8 images per API request. Currently, fine-tuning the image capabilities is not supported.
Applications
Pixtral-12B-2409 is well-suited for tasks such as image captioning, object counting in photos, OCR, and understanding complex diagrams and documents. It can also process satellite images, making it highly versatile across various domains.
Overall, Pixtral-12B-2409 is a significant advancement in multimodal AI, offering robust performance in both text and image processing tasks.