mistral-ai

Unveiling Pixtral-12B-2409: Mistral AI’s Latest Multimodal Marvel

Tal Peretz

28 Sep 2024 — 2 min read

Mistral AI has recently unveiled its new multimodal large language model, Pixtral-12B-2409. This cutting-edge model combines text and image processing capabilities to offer unparalleled performance in various applications.

Model Architecture

Pixtral-12B-2409 features a 12-billion-parameter multimodal decoder paired with a 400-million-parameter vision encoder. This combination allows the model to natively process both text and images. The vision encoder is trained from scratch and is versatile enough to handle images of various sizes and aspect ratios, converting them into tokens for each 16x16 patch.

Key Features

Multimodal Capabilities: The model is trained with interleaved image and text data, enabling it to understand and analyze both mediums effectively.
Variable Image Sizes: Pixtral can handle images of any size and aspect ratio, processing them at their native resolution.
Context Window: With a large context window of 128,000 tokens, the model can process multiple images within this window.
Performance: Pixtral excels in both text-only and multimodal tasks, including instruction following, chart and figure understanding, and document question answering.

Technical Details

Pixtral-12B-2409 boasts a total of 12 billion parameters for the multimodal decoder and 400 million for the vision encoder, summing up to approximately 24GB in size. It tokenizes images using a patch size of 16x16 pixels and uses specific tokens to handle different aspect ratios. Released under the Apache 2.0 license, it is free for commercial use.

Benchmarks and Performance

The model performs excellently on various benchmarks, surpassing other models like Qwen2 7B VL, LLaVA-OV 7B, and Phi-3 Vision. Key benchmark scores include:

52.5% on the MMMU reasoning benchmark
58.0% on Mathvista
81.8% on ChartQA
90.7% on DocVQA
78.6% on VQAv2

Usage and Availability

Pixtral-12B-2409 is accessible via Mistral AI’s platforms, Le Chat and Le Plateforme, and is also available on Hugging Face. While the model supports multiple image formats (PNG, JPEG, WEBP, non-animated GIF), it has a file size limit of 10MB per image and a maximum of 8 images per API request. Currently, fine-tuning the image capabilities is not supported.

Applications

Pixtral-12B-2409 is well-suited for tasks such as image captioning, object counting in photos, OCR, and understanding complex diagrams and documents. It can also process satellite images, making it highly versatile across various domains.

Overall, Pixtral-12B-2409 is a significant advancement in multimodal AI, offering robust performance in both text and image processing tasks.

Introducing Gemini 2.0 Flash Preview Image Generation: Google's Next-Step Generative AI Model

Google’s Gemini 2.0 Flash Preview Image Generation is the latest breakthrough in generative AI, introducing robust multimodal capabilities that enable intuitive, context-aware image generation and editing. This model builds upon the powerful Gemini 2.0 Flash architecture, providing developers and creators with a versatile tool for visually expressive

Exploring Google's Gemini 2.5 Flash Preview TTS: Powerful, Cost-Efficient Text-to-Speech

Google continues to set the pace in generative AI with the introduction of Gemini 2.5 Flash Preview TTS, a sophisticated text-to-speech model designed for structured workflows demanding high control, transparency, and cost-efficiency. Released as part of Google's Gemini 2.5 series, this model builds upon previous iterations

Introducing Vertex AI Gemini-2.5-Pro-Preview-TTS: Google's New Flagship LLM Explained

Google continues to push the boundaries of artificial intelligence with the recent release of its highly anticipated Vertex AI Gemini-2.5-Pro-Preview-TTS model. As part of the Vertex AI ecosystem, Gemini 2.5 Pro represents a significant leap forward in AI capabilities, offering advanced reasoning, exceptional coding proficiency, and unparalleled multimodal

Introducing Gemini 2.5 Pro Preview TTS: Google's Next-Generation Multimodal AI

Google DeepMind's Gemini 2.5 Pro Preview TTS is the latest breakthrough in large language models (LLMs), designed to deliver exceptional performance across reasoning, coding, multimodal capabilities, and text-to-speech (TTS) quality. Let's explore the key features, capabilities, and practical applications of this advanced AI model. Key