Exploring Vertex AI's Latest: Llama-3.2-90B Vision-Instruct-MaaS
The Vertex AI platform has just unveiled the Llama-3.2-90B Vision-Instruct model, a cutting-edge addition to Meta's new generation of multimodal large language models (LLMs). This model is designed to integrate text and image inputs, enabling a wide range of advanced AI tasks.
Model Architecture and Capabilities
The Llama-3.2-90B Vision-Instruct model combines an optimized transformer architecture with a vision encoder. This vision encoder seamlessly integrates with the pre-trained Llama 3.1 language model through a series of cross-attention layers. This setup allows the model to handle both text and image inputs, producing text outputs. Key functionalities include:
- Image captioning
- Image-text retrieval
- Visual grounding
- Visual Q&A
Training and Optimization
Trained on NVIDIA H100 Tensor Core GPUs, the model is optimized for high throughput and low latency. It employs supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The vision encoder benefits from hardware-level optimizations thanks to its export into an ONNX graph and subsequent TensorRT engine build.
Deployment on Vertex AI
The Llama-3.2-90B Vision-Instruct model is accessible via Vertex AI's Model Garden, offering a fully managed and serverless Model-as-a-Service (MaaS) experience. Developers can easily access, customize, and deploy the model without managing infrastructure. Key benefits include:
- Simple API calls for experimentation
- Fine-tuning with custom data
- Fully managed infrastructure
- Pay-as-you-go billing
Use Cases
This model excels in scenarios requiring visual reasoning, such as:
- Image-based search
- Content generation
- Interactive educational tools
- Image captioning
- Visual Q&A
- Document Q&A
Technical Specifications
With 90 billion parameters and a context length of 128K tokens, the model uses grouped query attention (GQA) and is optimized for inference. For image+text applications, it supports English, although it has been trained on a broader range of languages for text-only tasks.
Integration and Ecosystem
Vertex AI provides a unified platform for experimenting, customizing, and deploying Llama-3.2 models. It integrates with tools like LangChain and Genkit’s Vertex AI plugin, facilitating the creation of intelligent agents powered by Llama-3.2.
Availability
Currently in preview on Vertex AI, the 90B model will soon be generally available, with an 11B vision model also coming as MaaS in the near future.