Exploring Vertex AI's Latest: Llama-3.2-90B Vision-Instruct-MaaS

Exploring Vertex AI's Latest: Llama-3.2-90B Vision-Instruct-MaaS

The Vertex AI platform has just unveiled the Llama-3.2-90B Vision-Instruct model, a cutting-edge addition to Meta's new generation of multimodal large language models (LLMs). This model is designed to integrate text and image inputs, enabling a wide range of advanced AI tasks.

Model Architecture and Capabilities

The Llama-3.2-90B Vision-Instruct model combines an optimized transformer architecture with a vision encoder. This vision encoder seamlessly integrates with the pre-trained Llama 3.1 language model through a series of cross-attention layers. This setup allows the model to handle both text and image inputs, producing text outputs. Key functionalities include:

  • Image captioning
  • Image-text retrieval
  • Visual grounding
  • Visual Q&A

Training and Optimization

Trained on NVIDIA H100 Tensor Core GPUs, the model is optimized for high throughput and low latency. It employs supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The vision encoder benefits from hardware-level optimizations thanks to its export into an ONNX graph and subsequent TensorRT engine build.

Deployment on Vertex AI

The Llama-3.2-90B Vision-Instruct model is accessible via Vertex AI's Model Garden, offering a fully managed and serverless Model-as-a-Service (MaaS) experience. Developers can easily access, customize, and deploy the model without managing infrastructure. Key benefits include:

  • Simple API calls for experimentation
  • Fine-tuning with custom data
  • Fully managed infrastructure
  • Pay-as-you-go billing

Use Cases

This model excels in scenarios requiring visual reasoning, such as:

  • Image-based search
  • Content generation
  • Interactive educational tools
  • Image captioning
  • Visual Q&A
  • Document Q&A

Technical Specifications

With 90 billion parameters and a context length of 128K tokens, the model uses grouped query attention (GQA) and is optimized for inference. For image+text applications, it supports English, although it has been trained on a broader range of languages for text-only tasks.

Integration and Ecosystem

Vertex AI provides a unified platform for experimenting, customizing, and deploying Llama-3.2 models. It integrates with tools like LangChain and Genkit’s Vertex AI plugin, facilitating the creation of intelligent agents powered by Llama-3.2.

Availability

Currently in preview on Vertex AI, the 90B model will soon be generally available, with an 11B vision model also coming as MaaS in the near future.

Read more