Introducing Llama 3.1 405B: The Largest Open Foundation Model on Vertex AI
Google Cloud's Vertex AI has just unveiled the Llama 3.1 405B, the largest openly available foundation model to date. This model is part of the Llama 3.1 family, which also includes models with 8B and 70B parameters.
Features and Capabilities
Performance and Versatility: Llama 3.1 405B sets a new standard for open models, offering unparalleled flexibility, control, and innovation. It excels in generating synthetic data, handling complex reasoning tasks, and performing direct inference scenarios with minimal fine-tuning.
Multilingual Support: The models support eight languages, enhancing their global applicability.
Context Length: With an expanded context of 128,000 tokens, these models can deeply comprehend longer, more complex text.
Deployment and Usage
Model-as-a-Service (MaaS): Llama 3.1 models can be accessed via MaaS, enabling simple API calls and comprehensive evaluations without complex deployment processes.
Fine-Tuning: Users can fine-tune the models using their own data to build bespoke solutions. Fine-tuning for the 405B model will be available in the coming weeks.
Infrastructure: Deployment can be done using Google Cloud's A3 nodes with 8 x NVIDIA H100 GPUs, and the model can be loaded using Hugging Face Deep Learning Containers (DLCs).
Technical Details
Quantization: The 405B model is quantized to 8-bit (FP8) numerics, reducing compute requirements and allowing it to run within a single server node.
Resource Requirements: The model requires significant resources, such as 400 GiB of disk space and GPUs that support FP8 data type.
Use Cases
Synthetic Data Generation: The 405B model can generate synthetic data to improve and train smaller models through distillation.
Complex Reasoning and Tool Use: The models are effective for tool use, supporting zero-shot tool use and specific capabilities like search, image generation, code execution, and mathematical reasoning.
Access and Integration
Vertex AI Model Garden: The models are available for self-service in Vertex AI Model Garden, allowing users to choose their preferred infrastructure.
Integration Tools: Users can integrate Llama 3.1 into their AI experiences using tools like LangChain on Vertex AI and Genkit’s Vertex AI plugin.
Steps to Deploy
- Register the Model: Use the
google-cloud-aiplatform
Python SDK to register the Llama 3.1 405B model on Vertex AI. - Create Endpoint: Create a Vertex AI Endpoint and deploy the model using the Hugging Face DLC for Text Generation Inference (TGI).
- Resource Allocation: Ensure the deployment node has sufficient resources, such as an A3 instance with 8 x NVIDIA H100 GPUs.
Additional Notes
General Availability: The 405B model is currently in preview, with general availability expected in the coming weeks.
Cost and Billing: There are no charges during the preview period. For production-ready services, users will need to consider the costs associated with the self-hosted model.