Model Hosting Platform - Model Onboarding Guide¶
Our RAI Model Hosting Platform is designed to streamline AI model deployment, enhance performance efficiency, and provide a seamless experience for model owners solving AI safety challenges. Empowering AI Model Owners with Scalable, Efficient, and Reliable Hosting. Models hosted on the RAI Model Hosting Platform actively integrate with robust security infrastructure to provide advanced protection for leading LLMs.
Value Proposition¶
Target Users:¶
👩💻 AI Model Owners – Researchers & engineers building AI models for safety.
Customer Pain Points vs. Platform Value Proposition¶
Model Onboarding Lifecycle Step | Pain Points | Our Value Proposition |
---|---|---|
1. Build the Model Image | 🚨 Slow Build Image Time – Lack of standardization leads to long build times, delaying model deployment and iteration. | ✅ Standardized & Accelerated Model Deployment – Reduces build image time with a standardized format, enabling faster and more efficient deployment. |
2. Collect Performance Metrics | 🚨 Inconsistent Performance Metrics – No unified methodology for benchmarking, making it difficult to assess model efficiency and estimate capacity needs. | ✅ Performance & Unified benchmarks – Provides accurate AI model metrics using a unified benchmarking approach, improving efficiency and scalability. |
3. Manage Capacity | 🚨 Over-Provisioning – Wastes compute power and increases costs. 🚨 Under-Provisioning – Causes latency issues, slow response times, and downtime. |
✅ Cost-Efficient Resource Allocation – Dynamically optimizes legacy resource usage to reduce waste while maintaining performance. |
4. Register the Model | 🚨 Lack of Model and dedicated GPU Utilization Reports – No visibility into model and GPU capacity utilization with allocated resources. | |
5. Model Roll Out | 🚨 Slow & Inefficient Model Deployment – Long build times and lack of standardization slow down iteration cycles. | ✅ Seamless Deployment & Real-Time Monitoring – Automates deployment, reducing operational overhead, and enhances security and compliance with customer isolation.✅ Model Multi-Version Support – Enables parallel execution of multiple model versions and handles contract modifications efficiently. |
Overview¶
This guide outlines the essential steps for onboarding your model to the RAI Platform. We offer two onboarding options to meet diverse customer needs: the Regular Tier and the Fast-Tracked Tier. The Regular Tier may take up to 7 days to complete, requiring minimal effort from the customer, assuming adequate capacity is available. This option is recommended for the initial onboarding of a model. On the other hand, the Fast-Tracked Tier guarantees completion within 24 hours but requires more involvement from the customer, provided there is sufficient capacity. This tier is ideal for model refreshes, including updates to existing models, hotfixes, and configuration changes. The following checklist outlines the tasks for both onboarding options.
# | Task | Regular Tier | Fast-tracked Tier |
---|---|---|---|
1 | Prepare required information about the model | ||
2 | Build the model image | ||
3 | Prepare model performance metrics of the model | ||
4 | Confirm model capacity with the Capacity PoC | ||
5 | Register the model through the model Registration API | ||
6 | Inform the Hosting PoC to start rollout |
Task Description¶
This section offers comprehensive and in-depth information on each task, ensuring clarity and a thorough understanding of its purpose, steps, and requirements.
Prepare Required Model Information¶
- Model Information
- Model name and version
- Harm categories and model modality
- Input and output schemas. The content type must be JSON. Input and output schemas are not allowed to change for Fast-tracked Tier.
- Target SKU (e.g., CPU/GPU type, GPU microarchitecture like Volta, Ampere, Hopper). If unspecified, SKU will be determined during performance evaluation if it is a new model, otherwise SKU will remain the same for model refreshes.
- Service Capability
- Maximum payload size
- Acceptable latency statistics
- Expected requests per second (RPS) per region/cloud
Fill out the form once you get above information ready.
Build the Model Image¶
The RAI Platform serves models using Docker containers on Azure Machine Learning with Singularity as the backend.
Requirements for the Docker Image:¶
- Must be built on a Linux system.
- Must respond to requests on the liveness, readiness, and scoring routes.
- Must have an entrypoint of
['runsvdir', '/var/runit']
.
For a step-by-step guide, refer to the Tutorial page.
The image should be hosted in Azure Container Registry. Please make sure that the role AcrPull
of the container registry is assigned to the following two service principals:
- AICPRuntime
- RAIServiceOpsPlatformADO
Prepare Model Performance Metrics¶
Performance metrics are required to determine model capacity and deployment. The following data should be collected:
- SKU – The computing instance type used to run the model.
- Payload Size – Size of test samples (e.g., image dimensions, text length in characters).
- Latency – Latency statistics under target RPS (max, min, average, P50, P75, P90, P95, P99).
- Resource Utilization – CPU, GPU, memory, and GPU memory usage during load tests. This is optional if resource utilization is not applicable.
Ensure the SKU matches those supported by the RAI Platform. If needed, contact the model hosting PoC with the subject Performance Evaluation for assistance in collecting metrics.
Example Performance Metrics
Below are example latency metrics for models tested on two device types: Singularity.ND12am_A100_v4
and Singularity.NC4as_T4_v3
. These numbers are for demonstration purposes only.
Results on Singularity.ND12am_A100_v4
Target RPS | 10 | 20 | 30 | 40 |
---|---|---|---|---|
Number of Requests | 3000 | 6000 | 9000 | 12000 |
Total RPS | 9.8 | 19.8 | 29.6 | 38.8 |
Failure Rate | 0 | 0 | 0 | 20 |
Average Latency (ms) | 18.5 | 25.3 | 32.4 | 228.5 |
Max Latency (ms) | 23.4 | 32.5 | 37.4 | 653.3 |
Min Latency (ms) | 17.2 | 23.2 | 28.4 | 47.5 |
P50 Latency (ms) | 19.6 | 26.7 | 34.5 | 245.2 |
P75 Latency (ms) | 21.6 | 28.4 | 35.8 | 275.4 |
P95 Latency (ms) | 22.8 | 30.2 | 36.4 | 324.8 |
P99 Latency (ms) | 23.2 | 31.9 | 37.1 | 624.7 |
Average GPU Count | 1 | 1 | 1 | 1 |
GPU Utilization Percentage | 75 | 74 | 78 | 73 |
GPU Memory (GB) | 12 | 12 | 12 | 12 |
Max GPU Utilization Percentage | 99 | 99 | 99 | 99 |
Average CPU Count | 16 | 16 | 16 | 16 |
Average Logical CPU Count | 16 | 16 | 16 | 16 |
CPU Utilization Percentage | 80 | 80 | 80 | 80 |
Virtual Memory Used (GB) | 16 | 16 | 16 | 16 |
Max CPU Utilization Percentage | 100 | 100 | 100 | 100 |
Max Virtual Memory Used | 20 | 20 | 20 | 20 |
Results on Singularity.NC4as_T4_v3
Target RPS | 10 | 20 | 30 | 40 |
---|---|---|---|---|
Number of Requests | 3000 | 6000 | 9000 | 12000 |
Total RPS | 9.8 | 19.8 | 29.6 | 38.8 |
Failure Rate | 0 | 0 | 30 | 50 |
Average Latency (ms) | 36.5 | 50.3 | 364.4 | 648.5 |
Max Latency (ms) | 46.4 | 32.5 | 37.4 | 653.3 |
Min Latency (ms) | 17.2 | 23.2 | 28.4 | 47.5 |
P50 Latency (ms) | 19.6 | 26.7 | 34.5 | 245.2 |
P75 Latency (ms) | 21.6 | 28.4 | 35.8 | 275.4 |
P95 Latency (ms) | 22.8 | 30.2 | 36.4 | 324.8 |
P99 Latency (ms) | 23.2 | 31.9 | 37.1 | 624.7 |
Average GPU Count | 1 | 1 | 1 | 1 |
GPU Utilization Percentage | 75 | 74 | 78 | 73 |
GPU Memory (GB) | 12 | 12 | 12 | 12 |
Max GPU Utilization Percentage | 99 | 99 | 99 | 99 |
Average CPU Count | 16 | 16 | 16 | 16 |
Average Logical CPU Count | 16 | 16 | 16 | 16 |
CPU Utilization Percentage | 80 | 80 | 80 | 80 |
Virtual Memory Used (GB) | 16 | 16 | 16 | 16 |
Max CPU Utilization Percentage | 100 | 100 | 100 | 100 |
Max Virtual Memory Used | 20 | 20 | 20 | 20 |
Assume that the acceptable P95 latency is 35ms and target RPS in East US is 200, each Singularity.ND12am_A100_v4
can consume 20 requests per second and 10 GPUs are required in East US.
Confirm Resource Capacity¶
Number of SKUs need to be calculated for each region. Confirm with the Capacity PoC Megan Baker that there are enough resources available. If there is not enough quota, we need to apply for additional resources and this may take several weeks.
Model Registration¶
Model is registered by model registration API. You may contact with the Hosting Poc Placeholder for authorization if this is the first time you register your model. Detailed information about the API can be found in Model Registration API.
Inform the Hosting PoC to Rollout¶
Inform the Hosting Poc Placeholder to start the rollout process that is managed by the RAI OPS Platform. You will be updated on the rollout progress. Rollout will be triggered automatically for the regular tier and no action is required from the customer side.
Need Help?¶
For assistance, contact us at placeholder@placeholder.