Model Hosting Platform - Model Onboarding Guide¶

Our RAI Model Hosting Platform is designed to streamline AI model deployment, enhance performance efficiency, and provide a seamless experience for model owners solving AI safety challenges. Empowering AI Model Owners with Scalable, Efficient, and Reliable Hosting. Models hosted on the RAI Model Hosting Platform actively integrate with robust security infrastructure to provide advanced protection for leading LLMs.

Value Proposition¶

Target Users:¶

👩‍💻 AI Model Owners – Researchers & engineers building AI models for safety.

Customer Pain Points vs. Platform Value Proposition¶

Model Onboarding Lifecycle Step	Pain Points	Our Value Proposition
1. Build the Model Image	🚨 Slow Build Image Time – Lack of standardization leads to long build times, delaying model deployment and iteration.	✅ Standardized & Accelerated Model Deployment – Reduces build image time with a standardized format, enabling faster and more efficient deployment.
2. Collect Performance Metrics	🚨 Inconsistent Performance Metrics – No unified methodology for benchmarking, making it difficult to assess model efficiency and estimate capacity needs.	✅ Performance & Unified benchmarks – Provides accurate AI model metrics using a unified benchmarking approach, improving efficiency and scalability.
3. Manage Capacity	🚨 Over-Provisioning – Wastes compute power and increases costs. 🚨 Under-Provisioning – Causes latency issues, slow response times, and downtime.	✅ Cost-Efficient Resource Allocation – Dynamically optimizes legacy resource usage to reduce waste while maintaining performance.
4. Register the Model	🚨 Lack of Model and dedicated GPU Utilization Reports – No visibility into model and GPU capacity utilization with allocated resources.
5. Model Roll Out	🚨 Slow & Inefficient Model Deployment – Long build times and lack of standardization slow down iteration cycles.	✅ Seamless Deployment & Real-Time Monitoring – Automates deployment, reducing operational overhead, and enhances security and compliance with customer isolation.✅ Model Multi-Version Support – Enables parallel execution of multiple model versions and handles contract modifications efficiently.

Overview¶

This guide outlines the essential steps for onboarding your model to the RAI Platform. We offer two onboarding options to meet diverse customer needs: the Regular Tier and the Fast-Tracked Tier. The Regular Tier may take up to 7 days to complete, requiring minimal effort from the customer, assuming adequate capacity is available. This option is recommended for the initial onboarding of a model. On the other hand, the Fast-Tracked Tier guarantees completion within 24 hours but requires more involvement from the customer, provided there is sufficient capacity. This tier is ideal for model refreshes, including updates to existing models, hotfixes, and configuration changes. The following checklist outlines the tasks for both onboarding options.

#	Task	Regular Tier	Fast-tracked Tier
1	Prepare required information about the model
2	Build the model image
3	Prepare model performance metrics of the model
4	Confirm model capacity with the Capacity PoC
5	Register the model through the model Registration API
6	Inform the Hosting PoC to start rollout

Task Description¶

This section offers comprehensive and in-depth information on each task, ensuring clarity and a thorough understanding of its purpose, steps, and requirements.

Prepare Required Model Information¶

Model Information
- Model name and version
- Harm categories and model modality
- Input and output schemas. The content type must be JSON. Input and output schemas are not allowed to change for Fast-tracked Tier.
- Target SKU (e.g., CPU/GPU type, GPU microarchitecture like Volta, Ampere, Hopper). If unspecified, SKU will be determined during performance evaluation if it is a new model, otherwise SKU will remain the same for model refreshes.
Service Capability
- Maximum payload size
- Acceptable latency statistics
- Expected requests per second (RPS) per region/cloud

Fill out the form once you get above information ready.

Build the Model Image¶

The RAI Platform serves models using Docker containers on Azure Machine Learning with Singularity as the backend.

Requirements for the Docker Image:¶

Must be built on a Linux system.
Must respond to requests on the liveness, readiness, and scoring routes.
Must have an entrypoint of ['runsvdir', '/var/runit'].

For a step-by-step guide, refer to the Tutorial page.

The image should be hosted in Azure Container Registry. Please make sure that the role AcrPull of the container registry is assigned to the following two service principals:

AICPRuntime
RAIServiceOpsPlatformADO

Prepare Model Performance Metrics¶

Performance metrics are required to determine model capacity and deployment. The following data should be collected:

SKU – The computing instance type used to run the model.
Payload Size – Size of test samples (e.g., image dimensions, text length in characters).
Latency – Latency statistics under target RPS (max, min, average, P50, P75, P90, P95, P99).
Resource Utilization – CPU, GPU, memory, and GPU memory usage during load tests. This is optional if resource utilization is not applicable.

Ensure the SKU matches those supported by the RAI Platform. If needed, contact the model hosting PoC with the subject Performance Evaluation for assistance in collecting metrics.

Example Performance Metrics

Below are example latency metrics for models tested on two device types: Singularity.ND12am_A100_v4 and Singularity.NC4as_T4_v3. These numbers are for demonstration purposes only.

Results on Singularity.ND12am_A100_v4

Target RPS	10	20	30	40
Number of Requests	3000	6000	9000	12000
Total RPS	9.8	19.8	29.6	38.8
Failure Rate	0	0	0	20
Average Latency (ms)	18.5	25.3	32.4	228.5
Max Latency (ms)	23.4	32.5	37.4	653.3
Min Latency (ms)	17.2	23.2	28.4	47.5
P50 Latency (ms)	19.6	26.7	34.5	245.2
P75 Latency (ms)	21.6	28.4	35.8	275.4
P95 Latency (ms)	22.8	30.2	36.4	324.8
P99 Latency (ms)	23.2	31.9	37.1	624.7
Average GPU Count	1	1	1	1
GPU Utilization Percentage	75	74	78	73
GPU Memory (GB)	12	12	12	12
Max GPU Utilization Percentage	99	99	99	99
Average CPU Count	16	16	16	16
Average Logical CPU Count	16	16	16	16
CPU Utilization Percentage	80	80	80	80
Virtual Memory Used (GB)	16	16	16	16
Max CPU Utilization Percentage	100	100	100	100
Max Virtual Memory Used	20	20	20	20

Results on Singularity.NC4as_T4_v3

Target RPS	10	20	30	40
Number of Requests	3000	6000	9000	12000
Total RPS	9.8	19.8	29.6	38.8
Failure Rate	0	0	30	50
Average Latency (ms)	36.5	50.3	364.4	648.5
Max Latency (ms)	46.4	32.5	37.4	653.3
Min Latency (ms)	17.2	23.2	28.4	47.5
P50 Latency (ms)	19.6	26.7	34.5	245.2
P75 Latency (ms)	21.6	28.4	35.8	275.4
P95 Latency (ms)	22.8	30.2	36.4	324.8
P99 Latency (ms)	23.2	31.9	37.1	624.7
Average GPU Count	1	1	1	1
GPU Utilization Percentage	75	74	78	73
GPU Memory (GB)	12	12	12	12
Max GPU Utilization Percentage	99	99	99	99
Average CPU Count	16	16	16	16
Average Logical CPU Count	16	16	16	16
CPU Utilization Percentage	80	80	80	80
Virtual Memory Used (GB)	16	16	16	16
Max CPU Utilization Percentage	100	100	100	100
Max Virtual Memory Used	20	20	20	20

Assume that the acceptable P95 latency is 35ms and target RPS in East US is 200, each Singularity.ND12am_A100_v4 can consume 20 requests per second and 10 GPUs are required in East US.

Confirm Resource Capacity¶

Number of SKUs need to be calculated for each region. Confirm with the Capacity PoC Megan Baker that there are enough resources available. If there is not enough quota, we need to apply for additional resources and this may take several weeks.

Model Registration¶

Model is registered by model registration API. You may contact with the Hosting Poc Placeholder for authorization if this is the first time you register your model. Detailed information about the API can be found in Model Registration API.

Inform the Hosting PoC to Rollout¶

Inform the Hosting Poc Placeholder to start the rollout process that is managed by the RAI OPS Platform. You will be updated on the rollout progress. Rollout will be triggered automatically for the regular tier and no action is required from the customer side.

Need Help?¶

For assistance, contact us at placeholder@placeholder.