Dedicated Model Deployment

Dedicated Deployments provide exclusive GPU resources for your models, ensuring predictable performance and reliability for production workloads. This feature is designed for customers who require:

Consistent, low-latency inference
Guaranteed resource availability
Scalable infrastructure for enterprise applications

With Dedicated Deployments, your models run on reserved hardware, eliminating resource contention and enabling robust service-level agreements (SLAs).

Why Choose Dedicated Deployments?

When using shared inference, workloads compete for the same GPUs, leading to variability in speed and availability. Dedicated Deployments solve this by assigning hardware exclusively to your model.

---
title: "Dedicated Model Deployment (moved)"
---

This page has moved. See the updated page at: `/documentation/platform/deployments`.

Replica Count (Horizontal Scaling)

Description: Configure how your deployment scales horizontally:

Start by setting the minimum replica count (--min_replicas). This determines the baseline number of replicas your deployment will always maintain.
If you also specify a maximum replica count (--max_replicas), autoscaling is automatically enabled. The system will scale the number of replicas between your minimum and maximum values based on demand.
If you do not specify a maximum replica count, only the minimum replica count is used, and the deployment will always run with that fixed number of replicas.

Example: Without autoscaling:

--min_replicas 1

With autoscaling:

--min_replicas 1--max_replicas 3

Details: More replicas handle higher request volumes. Max replicas is only configurable when autoscaling is enabled (by specifying --max_replicas). Scaling is automatic if autoscaling is enabled; otherwise, replica count is fixed.

Autoscaling

Description: Autoscaling is automatically enabled when you specify a maximum replica count (--max_replicas). The system will scale the number of replicas between your minimum and maximum values based on demand, optimizing resource usage and cost. How to Enable:

Set both --min_replicas and --max_replicas when creating your deployment.

Details:

Autoscaling responds to traffic spikes and reduces idle resources.
Configure thresholds for scaling events as needed.

Accelerator Count (Vertical Scaling)

Description: Assign multiple GPUs per replica to increase throughput for large models or batch processing. Example:

--gpu_count 2

Details:

Supported values: 1, 2, 4, or 8.
Higher counts are ideal for compute-intensive tasks.

Model Selection

Description: Choose your model from the supported list to deploy. The model determines the capabilities and use cases of your deployment. Example:

--model_name "qwen3-4b-instruct-2507"

Details:

Refer to the documentation for available models.
Select based on your application’s requirements.

Deployments via the UI

Initial Deployment Setup

The Create New Deployment panel allows you to select the model, hardware type, accelerator, and number of replicas. You can also enable auto scaling and review the estimated hourly price before creating your deployment.

Creating Deployment

Once you submit the deployment, the status will show as Creating. The system is provisioning dedicated hardware and initializing your model. You can monitor progress here until the deployment is ready.

Created Deployment

When the deployment is complete, the status will update to Running. Your model is now live and ready to serve requests. You can view details, manage replicas, or delete the deployment from this page.

Deployments using SDK

Explore Available Hardware

List available GPUs:

gravixlayer deployments gpu --list

This command displays GPU types, memory sizes, and availability, enabling you to select the optimal hardware for your workload.

Create a Deployment

Reserve hardware and deploy your model:

gravixlayer deployments create \
  --deployment_name "production_model" \
  --model_name "qwen3-4b-instruct-2507" \
  --gpu_model "NVIDIA_T4_16GB" \
  --gpu_count 2 \
  --min_replicas 1 \
  --max_replicas 3 \
  --hw_type "dedicated" \
  --wait

Troubleshooting & Additional Resources

If you encounter issues with your deployment, consider the following:

Wait 5-10 minutes for initialization. Ensure deployment status is “running” before making requests.
Slow responses: Review hardware capacity and consider adding replicas.
Authentication errors: Ensure your API key is correctly set in environment variables.
Unexpected costs: Remove unused deployments promptly.

For more usage details and advanced options, see the Deployments SDK documentation.

​Why Choose Dedicated Deployments?​

​Replica Count (Horizontal Scaling)​

​Autoscaling​

​Accelerator Count (Vertical Scaling)​

​Model Selection​

​Deployments via the UI​

​Initial Deployment Setup​

​Creating Deployment​

​Created Deployment​

​Deployments using SDK​

​Explore Available Hardware​

​Create a Deployment​

​Troubleshooting & Additional Resources​

Why Choose Dedicated Deployments?

Replica Count (Horizontal Scaling)

Autoscaling

Accelerator Count (Vertical Scaling)

Model Selection

Deployments via the UI

Initial Deployment Setup

Creating Deployment

Created Deployment

Deployments using SDK

Explore Available Hardware

Create a Deployment

Troubleshooting & Additional Resources