Skip to main content
Deployments provide exclusive GPU resources for your models, ensuring predictable performance and reliability for production workloads. This feature is designed for customers who require consistent, low-latency inference, guaranteed resource availability, and scalable infrastructure for enterprise applications. With Deployments, your models run on reserved hardware, eliminating resource contention and enabling robust service-level agreements (SLAs).

Key Advantages of Deployments

When using shared inference, workloads compete for the same GPUs, leading to variability in speed and availability. Deployments address this by assigning hardware exclusively to your model.
  • Guaranteed Performance: Eliminates “noisy neighbor” effects, ensuring consistent latency and throughput.
  • Complete Isolation: Workloads run in secure, reserved environments.
  • Elastic Scalability: Add replicas or upgrade hardware to handle traffic increases.
  • Enterprise Reliability: Meets SLAs for production services.
  • Cost Control: Pay only for the reserved capacity you choose.

Deployment Configuration Options

Gravix Layer Deployments offer flexible configurations to match your workload and performance needs. The following options can be customized:

1. Accelerator

Description: Select the GPU type for your deployment, such as NVIDIA_T4_16GB. The choice of accelerator impacts model performance and cost. Example:
--gpu_model "NVIDIA_T4_16GB"
Details:
  • Different accelerators offer varying levels of compute power and memory.
  • Selection should be based on your model’s requirements and budget.

2. Hardware Type

Description: Choose "Dedicated" hardware for exclusive access to resources, ensuring consistent performance and security. Example:
--hw_type "dedicated"
Details:
  • Dedicated hardware is not shared with other users.
  • Recommended for production workloads and sensitive data.

3. Replica Count (Horizontal Scaling)

Description: Configure the horizontal scaling of your deployment:
  • Set the minimum replica count (--min_replicas) to determine the baseline number of replicas.
  • If a maximum replica count (--max_replicas) is also specified, autoscaling is enabled, and the system will scale the number of replicas between the minimum and maximum values based on demand.
  • If no maximum is set, the deployment will operate with a fixed number of replicas.
Example (Without Autoscaling):
--min_replicas 1
Example (With Autoscaling):
--min_replicas 1 --max_replicas 3

4. Autoscaling

Description: Autoscaling is automatically enabled when you specify a maximum replica count (--max_replicas). The system adjusts the number of replicas based on demand, optimizing resource usage and cost. How to Enable:
  • Set both --min_replicas and --max_replicas when creating your deployment.

5. Accelerator Count (Vertical Scaling)

Description: Assign multiple GPUs per replica to increase throughput for large models or batch processing. Example:
--gpu_count 2
Details:
  • Supported values: 1, 2, 4, or 8.
  • Higher counts are ideal for compute-intensive tasks.

6. Model Selection

Description: Choose a model from the supported list to deploy. Example:
--model_name "qwen3-4b-instruct-2507"

Managing Deployments

Deployments via the UI

  1. Initial Deployment Setup: The “Create New Deployment” panel allows you to select the model, hardware type, accelerator, and number of replicas. You can also enable autoscaling and review the estimated hourly price. Initial Deployment Setup UI
  2. Creating Deployment: After submission, the status will show as Creating while the system provisions hardware and initializes the model. Creating Deployment
  3. Deployment Running: When the status updates to Running, the model is live and ready to serve requests. Created Deployment

Deployments using the SDK

  1. Explore Available Hardware: List available GPUs to select the optimal hardware for your workload.
    gravixlayer deployments gpu --list
    
  2. Create a Deployment: Reserve hardware and deploy your model with the following command.
    gravixlayer deployments create \
      --deployment_name "production_model" \
      --model_name "qwen3-4b-instruct-2507" \
      --gpu_model "NVIDIA_T4_16GB" \
      --gpu_count 2 \
      --min_replicas 1 \
      --max_replicas 3 \
      --hw_type "dedicated" \
      --wait
    

Troubleshooting and Additional Resources

  • Initialization Time: Allow 5-10 minutes for the deployment status to become “running” before making requests.
  • Slow Responses: Review hardware capacity and consider increasing the number of replicas.
  • Authentication Errors: Verify that your API key is correctly set in your environment variables.
  • Unexpected Costs: Remove unused deployments promptly to avoid unnecessary charges.
For more detailed information, refer to the Querying Dedicated Deployments SDK page.
I