Key Advantages of Deployments
When using shared inference, workloads compete for the same GPUs, leading to variability in speed and availability. Deployments address this by assigning hardware exclusively to your model.- Guaranteed Performance: Eliminates “noisy neighbor” effects, ensuring consistent latency and throughput.
- Complete Isolation: Workloads run in secure, reserved environments.
- Elastic Scalability: Add replicas or upgrade hardware to handle traffic increases.
- Enterprise Reliability: Meets SLAs for production services.
- Cost Control: Pay only for the reserved capacity you choose.
Deployment Configuration Options
Gravix Layer Deployments offer flexible configurations to match your workload and performance needs. The following options can be customized:1. Accelerator
Description: Select the GPU type for your deployment, such asNVIDIA_T4_16GB. The choice of accelerator impacts model performance and cost.
Example:
- Different accelerators offer varying levels of compute power and memory.
- Selection should be based on your model’s requirements and budget.
2. Hardware Type
Description: Choose"Dedicated" hardware for exclusive access to resources, ensuring consistent performance and security.
Example:
- Dedicated hardware is not shared with other users.
- Recommended for production workloads and sensitive data.
3. Replica Count (Horizontal Scaling)
Description: Configure the horizontal scaling of your deployment:- Set the minimum replica count (
--min_replicas) to determine the baseline number of replicas. - If a maximum replica count (
--max_replicas) is also specified, autoscaling is enabled, and the system will scale the number of replicas between the minimum and maximum values based on demand. - If no maximum is set, the deployment will operate with a fixed number of replicas.
4. Autoscaling
Description: Autoscaling is automatically enabled when you specify a maximum replica count (--max_replicas). The system adjusts the number of replicas based on demand, optimizing resource usage and cost.
How to Enable:
- Set both
--min_replicasand--max_replicaswhen creating your deployment.
5. Accelerator Count (Vertical Scaling)
Description: Assign multiple GPUs per replica to increase throughput for large models or batch processing. Example:- Supported values: 1, 2, 4, or 8.
- Higher counts are ideal for compute-intensive tasks.
6. Model Selection
Description: Choose a model from the supported list to deploy. Example:Managing Deployments
Deployments via the UI
-
Initial Deployment Setup: The “Create New Deployment” panel allows you to select the model, hardware type, accelerator, and number of replicas. You can also enable autoscaling and review the estimated hourly price.

-
Creating Deployment: After submission, the status will show as Creating while the system provisions hardware and initializes the model.

-
Deployment Running: When the status updates to Running, the model is live and ready to serve requests.

Deployments using the SDK
-
Explore Available Hardware: List available GPUs to select the optimal hardware for your workload.
-
Create a Deployment: Reserve hardware and deploy your model with the following command.
Troubleshooting and Additional Resources
- Initialization Time: Allow 5-10 minutes for the deployment status to become “running” before making requests.
- Slow Responses: Review hardware capacity and consider increasing the number of replicas.
- Authentication Errors: Verify that your API key is correctly set in your environment variables.
- Unexpected Costs: Remove unused deployments promptly to avoid unnecessary charges.

