Skip to main content
Vision-language models (VLMs) are advanced AI systems capable of processing and understanding both text and image data within a single request. This dual-modal capability allows you to perform a wide range of tasks, such as asking detailed questions about visual content or generating descriptive captions for images.

Querying Models with an Image URL

You can provide images to vision models by referencing a publicly accessible HTTP URL.
  • cURL
  • Python - OpenAI
  • Python - Gravix SDK
  • JavaScript
  • JavaScript - Gravix SDK
curl -X POST https://api.gravixlayer.com/v1/inference/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $GRAVIXLAYER_API_KEY" \
    -d '{
        "model": "google/gemma-3-12b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "Can you describe this image?" },
                    { 
                        "type": "image_url", 
                        "image_url": { 
                            "url": "https://images.unsplash.com/photo-1720884413532-59289875c3e1?q=80&w=3024&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" 
                        } 
                    }
                ]
            }
        ]
    }'

Querying Models with a Base64 Encoded Image

You can also provide images by embedding them directly into the request payload as a Base64 encoded string.
  • cURL
  • Python - OpenAI
  • Python - Gravix SDK
  • JavaScript
  • JavaScript - Gravix SDK
curl -X POST https://api.gravixlayer.com/v1/inference/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $GRAVIXLAYER_API_KEY" \
    -d '{
        "model": "google/gemma-3-12b-it",
        "messages": [
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "Can you describe this image?" },
                    { 
                        "type": "image_url", 
                        "image_url": { 
                            "url": "data:image/jpeg;base64,{base64_encoded_image}"
                        } 
                    }
                ]
            }
        ]
    }'

Common applications for VLMs include:

  • Image Captioning: Automatically generating descriptive text for images.
  • Visual Question Answering (VQA): Answering questions based on the content of an image.
  • Document Analysis: Extracting and interpreting information from scanned documents or forms.
  • Chart Interpretation: Analyzing data visualizations like graphs and charts.
  • Optical Character Recognition (OCR): Extracting printed or handwritten text from images.
  • Content Moderation: Identifying and flagging inappropriate or sensitive visual content.
I