Text

Use the Text API when you want chat-completion style model access from your own application, script, or automation.

RemoteGPU does not choose a default model for you. Every request must name the model it should run against.

Send requests

Quickstart chat completion

For your first call, send a chat-completion request with model and messages.

Before you send it:

Create an API key with inference access
Send that key in the Authorization: Bearer <api-key> header
Name the model explicitly

If you have not created a key yet, read API keys first.

curlJavaScriptPython

bash

curl -X POST "https://inference.remotegpu.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3.6-27B-FP8",
    "messages": [
      {
        "role": "user",
        "content": "Summarize this note in one concise paragraph."
      }
    ],
    "max_tokens": 256,
    "temperature": 0.2
  }'

const response = await fetch("https://inference.remotegpu.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.REMOTEGPU_API_KEY}`,
  },
  body: JSON.stringify({
    model: "Qwen/Qwen3.6-27B-FP8",
    messages: [
      {
        role: "user",
        content: "Summarize this note in one concise paragraph.",
      },
    ],
    max_tokens: 256,
    temperature: 0.2,
  }),
});

python

import os
import requests

response = requests.post(
    "https://inference.remotegpu.ai/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ['REMOTEGPU_API_KEY']}",
    },
    json={
        "model": "Qwen/Qwen3.6-27B-FP8",
        "messages": [
            {
                "role": "user",
                "content": "Summarize this note in one concise paragraph.",
            }
        ],
        "max_tokens": 256,
        "temperature": 0.2,
    },
    timeout=60,
)
response.raise_for_status()
completion = response.json()
print(completion["choices"][0]["message"]["content"])

Example response:

json

{
  "id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
  "object": "chat.completion",
  "created": 1779174000,
  "model": "Qwen/Qwen3.6-27B-FP8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 48,
    "total_tokens": 80
  }
}

The response is returned synchronously as an OpenAI-compatible chat.completion object.

Decide if this API fits

Use the Text API for:

chat-completion style requests from applications, backends, scripts, or automations
programmatic control over prompts, model selection, and sampling parameters
text inference workflows that do not need Kubernetes resources

Use Application when you want a guided hosted workflow in the console. Use Kubernetes when you want to run and expose your own workloads in a namespace.

How requests work

Authentication

Text inference requests require an API key with inference access in the Authorization header as a Bearer token.

Create an Inference key in API keys, then send it on every request.

The request examples in this page already include the Bearer token header.

If the key is missing or invalid, the API returns 401. If the key is valid but does not allow inference APIs, the API returns 403.

OpenAI-compatible shape

Text requests use these OpenAI-compatible endpoints:

Endpoint	Purpose
`GET /v1/models`	List text models visible through the OpenAI-compatible model list
`POST /v1/chat/completions`	Submit a chat-completion request and receive a synchronous `chat.completion` response

Use https://inference.remotegpu.ai/v1 as the base URL for OpenAI-compatible text inference clients.

For OpenAI SDK style clients, configure:

python

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://inference.remotegpu.ai/v1",
)

Choose a model first

Every request must name a supported model. RemoteGPU does not choose a default model for you.

Before you build a client flow:

Read GET /v1/models or GET /v1/inference/models
Choose a model explicitly
Check the model's current runtime state in GET /v1/inference/models
Keep messages and generation parameters within the selected model's limits

The unified model catalog is public.

Runtime states:

State	Meaning
`ready`	At least one worker is ready to serve requests
`starting`	Capacity is starting or warming
`sleeping`	No warm worker is available for that model

A model can be supported even when its runtime state is sleeping. In that case, the first request may spend time in warming before it starts running.

Message content

Text models accept string content and OpenAI-style text content parts.

Supported message content shapes:

json

{
  "role": "user",
  "content": "Write a short summary."
}

json

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Write a short summary."
    }
  ]
}

Image content parts are not accepted by text-only models. Requests containing image_url content parts return 400 with image_input_not_supported_yet.

Response and timeouts

POST /v1/chat/completions waits for the model response and returns a 200 OK response with OpenAI-compatible JSON:

json

{
  "id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
  "object": "chat.completion",
  "created": 1779174000,
  "model": "Qwen/Qwen3.6-27B-FP8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 48,
    "total_tokens": 80
  }
}

If a worker is sleeping, the request may wait while capacity starts. If the request cannot complete before the server-side timeout, the API returns a 504 OpenAI-style error response. Jobs that have not started running are cancelled internally; jobs that are already running are marked abandoned and allowed to finish in the worker.

Common status codes

Status code	Meaning
`400`	The request body is syntactically valid JSON but the payload is invalid for the selected model or endpoint
`401`	Missing, invalid, revoked, or expired API key
`403`	API key is valid but not authorized for inference APIs
`404`	The selected model does not exist or is not available
`422`	Request validation failed, such as a missing `model` field
`503`	The selected model exists but is unavailable for serving
`504`	The request timed out before a chat-completion response was available

Reference

Current text models

Model	Max messages	Max output tokens	Default parameters
`Qwen/Qwen3.6-27B-FP8`	64	8192	`1024` max tokens, temperature `0.7`, top-p `1.0`

Request body fields

Field	Required	Notes
`model`	Yes	Must match a supported text model ID
`messages`	Yes	Chat messages with string content or text content parts
`max_tokens`	No	Defaults to the selected model default; maximum is `8192`
`temperature`	No	Defaults to `0.7`; accepted range is `0.0` through `2.0`
`top_p`	No	Defaults to `1.0`; accepted range is `0.0` through `1.0`
`stop`	No	Optional stop sequence or list of stop sequences
`stream`	No	Streaming is not accepted; send `false` or omit the field

If model is omitted, the request is rejected. If the selected model is known but unavailable for serving, the API returns 503.

Model catalog endpoint

Use the catalog endpoint when you need the full model field details instead of the summarized table above.

bash

curl "https://inference.remotegpu.ai/v1/inference/models"

The response includes:

text[].model: the model identifier to send in POST /v1/chat/completions
text[].parameters: the public field details, including defaults and limits
text[].runtime.state: the current serving state
text[].recent_summary_stats: recent wait, run, and total time summaries

Prefer reading this endpoint over hardcoding per-model limits in clients.

Recommended client flow

Read GET /v1/inference/models to choose a supported text model and inspect its limits, defaults, and current runtime state
Submit POST /v1/chat/completions with an explicit model
Read choices[0].message.content from the returned chat.completion

Text ​

Send requests ​

Quickstart chat completion ​

Decide if this API fits ​

How requests work ​

Authentication ​

OpenAI-compatible shape ​

Choose a model first ​

Message content ​

Response and timeouts ​

Common status codes ​

Reference ​

Current text models ​

Request body fields ​

Model catalog endpoint ​

Recommended client flow ​

Read next ​

Text

Send requests

Quickstart chat completion

Decide if this API fits

How requests work

Authentication

OpenAI-compatible shape

Choose a model first

Message content

Response and timeouts

Common status codes

Reference

Current text models

Request body fields

Model catalog endpoint

Recommended client flow

Read next