Skip to content

Text

Use the Text API when you want chat-completion style model access from your own application, script, or automation.

RemoteGPU does not choose a default model for you. Every request must name the model it should run against.

Send requests

Quickstart chat completion

For your first call, send a chat-completion request with model and messages.

Before you send it:

  • Create an API key with inference access
  • Send that key in the Authorization: Bearer <api-key> header
  • Name the model explicitly

If you have not created a key yet, read API keys first.

bash
curl -X POST "https://inference.remotegpu.ai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Qwen/Qwen3.6-27B-FP8",
    "messages": [
      {
        "role": "user",
        "content": "Summarize this note in one concise paragraph."
      }
    ],
    "max_tokens": 256,
    "temperature": 0.2
  }'
js
const response = await fetch("https://inference.remotegpu.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.REMOTEGPU_API_KEY}`,
  },
  body: JSON.stringify({
    model: "Qwen/Qwen3.6-27B-FP8",
    messages: [
      {
        role: "user",
        content: "Summarize this note in one concise paragraph.",
      },
    ],
    max_tokens: 256,
    temperature: 0.2,
  }),
});
python
import os
import requests

response = requests.post(
    "https://inference.remotegpu.ai/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ['REMOTEGPU_API_KEY']}",
    },
    json={
        "model": "Qwen/Qwen3.6-27B-FP8",
        "messages": [
            {
                "role": "user",
                "content": "Summarize this note in one concise paragraph.",
            }
        ],
        "max_tokens": 256,
        "temperature": 0.2,
    },
    timeout=60,
)
response.raise_for_status()
completion = response.json()
print(completion["choices"][0]["message"]["content"])

Example response:

json
{
  "id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
  "object": "chat.completion",
  "created": 1779174000,
  "model": "Qwen/Qwen3.6-27B-FP8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 48,
    "total_tokens": 80
  }
}

The response is returned synchronously as an OpenAI-compatible chat.completion object.

Decide if this API fits

Use the Text API for:

  • chat-completion style requests from applications, backends, scripts, or automations
  • programmatic control over prompts, model selection, and sampling parameters
  • text inference workflows that do not need Kubernetes resources

Use Application when you want a guided hosted workflow in the console. Use Kubernetes when you want to run and expose your own workloads in a namespace.

How requests work

Authentication

Text inference requests require an API key with inference access in the Authorization header as a Bearer token.

Create an Inference key in API keys, then send it on every request.

The request examples in this page already include the Bearer token header.

If the key is missing or invalid, the API returns 401. If the key is valid but does not allow inference APIs, the API returns 403.

OpenAI-compatible shape

Text requests use these OpenAI-compatible endpoints:

EndpointPurpose
GET /v1/modelsList text models visible through the OpenAI-compatible model list
POST /v1/chat/completionsSubmit a chat-completion request and receive a synchronous chat.completion response

Use https://inference.remotegpu.ai/v1 as the base URL for OpenAI-compatible text inference clients.

For OpenAI SDK style clients, configure:

python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://inference.remotegpu.ai/v1",
)

Choose a model first

Every request must name a supported model. RemoteGPU does not choose a default model for you.

Before you build a client flow:

  • Read GET /v1/models or GET /v1/inference/models
  • Choose a model explicitly
  • Check the model's current runtime state in GET /v1/inference/models
  • Keep messages and generation parameters within the selected model's limits

The unified model catalog is public.

Runtime states:

StateMeaning
readyAt least one worker is ready to serve requests
startingCapacity is starting or warming
sleepingNo warm worker is available for that model

A model can be supported even when its runtime state is sleeping. In that case, the first request may spend time in warming before it starts running.

Message content

Text models accept string content and OpenAI-style text content parts.

Supported message content shapes:

json
{
  "role": "user",
  "content": "Write a short summary."
}
json
{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Write a short summary."
    }
  ]
}

Image content parts are not accepted by text-only models. Requests containing image_url content parts return 400 with image_input_not_supported_yet.

Response and timeouts

POST /v1/chat/completions waits for the model response and returns a 200 OK response with OpenAI-compatible JSON:

json
{
  "id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
  "object": "chat.completion",
  "created": 1779174000,
  "model": "Qwen/Qwen3.6-27B-FP8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 48,
    "total_tokens": 80
  }
}

If a worker is sleeping, the request may wait while capacity starts. If the request cannot complete before the server-side timeout, the API returns a 504 OpenAI-style error response. Jobs that have not started running are cancelled internally; jobs that are already running are marked abandoned and allowed to finish in the worker.

Common status codes

Status codeMeaning
400The request body is syntactically valid JSON but the payload is invalid for the selected model or endpoint
401Missing, invalid, revoked, or expired API key
403API key is valid but not authorized for inference APIs
404The selected model does not exist or is not available
422Request validation failed, such as a missing model field
503The selected model exists but is unavailable for serving
504The request timed out before a chat-completion response was available

Reference

Current text models

ModelMax messagesMax output tokensDefault parameters
Qwen/Qwen3.6-27B-FP86481921024 max tokens, temperature 0.7, top-p 1.0

Request body fields

FieldRequiredNotes
modelYesMust match a supported text model ID
messagesYesChat messages with string content or text content parts
max_tokensNoDefaults to the selected model default; maximum is 8192
temperatureNoDefaults to 0.7; accepted range is 0.0 through 2.0
top_pNoDefaults to 1.0; accepted range is 0.0 through 1.0
stopNoOptional stop sequence or list of stop sequences
streamNoStreaming is not accepted; send false or omit the field

If model is omitted, the request is rejected. If the selected model is known but unavailable for serving, the API returns 503.

Model catalog endpoint

Use the catalog endpoint when you need the full model field details instead of the summarized table above.

bash
curl "https://inference.remotegpu.ai/v1/inference/models"

The response includes:

  • text[].model: the model identifier to send in POST /v1/chat/completions
  • text[].parameters: the public field details, including defaults and limits
  • text[].runtime.state: the current serving state
  • text[].recent_summary_stats: recent wait, run, and total time summaries

Prefer reading this endpoint over hardcoding per-model limits in clients.

  1. Read GET /v1/inference/models to choose a supported text model and inspect its limits, defaults, and current runtime state
  2. Submit POST /v1/chat/completions with an explicit model
  3. Read choices[0].message.content from the returned chat.completion

RemoteGPU customer documentation