Appearance
Text
Use the Text API when you want chat-completion style model access from your own application, script, or automation.
RemoteGPU does not choose a default model for you. Every request must name the model it should run against.
Send requests
Quickstart chat completion
For your first call, send a chat-completion request with model and messages.
Before you send it:
- Create an API key with inference access
- Send that key in the
Authorization: Bearer <api-key>header - Name the model explicitly
If you have not created a key yet, read API keys first.
bash
curl -X POST "https://inference.remotegpu.ai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "Qwen/Qwen3.6-27B-FP8",
"messages": [
{
"role": "user",
"content": "Summarize this note in one concise paragraph."
}
],
"max_tokens": 256,
"temperature": 0.2
}'js
const response = await fetch("https://inference.remotegpu.ai/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${process.env.REMOTEGPU_API_KEY}`,
},
body: JSON.stringify({
model: "Qwen/Qwen3.6-27B-FP8",
messages: [
{
role: "user",
content: "Summarize this note in one concise paragraph.",
},
],
max_tokens: 256,
temperature: 0.2,
}),
});python
import os
import requests
response = requests.post(
"https://inference.remotegpu.ai/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ['REMOTEGPU_API_KEY']}",
},
json={
"model": "Qwen/Qwen3.6-27B-FP8",
"messages": [
{
"role": "user",
"content": "Summarize this note in one concise paragraph.",
}
],
"max_tokens": 256,
"temperature": 0.2,
},
timeout=60,
)
response.raise_for_status()
completion = response.json()
print(completion["choices"][0]["message"]["content"])Example response:
json
{
"id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
"object": "chat.completion",
"created": 1779174000,
"model": "Qwen/Qwen3.6-27B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 32,
"completion_tokens": 48,
"total_tokens": 80
}
}The response is returned synchronously as an OpenAI-compatible chat.completion object.
Decide if this API fits
Use the Text API for:
- chat-completion style requests from applications, backends, scripts, or automations
- programmatic control over prompts, model selection, and sampling parameters
- text inference workflows that do not need Kubernetes resources
Use Application when you want a guided hosted workflow in the console. Use Kubernetes when you want to run and expose your own workloads in a namespace.
How requests work
Authentication
Text inference requests require an API key with inference access in the Authorization header as a Bearer token.
Create an Inference key in API keys, then send it on every request.
The request examples in this page already include the Bearer token header.
If the key is missing or invalid, the API returns 401. If the key is valid but does not allow inference APIs, the API returns 403.
OpenAI-compatible shape
Text requests use these OpenAI-compatible endpoints:
| Endpoint | Purpose |
|---|---|
GET /v1/models | List text models visible through the OpenAI-compatible model list |
POST /v1/chat/completions | Submit a chat-completion request and receive a synchronous chat.completion response |
Use https://inference.remotegpu.ai/v1 as the base URL for OpenAI-compatible text inference clients.
For OpenAI SDK style clients, configure:
python
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://inference.remotegpu.ai/v1",
)Choose a model first
Every request must name a supported model. RemoteGPU does not choose a default model for you.
Before you build a client flow:
- Read
GET /v1/modelsorGET /v1/inference/models - Choose a model explicitly
- Check the model's current runtime state in
GET /v1/inference/models - Keep messages and generation parameters within the selected model's limits
The unified model catalog is public.
Runtime states:
| State | Meaning |
|---|---|
ready | At least one worker is ready to serve requests |
starting | Capacity is starting or warming |
sleeping | No warm worker is available for that model |
A model can be supported even when its runtime state is sleeping. In that case, the first request may spend time in warming before it starts running.
Message content
Text models accept string content and OpenAI-style text content parts.
Supported message content shapes:
json
{
"role": "user",
"content": "Write a short summary."
}json
{
"role": "user",
"content": [
{
"type": "text",
"text": "Write a short summary."
}
]
}Image content parts are not accepted by text-only models. Requests containing image_url content parts return 400 with image_input_not_supported_yet.
Response and timeouts
POST /v1/chat/completions waits for the model response and returns a 200 OK response with OpenAI-compatible JSON:
json
{
"id": "chatcmpl-c3b1e0e7-2f7c-4c66-bd13-97b6c2b87f1d",
"object": "chat.completion",
"created": 1779174000,
"model": "Qwen/Qwen3.6-27B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 32,
"completion_tokens": 48,
"total_tokens": 80
}
}If a worker is sleeping, the request may wait while capacity starts. If the request cannot complete before the server-side timeout, the API returns a 504 OpenAI-style error response. Jobs that have not started running are cancelled internally; jobs that are already running are marked abandoned and allowed to finish in the worker.
Common status codes
| Status code | Meaning |
|---|---|
400 | The request body is syntactically valid JSON but the payload is invalid for the selected model or endpoint |
401 | Missing, invalid, revoked, or expired API key |
403 | API key is valid but not authorized for inference APIs |
404 | The selected model does not exist or is not available |
422 | Request validation failed, such as a missing model field |
503 | The selected model exists but is unavailable for serving |
504 | The request timed out before a chat-completion response was available |
Reference
Current text models
| Model | Max messages | Max output tokens | Default parameters |
|---|---|---|---|
Qwen/Qwen3.6-27B-FP8 | 64 | 8192 | 1024 max tokens, temperature 0.7, top-p 1.0 |
Request body fields
| Field | Required | Notes |
|---|---|---|
model | Yes | Must match a supported text model ID |
messages | Yes | Chat messages with string content or text content parts |
max_tokens | No | Defaults to the selected model default; maximum is 8192 |
temperature | No | Defaults to 0.7; accepted range is 0.0 through 2.0 |
top_p | No | Defaults to 1.0; accepted range is 0.0 through 1.0 |
stop | No | Optional stop sequence or list of stop sequences |
stream | No | Streaming is not accepted; send false or omit the field |
If model is omitted, the request is rejected. If the selected model is known but unavailable for serving, the API returns 503.
Model catalog endpoint
Use the catalog endpoint when you need the full model field details instead of the summarized table above.
bash
curl "https://inference.remotegpu.ai/v1/inference/models"The response includes:
text[].model: the model identifier to send inPOST /v1/chat/completionstext[].parameters: the public field details, including defaults and limitstext[].runtime.state: the current serving statetext[].recent_summary_stats: recent wait, run, and total time summaries
Prefer reading this endpoint over hardcoding per-model limits in clients.
Recommended client flow
- Read
GET /v1/inference/modelsto choose a supported text model and inspect its limits, defaults, and current runtime state - Submit
POST /v1/chat/completionswith an explicitmodel - Read
choices[0].message.contentfrom the returnedchat.completion
Read next
- Read API keys to create or rotate inference keys.
- Read Inference API overview to compare the API with hosted applications and Kubernetes.
