Gpt Oss 1
modelsnebius-token-factory-cookbook
Export
Run GPT-OSS model on Nebius Token Factory
GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comes in 2 sizes:
- a big one with 117B parameters (gpt-oss-120b),
- and a smaller one with 21B parameters (gpt-oss-20b).
Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4)
References
- OpenAI open models page | github openai/gpt-oss
- Model card for 120B param model: gpt-oss-120b
- Model card for smaller 20B param model: gpt-oss-20b
- Nebius Token Factory
Pre requisites
- Nebius API key. Sign up for free at Token Factory
1 - Getting Started
1.1 - Get your Nebius API key at Nebius Token Factory
1.2 - If running on Google Colab ...
Add NEBIUS_API_KEY to Secrets panel on the left as follows

1.3 - If running locally
Create an .env file with NEBIUS_API_KEY as follows
NEBIUS_API_KEY=your_api_key_goes_here
2 - Install Dependencies
[1]
2 - Load Configuration
[2]
Running in Colab ✅ NEBIUS_API_KEY found
3 - Run the Model
3.1 - Initialize the client
[3]
3.2 - Find out the model's capabilities
[4]
----model answer ----- I’m a GPT‑4‑based AI assistant, so my main strength is working with natural language (and, when an image is supplied, visual information as well). Here’s a quick rundown of what I can do: | Category | What I can help with | |----------|----------------------| | **Information & Knowledge** | Answer factual questions, explain concepts, summarize articles, compare topics, and provide background on science, history, technology, arts, etc. (knowledge up to June 2024). | | **Writing & Creativity** | Draft essays, reports, emails, blog posts, stories, poems, jokes, dialogue, scripts, marketing copy, social‑media captions, and more. I can also rewrite or edit existing text for tone, clarity, conciseness, or style. | | **Learning & Tutoring** | Explain math problems step‑by‑step, walk through physics or chemistry concepts, help with language learning, generate practice questions, and provide study strategies. | | **Programming & Technical Help** | Write, debug, and refactor code in many languages (Python, JavaScript, Java, C++, etc.), explain algorithms, suggest data‑structure choices, create SQL queries, and help with configuration files, APIs, or command‑line tasks. | | **Productivity & Planning** | Build outlines, to‑do lists, project plans, meeting agendas, timelines, and simple Gantt‑style schedules. I can also generate templates for resumes, cover letters, budgets, and business documents. | | **Data & Analysis** | Interpret CSV/JSON snippets, perform basic calculations, generate statistical summaries, suggest visualizations, and explain results in plain language. | | **Language Services** | Translate between many languages, proofread for grammar/spelling, adjust formality level, and localize text for different audiences. | | **Conversation & Role‑Play** | Simulate interviews, practice negotiations, act as a mock client or user, provide coaching for public speaking, or just chat for entertainment. | | **Image Understanding (when an image is provided)** | Describe the content, identify objects, read text within the picture (OCR), analyze charts/diagrams, and answer questions about visual details. | | **Ethical Guidance & Safety** | Offer advice on responsible AI use, data privacy, digital well‑being, and flag content that may be harmful or disallowed. | ### What I **don’t** do - Access the live internet, real‑time databases, or personal accounts. - Perform actions in the physical world (e.g., send emails, place orders). - Provide medical, legal, or financial advice that would replace a qualified professional’s counsel. - Generate disallowed content (e.g., hate speech, extremist propaganda, instructions for illegal activities). ### How to get the most out of me 1. **Be specific** – clear prompts lead to clearer answers. 2. **Provide context** – if you have a particular audience, format, or constraint, let me know. 3. **Iterate** – you can ask follow‑up questions or request refinements (“make it shorter,” “use a more formal tone,” etc.). Feel free to ask for anything within these areas, and I’ll do my best to help! CPU times: user 645 ms, sys: 83 ms, total: 728 ms Wall time: 5.82 s
3.3 - Ask a factual question
[5]
----model answer -----
The capital of France is **Paris**.
----- full response ----
{
"id": "chatcmpl-2769efe8f3ad49c9a96879ea7439656a",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "The capital of France is **Paris**.",
"refusal": null,
"role": "assistant",
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning_content": "The user asks: \"What is the capital of France?\" Straightforward answer: Paris."
},
"stop_reason": null
}
],
"created": 1754544553,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 37,
"prompt_tokens": 88,
"total_tokens": 125,
"completion_tokens_details": null,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}
---------
CPU times: user 6.25 ms, sys: 0 ns, total: 6.25 ms
Wall time: 374 ms
3.4 - Ask a reasoning question
[6]
Below is a **step‑by‑step playbook** you can use (and adapt) the next time you need to shave milliseconds‑to‑seconds off the latency of a model in production.
I’ll walk through the **thought process**, the **key levers** you can pull, and then give **concrete code snippets** for the most common techniques. For each lever I’ll also note **alternatives** and the **trade‑offs** you’ll have to consider.
---
## 1️⃣ Start with a Baseline & Profile the Hot Path
> **Why?** You can’t optimize what you don’t know is slow.
> **What to measure:**
> - End‑to‑end latency (client → API → model → response)
> - Pure model inference time (exclude network, serialization)
> - CPU/GPU utilization, memory bandwidth, cache‑miss rates
**Typical tools**
| Layer | Tool | What it tells you |
|-------|------|-------------------|
| Python code | `cProfile`, `line_profiler` | Python‑level bottlenecks (data prep, post‑proc) |
| TensorFlow | TensorBoard Profiler, `tf.profiler` | Ops breakdown, device placement |
| PyTorch | `torch.profiler`, `torch.utils.bottleneck` | Kernel timings, CUDA sync issues |
| System | `nvidia‑smi`, `perf`, `vtune` | GPU memory, PCIe throughput, CPU stalls |
| End‑to‑end | `wrk`, `hey`, `locust` | Request‑level latency, QPS, tail latencies |
**Action:** Run a few representative payloads (batch‑size = 1, typical batch size, worst‑case input) and record the numbers. This gives you a concrete target (e.g., “current 95th‑pct latency = 210 ms, we need < 80 ms”).
---
## 2️⃣ Model‑Level Optimizations
These techniques change the *graph* or *weights* of the model. They are usually the first place you get the biggest win with minimal impact on the serving stack.
| Technique | What it does | Typical speed‑up | When to use | Alternatives |
|-----------|--------------|------------------|-------------|--------------|
| **Quantization** (int8, fp16, bfloat16) | Reduces bit‑width of weights/activations → smaller memory, faster arithmetic on supported HW | 2‑4× on GPU/CPU, up to 10× on specialized ASICs (e.g., EdgeTPU) | Model is tolerant to reduced precision (most CNNs, Transformers after fine‑tuning) | Dynamic quantization, static quantization, QAT (Quantization‑Aware Training) |
| **Pruning / Sparsity** | Zeroes out unimportant weights → fewer MACs, can be exploited by sparse kernels | 1.5‑2× on hardware that supports sparsity (e.g., NVIDIA Ampere’s 2:4 structured sparsity) | You have a lot of over‑parameterization, can re‑train/fine‑tune after pruning | Structured pruning (filter/channel) vs. unstructured |
| **Knowledge Distillation** | Train a small “student” to mimic a large “teacher” → smaller architecture | 2‑10× depending on size reduction | Need to preserve accuracy, have data for distillation | Teacher‑free distillation (self‑distill), early‑exit models |
| **Operator Fusion / Graph Rewrites** | Merge adjacent ops (e.g., Conv+BatchNorm+ReLU) into a single kernel | 1.2‑1.5×, reduces memory traffic | Most frameworks already do this, but custom ops may need manual fusion | Use compiler stacks like TVM, TensorRT, XLA |
| **Neural Architecture Search (NAS) for latency** | Search for architecture that meets a latency budget on target HW | Up to 5× if you start from a generic model | Expensive compute budget, usually done offline | Manual redesign, MobileNet‑style depthwise separable convs |
### Quick‑Start Quantization Example (PyTorch → ONNX → TensorRT)
```python
# 1️⃣ Load a float32 model (e.g., ResNet50)
import torch, torchvision
model = torchvision.models.resnet50(pretrained=True).eval()
# 2️⃣ Switch to evaluation mode and fuse modules (conv+bn+relu)
model_fused = torch.quantization.fuse_modules(
model,
[['conv1', 'bn1', 'relu']],
inplace=True
)
# 3️⃣ Prepare for static quantization
model_fused.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model_fused, inplace=True)
# 4️⃣ Calibrate with a few batches of real data
import torch.utils.data as data
calib_loader = data.DataLoader(
torchvision.datasets.ImageNet('./data', split='val',
transform=torchvision.transforms.Compose([
torchvision.transforms.Resize(256),
torchvision.transforms.CenterCrop(224),
torchvision.transforms.ToTensor()
])),
batch_size=32, shuffle=False, num_workers=4)
with torch.no_grad():
for imgs, _ in calib_loader:
model_fused(imgs)
break # a few batches are enough for calibration
# 5️⃣ Convert to int8
torch.quantization.convert(model_fused, inplace=True)
# 6️⃣ Export to ONNX (TensorRT can ingest int8 ONNX)
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model_fused, dummy, "resnet50_int8.onnx",
opset_version=13, input_names=['input'],
output_names=['output'], dynamic_axes={'input': {0: 'batch'}})
# 7️⃣ Build TensorRT engine (CLI example)
# trtexec --onnx=resnet50_int8.onnx --int8 --saveEngine=resnet50_int8.trt
```
**Why this path?**
- **PyTorch → ONNX** is framework‑agnostic, letting you later serve with TensorRT, ONNX Runtime, or any other runtime that supports int8.
- **Static quantization** (calibration) gives the best speed on CPUs/GPUs that have int8 kernels.
- **TensorRT** automatically fuses ops, applies layer‑wise kernel auto‑tuning, and can exploit NVIDIA’s structured sparsity if you also prune.
**Alternatives**
- **Dynamic quantization** (`torch.quantization.quantize_dynamic`) – easier, works on CPU only, no calibration needed.
- **Quantization‑Aware Training (QAT)** – if static quantization hurts accuracy > 1‑2 %, train with fake‑quant nodes to recover it.
- **BFloat16** – on newer GPUs (A100, H100) you can keep fp16‑compatible kernels but avoid the accuracy drop of int8.
---
## 3️⃣ Runtime & Hardware Leveraging
Even a perfectly optimized graph won’t be fast if the runtime or hardware isn’t tuned.
| Lever | What to do | Typical gain | When it matters | Caveats |
|------|------------|--------------|-----------------|---------|
| **GPU/TPU selection** | Use latest generation (e.g., NVIDIA A100, H100, or G4 for inference) | 1.5‑3× over older cards | High QPS workloads, large batch sizes | Cost & availability |
| **FP16 / BF16 inference** | Enable mixed‑precision kernels (`torch.cuda.amp.autocast`, TensorRT `--fp16`) | 2× on GPUs with Tensor Cores | Models that tolerate reduced precision | Slight accuracy loss on some layers |
| **Batching** | Accumulate multiple requests before sending to the model (dynamic or static batch) | 2‑10× throughput, latency may increase slightly | High QPS, latency budget > 10 ms | Must manage request queuing & timeout |
| **Async / Multi‑stream execution** | Use CUDA streams or async inference APIs (e.g., Triton `async_infer`) | Improves GPU utilization, reduces tail latency | Mixed CPU/GPU workloads | Complexity in code |
| **TensorRT / ONNX Runtime / TVM** | Deploy with a high‑performance runtime that does kernel auto‑tuning | 1.5‑3× over vanilla PyTorch/TensorFlow | Production services, need low latency | Requires model conversion (ONNX) |
| **Edge ASICs (EdgeTPU, NPU, FPGA)** | Compile model to specialized instruction set (e.g., Coral EdgeTPU) | 10‑30× for tiny models | On‑device inference, power‑constrained | Model size limits, quantization mandatory |
| **CPU vector extensions** | Enable AVX2/AVX‑512, use Intel OpenVINO, MKL‑DNN | 2‑4× on modern CPUs | When GPU not available or cost‑sensitive | Need to build with appropriate flags |
### Example: Deploying with NVIDIA Triton Inference Server (Batched FP16)
```yaml
# model_repository/resnet50/1/model.yaml
name: "resnet50"
platform: "onnxruntime_onnx_fp16"
max_batch_size: 32 # enables static batching up to 32
input:
- name: "input"
data_type: TYPE_FP16
dims: [3, 224, 224]
output:
- name: "output"
data_type: TYPE_FP32
dims: [1000]
```
```bash
# 1️⃣ Start Triton (Docker)
docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
-v$(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:24.09-py3 tritonserver \
--model-repository=/models --strict-model-config=true
# 2️⃣ Client (Python) – asynchronous batched request
import tritonclient.http as httpclient
import numpy as np, asyncio
triton_client = httpclient.InferenceServerClient(url="localhost:8000")
async def async_infer(image_np):
inputs = httpclient.InferInput("input", image_np.shape, "FP16")
inputs.set_data_from_numpy(image_np.astype(np.float16))
result = await triton_client.async_infer(
model_name="resnet50",
inputs=[inputs],
request_id="req-{}".format(id(image_np))
)
return result.as_numpy("output")
# Example usage with an event loop
loop = asyncio.get_event_loop()
futures = [async_infer(batch) for batch in my_batch_generator()]
outputs = loop.run_until_complete(asyncio.gather(*futures))
```
**Why Triton?**
- Handles **dynamic batching** out‑of‑the‑box (you just set `max_batch_size`).
- Offers **model versioning**, **metrics**, **GPU sharing**, and **multiple runtimes** (TensorRT, ONNX Runtime, PyTorch).
- Works well with **async client libraries**, giving you low tail‑latency without manual thread pools.
**Alternatives**
- **TorchServe** – simpler if you’re all‑PyTorch and need built‑in model‑versioning.
- **KFServing / Seldon Core** – Kubernetes‑native, good for multi‑framework ensembles.
- **Custom Flask/FastAPI + TorchScript** – fastest to prototype but you lose batching & auto‑tuning.
---
## 4️⃣ System‑Level & Architectural Tweaks
| Lever | Description | Typical effect | When to apply |
|------|-------------|----------------|---------------|
| **Cold‑start mitigation** | Keep a warm pool of workers, pre‑load model into GPU memory | Removes first‑request spikes (often > 500 ms) | Low‑traffic services, serverless |
| **Model‑sharding / Pipeline parallelism** | Split a huge model across multiple GPUs, or run pre‑/post‑processing on CPU | Can keep per‑GPU latency low while scaling throughput | Very large Transformers (BERT‑large, LLMs) |
| **Cache repeated inference** | Store embeddings or logits for identical inputs (e.g., recommendation candidate generation) | Near‑zero latency for cache hits | High request duplication, low‑entropy inputs |
| **Input preprocessing off‑load** | Use a separate microservice (or GPU) for image resizing, tokenization | Reduces request‑handler CPU load | When preprocessing dominates latency |
| **Network & Serialization** | Use gRPC + protobuf, enable HTTP/2, compress payloads | 10‑30 % latency reduction on the wire | High‑QPS, large payloads |
| **Autoscaling with latency‑based policies** | Scale out when 95th‑pct latency > target, scale in otherwise | Keeps latency within SLA while controlling cost | Cloud‑native deployments (K8s HPA, Cloud Run) |
**Design Decision Example:**
Suppose you have a BERT‑base model serving 200 RPS with a 95th‑pct latency of 140 ms. After profiling you see:
- 30 ms spent in tokenization (CPU)
- 80 ms spent in the model (GPU)
- 30 ms in network/serialization
A good first step: **move tokenization to a separate async service** (e.g., FastAPI with `uvicorn --workers 4`). This can parallelize CPU work with GPU inference and often cuts total latency to ~110 ms without touching the model itself.
---
## 5️⃣ Validation & Continuous Monitoring
1. **Regression test** the new pipeline against the baseline (latency, throughput, accuracy).
2. **Canary deploy** the optimized model to a small traffic slice (e.g., 5 %).
3. **Monitor**:
- `latency_p95`, `latency_p99` (Prometheus metrics from Triton/TFServing)
- GPU memory/SM utilization (`nvml` exporter)
- Errors / fallback rate (if you have a fallback to float32)
If the canary meets SLA *and* the accuracy drop is within your tolerance (e.g., < 0.5 % top‑1), promote to production.
---
## 📋 TL;DR Checklist
| ✅ | Action | Tool / Code |
|---|--------|--------------|
| 1 | **Profile** real traffic → identify bottleneck | `torch.profiler`, `nvprof`, `wrk` |
| 2 | **Quantize** (int8) → export ONNX → TensorRT engine | PyTorch QAT / static quant, `trtexec` |
| 3 | **Prune** (structured) → fine‑tune | `torch.nn.utils.prune` |
| 4 | **Distill** to a smaller architecture if size matters | `torch.nn.KLDivLoss` + teacher logits |
| 5 | **Select runtime** that does kernel auto‑tuning | Triton, TensorRT, ONNX Runtime, TVM |
| 6 | **Enable batching** (static or dynamic) | Triton `max_batch_size`, TensorFlow `tf.function` with `batch_size` |
| 7 | **Use mixed‑precision** (FP16/BF16) on Tensor Cores | `torch.cuda.amp.autocast`, TensorRT `--fp16` |
| 8 | **Deploy** with warm workers, async client, and proper health checks | Docker + Triton, FastAPI for pre‑proc |
| 9 | **Monitor & Canary** → roll‑out when metrics meet SLA | Prometheus + Grafana, K8s HPA |
|10| **Iterate**: if latency still high, revisit architecture (e.g., MobileNetV3, EfficientNet, or a Transformer with early‑exit) | NAS, manual redesign |
---
### Final Thought
Speed is a **stacked problem**: you gain the most by first fixing the *slowest* layer (often the model itself), then tightening the *runtime* and *system* layers. A disciplined profiling‑first approach ensures you don’t waste engineering effort on a technique that yields only marginal gains for your particular workload.
Feel free to let me know which part of your stack you’d like to dive deeper into (e.g., “show me how to do quant‑aware training for a Vision Transformer” or “set up dynamic batching with FastAPI + TorchServe”), and I can provide a more focused code walk‑through.
----------
CPU times: user 8.04 ms, sys: 950 µs, total: 8.99 ms
Wall time: 23 s
5 - Try Your Queries
Go ahead and experiment with your queries. Here are some to get you started.
Write python code to read a csv file
write a haiku about cats
[6]