A New Era of Local Inference Begins
OpenAI’s breakthrough open-weight GPT-OSS models are now available with performance optimizations specifically designed for NVIDIA’s RTX and RTX PRO GPUs. This collaboration enables lightning-fast, on-device AI inference — with no need for cloud access — allowing developers and enthusiasts to bring high-performance, intelligent applications directly to their desktop environments.
With models like GPT-OSS-20B and GPT-OSS-120B now available, users can harness the power of generative AI for reasoning tasks, code generation, research, and more — all accelerated locally by NVIDIA hardware.
Built for Developers, Powered by RTX
These models, based on the powerful mixture-of-experts (MoE) architecture, offer advanced features like instruction following, tool usage, and chain-of-thought reasoning. Supporting a context length of up to 131,072 tokens, they’re ideally suited for deep research, multi-document analysis, and complex agentic AI workflows.
Optimized to run on RTX AI PCs and workstations, the models can now achieve up to 256 tokens per second on GPUs like the GeForce RTX 5090. This optimization extends across tools like Ollama, llama.cpp, and Microsoft AI Foundry Local, all designed to bring professional-grade inference into everyday computing.
MXFP4 Precision Unlocks Performance Without Sacrificing Quality
These are also the first models using the new MXFP4 precision format, balancing high output quality with significantly reduced computational demands. This opens the door to advanced AI use on local machines without the resource burdens typically associated with large-scale models.
Whether you’re using an RTX 4080 with 24GB VRAM or a professional RTX 6000, these models can run seamlessly with top-tier speed and efficiency.
Ollama: The Simplest Path to Personal AI
For those eager to try out OpenAI’s models with minimal setup, Ollama is the go-to solution. With native RTX optimization, it enables point-and-click interaction with GPT-OSS models through a modern UI. Users can feed in PDFs, images, and large documents with ease — all while chatting naturally with the model.
Ollama’s interface also includes support for multimodal prompts and customizable context lengths, giving creators and professionals more control over how their AI responds and reasons.
Advanced users can tap into Ollama’s command-line interface or integrate it directly into their apps using the SDK, extending its power across development pipelines.
More Tools, More Flexibility
Beyond Ollama, developers can explore GPT-OSS on RTX via:
- llama.cpp — with CUDA Graphs and low-latency enhancements tailored for NVIDIA GPUs
- GGML Tensor Library — community-driven library with Tensor Core optimization
- Microsoft AI Foundry Local — a robust, on-device inferencing toolkit for Windows, built on ONNX Runtime and CUDA
These tools give AI builders unprecedented flexibility, whether they’re building autonomous agents, coding assistants, research bots, or productivity apps — all running locally on AI PCs and workstations.
A Push Toward Local, Open Innovation
As OpenAI steps into the open-source ecosystem with NVIDIA’s hardware advantage, developers worldwide now have access to state-of-the-art models without being tethered to the cloud.
The ability to run long-context models with high-speed output opens new possibilities in real-time document comprehension, enterprise chatbots, developer tooling, and creative applications — with full control and privacy.
NVIDIA’s continued support through resources like the RTX AI Garage and AI Blueprints means the community will keep seeing evolving tools, microservices, and deployment solutions to push local AI even further.