Quick Answer: RTX 5090 offers 32GB VRAM and starts around $5196.32. It delivers approximately 396 tokens/sec on bigcode/starcoder2-3b. It typically draws 575W under load.
This GPU offers reliable throughput for local AI workloads. Pair it with the right model quantization to hit your desired tokens/sec, and monitor prices below to catch the best deal.
With 32GB VRAM, RTX 5090 can run models up to approximately 80B parameters using 4-bit quantization. This handles most popular models including Llama 3 70B, Mistral 7B, and larger.
Consider RTX 4090 or RTX 6000 Ada — 24GB Ada offers better efficiency than Ampere.
Buy directly on Amazon with fast shipping and reliable customer service.
Rotate out primary variants whenever validation flags an issue.
Essential accessories to pair with RTX 5090
Total Bundle Price
All items from Amazon
💡 Not ready to buy? Try cloud GPUs first
Test RTX 5090 performance in the cloud before investing in hardware. Pay by the hour with no commitment.
| Model | Quantization | Tokens/sec | VRAM used |
|---|---|---|---|
| bigcode/starcoder2-3b | Q4 | 395.58 tok/sEstimated Auto-generated benchmark | 2GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | 394.56 tok/sEstimated Auto-generated benchmark | 1GB |
| LiquidAI/LFM2-1.2B | Q4 | 394.12 tok/sEstimated Auto-generated benchmark | 1GB |
| google-bert/bert-base-uncased | Q4 | 393.79 tok/sEstimated Auto-generated benchmark | 1GB |
| tencent/HunyuanOCR | Q4 | 393.19 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2-2b-it | Q4 | 390.12 tok/sEstimated Auto-generated benchmark | 1GB |
| inference-net/Schematron-3B | Q4 | 389.59 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 387.38 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-research/PowerMoE-3b | Q4 | 383.95 tok/sEstimated Auto-generated benchmark | 2GB |
| allenai/OLMo-2-0425-1B | Q4 | 382.28 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-Guard-3-1B | Q4 | 380.84 tok/sEstimated Auto-generated benchmark | 1GB |
| WeiboAI/VibeThinker-1.5B | Q4 | 376.34 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | 371.04 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | 369.32 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-ASR-1.7B | Q4 | 367.70 tok/sEstimated Auto-generated benchmark | 2GB |
| nari-labs/Dia2-2B | Q4 | 366.36 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B-Instruct | Q4 | 362.86 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B | Q4 | 360.59 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-OCR-2 | Q4 | 360.17 tok/sEstimated Auto-generated benchmark | 2GB |
| apple/OpenELM-1_1B-Instruct | Q4 | 359.92 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/DeepSeek-OCR | Q4 | 357.55 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2b | Q4 | 355.97 tok/sEstimated Auto-generated benchmark | 1GB |
| google-t5/t5-3b | Q4 | 354.58 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B | Q4 | 349.12 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/gemma-3-1b-it | Q4 | 343.32 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-1B-Instruct | Q4 | 340.86 tok/sEstimated Auto-generated benchmark | 1GB |
| google/embeddinggemma-300m | Q4 | 336.60 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-3-1b-it | Q4 | 334.93 tok/sEstimated Auto-generated benchmark | 1GB |
| nineninesix/kani-tts-2-en | Q4 | 333.47 tok/sEstimated Auto-generated benchmark | 1GB |
| facebook/sam3 | Q4 | 332.76 tok/sEstimated Auto-generated benchmark | 1GB |
| ibm-granite/granite-3.3-2b-instruct | Q4 | 332.41 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-0.6B-Base | Q4 | 329.64 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q4 | 329.16 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-Guard-3-8B | Q4 | 329.11 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q4 | 328.93 tok/sEstimated Auto-generated benchmark | 4GB |
| Nanbeige/Nanbeige4.1-3B | Q4 | 328.74 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q4 | 328.65 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | 328.58 tok/sEstimated Auto-generated benchmark | 2GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | 328.24 tok/sEstimated Auto-generated benchmark | 2GB |
| FireRedTeam/FireRed-Image-Edit-1.0 | Q4 | 328.02 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q4 | 328.00 tok/sEstimated Auto-generated benchmark | 4GB |
| MiniMaxAI/MiniMax-M2.5 | Q4 | 327.59 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-4 | Q4 | 327.31 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.2-1B | Q4 | 326.61 tok/sEstimated Auto-generated benchmark | 1GB |
| tencent/HunyuanVideo-1.5 | Q4 | 326.43 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | 326.24 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3.5-vision-instruct | Q4 | 325.38 tok/sEstimated Auto-generated benchmark | 4GB |
| nvidia/personaplex-7b-v1 | Q4 | 325.16 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-large | Q4 | 324.95 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q4 | 324.67 tok/sEstimated Auto-generated benchmark | 2GB |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
| Model | Quantization | Verdict | Estimated speed | VRAM needed |
|---|---|---|---|---|
| openai-community/gpt2 | Q8 | Fits comfortably | 219.09 tok/sEstimated | 7GB (have 32GB) |
| openai-community/gpt2 | FP16 | Fits comfortably | 120.80 tok/sEstimated | 15GB (have 32GB) |
| Qwen/Qwen2.5-7B-Instruct | Q4 | Fits comfortably | 292.61 tok/sEstimated | 4GB (have 32GB) |
| Qwen/Qwen2.5-7B-Instruct | Q8 | Fits comfortably | 190.82 tok/sEstimated | 7GB (have 32GB) |
| Qwen/Qwen2.5-7B-Instruct | FP16 | Fits comfortably | 111.68 tok/sEstimated | 15GB (have 32GB) |
| Qwen/Qwen3-0.6B | Q4 | Fits comfortably | 292.73 tok/sEstimated | 3GB (have 32GB) |
| Qwen/Qwen3-0.6B | Q8 | Fits comfortably | 202.05 tok/sEstimated | 6GB (have 32GB) |
| Qwen/Qwen3-0.6B | FP16 | Fits comfortably | 117.98 tok/sEstimated | 13GB (have 32GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 288.38 tok/sEstimated | 3GB (have 32GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | Fits comfortably | 211.59 tok/sEstimated | 5GB (have 32GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | FP16 | Fits comfortably | 113.11 tok/sEstimated | 11GB (have 32GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | Fits comfortably | 282.68 tok/sEstimated | 4GB (have 32GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 221.18 tok/sEstimated | 9GB (have 32GB) |
| meta-llama/Llama-3.1-8B-Instruct | FP16 | Fits comfortably | 107.34 tok/sEstimated | 17GB (have 32GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | Fits comfortably | 106.58 tok/sEstimated | 17GB (have 32GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q8 | Not supported | 74.38 tok/sEstimated | 35GB (have 32GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | FP16 | Not supported | 41.37 tok/sEstimated | 70GB (have 32GB) |
| openai/gpt-oss-20b | Q4 | Fits comfortably | 156.71 tok/sEstimated | 10GB (have 32GB) |
| openai/gpt-oss-20b | Q8 | Fits comfortably | 126.70 tok/sEstimated | 20GB (have 32GB) |
| openai/gpt-oss-20b | FP16 | Not supported | 58.13 tok/sEstimated | 41GB (have 32GB) |
| google/gemma-3-1b-it | Q4 | Fits comfortably | 334.93 tok/sEstimated | 1GB (have 32GB) |
| google/gemma-3-1b-it | Q8 | Fits comfortably | 234.59 tok/sEstimated | 1GB (have 32GB) |
| google/gemma-3-1b-it | FP16 | Fits comfortably | 132.99 tok/sEstimated | 2GB (have 32GB) |
| Qwen/Qwen3-Embedding-0.6B | Q4 | Fits comfortably | 318.55 tok/sEstimated | 3GB (have 32GB) |
| Qwen/Qwen3-Embedding-0.6B | Q8 | Fits comfortably | 212.04 tok/sEstimated | 6GB (have 32GB) |
| Qwen/Qwen3-Embedding-0.6B | FP16 | Fits comfortably | 122.74 tok/sEstimated | 13GB (have 32GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | Fits comfortably | 307.95 tok/sEstimated | 3GB (have 32GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | Fits comfortably | 220.62 tok/sEstimated | 5GB (have 32GB) |
| Qwen/Qwen2.5-1.5B-Instruct | FP16 | Fits comfortably | 120.79 tok/sEstimated | 11GB (have 32GB) |
| facebook/opt-125m | Q4 | Fits comfortably | 320.49 tok/sEstimated | 4GB (have 32GB) |
| facebook/opt-125m | Q8 | Fits comfortably | 222.22 tok/sEstimated | 7GB (have 32GB) |
| facebook/opt-125m | FP16 | Fits comfortably | 120.48 tok/sEstimated | 15GB (have 32GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | Fits comfortably | 394.56 tok/sEstimated | 1GB (have 32GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | Fits comfortably | 234.41 tok/sEstimated | 1GB (have 32GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | FP16 | Fits comfortably | 134.28 tok/sEstimated | 2GB (have 32GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | Fits comfortably | 307.74 tok/sEstimated | 4GB (have 32GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q8 | Fits comfortably | 226.70 tok/sEstimated | 7GB (have 32GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | FP16 | Fits comfortably | 113.49 tok/sEstimated | 15GB (have 32GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | Fits comfortably | 326.24 tok/sEstimated | 2GB (have 32GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | Fits comfortably | 227.93 tok/sEstimated | 4GB (have 32GB) |
| Qwen/Qwen3-4B-Instruct-2507 | FP16 | Fits comfortably | 104.72 tok/sEstimated | 9GB (have 32GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | Fits comfortably | 369.32 tok/sEstimated | 1GB (have 32GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | Fits comfortably | 267.55 tok/sEstimated | 1GB (have 32GB) |
| meta-llama/Llama-3.2-1B-Instruct | FP16 | Fits comfortably | 143.68 tok/sEstimated | 2GB (have 32GB) |
| openai/gpt-oss-120b | Q4 | Not supported | 62.12 tok/sEstimated | 59GB (have 32GB) |
| openai/gpt-oss-120b | Q8 | Not supported | 41.62 tok/sEstimated | 117GB (have 32GB) |
| openai/gpt-oss-120b | FP16 | Not supported | 20.83 tok/sEstimated | 235GB (have 32GB) |
| Qwen/Qwen2.5-3B-Instruct | Q4 | Fits comfortably | 362.86 tok/sEstimated | 2GB (have 32GB) |
| Qwen/Qwen2.5-3B-Instruct | Q8 | Fits comfortably | 241.79 tok/sEstimated | 3GB (have 32GB) |
| openai-community/gpt2 | Q4 | Fits comfortably | 318.80 tok/sEstimated | 4GB (have 32GB) |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
Explore how RTX 5070 stacks up for local inference workloads.
Explore how RTX 4060 Ti 16GB stacks up for local inference workloads.
Explore how RX 6800 XT stacks up for local inference workloads.
Explore how RTX 4070 Super stacks up for local inference workloads.
Explore how RTX 3080 stacks up for local inference workloads.
RPG • 2020
RPG • 2023
Action RPG • 2023
RPG • 2023
Survival Horror • 2023
Action RPG • 2022
Action RPG • 2024
Action Adventure • 2025
Survival Horror • 2023
Action • 2022
Action Adventure • 2023
Action Adventure • 2019