Loading GPU specifications...
Loading GPU benchmarks...
Quick Answer: RTX 4090 offers 24GB VRAM and starts around current market pricing. It delivers approximately 236 tokens/sec on facebook/sam3. It typically draws 450W under load.
RTX 4090 remains the go-to GPU for local AI workloads. It runs every mainstream 70B model, sustains the fastest consumer inference speeds, and anchors premium builds that scale to production deployments.
Buy directly on Amazon with fast shipping and reliable customer service.
Rotate out primary variants whenever validation flags an issue.
| Model | Quantization | Tokens/sec | VRAM used |
|---|---|---|---|
| facebook/sam3 | Q4 | 236.38 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen2.5-3B | Q4 | 234.27 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/Llama-3.2-1B-Instruct | Q4 | 233.46 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen2.5-3B-Instruct | Q4 | 233.12 tok/sEstimated Auto-generated benchmark | 2GB |
| google/embeddinggemma-300m | Q4 | 232.70 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2b | Q4 | 228.89 tok/sEstimated Auto-generated benchmark | 1GB |
| tencent/HunyuanOCR | Q4 | 228.49 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/gemma-3-1b-it | Q4 | 227.22 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B | Q4 | 225.83 tok/sEstimated Auto-generated benchmark | 1GB |
| LiquidAI/LFM2-1.2B | Q4 | 224.36 tok/sEstimated Auto-generated benchmark | 1GB |
| WeiboAI/VibeThinker-1.5B | Q4 | 224.29 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2-2b-it | Q4 | 223.85 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/DeepSeek-OCR | Q4 | 222.45 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 222.08 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-Guard-3-1B | Q4 | 221.48 tok/sEstimated Auto-generated benchmark | 1GB |
| allenai/OLMo-2-0425-1B | Q4 | 219.94 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | 217.03 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | 216.90 tok/sEstimated Auto-generated benchmark | 2GB |
| bigcode/starcoder2-3b | Q4 | 216.49 tok/sEstimated Auto-generated benchmark | 2GB |
| google-bert/bert-base-uncased | Q4 | 210.99 tok/sEstimated Auto-generated benchmark | 1GB |
| google-t5/t5-3b | Q4 | 210.07 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B | Q4 | 210.07 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-research/PowerMoE-3b | Q4 | 208.48 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-granite/granite-3.3-2b-instruct | Q4 | 207.96 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-3B-Instruct | Q4 | 207.77 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-1b-it | Q4 | 206.52 tok/sEstimated Auto-generated benchmark | 1GB |
| inference-net/Schematron-3B | Q4 | 201.03 tok/sEstimated Auto-generated benchmark | 2GB |
| nari-labs/Dia2-2B | Q4 | 200.46 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-V3-0324 | Q4 | 197.87 tok/sEstimated Auto-generated benchmark | 4GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | 197.50 tok/sEstimated Auto-generated benchmark | 1GB |
| MiniMaxAI/MiniMax-M2 | Q4 | 196.84 tok/sEstimated Auto-generated benchmark | 4GB |
| apple/OpenELM-1_1B-Instruct | Q4 | 195.34 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-0.6B | Q4 | 195.26 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-2-7b-hf | Q4 | 195.21 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-4-mini-instruct | Q4 | 195.16 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-2 | Q4 | 195.13 tok/sEstimated Auto-generated benchmark | 4GB |
| petals-team/StableBeluga2 | Q4 | 194.92 tok/sEstimated Auto-generated benchmark | 4GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | 194.72 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3.5-vision-instruct | Q4 | 194.59 tok/sEstimated Auto-generated benchmark | 4GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q4 | 194.38 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q4 | 194.15 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B-Instruct | Q4 | 193.62 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3.1 | Q4 | 192.09 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Math-1.5B | Q4 | 191.91 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/phi-4 | Q4 | 191.74 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-1.5B | Q4 | 191.19 tok/sEstimated Auto-generated benchmark | 3GB |
| openai-community/gpt2-large | Q4 | 190.83 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2 | Q4 | 190.49 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q4 | 190.40 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q4 | 190.28 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-1.7B-Base | Q4 | 189.42 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceH4/zephyr-7b-beta | Q4 | 189.35 tok/sEstimated Auto-generated benchmark | 4GB |
| vikhyatk/moondream2 | Q4 | 189.33 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B | Q4 | 189.29 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q4 | 189.02 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-medium | Q4 | 188.60 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q4 | 188.59 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-0.6B | Q4 | 188.31 tok/sEstimated Auto-generated benchmark | 3GB |
| skt/kogpt2-base-v2 | Q4 | 188.25 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Base | Q4 | 187.88 tok/sEstimated Auto-generated benchmark | 2GB |
| hmellor/tiny-random-LlamaForCausalLM | Q4 | 187.78 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q4 | 187.77 tok/sEstimated Auto-generated benchmark | 4GB |
| zai-org/GLM-4.5-Air | Q4 | 187.76 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q4 | 187.29 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | 187.10 tok/sEstimated Auto-generated benchmark | 4GB |
| dicta-il/dictalm2.0-instruct | Q4 | 186.96 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-xl | Q4 | 186.77 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q4 | 186.68 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-7B-Instruct | Q4 | 186.64 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q4 | 186.36 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/VibeVoice-1.5B | Q4 | 186.23 tok/sEstimated Auto-generated benchmark | 3GB |
| HuggingFaceTB/SmolLM-135M | Q4 | 186.01 tok/sEstimated Auto-generated benchmark | 4GB |
| tencent/HunyuanVideo-1.5 | Q4 | 185.62 tok/sEstimated Auto-generated benchmark | 4GB |
| allenai/Olmo-3-7B-Think | Q4 | 185.49 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q4 | 185.48 tok/sEstimated Auto-generated benchmark | 2GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q4 | 185.08 tok/sEstimated Auto-generated benchmark | 4GB |
| liuhaotian/llava-v1.5-7b | Q4 | 185.06 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q4 | 184.87 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-R1 | Q4 | 184.80 tok/sEstimated Auto-generated benchmark | 4GB |
| bigscience/bloomz-560m | Q4 | 184.62 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-0.5B | Q4 | 184.57 tok/sEstimated Auto-generated benchmark | 3GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q4 | 184.55 tok/sEstimated Auto-generated benchmark | 4GB |
| rinna/japanese-gpt-neox-small | Q4 | 184.42 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q4 | 184.41 tok/sEstimated Auto-generated benchmark | 4GB |
| distilbert/distilgpt2 | Q4 | 184.15 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | 184.14 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | 183.97 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 183.67 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.1-8B | Q4 | 183.58 tok/sEstimated Auto-generated benchmark | 4GB |
| facebook/opt-125m | Q4 | 183.28 tok/sEstimated Auto-generated benchmark | 4GB |
| rednote-hilab/dots.ocr | Q4 | 183.17 tok/sEstimated Auto-generated benchmark | 4GB |
| ibm-granite/granite-3.3-8b-instruct | Q4 | 183.11 tok/sEstimated Auto-generated benchmark | 4GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q4 | 183.05 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-Embedding-4B | Q4 | 182.60 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-2-7b-chat-hf | Q4 | 182.40 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | 182.27 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q4 | 182.19 tok/sEstimated Auto-generated benchmark | 4GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q4 | 181.92 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | 181.77 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/DialoGPT-medium | Q4 | 181.41 tok/sEstimated Auto-generated benchmark | 4GB |
| google/gemma-3-270m-it | Q4 | 181.39 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceTB/SmolLM2-135M | Q4 | 181.24 tok/sEstimated Auto-generated benchmark | 4GB |
| Tongyi-MAI/Z-Image-Turbo | Q4 | 181.06 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q4 | 180.49 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q4 | 180.48 tok/sEstimated Auto-generated benchmark | 4GB |
| huggyllama/llama-7b | Q4 | 180.27 tok/sEstimated Auto-generated benchmark | 4GB |
| black-forest-labs/FLUX.2-dev | Q4 | 180.14 tok/sEstimated Auto-generated benchmark | 4GB |
| sshleifer/tiny-gpt2 | Q4 | 179.43 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen-Image-Edit-2509 | Q4 | 179.08 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q4 | 178.46 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-1.7B | Q4 | 176.83 tok/sEstimated Auto-generated benchmark | 4GB |
| EleutherAI/gpt-neo-125m | Q4 | 176.74 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q4 | 176.70 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-8B | Q4 | 176.20 tok/sEstimated Auto-generated benchmark | 4GB |
| black-forest-labs/FLUX.1-dev | Q4 | 176.12 tok/sEstimated Auto-generated benchmark | 4GB |
| numind/NuExtract-1.5 | Q4 | 175.87 tok/sEstimated Auto-generated benchmark | 4GB |
| parler-tts/parler-tts-large-v1 | Q4 | 175.26 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-4-multimodal-instruct | Q4 | 174.99 tok/sEstimated Auto-generated benchmark | 4GB |
| ibm-granite/granite-docling-258M | Q4 | 174.50 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Instruct | Q4 | 174.37 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B-Instruct | Q4 | 174.25 tok/sEstimated Auto-generated benchmark | 4GB |
| llamafactory/tiny-random-Llama-3 | Q4 | 173.66 tok/sEstimated Auto-generated benchmark | 4GB |
| lmsys/vicuna-7b-v1.5 | Q4 | 173.55 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-0.6B-Base | Q4 | 173.10 tok/sEstimated Auto-generated benchmark | 3GB |
| EleutherAI/pythia-70m-deduped | Q4 | 172.40 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3-mini-4k-instruct | Q4 | 172.07 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Coder-1.5B | Q4 | 171.62 tok/sEstimated Auto-generated benchmark | 3GB |
| zai-org/GLM-4.6-FP8 | Q4 | 171.26 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-0528 | Q4 | 171.10 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Base | Q4 | 170.77 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/DialoGPT-small | Q4 | 170.51 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q4 | 170.21 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-0.5B | Q4 | 170.00 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2-1.5B-Instruct | Q4 | 169.97 tok/sEstimated Auto-generated benchmark | 3GB |
| IlyaGusev/saiga_llama3_8b | Q4 | 169.41 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-Base | Q4 | 168.70 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-8B | Q4 | 168.45 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | 168.29 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-V3 | Q4 | 168.21 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-FP8 | Q4 | 167.45 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q4 | 167.44 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-v0.1 | Q4 | 167.29 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-0.5B-Instruct | Q4 | 167.08 tok/sEstimated Auto-generated benchmark | 3GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4 | 166.05 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q4 | 165.82 tok/sEstimated Auto-generated benchmark | 3GB |
| ibm-granite/granite-3.3-2b-instruct | Q8 | 165.78 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/DeepSeek-OCR | Q8 | 165.23 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Reranker-0.6B | Q4 | 165.11 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/Phi-3-mini-128k-instruct | Q4 | 164.87 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-Guard-3-8B | Q4 | 164.14 tok/sEstimated Auto-generated benchmark | 4GB |
| BSC-LT/salamandraTA-7b-instruct | Q4 | 164.06 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-3B | Q8 | 163.95 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-7B | Q4 | 163.72 tok/sEstimated Auto-generated benchmark | 4GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | 162.66 tok/sEstimated Auto-generated benchmark | 3GB |
| google/embeddinggemma-300m | Q8 | 162.31 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-4B | Q4 | 162.26 tok/sEstimated Auto-generated benchmark | 2GB |
| google-t5/t5-3b | Q8 | 161.59 tok/sEstimated Auto-generated benchmark | 3GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | 161.08 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B | Q8 | 160.59 tok/sEstimated Auto-generated benchmark | 3GB |
| ibm-research/PowerMoE-3b | Q8 | 159.85 tok/sEstimated Auto-generated benchmark | 3GB |
| apple/OpenELM-1_1B-Instruct | Q8 | 158.99 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B | Q8 | 158.57 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | 157.86 tok/sEstimated Auto-generated benchmark | 1GB |
| tencent/HunyuanOCR | Q8 | 157.55 tok/sEstimated Auto-generated benchmark | 2GB |
| WeiboAI/VibeThinker-1.5B | Q8 | 155.83 tok/sEstimated Auto-generated benchmark | 2GB |
| google-bert/bert-base-uncased | Q8 | 155.13 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q8 | 155.13 tok/sEstimated Auto-generated benchmark | 3GB |
| facebook/sam3 | Q8 | 153.17 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | 150.30 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/Llama-3.2-1B-Instruct | Q8 | 149.29 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2-9b-it | Q4 | 147.98 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-14B-Instruct | Q4 | 147.71 tok/sEstimated Auto-generated benchmark | 8GB |
| bigcode/starcoder2-3b | Q8 | 147.29 tok/sEstimated Auto-generated benchmark | 3GB |
| LiquidAI/LFM2-1.2B | Q8 | 146.98 tok/sEstimated Auto-generated benchmark | 2GB |
| NousResearch/Hermes-3-Llama-3.1-8B | Q4 | 146.74 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-3B-Instruct | Q8 | 146.36 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-14B-Instruct | Q4 | 145.64 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-Guard-3-1B | Q8 | 145.31 tok/sEstimated Auto-generated benchmark | 1GB |
| nari-labs/Dia2-2B | Q8 | 145.10 tok/sEstimated Auto-generated benchmark | 3GB |
| google/gemma-2-2b-it | Q8 | 144.82 tok/sEstimated Auto-generated benchmark | 2GB |
| ai-forever/ruGPT-3.5-13B | Q4 | 144.64 tok/sEstimated Auto-generated benchmark | 7GB |
| google/gemma-3-1b-it | Q8 | 144.61 tok/sEstimated Auto-generated benchmark | 1GB |
| EssentialAI/rnj-1 | Q4 | 143.12 tok/sEstimated Auto-generated benchmark | 5GB |
| allenai/OLMo-2-0425-1B | Q8 | 142.91 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-2-13b-chat-hf | Q4 | 142.39 tok/sEstimated Auto-generated benchmark | 7GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q8 | 141.96 tok/sEstimated Auto-generated benchmark | 3GB |
| inference-net/Schematron-3B | Q8 | 140.67 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/gemma-3-1b-it | Q8 | 139.27 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2b | Q8 | 138.53 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen-Image-Edit-2509 | Q8 | 138.50 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen3-Embedding-0.6B | Q8 | 138.42 tok/sEstimated Auto-generated benchmark | 6GB |
| Tongyi-MAI/Z-Image-Turbo | Q8 | 138.18 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen3-0.6B | Q8 | 138.13 tok/sEstimated Auto-generated benchmark | 6GB |
| OpenPipe/Qwen3-14B-Instruct | Q4 | 138.10 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-8B-FP8 | Q8 | 137.69 tok/sEstimated Auto-generated benchmark | 9GB |
| liuhaotian/llava-v1.5-7b | Q8 | 137.42 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Meta-Llama-3-8B | Q8 | 137.14 tok/sEstimated Auto-generated benchmark | 9GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q8 | 136.93 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-0.5B | Q8 | 136.91 tok/sEstimated Auto-generated benchmark | 5GB |
| deepseek-ai/DeepSeek-R1-0528 | Q8 | 136.76 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/Llama-3.2-3B-Instruct | Q8 | 136.58 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-14B | Q4 | 136.55 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/DialoGPT-medium | Q8 | 136.38 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B | Q8 | 136.30 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q8 | 136.10 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-Guard-3-8B | Q8 | 136.04 tok/sEstimated Auto-generated benchmark | 9GB |
| EleutherAI/pythia-70m-deduped | Q8 | 135.90 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q8 | 135.87 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-8B | Q8 | 135.78 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-Reranker-0.6B | Q8 | 135.65 tok/sEstimated Auto-generated benchmark | 6GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q4 | 135.49 tok/sEstimated Auto-generated benchmark | 5GB |
| allenai/Olmo-3-7B-Think | Q8 | 135.05 tok/sEstimated Auto-generated benchmark | 8GB |
| HuggingFaceTB/SmolLM2-135M | Q8 | 134.92 tok/sEstimated Auto-generated benchmark | 7GB |
| petals-team/StableBeluga2 | Q8 | 134.57 tok/sEstimated Auto-generated benchmark | 7GB |
| google/gemma-3-270m-it | Q8 | 134.54 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-2-7b-hf | Q8 | 134.16 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B-Instruct | Q8 | 133.43 tok/sEstimated Auto-generated benchmark | 8GB |
| zai-org/GLM-4.6-FP8 | Q8 | 133.34 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-V3.1 | Q8 | 133.27 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-medium-128k-instruct | Q4 | 133.07 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-Math-1.5B | Q8 | 133.00 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/Phi-3.5-mini-instruct | Q8 | 132.89 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-14B-Base | Q4 | 132.51 tok/sEstimated Auto-generated benchmark | 7GB |
| sshleifer/tiny-gpt2 | Q8 | 132.47 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q8 | 132.32 tok/sEstimated Auto-generated benchmark | 5GB |
| hmellor/tiny-random-LlamaForCausalLM | Q8 | 132.24 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceTB/SmolLM-135M | Q8 | 132.09 tok/sEstimated Auto-generated benchmark | 7GB |
| bigscience/bloomz-560m | Q8 | 132.00 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/phi-4 | Q8 | 131.74 tok/sEstimated Auto-generated benchmark | 7GB |
| llamafactory/tiny-random-Llama-3 | Q8 | 131.42 tok/sEstimated Auto-generated benchmark | 7GB |
| huggyllama/llama-7b | Q8 | 131.19 tok/sEstimated Auto-generated benchmark | 7GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q8 | 130.90 tok/sEstimated Auto-generated benchmark | 9GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q8 | 130.79 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q8 | 130.64 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q8 | 130.55 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-1.5B-Instruct | Q8 | 130.09 tok/sEstimated Auto-generated benchmark | 5GB |
| openai-community/gpt2-xl | Q8 | 129.47 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q8 | 129.30 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-1.7B | Q8 | 128.73 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-14B | Q4 | 128.65 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Ministral-3-14B-Instruct-2512 | Q4 | 128.52 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/DeepSeek-V3 | Q8 | 128.47 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-0.5B | Q8 | 128.40 tok/sEstimated Auto-generated benchmark | 5GB |
| MiniMaxAI/MiniMax-M2 | Q8 | 128.39 tok/sEstimated Auto-generated benchmark | 7GB |
| zai-org/GLM-4.5-Air | Q8 | 127.77 tok/sEstimated Auto-generated benchmark | 7GB |
| parler-tts/parler-tts-large-v1 | Q8 | 127.74 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q8 | 127.67 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q8 | 127.49 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q8 | 127.21 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q8 | 127.04 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | 126.75 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-1.5B | Q8 | 126.67 tok/sEstimated Auto-generated benchmark | 5GB |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | 126.51 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q8 | 126.43 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-4k-instruct | Q8 | 126.12 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q8 | 125.73 tok/sEstimated Auto-generated benchmark | 5GB |
| ibm-granite/granite-docling-258M | Q8 | 125.49 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q8 | 125.42 tok/sEstimated Auto-generated benchmark | 9GB |
| distilbert/distilgpt2 | Q8 | 125.22 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q8 | 125.13 tok/sEstimated Auto-generated benchmark | 9GB |
| microsoft/VibeVoice-1.5B | Q8 | 125.03 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/Phi-3.5-vision-instruct | Q8 | 124.61 tok/sEstimated Auto-generated benchmark | 7GB |
| lmsys/vicuna-7b-v1.5 | Q8 | 124.53 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-2-7b-chat-hf | Q8 | 124.49 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q8 | 124.34 tok/sEstimated Auto-generated benchmark | 9GB |
| tencent/HunyuanVideo-1.5 | Q8 | 124.24 tok/sEstimated Auto-generated benchmark | 8GB |
| microsoft/Phi-4-mini-instruct | Q8 | 124.00 tok/sEstimated Auto-generated benchmark | 7GB |
| EleutherAI/gpt-neo-125m | Q8 | 123.98 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | 123.81 tok/sEstimated Auto-generated benchmark | 5GB |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | 123.49 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-Embedding-4B | Q8 | 122.70 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-0.6B-Base | Q8 | 122.48 tok/sEstimated Auto-generated benchmark | 6GB |
| ibm-granite/granite-3.3-8b-instruct | Q8 | 122.19 tok/sEstimated Auto-generated benchmark | 9GB |
| GSAI-ML/LLaDA-8B-Base | Q8 | 122.09 tok/sEstimated Auto-generated benchmark | 9GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q8 | 121.99 tok/sEstimated Auto-generated benchmark | 7GB |
| vikhyatk/moondream2 | Q8 | 121.69 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/DialoGPT-small | Q8 | 121.69 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-4-multimodal-instruct | Q8 | 121.52 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q8 | 121.43 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-8B-Base | Q8 | 120.67 tok/sEstimated Auto-generated benchmark | 9GB |
| black-forest-labs/FLUX.2-dev | Q8 | 120.66 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen3-1.7B-Base | Q8 | 120.43 tok/sEstimated Auto-generated benchmark | 7GB |
| BSC-LT/salamandraTA-7b-instruct | Q8 | 120.07 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B-Base | Q8 | 119.81 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B | Q8 | 119.64 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2-7B-Instruct | Q8 | 119.45 tok/sEstimated Auto-generated benchmark | 7GB |
| GSAI-ML/LLaDA-8B-Instruct | Q8 | 119.37 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-4B | Q8 | 119.28 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q8 | 119.00 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3-0324 | Q8 | 118.86 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q8 | 118.60 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q8 | 118.50 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-Coder-1.5B | Q8 | 118.28 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/phi-2 | Q8 | 118.24 tok/sEstimated Auto-generated benchmark | 7GB |
| IlyaGusev/saiga_llama3_8b | Q8 | 118.10 tok/sEstimated Auto-generated benchmark | 9GB |
| mistralai/Mistral-7B-v0.1 | Q8 | 118.03 tok/sEstimated Auto-generated benchmark | 7GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | 117.80 tok/sEstimated Auto-generated benchmark | 5GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q8 | 117.70 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2 | Q8 | 117.59 tok/sEstimated Auto-generated benchmark | 7GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q8 | 117.32 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B | Q8 | 116.60 tok/sEstimated Auto-generated benchmark | 9GB |
| facebook/opt-125m | Q8 | 116.45 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-0.5B-Instruct | Q8 | 116.34 tok/sEstimated Auto-generated benchmark | 5GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q8 | 116.17 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q8 | 116.02 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2-large | Q8 | 115.53 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q8 | 115.04 tok/sEstimated Auto-generated benchmark | 4GB |
| skt/kogpt2-base-v2 | Q8 | 114.96 tok/sEstimated Auto-generated benchmark | 7GB |
| dicta-il/dictalm2.0-instruct | Q8 | 114.87 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-128k-instruct | Q8 | 114.59 tok/sEstimated Auto-generated benchmark | 7GB |
| rednote-hilab/dots.ocr | Q8 | 114.48 tok/sEstimated Auto-generated benchmark | 7GB |
| rinna/japanese-gpt-neox-small | Q8 | 114.33 tok/sEstimated Auto-generated benchmark | 7GB |
| black-forest-labs/FLUX.1-dev | Q8 | 114.23 tok/sEstimated Auto-generated benchmark | 8GB |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | 114.22 tok/sEstimated Auto-generated benchmark | 9GB |
| HuggingFaceH4/zephyr-7b-beta | Q8 | 114.12 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1 | Q8 | 114.11 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2-medium | Q8 | 113.88 tok/sEstimated Auto-generated benchmark | 7GB |
| numind/NuExtract-1.5 | Q8 | 113.79 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B-Instruct | Q8 | 113.61 tok/sEstimated Auto-generated benchmark | 7GB |
| openai/gpt-oss-safeguard-20b | Q4 | 108.86 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q4 | 103.64 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B | Q4 | 102.52 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-14B-Base | Q8 | 101.56 tok/sEstimated Auto-generated benchmark | 14GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q4 | 101.40 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-14B-Instruct | Q8 | 101.38 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen2.5-14B | Q8 | 101.33 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q4 | 100.80 tok/sEstimated Auto-generated benchmark | 15GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q4 | 100.30 tok/sEstimated Auto-generated benchmark | 10GB |
| mistralai/Ministral-3-14B-Instruct-2512 | Q8 | 99.08 tok/sEstimated Auto-generated benchmark | 16GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q4 | 98.79 tok/sEstimated Auto-generated benchmark | 10GB |
| Qwen/Qwen2.5-14B-Instruct | Q8 | 98.63 tok/sEstimated Auto-generated benchmark | 15GB |
| google/gemma-2-27b-it | Q4 | 98.55 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen3-14B | Q8 | 98.08 tok/sEstimated Auto-generated benchmark | 14GB |
| unsloth/gpt-oss-20b-BF16 | Q4 | 97.41 tok/sEstimated Auto-generated benchmark | 10GB |
| microsoft/Phi-3-medium-128k-instruct | Q8 | 97.30 tok/sEstimated Auto-generated benchmark | 14GB |
| NousResearch/Hermes-3-Llama-3.1-8B | Q8 | 96.97 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | 96.32 tok/sEstimated Auto-generated benchmark | 15GB |
| google/gemma-2-9b-it | Q8 | 96.14 tok/sEstimated Auto-generated benchmark | 10GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q4 | 95.32 tok/sEstimated Auto-generated benchmark | 15GB |
| OpenPipe/Qwen3-14B-Instruct | Q8 | 95.19 tok/sEstimated Auto-generated benchmark | 14GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q4 | 93.30 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | 92.85 tok/sEstimated Auto-generated benchmark | 9GB |
| ai-forever/ruGPT-3.5-13B | Q8 | 92.51 tok/sEstimated Auto-generated benchmark | 13GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q4 | 92.02 tok/sEstimated Auto-generated benchmark | 15GB |
| EssentialAI/rnj-1 | Q8 | 91.89 tok/sEstimated Auto-generated benchmark | 10GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q4 | 91.46 tok/sEstimated Auto-generated benchmark | 15GB |
| mistralai/Mistral-Small-Instruct-2409 | Q4 | 90.52 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen2.5-3B-Instruct | FP16 | 89.56 tok/sEstimated Auto-generated benchmark | 6GB |
| openai/gpt-oss-20b | Q4 | 89.49 tok/sEstimated Auto-generated benchmark | 10GB |
| google/gemma-2b | FP16 | 89.35 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Llama-3.2-3B-Instruct | FP16 | 89.02 tok/sEstimated Auto-generated benchmark | 6GB |
| bigcode/starcoder2-3b | FP16 | 88.81 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Llama-3.2-1B | FP16 | 88.64 tok/sEstimated Auto-generated benchmark | 2GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q8 | 88.09 tok/sEstimated Auto-generated benchmark | 10GB |
| meta-llama/Llama-Guard-3-1B | FP16 | 87.18 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-2-13b-chat-hf | Q8 | 86.72 tok/sEstimated Auto-generated benchmark | 13GB |
| LiquidAI/LFM2-1.2B | FP16 | 86.54 tok/sEstimated Auto-generated benchmark | 4GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | FP16 | 86.47 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/gemma-3-1b-it | FP16 | 86.00 tok/sEstimated Auto-generated benchmark | 2GB |
| google/embeddinggemma-300m | FP16 | 85.90 tok/sEstimated Auto-generated benchmark | 1GB |
| ibm-research/PowerMoE-3b | FP16 | 85.85 tok/sEstimated Auto-generated benchmark | 6GB |
| allenai/OLMo-2-0425-1B | FP16 | 85.68 tok/sEstimated Auto-generated benchmark | 2GB |
| google-t5/t5-3b | FP16 | 83.79 tok/sEstimated Auto-generated benchmark | 6GB |
| deepseek-ai/DeepSeek-OCR | FP16 | 83.66 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.2-1B-Instruct | FP16 | 82.59 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2-2b-it | FP16 | 82.56 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Llama-3.2-1B-Instruct | FP16 | 82.52 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-1b-it | FP16 | 82.49 tok/sEstimated Auto-generated benchmark | 2GB |
| inference-net/Schematron-3B | FP16 | 79.96 tok/sEstimated Auto-generated benchmark | 6GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | FP16 | 79.83 tok/sEstimated Auto-generated benchmark | 6GB |
| nari-labs/Dia2-2B | FP16 | 79.73 tok/sEstimated Auto-generated benchmark | 5GB |
| apple/OpenELM-1_1B-Instruct | FP16 | 79.66 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B | FP16 | 79.44 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | 79.28 tok/sEstimated Auto-generated benchmark | 6GB |
| Qwen/Qwen2.5-3B | FP16 | 78.43 tok/sEstimated Auto-generated benchmark | 6GB |
| facebook/sam3 | FP16 | 78.17 tok/sEstimated Auto-generated benchmark | 2GB |
| WeiboAI/VibeThinker-1.5B | FP16 | 76.82 tok/sEstimated Auto-generated benchmark | 4GB |
| google-bert/bert-base-uncased | FP16 | 76.58 tok/sEstimated Auto-generated benchmark | 1GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | FP16 | 76.40 tok/sEstimated Auto-generated benchmark | 6GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q8 | 75.99 tok/sEstimated Auto-generated benchmark | 31GB |
| ibm-granite/granite-3.3-2b-instruct | FP16 | 75.98 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B-Instruct | FP16 | 75.24 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | FP16 | 75.04 tok/sEstimated Auto-generated benchmark | 9GB |
| openai-community/gpt2-large | FP16 | 74.89 tok/sEstimated Auto-generated benchmark | 15GB |
| black-forest-labs/FLUX.1-dev | FP16 | 74.80 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q8 | 74.79 tok/sEstimated Auto-generated benchmark | 31GB |
| meta-llama/Llama-2-7b-chat-hf | FP16 | 74.78 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | FP16 | 74.73 tok/sEstimated Auto-generated benchmark | 17GB |
| skt/kogpt2-base-v2 | FP16 | 74.62 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Embedding-0.6B | FP16 | 74.59 tok/sEstimated Auto-generated benchmark | 13GB |
| microsoft/Phi-3-mini-4k-instruct | FP16 | 74.57 tok/sEstimated Auto-generated benchmark | 15GB |
| tencent/HunyuanOCR | FP16 | 74.22 tok/sEstimated Auto-generated benchmark | 3GB |
| rednote-hilab/dots.ocr | FP16 | 74.06 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B | Q8 | 74.04 tok/sEstimated Auto-generated benchmark | 31GB |
| rinna/japanese-gpt-neox-small | FP16 | 73.97 tok/sEstimated Auto-generated benchmark | 15GB |
| dicta-il/dictalm2.0-instruct | FP16 | 73.91 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | FP16 | 73.65 tok/sEstimated Auto-generated benchmark | 17GB |
| IlyaGusev/saiga_llama3_8b | FP16 | 73.62 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-4B-Thinking-2507 | FP16 | 73.61 tok/sEstimated Auto-generated benchmark | 9GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | FP16 | 73.52 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/gpt-oss-20b-BF16 | Q8 | 73.52 tok/sEstimated Auto-generated benchmark | 20GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | FP16 | 73.45 tok/sEstimated Auto-generated benchmark | 11GB |
| hmellor/tiny-random-LlamaForCausalLM | FP16 | 73.35 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | FP16 | 73.34 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q8 | 73.20 tok/sEstimated Auto-generated benchmark | 31GB |
| openai/gpt-oss-20b | Q8 | 72.99 tok/sEstimated Auto-generated benchmark | 20GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | FP16 | 72.93 tok/sEstimated Auto-generated benchmark | 9GB |
| bigscience/bloomz-560m | FP16 | 72.84 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-2-7b-hf | FP16 | 72.34 tok/sEstimated Auto-generated benchmark | 15GB |
| numind/NuExtract-1.5 | FP16 | 72.10 tok/sEstimated Auto-generated benchmark | 15GB |
| allenai/Olmo-3-7B-Think | FP16 | 71.99 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Meta-Llama-3-8B | FP16 | 71.94 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/Phi-4-mini-instruct | FP16 | 71.79 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-8B-FP8 | FP16 | 71.78 tok/sEstimated Auto-generated benchmark | 17GB |
| zai-org/GLM-4.5-Air | FP16 | 71.76 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-1.7B-Base | FP16 | 71.72 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-V3.1 | FP16 | 71.63 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-3.1-8B | FP16 | 71.54 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-Embedding-8B | FP16 | 71.53 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-1.7B | FP16 | 71.50 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-7B-Instruct | FP16 | 71.47 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Llama-3.1-8B-Instruct | FP16 | 71.28 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/phi-4 | FP16 | 71.26 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-3-mini-128k-instruct | FP16 | 71.09 tok/sEstimated Auto-generated benchmark | 15GB |
| mistralai/Mistral-7B-Instruct-v0.2 | FP16 | 71.07 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | FP16 | 71.00 tok/sEstimated Auto-generated benchmark | 9GB |
| tencent/HunyuanVideo-1.5 | FP16 | 70.94 tok/sEstimated Auto-generated benchmark | 16GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | FP16 | 70.75 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-1.5B | FP16 | 70.66 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-4B-Base | FP16 | 70.65 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2-0.5B-Instruct | FP16 | 70.54 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-8B | FP16 | 70.46 tok/sEstimated Auto-generated benchmark | 17GB |
| Gensyn/Qwen2.5-0.5B-Instruct | FP16 | 70.42 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen2.5-Math-1.5B | FP16 | 70.40 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q8 | 70.33 tok/sEstimated Auto-generated benchmark | 31GB |
| GSAI-ML/LLaDA-8B-Instruct | FP16 | 70.08 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-0.6B-Base | FP16 | 69.93 tok/sEstimated Auto-generated benchmark | 13GB |
| deepseek-ai/DeepSeek-V3 | FP16 | 69.93 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | FP16 | 69.88 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-4B | FP16 | 69.87 tok/sEstimated Auto-generated benchmark | 9GB |
| black-forest-labs/FLUX.2-dev | FP16 | 69.82 tok/sEstimated Auto-generated benchmark | 16GB |
| mistralai/Mistral-7B-Instruct-v0.1 | FP16 | 69.61 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-4-multimodal-instruct | FP16 | 69.38 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-7B | FP16 | 69.01 tok/sEstimated Auto-generated benchmark | 15GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q8 | 68.96 tok/sEstimated Auto-generated benchmark | 20GB |
| facebook/opt-125m | FP16 | 68.90 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | FP16 | 68.88 tok/sEstimated Auto-generated benchmark | 9GB |
| EleutherAI/pythia-70m-deduped | FP16 | 68.78 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-3.5-mini-instruct | FP16 | 68.60 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/deepseek-coder-33b-instruct | Q4 | 68.50 tok/sEstimated Auto-generated benchmark | 17GB |
| zai-org/GLM-4.6-FP8 | FP16 | 68.50 tok/sEstimated Auto-generated benchmark | 15GB |
| HuggingFaceH4/zephyr-7b-beta | FP16 | 68.46 tok/sEstimated Auto-generated benchmark | 15GB |
| swiss-ai/Apertus-8B-Instruct-2509 | FP16 | 68.28 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/VibeVoice-1.5B | FP16 | 68.20 tok/sEstimated Auto-generated benchmark | 11GB |
| lmsys/vicuna-7b-v1.5 | FP16 | 68.00 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q8 | 67.94 tok/sEstimated Auto-generated benchmark | 20GB |
| sshleifer/tiny-gpt2 | FP16 | 67.87 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q8 | 67.85 tok/sEstimated Auto-generated benchmark | 31GB |
| baichuan-inc/Baichuan-M2-32B | Q4 | 67.71 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-Embedding-4B | FP16 | 67.66 tok/sEstimated Auto-generated benchmark | 9GB |
| llamafactory/tiny-random-Llama-3 | FP16 | 67.43 tok/sEstimated Auto-generated benchmark | 15GB |
| mistralai/Mistral-7B-v0.1 | FP16 | 67.42 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | FP16 | 67.40 tok/sEstimated Auto-generated benchmark | 17GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q4 | 67.30 tok/sEstimated Auto-generated benchmark | 16GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q8 | 67.16 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen2-7B-Instruct | FP16 | 67.12 tok/sEstimated Auto-generated benchmark | 15GB |
| HuggingFaceTB/SmolLM2-135M | FP16 | 66.95 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | FP16 | 66.85 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-7B-Instruct | FP16 | 66.71 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Reranker-0.6B | FP16 | 66.68 tok/sEstimated Auto-generated benchmark | 13GB |
| Qwen/Qwen-Image-Edit-2509 | FP16 | 66.52 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen2-0.5B | FP16 | 66.50 tok/sEstimated Auto-generated benchmark | 11GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | FP16 | 66.48 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q8 | 66.46 tok/sEstimated Auto-generated benchmark | 31GB |
| mistralai/Mistral-Small-Instruct-2409 | Q8 | 66.44 tok/sEstimated Auto-generated benchmark | 23GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q8 | 66.44 tok/sEstimated Auto-generated benchmark | 31GB |
| microsoft/Phi-3.5-mini-instruct | FP16 | 66.31 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-Guard-3-8B | FP16 | 66.23 tok/sEstimated Auto-generated benchmark | 17GB |
| HuggingFaceTB/SmolLM-135M | FP16 | 66.09 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1 | FP16 | 66.02 tok/sEstimated Auto-generated benchmark | 15GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | FP16 | 65.90 tok/sEstimated Auto-generated benchmark | 17GB |
| ibm-granite/granite-docling-258M | FP16 | 65.75 tok/sEstimated Auto-generated benchmark | 15GB |
| openai-community/gpt2-xl | FP16 | 65.65 tok/sEstimated Auto-generated benchmark | 15GB |
| MiniMaxAI/MiniMax-M2 | FP16 | 65.61 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/QwQ-32B-Preview | Q4 | 65.60 tok/sEstimated Auto-generated benchmark | 17GB |
| liuhaotian/llava-v1.5-7b | FP16 | 65.52 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-0.6B | FP16 | 65.49 tok/sEstimated Auto-generated benchmark | 13GB |
| microsoft/DialoGPT-small | FP16 | 65.43 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-32B-Instruct | Q4 | 65.37 tok/sEstimated Auto-generated benchmark | 16GB |
| microsoft/Phi-3.5-vision-instruct | FP16 | 65.36 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-4B-Instruct-2507 | FP16 | 65.29 tok/sEstimated Auto-generated benchmark | 9GB |
| openai-community/gpt2 | FP16 | 65.15 tok/sEstimated Auto-generated benchmark | 15GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | FP16 | 65.15 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen2.5-0.5B | FP16 | 64.92 tok/sEstimated Auto-generated benchmark | 11GB |
| deepseek-ai/DeepSeek-R1-0528 | FP16 | 64.85 tok/sEstimated Auto-generated benchmark | 15GB |
| ibm-granite/granite-3.3-8b-instruct | FP16 | 64.84 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2.5-32B | Q4 | 64.76 tok/sEstimated Auto-generated benchmark | 16GB |
| vikhyatk/moondream2 | FP16 | 64.66 tok/sEstimated Auto-generated benchmark | 15GB |
| GSAI-ML/LLaDA-8B-Base | FP16 | 64.66 tok/sEstimated Auto-generated benchmark | 17GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | FP16 | 64.52 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/DialoGPT-medium | FP16 | 64.51 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/phi-2 | FP16 | 64.47 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-3.3-70B-Instruct | Q4 | 64.32 tok/sEstimated Auto-generated benchmark | 34GB |
| distilbert/distilgpt2 | FP16 | 64.26 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | Q4 | 63.85 tok/sEstimated Auto-generated benchmark | 17GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | FP16 | 63.79 tok/sEstimated Auto-generated benchmark | 15GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | FP16 | 63.22 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-1.5B-Instruct | FP16 | 63.18 tok/sEstimated Auto-generated benchmark | 11GB |
| deepseek-ai/DeepSeek-V3-0324 | FP16 | 63.18 tok/sEstimated Auto-generated benchmark | 15GB |
| openai/gpt-oss-safeguard-20b | Q8 | 63.17 tok/sEstimated Auto-generated benchmark | 22GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | FP16 | 63.06 tok/sEstimated Auto-generated benchmark | 15GB |
| huggyllama/llama-7b | FP16 | 62.98 tok/sEstimated Auto-generated benchmark | 15GB |
| google/gemma-2-27b-it | Q8 | 62.88 tok/sEstimated Auto-generated benchmark | 28GB |
| Qwen/Qwen2.5-0.5B-Instruct | FP16 | 62.84 tok/sEstimated Auto-generated benchmark | 11GB |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | 62.66 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-1.5B-Instruct | FP16 | 62.48 tok/sEstimated Auto-generated benchmark | 11GB |
| google/gemma-3-270m-it | FP16 | 62.45 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-8B-Base | FP16 | 62.36 tok/sEstimated Auto-generated benchmark | 17GB |
| parler-tts/parler-tts-large-v1 | FP16 | 62.31 tok/sEstimated Auto-generated benchmark | 15GB |
| openai-community/gpt2-medium | FP16 | 62.27 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-Coder-1.5B | FP16 | 62.25 tok/sEstimated Auto-generated benchmark | 11GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | FP16 | 62.24 tok/sEstimated Auto-generated benchmark | 9GB |
| petals-team/StableBeluga2 | FP16 | 62.09 tok/sEstimated Auto-generated benchmark | 15GB |
| EleutherAI/gpt-neo-125m | FP16 | 62.03 tok/sEstimated Auto-generated benchmark | 15GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | 62.02 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2.5-32B-Instruct | Q4 | 61.99 tok/sEstimated Auto-generated benchmark | 17GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q4 | 61.92 tok/sEstimated Auto-generated benchmark | 34GB |
| BSC-LT/salamandraTA-7b-instruct | FP16 | 61.74 tok/sEstimated Auto-generated benchmark | 15GB |
| Tongyi-MAI/Z-Image-Turbo | FP16 | 61.64 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Meta-Llama-3-70B-Instruct | Q4 | 61.06 tok/sEstimated Auto-generated benchmark | 34GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q4 | 60.80 tok/sEstimated Auto-generated benchmark | 34GB |
| 01-ai/Yi-1.5-34B-Chat | Q4 | 59.89 tok/sEstimated Auto-generated benchmark | 18GB |
| moonshotai/Kimi-K2-Thinking | Q4 | 59.43 tok/sEstimated Auto-generated benchmark | 489GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q4 | 59.29 tok/sEstimated Auto-generated benchmark | 16GB |
| deepseek-ai/DeepSeek-V2.5 | Q4 | 59.05 tok/sEstimated Auto-generated benchmark | 328GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | Q4 | 58.79 tok/sEstimated Auto-generated benchmark | 25GB |
| meta-llama/Llama-3.1-70B-Instruct | Q4 | 58.53 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen3-32B | Q4 | 57.55 tok/sEstimated Auto-generated benchmark | 16GB |
| codellama/CodeLlama-34b-hf | Q4 | 57.08 tok/sEstimated Auto-generated benchmark | 17GB |
| OpenPipe/Qwen3-14B-Instruct | FP16 | 55.83 tok/sEstimated Auto-generated benchmark | 29GB |
| microsoft/Phi-3-medium-128k-instruct | FP16 | 55.82 tok/sEstimated Auto-generated benchmark | 29GB |
| EssentialAI/rnj-1 | FP16 | 55.25 tok/sEstimated Auto-generated benchmark | 19GB |
| Qwen/Qwen2.5-14B | FP16 | 54.88 tok/sEstimated Auto-generated benchmark | 29GB |
| Qwen/Qwen2.5-14B-Instruct | FP16 | 54.68 tok/sEstimated Auto-generated benchmark | 30GB |
| NousResearch/Hermes-3-Llama-3.1-8B | FP16 | 53.39 tok/sEstimated Auto-generated benchmark | 17GB |
| meta-llama/Llama-2-13b-chat-hf | FP16 | 53.38 tok/sEstimated Auto-generated benchmark | 27GB |
| google/gemma-2-9b-it | FP16 | 52.83 tok/sEstimated Auto-generated benchmark | 20GB |
| Qwen/Qwen3-14B | FP16 | 52.24 tok/sEstimated Auto-generated benchmark | 29GB |
| Qwen/Qwen2.5-14B-Instruct | FP16 | 51.76 tok/sEstimated Auto-generated benchmark | 29GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | FP16 | 51.58 tok/sEstimated Auto-generated benchmark | 19GB |
| meta-llama/Llama-3.1-8B-Instruct | FP16 | 49.87 tok/sEstimated Auto-generated benchmark | 17GB |
| ai-forever/ruGPT-3.5-13B | FP16 | 49.59 tok/sEstimated Auto-generated benchmark | 27GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | Q8 | 47.64 tok/sEstimated Auto-generated benchmark | 50GB |
| Qwen/Qwen3-14B-Base | FP16 | 47.63 tok/sEstimated Auto-generated benchmark | 29GB |
| mistralai/Ministral-3-14B-Instruct-2512 | FP16 | 47.34 tok/sEstimated Auto-generated benchmark | 32GB |
| meta-llama/Llama-3.3-70B-Instruct | Q8 | 47.17 tok/sEstimated Auto-generated benchmark | 68GB |
| Qwen/Qwen3-32B | Q8 | 46.63 tok/sEstimated Auto-generated benchmark | 33GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | Q4 | 46.37 tok/sEstimated Auto-generated benchmark | 69GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q8 | 45.42 tok/sEstimated Auto-generated benchmark | 35GB |
| meta-llama/Meta-Llama-3-70B-Instruct | Q8 | 44.99 tok/sEstimated Auto-generated benchmark | 68GB |
| moonshotai/Kimi-K2-Thinking | Q8 | 44.78 tok/sEstimated Auto-generated benchmark | 978GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q8 | 44.77 tok/sEstimated Auto-generated benchmark | 33GB |
| Qwen/Qwen2.5-32B-Instruct | Q8 | 44.51 tok/sEstimated Auto-generated benchmark | 33GB |
| baichuan-inc/Baichuan-M2-32B | Q8 | 44.39 tok/sEstimated Auto-generated benchmark | 33GB |
| 01-ai/Yi-1.5-34B-Chat | Q8 | 43.98 tok/sEstimated Auto-generated benchmark | 35GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q8 | 43.94 tok/sEstimated Auto-generated benchmark | 68GB |
| codellama/CodeLlama-34b-hf | Q8 | 43.37 tok/sEstimated Auto-generated benchmark | 35GB |
| deepseek-ai/DeepSeek-V2.5 | Q8 | 43.35 tok/sEstimated Auto-generated benchmark | 656GB |
| meta-llama/Llama-3.1-70B-Instruct | Q8 | 43.34 tok/sEstimated Auto-generated benchmark | 68GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q8 | 43.18 tok/sEstimated Auto-generated benchmark | 68GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q8 | 43.03 tok/sEstimated Auto-generated benchmark | 33GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | Q8 | 42.13 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen2.5-32B-Instruct | Q8 | 42.09 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen2.5-32B | Q8 | 41.03 tok/sEstimated Auto-generated benchmark | 33GB |
| deepseek-ai/deepseek-coder-33b-instruct | Q8 | 40.33 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/QwQ-32B-Preview | Q8 | 40.29 tok/sEstimated Auto-generated benchmark | 34GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | FP16 | 40.09 tok/sEstimated Auto-generated benchmark | 61GB |
| mistralai/Mistral-Small-Instruct-2409 | FP16 | 39.77 tok/sEstimated Auto-generated benchmark | 46GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | FP16 | 39.63 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | FP16 | 38.57 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q4 | 38.42 tok/sEstimated Auto-generated benchmark | 39GB |
| google/gemma-2-27b-it | FP16 | 38.25 tok/sEstimated Auto-generated benchmark | 56GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | FP16 | 38.20 tok/sEstimated Auto-generated benchmark | 61GB |
| NousResearch/Hermes-3-Llama-3.1-70B | Q4 | 37.70 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen2.5-Math-72B-Instruct | Q4 | 37.42 tok/sEstimated Auto-generated benchmark | 36GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | FP16 | 37.18 tok/sEstimated Auto-generated benchmark | 41GB |
| Qwen/Qwen2.5-72B-Instruct | Q4 | 37.11 tok/sEstimated Auto-generated benchmark | 35GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | FP16 | 37.02 tok/sEstimated Auto-generated benchmark | 41GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | FP16 | 36.99 tok/sEstimated Auto-generated benchmark | 61GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Q4 | 36.61 tok/sEstimated Auto-generated benchmark | 34GB |
| openai/gpt-oss-20b | FP16 | 36.43 tok/sEstimated Auto-generated benchmark | 41GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | FP16 | 36.38 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q4 | 36.22 tok/sEstimated Auto-generated benchmark | 39GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q4 | 35.59 tok/sEstimated Auto-generated benchmark | 39GB |
| openai/gpt-oss-safeguard-20b | FP16 | 35.56 tok/sEstimated Auto-generated benchmark | 44GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q4 | 35.41 tok/sEstimated Auto-generated benchmark | 44GB |
| meta-llama/Llama-3.1-70B-Instruct | Q4 | 35.31 tok/sEstimated Auto-generated benchmark | 34GB |
| openai/gpt-oss-120b | Q4 | 35.17 tok/sEstimated Auto-generated benchmark | 59GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | FP16 | 35.15 tok/sEstimated Auto-generated benchmark | 61GB |
| AI-MO/Kimina-Prover-72B | Q4 | 34.86 tok/sEstimated Auto-generated benchmark | 35GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | FP16 | 34.78 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen2.5-72B-Instruct | Q4 | 34.53 tok/sEstimated Auto-generated benchmark | 36GB |
| unsloth/gpt-oss-20b-BF16 | FP16 | 34.31 tok/sEstimated Auto-generated benchmark | 41GB |
| mistralai/Mistral-Large-Instruct-2411 | Q4 | 34.01 tok/sEstimated Auto-generated benchmark | 60GB |
| Qwen/Qwen3-30B-A3B | FP16 | 33.90 tok/sEstimated Auto-generated benchmark | 61GB |
| meta-llama/Llama-3.3-70B-Instruct | Q4 | 33.20 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q4 | 32.83 tok/sEstimated Auto-generated benchmark | 39GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | Q8 | 30.80 tok/sEstimated Auto-generated benchmark | 138GB |
| deepseek-ai/DeepSeek-Math-V2 | Q4 | 27.36 tok/sEstimated Auto-generated benchmark | 383GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Q8 | 26.84 tok/sEstimated Auto-generated benchmark | 69GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | Q4 | 26.69 tok/sEstimated Auto-generated benchmark | 115GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | FP16 | 26.30 tok/sEstimated Auto-generated benchmark | 66GB |
| 01-ai/Yi-1.5-34B-Chat | FP16 | 25.93 tok/sEstimated Auto-generated benchmark | 70GB |
| Qwen/Qwen2.5-72B-Instruct | Q8 | 25.85 tok/sEstimated Auto-generated benchmark | 71GB |
| Qwen/QwQ-32B-Preview | FP16 | 25.78 tok/sEstimated Auto-generated benchmark | 67GB |
| codellama/CodeLlama-34b-hf | FP16 | 25.78 tok/sEstimated Auto-generated benchmark | 70GB |
| Qwen/Qwen2.5-32B | FP16 | 25.71 tok/sEstimated Auto-generated benchmark | 66GB |
| meta-llama/Llama-3.1-70B-Instruct | FP16 | 25.71 tok/sEstimated Auto-generated benchmark | 137GB |
| meta-llama/Llama-3.3-70B-Instruct | Q8 | 25.63 tok/sEstimated Auto-generated benchmark | 69GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | FP16 | 25.57 tok/sEstimated Auto-generated benchmark | 66GB |
| deepseek-ai/deepseek-coder-33b-instruct | FP16 | 25.47 tok/sEstimated Auto-generated benchmark | 68GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | FP16 | 25.46 tok/sEstimated Auto-generated benchmark | 101GB |
| meta-llama/Llama-3.1-70B-Instruct | Q8 | 25.01 tok/sEstimated Auto-generated benchmark | 69GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q8 | 24.98 tok/sEstimated Auto-generated benchmark | 78GB |
| mistralai/Mistral-Large-Instruct-2411 | Q8 | 24.98 tok/sEstimated Auto-generated benchmark | 120GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q8 | 24.82 tok/sEstimated Auto-generated benchmark | 78GB |
| Qwen/Qwen2.5-Math-72B-Instruct | Q8 | 24.40 tok/sEstimated Auto-generated benchmark | 71GB |
| meta-llama/Llama-3.3-70B-Instruct | FP16 | 24.37 tok/sEstimated Auto-generated benchmark | 137GB |
| baichuan-inc/Baichuan-M2-32B | FP16 | 24.25 tok/sEstimated Auto-generated benchmark | 66GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | FP16 | 23.88 tok/sEstimated Auto-generated benchmark | 67GB |
| openai/gpt-oss-120b | Q8 | 23.64 tok/sEstimated Auto-generated benchmark | 117GB |
| AI-MO/Kimina-Prover-72B | Q8 | 23.63 tok/sEstimated Auto-generated benchmark | 70GB |
| NousResearch/Hermes-3-Llama-3.1-70B | Q8 | 23.52 tok/sEstimated Auto-generated benchmark | 69GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q8 | 23.42 tok/sEstimated Auto-generated benchmark | 78GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q8 | 23.41 tok/sEstimated Auto-generated benchmark | 78GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q8 | 23.24 tok/sEstimated Auto-generated benchmark | 88GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | FP16 | 23.22 tok/sEstimated Auto-generated benchmark | 137GB |
| Qwen/Qwen2.5-72B-Instruct | Q8 | 23.15 tok/sEstimated Auto-generated benchmark | 70GB |
| moonshotai/Kimi-K2-Thinking | FP16 | 23.14 tok/sEstimated Auto-generated benchmark | 1956GB |
| Qwen/Qwen3-235B-A22B | Q4 | 23.11 tok/sEstimated Auto-generated benchmark | 115GB |
| Qwen/Qwen3-32B | FP16 | 23.11 tok/sEstimated Auto-generated benchmark | 66GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | FP16 | 23.08 tok/sEstimated Auto-generated benchmark | 137GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | FP16 | 22.80 tok/sEstimated Auto-generated benchmark | 70GB |
| Qwen/Qwen2.5-32B-Instruct | FP16 | 22.47 tok/sEstimated Auto-generated benchmark | 67GB |
| meta-llama/Meta-Llama-3-70B-Instruct | FP16 | 21.82 tok/sEstimated Auto-generated benchmark | 137GB |
| deepseek-ai/DeepSeek-V2.5 | FP16 | 21.74 tok/sEstimated Auto-generated benchmark | 1312GB |
| Qwen/Qwen2.5-32B-Instruct | FP16 | 21.59 tok/sEstimated Auto-generated benchmark | 66GB |
| MiniMaxAI/MiniMax-M1-40k | Q4 | 21.13 tok/sEstimated Auto-generated benchmark | 255GB |
| deepseek-ai/DeepSeek-Math-V2 | Q8 | 20.39 tok/sEstimated Auto-generated benchmark | 766GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q4 | 19.48 tok/sEstimated Auto-generated benchmark | 378GB |
| MiniMaxAI/MiniMax-VL-01 | Q4 | 19.47 tok/sEstimated Auto-generated benchmark | 256GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | Q8 | 19.32 tok/sEstimated Auto-generated benchmark | 231GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | FP16 | 19.27 tok/sEstimated Auto-generated benchmark | 275GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q8 | 16.03 tok/sEstimated Auto-generated benchmark | 755GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | FP16 | 15.04 tok/sEstimated Auto-generated benchmark | 176GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | FP16 | 14.90 tok/sEstimated Auto-generated benchmark | 138GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | FP16 | 14.84 tok/sEstimated Auto-generated benchmark | 156GB |
| meta-llama/Llama-3.3-70B-Instruct | FP16 | 14.41 tok/sEstimated Auto-generated benchmark | 138GB |
| MiniMaxAI/MiniMax-M1-40k | Q8 | 14.12 tok/sEstimated Auto-generated benchmark | 510GB |
| Qwen/Qwen3-235B-A22B | Q8 | 13.89 tok/sEstimated Auto-generated benchmark | 230GB |
| mistralai/Mistral-Large-Instruct-2411 | FP16 | 13.81 tok/sEstimated Auto-generated benchmark | 240GB |
| MiniMaxAI/MiniMax-VL-01 | Q8 | 13.80 tok/sEstimated Auto-generated benchmark | 511GB |
| openai/gpt-oss-120b | FP16 | 13.65 tok/sEstimated Auto-generated benchmark | 235GB |
| Qwen/Qwen2.5-Math-72B-Instruct | FP16 | 13.64 tok/sEstimated Auto-generated benchmark | 142GB |
| NousResearch/Hermes-3-Llama-3.1-70B | FP16 | 13.62 tok/sEstimated Auto-generated benchmark | 138GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | FP16 | 13.42 tok/sEstimated Auto-generated benchmark | 156GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | FP16 | 13.29 tok/sEstimated Auto-generated benchmark | 156GB |
| Qwen/Qwen2.5-72B-Instruct | FP16 | 13.25 tok/sEstimated Auto-generated benchmark | 142GB |
| Qwen/Qwen2.5-72B-Instruct | FP16 | 12.71 tok/sEstimated Auto-generated benchmark | 141GB |
| AI-MO/Kimina-Prover-72B | FP16 | 12.53 tok/sEstimated Auto-generated benchmark | 141GB |
| meta-llama/Llama-3.1-70B-Instruct | FP16 | 12.42 tok/sEstimated Auto-generated benchmark | 138GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | FP16 | 12.37 tok/sEstimated Auto-generated benchmark | 156GB |
| deepseek-ai/DeepSeek-Math-V2 | FP16 | 10.70 tok/sEstimated Auto-generated benchmark | 1532GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | FP16 | 10.25 tok/sEstimated Auto-generated benchmark | 461GB |
| MiniMaxAI/MiniMax-VL-01 | FP16 | 8.30 tok/sEstimated Auto-generated benchmark | 1021GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | FP16 | 8.19 tok/sEstimated Auto-generated benchmark | 1509GB |
| Qwen/Qwen3-235B-A22B | FP16 | 7.87 tok/sEstimated Auto-generated benchmark | 460GB |
| MiniMaxAI/MiniMax-M1-40k | FP16 | 7.71 tok/sEstimated Auto-generated benchmark | 1020GB |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
| Model | Quantization | Verdict | Estimated speed | VRAM needed |
|---|---|---|---|---|
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q4 | Not supported | 19.48 tok/sEstimated | 378GB (have 24GB) |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q8 | Not supported | 16.03 tok/sEstimated | 755GB (have 24GB) |
| EssentialAI/rnj-1 | FP16 | Fits comfortably | 55.25 tok/sEstimated | 19GB (have 24GB) |
| EssentialAI/rnj-1 | Q8 | Fits comfortably | 91.89 tok/sEstimated | 10GB (have 24GB) |
| EssentialAI/rnj-1 | Q4 | Fits comfortably | 143.12 tok/sEstimated | 5GB (have 24GB) |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | FP16 | Not supported | 8.19 tok/sEstimated | 1509GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | Fits comfortably | 93.14 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-32B | Q8 | Not supported | 46.48 tok/sEstimated | 33GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q8 | Not supported | 74.55 tok/sEstimated | 31GB (have 24GB) |
| distilbert/distilgpt2 | Q8 | Fits comfortably | 124.15 tok/sEstimated | 7GB (have 24GB) |
| meta-llama/Llama-3.2-1B | FP16 | Fits comfortably | 75.83 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-32B | Q4 | Fits comfortably | 64.33 tok/sEstimated | 16GB (have 24GB) |
| openai-community/gpt2-large | Q4 | Fits comfortably | 193.72 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B | FP16 | Fits comfortably | 64.90 tok/sEstimated | 9GB (have 24GB) |
| openai/gpt-oss-120b | Q8 | Not supported | 24.86 tok/sEstimated | 117GB (have 24GB) |
| distilbert/distilgpt2 | FP16 | Fits comfortably | 65.68 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | Fits comfortably | 114.55 tok/sEstimated | 5GB (have 24GB) |
| google/gemma-3-1b-it | Q4 | Fits comfortably | 229.10 tok/sEstimated | 1GB (have 24GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | FP16 | Fits comfortably | 70.59 tok/sEstimated | 15GB (have 24GB) |
| openai/gpt-oss-120b | Q4 | Not supported | 36.75 tok/sEstimated | 59GB (have 24GB) |
| bigscience/bloomz-560m | FP16 | Fits comfortably | 72.64 tok/sEstimated | 15GB (have 24GB) |
| distilbert/distilgpt2 | Q4 | Fits comfortably | 163.99 tok/sEstimated | 4GB (have 24GB) |
| petals-team/StableBeluga2 | Q4 | Fits comfortably | 194.01 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.2-1B | Q8 | Fits comfortably | 149.59 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen2.5-7B | FP16 | Fits comfortably | 63.39 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | FP16 | Fits comfortably | 66.44 tok/sEstimated | 11GB (have 24GB) |
| allenai/OLMo-2-0425-1B | FP16 | Fits comfortably | 84.55 tok/sEstimated | 2GB (have 24GB) |
| microsoft/Phi-3-mini-4k-instruct | FP16 | Fits comfortably | 66.81 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-1.7B | Q8 | Fits comfortably | 117.08 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-4B | Q8 | Fits comfortably | 113.63 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-1.5B-Instruct | FP16 | Fits comfortably | 63.38 tok/sEstimated | 11GB (have 24GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | Fits comfortably | 181.98 tok/sEstimated | 4GB (have 24GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 177.88 tok/sEstimated | 3GB (have 24GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 117.12 tok/sEstimated | 9GB (have 24GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | FP16 | Not supported | 25.32 tok/sEstimated | 70GB (have 24GB) |
| openai/gpt-oss-20b | FP16 | Not supported | 35.84 tok/sEstimated | 41GB (have 24GB) |
| Qwen/Qwen3-Embedding-0.6B | Q4 | Fits comfortably | 175.72 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | Fits comfortably | 191.01 tok/sEstimated | 3GB (have 24GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | Fits comfortably | 219.70 tok/sEstimated | 1GB (have 24GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | Fits comfortably | 177.16 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Instruct-2507 | FP16 | Fits comfortably | 67.92 tok/sEstimated | 9GB (have 24GB) |
| meta-llama/Llama-3.2-1B-Instruct | FP16 | Fits comfortably | 74.33 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2.5-3B-Instruct | Q8 | Fits comfortably | 154.51 tok/sEstimated | 3GB (have 24GB) |
| bigscience/bloomz-560m | Q8 | Fits comfortably | 126.12 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-8B | Q8 | Fits comfortably | 130.66 tok/sEstimated | 9GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | FP16 | Not supported | 25.95 tok/sEstimated | 66GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | Fits comfortably | 158.55 tok/sEstimated | 3GB (have 24GB) |
| vikhyatk/moondream2 | FP16 | Fits comfortably | 61.72 tok/sEstimated | 15GB (have 24GB) |
| petals-team/StableBeluga2 | FP16 | Fits comfortably | 74.22 tok/sEstimated | 15GB (have 24GB) |
| meta-llama/Llama-3.2-1B | Q4 | Fits comfortably | 200.93 tok/sEstimated | 1GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B | Q4 | Fits comfortably | 170.79 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-7B | Q8 | Fits comfortably | 133.48 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 191.88 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | Q8 | Fits comfortably | 133.18 tok/sEstimated | 5GB (have 24GB) |
| allenai/OLMo-2-0425-1B | Q4 | Fits comfortably | 206.77 tok/sEstimated | 1GB (have 24GB) |
| allenai/OLMo-2-0425-1B | Q8 | Fits comfortably | 145.08 tok/sEstimated | 1GB (have 24GB) |
| microsoft/Phi-3-mini-4k-instruct | Q4 | Fits comfortably | 195.39 tok/sEstimated | 4GB (have 24GB) |
| microsoft/Phi-3-mini-4k-instruct | Q8 | Fits comfortably | 131.32 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2-large | FP16 | Fits comfortably | 68.34 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-1.7B | Q4 | Fits comfortably | 175.06 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-1.7B | FP16 | Fits comfortably | 65.21 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-4B | Q4 | Fits comfortably | 182.78 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2.5-7B-Instruct | Q8 | Fits comfortably | 130.20 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-7B-Instruct | Q4 | Fits comfortably | 173.12 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-7B-Instruct | FP16 | Fits comfortably | 63.24 tok/sEstimated | 15GB (have 24GB) |
| openai-community/gpt2 | Q8 | Fits comfortably | 121.24 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-0.6B | Q8 | Fits comfortably | 119.50 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-0.6B | FP16 | Fits comfortably | 69.96 tok/sEstimated | 13GB (have 24GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | Fits comfortably | 116.89 tok/sEstimated | 5GB (have 24GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | FP16 | Fits comfortably | 65.44 tok/sEstimated | 11GB (have 24GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | Fits comfortably | 62.88 tok/sEstimated | 17GB (have 24GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q8 | Not supported | 46.64 tok/sEstimated | 35GB (have 24GB) |
| openai/gpt-oss-20b | Q4 | Fits comfortably | 99.45 tok/sEstimated | 10GB (have 24GB) |
| openai/gpt-oss-20b | Q8 | Fits comfortably | 69.96 tok/sEstimated | 20GB (have 24GB) |
| google/gemma-3-1b-it | Q8 | Fits comfortably | 150.89 tok/sEstimated | 1GB (have 24GB) |
| google/gemma-3-1b-it | FP16 | Fits comfortably | 81.32 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-Embedding-0.6B | Q8 | Fits comfortably | 134.53 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-Embedding-0.6B | FP16 | Fits comfortably | 69.15 tok/sEstimated | 13GB (have 24GB) |
| facebook/opt-125m | Q8 | Fits comfortably | 114.97 tok/sEstimated | 7GB (have 24GB) |
| facebook/opt-125m | FP16 | Fits comfortably | 64.85 tok/sEstimated | 15GB (have 24GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | Fits comfortably | 157.92 tok/sEstimated | 1GB (have 24GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | FP16 | Fits comfortably | 78.24 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | Fits comfortably | 197.14 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | Fits comfortably | 115.87 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | Fits comfortably | 233.68 tok/sEstimated | 1GB (have 24GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | Fits comfortably | 162.50 tok/sEstimated | 1GB (have 24GB) |
| openai/gpt-oss-120b | FP16 | Not supported | 14.41 tok/sEstimated | 235GB (have 24GB) |
| Qwen/Qwen2.5-3B-Instruct | Q4 | Fits comfortably | 212.19 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2.5-3B-Instruct | FP16 | Fits comfortably | 77.81 tok/sEstimated | 6GB (have 24GB) |
| bigscience/bloomz-560m | Q4 | Fits comfortably | 172.07 tok/sEstimated | 4GB (have 24GB) |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | Fits comfortably | 207.56 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-8B | Q4 | Fits comfortably | 175.44 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-8B | FP16 | Fits comfortably | 69.02 tok/sEstimated | 17GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q8 | Not supported | 43.61 tok/sEstimated | 33GB (have 24GB) |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | FP16 | Not supported | 21.72 tok/sEstimated | 137GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | Fits comfortably | 227.10 tok/sEstimated | 2GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | Fits comfortably | 74.48 tok/sEstimated | 6GB (have 24GB) |
| vikhyatk/moondream2 | Q4 | Fits comfortably | 162.49 tok/sEstimated | 4GB (have 24GB) |
| petals-team/StableBeluga2 | Q8 | Fits comfortably | 123.96 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2 | Q4 | Fits comfortably | 187.81 tok/sEstimated | 4GB (have 24GB) |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
Data-backed answers pulled from community benchmarks, manufacturer specs, and live pricing.
Community llama.cpp benchmarks of the ubergarm/Qwen3-30B-A3B-GGUF build show the RTX 4090 sustaining roughly 150–160 tokens/sec with CUDA kernels, keeping decode latency under 7 ms per token.
Source: Reddit – /r/LocalLLaMA (mq59v1k)
No. Builders loading Llama 3.1 70B Q4_K_M report roughly half the tensor pages spilling to system RAM on a 24 GB 4090, which drags throughput because PCIe becomes the bottleneck. Multi-GPU setups or 48 GB cards avoid the spill.
Source: Reddit – /r/LocalLLaMA (mqcouez)
Power users running multi-4090 racks note that a single 4090 comfortably hosts one 32B-class model; parallel agents or MoE workloads need tensor parallelism across multiple GPUs to keep speeds high.
Source: Reddit – /r/LocalLLaMA (mqwkgv3)
NVIDIA rates the RTX 4090 at 450 W board power and recommends at least an 850 W PSU with the 16-pin 12VHPWR connector to maintain headroom for AI workloads.
Source: TechPowerUp – RTX 4090 Specs
Our price tracker (Nov 2025) shows Amazon at $1,599 in stock.
Source: Supabase price tracker snapshot – 2025-11-03
Explore how RTX 4080 stacks up for local inference workloads.
Explore how RTX 4070 Ti stacks up for local inference workloads.
Explore how RTX 3090 stacks up for local inference workloads.
Explore how NVIDIA RTX 6000 Ada stacks up for local inference workloads.
Explore how RX 7900 XTX stacks up for local inference workloads.
Side-by-side VRAM, throughput, efficiency, and pricing benchmarks for both GPUs.
Side-by-side VRAM, throughput, efficiency, and pricing benchmarks for both GPUs.
Side-by-side VRAM, throughput, efficiency, and pricing benchmarks for both GPUs.