Loading GPU specifications...
Loading GPU benchmarks...
Quick Answer: NVIDIA A5000 offers 24GB VRAM and starts around $10.40. It delivers approximately 162 tokens/sec on deepseek-ai/DeepSeek-OCR. It typically draws 230W under load.
This GPU offers reliable throughput for local AI workloads. Pair it with the right model quantization to hit your desired tokens/sec, and monitor prices below to catch the best deal.
Buy directly on Amazon with fast shipping and reliable customer service.
| Model | Quantization | Tokens/sec | VRAM used |
|---|---|---|---|
| deepseek-ai/DeepSeek-OCR | Q4 | 162.44 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-1B | Q4 | 162.36 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen2.5-3B | Q4 | 161.67 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | 161.52 tok/sEstimated Auto-generated benchmark | 1GB |
| inference-net/Schematron-3B | Q4 | 161.12 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/Llama-3.2-1B-Instruct | Q4 | 160.70 tok/sEstimated Auto-generated benchmark | 1GB |
| ibm-research/PowerMoE-3b | Q4 | 160.54 tok/sEstimated Auto-generated benchmark | 2GB |
| tencent/HunyuanOCR | Q4 | 159.81 tok/sEstimated Auto-generated benchmark | 1GB |
| apple/OpenELM-1_1B-Instruct | Q4 | 158.76 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B | Q4 | 157.65 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-Guard-3-1B | Q4 | 157.52 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/gemma-3-1b-it | Q4 | 156.37 tok/sEstimated Auto-generated benchmark | 1GB |
| facebook/sam3 | Q4 | 154.92 tok/sEstimated Auto-generated benchmark | 1GB |
| google/embeddinggemma-300m | Q4 | 154.17 tok/sEstimated Auto-generated benchmark | 1GB |
| google-t5/t5-3b | Q4 | 154.08 tok/sEstimated Auto-generated benchmark | 2GB |
| allenai/OLMo-2-0425-1B | Q4 | 150.84 tok/sEstimated Auto-generated benchmark | 1GB |
| google-bert/bert-base-uncased | Q4 | 150.55 tok/sEstimated Auto-generated benchmark | 1GB |
| LiquidAI/LFM2-1.2B | Q4 | 150.55 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 149.63 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-1b-it | Q4 | 149.36 tok/sEstimated Auto-generated benchmark | 1GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | 148.84 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2b | Q4 | 145.99 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-3B-Instruct | Q4 | 142.08 tok/sEstimated Auto-generated benchmark | 2GB |
| bigcode/starcoder2-3b | Q4 | 141.44 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B-Instruct | Q4 | 141.27 tok/sEstimated Auto-generated benchmark | 2GB |
| nari-labs/Dia2-2B | Q4 | 140.91 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2-2b-it | Q4 | 140.87 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | 140.60 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-granite/granite-3.3-2b-instruct | Q4 | 139.00 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/DeepSeek-R1-0528 | Q4 | 137.85 tok/sEstimated Auto-generated benchmark | 4GB |
| WeiboAI/VibeThinker-1.5B | Q4 | 137.81 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-Embedding-4B | Q4 | 137.56 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3-mini-4k-instruct | Q4 | 137.30 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q4 | 136.81 tok/sEstimated Auto-generated benchmark | 4GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | 136.54 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-2-7b-hf | Q4 | 136.22 tok/sEstimated Auto-generated benchmark | 4GB |
| BSC-LT/salamandraTA-7b-instruct | Q4 | 135.81 tok/sEstimated Auto-generated benchmark | 4GB |
| hmellor/tiny-random-LlamaForCausalLM | Q4 | 135.79 tok/sEstimated Auto-generated benchmark | 4GB |
| numind/NuExtract-1.5 | Q4 | 135.79 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3 | Q4 | 135.39 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-v0.1 | Q4 | 135.29 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q4 | 134.93 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-0.5B-Instruct | Q4 | 134.78 tok/sEstimated Auto-generated benchmark | 3GB |
| HuggingFaceTB/SmolLM2-135M | Q4 | 134.69 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/DialoGPT-small | Q4 | 134.43 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-8B | Q4 | 134.39 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceH4/zephyr-7b-beta | Q4 | 134.18 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Base | Q4 | 133.95 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-granite/granite-docling-258M | Q4 | 133.81 tok/sEstimated Auto-generated benchmark | 4GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q4 | 133.78 tok/sEstimated Auto-generated benchmark | 4GB |
| vikhyatk/moondream2 | Q4 | 133.69 tok/sEstimated Auto-generated benchmark | 4GB |
| parler-tts/parler-tts-large-v1 | Q4 | 133.42 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q4 | 132.99 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q4 | 132.38 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-4 | Q4 | 132.38 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3.1 | Q4 | 132.27 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Reranker-0.6B | Q4 | 132.23 tok/sEstimated Auto-generated benchmark | 3GB |
| lmsys/vicuna-7b-v1.5 | Q4 | 132.19 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceTB/SmolLM-135M | Q4 | 131.99 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2 | Q4 | 131.99 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3-mini-128k-instruct | Q4 | 131.83 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-1.5B | Q4 | 131.77 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 131.49 tok/sEstimated Auto-generated benchmark | 2GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q4 | 131.30 tok/sEstimated Auto-generated benchmark | 2GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | 131.06 tok/sEstimated Auto-generated benchmark | 4GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q4 | 130.82 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-Guard-3-8B | Q4 | 130.51 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q4 | 130.23 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/VibeVoice-1.5B | Q4 | 129.95 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/Phi-4-multimodal-instruct | Q4 | 129.38 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-1.7B-Base | Q4 | 129.34 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Coder-1.5B | Q4 | 129.25 tok/sEstimated Auto-generated benchmark | 3GB |
| tencent/HunyuanVideo-1.5 | Q4 | 129.13 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q4 | 128.93 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Math-1.5B | Q4 | 128.44 tok/sEstimated Auto-generated benchmark | 3GB |
| IlyaGusev/saiga_llama3_8b | Q4 | 128.41 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-4-mini-instruct | Q4 | 128.05 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-FP8 | Q4 | 127.80 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B-Instruct | Q4 | 127.80 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B | Q4 | 126.95 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q4 | 126.90 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-7B-Instruct | Q4 | 126.64 tok/sEstimated Auto-generated benchmark | 4GB |
| distilbert/distilgpt2 | Q4 | 126.36 tok/sEstimated Auto-generated benchmark | 4GB |
| rinna/japanese-gpt-neox-small | Q4 | 126.16 tok/sEstimated Auto-generated benchmark | 4GB |
| MiniMaxAI/MiniMax-M2 | Q4 | 126.09 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-large | Q4 | 125.76 tok/sEstimated Auto-generated benchmark | 4GB |
| liuhaotian/llava-v1.5-7b | Q4 | 125.46 tok/sEstimated Auto-generated benchmark | 4GB |
| bigscience/bloomz-560m | Q4 | 125.40 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-0.6B-Base | Q4 | 125.23 tok/sEstimated Auto-generated benchmark | 3GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q4 | 125.22 tok/sEstimated Auto-generated benchmark | 4GB |
| ibm-granite/granite-3.3-8b-instruct | Q4 | 125.16 tok/sEstimated Auto-generated benchmark | 4GB |
| zai-org/GLM-4.5-Air | Q4 | 125.13 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | 124.81 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-270m-it | Q4 | 124.75 tok/sEstimated Auto-generated benchmark | 4GB |
| dicta-il/dictalm2.0-instruct | Q4 | 124.74 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B | Q4 | 124.69 tok/sEstimated Auto-generated benchmark | 4GB |
| llamafactory/tiny-random-Llama-3 | Q4 | 124.65 tok/sEstimated Auto-generated benchmark | 4GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q4 | 124.21 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-vision-instruct | Q4 | 123.95 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | 123.80 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-0.5B | Q4 | 123.62 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | 123.57 tok/sEstimated Auto-generated benchmark | 2GB |
| black-forest-labs/FLUX.1-dev | Q4 | 123.36 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | 123.32 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-1.7B | Q4 | 122.91 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-1.5B-Instruct | Q4 | 122.90 tok/sEstimated Auto-generated benchmark | 3GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4 | 122.89 tok/sEstimated Auto-generated benchmark | 4GB |
| sshleifer/tiny-gpt2 | Q4 | 122.83 tok/sEstimated Auto-generated benchmark | 4GB |
| petals-team/StableBeluga2 | Q4 | 122.36 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B | Q4 | 121.96 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Base | Q4 | 121.40 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3-0324 | Q4 | 121.31 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Instruct | Q4 | 121.10 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1 | Q4 | 120.79 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-2-7b-chat-hf | Q4 | 120.57 tok/sEstimated Auto-generated benchmark | 4GB |
| allenai/Olmo-3-7B-Think | Q4 | 120.37 tok/sEstimated Auto-generated benchmark | 4GB |
| EleutherAI/gpt-neo-125m | Q4 | 120.19 tok/sEstimated Auto-generated benchmark | 4GB |
| zai-org/GLM-4.6-FP8 | Q4 | 120.03 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q4 | 119.83 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2-0.5B | Q4 | 119.76 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q4 | 119.69 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q4 | 119.34 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/DialoGPT-medium | Q4 | 118.62 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q4 | 118.51 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-4B | Q4 | 118.26 tok/sEstimated Auto-generated benchmark | 2GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q4 | 117.72 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen-Image-Edit-2509 | Q4 | 117.69 tok/sEstimated Auto-generated benchmark | 4GB |
| black-forest-labs/FLUX.2-dev | Q4 | 117.39 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B | Q4 | 117.16 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | 117.09 tok/sEstimated Auto-generated benchmark | 3GB |
| openai-community/gpt2-medium | Q4 | 116.97 tok/sEstimated Auto-generated benchmark | 4GB |
| huggyllama/llama-7b | Q4 | 116.87 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-2 | Q4 | 116.75 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q4 | 116.56 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-0.6B | Q4 | 116.41 tok/sEstimated Auto-generated benchmark | 3GB |
| Tongyi-MAI/Z-Image-Turbo | Q4 | 116.37 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q4 | 116.30 tok/sEstimated Auto-generated benchmark | 4GB |
| rednote-hilab/dots.ocr | Q4 | 116.24 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q4 | 116.17 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-Base | Q4 | 115.85 tok/sEstimated Auto-generated benchmark | 4GB |
| inference-net/Schematron-3B | Q8 | 115.72 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-0.6B | Q4 | 115.39 tok/sEstimated Auto-generated benchmark | 3GB |
| unsloth/Llama-3.2-1B-Instruct | Q8 | 115.31 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen2-7B-Instruct | Q4 | 115.29 tok/sEstimated Auto-generated benchmark | 4GB |
| google/embeddinggemma-300m | Q8 | 115.20 tok/sEstimated Auto-generated benchmark | 1GB |
| LiquidAI/LFM2-1.2B | Q8 | 115.04 tok/sEstimated Auto-generated benchmark | 2GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q4 | 114.97 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-xl | Q4 | 114.91 tok/sEstimated Auto-generated benchmark | 4GB |
| facebook/opt-125m | Q4 | 114.86 tok/sEstimated Auto-generated benchmark | 4GB |
| skt/kogpt2-base-v2 | Q4 | 114.78 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q4 | 114.52 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-Guard-3-1B | Q8 | 114.44 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q4 | 114.38 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q4 | 114.24 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q4 | 113.86 tok/sEstimated Auto-generated benchmark | 4GB |
| google/gemma-2b | Q8 | 113.68 tok/sEstimated Auto-generated benchmark | 2GB |
| EleutherAI/pythia-70m-deduped | Q4 | 113.24 tok/sEstimated Auto-generated benchmark | 4GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | 112.95 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-3B | Q8 | 112.70 tok/sEstimated Auto-generated benchmark | 3GB |
| ibm-research/PowerMoE-3b | Q8 | 111.44 tok/sEstimated Auto-generated benchmark | 3GB |
| google/gemma-2-2b-it | Q8 | 111.09 tok/sEstimated Auto-generated benchmark | 2GB |
| facebook/sam3 | Q8 | 110.39 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | 108.53 tok/sEstimated Auto-generated benchmark | 3GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q8 | 108.03 tok/sEstimated Auto-generated benchmark | 3GB |
| deepseek-ai/DeepSeek-OCR | Q8 | 107.87 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.2-1B | Q8 | 105.67 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B | Q8 | 105.28 tok/sEstimated Auto-generated benchmark | 3GB |
| apple/OpenELM-1_1B-Instruct | Q8 | 104.15 tok/sEstimated Auto-generated benchmark | 1GB |
| allenai/OLMo-2-0425-1B | Q8 | 104.01 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-3-1b-it | Q8 | 103.55 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen3-14B | Q4 | 103.30 tok/sEstimated Auto-generated benchmark | 7GB |
| google-t5/t5-3b | Q8 | 103.27 tok/sEstimated Auto-generated benchmark | 3GB |
| WeiboAI/VibeThinker-1.5B | Q8 | 102.86 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/gemma-3-1b-it | Q8 | 102.62 tok/sEstimated Auto-generated benchmark | 1GB |
| bigcode/starcoder2-3b | Q8 | 102.06 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-14B-Instruct | Q4 | 101.73 tok/sEstimated Auto-generated benchmark | 7GB |
| ibm-granite/granite-3.3-2b-instruct | Q8 | 101.71 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q8 | 101.31 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-2-13b-chat-hf | Q4 | 100.89 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | 100.76 tok/sEstimated Auto-generated benchmark | 4GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | 100.69 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | 100.32 tok/sEstimated Auto-generated benchmark | 1GB |
| nari-labs/Dia2-2B | Q8 | 99.87 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-3B-Instruct | Q8 | 99.63 tok/sEstimated Auto-generated benchmark | 3GB |
| google/gemma-2-9b-it | Q4 | 99.14 tok/sEstimated Auto-generated benchmark | 5GB |
| unsloth/Llama-3.2-3B-Instruct | Q8 | 99.10 tok/sEstimated Auto-generated benchmark | 3GB |
| EssentialAI/rnj-1 | Q4 | 98.55 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-14B-Instruct | Q4 | 98.17 tok/sEstimated Auto-generated benchmark | 8GB |
| google-bert/bert-base-uncased | Q8 | 97.53 tok/sEstimated Auto-generated benchmark | 1GB |
| tencent/HunyuanOCR | Q8 | 96.76 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3-medium-128k-instruct | Q4 | 96.48 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q8 | 96.44 tok/sEstimated Auto-generated benchmark | 7GB |
| vikhyatk/moondream2 | Q8 | 96.40 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-128k-instruct | Q8 | 96.13 tok/sEstimated Auto-generated benchmark | 7GB |
| GSAI-ML/LLaDA-8B-Instruct | Q8 | 96.09 tok/sEstimated Auto-generated benchmark | 9GB |
| mistralai/Ministral-3-14B-Instruct-2512 | Q4 | 96.05 tok/sEstimated Auto-generated benchmark | 8GB |
| distilbert/distilgpt2 | Q8 | 96.05 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-0.5B | Q8 | 96.00 tok/sEstimated Auto-generated benchmark | 5GB |
| petals-team/StableBeluga2 | Q8 | 95.96 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q8 | 95.92 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceTB/SmolLM2-135M | Q8 | 95.84 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q8 | 95.82 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q8 | 95.77 tok/sEstimated Auto-generated benchmark | 9GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q8 | 95.60 tok/sEstimated Auto-generated benchmark | 7GB |
| black-forest-labs/FLUX.2-dev | Q8 | 95.28 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/DeepSeek-V3 | Q8 | 95.21 tok/sEstimated Auto-generated benchmark | 7GB |
| rednote-hilab/dots.ocr | Q8 | 95.12 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-Coder-1.5B | Q8 | 94.87 tok/sEstimated Auto-generated benchmark | 5GB |
| Tongyi-MAI/Z-Image-Turbo | Q8 | 94.81 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q8 | 94.44 tok/sEstimated Auto-generated benchmark | 7GB |
| liuhaotian/llava-v1.5-7b | Q8 | 94.39 tok/sEstimated Auto-generated benchmark | 7GB |
| ibm-granite/granite-3.3-8b-instruct | Q8 | 94.28 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-14B | Q4 | 94.17 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3.5-vision-instruct | Q8 | 94.09 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-0.6B | Q8 | 94.08 tok/sEstimated Auto-generated benchmark | 6GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q8 | 93.73 tok/sEstimated Auto-generated benchmark | 7GB |
| NousResearch/Hermes-3-Llama-3.1-8B | Q4 | 93.55 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q8 | 93.40 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-Reranker-0.6B | Q8 | 93.22 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Llama-Guard-3-8B | Q8 | 92.79 tok/sEstimated Auto-generated benchmark | 9GB |
| zai-org/GLM-4.6-FP8 | Q8 | 92.68 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-8B | Q8 | 92.63 tok/sEstimated Auto-generated benchmark | 9GB |
| facebook/opt-125m | Q8 | 92.60 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | 92.47 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/DialoGPT-small | Q8 | 92.47 tok/sEstimated Auto-generated benchmark | 7GB |
| numind/NuExtract-1.5 | Q8 | 92.29 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B | Q8 | 92.09 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q8 | 92.05 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3-0324 | Q8 | 91.83 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q8 | 91.80 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-8B-FP8 | Q8 | 91.76 tok/sEstimated Auto-generated benchmark | 9GB |
| lmsys/vicuna-7b-v1.5 | Q8 | 91.63 tok/sEstimated Auto-generated benchmark | 7GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | 91.61 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen-Image-Edit-2509 | Q8 | 91.31 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen2.5-7B-Instruct | Q8 | 90.88 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1 | Q8 | 90.74 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceH4/zephyr-7b-beta | Q8 | 90.68 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/phi-4 | Q8 | 90.64 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B | Q8 | 90.57 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q8 | 90.55 tok/sEstimated Auto-generated benchmark | 5GB |
| openai-community/gpt2-large | Q8 | 90.08 tok/sEstimated Auto-generated benchmark | 7GB |
| hmellor/tiny-random-LlamaForCausalLM | Q8 | 89.64 tok/sEstimated Auto-generated benchmark | 7GB |
| BSC-LT/salamandraTA-7b-instruct | Q8 | 89.49 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-0.5B | Q8 | 89.31 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/VibeVoice-1.5B | Q8 | 89.29 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q8 | 89.07 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-2-7b-hf | Q8 | 88.86 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q8 | 88.81 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-1.7B | Q8 | 88.56 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q8 | 88.29 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-1.7B-Base | Q8 | 88.27 tok/sEstimated Auto-generated benchmark | 7GB |
| sshleifer/tiny-gpt2 | Q8 | 88.22 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-v0.1 | Q8 | 88.15 tok/sEstimated Auto-generated benchmark | 7GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q8 | 88.04 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/phi-2 | Q8 | 87.81 tok/sEstimated Auto-generated benchmark | 7GB |
| GSAI-ML/LLaDA-8B-Base | Q8 | 87.81 tok/sEstimated Auto-generated benchmark | 9GB |
| EleutherAI/pythia-70m-deduped | Q8 | 87.75 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-0528 | Q8 | 87.61 tok/sEstimated Auto-generated benchmark | 7GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q8 | 87.40 tok/sEstimated Auto-generated benchmark | 7GB |
| allenai/Olmo-3-7B-Think | Q8 | 87.11 tok/sEstimated Auto-generated benchmark | 8GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q8 | 86.84 tok/sEstimated Auto-generated benchmark | 4GB |
| huggyllama/llama-7b | Q8 | 86.80 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/DialoGPT-medium | Q8 | 86.69 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceTB/SmolLM-135M | Q8 | 86.65 tok/sEstimated Auto-generated benchmark | 7GB |
| ibm-granite/granite-docling-258M | Q8 | 86.58 tok/sEstimated Auto-generated benchmark | 7GB |
| OpenPipe/Qwen3-14B-Instruct | Q4 | 86.35 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q8 | 86.25 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen3-0.6B-Base | Q8 | 86.23 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q8 | 85.92 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | 85.86 tok/sEstimated Auto-generated benchmark | 4GB |
| ai-forever/ruGPT-3.5-13B | Q4 | 85.66 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-4B | Q8 | 85.47 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q8 | 85.46 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-14B-Base | Q4 | 85.38 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-4k-instruct | Q8 | 85.22 tok/sEstimated Auto-generated benchmark | 7GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q4 | 85.02 tok/sEstimated Auto-generated benchmark | 5GB |
| dicta-il/dictalm2.0-instruct | Q8 | 84.94 tok/sEstimated Auto-generated benchmark | 7GB |
| black-forest-labs/FLUX.1-dev | Q8 | 84.93 tok/sEstimated Auto-generated benchmark | 8GB |
| openai-community/gpt2-medium | Q8 | 84.46 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-0.6B | Q8 | 84.34 tok/sEstimated Auto-generated benchmark | 6GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q8 | 84.02 tok/sEstimated Auto-generated benchmark | 9GB |
| MiniMaxAI/MiniMax-M2 | Q8 | 83.94 tok/sEstimated Auto-generated benchmark | 7GB |
| bigscience/bloomz-560m | Q8 | 83.85 tok/sEstimated Auto-generated benchmark | 7GB |
| zai-org/GLM-4.5-Air | Q8 | 83.55 tok/sEstimated Auto-generated benchmark | 7GB |
| google/gemma-3-270m-it | Q8 | 83.46 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B-Instruct | Q8 | 83.42 tok/sEstimated Auto-generated benchmark | 8GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q8 | 83.33 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Base | Q8 | 83.10 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-7B-Instruct | Q8 | 82.97 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-8B | Q8 | 82.78 tok/sEstimated Auto-generated benchmark | 9GB |
| skt/kogpt2-base-v2 | Q8 | 82.65 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q8 | 82.63 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3.5-mini-instruct | Q8 | 82.61 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2 | Q8 | 82.07 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-4-mini-instruct | Q8 | 82.04 tok/sEstimated Auto-generated benchmark | 7GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q8 | 81.91 tok/sEstimated Auto-generated benchmark | 9GB |
| EleutherAI/gpt-neo-125m | Q8 | 81.88 tok/sEstimated Auto-generated benchmark | 7GB |
| IlyaGusev/saiga_llama3_8b | Q8 | 81.54 tok/sEstimated Auto-generated benchmark | 9GB |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | 81.54 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-8B-Base | Q8 | 81.36 tok/sEstimated Auto-generated benchmark | 9GB |
| meta-llama/Llama-2-7b-chat-hf | Q8 | 80.98 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q8 | 80.71 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | 80.70 tok/sEstimated Auto-generated benchmark | 9GB |
| tencent/HunyuanVideo-1.5 | Q8 | 80.62 tok/sEstimated Auto-generated benchmark | 8GB |
| parler-tts/parler-tts-large-v1 | Q8 | 80.53 tok/sEstimated Auto-generated benchmark | 7GB |
| rinna/japanese-gpt-neox-small | Q8 | 80.51 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q8 | 80.48 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q8 | 80.42 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-xl | Q8 | 80.31 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-1.5B-Instruct | Q8 | 80.18 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-Math-1.5B | Q8 | 80.13 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-1.5B | Q8 | 80.12 tok/sEstimated Auto-generated benchmark | 5GB |
| microsoft/Phi-4-multimodal-instruct | Q8 | 80.12 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-V3.1 | Q8 | 80.06 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.1-8B | Q8 | 79.84 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2-0.5B-Instruct | Q8 | 79.76 tok/sEstimated Auto-generated benchmark | 5GB |
| llamafactory/tiny-random-Llama-3 | Q8 | 79.55 tok/sEstimated Auto-generated benchmark | 7GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q8 | 79.28 tok/sEstimated Auto-generated benchmark | 9GB |
| meta-llama/Meta-Llama-3-8B | Q8 | 79.17 tok/sEstimated Auto-generated benchmark | 9GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q4 | 73.14 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-14B-Instruct | Q8 | 72.16 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q4 | 72.14 tok/sEstimated Auto-generated benchmark | 15GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q4 | 71.32 tok/sEstimated Auto-generated benchmark | 10GB |
| google/gemma-2-9b-it | Q8 | 70.56 tok/sEstimated Auto-generated benchmark | 10GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q4 | 70.47 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-14B | Q8 | 70.37 tok/sEstimated Auto-generated benchmark | 14GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q4 | 70.21 tok/sEstimated Auto-generated benchmark | 15GB |
| NousResearch/Hermes-3-Llama-3.1-8B | Q8 | 69.78 tok/sEstimated Auto-generated benchmark | 9GB |
| mistralai/Ministral-3-14B-Instruct-2512 | Q8 | 69.32 tok/sEstimated Auto-generated benchmark | 16GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q8 | 69.00 tok/sEstimated Auto-generated benchmark | 10GB |
| Qwen/Qwen3-30B-A3B | Q4 | 68.92 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-14B | Q8 | 68.55 tok/sEstimated Auto-generated benchmark | 14GB |
| openai/gpt-oss-20b | Q4 | 68.36 tok/sEstimated Auto-generated benchmark | 10GB |
| EssentialAI/rnj-1 | Q8 | 67.53 tok/sEstimated Auto-generated benchmark | 10GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | 67.52 tok/sEstimated Auto-generated benchmark | 15GB |
| OpenPipe/Qwen3-14B-Instruct | Q8 | 66.54 tok/sEstimated Auto-generated benchmark | 14GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q4 | 66.50 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q4 | 66.40 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-14B-Base | Q8 | 65.74 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q4 | 64.07 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/gpt-oss-20b-BF16 | Q4 | 64.05 tok/sEstimated Auto-generated benchmark | 10GB |
| Qwen/Qwen2.5-14B-Instruct | Q8 | 63.74 tok/sEstimated Auto-generated benchmark | 14GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q4 | 63.70 tok/sEstimated Auto-generated benchmark | 10GB |
| openai/gpt-oss-safeguard-20b | Q4 | 63.56 tok/sEstimated Auto-generated benchmark | 11GB |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | 63.55 tok/sEstimated Auto-generated benchmark | 9GB |
| google/gemma-2-27b-it | Q4 | 63.53 tok/sEstimated Auto-generated benchmark | 14GB |
| mistralai/Mistral-Small-Instruct-2409 | Q4 | 63.30 tok/sEstimated Auto-generated benchmark | 11GB |
| ai-forever/ruGPT-3.5-13B | Q8 | 63.14 tok/sEstimated Auto-generated benchmark | 13GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | FP16 | 62.70 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Llama-2-13b-chat-hf | Q8 | 62.61 tok/sEstimated Auto-generated benchmark | 13GB |
| Qwen/Qwen2.5-3B-Instruct | FP16 | 62.53 tok/sEstimated Auto-generated benchmark | 6GB |
| allenai/OLMo-2-0425-1B | FP16 | 62.35 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3-medium-128k-instruct | Q8 | 61.94 tok/sEstimated Auto-generated benchmark | 14GB |
| facebook/sam3 | FP16 | 61.90 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-Guard-3-1B | FP16 | 61.78 tok/sEstimated Auto-generated benchmark | 2GB |
| tencent/HunyuanOCR | FP16 | 61.17 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | 60.61 tok/sEstimated Auto-generated benchmark | 6GB |
| unsloth/gemma-3-1b-it | FP16 | 59.62 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-granite/granite-3.3-2b-instruct | FP16 | 59.58 tok/sEstimated Auto-generated benchmark | 4GB |
| google-bert/bert-base-uncased | FP16 | 59.54 tok/sEstimated Auto-generated benchmark | 1GB |
| bigcode/starcoder2-3b | FP16 | 59.47 tok/sEstimated Auto-generated benchmark | 6GB |
| deepseek-ai/DeepSeek-OCR | FP16 | 58.71 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-3B | FP16 | 58.13 tok/sEstimated Auto-generated benchmark | 6GB |
| LiquidAI/LFM2-1.2B | FP16 | 57.93 tok/sEstimated Auto-generated benchmark | 4GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | FP16 | 57.68 tok/sEstimated Auto-generated benchmark | 6GB |
| meta-llama/Llama-3.2-3B | FP16 | 57.25 tok/sEstimated Auto-generated benchmark | 6GB |
| apple/OpenELM-1_1B-Instruct | FP16 | 56.76 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-1B | FP16 | 56.59 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-1b-it | FP16 | 56.35 tok/sEstimated Auto-generated benchmark | 2GB |
| inference-net/Schematron-3B | FP16 | 56.02 tok/sEstimated Auto-generated benchmark | 6GB |
| WeiboAI/VibeThinker-1.5B | FP16 | 55.87 tok/sEstimated Auto-generated benchmark | 4GB |
| google-t5/t5-3b | FP16 | 55.77 tok/sEstimated Auto-generated benchmark | 6GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | FP16 | 55.06 tok/sEstimated Auto-generated benchmark | 2GB |
| nari-labs/Dia2-2B | FP16 | 54.72 tok/sEstimated Auto-generated benchmark | 5GB |
| unsloth/Llama-3.2-1B-Instruct | FP16 | 54.42 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-1B-Instruct | FP16 | 54.39 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2b | FP16 | 54.21 tok/sEstimated Auto-generated benchmark | 4GB |
| google/embeddinggemma-300m | FP16 | 54.00 tok/sEstimated Auto-generated benchmark | 1GB |
| ibm-research/PowerMoE-3b | FP16 | 53.26 tok/sEstimated Auto-generated benchmark | 6GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q8 | 52.96 tok/sEstimated Auto-generated benchmark | 20GB |
| unsloth/Llama-3.2-3B-Instruct | FP16 | 52.82 tok/sEstimated Auto-generated benchmark | 6GB |
| mistralai/Mistral-7B-Instruct-v0.1 | FP16 | 52.33 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-V3 | FP16 | 52.30 tok/sEstimated Auto-generated benchmark | 15GB |
| MiniMaxAI/MiniMax-M2 | FP16 | 52.27 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B | Q8 | 52.24 tok/sEstimated Auto-generated benchmark | 31GB |
| google/gemma-2-2b-it | FP16 | 52.20 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | FP16 | 52.14 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2.5-7B | FP16 | 52.12 tok/sEstimated Auto-generated benchmark | 15GB |
| GSAI-ML/LLaDA-8B-Base | FP16 | 52.10 tok/sEstimated Auto-generated benchmark | 17GB |
| openai-community/gpt2-xl | FP16 | 51.92 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-1.5B | FP16 | 51.80 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen2.5-0.5B | FP16 | 51.73 tok/sEstimated Auto-generated benchmark | 11GB |
| openai-community/gpt2 | FP16 | 51.66 tok/sEstimated Auto-generated benchmark | 15GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | FP16 | 51.62 tok/sEstimated Auto-generated benchmark | 15GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | FP16 | 51.44 tok/sEstimated Auto-generated benchmark | 15GB |
| huggyllama/llama-7b | FP16 | 51.36 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q8 | 51.34 tok/sEstimated Auto-generated benchmark | 20GB |
| openai/gpt-oss-safeguard-20b | Q8 | 51.19 tok/sEstimated Auto-generated benchmark | 22GB |
| llamafactory/tiny-random-Llama-3 | FP16 | 51.05 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-7B-Instruct | FP16 | 50.97 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q8 | 50.87 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen3-4B | FP16 | 50.86 tok/sEstimated Auto-generated benchmark | 9GB |
| parler-tts/parler-tts-large-v1 | FP16 | 50.84 tok/sEstimated Auto-generated benchmark | 15GB |
| black-forest-labs/FLUX.1-dev | FP16 | 50.68 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-4B-Instruct-2507 | FP16 | 50.53 tok/sEstimated Auto-generated benchmark | 9GB |
| deepseek-ai/DeepSeek-V3-0324 | FP16 | 50.42 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-4-mini-instruct | FP16 | 50.40 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-3-mini-128k-instruct | FP16 | 50.28 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2-7B-Instruct | FP16 | 50.19 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-2-7b-hf | FP16 | 50.07 tok/sEstimated Auto-generated benchmark | 15GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | FP16 | 50.05 tok/sEstimated Auto-generated benchmark | 9GB |
| skt/kogpt2-base-v2 | FP16 | 50.05 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/VibeVoice-1.5B | FP16 | 49.93 tok/sEstimated Auto-generated benchmark | 11GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | FP16 | 49.92 tok/sEstimated Auto-generated benchmark | 11GB |
| mistralai/Mistral-7B-Instruct-v0.2 | FP16 | 49.89 tok/sEstimated Auto-generated benchmark | 15GB |
| Gensyn/Qwen2.5-0.5B-Instruct | FP16 | 49.87 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-8B-FP8 | FP16 | 49.83 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/Phi-3.5-vision-instruct | FP16 | 49.65 tok/sEstimated Auto-generated benchmark | 15GB |
| openai-community/gpt2-medium | FP16 | 49.58 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/gpt-oss-20b-BF16 | Q8 | 49.54 tok/sEstimated Auto-generated benchmark | 20GB |
| openai-community/gpt2-large | FP16 | 49.54 tok/sEstimated Auto-generated benchmark | 15GB |
| petals-team/StableBeluga2 | FP16 | 49.20 tok/sEstimated Auto-generated benchmark | 15GB |
| HuggingFaceTB/SmolLM2-135M | FP16 | 49.15 tok/sEstimated Auto-generated benchmark | 15GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | FP16 | 49.15 tok/sEstimated Auto-generated benchmark | 15GB |
| distilbert/distilgpt2 | FP16 | 49.05 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-2-7b-chat-hf | FP16 | 49.00 tok/sEstimated Auto-generated benchmark | 15GB |
| liuhaotian/llava-v1.5-7b | FP16 | 48.97 tok/sEstimated Auto-generated benchmark | 15GB |
| bigscience/bloomz-560m | FP16 | 48.78 tok/sEstimated Auto-generated benchmark | 15GB |
| EleutherAI/pythia-70m-deduped | FP16 | 48.77 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-Guard-3-8B | FP16 | 48.77 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/DialoGPT-small | FP16 | 48.66 tok/sEstimated Auto-generated benchmark | 15GB |
| EleutherAI/gpt-neo-125m | FP16 | 48.57 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-3.5-mini-instruct | FP16 | 48.48 tok/sEstimated Auto-generated benchmark | 8GB |
| rednote-hilab/dots.ocr | FP16 | 48.40 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q8 | 48.30 tok/sEstimated Auto-generated benchmark | 31GB |
| ibm-granite/granite-docling-258M | FP16 | 48.26 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/phi-2 | FP16 | 48.16 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1 | FP16 | 47.94 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/deepseek-coder-33b-instruct | Q4 | 47.92 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q8 | 47.86 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | FP16 | 47.71 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2-1.5B-Instruct | FP16 | 47.71 tok/sEstimated Auto-generated benchmark | 11GB |
| black-forest-labs/FLUX.2-dev | FP16 | 47.71 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Llama-3.1-8B | FP16 | 47.69 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2.5-Math-1.5B | FP16 | 47.61 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-Embedding-0.6B | FP16 | 47.59 tok/sEstimated Auto-generated benchmark | 13GB |
| codellama/CodeLlama-34b-hf | Q4 | 47.51 tok/sEstimated Auto-generated benchmark | 17GB |
| HuggingFaceTB/SmolLM-135M | FP16 | 47.44 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Meta-Llama-3-8B | FP16 | 47.30 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2.5-Coder-1.5B | FP16 | 47.24 tok/sEstimated Auto-generated benchmark | 11GB |
| HuggingFaceH4/zephyr-7b-beta | FP16 | 47.18 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1-0528 | FP16 | 47.18 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q8 | 47.16 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen-Image-Edit-2509 | FP16 | 47.15 tok/sEstimated Auto-generated benchmark | 16GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | FP16 | 46.98 tok/sEstimated Auto-generated benchmark | 17GB |
| vikhyatk/moondream2 | FP16 | 46.94 tok/sEstimated Auto-generated benchmark | 15GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q4 | 46.93 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-1.7B-Base | FP16 | 46.87 tok/sEstimated Auto-generated benchmark | 15GB |
| meta-llama/Llama-3.1-8B-Instruct | FP16 | 46.84 tok/sEstimated Auto-generated benchmark | 17GB |
| mistralai/Mistral-Small-Instruct-2409 | Q8 | 46.77 tok/sEstimated Auto-generated benchmark | 23GB |
| microsoft/phi-4 | FP16 | 46.75 tok/sEstimated Auto-generated benchmark | 15GB |
| rinna/japanese-gpt-neox-small | FP16 | 46.61 tok/sEstimated Auto-generated benchmark | 15GB |
| google/gemma-2-27b-it | Q8 | 46.54 tok/sEstimated Auto-generated benchmark | 28GB |
| dicta-il/dictalm2.0-instruct | FP16 | 46.50 tok/sEstimated Auto-generated benchmark | 15GB |
| facebook/opt-125m | FP16 | 46.49 tok/sEstimated Auto-generated benchmark | 15GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | FP16 | 46.43 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | FP16 | 46.43 tok/sEstimated Auto-generated benchmark | 9GB |
| numind/NuExtract-1.5 | FP16 | 46.41 tok/sEstimated Auto-generated benchmark | 15GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q4 | 46.33 tok/sEstimated Auto-generated benchmark | 34GB |
| openai/gpt-oss-20b | Q8 | 46.33 tok/sEstimated Auto-generated benchmark | 20GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | FP16 | 46.18 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q8 | 46.12 tok/sEstimated Auto-generated benchmark | 31GB |
| deepseek-ai/DeepSeek-V2.5 | Q4 | 46.10 tok/sEstimated Auto-generated benchmark | 328GB |
| hmellor/tiny-random-LlamaForCausalLM | FP16 | 46.04 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/Phi-4-multimodal-instruct | FP16 | 45.96 tok/sEstimated Auto-generated benchmark | 15GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | Q4 | 45.95 tok/sEstimated Auto-generated benchmark | 25GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | FP16 | 45.94 tok/sEstimated Auto-generated benchmark | 17GB |
| sshleifer/tiny-gpt2 | FP16 | 45.93 tok/sEstimated Auto-generated benchmark | 15GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q4 | 45.84 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q8 | 45.84 tok/sEstimated Auto-generated benchmark | 31GB |
| 01-ai/Yi-1.5-34B-Chat | Q4 | 45.76 tok/sEstimated Auto-generated benchmark | 18GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | FP16 | 45.68 tok/sEstimated Auto-generated benchmark | 15GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | FP16 | 45.46 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-Embedding-8B | FP16 | 45.43 tok/sEstimated Auto-generated benchmark | 17GB |
| meta-llama/Meta-Llama-3-8B-Instruct | FP16 | 45.43 tok/sEstimated Auto-generated benchmark | 17GB |
| meta-llama/Meta-Llama-3-70B-Instruct | Q4 | 45.34 tok/sEstimated Auto-generated benchmark | 34GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q8 | 45.33 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen2.5-32B-Instruct | Q4 | 45.28 tok/sEstimated Auto-generated benchmark | 16GB |
| IlyaGusev/saiga_llama3_8b | FP16 | 45.27 tok/sEstimated Auto-generated benchmark | 17GB |
| microsoft/Phi-3.5-mini-instruct | FP16 | 45.26 tok/sEstimated Auto-generated benchmark | 15GB |
| moonshotai/Kimi-K2-Thinking | Q4 | 45.24 tok/sEstimated Auto-generated benchmark | 489GB |
| allenai/Olmo-3-7B-Think | FP16 | 45.23 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen2-0.5B-Instruct | FP16 | 45.08 tok/sEstimated Auto-generated benchmark | 11GB |
| mistralai/Mistral-7B-v0.1 | FP16 | 45.04 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-4B-Thinking-2507 | FP16 | 45.04 tok/sEstimated Auto-generated benchmark | 9GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | FP16 | 45.02 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-1.5B-Instruct | FP16 | 44.96 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen2.5-0.5B-Instruct | FP16 | 44.85 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-0.6B-Base | FP16 | 44.85 tok/sEstimated Auto-generated benchmark | 13GB |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | 44.59 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-0.6B | FP16 | 44.57 tok/sEstimated Auto-generated benchmark | 13GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | FP16 | 44.57 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Embedding-4B | FP16 | 44.53 tok/sEstimated Auto-generated benchmark | 9GB |
| microsoft/Phi-3-mini-4k-instruct | FP16 | 44.30 tok/sEstimated Auto-generated benchmark | 15GB |
| BSC-LT/salamandraTA-7b-instruct | FP16 | 44.29 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-8B | FP16 | 44.16 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q8 | 44.15 tok/sEstimated Auto-generated benchmark | 31GB |
| Qwen/Qwen3-1.7B | FP16 | 44.13 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-Reranker-0.6B | FP16 | 44.03 tok/sEstimated Auto-generated benchmark | 13GB |
| Tongyi-MAI/Z-Image-Turbo | FP16 | 43.99 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen2.5-32B | Q4 | 43.82 tok/sEstimated Auto-generated benchmark | 16GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | 43.79 tok/sEstimated Auto-generated benchmark | 17GB |
| lmsys/vicuna-7b-v1.5 | FP16 | 43.68 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-V3.1 | FP16 | 43.59 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-7B-Instruct | FP16 | 43.58 tok/sEstimated Auto-generated benchmark | 16GB |
| GSAI-ML/LLaDA-8B-Instruct | FP16 | 43.56 tok/sEstimated Auto-generated benchmark | 17GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | FP16 | 43.54 tok/sEstimated Auto-generated benchmark | 15GB |
| google/gemma-3-270m-it | FP16 | 43.53 tok/sEstimated Auto-generated benchmark | 15GB |
| zai-org/GLM-4.5-Air | FP16 | 43.47 tok/sEstimated Auto-generated benchmark | 15GB |
| tencent/HunyuanVideo-1.5 | FP16 | 43.46 tok/sEstimated Auto-generated benchmark | 16GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | FP16 | 43.44 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen3-8B-Base | FP16 | 43.43 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | FP16 | 43.24 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-32B-Instruct | Q4 | 43.23 tok/sEstimated Auto-generated benchmark | 17GB |
| zai-org/GLM-4.6-FP8 | FP16 | 43.06 tok/sEstimated Auto-generated benchmark | 15GB |
| microsoft/DialoGPT-medium | FP16 | 43.06 tok/sEstimated Auto-generated benchmark | 15GB |
| ibm-granite/granite-3.3-8b-instruct | FP16 | 43.05 tok/sEstimated Auto-generated benchmark | 17GB |
| swiss-ai/Apertus-8B-Instruct-2509 | FP16 | 43.00 tok/sEstimated Auto-generated benchmark | 17GB |
| Qwen/Qwen2-0.5B | FP16 | 43.00 tok/sEstimated Auto-generated benchmark | 11GB |
| Qwen/Qwen3-4B-Base | FP16 | 42.96 tok/sEstimated Auto-generated benchmark | 9GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | FP16 | 42.96 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | FP16 | 42.88 tok/sEstimated Auto-generated benchmark | 17GB |
| baichuan-inc/Baichuan-M2-32B | Q4 | 42.17 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-32B | Q4 | 41.72 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Llama-3.3-70B-Instruct | Q4 | 41.41 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/QwQ-32B-Preview | Q4 | 41.38 tok/sEstimated Auto-generated benchmark | 17GB |
| meta-llama/Llama-3.1-70B-Instruct | Q4 | 41.24 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | Q4 | 41.13 tok/sEstimated Auto-generated benchmark | 17GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q4 | 40.00 tok/sEstimated Auto-generated benchmark | 16GB |
| EssentialAI/rnj-1 | FP16 | 39.14 tok/sEstimated Auto-generated benchmark | 19GB |
| Qwen/Qwen2.5-14B | FP16 | 38.95 tok/sEstimated Auto-generated benchmark | 29GB |
| meta-llama/Llama-2-13b-chat-hf | FP16 | 38.84 tok/sEstimated Auto-generated benchmark | 27GB |
| mistralai/Ministral-3-14B-Instruct-2512 | FP16 | 38.34 tok/sEstimated Auto-generated benchmark | 32GB |
| ai-forever/ruGPT-3.5-13B | FP16 | 38.11 tok/sEstimated Auto-generated benchmark | 27GB |
| microsoft/Phi-3-medium-128k-instruct | FP16 | 37.56 tok/sEstimated Auto-generated benchmark | 29GB |
| Qwen/Qwen2.5-14B-Instruct | FP16 | 37.51 tok/sEstimated Auto-generated benchmark | 30GB |
| Qwen/Qwen3-14B-Base | FP16 | 37.25 tok/sEstimated Auto-generated benchmark | 29GB |
| Qwen/Qwen2.5-14B-Instruct | FP16 | 35.80 tok/sEstimated Auto-generated benchmark | 29GB |
| google/gemma-2-9b-it | FP16 | 35.32 tok/sEstimated Auto-generated benchmark | 20GB |
| OpenPipe/Qwen3-14B-Instruct | FP16 | 35.20 tok/sEstimated Auto-generated benchmark | 29GB |
| meta-llama/Llama-3.1-8B-Instruct | FP16 | 34.74 tok/sEstimated Auto-generated benchmark | 17GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | FP16 | 34.50 tok/sEstimated Auto-generated benchmark | 19GB |
| moonshotai/Kimi-K2-Thinking | Q8 | 33.70 tok/sEstimated Auto-generated benchmark | 978GB |
| Qwen/Qwen3-14B | FP16 | 32.88 tok/sEstimated Auto-generated benchmark | 29GB |
| NousResearch/Hermes-3-Llama-3.1-8B | FP16 | 32.66 tok/sEstimated Auto-generated benchmark | 17GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | Q4 | 32.59 tok/sEstimated Auto-generated benchmark | 69GB |
| meta-llama/Llama-3.1-70B-Instruct | Q8 | 32.27 tok/sEstimated Auto-generated benchmark | 68GB |
| meta-llama/Meta-Llama-3-70B-Instruct | Q8 | 32.22 tok/sEstimated Auto-generated benchmark | 68GB |
| meta-llama/Llama-3.3-70B-Instruct | Q8 | 32.16 tok/sEstimated Auto-generated benchmark | 68GB |
| deepseek-ai/DeepSeek-V2.5 | Q8 | 31.93 tok/sEstimated Auto-generated benchmark | 656GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q8 | 31.89 tok/sEstimated Auto-generated benchmark | 33GB |
| Qwen/Qwen2.5-32B | Q8 | 31.34 tok/sEstimated Auto-generated benchmark | 33GB |
| baichuan-inc/Baichuan-M2-32B | Q8 | 31.04 tok/sEstimated Auto-generated benchmark | 33GB |
| Qwen/QwQ-32B-Preview | Q8 | 30.21 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen3-32B | Q8 | 30.13 tok/sEstimated Auto-generated benchmark | 33GB |
| Qwen/Qwen2.5-32B-Instruct | Q8 | 29.91 tok/sEstimated Auto-generated benchmark | 33GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q8 | 29.57 tok/sEstimated Auto-generated benchmark | 35GB |
| deepseek-ai/deepseek-coder-33b-instruct | Q8 | 29.54 tok/sEstimated Auto-generated benchmark | 34GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q8 | 29.24 tok/sEstimated Auto-generated benchmark | 68GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q8 | 29.02 tok/sEstimated Auto-generated benchmark | 33GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | Q8 | 28.60 tok/sEstimated Auto-generated benchmark | 50GB |
| Qwen/Qwen2.5-32B-Instruct | Q8 | 28.36 tok/sEstimated Auto-generated benchmark | 34GB |
| codellama/CodeLlama-34b-hf | Q8 | 28.26 tok/sEstimated Auto-generated benchmark | 35GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q8 | 28.26 tok/sEstimated Auto-generated benchmark | 68GB |
| Qwen/Qwen3-30B-A3B | FP16 | 28.01 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | Q8 | 27.93 tok/sEstimated Auto-generated benchmark | 34GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | FP16 | 27.89 tok/sEstimated Auto-generated benchmark | 61GB |
| unsloth/gpt-oss-20b-BF16 | FP16 | 27.80 tok/sEstimated Auto-generated benchmark | 41GB |
| 01-ai/Yi-1.5-34B-Chat | Q8 | 27.73 tok/sEstimated Auto-generated benchmark | 35GB |
| openai/gpt-oss-20b | FP16 | 27.39 tok/sEstimated Auto-generated benchmark | 41GB |
| Qwen/Qwen2.5-72B-Instruct | Q4 | 27.31 tok/sEstimated Auto-generated benchmark | 36GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | FP16 | 27.22 tok/sEstimated Auto-generated benchmark | 61GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | FP16 | 26.92 tok/sEstimated Auto-generated benchmark | 41GB |
| AI-MO/Kimina-Prover-72B | Q4 | 26.47 tok/sEstimated Auto-generated benchmark | 35GB |
| openai/gpt-oss-safeguard-20b | FP16 | 26.40 tok/sEstimated Auto-generated benchmark | 44GB |
| meta-llama/Llama-3.1-70B-Instruct | Q4 | 26.30 tok/sEstimated Auto-generated benchmark | 34GB |
| mistralai/Mistral-Small-Instruct-2409 | FP16 | 26.25 tok/sEstimated Auto-generated benchmark | 46GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | FP16 | 25.87 tok/sEstimated Auto-generated benchmark | 61GB |
| google/gemma-2-27b-it | FP16 | 25.49 tok/sEstimated Auto-generated benchmark | 56GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | FP16 | 25.32 tok/sEstimated Auto-generated benchmark | 61GB |
| openai/gpt-oss-120b | Q4 | 25.30 tok/sEstimated Auto-generated benchmark | 59GB |
| mistralai/Mistral-Large-Instruct-2411 | Q4 | 25.00 tok/sEstimated Auto-generated benchmark | 60GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | FP16 | 24.95 tok/sEstimated Auto-generated benchmark | 61GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | FP16 | 24.81 tok/sEstimated Auto-generated benchmark | 61GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Q4 | 24.75 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q4 | 24.62 tok/sEstimated Auto-generated benchmark | 39GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | FP16 | 24.56 tok/sEstimated Auto-generated benchmark | 61GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | FP16 | 24.43 tok/sEstimated Auto-generated benchmark | 61GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | FP16 | 24.39 tok/sEstimated Auto-generated benchmark | 41GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q4 | 24.10 tok/sEstimated Auto-generated benchmark | 39GB |
| meta-llama/Llama-3.3-70B-Instruct | Q4 | 24.09 tok/sEstimated Auto-generated benchmark | 34GB |
| Qwen/Qwen2.5-72B-Instruct | Q4 | 23.78 tok/sEstimated Auto-generated benchmark | 35GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q4 | 23.23 tok/sEstimated Auto-generated benchmark | 39GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q4 | 23.09 tok/sEstimated Auto-generated benchmark | 39GB |
| Qwen/Qwen2.5-Math-72B-Instruct | Q4 | 23.09 tok/sEstimated Auto-generated benchmark | 36GB |
| NousResearch/Hermes-3-Llama-3.1-70B | Q4 | 22.96 tok/sEstimated Auto-generated benchmark | 34GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q4 | 22.75 tok/sEstimated Auto-generated benchmark | 44GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | Q8 | 22.54 tok/sEstimated Auto-generated benchmark | 138GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | Q8 | 18.84 tok/sEstimated Auto-generated benchmark | 69GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q8 | 18.69 tok/sEstimated Auto-generated benchmark | 78GB |
| meta-llama/Llama-3.3-70B-Instruct | Q8 | 18.51 tok/sEstimated Auto-generated benchmark | 69GB |
| 01-ai/Yi-1.5-34B-Chat | FP16 | 18.31 tok/sEstimated Auto-generated benchmark | 70GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q8 | 18.22 tok/sEstimated Auto-generated benchmark | 78GB |
| deepseek-ai/deepseek-coder-33b-instruct | FP16 | 18.10 tok/sEstimated Auto-generated benchmark | 68GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q8 | 18.05 tok/sEstimated Auto-generated benchmark | 78GB |
| Qwen/Qwen2.5-32B-Instruct | FP16 | 18.04 tok/sEstimated Auto-generated benchmark | 66GB |
| Qwen/Qwen2.5-32B | FP16 | 17.98 tok/sEstimated Auto-generated benchmark | 66GB |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | FP16 | 17.91 tok/sEstimated Auto-generated benchmark | 101GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q8 | 17.88 tok/sEstimated Auto-generated benchmark | 88GB |
| meta-llama/Meta-Llama-3-70B-Instruct | FP16 | 17.86 tok/sEstimated Auto-generated benchmark | 137GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | FP16 | 17.85 tok/sEstimated Auto-generated benchmark | 70GB |
| mistralai/Mistral-Large-Instruct-2411 | Q8 | 17.80 tok/sEstimated Auto-generated benchmark | 120GB |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | FP16 | 17.75 tok/sEstimated Auto-generated benchmark | 137GB |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | FP16 | 17.66 tok/sEstimated Auto-generated benchmark | 137GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | Q4 | 17.39 tok/sEstimated Auto-generated benchmark | 115GB |
| openai/gpt-oss-120b | Q8 | 17.20 tok/sEstimated Auto-generated benchmark | 117GB |
| baichuan-inc/Baichuan-M2-32B | FP16 | 17.19 tok/sEstimated Auto-generated benchmark | 66GB |
| Qwen/Qwen3-32B | FP16 | 17.06 tok/sEstimated Auto-generated benchmark | 66GB |
| meta-llama/Llama-3.1-70B-Instruct | Q8 | 17.02 tok/sEstimated Auto-generated benchmark | 69GB |
| Qwen/Qwen2.5-32B-Instruct | FP16 | 17.02 tok/sEstimated Auto-generated benchmark | 67GB |
| moonshotai/Kimi-K2-Thinking | FP16 | 16.98 tok/sEstimated Auto-generated benchmark | 1956GB |
| deepseek-ai/DeepSeek-Math-V2 | Q4 | 16.94 tok/sEstimated Auto-generated benchmark | 383GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q8 | 16.92 tok/sEstimated Auto-generated benchmark | 78GB |
| Qwen/Qwen2.5-72B-Instruct | Q8 | 16.74 tok/sEstimated Auto-generated benchmark | 70GB |
| meta-llama/Llama-3.1-70B-Instruct | FP16 | 16.70 tok/sEstimated Auto-generated benchmark | 137GB |
| deepseek-ai/DeepSeek-V2.5 | FP16 | 16.68 tok/sEstimated Auto-generated benchmark | 1312GB |
| NousResearch/Hermes-3-Llama-3.1-70B | Q8 | 16.55 tok/sEstimated Auto-generated benchmark | 69GB |
| Qwen/Qwen2.5-Coder-32B-Instruct | FP16 | 16.51 tok/sEstimated Auto-generated benchmark | 67GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | FP16 | 16.34 tok/sEstimated Auto-generated benchmark | 66GB |
| Qwen/Qwen2.5-72B-Instruct | Q8 | 16.31 tok/sEstimated Auto-generated benchmark | 71GB |
| Qwen/Qwen2.5-Math-72B-Instruct | Q8 | 16.30 tok/sEstimated Auto-generated benchmark | 71GB |
| AI-MO/Kimina-Prover-72B | Q8 | 16.27 tok/sEstimated Auto-generated benchmark | 70GB |
| codellama/CodeLlama-34b-hf | FP16 | 16.23 tok/sEstimated Auto-generated benchmark | 70GB |
| MiniMaxAI/MiniMax-M1-40k | Q4 | 16.14 tok/sEstimated Auto-generated benchmark | 255GB |
| meta-llama/Llama-3.3-70B-Instruct | FP16 | 16.03 tok/sEstimated Auto-generated benchmark | 137GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | FP16 | 15.53 tok/sEstimated Auto-generated benchmark | 66GB |
| Qwen/QwQ-32B-Preview | FP16 | 15.44 tok/sEstimated Auto-generated benchmark | 67GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q4 | 13.81 tok/sEstimated Auto-generated benchmark | 378GB |
| Qwen/Qwen3-235B-A22B | Q4 | 13.77 tok/sEstimated Auto-generated benchmark | 115GB |
| MiniMaxAI/MiniMax-VL-01 | Q4 | 13.60 tok/sEstimated Auto-generated benchmark | 256GB |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | FP16 | 13.37 tok/sEstimated Auto-generated benchmark | 275GB |
| deepseek-ai/DeepSeek-Math-V2 | Q8 | 13.28 tok/sEstimated Auto-generated benchmark | 766GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | Q8 | 12.88 tok/sEstimated Auto-generated benchmark | 231GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q8 | 10.74 tok/sEstimated Auto-generated benchmark | 755GB |
| mistralai/Mistral-Large-Instruct-2411 | FP16 | 10.40 tok/sEstimated Auto-generated benchmark | 240GB |
| meta-llama/Llama-3.3-70B-Instruct | FP16 | 10.40 tok/sEstimated Auto-generated benchmark | 138GB |
| Qwen/Qwen2.5-72B-Instruct | FP16 | 10.38 tok/sEstimated Auto-generated benchmark | 141GB |
| Qwen/Qwen3-235B-A22B | Q8 | 10.35 tok/sEstimated Auto-generated benchmark | 230GB |
| Qwen/Qwen2.5-Math-72B-Instruct | FP16 | 10.21 tok/sEstimated Auto-generated benchmark | 142GB |
| NousResearch/Hermes-3-Llama-3.1-70B | FP16 | 10.13 tok/sEstimated Auto-generated benchmark | 138GB |
| Qwen/Qwen2.5-72B-Instruct | FP16 | 10.08 tok/sEstimated Auto-generated benchmark | 142GB |
| MiniMaxAI/MiniMax-VL-01 | Q8 | 10.04 tok/sEstimated Auto-generated benchmark | 511GB |
| meta-llama/Llama-3.1-70B-Instruct | FP16 | 10.03 tok/sEstimated Auto-generated benchmark | 138GB |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | FP16 | 9.86 tok/sEstimated Auto-generated benchmark | 176GB |
| MiniMaxAI/MiniMax-M1-40k | Q8 | 9.80 tok/sEstimated Auto-generated benchmark | 510GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct | FP16 | 9.62 tok/sEstimated Auto-generated benchmark | 156GB |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | FP16 | 9.33 tok/sEstimated Auto-generated benchmark | 156GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking | FP16 | 9.17 tok/sEstimated Auto-generated benchmark | 156GB |
| openai/gpt-oss-120b | FP16 | 9.13 tok/sEstimated Auto-generated benchmark | 235GB |
| AI-MO/Kimina-Prover-72B | FP16 | 8.97 tok/sEstimated Auto-generated benchmark | 141GB |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | FP16 | 8.81 tok/sEstimated Auto-generated benchmark | 156GB |
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | FP16 | 8.72 tok/sEstimated Auto-generated benchmark | 138GB |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | FP16 | 7.85 tok/sEstimated Auto-generated benchmark | 461GB |
| deepseek-ai/DeepSeek-Math-V2 | FP16 | 7.13 tok/sEstimated Auto-generated benchmark | 1532GB |
| Qwen/Qwen3-235B-A22B | FP16 | 6.19 tok/sEstimated Auto-generated benchmark | 460GB |
| MiniMaxAI/MiniMax-VL-01 | FP16 | 6.15 tok/sEstimated Auto-generated benchmark | 1021GB |
| MiniMaxAI/MiniMax-M1-40k | FP16 | 5.82 tok/sEstimated Auto-generated benchmark | 1020GB |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | FP16 | 5.71 tok/sEstimated Auto-generated benchmark | 1509GB |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
| Model | Quantization | Verdict | Estimated speed | VRAM needed |
|---|---|---|---|---|
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q4 | Not supported | 13.81 tok/sEstimated | 378GB (have 24GB) |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | Q8 | Not supported | 10.74 tok/sEstimated | 755GB (have 24GB) |
| EssentialAI/rnj-1 | FP16 | Fits comfortably | 39.14 tok/sEstimated | 19GB (have 24GB) |
| EssentialAI/rnj-1 | Q8 | Fits comfortably | 67.53 tok/sEstimated | 10GB (have 24GB) |
| EssentialAI/rnj-1 | Q4 | Fits comfortably | 98.55 tok/sEstimated | 5GB (have 24GB) |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | FP16 | Not supported | 5.71 tok/sEstimated | 1509GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q4 | Fits comfortably | 134.93 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-2-7b-hf | FP16 | Fits comfortably | 50.07 tok/sEstimated | 15GB (have 24GB) |
| google-t5/t5-3b | Q4 | Fits comfortably | 154.08 tok/sEstimated | 2GB (have 24GB) |
| openai-community/gpt2-medium | Q4 | Fits comfortably | 116.97 tok/sEstimated | 4GB (have 24GB) |
| openai-community/gpt2-medium | Q8 | Fits comfortably | 84.46 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2-medium | FP16 | Fits comfortably | 49.58 tok/sEstimated | 15GB (have 24GB) |
| GSAI-ML/LLaDA-8B-Instruct | Q8 | Fits comfortably | 96.09 tok/sEstimated | 9GB (have 24GB) |
| unsloth/Llama-3.2-1B-Instruct | Q8 | Fits comfortably | 115.31 tok/sEstimated | 1GB (have 24GB) |
| unsloth/gpt-oss-20b-BF16 | Q8 | Fits comfortably | 49.54 tok/sEstimated | 20GB (have 24GB) |
| google-t5/t5-3b | FP16 | Fits comfortably | 55.77 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | Fits comfortably | 123.57 tok/sEstimated | 2GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q8 | Fits comfortably | 91.80 tok/sEstimated | 9GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507 | Q8 | Fits comfortably | 89.07 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-235B-A22B | FP16 | Not supported | 6.19 tok/sEstimated | 460GB (have 24GB) |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q4 | Fits comfortably | 125.22 tok/sEstimated | 4GB (have 24GB) |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q8 | Fits comfortably | 93.73 tok/sEstimated | 7GB (have 24GB) |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | FP16 | Fits comfortably | 51.62 tok/sEstimated | 15GB (have 24GB) |
| google/gemma-2b | Q4 | Fits comfortably | 145.99 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q8 | Not supported | 47.16 tok/sEstimated | 31GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | FP16 | Not supported | 24.56 tok/sEstimated | 61GB (have 24GB) |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q8 | Fits comfortably | 88.04 tok/sEstimated | 5GB (have 24GB) |
| ibm-granite/granite-3.3-8b-instruct | Q4 | Fits comfortably | 125.16 tok/sEstimated | 4GB (have 24GB) |
| ibm-granite/granite-3.3-8b-instruct | Q8 | Fits comfortably | 94.28 tok/sEstimated | 9GB (have 24GB) |
| ibm-granite/granite-3.3-8b-instruct | FP16 | Fits comfortably | 43.05 tok/sEstimated | 17GB (have 24GB) |
| HuggingFaceTB/SmolLM2-135M | Q4 | Fits comfortably | 134.69 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/deepseek-coder-33b-instruct | Q8 | Not supported | 29.54 tok/sEstimated | 34GB (have 24GB) |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | FP16 | Not supported | 7.85 tok/sEstimated | 461GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q4 | Fits comfortably | 116.30 tok/sEstimated | 4GB (have 24GB) |
| WeiboAI/VibeThinker-1.5B | Q8 | Fits comfortably | 102.86 tok/sEstimated | 2GB (have 24GB) |
| WeiboAI/VibeThinker-1.5B | FP16 | Fits comfortably | 55.87 tok/sEstimated | 4GB (have 24GB) |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | FP16 | Not supported | 17.75 tok/sEstimated | 137GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | Fits comfortably | 149.63 tok/sEstimated | 2GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | Fits comfortably | 108.53 tok/sEstimated | 3GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | Fits comfortably | 60.61 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-32B | FP16 | Not supported | 17.06 tok/sEstimated | 66GB (have 24GB) |
| rednote-hilab/dots.ocr | Q8 | Fits comfortably | 95.12 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-Reranker-0.6B | FP16 | Fits comfortably | 44.03 tok/sEstimated | 13GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B-Instruct | Q4 | Fits comfortably | 128.93 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B-Instruct | Q8 | Fits comfortably | 85.92 tok/sEstimated | 9GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B-Instruct | FP16 | Fits comfortably | 45.43 tok/sEstimated | 17GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q4 | Fits comfortably | 118.51 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen3-Embedding-8B | FP16 | Fits comfortably | 45.43 tok/sEstimated | 17GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | FP16 | Fits comfortably | 44.57 tok/sEstimated | 15GB (have 24GB) |
| meta-llama/Llama-2-7b-hf | Q4 | Fits comfortably | 136.22 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-V3.1 | Q4 | Fits comfortably | 132.27 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-V3.1 | Q8 | Fits comfortably | 80.06 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-V3.1 | FP16 | Fits comfortably | 43.59 tok/sEstimated | 15GB (have 24GB) |
| meta-llama/Llama-3.1-8B | Q4 | Fits comfortably | 121.96 tok/sEstimated | 4GB (have 24GB) |
| codellama/CodeLlama-34b-hf | FP16 | Not supported | 16.23 tok/sEstimated | 70GB (have 24GB) |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | FP16 | Not supported | 17.66 tok/sEstimated | 137GB (have 24GB) |
| OpenPipe/Qwen3-14B-Instruct | Q4 | Fits comfortably | 86.35 tok/sEstimated | 7GB (have 24GB) |
| OpenPipe/Qwen3-14B-Instruct | Q8 | Fits comfortably | 66.54 tok/sEstimated | 14GB (have 24GB) |
| openai/gpt-oss-120b | Q4 | Not supported | 25.30 tok/sEstimated | 59GB (have 24GB) |
| openai/gpt-oss-120b | Q8 | Not supported | 17.20 tok/sEstimated | 117GB (have 24GB) |
| openai/gpt-oss-120b | FP16 | Not supported | 9.13 tok/sEstimated | 235GB (have 24GB) |
| llamafactory/tiny-random-Llama-3 | Q8 | Fits comfortably | 79.55 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-7B | Q4 | Fits comfortably | 117.16 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B | FP16 | Fits comfortably | 47.30 tok/sEstimated | 17GB (have 24GB) |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | Q8 | Not supported | 12.88 tok/sEstimated | 231GB (have 24GB) |
| microsoft/Phi-4-mini-instruct | Q4 | Fits comfortably | 128.05 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-7B | Q8 | Fits comfortably | 90.57 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-7B | FP16 | Fits comfortably | 52.12 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 117.09 tok/sEstimated | 3GB (have 24GB) |
| google/gemma-3-270m-it | Q8 | Fits comfortably | 83.46 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-1.7B | Q8 | Fits comfortably | 88.56 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-1.7B | FP16 | Fits comfortably | 44.13 tok/sEstimated | 15GB (have 24GB) |
| ibm-granite/granite-docling-258M | Q8 | Fits comfortably | 86.58 tok/sEstimated | 7GB (have 24GB) |
| ibm-granite/granite-docling-258M | FP16 | Fits comfortably | 48.26 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q4 | Not supported | 24.10 tok/sEstimated | 39GB (have 24GB) |
| bigcode/starcoder2-3b | FP16 | Fits comfortably | 59.47 tok/sEstimated | 6GB (have 24GB) |
| lmsys/vicuna-7b-v1.5 | Q8 | Fits comfortably | 91.63 tok/sEstimated | 7GB (have 24GB) |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | FP16 | Fits comfortably | 45.46 tok/sEstimated | 11GB (have 24GB) |
| Qwen/Qwen2-0.5B-Instruct | FP16 | Fits comfortably | 45.08 tok/sEstimated | 11GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q8 | Not supported | 47.86 tok/sEstimated | 31GB (have 24GB) |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q8 | Fits comfortably | 51.34 tok/sEstimated | 20GB (have 24GB) |
| microsoft/Phi-3.5-mini-instruct | FP16 | Fits comfortably | 45.26 tok/sEstimated | 15GB (have 24GB) |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q4 | Fits comfortably | 46.93 tok/sEstimated | 16GB (have 24GB) |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q8 | Not supported | 29.02 tok/sEstimated | 33GB (have 24GB) |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q4 | Not supported | 46.33 tok/sEstimated | 34GB (have 24GB) |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q8 | Not supported | 28.26 tok/sEstimated | 68GB (have 24GB) |
| Qwen/Qwen2.5-Coder-32B-Instruct | FP16 | Not supported | 16.51 tok/sEstimated | 67GB (have 24GB) |
| Qwen/Qwen2.5-32B-Instruct | Q8 | Not supported | 28.36 tok/sEstimated | 34GB (have 24GB) |
| meta-llama/Llama-2-13b-chat-hf | Q8 | Fits comfortably | 62.61 tok/sEstimated | 13GB (have 24GB) |
| meta-llama/Llama-2-13b-chat-hf | FP16 | Not supported | 38.84 tok/sEstimated | 27GB (have 24GB) |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | FP16 | Not supported | 24.39 tok/sEstimated | 41GB (have 24GB) |
| Qwen/Qwen3-1.7B-Base | Q4 | Fits comfortably | 129.34 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-1.7B-Base | Q8 | Fits comfortably | 88.27 tok/sEstimated | 7GB (have 24GB) |
| mistralai/Mistral-Small-Instruct-2409 | Q4 | Fits comfortably | 63.30 tok/sEstimated | 11GB (have 24GB) |
| google/gemma-2-27b-it | Q4 | Fits comfortably | 63.53 tok/sEstimated | 14GB (have 24GB) |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | Q8 | Not supported | 28.60 tok/sEstimated | 50GB (have 24GB) |
| moonshotai/Kimi-Linear-48B-A3B-Instruct | FP16 | Not supported | 17.91 tok/sEstimated | 101GB (have 24GB) |
| deepseek-ai/DeepSeek-OCR | Q4 | Fits comfortably | 162.44 tok/sEstimated | 2GB (have 24GB) |
| deepseek-ai/DeepSeek-OCR | Q8 | Fits comfortably | 107.87 tok/sEstimated | 4GB (have 24GB) |
| llamafactory/tiny-random-Llama-3 | FP16 | Fits comfortably | 51.05 tok/sEstimated | 15GB (have 24GB) |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
Data-backed answers pulled from community benchmarks, manufacturer specs, and live pricing.
RunPod benchmarks show the 24 GB RTX A5000 pushing ~49 tokens/sec on Mixtral 8x7B Q2_K under Ollama, and about 38 tok/s at Q3_K_S.
Source: Reddit – /r/LocalLLaMA (19428v9)
Yes—with low-bit EXL2 quantization. Community guides note that 2.4 bpw EXL2 plus 4-bit KV cache lets Miqu 70B run entirely within 24 GB on cards like the A5000.
Source: Reddit – /r/LocalLLaMA (kx452no)
Operators of quad-A5000 rigs suggest disabling NVLink peer-to-peer via NCCL env flags when vLLM underperforms—removing the bridges boosted throughput from ~14 tok/s to ~25 tok/s.
Source: Reddit – /r/LocalLLaMA (n3vnbez)
RTX A5000 is rated at 230 W, uses a single 8-pin connector, and NVIDIA recommends a 600 W PSU.
Source: TechPowerUp – RTX A5000 Specs
Our Nov 2025 snapshot showed RTX A5000 cards around $2,399 on Amazon (check current availability).
Source: Supabase price tracker snapshot – 2025-11-03
Explore how NVIDIA A6000 stacks up for local inference workloads.
Explore how NVIDIA A4000 stacks up for local inference workloads.
Explore how NVIDIA RTX 6000 Ada stacks up for local inference workloads.
Explore how RTX 4090 stacks up for local inference workloads.
Explore how RTX 4080 stacks up for local inference workloads.