Loading GPU data...
Loading GPU data...
Quick Answer: NVIDIA A5000 offers 24GB VRAM and starts around $2089.00. It delivers approximately 89 tokens/sec on TinyLlama/TinyLlama-1.1B-Chat-v1.0. It typically draws 230W under load.
This GPU offers reliable throughput for local AI workloads. Pair it with the right model quantization to hit your desired tokens/sec, and monitor prices below to catch the best deal.
Rotate out primary variants whenever validation flags an issue.
| Model | Quantization | Tokens/sec | VRAM used |
|---|---|---|---|
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | 88.66 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B | Q4 | 87.58 tok/sEstimated Auto-generated benchmark | 1GB |
| apple/OpenELM-1_1B-Instruct | Q4 | 85.39 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-Guard-3-1B | Q4 | 82.88 tok/sEstimated Auto-generated benchmark | 1GB |
| allenai/OLMo-2-0425-1B | Q4 | 82.40 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-3-1b-it | Q4 | 80.07 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-1B-Instruct | Q4 | 79.20 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/gemma-3-1b-it | Q4 | 78.12 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | 77.52 tok/sEstimated Auto-generated benchmark | 1GB |
| ibm-granite/granite-3.3-2b-instruct | Q4 | 66.98 tok/sEstimated Auto-generated benchmark | 1GB |
| LiquidAI/LFM2-1.2B | Q4 | 66.75 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2b | Q4 | 66.59 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B | Q8 | 60.77 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-3B-Instruct | Q4 | 60.06 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-3-1b-it | Q8 | 60.06 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-2-2b-it | Q4 | 59.81 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | 59.70 tok/sEstimated Auto-generated benchmark | 1GB |
| bigcode/starcoder2-3b | Q4 | 59.13 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/gemma-3-1b-it | Q8 | 58.28 tok/sEstimated Auto-generated benchmark | 1GB |
| Qwen/Qwen2.5-3B-Instruct | Q4 | 57.18 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B | Q4 | 57.14 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-research/PowerMoE-3b | Q4 | 57.04 tok/sEstimated Auto-generated benchmark | 2GB |
| google-t5/t5-3b | Q4 | 56.81 tok/sEstimated Auto-generated benchmark | 2GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | 56.35 tok/sEstimated Auto-generated benchmark | 2GB |
| unsloth/Llama-3.2-1B-Instruct | Q8 | 56.18 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-Guard-3-1B | Q8 | 56.14 tok/sEstimated Auto-generated benchmark | 1GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | 55.32 tok/sEstimated Auto-generated benchmark | 1GB |
| allenai/OLMo-2-0425-1B | Q8 | 54.50 tok/sEstimated Auto-generated benchmark | 1GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q4 | 53.91 tok/sEstimated Auto-generated benchmark | 2GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q4 | 53.76 tok/sEstimated Auto-generated benchmark | 2GB |
| apple/OpenELM-1_1B-Instruct | Q8 | 53.48 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B | Q4 | 52.84 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-4B-Base | Q4 | 52.13 tok/sEstimated Auto-generated benchmark | 2GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q4 | 51.62 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | 51.43 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-Embedding-4B | Q4 | 51.10 tok/sEstimated Auto-generated benchmark | 2GB |
| inference-net/Schematron-3B | Q4 | 51.07 tok/sEstimated Auto-generated benchmark | 2GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | 50.60 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 50.05 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | 49.65 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-1.5B | Q4 | 49.63 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-0.5B | Q4 | 49.19 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-Coder-1.5B | Q4 | 48.94 tok/sEstimated Auto-generated benchmark | 3GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q4 | 48.39 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | 48.18 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-4B | Q4 | 48.10 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2-2b-it | Q8 | 48.02 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/VibeVoice-1.5B | Q4 | 47.91 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-Embedding-0.6B | Q4 | 46.23 tok/sEstimated Auto-generated benchmark | 3GB |
| ibm-granite/granite-3.3-2b-instruct | Q8 | 46.02 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q4 | 46.00 tok/sEstimated Auto-generated benchmark | 2GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | 44.81 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | 44.59 tok/sEstimated Auto-generated benchmark | 3GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q4 | 44.55 tok/sEstimated Auto-generated benchmark | 2GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q4 | 44.49 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2-1.5B-Instruct | Q4 | 44.30 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-0.6B | Q4 | 44.17 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-0.6B-Base | Q4 | 44.16 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2-0.5B-Instruct | Q4 | 44.10 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2-7B-Instruct | Q4 | 44.03 tok/sEstimated Auto-generated benchmark | 4GB |
| LiquidAI/LFM2-1.2B | Q8 | 43.57 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen3-1.7B | Q4 | 43.44 tok/sEstimated Auto-generated benchmark | 4GB |
| dicta-il/dictalm2.0-instruct | Q4 | 43.32 tok/sEstimated Auto-generated benchmark | 4GB |
| distilbert/distilgpt2 | Q4 | 43.25 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-4 | Q4 | 43.23 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/DialoGPT-medium | Q4 | 43.18 tok/sEstimated Auto-generated benchmark | 4GB |
| huggyllama/llama-7b | Q4 | 43.10 tok/sEstimated Auto-generated benchmark | 4GB |
| hmellor/tiny-random-LlamaForCausalLM | Q4 | 43.00 tok/sEstimated Auto-generated benchmark | 4GB |
| zai-org/GLM-4.6-FP8 | Q4 | 43.00 tok/sEstimated Auto-generated benchmark | 4GB |
| sshleifer/tiny-gpt2 | Q4 | 42.94 tok/sEstimated Auto-generated benchmark | 4GB |
| petals-team/StableBeluga2 | Q4 | 42.82 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q4 | 42.82 tok/sEstimated Auto-generated benchmark | 4GB |
| zai-org/GLM-4.5-Air | Q4 | 42.63 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-2-7b-chat-hf | Q4 | 42.36 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q4 | 42.27 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-4-mini-instruct | Q4 | 42.19 tok/sEstimated Auto-generated benchmark | 4GB |
| numind/NuExtract-1.5 | Q4 | 42.19 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-medium | Q4 | 42.18 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4 | 42.11 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceTB/SmolLM2-135M | Q4 | 42.06 tok/sEstimated Auto-generated benchmark | 4GB |
| google/gemma-2b | Q8 | 42.02 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-3-mini-4k-instruct | Q4 | 41.97 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Reranker-0.6B | Q4 | 41.94 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/DialoGPT-small | Q4 | 41.62 tok/sEstimated Auto-generated benchmark | 4GB |
| rinna/japanese-gpt-neox-small | Q4 | 41.48 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2 | Q4 | 41.34 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2-0.5B | Q4 | 41.30 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/Phi-3-mini-128k-instruct | Q4 | 41.29 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-0528 | Q4 | 41.28 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/phi-2 | Q4 | 41.09 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q4 | 41.08 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-Math-1.5B | Q4 | 40.96 tok/sEstimated Auto-generated benchmark | 3GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q4 | 40.87 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | 40.81 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q4 | 40.75 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q4 | 40.67 tok/sEstimated Auto-generated benchmark | 4GB |
| skt/kogpt2-base-v2 | Q4 | 40.60 tok/sEstimated Auto-generated benchmark | 4GB |
| EleutherAI/gpt-neo-125m | Q4 | 40.53 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B-Instruct | Q4 | 40.46 tok/sEstimated Auto-generated benchmark | 4GB |
| google/gemma-3-270m-it | Q4 | 40.44 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-xl | Q4 | 40.23 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q4 | 40.11 tok/sEstimated Auto-generated benchmark | 4GB |
| facebook/opt-125m | Q4 | 40.09 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-large | Q4 | 39.53 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3 | Q4 | 39.42 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1 | Q4 | 39.35 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.2-3B | Q8 | 39.14 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen3-1.7B-Base | Q4 | 39.02 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-v0.1 | Q4 | 38.98 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q4 | 38.79 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceH4/zephyr-7b-beta | Q4 | 38.78 tok/sEstimated Auto-generated benchmark | 4GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | 38.61 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-7B | Q4 | 38.57 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-vision-instruct | Q4 | 38.51 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-3B | Q8 | 38.48 tok/sEstimated Auto-generated benchmark | 3GB |
| liuhaotian/llava-v1.5-7b | Q4 | 38.44 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-FP8 | Q4 | 38.26 tok/sEstimated Auto-generated benchmark | 4GB |
| llamafactory/tiny-random-Llama-3 | Q4 | 38.25 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q4 | 38.18 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q4 | 38.12 tok/sEstimated Auto-generated benchmark | 4GB |
| ibm-granite/granite-docling-258M | Q4 | 37.95 tok/sEstimated Auto-generated benchmark | 4GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q4 | 37.90 tok/sEstimated Auto-generated benchmark | 4GB |
| MiniMaxAI/MiniMax-M2 | Q4 | 37.82 tok/sEstimated Auto-generated benchmark | 4GB |
| rednote-hilab/dots.ocr | Q4 | 37.73 tok/sEstimated Auto-generated benchmark | 4GB |
| google-t5/t5-3b | Q8 | 37.68 tok/sEstimated Auto-generated benchmark | 3GB |
| parler-tts/parler-tts-large-v1 | Q4 | 37.63 tok/sEstimated Auto-generated benchmark | 4GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q4 | 37.49 tok/sEstimated Auto-generated benchmark | 4GB |
| vikhyatk/moondream2 | Q4 | 37.48 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q4 | 37.35 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-2-7b-hf | Q4 | 37.32 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3-0324 | Q4 | 37.31 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-V3.1 | Q4 | 37.30 tok/sEstimated Auto-generated benchmark | 4GB |
| ibm-granite/granite-3.3-8b-instruct | Q4 | 37.29 tok/sEstimated Auto-generated benchmark | 4GB |
| HuggingFaceTB/SmolLM-135M | Q4 | 37.23 tok/sEstimated Auto-generated benchmark | 4GB |
| BSC-LT/salamandraTA-7b-instruct | Q4 | 37.18 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | 36.99 tok/sEstimated Auto-generated benchmark | 3GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q8 | 36.95 tok/sEstimated Auto-generated benchmark | 4GB |
| EleutherAI/pythia-70m-deduped | Q4 | 36.92 tok/sEstimated Auto-generated benchmark | 4GB |
| unsloth/Llama-3.2-3B-Instruct | Q8 | 36.89 tok/sEstimated Auto-generated benchmark | 3GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q4 | 36.83 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-3.5-mini-instruct | Q4 | 36.81 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/Phi-4-multimodal-instruct | Q4 | 36.61 tok/sEstimated Auto-generated benchmark | 4GB |
| lmsys/vicuna-7b-v1.5 | Q4 | 36.55 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-8B | Q4 | 36.48 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507 | Q8 | 36.43 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q4 | 36.33 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-3.1-8B | Q4 | 36.27 tok/sEstimated Auto-generated benchmark | 4GB |
| bigcode/starcoder2-3b | Q8 | 36.27 tok/sEstimated Auto-generated benchmark | 3GB |
| bigscience/bloomz-560m | Q4 | 36.26 tok/sEstimated Auto-generated benchmark | 4GB |
| IlyaGusev/saiga_llama3_8b | Q4 | 36.24 tok/sEstimated Auto-generated benchmark | 4GB |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q8 | 36.09 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B | Q4 | 35.85 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Instruct | Q4 | 35.65 tok/sEstimated Auto-generated benchmark | 4GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q4 | 35.35 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-4B | Q8 | 35.28 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-Base | Q4 | 35.26 tok/sEstimated Auto-generated benchmark | 4GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q8 | 34.99 tok/sEstimated Auto-generated benchmark | 3GB |
| ibm-research/PowerMoE-3b | Q8 | 34.99 tok/sEstimated Auto-generated benchmark | 3GB |
| inference-net/Schematron-3B | Q8 | 34.74 tok/sEstimated Auto-generated benchmark | 3GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q8 | 34.66 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Llama-Guard-3-8B | Q4 | 34.63 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen2.5-3B-Instruct | Q8 | 34.62 tok/sEstimated Auto-generated benchmark | 3GB |
| meta-llama/Meta-Llama-3-8B | Q4 | 34.55 tok/sEstimated Auto-generated benchmark | 4GB |
| GSAI-ML/LLaDA-8B-Base | Q4 | 34.51 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B | Q8 | 34.38 tok/sEstimated Auto-generated benchmark | 4GB |
| meta-llama/Llama-2-13b-chat-hf | Q4 | 34.33 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q8 | 34.04 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-0.5B-Instruct | Q8 | 34.02 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2-0.5B-Instruct | Q8 | 33.98 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | 33.74 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q8 | 33.60 tok/sEstimated Auto-generated benchmark | 4GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q4 | 33.48 tok/sEstimated Auto-generated benchmark | 5GB |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | 33.19 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-0.5B | Q8 | 33.16 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | 32.85 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen3-4B-Base | Q8 | 32.76 tok/sEstimated Auto-generated benchmark | 4GB |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q8 | 32.58 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-14B-Instruct | Q4 | 32.12 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q8 | 31.98 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q8 | 31.90 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/VibeVoice-1.5B | Q8 | 31.47 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2.5-1.5B | Q8 | 31.39 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen3-0.6B-Base | Q8 | 31.17 tok/sEstimated Auto-generated benchmark | 6GB |
| ai-forever/ruGPT-3.5-13B | Q4 | 30.90 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-v0.1 | Q8 | 30.69 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3.5-vision-instruct | Q8 | 30.67 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2 | Q8 | 30.65 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-0.5B | Q8 | 30.58 tok/sEstimated Auto-generated benchmark | 5GB |
| deepseek-ai/DeepSeek-R1 | Q8 | 30.51 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-14B | Q4 | 30.30 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Reranker-0.6B | Q8 | 30.14 tok/sEstimated Auto-generated benchmark | 6GB |
| Qwen/Qwen2.5-Math-1.5B | Q8 | 30.10 tok/sEstimated Auto-generated benchmark | 5GB |
| llamafactory/tiny-random-Llama-3 | Q8 | 29.98 tok/sEstimated Auto-generated benchmark | 7GB |
| mistralai/Mistral-7B-Instruct-v0.1 | Q8 | 29.90 tok/sEstimated Auto-generated benchmark | 7GB |
| distilbert/distilgpt2 | Q8 | 29.78 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q8 | 29.70 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/phi-2 | Q8 | 29.67 tok/sEstimated Auto-generated benchmark | 7GB |
| BSC-LT/salamandraTA-7b-instruct | Q8 | 29.66 tok/sEstimated Auto-generated benchmark | 7GB |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q8 | 29.60 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-0.6B | Q8 | 29.59 tok/sEstimated Auto-generated benchmark | 6GB |
| skt/kogpt2-base-v2 | Q8 | 29.58 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceTB/SmolLM2-135M | Q8 | 29.50 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-0.6B | Q8 | 29.50 tok/sEstimated Auto-generated benchmark | 6GB |
| rednote-hilab/dots.ocr | Q8 | 29.45 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-Coder-1.5B | Q8 | 29.43 tok/sEstimated Auto-generated benchmark | 5GB |
| Qwen/Qwen2-1.5B-Instruct | Q8 | 29.23 tok/sEstimated Auto-generated benchmark | 5GB |
| mistralai/Mistral-7B-Instruct-v0.2 | Q8 | 29.18 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-8B-Base | Q8 | 29.07 tok/sEstimated Auto-generated benchmark | 8GB |
| bigscience/bloomz-560m | Q8 | 29.04 tok/sEstimated Auto-generated benchmark | 7GB |
| numind/NuExtract-1.5 | Q8 | 28.97 tok/sEstimated Auto-generated benchmark | 7GB |
| MiniMaxAI/MiniMax-M2 | Q8 | 28.87 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | 28.85 tok/sEstimated Auto-generated benchmark | 8GB |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q8 | 28.81 tok/sEstimated Auto-generated benchmark | 7GB |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q8 | 28.76 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2-large | Q8 | 28.73 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-0528 | Q8 | 28.72 tok/sEstimated Auto-generated benchmark | 7GB |
| dicta-il/dictalm2.0-instruct | Q8 | 28.66 tok/sEstimated Auto-generated benchmark | 7GB |
| ibm-granite/granite-docling-258M | Q8 | 28.66 tok/sEstimated Auto-generated benchmark | 7GB |
| parler-tts/parler-tts-large-v1 | Q8 | 28.61 tok/sEstimated Auto-generated benchmark | 7GB |
| OpenPipe/Qwen3-14B-Instruct | Q4 | 28.55 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Meta-Llama-3-8B | Q8 | 28.53 tok/sEstimated Auto-generated benchmark | 8GB |
| HuggingFaceH4/zephyr-7b-beta | Q8 | 28.52 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B-Instruct | Q8 | 28.50 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-14B | Q4 | 28.45 tok/sEstimated Auto-generated benchmark | 7GB |
| huggyllama/llama-7b | Q8 | 28.36 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-14B-Base | Q4 | 28.32 tok/sEstimated Auto-generated benchmark | 7GB |
| EleutherAI/gpt-neo-125m | Q8 | 28.31 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3.5-mini-instruct | Q8 | 28.22 tok/sEstimated Auto-generated benchmark | 7GB |
| lmsys/vicuna-7b-v1.5 | Q8 | 28.20 tok/sEstimated Auto-generated benchmark | 7GB |
| EleutherAI/pythia-70m-deduped | Q8 | 28.17 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-128k-instruct | Q8 | 28.12 tok/sEstimated Auto-generated benchmark | 7GB |
| rinna/japanese-gpt-neox-small | Q8 | 27.99 tok/sEstimated Auto-generated benchmark | 7GB |
| GSAI-ML/LLaDA-8B-Instruct | Q8 | 27.88 tok/sEstimated Auto-generated benchmark | 8GB |
| meta-llama/Llama-2-7b-hf | Q8 | 27.83 tok/sEstimated Auto-generated benchmark | 7GB |
| hmellor/tiny-random-LlamaForCausalLM | Q8 | 27.83 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Meta-Llama-3-8B-Instruct | Q8 | 27.81 tok/sEstimated Auto-generated benchmark | 8GB |
| microsoft/Phi-4-multimodal-instruct | Q8 | 27.81 tok/sEstimated Auto-generated benchmark | 7GB |
| openai-community/gpt2-xl | Q8 | 27.76 tok/sEstimated Auto-generated benchmark | 7GB |
| HuggingFaceTB/SmolLM-135M | Q8 | 27.57 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-2-7b-chat-hf | Q8 | 27.53 tok/sEstimated Auto-generated benchmark | 7GB |
| meta-llama/Llama-3.1-8B | Q8 | 27.47 tok/sEstimated Auto-generated benchmark | 8GB |
| microsoft/DialoGPT-small | Q8 | 27.46 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-8B | Q8 | 27.37 tok/sEstimated Auto-generated benchmark | 8GB |
| facebook/opt-125m | Q8 | 27.34 tok/sEstimated Auto-generated benchmark | 7GB |
| petals-team/StableBeluga2 | Q8 | 27.33 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-3-mini-4k-instruct | Q8 | 27.31 tok/sEstimated Auto-generated benchmark | 7GB |
| zai-org/GLM-4.6-FP8 | Q8 | 27.30 tok/sEstimated Auto-generated benchmark | 7GB |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q8 | 27.24 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q8 | 27.02 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-Embedding-8B | Q8 | 27.01 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q8 | 26.79 tok/sEstimated Auto-generated benchmark | 8GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q4 | 26.75 tok/sEstimated Auto-generated benchmark | 10GB |
| openai-community/gpt2-medium | Q8 | 26.74 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-V3.1 | Q8 | 26.71 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-1.7B | Q8 | 26.68 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q8 | 26.67 tok/sEstimated Auto-generated benchmark | 7GB |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q8 | 26.67 tok/sEstimated Auto-generated benchmark | 9GB |
| microsoft/phi-4 | Q8 | 26.66 tok/sEstimated Auto-generated benchmark | 7GB |
| sshleifer/tiny-gpt2 | Q8 | 26.34 tok/sEstimated Auto-generated benchmark | 7GB |
| liuhaotian/llava-v1.5-7b | Q8 | 26.33 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-V3 | Q8 | 26.30 tok/sEstimated Auto-generated benchmark | 7GB |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q8 | 26.11 tok/sEstimated Auto-generated benchmark | 8GB |
| deepseek-ai/DeepSeek-V3-0324 | Q8 | 26.10 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen3-1.7B-Base | Q8 | 26.09 tok/sEstimated Auto-generated benchmark | 7GB |
| vikhyatk/moondream2 | Q8 | 25.98 tok/sEstimated Auto-generated benchmark | 7GB |
| openai/gpt-oss-20b | Q4 | 25.96 tok/sEstimated Auto-generated benchmark | 10GB |
| meta-llama/Llama-Guard-3-8B | Q8 | 25.81 tok/sEstimated Auto-generated benchmark | 8GB |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q8 | 25.73 tok/sEstimated Auto-generated benchmark | 8GB |
| google/gemma-3-270m-it | Q8 | 25.70 tok/sEstimated Auto-generated benchmark | 7GB |
| swiss-ai/Apertus-8B-Instruct-2509 | Q8 | 25.59 tok/sEstimated Auto-generated benchmark | 8GB |
| microsoft/DialoGPT-medium | Q8 | 25.55 tok/sEstimated Auto-generated benchmark | 7GB |
| zai-org/GLM-4.5-Air | Q8 | 25.50 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2-7B-Instruct | Q8 | 25.40 tok/sEstimated Auto-generated benchmark | 7GB |
| microsoft/Phi-4-mini-instruct | Q8 | 25.31 tok/sEstimated Auto-generated benchmark | 7GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q8 | 25.29 tok/sEstimated Auto-generated benchmark | 7GB |
| Qwen/Qwen2.5-7B | Q8 | 25.28 tok/sEstimated Auto-generated benchmark | 7GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q8 | 25.09 tok/sEstimated Auto-generated benchmark | 8GB |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q8 | 24.95 tok/sEstimated Auto-generated benchmark | 8GB |
| unsloth/gpt-oss-20b-BF16 | Q4 | 24.87 tok/sEstimated Auto-generated benchmark | 10GB |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q8 | 24.74 tok/sEstimated Auto-generated benchmark | 8GB |
| IlyaGusev/saiga_llama3_8b | Q8 | 24.57 tok/sEstimated Auto-generated benchmark | 8GB |
| GSAI-ML/LLaDA-8B-Base | Q8 | 24.31 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen3-8B-FP8 | Q8 | 24.31 tok/sEstimated Auto-generated benchmark | 8GB |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q4 | 24.25 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B | Q4 | 24.21 tok/sEstimated Auto-generated benchmark | 15GB |
| ibm-granite/granite-3.3-8b-instruct | Q8 | 24.20 tok/sEstimated Auto-generated benchmark | 8GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q4 | 24.04 tok/sEstimated Auto-generated benchmark | 10GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q4 | 23.69 tok/sEstimated Auto-generated benchmark | 15GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q4 | 23.60 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q4 | 23.59 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen2.5-14B | Q8 | 23.57 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | 23.55 tok/sEstimated Auto-generated benchmark | 15GB |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q4 | 23.01 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q4 | 23.01 tok/sEstimated Auto-generated benchmark | 15GB |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | 22.18 tok/sEstimated Auto-generated benchmark | 17GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q4 | 22.13 tok/sEstimated Auto-generated benchmark | 15GB |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q4 | 21.70 tok/sEstimated Auto-generated benchmark | 15GB |
| OpenPipe/Qwen3-14B-Instruct | Q8 | 21.67 tok/sEstimated Auto-generated benchmark | 14GB |
| ai-forever/ruGPT-3.5-13B | Q8 | 21.67 tok/sEstimated Auto-generated benchmark | 13GB |
| Qwen/Qwen2.5-32B | Q4 | 21.40 tok/sEstimated Auto-generated benchmark | 16GB |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q4 | 21.05 tok/sEstimated Auto-generated benchmark | 16GB |
| meta-llama/Llama-2-13b-chat-hf | Q8 | 20.88 tok/sEstimated Auto-generated benchmark | 13GB |
| Qwen/Qwen3-14B | Q8 | 20.85 tok/sEstimated Auto-generated benchmark | 14GB |
| baichuan-inc/Baichuan-M2-32B | Q4 | 20.74 tok/sEstimated Auto-generated benchmark | 16GB |
| Qwen/Qwen3-32B | Q4 | 20.56 tok/sEstimated Auto-generated benchmark | 16GB |
| unsloth/gpt-oss-20b-BF16 | Q8 | 20.46 tok/sEstimated Auto-generated benchmark | 20GB |
| Qwen/Qwen2.5-32B-Instruct | Q4 | 20.19 tok/sEstimated Auto-generated benchmark | 16GB |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q8 | 19.96 tok/sEstimated Auto-generated benchmark | 20GB |
| openai/gpt-oss-20b | Q8 | 19.71 tok/sEstimated Auto-generated benchmark | 20GB |
| Qwen/Qwen3-14B-Base | Q8 | 19.70 tok/sEstimated Auto-generated benchmark | 14GB |
| Qwen/Qwen2.5-14B-Instruct | Q8 | 19.41 tok/sEstimated Auto-generated benchmark | 14GB |
| codellama/CodeLlama-34b-hf | Q4 | 19.36 tok/sEstimated Auto-generated benchmark | 17GB |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q8 | 16.92 tok/sEstimated Auto-generated benchmark | 20GB |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
| Model | Quantization | Verdict | Estimated speed | VRAM needed |
|---|---|---|---|---|
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q8 | Not supported | — | 80GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | Q4 | Not supported | — | 40GB (have 24GB) |
| ai-forever/ruGPT-3.5-13B | Q8 | Fits comfortably | 21.67 tok/sEstimated | 13GB (have 24GB) |
| ai-forever/ruGPT-3.5-13B | Q4 | Fits comfortably | 30.90 tok/sEstimated | 7GB (have 24GB) |
| baichuan-inc/Baichuan-M2-32B | Q8 | Not supported | — | 32GB (have 24GB) |
| baichuan-inc/Baichuan-M2-32B | Q4 | Fits comfortably | 20.74 tok/sEstimated | 16GB (have 24GB) |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q8 | Fits comfortably | 29.70 tok/sEstimated | 7GB (have 24GB) |
| HuggingFaceM4/tiny-random-LlamaForCausalLM | Q4 | Fits comfortably | 38.12 tok/sEstimated | 4GB (have 24GB) |
| ibm-granite/granite-3.3-8b-instruct | Q8 | Fits comfortably | 24.20 tok/sEstimated | 8GB (have 24GB) |
| ibm-granite/granite-3.3-8b-instruct | Q4 | Fits comfortably | 37.29 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-1.7B-Base | Q8 | Fits comfortably | 26.09 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-1.7B-Base | Q4 | Fits comfortably | 39.02 tok/sEstimated | 4GB (have 24GB) |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q8 | Fits comfortably | 16.92 tok/sEstimated | 20GB (have 24GB) |
| unsloth/gpt-oss-20b-unsloth-bnb-4bit | Q4 | Fits comfortably | 24.04 tok/sEstimated | 10GB (have 24GB) |
| BSC-LT/salamandraTA-7b-instruct | Q8 | Fits comfortably | 29.66 tok/sEstimated | 7GB (have 24GB) |
| BSC-LT/salamandraTA-7b-instruct | Q4 | Fits comfortably | 37.18 tok/sEstimated | 4GB (have 24GB) |
| dicta-il/dictalm2.0-instruct | Q8 | Fits comfortably | 28.66 tok/sEstimated | 7GB (have 24GB) |
| dicta-il/dictalm2.0-instruct | Q4 | Fits comfortably | 43.32 tok/sEstimated | 4GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q8 | Not supported | — | 30GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-6bit | Q4 | Fits comfortably | 21.70 tok/sEstimated | 15GB (have 24GB) |
| GSAI-ML/LLaDA-8B-Base | Q8 | Fits comfortably | 24.31 tok/sEstimated | 8GB (have 24GB) |
| GSAI-ML/LLaDA-8B-Base | Q4 | Fits comfortably | 34.51 tok/sEstimated | 4GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q8 | Not supported | — | 30GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | Q4 | Fits comfortably | 23.01 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen2-0.5B-Instruct | Q8 | Fits comfortably | 33.98 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2-0.5B-Instruct | Q4 | Fits comfortably | 44.10 tok/sEstimated | 3GB (have 24GB) |
| deepseek-ai/DeepSeek-V3 | Q8 | Fits comfortably | 26.30 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-V3 | Q4 | Fits comfortably | 39.42 tok/sEstimated | 4GB (have 24GB) |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q8 | Fits comfortably | 32.58 tok/sEstimated | 5GB (have 24GB) |
| Alibaba-NLP/gte-Qwen2-1.5B-instruct | Q4 | Fits comfortably | 44.49 tok/sEstimated | 3GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q8 | Not supported | — | 30GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-5bit | Q4 | Fits comfortably | 23.69 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q8 | Not supported | — | 30GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Thinking-2507 | Q4 | Fits comfortably | 23.01 tok/sEstimated | 15GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q8 | Not supported | — | 30GB (have 24GB) |
| lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-MLX-4bit | Q4 | Fits comfortably | 22.13 tok/sEstimated | 15GB (have 24GB) |
| AI-MO/Kimina-Prover-72B | Q8 | Not supported | — | 72GB (have 24GB) |
| AI-MO/Kimina-Prover-72B | Q4 | Not supported | — | 36GB (have 24GB) |
| apple/OpenELM-1_1B-Instruct | Q8 | Fits comfortably | 53.48 tok/sEstimated | 1GB (have 24GB) |
| apple/OpenELM-1_1B-Instruct | Q4 | Fits comfortably | 85.39 tok/sEstimated | 1GB (have 24GB) |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 25.73 tok/sEstimated | 8GB (have 24GB) |
| NousResearch/Meta-Llama-3.1-8B-Instruct | Q4 | Fits comfortably | 35.35 tok/sEstimated | 4GB (have 24GB) |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q8 | Fits comfortably | 26.67 tok/sEstimated | 9GB (have 24GB) |
| nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Q4 | Fits comfortably | 33.48 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-3B | Q8 | Fits comfortably | 38.48 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-3B | Q4 | Fits comfortably | 57.14 tok/sEstimated | 2GB (have 24GB) |
| lmsys/vicuna-7b-v1.5 | Q8 | Fits comfortably | 28.20 tok/sEstimated | 7GB (have 24GB) |
| lmsys/vicuna-7b-v1.5 | Q4 | Fits comfortably | 36.55 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-2-13b-chat-hf | Q8 | Fits comfortably | 20.88 tok/sEstimated | 13GB (have 24GB) |
| meta-llama/Llama-2-13b-chat-hf | Q4 | Fits comfortably | 34.33 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q8 | Not supported | — | 80GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Thinking | Q4 | Not supported | — | 40GB (have 24GB) |
| unsloth/gemma-3-1b-it | Q8 | Fits comfortably | 58.28 tok/sEstimated | 1GB (have 24GB) |
| unsloth/gemma-3-1b-it | Q4 | Fits comfortably | 78.12 tok/sEstimated | 1GB (have 24GB) |
| bigcode/starcoder2-3b | Q8 | Fits comfortably | 36.27 tok/sEstimated | 3GB (have 24GB) |
| bigcode/starcoder2-3b | Q4 | Fits comfortably | 59.13 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q8 | Not supported | — | 80GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 | Q4 | Not supported | — | 40GB (have 24GB) |
| ibm-granite/granite-docling-258M | Q8 | Fits comfortably | 28.66 tok/sEstimated | 7GB (have 24GB) |
| ibm-granite/granite-docling-258M | Q4 | Fits comfortably | 37.95 tok/sEstimated | 4GB (have 24GB) |
| skt/kogpt2-base-v2 | Q8 | Fits comfortably | 29.58 tok/sEstimated | 7GB (have 24GB) |
| skt/kogpt2-base-v2 | Q4 | Fits comfortably | 40.60 tok/sEstimated | 4GB (have 24GB) |
| google/gemma-3-270m-it | Q8 | Fits comfortably | 25.70 tok/sEstimated | 7GB (have 24GB) |
| google/gemma-3-270m-it | Q4 | Fits comfortably | 40.44 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q8 | Fits comfortably | 33.60 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q4 | Fits comfortably | 46.00 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2.5-32B | Q8 | Not supported | — | 32GB (have 24GB) |
| Qwen/Qwen2.5-32B | Q4 | Fits comfortably | 21.40 tok/sEstimated | 16GB (have 24GB) |
| parler-tts/parler-tts-large-v1 | Q8 | Fits comfortably | 28.61 tok/sEstimated | 7GB (have 24GB) |
| parler-tts/parler-tts-large-v1 | Q4 | Fits comfortably | 37.63 tok/sEstimated | 4GB (have 24GB) |
| EleutherAI/pythia-70m-deduped | Q8 | Fits comfortably | 28.17 tok/sEstimated | 7GB (have 24GB) |
| EleutherAI/pythia-70m-deduped | Q4 | Fits comfortably | 36.92 tok/sEstimated | 4GB (have 24GB) |
| microsoft/VibeVoice-1.5B | Q8 | Fits comfortably | 31.47 tok/sEstimated | 5GB (have 24GB) |
| microsoft/VibeVoice-1.5B | Q4 | Fits comfortably | 47.91 tok/sEstimated | 3GB (have 24GB) |
| ibm-granite/granite-3.3-2b-instruct | Q8 | Fits comfortably | 46.02 tok/sEstimated | 2GB (have 24GB) |
| ibm-granite/granite-3.3-2b-instruct | Q4 | Fits comfortably | 66.98 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen2.5-72B-Instruct | Q8 | Not supported | — | 72GB (have 24GB) |
| Qwen/Qwen2.5-72B-Instruct | Q4 | Not supported | — | 36GB (have 24GB) |
| liuhaotian/llava-v1.5-7b | Q8 | Fits comfortably | 26.33 tok/sEstimated | 7GB (have 24GB) |
| liuhaotian/llava-v1.5-7b | Q4 | Fits comfortably | 38.44 tok/sEstimated | 4GB (have 24GB) |
| google/gemma-2b | Q8 | Fits comfortably | 42.02 tok/sEstimated | 2GB (have 24GB) |
| google/gemma-2b | Q4 | Fits comfortably | 66.59 tok/sEstimated | 1GB (have 24GB) |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q8 | Fits comfortably | 29.60 tok/sEstimated | 7GB (have 24GB) |
| trl-internal-testing/tiny-LlamaForCausalLM-3.2 | Q4 | Fits comfortably | 40.11 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-235B-A22B | Q8 | Not supported | — | 235GB (have 24GB) |
| Qwen/Qwen3-235B-A22B | Q4 | Not supported | — | 118GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q8 | Fits comfortably | 25.09 tok/sEstimated | 8GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-8bit | Q4 | Fits comfortably | 38.79 tok/sEstimated | 4GB (have 24GB) |
| microsoft/Phi-4-mini-instruct | Q8 | Fits comfortably | 25.31 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-4-mini-instruct | Q4 | Fits comfortably | 42.19 tok/sEstimated | 4GB (have 24GB) |
| llamafactory/tiny-random-Llama-3 | Q8 | Fits comfortably | 29.98 tok/sEstimated | 7GB (have 24GB) |
| llamafactory/tiny-random-Llama-3 | Q4 | Fits comfortably | 38.25 tok/sEstimated | 4GB (have 24GB) |
| HuggingFaceH4/zephyr-7b-beta | Q8 | Fits comfortably | 28.52 tok/sEstimated | 7GB (have 24GB) |
| HuggingFaceH4/zephyr-7b-beta | Q4 | Fits comfortably | 38.78 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507 | Q8 | Fits comfortably | 36.43 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Thinking-2507 | Q4 | Fits comfortably | 51.43 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q8 | Not supported | — | 30GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 | Q4 | Fits comfortably | 23.59 tok/sEstimated | 15GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q8 | Fits comfortably | 24.74 tok/sEstimated | 8GB (have 24GB) |
| lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit | Q4 | Fits comfortably | 40.67 tok/sEstimated | 4GB (have 24GB) |
| unsloth/Llama-3.2-1B-Instruct | Q8 | Fits comfortably | 56.18 tok/sEstimated | 1GB (have 24GB) |
| unsloth/Llama-3.2-1B-Instruct | Q4 | Fits comfortably | 79.20 tok/sEstimated | 1GB (have 24GB) |
| GSAI-ML/LLaDA-8B-Instruct | Q8 | Fits comfortably | 27.88 tok/sEstimated | 8GB (have 24GB) |
| GSAI-ML/LLaDA-8B-Instruct | Q4 | Fits comfortably | 35.65 tok/sEstimated | 4GB (have 24GB) |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q8 | Not supported | — | 90GB (have 24GB) |
| RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamic | Q4 | Not supported | — | 45GB (have 24GB) |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q8 | Fits comfortably | 26.67 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-Coder-7B-Instruct | Q4 | Fits comfortably | 36.83 tok/sEstimated | 4GB (have 24GB) |
| numind/NuExtract-1.5 | Q8 | Fits comfortably | 28.97 tok/sEstimated | 7GB (have 24GB) |
| numind/NuExtract-1.5 | Q4 | Fits comfortably | 42.19 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q8 | Fits comfortably | 27.02 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q4 | Fits comfortably | 36.33 tok/sEstimated | 4GB (have 24GB) |
| hmellor/tiny-random-LlamaForCausalLM | Q8 | Fits comfortably | 27.83 tok/sEstimated | 7GB (have 24GB) |
| hmellor/tiny-random-LlamaForCausalLM | Q4 | Fits comfortably | 43.00 tok/sEstimated | 4GB (have 24GB) |
| huggyllama/llama-7b | Q8 | Fits comfortably | 28.36 tok/sEstimated | 7GB (have 24GB) |
| huggyllama/llama-7b | Q4 | Fits comfortably | 43.10 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-V3-0324 | Q8 | Fits comfortably | 26.10 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-V3-0324 | Q4 | Fits comfortably | 37.31 tok/sEstimated | 4GB (have 24GB) |
| microsoft/Phi-3-mini-128k-instruct | Q8 | Fits comfortably | 28.12 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-3-mini-128k-instruct | Q4 | Fits comfortably | 41.29 tok/sEstimated | 4GB (have 24GB) |
| sshleifer/tiny-gpt2 | Q8 | Fits comfortably | 26.34 tok/sEstimated | 7GB (have 24GB) |
| sshleifer/tiny-gpt2 | Q4 | Fits comfortably | 42.94 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-Guard-3-8B | Q8 | Fits comfortably | 25.81 tok/sEstimated | 8GB (have 24GB) |
| meta-llama/Llama-Guard-3-8B | Q4 | Fits comfortably | 34.63 tok/sEstimated | 4GB (have 24GB) |
| openai-community/gpt2-xl | Q8 | Fits comfortably | 27.76 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2-xl | Q4 | Fits comfortably | 40.23 tok/sEstimated | 4GB (have 24GB) |
| OpenPipe/Qwen3-14B-Instruct | Q8 | Fits comfortably | 21.67 tok/sEstimated | 14GB (have 24GB) |
| OpenPipe/Qwen3-14B-Instruct | Q4 | Fits comfortably | 28.55 tok/sEstimated | 7GB (have 24GB) |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q8 | Not supported | — | 70GB (have 24GB) |
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 | Q4 | Not supported | — | 35GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q8 | Fits comfortably | 36.95 tok/sEstimated | 4GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-6bit | Q4 | Fits comfortably | 53.76 tok/sEstimated | 2GB (have 24GB) |
| ibm-research/PowerMoE-3b | Q8 | Fits comfortably | 34.99 tok/sEstimated | 3GB (have 24GB) |
| ibm-research/PowerMoE-3b | Q4 | Fits comfortably | 57.04 tok/sEstimated | 2GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q8 | Fits comfortably | 31.90 tok/sEstimated | 4GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-8bit | Q4 | Fits comfortably | 44.55 tok/sEstimated | 2GB (have 24GB) |
| unsloth/Llama-3.2-3B-Instruct | Q8 | Fits comfortably | 36.89 tok/sEstimated | 3GB (have 24GB) |
| unsloth/Llama-3.2-3B-Instruct | Q4 | Fits comfortably | 60.06 tok/sEstimated | 2GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q8 | Fits comfortably | 31.98 tok/sEstimated | 4GB (have 24GB) |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q4 | Fits comfortably | 51.62 tok/sEstimated | 2GB (have 24GB) |
| meta-llama/Llama-3.2-3B | Q8 | Fits comfortably | 39.14 tok/sEstimated | 3GB (have 24GB) |
| meta-llama/Llama-3.2-3B | Q4 | Fits comfortably | 52.84 tok/sEstimated | 2GB (have 24GB) |
| EleutherAI/gpt-neo-125m | Q8 | Fits comfortably | 28.31 tok/sEstimated | 7GB (have 24GB) |
| EleutherAI/gpt-neo-125m | Q4 | Fits comfortably | 40.53 tok/sEstimated | 4GB (have 24GB) |
| codellama/CodeLlama-34b-hf | Q8 | Not supported | — | 34GB (have 24GB) |
| codellama/CodeLlama-34b-hf | Q4 | Fits comfortably | 19.36 tok/sEstimated | 17GB (have 24GB) |
| meta-llama/Llama-Guard-3-1B | Q8 | Fits comfortably | 56.14 tok/sEstimated | 1GB (have 24GB) |
| meta-llama/Llama-Guard-3-1B | Q4 | Fits comfortably | 82.88 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen2-1.5B-Instruct | Q8 | Fits comfortably | 29.23 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2-1.5B-Instruct | Q4 | Fits comfortably | 44.30 tok/sEstimated | 3GB (have 24GB) |
| google/gemma-2-2b-it | Q8 | Fits comfortably | 48.02 tok/sEstimated | 2GB (have 24GB) |
| google/gemma-2-2b-it | Q4 | Fits comfortably | 59.81 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen2.5-14B | Q8 | Fits comfortably | 23.57 tok/sEstimated | 14GB (have 24GB) |
| Qwen/Qwen2.5-14B | Q4 | Fits comfortably | 30.30 tok/sEstimated | 7GB (have 24GB) |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q8 | Not supported | — | 32GB (have 24GB) |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-bnb-4bit | Q4 | Fits comfortably | 21.05 tok/sEstimated | 16GB (have 24GB) |
| microsoft/Phi-3.5-mini-instruct | Q8 | Fits comfortably | 28.22 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-3.5-mini-instruct | Q4 | Fits comfortably | 36.81 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Base | Q8 | Fits comfortably | 32.76 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Base | Q4 | Fits comfortably | 52.13 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2-7B-Instruct | Q8 | Fits comfortably | 25.40 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2-7B-Instruct | Q4 | Fits comfortably | 44.03 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-2-7b-chat-hf | Q8 | Fits comfortably | 27.53 tok/sEstimated | 7GB (have 24GB) |
| meta-llama/Llama-2-7b-chat-hf | Q4 | Fits comfortably | 42.36 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-14B-Base | Q8 | Fits comfortably | 19.70 tok/sEstimated | 14GB (have 24GB) |
| Qwen/Qwen3-14B-Base | Q4 | Fits comfortably | 28.32 tok/sEstimated | 7GB (have 24GB) |
| swiss-ai/Apertus-8B-Instruct-2509 | Q8 | Fits comfortably | 25.59 tok/sEstimated | 8GB (have 24GB) |
| swiss-ai/Apertus-8B-Instruct-2509 | Q4 | Fits comfortably | 37.49 tok/sEstimated | 4GB (have 24GB) |
| microsoft/Phi-3.5-vision-instruct | Q8 | Fits comfortably | 30.67 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-3.5-vision-instruct | Q4 | Fits comfortably | 38.51 tok/sEstimated | 4GB (have 24GB) |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q8 | Fits comfortably | 28.81 tok/sEstimated | 7GB (have 24GB) |
| unsloth/mistral-7b-v0.3-bnb-4bit | Q4 | Fits comfortably | 42.27 tok/sEstimated | 4GB (have 24GB) |
| rinna/japanese-gpt-neox-small | Q8 | Fits comfortably | 27.99 tok/sEstimated | 7GB (have 24GB) |
| rinna/japanese-gpt-neox-small | Q4 | Fits comfortably | 41.48 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-Coder-1.5B | Q8 | Fits comfortably | 29.43 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-Coder-1.5B | Q4 | Fits comfortably | 48.94 tok/sEstimated | 3GB (have 24GB) |
| IlyaGusev/saiga_llama3_8b | Q8 | Fits comfortably | 24.57 tok/sEstimated | 8GB (have 24GB) |
| IlyaGusev/saiga_llama3_8b | Q4 | Fits comfortably | 36.24 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-30B-A3B | Q8 | Not supported | — | 30GB (have 24GB) |
| Qwen/Qwen3-30B-A3B | Q4 | Fits comfortably | 24.21 tok/sEstimated | 15GB (have 24GB) |
| deepseek-ai/DeepSeek-R1 | Q8 | Fits comfortably | 30.51 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-R1 | Q4 | Fits comfortably | 39.35 tok/sEstimated | 4GB (have 24GB) |
| microsoft/DialoGPT-small | Q8 | Fits comfortably | 27.46 tok/sEstimated | 7GB (have 24GB) |
| microsoft/DialoGPT-small | Q4 | Fits comfortably | 41.62 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-8B-FP8 | Q8 | Fits comfortably | 24.31 tok/sEstimated | 8GB (have 24GB) |
| Qwen/Qwen3-8B-FP8 | Q4 | Fits comfortably | 38.26 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q8 | Not supported | — | 30GB (have 24GB) |
| Qwen/Qwen3-Coder-30B-A3B-Instruct | Q4 | Fits comfortably | 24.25 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-Embedding-4B | Q8 | Fits comfortably | 35.28 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-Embedding-4B | Q4 | Fits comfortably | 51.10 tok/sEstimated | 2GB (have 24GB) |
| microsoft/Phi-4-multimodal-instruct | Q8 | Fits comfortably | 27.81 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-4-multimodal-instruct | Q4 | Fits comfortably | 36.61 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-8B-Base | Q8 | Fits comfortably | 29.07 tok/sEstimated | 8GB (have 24GB) |
| Qwen/Qwen3-8B-Base | Q4 | Fits comfortably | 35.26 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-0.6B-Base | Q8 | Fits comfortably | 31.17 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-0.6B-Base | Q4 | Fits comfortably | 44.16 tok/sEstimated | 3GB (have 24GB) |
| openai-community/gpt2-medium | Q8 | Fits comfortably | 26.74 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2-medium | Q4 | Fits comfortably | 42.18 tok/sEstimated | 4GB (have 24GB) |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q8 | Fits comfortably | 28.76 tok/sEstimated | 7GB (have 24GB) |
| trl-internal-testing/tiny-random-LlamaForCausalLM | Q4 | Fits comfortably | 42.82 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-Math-1.5B | Q8 | Fits comfortably | 30.10 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-Math-1.5B | Q4 | Fits comfortably | 40.96 tok/sEstimated | 3GB (have 24GB) |
| HuggingFaceTB/SmolLM-135M | Q8 | Fits comfortably | 27.57 tok/sEstimated | 7GB (have 24GB) |
| HuggingFaceTB/SmolLM-135M | Q4 | Fits comfortably | 37.23 tok/sEstimated | 4GB (have 24GB) |
| unsloth/gpt-oss-20b-BF16 | Q8 | Fits comfortably | 20.46 tok/sEstimated | 20GB (have 24GB) |
| unsloth/gpt-oss-20b-BF16 | Q4 | Fits comfortably | 24.87 tok/sEstimated | 10GB (have 24GB) |
| meta-llama/Meta-Llama-3-70B-Instruct | Q8 | Not supported | — | 70GB (have 24GB) |
| meta-llama/Meta-Llama-3-70B-Instruct | Q4 | Not supported | — | 35GB (have 24GB) |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 26.11 tok/sEstimated | 8GB (have 24GB) |
| unsloth/Meta-Llama-3.1-8B-Instruct | Q4 | Fits comfortably | 40.75 tok/sEstimated | 4GB (have 24GB) |
| zai-org/GLM-4.5-Air | Q8 | Fits comfortably | 25.50 tok/sEstimated | 7GB (have 24GB) |
| zai-org/GLM-4.5-Air | Q4 | Fits comfortably | 42.63 tok/sEstimated | 4GB (have 24GB) |
| mistralai/Mistral-7B-Instruct-v0.1 | Q8 | Fits comfortably | 29.90 tok/sEstimated | 7GB (have 24GB) |
| mistralai/Mistral-7B-Instruct-v0.1 | Q4 | Fits comfortably | 37.90 tok/sEstimated | 4GB (have 24GB) |
| LiquidAI/LFM2-1.2B | Q8 | Fits comfortably | 43.57 tok/sEstimated | 2GB (have 24GB) |
| LiquidAI/LFM2-1.2B | Q4 | Fits comfortably | 66.75 tok/sEstimated | 1GB (have 24GB) |
| mistralai/Mistral-7B-v0.1 | Q8 | Fits comfortably | 30.69 tok/sEstimated | 7GB (have 24GB) |
| mistralai/Mistral-7B-v0.1 | Q4 | Fits comfortably | 38.98 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-32B-Instruct | Q8 | Not supported | — | 32GB (have 24GB) |
| Qwen/Qwen2.5-32B-Instruct | Q4 | Fits comfortably | 20.19 tok/sEstimated | 16GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-0528 | Q8 | Fits comfortably | 28.72 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-0528 | Q4 | Fits comfortably | 41.28 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.1-8B | Q8 | Fits comfortably | 27.47 tok/sEstimated | 8GB (have 24GB) |
| meta-llama/Llama-3.1-8B | Q4 | Fits comfortably | 36.27 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-V3.1 | Q8 | Fits comfortably | 26.71 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-V3.1 | Q4 | Fits comfortably | 37.30 tok/sEstimated | 4GB (have 24GB) |
| microsoft/phi-4 | Q8 | Fits comfortably | 26.66 tok/sEstimated | 7GB (have 24GB) |
| microsoft/phi-4 | Q4 | Fits comfortably | 43.23 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q8 | Fits comfortably | 34.66 tok/sEstimated | 3GB (have 24GB) |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | Fits comfortably | 56.35 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen2-0.5B | Q8 | Fits comfortably | 30.58 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2-0.5B | Q4 | Fits comfortably | 41.30 tok/sEstimated | 3GB (have 24GB) |
| MiniMaxAI/MiniMax-M2 | Q8 | Fits comfortably | 28.87 tok/sEstimated | 7GB (have 24GB) |
| MiniMaxAI/MiniMax-M2 | Q4 | Fits comfortably | 37.82 tok/sEstimated | 4GB (have 24GB) |
| microsoft/DialoGPT-medium | Q8 | Fits comfortably | 25.55 tok/sEstimated | 7GB (have 24GB) |
| microsoft/DialoGPT-medium | Q4 | Fits comfortably | 43.18 tok/sEstimated | 4GB (have 24GB) |
| zai-org/GLM-4.6-FP8 | Q8 | Fits comfortably | 27.30 tok/sEstimated | 7GB (have 24GB) |
| zai-org/GLM-4.6-FP8 | Q4 | Fits comfortably | 43.00 tok/sEstimated | 4GB (have 24GB) |
| HuggingFaceTB/SmolLM2-135M | Q8 | Fits comfortably | 29.50 tok/sEstimated | 7GB (have 24GB) |
| HuggingFaceTB/SmolLM2-135M | Q4 | Fits comfortably | 42.06 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q8 | Fits comfortably | 26.79 tok/sEstimated | 8GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Q4 | Fits comfortably | 41.08 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-2-7b-hf | Q8 | Fits comfortably | 27.83 tok/sEstimated | 7GB (have 24GB) |
| meta-llama/Llama-2-7b-hf | Q4 | Fits comfortably | 37.32 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q8 | Fits comfortably | 25.29 tok/sEstimated | 7GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | Q4 | Fits comfortably | 40.87 tok/sEstimated | 4GB (have 24GB) |
| microsoft/phi-2 | Q8 | Fits comfortably | 29.67 tok/sEstimated | 7GB (have 24GB) |
| microsoft/phi-2 | Q4 | Fits comfortably | 41.09 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.1-70B-Instruct | Q8 | Not supported | — | 70GB (have 24GB) |
| meta-llama/Llama-3.1-70B-Instruct | Q4 | Not supported | — | 35GB (have 24GB) |
| Qwen/Qwen2.5-0.5B | Q8 | Fits comfortably | 33.16 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-0.5B | Q4 | Fits comfortably | 49.19 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen3-14B | Q8 | Fits comfortably | 20.85 tok/sEstimated | 14GB (have 24GB) |
| Qwen/Qwen3-14B | Q4 | Fits comfortably | 28.45 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-Embedding-8B | Q8 | Fits comfortably | 27.01 tok/sEstimated | 8GB (have 24GB) |
| Qwen/Qwen3-Embedding-8B | Q4 | Fits comfortably | 36.48 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.3-70B-Instruct | Q8 | Not supported | — | 70GB (have 24GB) |
| meta-llama/Llama-3.3-70B-Instruct | Q4 | Not supported | — | 35GB (have 24GB) |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q8 | Fits comfortably | 24.95 tok/sEstimated | 8GB (have 24GB) |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | Q4 | Fits comfortably | 37.35 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-14B-Instruct | Q8 | Fits comfortably | 19.41 tok/sEstimated | 14GB (have 24GB) |
| Qwen/Qwen2.5-14B-Instruct | Q4 | Fits comfortably | 32.12 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-1.5B | Q8 | Fits comfortably | 31.39 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-1.5B | Q4 | Fits comfortably | 49.63 tok/sEstimated | 3GB (have 24GB) |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q8 | Fits comfortably | 36.09 tok/sEstimated | 4GB (have 24GB) |
| kaitchup/Phi-3-mini-4k-instruct-gptq-4bit | Q4 | Fits comfortably | 53.91 tok/sEstimated | 2GB (have 24GB) |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q8 | Fits comfortably | 19.96 tok/sEstimated | 20GB (have 24GB) |
| mlx-community/gpt-oss-20b-MXFP4-Q8 | Q4 | Fits comfortably | 26.75 tok/sEstimated | 10GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q8 | Fits comfortably | 34.04 tok/sEstimated | 5GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | Q4 | Fits comfortably | 48.39 tok/sEstimated | 3GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B-Instruct | Q8 | Fits comfortably | 27.81 tok/sEstimated | 8GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B-Instruct | Q4 | Fits comfortably | 38.18 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-Reranker-0.6B | Q8 | Fits comfortably | 30.14 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-Reranker-0.6B | Q4 | Fits comfortably | 41.94 tok/sEstimated | 3GB (have 24GB) |
| rednote-hilab/dots.ocr | Q8 | Fits comfortably | 29.45 tok/sEstimated | 7GB (have 24GB) |
| rednote-hilab/dots.ocr | Q4 | Fits comfortably | 37.73 tok/sEstimated | 4GB (have 24GB) |
| google-t5/t5-3b | Q8 | Fits comfortably | 37.68 tok/sEstimated | 3GB (have 24GB) |
| google-t5/t5-3b | Q4 | Fits comfortably | 56.81 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q8 | Not supported | — | 30GB (have 24GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | Fits comfortably | 23.55 tok/sEstimated | 15GB (have 24GB) |
| Qwen/Qwen3-4B | Q8 | Fits comfortably | 34.38 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B | Q4 | Fits comfortably | 48.10 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-1.7B | Q8 | Fits comfortably | 26.68 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen3-1.7B | Q4 | Fits comfortably | 43.44 tok/sEstimated | 4GB (have 24GB) |
| openai-community/gpt2-large | Q8 | Fits comfortably | 28.73 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2-large | Q4 | Fits comfortably | 39.53 tok/sEstimated | 4GB (have 24GB) |
| microsoft/Phi-3-mini-4k-instruct | Q8 | Fits comfortably | 27.31 tok/sEstimated | 7GB (have 24GB) |
| microsoft/Phi-3-mini-4k-instruct | Q4 | Fits comfortably | 41.97 tok/sEstimated | 4GB (have 24GB) |
| allenai/OLMo-2-0425-1B | Q8 | Fits comfortably | 54.50 tok/sEstimated | 1GB (have 24GB) |
| allenai/OLMo-2-0425-1B | Q4 | Fits comfortably | 82.40 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q8 | Not supported | — | 80GB (have 24GB) |
| Qwen/Qwen3-Next-80B-A3B-Instruct | Q4 | Not supported | — | 40GB (have 24GB) |
| Qwen/Qwen3-32B | Q8 | Not supported | — | 32GB (have 24GB) |
| Qwen/Qwen3-32B | Q4 | Fits comfortably | 20.56 tok/sEstimated | 16GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | Q8 | Fits comfortably | 34.02 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 44.59 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-7B | Q8 | Fits comfortably | 25.28 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-7B | Q4 | Fits comfortably | 38.57 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B | Q8 | Fits comfortably | 28.53 tok/sEstimated | 8GB (have 24GB) |
| meta-llama/Meta-Llama-3-8B | Q4 | Fits comfortably | 34.55 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.2-1B | Q8 | Fits comfortably | 60.77 tok/sEstimated | 1GB (have 24GB) |
| meta-llama/Llama-3.2-1B | Q4 | Fits comfortably | 87.58 tok/sEstimated | 1GB (have 24GB) |
| petals-team/StableBeluga2 | Q8 | Fits comfortably | 27.33 tok/sEstimated | 7GB (have 24GB) |
| petals-team/StableBeluga2 | Q4 | Fits comfortably | 42.82 tok/sEstimated | 4GB (have 24GB) |
| vikhyatk/moondream2 | Q8 | Fits comfortably | 25.98 tok/sEstimated | 7GB (have 24GB) |
| vikhyatk/moondream2 | Q4 | Fits comfortably | 37.48 tok/sEstimated | 4GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | Fits comfortably | 36.99 tok/sEstimated | 3GB (have 24GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | Fits comfortably | 50.05 tok/sEstimated | 2GB (have 24GB) |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q8 | Not supported | — | 70GB (have 24GB) |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q4 | Not supported | — | 35GB (have 24GB) |
| distilbert/distilgpt2 | Q8 | Fits comfortably | 29.78 tok/sEstimated | 7GB (have 24GB) |
| distilbert/distilgpt2 | Q4 | Fits comfortably | 43.25 tok/sEstimated | 4GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q8 | Not supported | — | 32GB (have 24GB) |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | Q4 | Fits comfortably | 23.60 tok/sEstimated | 16GB (have 24GB) |
| inference-net/Schematron-3B | Q8 | Fits comfortably | 34.74 tok/sEstimated | 3GB (have 24GB) |
| inference-net/Schematron-3B | Q4 | Fits comfortably | 51.07 tok/sEstimated | 2GB (have 24GB) |
| Qwen/Qwen3-8B | Q8 | Fits comfortably | 27.37 tok/sEstimated | 8GB (have 24GB) |
| Qwen/Qwen3-8B | Q4 | Fits comfortably | 35.85 tok/sEstimated | 4GB (have 24GB) |
| mistralai/Mistral-7B-Instruct-v0.2 | Q8 | Fits comfortably | 29.18 tok/sEstimated | 7GB (have 24GB) |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4 | Fits comfortably | 42.11 tok/sEstimated | 4GB (have 24GB) |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q8 | Fits comfortably | 34.99 tok/sEstimated | 3GB (have 24GB) |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | Fits comfortably | 50.60 tok/sEstimated | 2GB (have 24GB) |
| bigscience/bloomz-560m | Q8 | Fits comfortably | 29.04 tok/sEstimated | 7GB (have 24GB) |
| bigscience/bloomz-560m | Q4 | Fits comfortably | 36.26 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-3B-Instruct | Q8 | Fits comfortably | 34.62 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-3B-Instruct | Q4 | Fits comfortably | 57.18 tok/sEstimated | 2GB (have 24GB) |
| openai/gpt-oss-120b | Q8 | Not supported | — | 120GB (have 24GB) |
| openai/gpt-oss-120b | Q4 | Not supported | — | 60GB (have 24GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q8 | Fits comfortably | 59.70 tok/sEstimated | 1GB (have 24GB) |
| meta-llama/Llama-3.2-1B-Instruct | Q4 | Fits comfortably | 77.52 tok/sEstimated | 1GB (have 24GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q8 | Fits comfortably | 33.74 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen3-4B-Instruct-2507 | Q4 | Fits comfortably | 49.65 tok/sEstimated | 2GB (have 24GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q8 | Fits comfortably | 27.24 tok/sEstimated | 7GB (have 24GB) |
| trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 | Q4 | Fits comfortably | 38.61 tok/sEstimated | 4GB (have 24GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q8 | Fits comfortably | 55.32 tok/sEstimated | 1GB (have 24GB) |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | Fits comfortably | 88.66 tok/sEstimated | 1GB (have 24GB) |
| facebook/opt-125m | Q8 | Fits comfortably | 27.34 tok/sEstimated | 7GB (have 24GB) |
| facebook/opt-125m | Q4 | Fits comfortably | 40.09 tok/sEstimated | 4GB (have 24GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q8 | Fits comfortably | 32.85 tok/sEstimated | 5GB (have 24GB) |
| Qwen/Qwen2.5-1.5B-Instruct | Q4 | Fits comfortably | 48.18 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen3-Embedding-0.6B | Q8 | Fits comfortably | 29.59 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-Embedding-0.6B | Q4 | Fits comfortably | 46.23 tok/sEstimated | 3GB (have 24GB) |
| google/gemma-3-1b-it | Q8 | Fits comfortably | 60.06 tok/sEstimated | 1GB (have 24GB) |
| google/gemma-3-1b-it | Q4 | Fits comfortably | 80.07 tok/sEstimated | 1GB (have 24GB) |
| openai/gpt-oss-20b | Q8 | Fits comfortably | 19.71 tok/sEstimated | 20GB (have 24GB) |
| openai/gpt-oss-20b | Q4 | Fits comfortably | 25.96 tok/sEstimated | 10GB (have 24GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q8 | Not supported | — | 34GB (have 24GB) |
| dphn/dolphin-2.9.1-yi-1.5-34b | Q4 | Fits comfortably | 22.18 tok/sEstimated | 17GB (have 24GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 28.85 tok/sEstimated | 8GB (have 24GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q4 | Fits comfortably | 40.81 tok/sEstimated | 4GB (have 24GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q8 | Fits comfortably | 33.19 tok/sEstimated | 5GB (have 24GB) |
| Gensyn/Qwen2.5-0.5B-Instruct | Q4 | Fits comfortably | 44.81 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen3-0.6B | Q8 | Fits comfortably | 29.50 tok/sEstimated | 6GB (have 24GB) |
| Qwen/Qwen3-0.6B | Q4 | Fits comfortably | 44.17 tok/sEstimated | 3GB (have 24GB) |
| Qwen/Qwen2.5-7B-Instruct | Q8 | Fits comfortably | 28.50 tok/sEstimated | 7GB (have 24GB) |
| Qwen/Qwen2.5-7B-Instruct | Q4 | Fits comfortably | 40.46 tok/sEstimated | 4GB (have 24GB) |
| openai-community/gpt2 | Q8 | Fits comfortably | 30.65 tok/sEstimated | 7GB (have 24GB) |
| openai-community/gpt2 | Q4 | Fits comfortably | 41.34 tok/sEstimated | 4GB (have 24GB) |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
Data-backed answers pulled from community benchmarks, manufacturer specs, and live pricing.
RunPod benchmarks show the 24 GB RTX A5000 pushing ~49 tokens/sec on Mixtral 8x7B Q2_K under Ollama, and about 38 tok/s at Q3_K_S.
Source: Reddit – /r/LocalLLaMA (19428v9)
Yes—with low-bit EXL2 quantization. Community guides note that 2.4 bpw EXL2 plus 4-bit KV cache lets Miqu 70B run entirely within 24 GB on cards like the A5000.
Source: Reddit – /r/LocalLLaMA (kx452no)
Operators of quad-A5000 rigs suggest disabling NVLink peer-to-peer via NCCL env flags when vLLM underperforms—removing the bridges boosted throughput from ~14 tok/s to ~25 tok/s.
Source: Reddit – /r/LocalLLaMA (n3vnbez)
RTX A5000 is rated at 230 W, uses a single 8-pin connector, and NVIDIA recommends a 600 W PSU.
Source: TechPowerUp – RTX A5000 Specs
Our 3 Nov 2025 snapshot showed RTX A5000 cards around $1,699 (Amazon, in stock), $1,729 (Newegg, in stock), and $1,749 (Best Buy, in stock).
Source: Supabase price tracker snapshot – 2025-11-03
Explore how RTX 4060 Ti 16GB stacks up for local inference workloads.
Explore how RX 6800 XT stacks up for local inference workloads.
Explore how RTX 3080 stacks up for local inference workloads.