Quick Answer: NVIDIA L40 offers 48GB VRAM and starts around current market pricing. It delivers approximately 217 tokens/sec on meta-llama/Llama-3.2-1B-Instruct. It typically draws 300W under load.
This GPU offers reliable throughput for local AI workloads. Pair it with the right model quantization to hit your desired tokens/sec, and monitor prices below to catch the best deal.
Buy directly on Amazon with fast shipping and reliable customer service.
💡 Not ready to buy? Try cloud GPUs first
Test NVIDIA L40 performance in the cloud before investing in hardware. Pay by the hour with no commitment.
| Model | Quantization | Tokens/sec | VRAM used |
|---|---|---|---|
| meta-llama/Llama-3.2-1B-Instruct | Q4 | 217.30 tok/sEstimated Auto-generated benchmark | 1GB |
| apple/OpenELM-1_1B-Instruct | Q4 | 214.24 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-Guard-3-1B | Q4 | 214.03 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-1B | Q4 | 212.74 tok/sEstimated Auto-generated benchmark | 1GB |
| google-t5/t5-3b | Q4 | 212.54 tok/sEstimated Auto-generated benchmark | 2GB |
| google/gemma-2-2b-it | Q4 | 212.05 tok/sEstimated Auto-generated benchmark | 1GB |
| google/embeddinggemma-300m | Q4 | 211.11 tok/sEstimated Auto-generated benchmark | 1GB |
| facebook/sam3 | Q4 | 210.12 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/DeepSeek-OCR | Q4 | 204.89 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B-Instruct | Q4 | 203.91 tok/sEstimated Auto-generated benchmark | 2GB |
| Qwen/Qwen2.5-3B | Q4 | 203.65 tok/sEstimated Auto-generated benchmark | 2GB |
| inference-net/Schematron-3B | Q4 | 201.76 tok/sEstimated Auto-generated benchmark | 2GB |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Q4 | 200.78 tok/sEstimated Auto-generated benchmark | 1GB |
| google/gemma-3-1b-it | Q4 | 199.63 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/gemma-3-1b-it | Q4 | 199.35 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-3B-Instruct | Q4 | 196.99 tok/sEstimated Auto-generated benchmark | 2GB |
| tencent/HunyuanOCR | Q4 | 195.04 tok/sEstimated Auto-generated benchmark | 1GB |
| WeiboAI/VibeThinker-1.5B | Q4 | 193.63 tok/sEstimated Auto-generated benchmark | 1GB |
| context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16 | Q4 | 189.25 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-research/PowerMoE-3b | Q4 | 188.19 tok/sEstimated Auto-generated benchmark | 2GB |
| meta-llama/Llama-3.2-3B | Q4 | 187.25 tok/sEstimated Auto-generated benchmark | 2GB |
| bigcode/starcoder2-3b | Q4 | 187.09 tok/sEstimated Auto-generated benchmark | 2GB |
| google-bert/bert-base-uncased | Q4 | 186.38 tok/sEstimated Auto-generated benchmark | 1GB |
| unsloth/Llama-3.2-1B-Instruct | Q4 | 185.56 tok/sEstimated Auto-generated benchmark | 1GB |
| nari-labs/Dia2-2B | Q4 | 185.02 tok/sEstimated Auto-generated benchmark | 2GB |
| LiquidAI/LFM2-1.2B | Q4 | 184.67 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.2-3B-Instruct | Q4 | 184.49 tok/sEstimated Auto-generated benchmark | 2GB |
| ibm-granite/granite-3.3-2b-instruct | Q4 | 184.04 tok/sEstimated Auto-generated benchmark | 1GB |
| deepseek-ai/deepseek-coder-1.3b-instruct | Q4 | 182.66 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/phi-2 | Q4 | 181.91 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-8B-Base | Q4 | 181.70 tok/sEstimated Auto-generated benchmark | 4GB |
| microsoft/DialoGPT-small | Q4 | 181.65 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-Embedding-4B | Q4 | 181.25 tok/sEstimated Auto-generated benchmark | 2GB |
| microsoft/Phi-4-multimodal-instruct | Q4 | 181.09 tok/sEstimated Auto-generated benchmark | 4GB |
| hmellor/tiny-random-LlamaForCausalLM | Q4 | 181.05 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-4B-Thinking-2507-FP8 | Q4 | 180.92 tok/sEstimated Auto-generated benchmark | 2GB |
| facebook/opt-125m | Q4 | 180.81 tok/sEstimated Auto-generated benchmark | 4GB |
| lmsys/vicuna-7b-v1.5 | Q4 | 180.74 tok/sEstimated Auto-generated benchmark | 4GB |
| lmstudio-community/Qwen3-4B-Thinking-2507-MLX-4bit | Q4 | 180.52 tok/sEstimated Auto-generated benchmark | 2GB |
| zai-org/GLM-4.5-Air | Q4 | 180.38 tok/sEstimated Auto-generated benchmark | 4GB |
| allenai/OLMo-2-0425-1B | Q4 | 179.97 tok/sEstimated Auto-generated benchmark | 1GB |
| microsoft/Phi-3-mini-128k-instruct | Q4 | 179.92 tok/sEstimated Auto-generated benchmark | 4GB |
| google/gemma-2b | Q4 | 179.85 tok/sEstimated Auto-generated benchmark | 1GB |
| meta-llama/Llama-3.1-8B | Q4 | 179.79 tok/sEstimated Auto-generated benchmark | 4GB |
| huggyllama/llama-7b | Q4 | 179.31 tok/sEstimated Auto-generated benchmark | 4GB |
| skt/kogpt2-base-v2 | Q4 | 178.45 tok/sEstimated Auto-generated benchmark | 4GB |
| openai-community/gpt2-medium | Q4 | 177.98 tok/sEstimated Auto-generated benchmark | 4GB |
| deepseek-ai/DeepSeek-R1-0528 | Q4 | 177.79 tok/sEstimated Auto-generated benchmark | 4GB |
| Qwen/Qwen3-0.6B | Q4 | 177.61 tok/sEstimated Auto-generated benchmark | 3GB |
| microsoft/Phi-3.5-vision-instruct | Q4 | 177.59 tok/sEstimated Auto-generated benchmark | 4GB |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
| Model | Quantization | Verdict | Estimated speed | VRAM needed |
|---|---|---|---|---|
| zai-org/GLM-4.6-FP8 | Q4 | Fits comfortably | 168.55 tok/sEstimated | 4GB (have 48GB) |
| zai-org/GLM-4.6-FP8 | FP16 | Fits comfortably | 65.21 tok/sEstimated | 15GB (have 48GB) |
| microsoft/DialoGPT-medium | FP16 | Fits comfortably | 59.91 tok/sEstimated | 15GB (have 48GB) |
| MiniMaxAI/MiniMax-M2 | FP16 | Fits comfortably | 64.29 tok/sEstimated | 15GB (have 48GB) |
| Qwen/Qwen2-0.5B | Q4 | Fits comfortably | 150.38 tok/sEstimated | 3GB (have 48GB) |
| Qwen/Qwen2-0.5B | Q8 | Fits comfortably | 108.31 tok/sEstimated | 5GB (have 48GB) |
| microsoft/phi-4 | Q8 | Fits comfortably | 119.29 tok/sEstimated | 7GB (have 48GB) |
| microsoft/phi-4 | FP16 | Fits comfortably | 58.08 tok/sEstimated | 15GB (have 48GB) |
| unsloth/Meta-Llama-3.1-8B-Instruct | FP16 | Fits comfortably | 66.39 tok/sEstimated | 17GB (have 48GB) |
| Qwen/Qwen2.5-Math-1.5B | FP16 | Fits comfortably | 58.34 tok/sEstimated | 11GB (have 48GB) |
| trl-internal-testing/tiny-random-LlamaForCausalLM | FP16 | Fits comfortably | 67.26 tok/sEstimated | 15GB (have 48GB) |
| EleutherAI/gpt-neo-125m | Q4 | Fits comfortably | 158.77 tok/sEstimated | 4GB (have 48GB) |
| EleutherAI/gpt-neo-125m | Q8 | Fits comfortably | 106.22 tok/sEstimated | 7GB (have 48GB) |
| Qwen/Qwen3-1.7B-Base | FP16 | Fits comfortably | 63.25 tok/sEstimated | 15GB (have 48GB) |
| ibm-granite/granite-3.3-8b-instruct | Q4 | Fits comfortably | 172.20 tok/sEstimated | 4GB (have 48GB) |
| ibm-granite/granite-3.3-8b-instruct | FP16 | Fits comfortably | 63.75 tok/sEstimated | 17GB (have 48GB) |
| Qwen/QwQ-32B-Preview | Q4 | Fits comfortably | 53.56 tok/sEstimated | 17GB (have 48GB) |
| Qwen/QwQ-32B-Preview | Q8 | Fits comfortably | 38.25 tok/sEstimated | 34GB (have 48GB) |
| deepseek-ai/DeepSeek-Coder-V2-Instruct-0724 | FP16 | Not supported | 9.50 tok/sEstimated | 461GB (have 48GB) |
| facebook/sam3 | Q8 | Fits comfortably | 142.84 tok/sEstimated | 1GB (have 48GB) |
| mistralai/Ministral-3-14B-Instruct-2512 | Q4 | Fits comfortably | 134.78 tok/sEstimated | 8GB (have 48GB) |
| mistralai/Ministral-3-14B-Instruct-2512 | Q8 | Fits comfortably | 80.65 tok/sEstimated | 16GB (have 48GB) |
| mistralai/Ministral-3-14B-Instruct-2512 | FP16 | Fits comfortably | 48.69 tok/sEstimated | 32GB (have 48GB) |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4 | Fits comfortably | 163.82 tok/sEstimated | 4GB (have 48GB) |
| RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic | Q4 | Fits comfortably | 59.67 tok/sEstimated | 34GB (have 48GB) |
| meta-llama/Llama-3.2-3B-Instruct | Q8 | Fits comfortably | 142.98 tok/sEstimated | 3GB (have 48GB) |
| meta-llama/Llama-3.2-3B-Instruct | FP16 | Fits comfortably | 77.93 tok/sEstimated | 6GB (have 48GB) |
| vikhyatk/moondream2 | FP16 | Fits comfortably | 63.63 tok/sEstimated | 15GB (have 48GB) |
| petals-team/StableBeluga2 | Q8 | Fits comfortably | 117.38 tok/sEstimated | 7GB (have 48GB) |
| microsoft/Phi-3-mini-4k-instruct | FP16 | Fits comfortably | 68.82 tok/sEstimated | 15GB (have 48GB) |
| openai-community/gpt2-large | Q4 | Fits comfortably | 166.24 tok/sEstimated | 4GB (have 48GB) |
| openai-community/gpt2-large | Q8 | Fits comfortably | 110.46 tok/sEstimated | 7GB (have 48GB) |
| Qwen/Qwen3-1.7B | Q4 | Fits comfortably | 153.43 tok/sEstimated | 4GB (have 48GB) |
| Qwen/Qwen3-1.7B | Q8 | Fits comfortably | 114.20 tok/sEstimated | 7GB (have 48GB) |
| MiniMaxAI/MiniMax-M2 | Q8 | Fits comfortably | 114.24 tok/sEstimated | 7GB (have 48GB) |
| Qwen/Qwen2.5-32B | FP16 | Not supported | 23.91 tok/sEstimated | 66GB (have 48GB) |
| meta-llama/Llama-3.1-8B-Instruct | Q8 | Fits comfortably | 83.13 tok/sEstimated | 9GB (have 48GB) |
| black-forest-labs/FLUX.1-dev | Q8 | Fits comfortably | 116.30 tok/sEstimated | 8GB (have 48GB) |
| tencent/HunyuanVideo-1.5 | Q4 | Fits comfortably | 168.43 tok/sEstimated | 4GB (have 48GB) |
| tencent/HunyuanVideo-1.5 | Q8 | Fits comfortably | 118.91 tok/sEstimated | 8GB (have 48GB) |
| tencent/HunyuanVideo-1.5 | FP16 | Fits comfortably | 65.47 tok/sEstimated | 16GB (have 48GB) |
| nari-labs/Dia2-2B | Q4 | Fits comfortably | 185.02 tok/sEstimated | 2GB (have 48GB) |
| nari-labs/Dia2-2B | Q8 | Fits comfortably | 152.56 tok/sEstimated | 3GB (have 48GB) |
| nari-labs/Dia2-2B | FP16 | Fits comfortably | 80.35 tok/sEstimated | 5GB (have 48GB) |
| Qwen/Qwen3-4B | FP16 | Fits comfortably | 64.72 tok/sEstimated | 9GB (have 48GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q4 | Fits comfortably | 92.59 tok/sEstimated | 15GB (have 48GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | Q8 | Fits comfortably | 57.63 tok/sEstimated | 31GB (have 48GB) |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | FP16 | Not supported | 37.79 tok/sEstimated | 61GB (have 48GB) |
| google-t5/t5-3b | FP16 | Fits comfortably | 71.28 tok/sEstimated | 6GB (have 48GB) |
| Qwen/Qwen2.5-0.5B | Q8 | Fits comfortably | 119.42 tok/sEstimated | 5GB (have 48GB) |
Note: Performance estimates are calculated. Real results may vary. Methodology · Submit real data
Explore how NVIDIA RTX 6000 Ada stacks up for local inference workloads.
Explore how NVIDIA A6000 stacks up for local inference workloads.
Explore how RTX 4090 stacks up for local inference workloads.
Explore how RTX 4080 stacks up for local inference workloads.
Explore how NVIDIA A5000 stacks up for local inference workloads.