Multi-GPU Inference Methodology

Most people think 2 GPUs means 2× the performance. It doesn't. This is what actually happens when you run large language models across multiple GPUs, and when it's worth the complexity.

Overview

Multi-GPU lets you run models that won't fit on a single GPU. You split the model across 2 or more GPUs and they work together. The catch is that 2 GPUs doesn't give you 2× performance. With consumer hardware (RTX 4090, RTX 3090) connected over PCIe, expect around 1.5× speedup with 2 GPUs. Sometimes less.

This isn't marketing. These are conservative estimates based on real benchmarks from llama.cpp and vLLM users, plus the physical limitations of PCIe bandwidth. Professional setups with NVLink do better, but they cost 2-3× more. If you're reading this, you probably care about consumer hardware.

The honest truth: only use multi-GPU when your model literally won't fit on one GPU. If you're close to fitting, try a more aggressive quantization first. It's simpler.

Multi-GPU Fundamentals

How multiple GPUs work together for LLM inference

Parallelism Strategies

Tensor Parallelism

Split the model weights across GPUs. Each GPU handles part of each layer, all at the same time.

  • Needs high-bandwidth interconnect (NVLink at 600-900 GB/s)
  • Works well on professional GPUs with NVLink
  • Consumer reality: PCIe 4.0 is only 32 GB/s, that's 28× slower than NVLink
  • Expect 1.6-1.7× speedup on 2 GPUs with vLLM and NVLink

Pipeline Parallelism

Split layers across GPUs. First GPU handles early layers, second GPU handles later layers, like an assembly line.

  • Works on PCIe, doesn't need fancy interconnects
  • This is what you get with consumer GPUs (RTX 4090, RTX 3090)
  • Trade-off: sequential processing means higher latency
  • Expect 1.4-1.5× speedup on 2 GPUs with llama.cpp

Sources: Shoeybi et al., "Megatron-LM" (2019); Huang et al., "GPipe" (2019); DeepSpeed documentation; llama.cpp community benchmarks.

Interconnect Technologies

TechnologyBandwidthAvailabilityUse Case
NVLink 4.0450 GB/s per GPURTX 6000 Ada, L40SWorkstation tensor parallelism
NVLink 3.0600 GB/s per GPUA100, RTX A6000Datacenter multi-GPU
NVSwitch900 GB/s any-to-anyH100, DGX systemsLarge-scale clusters
PCIe 4.0 x1632 GB/s (bilateral)RTX 40-series consumerConsumer pipeline parallelism
PCIe 5.0 x1664 GB/s (bilateral)Limited support (2025)Future consumer setups

Consumer GPU Reality

Consumer RTX 40-series GPUs (4090, 4080, 4070 Ti) don't have NVLink. They only have PCIe, which is 28× slower for GPU-to-GPU communication. This is why multi-GPU doesn't scale linearly on consumer hardware. The bottleneck isn't the GPUs, it's the connection between them.

Sources: NVIDIA NVLink Technical Blog (developer.nvidia.com); PCIe specifications (pcisig.com).

VRAM Pooling Calculations

Understanding effective vs theoretical VRAM capacity

Theoretical vs Effective VRAM

What People Think

2× RTX 4090 = 2 × 24GB = 48GB

Simple math. Two 24GB GPUs, you get 48GB total. Right?

What Actually Happens

2× RTX 4090 = 48GB × 0.87 = 41GB

You lose 13% to overhead: framework buffers, CUDA contexts, communication.

Formula

Effective_VRAM = (GPU_Count × VRAM_per_GPU) × Efficiency_Factor

Efficiency Factors

InterconnectEfficiencyOverheadExample (2×24GB)
NVLink92%8%44GB effective
PCIe 4.087%13%41GB effective
PCIe 3.085%15%40GB effective

Overhead Sources

PyTorch/CUDA context (per GPU)2-3GB (4-6%)
Communication buffers1-2GB (2-4%)
Framework overhead (llama.cpp/vLLM)1-2GB (2-4%)
Memory fragmentation1-2GB (2-4%)
Total Overhead5-7GB (12-15%)

Sources: These efficiency numbers come from 50+ real user reports on r/LocalLLaMA and llama.cpp GitHub. The overhead breakdown is from PyTorch memory profiling and framework docs.

We use 87% efficiency (13% overhead) for consumer PCIe setups. This is conservative. Real numbers vary ±3% depending on your software stack and configuration.

Performance Scaling

Why 2 GPUs ≠ 2× performance

Reality Check

2 GPUs don't give you 2× performance. Expect 1.4-1.6× speedup on consumer hardware. If someone tells you they get 2× speedup, they either have professional gear with NVLink, or they're not measuring accurately.

Actual Speedup Factors

PCIe 4.0 (Consumer GPUs)

GPU CountTheoreticalllama.cppvLLMEfficiency
11.0×1.0×1.0×100%
22.0×1.45×1.60×73-80%
33.0×1.85×2.05×62-68%
44.0×2.10×2.45×53-61%

Conservative estimate: 2 GPUs on PCIe = 1.5× speedup (average of llama.cpp and vLLM)

NVLink (Professional GPUs)

GPU CountTheoreticalvLLM (Actual)Efficiency
11.0×1.0×100%
22.0×1.75×88%
44.0×3.10×78%
88.0×5.80×73%

NVLink performs significantly better than PCIe but still shows sublinear scaling due to synchronization overhead.

Why Sublinear Scaling?

1. Memory Bandwidth Bottleneck

Each GPU still has to load weights from its own VRAM at the same speed. Adding more GPUs gives you more total capacity, but each GPU is still individually bandwidth-limited.

2. Synchronization Overhead

The GPUs have to talk to each other after every layer. PCIe gives you 32 GB/s for this. Each GPU has 1,000+ GB/s of internal memory bandwidth. See the problem?

3. Pipeline Bubbles

When you split layers across GPUs, the first GPU finishes its layers while the second GPU is still working. The first GPU sits idle. This happens in every forward pass.

4. Framework Overhead

PyTorch, CUDA, llama.cpp, vLLM, they all need to coordinate multiple GPUs. This coordination takes time. A single GPU doesn't need any of this.

Sources: These speedup numbers come from llama.cpp GitHub discussions, vLLM benchmarks, and r/LocalLLaMA user reports. PCIe bandwidth limits are from the official PCI-SIG specs.

Your mileage may vary ±20-30%. Depends on your model, quantization, context length, and how well your drivers are configured. These numbers assume typical single-user inference.

When Multi-GPU Makes Sense

Decision matrix and cost-benefit analysis

Decision Criteria

✓ Good Fit for Multi-GPU

  • Your model needs more than 24GB (Llama 2 70B Q4 at 38GB, Mixtral 8×7B Q8 at 55GB)
  • You're okay with BIOS configuration and troubleshooting drivers
  • $6,800 for an RTX 6000 Ada sounds like too much
  • You already have one high-VRAM GPU and want to add another

✗ Bad Fit for Multi-GPU

  • Your model fits on one GPU (Llama 3 8B at 6GB, Llama 2 13B at 10GB)
  • This is your first time running LLMs locally (too complex)
  • You value simplicity over saving money
  • You're at 22-23GB on a 24GB GPU (just use more aggressive quantization)

Cost-Benefit Analysis: 30GB Model (Q4)

OptionCostVRAMPerfComplexityVerdict
2× RTX 4090$3,20041GB~45 t/sHighBest value if technical
RTX 6000 Ada$6,80048GB~50 t/sLowSimplest, 2× cost
2× RTX 3090 (used)$1,40040GB~38 t/sHighBest $/perf, risky
A6000 (used)$4,50048GB~42 t/sLowMiddle ground

These performance numbers assume llama.cpp. vLLM is 10-15% faster. Used GPUs have no warranty. Cost per tokens/sec: 2×3090 ($37) beats 2×4090 ($71) beats A6000 ($107) beats 6000 Ada ($136).

Recommendations by Model Size

<24GB Models

Llama 3 8B, Mistral 7B, Llama 2 13B (all Q4/Q8)

→ Single GPU sufficient (RTX 4090, RTX 3090)

24-48GB Models

Llama 2 70B Q4 (~38GB), Qwen 72B Q4 (~42GB)

→ 2× RTX 4090 feasible (41GB effective, PCIe)

48-72GB Models

Mixtral 8×7B Q8 (~55GB), Llama 2 70B Q8 (~60GB)

→ 3× RTX 4090 or workstation GPUs (A6000, 6000 Ada)

>72GB Models

Llama 2 70B FP16 (~140GB), Mixtral 8×22B FP16 (~200GB)

→ Datacenter GPUs (H100, A100) or cloud APIs only

Consumer GPU Multi-GPU Reality

What actually happens with 2× RTX 4090 on PCIe

The PCIe Problem

  • Consumer RTX 40-series cards have no NVLink, only PCIe
  • PCIe 4.0 x16 gives you 32 GB/s, shared across everything
  • With 2 GPUs, a display output, and an NVMe drive, each GPU gets maybe 20-25 GB/s
  • NVLink on professional cards is 600-900 GB/s, that's 28× faster
  • This is why consumer multi-GPU doesn't scale well

Software Support

llama.cpp (Easiest)

Pipeline parallelism via --n-gpu-layers flag.

  • Pros: Simple CLI, widely used, stable
  • Cons: Pipeline only (not tensor parallel), moderate performance
  • Expected: 1.4-1.5× speedup on 2× RTX 4090

vLLM (Best Performance)

Tensor parallelism via --tensor-parallel-size flag.

  • Pros: Tensor parallel, PagedAttention, best throughput
  • Cons: More complex setup (API server), Python dependencies
  • Expected: 1.5-1.7× speedup on 2× RTX 4090

text-generation-webui

Uses llama.cpp or transformers backend.

  • Pros: User-friendly web UI
  • Cons: Performance depends on backend, less control
  • Expected: Similar to llama.cpp (1.4-1.5×)

Setup Complexity

BIOS Configuration (Click to expand)
  • Enable PCIe bifurcation (split x16 lanes)
  • Enable "Above 4G Decoding"
  • Set Resizable BAR (if supported)
  • Verify PCIe lane allocation per slot
Driver & Software (Click to expand)
  • Install CUDA toolkit (match PyTorch version)
  • Configure CUDA_VISIBLE_DEVICES
  • Test with nvidia-smi
  • Tune --n-gpu-layers per model
Power & Cooling (Click to expand)
  • PSU: 1000W+ for 2× RTX 4090 (900W GPU + 100W system)
  • Case airflow: Both GPUs need adequate cooling
  • PCIe slot spacing: Minimum 3-slot gap between GPUs
  • Thermal throttling: Monitor GPU temps under load

Expect to spend 1-2 weeks getting this working the first time. PCIe lane conflicts, driver errors, uneven GPU usage, these are common. A single GPU just works.

Data Sources & References

Research papers, community benchmarks, and hardware specifications

Academic Papers

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). ArXiv preprint arXiv:1909.08053. arxiv.org/abs/1909.08053

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Huang, Y., et al. (2019). ArXiv preprint arXiv:1811.06965. arxiv.org/abs/1811.06965

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T., et al. (2023). Stanford University. arxiv.org/abs/2307.08691

Industry Documentation

NVIDIA NVLink 4.0 Multi-GPU System Scalability

NVIDIA Developer Blog. developer.nvidia.com/blog

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Project Documentation. docs.vllm.ai

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G., et al. (2023-2025). github.com/ggerganov/llama.cpp

Microsoft DeepSpeed: Pipeline Parallelism

DeepSpeed Documentation. deepspeed.ai/tutorials/pipeline

Community Benchmarks

llama.cpp GitHub Discussions

Multi-GPU performance reports from users with 2-4× RTX 4090/3090 setups. github.com/ggerganov/llama.cpp/discussions

r/LocalLLaMA Community Reports

Real-world multi-GPU setups, cost analysis, and troubleshooting guides. reddit.com/r/LocalLLaMA

Hardware Specifications

PCI-SIG Specifications

PCIe 4.0 and 5.0 bandwidth specifications. pcisig.com/specifications

NVIDIA GPU Datasheets

RTX 4090, RTX 6000 Ada, A100, H100 technical specifications. nvidia.com/datasheets

Limitations & Caveats

What we don't know and what can vary

These Estimates Are Conservative

  • Assumes consumer PCIe setups (RTX 4090, RTX 3090), which most people use
  • Baseline llama.cpp performance, not optimized vLLM or TensorRT
  • Single user, batch size of 1
  • Normal context length (2-4K tokens)
  • Doesn't account for FlashAttention-2 or PagedAttention speedups

Real Performance Varies ±30-50%

You Might Do Better:

  • vLLM with tensor parallelism
  • Professional cards with NVLink
  • Batch processing multiple requests
  • FlashAttention-2 compiled in

You Might Do Worse:

  • Long context (more than 8K tokens)
  • Misconfigured drivers or CUDA
  • GPU thermal throttling
  • Other PCIe devices competing for bandwidth

We Can't Test Everything

  • Don't have every GPU combination (a pair of RTX 6000 Adas costs $13,600)
  • Most community benchmarks don't document their exact setup
  • Software changes fast (llama.cpp and vLLM ship updates weekly)
  • Your specific combination of hardware, drivers, and model might behave differently

Complexity Warning

Multi-GPU setup isn't for beginners:

  • BIOS configuration (PCIe bifurcation, above 4G decoding)
  • Getting the right CUDA toolkit version for your PyTorch
  • Learning framework flags (--n-gpu-layers, --tensor-parallel-size)
  • Debugging why your GPUs won't talk to each other
  • Making sure your PSU and cooling can handle 900W of GPUs

Only try this if you're comfortable with command line tools and fixing broken configs. For most people, one good GPU is simpler and less painful.

Last updated: November 29, 2025

Questions about multi-GPU inference? Contact us or see our main methodology page