Scaling a Multimodal Tutor Model on Modal
completeBenchmarked Qwen3-VL-4B-Instruct on Modal with vLLM, comparing H100 replica scaling, tensor parallelism, concurrency limits, and mixed multimodal traffic to find a low-latency, high-throughput serving configuration.
Scaling a Multimodal Tutor Model on Modal
Benchmarking
Qwen/Qwen3-VL-4B-Instructwith vLLM on Modal to tune throughput, time to first token, and replica-based serving for an AI tutoring workload.
InferTutor needs to serve many concurrent student tutoring requests, including short text prompts, long prompts, and image-based multimodal prompts. I benchmarked how serving configuration affects p95 time to first token, streaming latency, request throughput, token/chunk throughput, stability under load, and cost/performance tradeoffs on H100 GPUs.
System Setup
The benchmark used Modal to deploy vLLM as an OpenAI-compatible HTTP server. Experiments varied GPU count, replica count, tensor parallelism, max_seqs, concurrent_inputs, max_batch_tokens, max_model_len, chunked prefill, eager vs. compiled mode, and text-only vs. mixed multimodal traffic.
The key architectural finding was that the model fits comfortably on a single H100, so independent replicas scaled better than tensor parallelism for this workload.
Client load tester -> Modal web endpoint -> vLLM replicas -> H100 GPUs
Key Results
| Experiment | Mode | Users | GPUs | Req/s | Chunks/s | TTFT p95 | Latency p95 | Notes |
|---|---|---|---|---|---|---|---|---|
| 1-GPU baseline | text | 150 | 1 | 20.22 | 1936.3 | 3211 ms | 5045 ms | Useful lower-bound baseline |
| 2-GPU replicas | text | 150 | 2 | 31.82 | 3046.2 | 2150 ms | 3855 ms | Clear replica scaling gain |
| 4-GPU prior best | text | 150 | 4 | 37.10 | 3551.7 | 1151 ms | 3225 ms | Strong latency at 150 users |
| 4-GPU replicas | text | 250 | 4 | 53.36 | 5106.3 | 1371 ms | 4527 ms | Best 4-GPU balance |
| 4-GPU replicas | text | 300 | 4 | 55.31 | 5293.9 | 1843 ms | 5088 ms | Higher throughput, more queueing |
| 8-GPU replicas | mixed | 250 | 8 | 60.78 | 5605.2 | 916 ms | 3772 ms | Best multimodal result |
| 8-GPU replicas | mixed | 300 | 8 | 57.00 | 5256.1 | 1536 ms | 4222 ms | Saturation begins |
| 8-GPU replicas | text | 400 | 8 | 63.05 | 6035.5 | 1556 ms | 4277 ms | Best text throughput |
Throughput and TTFT Updates
The biggest changes came from moving from one larger serving process toward many independent single-H100 replicas. Throughput rose because requests could be spread across more warm vLLM servers, while TTFT improved when each replica had enough headroom to start prefill quickly instead of letting requests sit in a queue.
The curve also shows the saturation point. On 4 GPUs, moving from 250 to 300 users added only about 2 req/s, but p95 TTFT jumped from 1371 ms to 1843 ms. That is the classic sign that the endpoint is no longer compute-bound in a useful way; extra client pressure mostly becomes queueing delay. The 8-GPU mixed endpoint shows the same pattern: 250 users was the sweet spot, while 300 users lowered throughput and raised TTFT.
Extra TTFT movement between configurations also came from deployment hygiene. Reusing a warm Modal app could make a health check pass while an older vLLM configuration was still the one serving steady-state traffic, so the benchmark became more reliable after separating deployment-time server settings from client-side load-test settings and warming one endpoint per serving config.
Throughput Scaling
| Configuration | Users | Req/s |
|---|---|---|
| 1 GPU | 150 | 20.22 |
| 2 GPUs | 150 | 31.82 |
| 4 GPUs | 150 | 37.10 |
| 4 GPUs | 250 | 53.36 |
| 8 GPUs, mixed traffic | 250 | 60.78 |
| 8 GPUs, text traffic | 400 | 63.05 |
Throughput improved substantially with replicas, but the improvement depended on driving enough client load. The 8-GPU text endpoint produced the strongest raw throughput at 63.05 requests per second, while the 8-GPU mixed endpoint peaked at 60.78 requests per second with better tail latency.
TTFT vs. Load on 4 GPUs
| Users | Req/s | TTFT p95 |
|---|---|---|
| 175 | 42.91 | 1076 ms |
| 200 | 45.31 | 1307 ms |
| 250 | 53.36 | 1371 ms |
| 300 | 55.31 | 1843 ms |
The best 4-GPU text configuration used four single-H100 replicas with moderate per-container concurrency. At 250 users it reached 53.36 requests per second with p95 TTFT of 1.37 seconds. Increasing to 300 users raised throughput only slightly, but pushed p95 TTFT to 1.84 seconds, making 250 users the better latency-throughput operating point.
Mixed Multimodal Saturation
| Users | Req/s | TTFT p95 |
|---|---|---|
| 150 | 40.95 | 918 ms |
| 200 | 53.10 | 819 ms |
| 250 | 60.78 | 916 ms |
| 300 | 57.00 | 1536 ms |
The best 8-GPU mixed multimodal configuration used eight single-H100 replicas with lower per-container concurrency. It reached 60.78 requests per second at 250 users while keeping p95 TTFT below one second. At 300 users, throughput dropped and TTFT rose, showing that the endpoint had crossed its best saturation point.
Replica Scaling vs. Tensor Parallelism
| Configuration | Users | Req/s | TTFT p95 |
|---|---|---|---|
| 4 single-H100 replicas | 250 | 53.36 | 1371 ms |
| 4-H100 tensor-parallel endpoint | 250 | 8.74 | 17884 ms |
| 4-H100 eager tensor-parallel endpoint | 250 | 7.51 | 7311 ms |
Since Qwen/Qwen3-VL-4B-Instruct fits on one H100, tensor parallelism was unnecessary for this workload. Four independent H100 replicas produced much higher throughput and dramatically lower TTFT than a single tensor-parallel 4-H100 server.
Benchmark Workflow
Early experiments showed that benchmark results could be misleading if a warm Modal app was reused after changing server parameters, because /health could pass before the newly intended vLLM configuration was actually serving steady-state traffic. I separated deployment-time parameters from client-side load-test parameters, then used one Modal app per serving configuration and multiple load tests per warmed endpoint.
Final Takeaways
- Replica scaling outperformed tensor parallelism for this model and workload.
- More users helped only until the endpoint reached saturation; after that, TTFT increased faster than throughput.
- Four H100 replicas were enough for roughly 53 req/s text throughput with p95 TTFT near 1.4 seconds.
- Eight H100 replicas handled mixed multimodal traffic well, peaking around 61 req/s with sub-second p95 TTFT.
- The best benchmark workflow was to deploy once per serving config, warm the endpoint, then run multiple client-side load levels against the same URL.
Technical Stack
| Layer | Tool |
|---|---|
| Serving platform | Modal |
| Inference server | vLLM OpenAI-compatible server |
| Model | Qwen/Qwen3-VL-4B-Instruct |
| Hardware | H100 GPUs |
| Workload | Text and multimodal tutoring requests |
| Metrics | Request throughput, chunks/sec, p95 TTFT, p95 latency |