Scaling a Multimodal Tutor Model on Modal

Benchmarking Qwen/Qwen3-VL-4B-Instruct with vLLM on Modal to tune throughput, time to first token, and replica-based serving for an AI tutoring workload.

InferTutor needs to serve many concurrent student tutoring requests, including short text prompts, long prompts, and image-based multimodal prompts. I benchmarked how serving configuration affects p95 time to first token, streaming latency, request throughput, token/chunk throughput, stability under load, and cost/performance tradeoffs on H100 GPUs.

System Setup

The benchmark used Modal to deploy vLLM as an OpenAI-compatible HTTP server. Experiments varied GPU count, replica count, tensor parallelism, max_seqs, concurrent_inputs, max_batch_tokens, max_model_len, chunked prefill, eager vs. compiled mode, and text-only vs. mixed multimodal traffic.

The key architectural finding was that the model fits comfortably on a single H100, so independent replicas scaled better than tensor parallelism for this workload.

Client load tester -> Modal web endpoint -> vLLM replicas -> H100 GPUs

Key Results

Experiment	Mode	Users	GPUs	Req/s	Chunks/s	TTFT p95	Latency p95	Notes
1-GPU baseline	text	150	1	20.22	1936.3	3211 ms	5045 ms	Useful lower-bound baseline
2-GPU replicas	text	150	2	31.82	3046.2	2150 ms	3855 ms	Clear replica scaling gain
4-GPU prior best	text	150	4	37.10	3551.7	1151 ms	3225 ms	Strong latency at 150 users
4-GPU replicas	text	250	4	53.36	5106.3	1371 ms	4527 ms	Best 4-GPU balance
4-GPU replicas	text	300	4	55.31	5293.9	1843 ms	5088 ms	Higher throughput, more queueing
8-GPU replicas	mixed	250	8	60.78	5605.2	916 ms	3772 ms	Best multimodal result
8-GPU replicas	mixed	300	8	57.00	5256.1	1536 ms	4222 ms	Saturation begins
8-GPU replicas	text	400	8	63.05	6035.5	1556 ms	4277 ms	Best text throughput

Throughput and TTFT Updates

InferTutor throughput and p95 TTFT across serving updates

The biggest changes came from moving from one larger serving process toward many independent single-H100 replicas. Throughput rose because requests could be spread across more warm vLLM servers, while TTFT improved when each replica had enough headroom to start prefill quickly instead of letting requests sit in a queue.

The curve also shows the saturation point. On 4 GPUs, moving from 250 to 300 users added only about 2 req/s, but p95 TTFT jumped from 1371 ms to 1843 ms. That is the classic sign that the endpoint is no longer compute-bound in a useful way; extra client pressure mostly becomes queueing delay. The 8-GPU mixed endpoint shows the same pattern: 250 users was the sweet spot, while 300 users lowered throughput and raised TTFT.

Extra TTFT movement between configurations also came from deployment hygiene. Reusing a warm Modal app could make a health check pass while an older vLLM configuration was still the one serving steady-state traffic, so the benchmark became more reliable after separating deployment-time server settings from client-side load-test settings and warming one endpoint per serving config.

Throughput Scaling

Configuration	Users	Req/s
1 GPU	150	20.22
2 GPUs	150	31.82
4 GPUs	150	37.10
4 GPUs	250	53.36
8 GPUs, mixed traffic	250	60.78
8 GPUs, text traffic	400	63.05

Throughput improved substantially with replicas, but the improvement depended on driving enough client load. The 8-GPU text endpoint produced the strongest raw throughput at 63.05 requests per second, while the 8-GPU mixed endpoint peaked at 60.78 requests per second with better tail latency.

TTFT vs. Load on 4 GPUs

Users	Req/s	TTFT p95
175	42.91	1076 ms
200	45.31	1307 ms
250	53.36	1371 ms
300	55.31	1843 ms

The best 4-GPU text configuration used four single-H100 replicas with moderate per-container concurrency. At 250 users it reached 53.36 requests per second with p95 TTFT of 1.37 seconds. Increasing to 300 users raised throughput only slightly, but pushed p95 TTFT to 1.84 seconds, making 250 users the better latency-throughput operating point.

Mixed Multimodal Saturation

Users	Req/s	TTFT p95
150	40.95	918 ms
200	53.10	819 ms
250	60.78	916 ms
300	57.00	1536 ms

The best 8-GPU mixed multimodal configuration used eight single-H100 replicas with lower per-container concurrency. It reached 60.78 requests per second at 250 users while keeping p95 TTFT below one second. At 300 users, throughput dropped and TTFT rose, showing that the endpoint had crossed its best saturation point.

Replica Scaling vs. Tensor Parallelism

Configuration	Users	Req/s	TTFT p95
4 single-H100 replicas	250	53.36	1371 ms
4-H100 tensor-parallel endpoint	250	8.74	17884 ms
4-H100 eager tensor-parallel endpoint	250	7.51	7311 ms

Since Qwen/Qwen3-VL-4B-Instruct fits on one H100, tensor parallelism was unnecessary for this workload. Four independent H100 replicas produced much higher throughput and dramatically lower TTFT than a single tensor-parallel 4-H100 server.

Benchmark Workflow

Early experiments showed that benchmark results could be misleading if a warm Modal app was reused after changing server parameters, because /health could pass before the newly intended vLLM configuration was actually serving steady-state traffic. I separated deployment-time parameters from client-side load-test parameters, then used one Modal app per serving configuration and multiple load tests per warmed endpoint.

Final Takeaways

Replica scaling outperformed tensor parallelism for this model and workload.
More users helped only until the endpoint reached saturation; after that, TTFT increased faster than throughput.
Four H100 replicas were enough for roughly 53 req/s text throughput with p95 TTFT near 1.4 seconds.
Eight H100 replicas handled mixed multimodal traffic well, peaking around 61 req/s with sub-second p95 TTFT.
The best benchmark workflow was to deploy once per serving config, warm the endpoint, then run multiple client-side load levels against the same URL.

Technical Stack

Layer	Tool
Serving platform	Modal
Inference server	vLLM OpenAI-compatible server
Model	Qwen/Qwen3-VL-4B-Instruct
Hardware	H100 GPUs
Workload	Text and multimodal tutoring requests
Metrics	Request throughput, chunks/sec, p95 TTFT, p95 latency