← back to projects

Scaling a Multimodal Tutor Model on Modal

complete

Benchmarked Qwen3-VL-4B-Instruct on Modal with vLLM, comparing H100 replica scaling, tensor parallelism, concurrency limits, and mixed multimodal traffic to find a low-latency, high-throughput serving configuration.

ModalvLLMInferenceH100MultimodalQwenBenchmarking

Scaling a Multimodal Tutor Model on Modal

Benchmarking Qwen/Qwen3-VL-4B-Instruct with vLLM on Modal to tune throughput, time to first token, and replica-based serving for an AI tutoring workload.

InferTutor needs to serve many concurrent student tutoring requests, including short text prompts, long prompts, and image-based multimodal prompts. I benchmarked how serving configuration affects p95 time to first token, streaming latency, request throughput, token/chunk throughput, stability under load, and cost/performance tradeoffs on H100 GPUs.


System Setup

The benchmark used Modal to deploy vLLM as an OpenAI-compatible HTTP server. Experiments varied GPU count, replica count, tensor parallelism, max_seqs, concurrent_inputs, max_batch_tokens, max_model_len, chunked prefill, eager vs. compiled mode, and text-only vs. mixed multimodal traffic.

The key architectural finding was that the model fits comfortably on a single H100, so independent replicas scaled better than tensor parallelism for this workload.

Client load tester -> Modal web endpoint -> vLLM replicas -> H100 GPUs

Key Results

ExperimentModeUsersGPUsReq/sChunks/sTTFT p95Latency p95Notes
1-GPU baselinetext150120.221936.33211 ms5045 msUseful lower-bound baseline
2-GPU replicastext150231.823046.22150 ms3855 msClear replica scaling gain
4-GPU prior besttext150437.103551.71151 ms3225 msStrong latency at 150 users
4-GPU replicastext250453.365106.31371 ms4527 msBest 4-GPU balance
4-GPU replicastext300455.315293.91843 ms5088 msHigher throughput, more queueing
8-GPU replicasmixed250860.785605.2916 ms3772 msBest multimodal result
8-GPU replicasmixed300857.005256.11536 ms4222 msSaturation begins
8-GPU replicastext400863.056035.51556 ms4277 msBest text throughput

Throughput and TTFT Updates

InferTutor throughput and p95 TTFT across serving updates
InferTutor throughput and p95 TTFT across serving updates

The biggest changes came from moving from one larger serving process toward many independent single-H100 replicas. Throughput rose because requests could be spread across more warm vLLM servers, while TTFT improved when each replica had enough headroom to start prefill quickly instead of letting requests sit in a queue.

The curve also shows the saturation point. On 4 GPUs, moving from 250 to 300 users added only about 2 req/s, but p95 TTFT jumped from 1371 ms to 1843 ms. That is the classic sign that the endpoint is no longer compute-bound in a useful way; extra client pressure mostly becomes queueing delay. The 8-GPU mixed endpoint shows the same pattern: 250 users was the sweet spot, while 300 users lowered throughput and raised TTFT.

Extra TTFT movement between configurations also came from deployment hygiene. Reusing a warm Modal app could make a health check pass while an older vLLM configuration was still the one serving steady-state traffic, so the benchmark became more reliable after separating deployment-time server settings from client-side load-test settings and warming one endpoint per serving config.


Throughput Scaling

ConfigurationUsersReq/s
1 GPU15020.22
2 GPUs15031.82
4 GPUs15037.10
4 GPUs25053.36
8 GPUs, mixed traffic25060.78
8 GPUs, text traffic40063.05

Throughput improved substantially with replicas, but the improvement depended on driving enough client load. The 8-GPU text endpoint produced the strongest raw throughput at 63.05 requests per second, while the 8-GPU mixed endpoint peaked at 60.78 requests per second with better tail latency.


TTFT vs. Load on 4 GPUs

UsersReq/sTTFT p95
17542.911076 ms
20045.311307 ms
25053.361371 ms
30055.311843 ms

The best 4-GPU text configuration used four single-H100 replicas with moderate per-container concurrency. At 250 users it reached 53.36 requests per second with p95 TTFT of 1.37 seconds. Increasing to 300 users raised throughput only slightly, but pushed p95 TTFT to 1.84 seconds, making 250 users the better latency-throughput operating point.


Mixed Multimodal Saturation

UsersReq/sTTFT p95
15040.95918 ms
20053.10819 ms
25060.78916 ms
30057.001536 ms

The best 8-GPU mixed multimodal configuration used eight single-H100 replicas with lower per-container concurrency. It reached 60.78 requests per second at 250 users while keeping p95 TTFT below one second. At 300 users, throughput dropped and TTFT rose, showing that the endpoint had crossed its best saturation point.


Replica Scaling vs. Tensor Parallelism

ConfigurationUsersReq/sTTFT p95
4 single-H100 replicas25053.361371 ms
4-H100 tensor-parallel endpoint2508.7417884 ms
4-H100 eager tensor-parallel endpoint2507.517311 ms

Since Qwen/Qwen3-VL-4B-Instruct fits on one H100, tensor parallelism was unnecessary for this workload. Four independent H100 replicas produced much higher throughput and dramatically lower TTFT than a single tensor-parallel 4-H100 server.


Benchmark Workflow

Early experiments showed that benchmark results could be misleading if a warm Modal app was reused after changing server parameters, because /health could pass before the newly intended vLLM configuration was actually serving steady-state traffic. I separated deployment-time parameters from client-side load-test parameters, then used one Modal app per serving configuration and multiple load tests per warmed endpoint.


Final Takeaways

  • Replica scaling outperformed tensor parallelism for this model and workload.
  • More users helped only until the endpoint reached saturation; after that, TTFT increased faster than throughput.
  • Four H100 replicas were enough for roughly 53 req/s text throughput with p95 TTFT near 1.4 seconds.
  • Eight H100 replicas handled mixed multimodal traffic well, peaking around 61 req/s with sub-second p95 TTFT.
  • The best benchmark workflow was to deploy once per serving config, warm the endpoint, then run multiple client-side load levels against the same URL.

Technical Stack

LayerTool
Serving platformModal
Inference servervLLM OpenAI-compatible server
ModelQwen/Qwen3-VL-4B-Instruct
HardwareH100 GPUs
WorkloadText and multimodal tutoring requests
MetricsRequest throughput, chunks/sec, p95 TTFT, p95 latency