Inference Benchmarker

Summary

This table shows the average of the metrics for each model and QPS rate.

The metrics are:

Inter token latency: Time to generate a new output token for each user querying the system. It translates as the “speed” perceived by the end-user. We aim for at least 300 words per minute (average reading speed), so ITL<150ms
Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions, less so for offline workloads.
End-to-end latency: The overall time the system took to generate the full response to the user.
Throughput: The number of tokens per second the system can generate across all requests
Successful requests: The number of requests the system was able to honor in the benchmark timeframe
Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.