Inference-benchmarker 🤗

Benchmarks results

Summary

This table shows the average of the metrics for each model and QPS rate.

The metrics are:

  • Inter token latency: Time to generate a new output token for each user querying the system. It translates as the “speed” perceived by the end-user. We aim for at least 300 words per minute (average reading speed), so ITL<150ms
  • Time to First Token: Time the user has to wait before seeing the first token of its answer. Lower waiting time are essential for real-time interactions, less so for offline workloads.
  • End-to-end latency: The overall time the system took to generate the full response to the user.
  • Throughput: The number of tokens per second the system can generate across all requests
  • Successful requests: The number of requests the system was able to honor in the benchmark timeframe
  • Error rate: The percentage of requests that ended up in error, as the system could not process them in time or failed to process them.

Details

Select model
Inter Token Latency (lower is better)
TTFT (lower is better)
End to End Latency (lower is better)
Request Output Throughput (higher is better)
Successful requests (higher is better)
Error rate
Prompt tokens
Decoded tokens