Benchmarking LLMs and Local Rerankers on Product Knowledge Tasks: Price vs Performance
Information: Some models are not included properly. This is in part due to the inability to disable reasoning for some model via the openrouter API, this benchmark being latency sensitive, this was a key aspect of it. This also explain why GPT models have null score, those models are not designed to be used in non reasoning way. To support those models, the benchmark would need to be adapted to allow reasoning, which would make it less relevent to determine low latency workhorse models.
X: Cost to run benchmark ($) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]
X: Time Taken (Seconds) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]
| Model | Scenario | Params | Total Cost ($) | Aug Cost / 10k ($) | Rerank Cost / 10k ($) | Time (s) | Tokens | Aug Score | Rerank Dist | Action |
|---|