ProductBench Leaderboard

Benchmarking LLMs and Local Rerankers on Product Knowledge Tasks: Price vs Performance

Information: Some models are not included properly. This is in part due to the inability to disable reasoning for some model via the openrouter API, this benchmark being latency sensitive, this was a key aspect of it. This also explain why GPT models have null score, those models are not designed to be used in non reasoning way. To support those models, the benchmark would need to be adapted to allow reasoning, which would make it less relevent to determine low latency workhorse models.

Price / Performance Frontier

X: Cost to run benchmark ($) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]

Time vs Performance

X: Time Taken (Seconds) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]

Detailed Results

Model Scenario Params Total Cost ($) Aug Cost / 10k ($) Rerank Cost / 10k ($) Time (s) Tokens Aug Score Rerank Dist Action