ProductBench Leaderboard

Benchmarking LLMs and Local Rerankers on Product Knowledge Tasks: Price vs Performance

Information: Some models are not included properly. This is in part due to the inability to disable reasoning for some model via the openrouter API, this benchmark being latency sensitive, this was a key aspect of it. This also explain why GPT models have null score, those models are not designed to be used in non reasoning way. To support those models, the benchmark would need to be adapted to allow reasoning, which would make it less relevent to determine low latency workhorse models.

Scenario:

Price / Performance Frontier

X: Cost to run benchmark ($) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]

Time vs Performance

X: Time Taken (Seconds) [Log Scale, Lower is better] | Y: Label Augmentation Score [Higher is better]

Detailed Results

Model	Scenario	Params	Total Cost ($)	Aug Cost / 10k ($)	Rerank Cost / 10k ($)	Time (s)	Tokens	Aug Score	Rerank Dist	Action

Scenario:

Show:

Reranking Performance

X: Time per Item (seconds) [Log Scale, Lower is better] | Y: Rerank Distance [Lower is better]

Distance Comparison

Lower distance is better. Comparing LLM vs Local rerankers.

Reranking Results

LLM Reranker Local Reranker