A/B Tests over Evals

23 points by Nischalj10 4 days ago

gregsadetsky 16 minutes ago

I'm new/uninformed in this world, but I have an idea for an eval that I think has not been tried yet.

Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?

What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?

Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

Thanks!

esafak 40 minutes ago

No, thanks. Just use evals with error bars. If you can't get error bars, use an A/B test to detect spuriousness and evals.

gk1 3 hours ago

Founder of a/b testing company accuses founder of evals company of misrepresenting how a/b tests are used in practice, then concludes by misrepresenting how evals are used in practice: "Or you can write 10,000,000 evals."

Could've easily been framed as "you need both evals and a/b testing," but instead they chose this route which comes across as defensive, disingenuous, and desperate.

BTW, if a competitor ever writes a whole post to refute something you barely alluded to without even mentioning their name... congratulations, you've won.

basket_horse 42 minutes ago

Agree. This whole post comes across as sales rather than the truth that both are useful for different things

eitland 2 hours ago

This scared me until I realized it is about raindrop.ai, not raindrop.io.

(Raindrop.io is a bookmark service that AFAIK has "take money from people and stores their bookmarks" as its complete business model.)

anonymoushn 4 hours ago

The framing in this post is really weird. Automated evals can be much more informative than unit tests because the results can be much more fine grained. A/B testing in production is not suitable for determining whether all of one's internal experiments are successful or not.

I don't doubt that Raindrop's product is worthwhile to model vendors, but the post seems like its audience is C suite folks who have no clue how anything works. Do their most important customers even have any of these?

CharlieDigital an hour ago

I think in most cases, outside of pure AI providers or think AI wrappers, almost every team will realize more gains from focusing on their user domains and solving business problems versus fine tuning their prompts to eek out a 5% improvement here and there.
- basket_horse 44 minutes ago
  
  I don’t think you can use this as a blanket statement. For many use cases the last 5-10% is the difference between demoware and production.

CharlieDigital 4 hours ago

Both of these are kind of silly and vendors trying to sell you tooling you probably don't need.

In a gold rush, each is trying to sell you a different kind of shovel claiming theirs to be the best when you really should go find a geologist and and figure out where the vein is.

koakuma-chan an hour ago

> Intentionally or not, the word "eval" has become increasingly vague. I've seen at least 6 distinct definitions of evals

This. I am so tired of people saying evals without defining what they mean. And now even management is asking me for evals and why we are not fine tuning.