The irony: common sense is the easiest thing for humans, but the hardest thing for models. We’ve taught models to plan, adapt, and use tools. Now we need them to realize when a “refund” means the item arrived already.
Everyone's acting like models are ready to replace humans. So we built an entire company to make them prove it. Turns out the world’s worst customer service agent is still an LLM. Even worse than the agents who keep repeating the policy no matter how many times you rephrase your question. Like when a customer says: “The name under my account should be Sarah Kim.” GPT-5: If you want to change your name, contact customer support. Human: uhh… that’s just my name. 🤦 In this eval we learned something important: models are pedantic and literal. Seems like common sense is the ever-elusive frontier of AI, and maybe the one we still don’t know how to cross. New analysis 👇 https://lnkd.in/eS28Y3Eh