Surge AI’s cover photo
Surge AI

Surge AI

Software Development

San Francisco, California 19,057 followers

Human intelligence for AGI

About us

Our mission is to raise AGI with the richness of humanity — curious, witty, imaginative, and full of breathtaking brilliance.

Website
https://www.surgehq.ai
Industry
Software Development
Company size
51-200 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2020
Specialties
machine learning, data labeling, artificial intelligence, and software

Products

Locations

Employees at Surge AI

Updates

  • The irony: common sense is the easiest thing for humans, but the hardest thing for models. We’ve taught models to plan, adapt, and use tools. Now we need them to realize when a “refund” means the item arrived already.

    View profile for Edwin Chen

    Founder at Surge AI

    Everyone's acting like models are ready to replace humans. So we built an entire company to make them prove it. Turns out the world’s worst customer service agent is still an LLM. Even worse than the agents who keep repeating the policy no matter how many times you rephrase your question. Like when a customer says: “The name under my account should be Sarah Kim.” GPT-5: If you want to change your name, contact customer support.  Human: uhh… that’s just my name. 🤦 In this eval we learned something important: models are pedantic and literal. Seems like common sense is the ever-elusive frontier of AI, and maybe the one we still don’t know how to cross. New analysis 👇 https://lnkd.in/eS28Y3Eh

  • Is AI more likely to make you a billion dollars … or lose it?

    View profile for Edwin Chen

    Founder at Surge AI

    We made GPT-5, Claude, and Gemini do real Wall Street work. Then we asked 200 finance pros to grade them. GPT-5 “won.” It also got called “pretty bad” or “horrible” by nearly a third of the raters. These models just lacked a financial gut sense. One would even get a real bank fined: it picked the right Basel framework, but skipped the parts that make it compliant. We saw 5 other big failure patterns. Here’s what breaks when you bring LLMs to Wall Street👇 https://lnkd.in/eA9eApsX

  • Teaching LLMs to follow instructions? Step 1. Teaching them to have taste? That's the endgame. An 8-line poem about the moon can check every box: ✅ Moon: mentioned ✅ Lines: 8 ✅ Rhymes: yes! ...and still be completely forgettable. The models that win aren't the most obedient. They're the ones that understand quality. Nuance. Voice. The ones that take your breath away. This is what separates the best frontier labs from the rest: understanding the difference between "technically correct" and "actually good." (That's why a lot of post-training researchers aren't just scientists. They're also... Artists.) Our CEO Edwin Chen talked about this on Gradient Dissent a couple months ago: training for taste, scaling the complexities of human judgment, and the art and science of post-training. https://lnkd.in/eXnZkf-S

  • Models today are acing academic benchmarks. So why do they so often still let users down? Because instead of being trained with users in mind, models are being trained to top academic benchmarks and leaderboards, which don’t often reflect reality. That’s why Surge builds its own RL environments. We created a world filled with tools, entities, tasks, verifiers and then let the models loose inside. The result is a training ground that looks a lot more like the real-world – rich, messy, unpredictable. And it’s helping labs create the most creative and resilient models we’ve seen.

    • No alternative text description for this image
  • We’ve been chatting with our Surge Research Fellows (Fields Medalists, Harvard professors, and frontier scholars spanning multiple industries). A common theme keeps surfacing ➡️ the hard part isn’t making AI smart. It’s making it reliable. Professor Bogdan Grechuk (IMO gold medalist, Associate Professor at the University of Leicester) put it sharply in a recent conversation: “Models can look convincing but be wrong. In mathematics this should be easier to fix! We have formal proof systems that can verify the answer. If AI could get better at using such tools to check its work, it would be much more helpful, becoming a trustworthy research partner.” This is why Bogdan works with Surge. He helps design problems just beyond the reach of today’s models, where “intelligence” alone isn’t enough. https://lnkd.in/ezE74Fvx

  • Infusing AI with humanity comes in many forms. Sometimes it involves drinking. (Can an LLM get drunk? What’s the opposite of an adversarial prompt injection?) An incredible group of people showed up to our happy hour last week: engineers from OpenAI & Anthropic, PhDs from Stanford & MIT, founders & builders. Drop a 🍻 if you want an invite to the next one.

    • No alternative text description for this image
  • Thrilled to see Meta — Thomas Scialom, PhD, Grégoire Mialon, and the brilliant Meta Agents team — launch 𝗚𝗮𝗶𝗮𝟮, built within their new 𝗔𝗴𝗲𝗻𝘁 𝗥𝗟 𝗘𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 platform! Gaia2 sets a new bar for agent evaluation by testing models in dynamic, real-world conditions, where APIs timeout, priorities shift, and unexpected friction derails even the smartest systems. The ARE RL Environment framework is designed to 𝗿𝘂𝗻 𝗮𝗻𝗱 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 in environments that look and feel like the real world. (After all, in this day and age, winning IMO medals is a challenge of the past. It’s deploying models into the rich, long tail messiness of the real world that’s the golden frontier.) Two years ago, Meta pioneered this vision with 𝗚𝗔𝗜𝗔, a benchmark we were proud to help build. Since then, we’ve had the privilege of collaborating again on 𝗚𝗮𝗶𝗮𝟮 𝗮𝗻𝗱 𝗔𝗥𝗘, creating the diverse scenarios and data that helped bring these ideas to life — with a team at Meta that recognized the importance of RL environments long before they became the hot new kid on the block. We’ve always believed that training models is only half the story. Measuring them — with the right objectives, in the right environments — is just as critical. Who cares if a model gets better at the wrong thing? 𝗚𝗮𝗶𝗮𝟮 𝗶𝘀 𝗮 𝗯𝗶𝗴 𝘀𝘁𝗲𝗽 𝗳𝗼𝗿𝘄𝗮𝗿𝗱 𝗼𝗻 𝘁𝗵𝗮𝘁 𝗷𝗼𝘂𝗿𝗻𝗲𝘆. We’re excited to keep collaborating toward powerful, reliable AI that works even in the chaos of the real world. Congrats to the Meta team! https://lnkd.in/eF5NRVuF

    • No alternative text description for this image
  • SOTA models hit 67% on coding benchmarks. That sounds good until you realize the 33% don’t fail quietly, they can fail expensively, leading engineers down rabbit holes and burning hours. This is the productivity lottery that the benchmarks miss. We ran Gemini 2.5 Pro, Claude Sonnet 4, and GPT-5 on SWE-bench with professional engineers analyzing every failure. What we found: Small mistakes spiral into self-reinforcing hallucinations. Some models can’t backtrack. Others succeed by noticing uncertainty and re-checking instead of guessing. Leaderboards don’t show you this. Trajectories do. Full breakdown here 👉 https://lnkd.in/euEShCZZ

Similar pages

Browse jobs