LinkedIn has published one of the best reports I’ve read on deploying LLM applications: what worked and what didn’t. 1. Structured outputs They chose YAML over JSON as the output format because YAML uses less output tokens. Initially, only 90% of the outputs are correctly formatted YAML. They used re-prompting (asking the model to fix its YAML responses), which increased the number of API calls significantly. They then analyzed the common formatting errors, added those hints to the original prompt, and wrote an error fixing script. This reduced their errors to 0.01%. 2. Sacrificing throughput for latency Originally, they focused on TTFT (Time To First Token), but realized that TBT (Time Between Token) hurt them a lot more, especially with Chain-of-Thought queries where users don’t see the intermediate outputs. They found that TTFT and TBT inversely correlate with TPS (Tokens per Second). To achieve good TTFT and TBT, they had to sacrifice TPS. 3. Automatic evaluation is hard One core challenge of evaluation is coming up with a guideline on what a good response is. For example, for skill fit assessment, the response: “You’re not a good fit for this job” can be correct, but not helpful. Originally, evaluation was ad-hoc. Everyone could chime in. That didn’t work. They then have linguists build tooling and processes to standardize annotation, evaluating up to 500 daily conversations and these manual annotations guide their iteration. Their next goal is to get automatic evaluation, but it’s not easy. 4. Initial success with LLMs can be misleading It took them 1 month to achieve 80% of the experience they wanted, and additional 4 months to surpass 95%. The initial success made them underestimate how challenging it is to improve the product, especially dealing with hallucinations. They found it discouraging how slow it was to achieve each subsequent 1% gain. #aiengineering #llms #aiapplication
I tried the premium feature of automatic evaluation Chip Huyen. It’s not bad but not amazing. I would’ve liked more concrete actions to be a better fit. It’s in beta as stated.
Alon Bochman worth speaking to the LinkedIn team for RagMetrics / Eval software?
All points are very interesting. I see many companies report bias in the evaluation by human raters. For example, raters tend to prefer long answers that seems more natural human-like to them but customer prefer short answers that are more aligned with their expectations for the system and customers do not like to read a lot of text to solve their tasks. Some companies works on better rater instructions with constant verification of raters vs customer preferences, some companies work on better experimentation (A/B, bandits) for LLM experiments in production but bias in evaluation is quite noticable. About 4. 80% quality is good enough to launch many products (it depends on the product and use case) so customers will find it useful despite some errors. Neither Google, nor other systems are perfect but billions of people use them. 80% is typically a borderline "ok" vs "not usable" . So those numbers are good. Certainly the big task is to improve it after. I see that the methods to continiously improve LLMs through various methods as one gathers more data and tranforms them into better and bigger training sets will be prevalent.
The in-depth analysis and iterative approach shared in the post by Chip Huyen is impressive, Chip Huyen
This is a great read. Thanks for sharing.
Isn't 4. being a typical example of 80/20 rule, so no one should be surprised?
Thanks for sharing !
Very helpful!
Building something new | AI x storytelling x education
1yI'd highly recommend this report to anyone interested in building AI applications. Great write up Juan Pablo Bottaro and Karthik Ramgopal! https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product