The Klarna AI Customer Service Experiment: Three Acts, One Honest Lesson

February 2024. Klarna drops the headline that gets forwarded around every AI newsletter for months: their OpenAI-powered assistant handled 2.3 million conversations in its first month. Two-thirds of all customer chats. Equivalent to 700 full-time agents. Resolution time under 2 minutes, versus 11 minutes for humans. CSAT 4.4 versus 4.2. Repeat inquiries down 25%.

CEO Sebastian Siemiatkowski: “This AI breakthrough means superior experiences for customers at better prices.”

For a lot of enterprise leaders, this was the case study they’d been waiting for. Real numbers, at scale, with proof.

Then the rest of the story happened.

Act 1: What the Numbers Actually Said

Let’s be precise about what was genuinely impressive here, because the signal and the spin got mixed together quickly.

The 2-minute versus 11-minute comparison is real — for the queries the AI was handling. That caveat matters more than it looks. Klarna routed AI to tier-1, structured, high-volume interactions: order status, refund eligibility, payment schedules, account lookups. These are exactly the queries that respond well to well-trained systems with defined decision trees. The AI didn’t replace the full distribution of customer service work. It handled the easy part of it, faster.

The CSAT comparison is similarly scoped. It measures queries that were successfully resolved, on channels where customers were already comfortable with digital-first interactions.

That said — handling two-thirds of chat volume with comparable satisfaction at a fraction of the response time is genuinely impressive. The workforce went from roughly 7,400 to around 3,000 through attrition and a hiring freeze. Projected annual profit improvement: $40 million.

Act 1 is a real win. Just read it precisely.

Act 2: Where It Broke

By May 2025, CSAT was deteriorating on complex queries. The AI was hallucinating on edge cases — confidently producing wrong answers on policy questions, dispute resolutions, situations requiring judgment rather than retrieval. Customers with genuinely difficult problems had no human fallback option.

The backlash was tangible enough that Siemiatkowski addressed it publicly: “cost unfortunately seems to have been a too predominant evaluation factor.”

That’s a remarkable sentence for a CEO to say. It’s an admission that the company optimised for the metric that looked good — average CSAT on successfully resolved queries — rather than the metric that mattered — quality across the full distribution, including the hard cases and the failures.

The fix was rehiring. But not the same model. Klarna moved to a gig structure: work-from-anywhere, roughly $41 per hour, no permanent employment overhead. This detail is worth noting. It’s not a reversal on AI strategy. It’s a recalibration toward a hybrid cost model that still looks nothing like the pre-AI baseline.

Act 3: Where It Stabilised

Late 2025. The model settles.

The AI agent is now equivalent to 853 FTE — up from 700. Savings are $60 million. It still handles two-thirds of chat volume. But customer service costs rose to $50 million in Q3, up from $42 million the year before, because the human layer was added back.

This is the durable model: AI owns tier-1 routine volume; humans handle VIP interactions, escalations, and emotional complexity. The economics still work. The savings are real. But pure replacement didn’t hold, and Klarna is now paying for both the AI infrastructure and a reconstituted human layer on top of it.

What Enterprise Operators Should Take From This

Klarna is now the canonical case study for both sides of the AI customer service debate, which is why it’s worth reading the whole arc rather than the February 2024 announcement in isolation.

The part that worked: AI genuinely resolves structured, high-volume, repeatable queries faster at comparable satisfaction scores. If your service volume is dominated by order status, FAQs, account lookups, and defined decision trees, the math on AI handling holds. The 2-minute resolution time is real — for those query types.

The part that failed: Removing human fallback for complex queries is a quality bet that doesn’t pay. Customers with genuine disputes, ambiguous situations, or emotional states don’t accept a wrong AI answer and move on. They escalate, churn, and tell people. The hallucination problem on edge cases isn’t a temporary limitation — it’s a current-generation constraint you need to architect around, not optimise through.

The measurement trap: Average satisfaction scores across resolved queries will look good when you route only the easy work to AI. This isn’t fraud — it’s incomplete measurement. The right metric is quality across the full distribution, including queries that escalate or never resolve. If you’re evaluating an AI service rollout, demand the hard distribution, not the headline average.

We’ve seen this same pattern repeatedly in our own deployments: aggregate experience holds or even improves because AI crushes the simple 80%, but the hard tail — complex, emotional, or high-stakes queries — drives disproportionate churn and brand damage that averaged metrics completely mask until it’s too late.

The Honest Read

Siemiatkowski’s February 2025 quote — “I am of the opinion that AI can already do all of the jobs that we, as humans, do” — aged about three months before his own company walked it back in practice. By October 2025, he was saying “I think there is a massive shift coming to knowledge work,” which is a more defensible and more accurate claim.

The shift is real. The timeline is not linear. And the durable equilibrium is not full replacement — it’s AI handling volume that humans shouldn’t be spending time on, with humans handling situations where being wrong has real consequences.

The lesson isn’t that AI fails at customer service — it’s that removing the human fallback is a quality trade-off most organisations underestimate. The winning model isn’t AI-first or human-first. It’s AI for scale, with a deliberately preserved human layer for the irreducible edge cases.