• Uplifted Newsletter
  • Posts
  • Datadog Acquires Eppo: Transforming Observability into Strategic Insights

Datadog Acquires Eppo: Transforming Observability into Strategic Insights

Datadog's strategic acquisition of Eppo transforms tech observability, merging performance monitoring and product analytics to help teams learn faster and drive smarter business outcomes.

Table of Contents

At first glance, this looks like a peculiar move up-market - from pure performance monitoring to causal measurement of business outcomes. Look closer and it’s something quite strategic. It's Datadog staking claim to a new category that fuses data observability, product analytics, and experimentation under one roof. And I think it’s a sign of major consolidation efforts to come. And one that will have a lasting impact on experimentation tooling.

Why Eppo?

“Those who learn fastest, win.”

Che Sharma, Eppo CEO

Che Sharma, Eppo’s CEO, has a mantra I love: “Those who learn fastest, win.” Eppo’s warehouse-native stats engine turns that mantra into reality, letting teams run clean experiments and tie every feature flag to business lift.

And by folding Eppo in, Datadog didn’t just buy an experimentation stats engine, they also bought an on-ramp for PMs, growth teams, and ML engineers to measure impact and ship smarter and faster…which maybe shouldn’t be all that surprising given its more recent moves:

Year

Move

What it added

2024

Metaplane

Data-quality alerts & lineage

2024

Product Analytics and Session Replay

Qualitative UX replays

2025

Eppo

Feature flags + causal stats

So, Datadog has effectively been building a strong base of measuring not just for application performance and system health but also data quality and product analytics. Maybe it can be summarized like this:

  1. From dev-only to dev + PM + growth. Datadog already owns performance telemetry; Eppo gives them feature flags + causal stats, opening the door to the product/marketing and AI/ML teams for measuring releases with certainty.

  2. Infra → outcomes. Logs, traces and RUM tell you effectively what broke and experiments tell you whether the fix moved the needle. Pair the two with robust feature flagging capabilities and you shorten the loop from “this looks broken” to “ship, measure, rollout.”

  3. Precedent matters. Note the Metaplane acquisition (i.e. smart data-quality alerts) and Datadog’s quietly growing Product Analytics + Session Replay lines. It’s all geared towards this new category of ‘all-in-one measurement and release platform’.

An “all-in-one measurement and release platform”… I don’t know for certain if this is how the CEO thinks of Datadog now but I can’t help thinking…why not? If your page performance data, product analytics, data pipeline flagging and alerts and everything about your app lives there - doesn’t it make sense to then measure those things when you release a code change, ranging from data infrastructure changes to rolling out a new LLM chatbot interface?

What this unlocks

Che hints at these ideas in his blog post. I’ve fleshed out the ideas a bit here to better understand what’s to come in the near-term - it’s exciting but also want to call out some concerns that they might face.

End-to-End Canary Testing

“Datadog’s real-time observability with Eppo’s flags and stats engine means that a true end-to-end canary test solution will finally be on the market.”

Che Sharma, Eppo CEO

Imagine combining real-time observability with Eppo's feature flags and stats engine. You’ve now got a pipeline where code hits production, gets monitored, and rolls out (or back) based on thresholds and validated by Eppo’s experimentation engine. Here’s an example:

Scenario: Suppose you’re planning to rollout a new checkout API. With a canary test, you can essentially ship the new API and compare its performance against the old one with limited traffic. This is where this partnership between Datadog and Eppo becomes realized and exciting. Here’s the general playbook:

  1. Flag the bundle with Eppo and rollout the new API switch to 5% of traffic

  2. Define metrics for both performance and the user experience / product analytics funnel metrics:

Performance

Product Metrics

P99 latency

Bounce Rate from checkout start

spikes in 4xx/5xx errors

Checkout Rate

  1. Decision loop: If latency or errors breach as a guardrail or the non-inferiority margin for a product analytics metric like Checkout Rate exceeds the threshold, Eppo would auto-roll back. Else, you can continue to increase in traffic allocation till we reach 100%.

This is a general canary test method but the inclusion of performance data is very exciting. Again, bundling performance with product data is going to be a very new and a more complete way to measure a release than a PM would traditionally be used to doing.

⚠️ A few concerns of mine:

  • Real-time vs warehouse lag. The mix of Datadog’s millisecond-level telemetry with Eppo’s warehouse-native metrics that land minutes (or hours) later isn’t trivial. Curious to see how they handle this.

  • Potential metric overload. Having more metrics to measure only increases your chances of committing a false positive - it’s the same dilema every team faces but will explode now with the advent of performance metrics. Teams will have to diligently practice choosing a strong-signaled list of metrics ( or employ some high-signal composite metric of performance and user experience) and follow proper multiple comparison procedures.

Continuous GenAI Model Optimization

“Datadog’s AI observability with Eppo’s contextual bandits will allow AI teams to ensemble gen AI foundational models in a state of continuous testing and rebalancing”

Che Sharma, Eppo CEO

Datadog’s AI observability married with Eppo’s contextual multi-armed bandits could unlock some interesting velocity in terms of LLM model optimization. A note on each:

AI Observability: Datadog’s LLM/AI Observability gives teams a real-time control-panel for generative-AI systems - tracking latency, token usage, error rates, data & prediction drift, and even content-quality signals like topic relevance or toxicity

Contextual MABs: Eppo’s contextual multi-armed bandits are warehouse-native, low-latency feature-flag rules that use reinforcement learning. They dynamically allocate traffic to different arms and based on context of the user.

With the advent of LLMs in the workplace, companies are racing to deploy personalized generative AI features to best serve their customers. As system prompts evolve, system behavior shifts, and the foundational models themselves get updated, this combo could dynamically explore and serve the best-performing model variant on a personalized level.

If you’ve ever toyed with OpenAI/Anthropic APIs, you know the iteration loop is real and needs constant attention, thus MABs might be of good use here.

Scenario: You’re continually testing a live support chatbot that ensembles multiple LLM-models to best answer typical “How do I” questions. Assume we’re testing 3 models: ChatGPT 4o, Claude 3.7 and a small in-house LLM that handles simple queries as the baseline.

  1. Each chat session would be banded, and based on contextual information like question topic, some user characteristics and perhaps available chat / support ticket resolution history.

  2. Defining metrics:

Performance

Product Metrics

Latency per answer

Ticket deflection rate

Token spend

resolution time

CSAT Reply 👍/👎

  1. Decision Loop. The bandit dynamically shifts contextural traffic toward the model with the highest reward.

The cool thing about continuous learning/optimization here is that we could effectively use guardrail metrics or a composite reward of performance + product metrics to allocate traffic to certain models for each context vector. Effectively Datadog could continuously optimize and personalize the chatbot for different contextualized cohorts while keeping the performance of the LLM in check. Cool!

Learning Velocity

I felt like I couldn’t fully wrap this up and get to my prediction until I expanded on learning velocity and how that equates to ‘winning’. I think it’s good for all growth teams to know.

Learning velocity = (Speed of trustworthy insights) ÷ (Friction in your stack).

  1. Tool consolidation. Too many single-purpose tools slow feedback, iteration and insight gathering. A unified platform trims hand-offs and data silos but watch out for vendor lock-in…

  2. Embracing learning over incrementally. We want to win all the time and in doing so, are afraid to fail. But swinging big generally leads to big-time learnings! It’s a perception, culture shift, an adopted “rule-of-life” that a company needs to embrace. And smarter stats makes this easier.

  3. Customer-first. The sooner you know if an experience delights or degrades the UX, the sooner you iterate. Having a ton of tools that make up this feedback loop can potentially add friction.

Datadog now checks box #1 outright and is well-positioned for #2 and #3. It’ll be interesting to see how Che navigates all 3 and empowers Datadog customers to embrace a culture of big swings and customer centricity.

What it means for the experimentation landscape

I think the platform that blends performance + behavioral insights + rapid rollouts into a single workflow sets the gold standard here. And yeah, Datadog just took pole position.

🔮 My prediction in this space: Another publicly traded observability tool, namely Dynatrace, snatches up Statsig or Launch Darkly to chase after the tail of the Dog. They may not be positioned to tackle product analytics but they certainly can chase the union between performance and inference.

Thanks for reading 💙 Mike.