AI Impact
Learn more about the impact your AI initiatives are having on your other Multitudes metrics.
Understand how AI tools affect outcomes, by looking at their impact on leading and lagging indicators of productivity, quality, and developer experience.
Multitudes conducts ongoing research into the impact of AI on engineering teams. Our findings show that the actions you take as a leader have the biggest impact on the success of your AI rollout – not your tooling. With the right initiatives, you can help your team get more benefit from AI, with fewer of the costs (to codebase quality, learning, and more).
But to do that, we need to be able to measure and compare the impact of each of our AI initiatives. This feature helps you do just that, with holistic metrics looking across productivity, code quality measures, and developer experience.
Note that we found that it is important to control for interventions by looking at pre-and-post metrics. We share more about that below.
High and Low AI Usage Cohorts

Multitudes automatically classifies users into two cohorts based on their AI tool usage patterns over the most recent 12 weeks:
High AI Adopters: Users with AI activity on ≥33% of days in the last 12 weeks
Low AI Adopters: Users with AI activity on <33% of days in the last 12 weeks
Why 33%?
Based on our recent AI impact research, we found that 50% Daily Active Users (DAU) was a strong predictor of meaningful AI usage when measuring only workdays (and excluding weekends).
Since our feature calculates DAU across all calendar days (including weekends and holidays when many developers don't work), we adjusted this threshold to 33% to reflect realistic usage patterns while maintaining the same signal of engaged adoption.
Calculation Notes
Cohorts are defined globally across your entire organization, not per team. This means that you have a consistent view of who's a high AI adopter across teams. When you apply team filters, you'll see a subset of these global cohorts.
If multiple AI tools are integrated, DAU is calculated across all tools combined.
Low AI Adopters who consistently increase their usage will eventually graduate to the High AI Adopters cohort — this is a sign of adoption initiatives working.
Measuring Impact on Adoption

This time series chart tracks Daily Active Users (DAU) over time, broken down by the "High" and "Low" AI adoption cohorts. It helps you understand adoption momentum and, when interventions are present, how they've impacted AI adoption. This is important to measure because we look at other impact metrics, because if an intervention didn't increase AI adoption, then we can't claim that it led to the follow-on outcome metrics.
What You See
The chart shows:
Two trend lines: One for High AI Adopters, one for Low AI Adopters
Current DAU percentage for each cohort
Intervention markers (when annotated) show when your organization introduced new tools, training, or process changes
AI Impact Measurement

Using box-and-whisker plots, we show how AI interventions impacted key performance metrics.
The page supports two analysis modes: cohort-based comparisons (viewing High vs Low AI Adopters at a single point in time) and pre/post intervention analysis (tracking how each cohort changes after a specific action). Pre/post analysis provides stronger evidence for causality by controlling for pre-existing differences between groups, but we default to the simpler high vs low AI adopter view because the pre/post data isn't always available.
Why We Use Pre/Post Intervention Analysis
A common mistake in measuring AI impact is simply comparing High AI Adopters to Low AI Adopters and attributing all differences to AI usage. This approach can be misleading.
The people who choose to use AI more are likely different from those who use it less — and these differences existed before AI was introduced. For example:
High AI Adopters might be more productivity-focused or excited about new technology
They might work in codebases or languages where AI performs better
They could be newer to a project and seeking help getting up to speed
These pre-existing differences create selection bias. The high and low usage groups likely started with different metrics even before AI existed, which means comparing them directly confounds AI's actual impact with these other factors. For more about this issue, read our blog post.
This is why Multitudes emphasizes pre/post intervention analysis: by comparing each cohort to their own baseline, we control for pre-existing differences, which can help isolate AI's actual effect.
When you don't have intervention data
If you're viewing the page without configuring an intervention, you'll see direct comparisons between High and Low AI Adopters. This view is still valuable for understanding patterns and generating hypotheses, but remember:
Be cautious about causality: Differences could come from pre-existing factors, not AI itself
Use it for exploration: Identify interesting patterns worth investigating further
Consider setting up an intervention: Our AI impact research showed that what drives AI adoption isn't tool availability but enablement – so we recommend everyone run an AI intervention. And even small experiments (like a training session) create natural pre/post periods that strengthen your conclusions about Ai impact.
The charts comparisons help you spot where differences exist, whereas running an intervention can help you understand why differences exist.
How metrics are visualized
Different metrics use different visualization approaches based on how the data is measured:
Box-and-Whisker Plots (for individual event-level metrics): We use these when we have enough underlying event data to construct a box-and-whisker – specifically for:
PR Size: Each individual PR has a measurable size
Change Lead Time: Each individual PR has a lead time
Bar Charts (for aggregated weekly metrics): We use these when the given metric is an aggregate, so it makes more sense to roll up all the data over the relevant time period. We do this for:
Merge Frequency: This aggregates based on PRs per week/month and controls for different-sized teams by dividing by contributor count.
Feedback Quality Given: This chart aggregates based on the quality of feedback (e.g., count of highly specific reviews or minimal reviews) divided by the total number of reviews.
Change Failure Rate: This chart aggregates by looking at the number of failures divided by the total number of changes.
Out-of-Hours Commits: This chart, like Merge Frequency, aggregates based on a count per week/month and divides by contributor count to control for different team sizes.
Metrics with Validated AI Impact
Based on our research analyzing engineering teams before and after AI adoption, we observed the following consistent impacts: Merge Frequency
Expected uplift for high AI adopters: 27.2% increase
Teams using AI tools more frequently showed significantly higher merge rates compared to low adopters
Out-of-Hours Commits:
Expected uplift for high AI adopters: 19.6% increase
Note: This represents an increase in out-of-hours activity, which may reflect different working patterns with AI tools
Other Metrics (LOC, Lead Time, etc.):
Standard DORA and internal benchmarks apply (no change)
AI-specific uplift benchmarks not yet available due to inconsistent signal across study participants
These benchmarks will be updated as we gather more data
Important Caveats:
These benchmarks are based on our AI impact study across multiple organizations
Individual organization results may vary significantly - our research showed that each org's AI journey is highly contextual
We will continue to refine these benchmarks as we collect more data over time
Understanding Box-and-Whisker Plots
Box-and-whisker plots visualize the distribution of data, helping you understand not just the average, but the full range of typical values.
Reading the chart
The box represents the middle 50% of values (from the 25th to 75th percentile) – this range is called the interquartile range (= the 75th percentile value minues the 25th percentile value)
The line inside the box shows the median (the 50th percentile) — the mid-point value for that group. 50% of the datapoints sit above this line and 50% sit below it.
The whiskers extend to the largest data point within 1.5 times the interquartile range from the quartiles. Values beyond this are considered outliers.
The key thing to remember is that height matters: Taller boxes indicate more variability in the data. Shorter boxes suggest more consistent behavior.
Interpreting the percentage

The percentage shown within the insight represents the relative difference between High and Low AI Adopters.
Without intervention (baseline comparison)
We calculate it this way:
Example: If High AI Adopters have a median PR size of 150 LOC and Low AI Adopters have 200 LOC, the difference is -25% (High AI Adopters create 25% smaller PRs).
When we have an intervention in place we show the differences
With intervention (pre/post analysis):
When we have an intervention in place, we calculate it this way:
Step 1: Calculate percentage change for each group
Step 2: Calculate relative difference
This measures the change in the High AI group minus the change in the Low AI group, isolating the effect of the intervention from other temporal trends
Last updated
Was this helpful?

