AI Impact

Learn more about the impact your AI initiatives are having on your other Multitudes metrics.

Understand how AI tools affect outcomes, by looking at their impact on leading and lagging indicators of productivity, quality, and developer experience.

Multitudes conducts ongoing researcharrow-up-right into the impact of AI on engineering teams. Our findings show that the actions you take as a leader have the biggest impact on the success of your AI rollout – not your tooling. With the right initiatives, you can help your team get more benefit from AI, with fewer of the costs (to codebase quality, learning, and more).

But to do that, we need to be able to measure and compare the impact of each of our AI initiatives. This feature helps you do just that, with holistic metrics looking across productivity, code quality measures, and developer experience.

Note that we found that it is important to control for interventions by looking at pre-and-post metrics. We share more about that below.

High and Low AI Usage Cohorts

Multitudes automatically classifies users into two cohorts based on their AI tool usage patterns over the most recent 12 weeks:

  • High AI Adopters: Users with AI activity on ≥35% of days in the last 12 weeks

  • Low AI Adopters: Users with AI activity on <35% of days in the last 12 weeks

Why 35%?

Based on our recent AI impact researcharrow-up-right, we found that 50% Daily Active Users (DAU) was a strong predictor of meaningful AI usage when measuring only workdays (and excluding weekends).

Since our feature calculates DAU across all calendar days (including weekends and holidays when many developers don't work), we adjusted this threshold to 35% to reflect realistic usage patterns while maintaining the same signal of engaged adoption.

Calculation Notes

  • Cohorts are defined globally across your entire organization, not per team. This means that you have a consistent view of who's a high AI adopter across teams. When you apply team filters, you'll see a subset of these global cohorts.

  • If multiple AI tools are integrated, DAU is calculated across all tools combined.

  • Low AI Adopters who consistently increase their usage will eventually graduate to the High AI Adopters cohort — this is a sign of adoption initiatives working.

Measuring Impact on Adoption

This time series chart tracks Daily Active Users (DAU) over time, broken down by the "High" and "Low" AI adoption cohorts. It helps you understand adoption momentum and, when interventions are present, how they've impacted AI adoption. This is important to measure because we look at other impact metrics, because if an intervention didn't increase AI adoption, then we can't claim that it led to the follow-on outcome metrics.

What You See

The chart shows:

  • Two trend lines: One for High AI Adopters, one for Low AI Adopters

  • Current DAU percentage for each cohort

  • Intervention markers (when annotated) show when your organization introduced new tools, training, or process changes

AI Impact Measurement

Using box-and-whisker plots, we show how AI interventions impacted key performance metrics.

The page supports two analysis modes: cohort-based comparisons (viewing High vs Low AI Adopters at a single point in time) and pre/post intervention analysis (tracking how each cohort changes after a specific action). Pre/post analysis provides stronger evidence for causality by controlling for pre-existing differences between groups, but we default to the simpler high vs low AI adopter view because the pre/post data isn't always available.

Why We Use Pre/Post Intervention Analysis

A common mistake in measuring AI impact is simply comparing High AI Adopters to Low AI Adopters and attributing all differences to AI usage. This approach can be misleading.

The people who choose to use AI more are likely different from those who use it less — and these differences existed before AI was introduced. For example:

  • High AI Adopters might be more productivity-focused or excited about new technology

  • They might work in codebases or languages where AI performs better

  • They could be newer to a project and seeking help getting up to speed

These pre-existing differences create selection biasarrow-up-right. The high and low usage groups likely started with different metrics even before AI existed, which means comparing them directly confounds AI's actual impact with these other factors. For more about this issue, read our blog postarrow-up-right.

circle-info

Real-world example: In one organization we worked with, High AI Adopters initially appeared to have smaller PR sizes than Low AI Adopters — suggesting AI reduced PR size. But when we examined pre-intervention data, we discovered High AI Adopters started out with much smaller PRs before the AI rollout. Post-intervention, their PR sizes actually increased compared to their our pre-AI data. Without controlling for pre-existing differences, we would have made the wrong conclusion about the impact of AI.

This is why Multitudes emphasizes pre/post intervention analysis: by comparing each cohort to their own baseline, we control for pre-existing differences, which can help isolate AI's actual effect.

When you don't have intervention data

If you're viewing the page without configuring an intervention, you'll see direct comparisons between High and Low AI Adopters. This view is still valuable for understanding patterns and generating hypotheses, but remember:

  • Be cautious about causality: Differences could come from pre-existing factors, not AI itself

  • Use it for exploration: Identify interesting patterns worth investigating further

  • Consider setting up an intervention: Our AI impact research showed that what drives AI adoption isn't tool availability but enablement – so we recommend everyone run an AI intervention. And even small experiments (like a training session) create natural pre/post periods that strengthen your conclusions about Ai impact.

The charts comparisons help you spot where differences exist, whereas running an intervention can help you understand why differences exist.

How metrics are visualized

Different metrics use different visualization approaches based on how the data is measured:

Box-and-Whisker Plots: We use these when we have enough underlying event data to construct a box-and-whisker. We use this for charts where we have individual, event-level metrics – specifically:

  • PR Size: Each individual PR has a measurable size

  • Change Lead Time: Each individual PR has a lead time

Bar Charts: We use these when the given metric is an aggregate, so it makes more sense to roll up all the data over the relevant time period (because we need a larger observation window before there's enough data for a meaningful box-and-whisker). We do this for charts including:

  • Merge Frequency: This aggregates based on PRs per week/month and controls for different-sized teams by dividing by contributor count.

  • Feedback Quality Given: This chart aggregates based on the quality of feedback (e.g., count of highly specific reviews or minimal reviews) divided by the total number of reviews.

  • Change Failure Rate: This chart aggregates by looking at the number of failures divided by the total number of changes.

  • Out-of-Hours Commits: This chart, like Merge Frequency, aggregates based on a count per week/month and divides by contributor count to control for different team sizes.

Understanding Box-and-Whisker Plots

Box-and-whisker plots visualize the distribution of data, helping you understand not just the average, but the full range of typical values.

Reading the chart

  • The box represents the middle 50% of values (from the 25th to 75th percentile) – this range is called the interquartile range (= the 75th percentile value minues the 25th percentile value)

  • The line inside the box shows the median (the 50th percentile) — the mid-point value for that group. 50% of the datapoints sit above this line and 50% sit below it.

  • The whiskers extend to the largest data point within 1.5 times the interquartile range from the quartiles. Values beyond this are considered outliers.

The key thing to remember is that height matters: Taller boxes indicate more variability in the data. Shorter boxes suggest more consistent behavior.

Interpreting the chart insight percentage

The percentage shown in the insight (see the top of each chart; circled in the image above) shows the relative difference between High and Low AI Adopters.

When we have intervention data, we also calculate the difference from before to after the intervention. When we have intervention data and can do the pre/post comparison as well as the low/high AI adopter comparison, that gives us a higher confidence that the differences we see are from AI.

Simple comparison: Low vs high AI adopters

We calculate the percentage this way:

Difference %=(median(High AI)median(Low AI)1)×100\text{Difference \%} = \left(\frac{\text{median(High AI)}}{\text{median(Low AI)}} - 1\right) \times 100

Example: If High AI Adopters have a median PR size of 150 LOC and Low AI Adopters have 200 LOC, the difference is (150/200 - 1) x 100 = -25, or -25%.

This means that High AI Adopters create PRs that are 25% smaller than the PRs created by low AI adopters.

Note that while this difference is interesting, we cannot conclude that it is because of AI because there were likely pre-existing differences between your low and high AI adopters (more in this blog post: Don't measure AI impact by comparing low & high AI adoptersarrow-up-right).

With intervention: Pre vs post and Low vs high AI adopters

When we have an intervention, we change the calculation to consider the intervention, per the steps below.

Step 1: Calculate pre-post change for each adoption group

Change%(High AI)=(median(High Post)median(High Pre)1)×100\text{Change\%(High AI)} = \left(\frac{\text{median(High Post)}}{\text{median(High Pre)}} - 1\right) \times 100
Change%(Low AI)=(median(Low Post)median(Low Pre)1)×100\text{Change\%(Low AI)} = \left(\frac{\text{median(Low Post)}}{\text{median(Low Pre)}} - 1\right) \times 100

Step 2: Calculate relative difference between low vs high AI adopters

Relative Difference=Change%(High AI)Change%(Low AI)\text{Relative Difference} = \text{Change\%(High AI)} - \text{Change\%(Low AI)}

This measures the change in the High AI group minus the change in the Low AI group, isolating the effect of the intervention from other temporal trends.

A randomized controlled trial is the best way to understand the causal difference of something, like AI or an AI intervention. But the method above gives us a good fallback option, since we usually don't have time or resources for a randomized controlled trial.

The calculation above gives us more confidence about the impact of AI, because we're removing pre-existing differences between low and high AI adopters, and we're also using the change in the low AI adoption group as a sort of control for other environmental factors that could have influenced the metric over this time period.

What good looks like for AI Impact metrics

Based on our researcharrow-up-right analyzing engineering teams before and after AI adoption, we observed what the typical variation is across DORA, SPACE, and other metrics when a company does an AI intervention. Merge Frequency

  • Expected uplift for high AI adopters: 27.2% increase

  • Teams using AI tools more frequently showed significantly higher merge rates compared to low adopters.

Out-of-Hours Commits:

  • Expected uplift for high AI adopters: 19.6% increase

  • Note: This represents an increase in out-of-hours activity, which may reflect different working patterns with AI tools.

Other Metrics (LOC, Lead Time, etc.):

  • There was no consistent impact from AI interventions, at least not in our initial research. Instead, we saw wide variation in these metrics depending on the organization's specific practices.

  • We'll continue to monitor the changes from AI interventions and will update these benchmarks over time.

Important Caveats:

  • These benchmarks are based on our AI impact study across multiple organizations.

  • Individual organization results may vary significantly – our research showed that each organization's AI journey is highly contextual.

  • We will continue to refine these benchmarks as we collect more data over time

Last updated

Was this helpful?