AI Impact

Learn more about the impact your AI initiatives are having on your other Multitudes metrics.

Understand how AI tools affect outcomes, by looking at their impact on leading and lagging indicators of productivity, quality, and developer experience.

Multitudes conducts ongoing research into the impact of AI on engineering teams. Our findings show that the actions you take as a leader have the biggest impact on the success of your AI rollout – not your tooling. With the right initiatives, you can help your team get more benefit from AI, with fewer of the costs (to codebase quality, learning, and more).

But to do that, we need to be able to measure and compare the impact of each of our AI initiatives. This feature helps you do just that, with holistic metrics looking across productivity, code quality measures, and developer experience.

Note that we found that it is important to control for interventions by looking at pre-and-post metrics. We share more about that below.

High and Low AI Usage Cohorts

Multitudes automatically classifies users into two cohorts based on their AI tool usage patterns over the most recent 12 weeks:

  • High AI Adopters: Users with AI activity on ≥33% of days in the last 12 weeks

  • Low AI Adopters: Users with AI activity on <33% of days in the last 12 weeks

Why 33%?

Based on our recent AI impact research, we found that 50% Daily Active Users (DAU) was a strong predictor of meaningful AI usage when measuring only workdays (and excluding weekends).

Since our feature calculates DAU across all calendar days (including weekends and holidays when many developers don't work), we adjusted this threshold to 33% to reflect realistic usage patterns while maintaining the same signal of engaged adoption.

Calculation Notes

  • Cohorts are defined globally across your entire organization, not per team. This means that you have a consistent view of who's a high AI adopter across teams. When you apply team filters, you'll see a subset of these global cohorts.

  • If multiple AI tools are integrated, DAU is calculated across all tools combined.

  • Low AI Adopters who consistently increase their usage will eventually graduate to the High AI Adopters cohort — this is a sign of adoption initiatives working.

Measuring Impact on Adoption

This time series chart tracks Daily Active Users (DAU) over time, broken down by the "High" and "Low" AI adoption cohorts. It helps you understand adoption momentum and, when interventions are present, how they've impacted AI adoption. This is important to measure because we look at other impact metrics, because if an intervention didn't increase AI adoption, then we can't claim that it led to the follow-on outcome metrics.

What You See

The chart shows:

  • Two trend lines: One for High AI Adopters, one for Low AI Adopters

  • Current DAU percentage for each cohort

  • Intervention markers (when annotated) show when your organization introduced new tools, training, or process changes

AI Impact Measurement

Using box-and-whisker plots, we show how AI interventions impacted key performance metrics.

The page supports two analysis modes: cohort-based comparisons (viewing High vs Low AI Adopters at a single point in time) and pre/post intervention analysis (tracking how each cohort changes after a specific action). Pre/post analysis provides stronger evidence for causality by controlling for pre-existing differences between groups, but we default to the simpler high vs low AI adopter view because the pre/post data isn't always available.

Why We Use Pre/Post Intervention Analysis

A common mistake in measuring AI impact is simply comparing High AI Adopters to Low AI Adopters and attributing all differences to AI usage. This approach can be misleading.

The people who choose to use AI more are likely different from those who use it less — and these differences existed before AI was introduced. For example:

  • High AI Adopters might be more productivity-focused or excited about new technology

  • They might work in codebases or languages where AI performs better

  • They could be newer to a project and seeking help getting up to speed

These pre-existing differences create selection bias. The high and low usage groups likely started with different metrics even before AI existed, which means comparing them directly confounds AI's actual impact with these other factors. For more about this issue, read our blog post.

Real-world example: In one organization we worked with, High AI Adopters initially appeared to have smaller PR sizes than Low AI Adopters — suggesting AI reduced PR size. But when we examined pre-intervention data, we discovered High AI Adopters started out with much smaller PRs before the AI rollout. Post-intervention, their PR sizes actually increased compared to their our pre-AI data. Without controlling for pre-existing differences, we would have made the wrong conclusion about the impact of AI.

This is why Multitudes emphasizes pre/post intervention analysis: by comparing each cohort to their own baseline, we control for pre-existing differences, which can help isolate AI's actual effect.

When you don't have intervention data

If you're viewing the page without configuring an intervention, you'll see direct comparisons between High and Low AI Adopters. This view is still valuable for understanding patterns and generating hypotheses, but remember:

  • Be cautious about causality: Differences could come from pre-existing factors, not AI itself

  • Use it for exploration: Identify interesting patterns worth investigating further

  • Consider setting up an intervention: Our AI impact research showed that what drives AI adoption isn't tool availability but enablement – so we recommend everyone run an AI intervention. And even small experiments (like a training session) create natural pre/post periods that strengthen your conclusions about Ai impact.

The charts comparisons help you spot where differences exist, whereas running an intervention can help you understand why differences exist.

How metrics are visualized

Different metrics use different visualization approaches based on how the data is measured:

Box-and-Whisker Plots (for individual event-level metrics): We use these when we have enough underlying event data to construct a box-and-whisker – specifically for:

  • PR Size: Each individual PR has a measurable size

  • Change Lead Time: Each individual PR has a lead time

Bar Charts (for aggregated weekly metrics): We use these when the given metric is an aggregate, so it makes more sense to roll up all the data over the relevant time period. We do this for:

  • Merge Frequency: This aggregates based on PRs per week/month and controls for different-sized teams by dividing by contributor count.

  • Feedback Quality Given: This chart aggregates based on the quality of feedback (e.g., count of highly specific reviews or minimal reviews) divided by the total number of reviews.

  • Change Failure Rate: This chart aggregates by looking at the number of failures divided by the total number of changes.

  • Out-of-Hours Commits: This chart, like Merge Frequency, aggregates based on a count per week/month and divides by contributor count to control for different team sizes.

Metrics with Validated AI Impact

Based on our research analyzing engineering teams before and after AI adoption, we observed the following consistent impacts: Merge Frequency

  • Expected uplift for high AI adopters: 27.2% increase

  • Teams using AI tools more frequently showed significantly higher merge rates compared to low adopters

Out-of-Hours Commits:

  • Expected uplift for high AI adopters: 19.6% increase

  • Note: This represents an increase in out-of-hours activity, which may reflect different working patterns with AI tools

Other Metrics (LOC, Lead Time, etc.):

  • Standard DORA and internal benchmarks apply (no change)

  • AI-specific uplift benchmarks not yet available due to inconsistent signal across study participants

  • These benchmarks will be updated as we gather more data

Important Caveats:

  • These benchmarks are based on our AI impact study across multiple organizations

  • Individual organization results may vary significantly - our research showed that each org's AI journey is highly contextual

  • We will continue to refine these benchmarks as we collect more data over time

Understanding Box-and-Whisker Plots

Box-and-whisker plots visualize the distribution of data, helping you understand not just the average, but the full range of typical values.

Reading the chart

  • The box represents the middle 50% of values (from the 25th to 75th percentile) – this range is called the interquartile range (= the 75th percentile value minues the 25th percentile value)

  • The line inside the box shows the median (the 50th percentile) — the mid-point value for that group. 50% of the datapoints sit above this line and 50% sit below it.

  • The whiskers extend to the largest data point within 1.5 times the interquartile range from the quartiles. Values beyond this are considered outliers.

The key thing to remember is that height matters: Taller boxes indicate more variability in the data. Shorter boxes suggest more consistent behavior.

Interpreting the percentage

The percentage shown within the insight represents the relative difference between High and Low AI Adopters.

Without intervention (baseline comparison)

We calculate it this way:

Difference %=(median(High AI)median(Low AI)1)×100\text{Difference \%} = \left(\frac{\text{median(High AI)}}{\text{median(Low AI)}} - 1\right) \times 100

Example: If High AI Adopters have a median PR size of 150 LOC and Low AI Adopters have 200 LOC, the difference is -25% (High AI Adopters create 25% smaller PRs).

When we have an intervention in place we show the differences

With intervention (pre/post analysis):

When we have an intervention in place, we calculate it this way:

Step 1: Calculate percentage change for each group

Change%(High AI)=(median(High Post)median(High Pre)1)×100\text{Change\%(High AI)} = \left(\frac{\text{median(High Post)}}{\text{median(High Pre)}} - 1\right) \times 100
Change%(Low AI)=(median(Low Post)median(Low Pre)1)×100\text{Change\%(Low AI)} = \left(\frac{\text{median(Low Post)}}{\text{median(Low Pre)}} - 1\right) \times 100

Step 2: Calculate relative difference

Relative Difference=Change%(High AI)Change%(Low AI)\text{Relative Difference} = \text{Change\%(High AI)} - \text{Change\%(Low AI)}

This measures the change in the High AI group minus the change in the Low AI group, isolating the effect of the intervention from other temporal trends

Last updated

Was this helpful?