Feedback Quality

Learn how we use AI and machine learning to surface insights about the quality of feedback in your team's code reviews.

Stacked bar chart showing percentage of feedback given on PRs grouped by specific, neutral, unspecific, and minimal review.

What it is

This shows the overall quality of feedback given in code reviews. We analyze all code review comments (excluding the PR author's own comments) and classify them into quality categories based on how constructive and actionable the feedback is.

Why it matters

The quality of feedback in code reviews directly impacts team psychological safety (see Belschak et al.), learning outcomes, and of course, code quality. Research consistently shows significant feedback gaps across different groups in the workplace (see Gunawardena et al.). According to Lean In and McKinsey & Co, women are more than 20% less likely than men to say their manager gave them critical feedback that contributed to their growth. Additionally, a recent survey by Textio found that Black and LatinX employees are more likely to receive feedback about their "personality" than the actual quality of their job performance.

Understanding your team's feedback quality helps you create more inclusive review processes and identify opportunities to improve collaboration.

How we calculate it

We analyze feedback given in code reviews using Multitudes's AI models that have been specifically designed to mitigate algorithmic biases and are grounded in research. First, we find each set of feedback – we group all the comments that a commenter did on someone’s PR into one set. The reason we do this is because it’s typical for the first review on a PR to be more detailed than follow-up reviews, so bringing in the full set of comments gives our model more context.

We then classify feedback into four quality categories:

Highly Specific: Detailed, actionable feedback that clearly explains what needs to change and provides clear reasoning. This also includes thought-provoking comments that share knowledge or offer alternative approaches.
Neutral: Moderately detailed feedback that provides some guidance but could be more comprehensive
Unspecific: Vague comments that don't provide clear direction for improvement
Minimal: Short reviews like "LGTM 👍" or "Shippit 🔥!!" that provide minimal guidance. These reviews are commonly referred to as "rubber-stamp" code reviews. Note: PR Feedback where it was a code review with no comments are not included in this.
Needs Attention: Feedback that has been classified as Negative by our model and so may come across as harsh, dismissive, or potentially harmful. We classify as "Needs Attention" rather than "Negative" to acknowledge that a model will always lack the context a human will.

We look at patterns across your team's code review conversations to surface insights about collaboration dynamics and feedback culture. If you notice any incorrect model classifications please provide feedback using the 🚩button in the drill down table. We also support exclusion rules where you can configure terms to be excluded (see Exclusion Terms below for more).

Understanding Feedback that "Needs Attention"

When feedback is classified as "Needs Attention", it is because a model believes it to be Negative – so we recommend a human take a look to bring more context. For transparency, we also identify the specific reasons for the model classification based on established research on what makes code review feedback destructive:

Negativity Reasons:

Personal Attack: Feedback that targets the person rather than the code
Vague Criticism: Critical feedback without clear suggestions for improvement
Judgmental: Feedback with a judgmental or condescending tone
Harsh Language: Use of inconsiderate or unnecessarily harsh language
Excessive Nitpicking: Repeated focus on minor issues without addressing bigger picture concerns
Negative Emojis: Overuse of negative emojis that create a hostile tone
Terseness: Overly brief feedback that comes across as dismissive

This granular analysis helps teams understand not just that negative feedback occurred, but what specific patterns to address in their code review culture.

What good looks like

Feedback quality patterns vary significantly based on team dynamics, code review practices, and organizational culture. That said, based on our initial analysis across Multitudes customers, teams typically have 15% highly specific feedback, 20% unspecific feedback, 25% minimal feedback and less than 2% negative feedback classified as "Needs Attention".

We recommend teams aim for:

20%+ highly specific feedback because this provides clear, actionable guidance
Zero negative feedback because this can be destructive for team inclusion and performance. Feedback which our models consider may be potentially negative is classified as "Needs Attention" on the chart.
<30% minimal feedback, because some quick approvals are normal, but excessive rates suggest insufficient review depth.
Equitable distribution across the team, with all team members receiving similar quality of feedback and no one getting significantly less specific feedback.”

Use these insights in 1:1s, retros, and team discussions to foster a more supportive and effective code review culture.

Multitudes is actively conducting research to identify what percentages of feedback quality correlate with high-performing teams. As we gather more data and insights, these benchmarks will be updated to provide more precise targets for healthy feedback patterns. Learn more about our original research here.

Exclusion Terms

You can configure terms to be excluded from the Feedback Quality chart through the exclusion rules. This is particularly useful for filtering out bot commands, automated messages, or other non-human feedback that shouldn't be analyzed.

To add or modify exclusion terms, click the write icon on the chart header:

Pointer on chart over edit button, with tooltip text saying "Edit rules to exclude specific terms from this analysis"

Exclusion rules apply only to complete matches, not partial text, and changes take up to 6 hours to apply. When you add a new exclusion term, we will apply it to future data; if you need to exclude terms from existing historical data, contact [email protected].

Provide Feedback on Model Predictions

We're continuously improving our Multitudes AI models to ensure accurate classification. If you've noticed a comment is misclassified — such as constructive feedback labeled as Needs Attention or vague comments marked as Highly Specific, please use the 🚩 (the red flag) button in the drill down table to report it.

Your feedback helps us refine our models and reduce algorithmic bias, ensuring better insights for your teams. We review all flagged predictions and use it to make the feedback quality analysis more reliable for everyone using Multitudes.

Note that our language model does better with domain-specific knowledge in English than in other languages, so we’d especially love to hear from Multitudes users who write reviews regularly in other languages.

Research on Feedback Quality and Code Review

The design of this feature is informed by extensive research on feedback quality and code review dynamics and alongside collaboration with our academic partners (see Multitudes Research). Some key studies and articles we recommend users read include:

PreviousFeedback Flows NextDeployment Metrics

Last updated 14 days ago

Was this helpful?