Part 1: How to Analyze for Bias in Performance Reviews

Performance reviews are a critical part of the employee lifecycle, and can be an early indicator of inequities in your workplace that will later be reflected by differences in pay, differences in promotion rates, or differences in retention — which can spell trouble for your organization in terms of reduced engagement, increased flight risks, a lack of progress against diversity goals, and legal or regulatory action.

Our 2024 Workplace Equity Trends Report dove deep into issues around performance ratings, and two statistics stand out:

1. Organizations that effectively build diverse teams at every level are 69% more likely than ineffective organizations to analyze performance ratings for bias against particular groups.

2. Over half of organizations still rely only on an employee’s manager to evaluate an employee’s performance. This means there is an absence of alternative perspectives, no need to talk through the reasoning behind a specific rating, and a single “point of failure” when it comes to identifying and interrupting bias when it occurs.

Data analytics can help identify if, where, and to what degree inequities exist within your organization. It can also help your organization tailor the right intervention into these issues at the right time. For example, pay equity issues may arise due to differences in performance ratings if there is an underlying issue with the distribution of these ratings. If you only track and remediate the downstream impact on pay, the issue may continue to surface year after year, and you may be missing other ways this difference in performance ratings impacts your employees.

How to set up your performance review analysis

Before you calculate any tables showing the distribution of performance ratings, it is worth asking some preliminary questions. This will ensure that you have the information you need to tell the story of how things are playing out at your organization in a way that doesn’t miss anything important, resonates with your leaders, and anticipates key follow-up questions.

1. How (and to what degree) should you subset your organization for analysis?

When we conduct this analysis for Syndio customers, one of the first questions we ask is how they want their organizational structure to be reflected back to them. Typically, it makes sense to look at both “the forest and the trees,” taking a high-level view of the organization and diving into specific leaders, functions, and job levels. This usually takes the form of a nested hierarchy with multiple levels. Beyond the overall results, we see customers break things out by leadership hierarchies, functions and subfunctions, and geography. The right answer will depend on how your organization conceptualizes itself. It may vary from one part of the business to the other.

You may try using multiple approaches to subsetting your organization. This makes sense, but here is a word of caution! This “analyze everything and sort it out later” approach can create thousands and thousands of results to review, particularly since you’ll have other decision points on the most important outcomes, and the employee identity groups you want to analyze. Thinking through these decisions beforehand can reduce the number of analyses you need to conduct (though software can help conduct these analyses at scale), and – more importantly – help you think ahead to the most important questions about how a particular employment process plays out at your organization.

2. What outcomes are of particular interest to your stakeholders?

When rated or bucketed, performance evaluations typically take three to five potential values. Each of these ratings might be explicitly linked to specific outcomes (e.g., the highest performance ratings receive enhanced bonus payouts while the lowest ratings receive none), or the relationship can be more relaxed.

Analytically, the broadest question is whether the distribution of performance ratings is different between two groups. However, you may want to ask more specific questions about particular outcomes: Was one group more likely to receive the highest performance rating? Or perhaps to receive a better-than-average rating (e.g., a 4 or 5 on a 5-point scale)? The answers to these narrower questions are often the headline takeaways I frame up when presenting results to clients and business leaders.

Leaders at one of our client customers brought this specific question to us: “Are gaps more of an issue at the bottom of our performance curves, at the top, or both?” Your organization may have different questions, but thinking through and documenting the core questions before you start analyzing your data will keep you from losing your way.

3. What groups might be disadvantaged in your organization?

Most organizations analyze their workforce through the lens of gender and aggregate racial groups, comparing workers who identified as white only to those who identified as anything other than (or in addition to) white. This is a good place to start, but sometimes that aggregation can mask differences in outcomes. In the past six months, we have seen cases where employees of color as a group are not disadvantaged, but Black employees were, or where outcomes were better for women than men, but only because outcomes were much worse for men of color than white women.

This may take a couple of rounds and sample size will constrain your ability to go deep, but it’s worth starting with gender, race, the intersection of the two, and a few specific racial groups. Engagement surveys or employee resource groups are a good place to begin to understand which communities feel like they are having a more difficult time finding success and recognition at your organization.

Key considerations when analyzing performance reviews for bias

Now that you know how you want to structure your analysis, here are a couple of lessons we have learned through our experience.

First, performance ratings often differ significantly in different functions. In a review of performance ratings from dozens of our customers, we found that employees in front-line and retail positions had much more differentiation, both high and low, than employees in other lines of the business. This is important because women and workers of color are often disproportionately represented in these positions, so you may want to keep those groups separate — or at least analyze them individually.

Second, ratings also vary by job level. We typically see that company leadership (e.g., levels at director and above) are more likely to receive positive ratings and less likely to see negative ratings, particularly compared to early career employees. Once again, historical opportunity gaps mean this difference may correlate with community-based differences in performance ratings.

It is important to distinguish between these issues: Are workers of color more likely to receive negative ratings because they’re concentrated in functions and lower-level positions where low ratings are more common? Or are they more likely to receive low ratings than their peers in those levels and functions? You can do more harm than good for your workplace equity program by misidentifying the actual issue.

How to conduct the performance review analysis

At this point, you should have a plan for which outcomes you want to track for which communities at which levels of your organization. With those questions answered, the analysis is relatively straightforward.

We prefer the following analyses:

Chi-square analysis of the overall difference in performance ratings between communities. This test can tell you if there are statistically significant differences overall in performance ratings between two groups, but does not have a single statistic demonstrating directionality (which group is favored or disfavored) or effect size (how large the difference is). To identify and explain differences, you should reference a contingency table showing the breakdown of performance ratings by group.

As an example, here is an example of performance ratings broken out between white and BIPOC employees:

	Poor Performance	Adequate Performance	Strong Performance	Exceptional Performance
White	3%	73%	20%	4%
BIPOC	8%	82%	9%	1%

In this case, the differences in the distribution are statistically significant, driven by higher likelihoods of BIPOC employees receiving lower ratings and lower likelihoods of receiving higher ratings.

2×2 chi-square or Fisher exact tests (depending on headcounts) for key outcomes (e.g., high rating vs. not high ratings) for each organizational unit, comparison group, and job level (including overall). I prefer to translate these differences in ratings into likelihood ratios (showing how much less likely one group is to receive a particular outcome), as that efficiently communicates the results in a way leaders can understand. As mentioned above, these tests often result in the headline takeaways because they concisely answer specific questions.

In our example, we would state that BIPOC employees were 58% less likely to receive high performance ratings (e.g., “performs exceptionally well” or “is a top performer”), since 24% of white employees did and only 10% of BIPOC employees did. We could also state that white employees were 63% less likely (or, alternatively, BIPOC employees were 2.7x as likely) to receive low performance ratings.

Logistic regressions for controlled differences. As mentioned above, there may be legitimate factors that can explain differences in performance ratings. BIPOC employees may be more likely to receive low performance ratings, but no more likely than peers in their same job level or function. Logistic regression is the most common tool for analyzing differences in a binary outcome net of other explanatory factors. Some other statistical analyses, like ordered logit regression, can also work for data like this, but they come with stronger assumptions. Given the tradeoffs, we prefer sticking with the simpler tool and collapsing to binary outcomes. Logistic regression is still complicated, though, and other controls should be applied judiciously and critically.

In our example, positive ratings fell from about 29% of employees to about 16% between management job levels and entry-level positions, while BIPOC representation increased from 20% to 45%. Even so, at each level BIPOC employees were less likely to receive positive ratings than their white peers — so the controlled difference showed BIPOC employees being 40% less likely than their white peers to receive positive ratings. (Note that the output from logistic regression is typically an adjusted odds ratio, which sounds very similar to but is not the same as a relative likelihood).

Each of these tests include p-values, which are the conventional measure of statistical significance. Basically, a smaller p-value indicates that the observed result is not less likely to be due to random chance. These p-values can help distinguish which results are most meaningful, and are driven by the size of the difference in rates and the sample sizes.

You can conduct all of these tests in statistical software (R and Python are free, open-source solutions). Excel, Google sheets, and online calculators can run chi-square tests and Fisher’s exact tests — though sometimes quite painfully, and the need to iterate through many cuts of the data is a good argument for statistical software. In our equitable movement suite OppEQ®, we run these specific analyses in the context of supporting workplace equity.

How to digest and communicate your results

Once you have finished the analysis, it is time to “switch gears” into understanding what it all means. My general approach for wrapping my head around a set of results is to start high and then go deeper and note when results change as we go down one or two layers. Generally, I prefer focusing my key takeaways on the highest level results. This means if an issue exists organization-wide, I will focus on that result, only going into specific subsets of the organization if issues are particularly acute or different from the overall trend, or if one key part of the business is driving the overall result. Similarly, if all communities of color are impacted, I will focus on that result rather than on specific or intersectional groups, only going narrower when something appears that either highlights or complements that broader story.

In some cases, however, the narrower stories are the key results — maybe Black employees in finance are more likely to receive negative ratings while Asian women in HR are less likely to receive positive ratings, and results are overall much worse for non-binary and transgender employees in the retail business. If this is the case, it’s helpful to make a heatmap of the organization to help leaders understand the lay of the land. Highlight two or three key, representative issues, and then have your final takeaway be that there are a variety of inconsistent issues that likely need targeted interventions.

Overall, your mission in communicating the results is to make it clear to leaders that you’ve done your homework, and allow them to move past the analysis into its implications — what action do they want to take as a result? Forcing yourself to be concise and clear with a focused set of results and supporting visualizations can help create a more effective conversation.

The Syndio solution

This roadmap will help you uncover potential inequities or bias in your performance review process, identifying if, where, and to what degree this process is failing to generate consistent results across communities. It takes a lot of work to structure, run, digest, and present these analyses — and the process is iterative, meaning your first run through will almost certainly not be your last.

At Syndio, we are committed to helping our partners through that process by creating software that runs best-practice statistics at scale and presents the results in a clear, digestible format. We also have a team of in-house experts and consultants that can help you think through the key considerations of an analysis, and how it may tie to other outcomes like compensation, promotion, and a lack of representation in your org.

Reach out to our team to learn more about how we can help with workplace equity analyses like these. You can read Part 2 of this series, linked below, to learn what you should do if your performance review analysis uncovers issues — and get the full 2024 Workplace Equity Trends Report for more insights.

Read Part 2 Get the Report

The information provided herein does not, and is not intended to, constitute legal advice. All information, content, and materials are provided for general informational purposes only. The links to third-party or government websites are offered for the convenience of the reader; Syndio is not responsible for the contents on linked pages.

Performance Reviews Part 1: How to Analyze Your Performance Reviews for Bias

How to set up your performance review analysis

1. How (and to what degree) should you subset your organization for analysis?

2. What outcomes are of particular interest to your stakeholders?

3. What groups might be disadvantaged in your organization?

Key considerations when analyzing performance reviews for bias

How to conduct the performance review analysis

How to digest and communicate your results

The Syndio solution

Related Posts

Together, let’s build a more equitable workplace.

Performance Reviews Part 1: How to Analyze Your Performance Reviews for Bias

How to set up your performance review analysis

1. How (and to what degree) should you subset your organization for analysis?

2. What outcomes are of particular interest to your stakeholders?

3. What groups might be disadvantaged in your organization?

Key considerations when analyzing performance reviews for bias

How to conduct the performance review analysis

How to digest and communicate your results

The Syndio solution

Related Posts

What Drives the Gender Pay Gap — and What Can You Do About It?

The Pay Gap for Black Workers is Closing for the First Time This Century