Project
ESG Measurement: Why Rating Agencies Can't Agree on Who's Sustainable
June 2022
The $500 Billion Question
In 2021 alone, over $500 billion flowed into ESG related funds. The growth of ESG investment products now exceeds $1 trillion and continues to accelerate. Investors want to put their money into sustainable companies. Asset managers want to identify environmental and social risks before they materialize. Everyone agrees that ESG matters.
There is just one problem: nobody agrees on what “good ESG” actually looks like.
This post examines the current state of ESG measurement, how the major rating agencies construct their scores, and why their ratings diverge so dramatically. The divergence is not a minor technical issue. It fundamentally undermines the usefulness of ESG ratings for the investors who rely on them.
What is ESG?
ESG stands for Environmental, Social, and Governance. The term was developed in a 2004 report by 20 financial institutions and has since become the standard framework for sustainable investing.
Environmental measures how a company interacts with natural ecosystems. This includes greenhouse gas emissions, wastewater treatment, product eco-friendliness, and resource consumption.
Social measures a company’s impact on people: how they treat workers, customers, and communities. This covers labor practices, diversity, health and safety, and community engagement.
Governance refers to how decisions are made within a company. This includes board composition, executive compensation, shareholder rights, and anti-corruption policies.
The idea behind ESG ratings is straightforward. Investors want an independent, reliable assessment of how sustainable a company is, without relying solely on the company’s own disclosures. ESG scores should help investors identify risks and opportunities that traditional financial analysis might miss.
In theory, ESG investing should channel capital toward responsible companies, funding projects that reduce waste, improve working conditions, and promote ethical business practices.
In practice, the picture is more complicated.
The ESG Rating Landscape
ESG data providers function similarly to credit rating agencies, but instead of assessing creditworthiness, they evaluate sustainability performance. The two largest providers are MSCI ESG and Sustainalytics, covering 7,500+ and 11,000+ companies respectively.
Other major vendors include:
| Provider | Coverage | Notes |
|---|---|---|
| MSCI ESG | 7,500+ companies | Acquired KLD in 2014 |
| Sustainalytics | 11,000+ companies | Owned by Morningstar |
| Refinitiv ESG | 7,000+ companies | Formerly Thomson Reuters ASSET4 |
| S&P Global ESG | Broad coverage | Integrated with S&P indices |
| Moody’s ESG | Growing coverage | Leverages credit rating expertise |
| Bloomberg ESG | Integrated with terminal | Data-focused approach |
Each of these providers has developed their own methodology for converting raw ESG data into ratings. Understanding these methodologies is essential for understanding why the ratings diverge.
How MSCI Builds ESG Ratings
MSCI aims to measure a company’s resilience to long-term, financially relevant ESG risks. Their ratings run from AAA (best) to CCC (worst), relative to industry peers.
The methodology asks four key questions about each company:
- What are the most significant ESG risks and opportunities facing the company and its industry?
- How exposed is the company to those risks or opportunities?
- How well is the company managing them?
- How does the company compare to its global industry peers?
MSCI identifies key issues for each industry and assigns them to companies based on their business activities. Each Environmental and Social key issue comprises 5% to 30% of the total rating, weighted by the expected contribution to externalities and the time horizon for risks to materialize.
The Governance pillar has a floor of 33% weight, reflecting MSCI’s view that governance issues are universally material across all industries.
For each key issue, MSCI calculates two scores:
Exposure Score (0-10): Based on the breakdown of the company’s business, including core products, location of operations, and supply chain characteristics.
Management Score (0-10): Based on the company’s policies, programs, and track record for managing the risk.
These combine into a Key Issue Score. The key insight is that a company with high exposure needs strong management to score well, but a company with limited exposure can achieve a good score with more modest management efforts.
The final rating is industry-relative. MSCI calculates a weighted average of Key Issue Scores, normalizes based on peer performance, and maps to a letter rating. The maximum score falls between the 95th and 100th percentile of peers, while the minimum falls between the 0th and 5th percentile.
How Sustainalytics Builds ESG Ratings
Sustainalytics takes a different approach. Their ESG Risk Ratings measure the degree to which a company’s economic value is at risk due to unmanaged ESG factors.
The rating consists of a quantitative score (lower is better, starting at zero with an open-ended scale) and a risk category: negligible, low, medium, high, or severe. Critically, these categories are absolute, meaning they can be compared directly across companies in different industries.
Sustainalytics builds ratings from three blocks:
Corporate Governance: Reflects the conviction that poor governance poses material risks for all companies. On average, this contributes about 20% to the overall risk score.
Material ESG Issues (MEIs): Industry-specific topics that require management attention, like emissions, labor relations, or product safety. This forms the core of the methodology.
Idiosyncratic Issues: Unpredictable, company-specific events that are not tied to the business model but can become material if they pass certain thresholds.
Like MSCI, Sustainalytics assesses both exposure and management. But they use a “beta” concept to reflect how a company’s exposure deviates from the average for its sub-industry. A company with higher-than-average exposure to a risk gets a beta above 1.0; lower-than-average exposure gets a beta below 1.0.
The final score represents unmanaged risk, which Sustainalytics defines as the sum of:
Unmanageable Risk: ESG risks that cannot be addressed by company initiatives, regardless of effort.
Management Gap: Risks that could be managed but are not sufficiently addressed based on Sustainalytics’ assessment.
Comparing the Methodologies
At first glance, MSCI and Sustainalytics seem similar. Both adjust scores based on industry context. Both evaluate exposure and management. Both attempt to identify material ESG issues.
But the differences are significant:
| Aspect | MSCI | Sustainalytics |
|---|---|---|
| Rating scale | AAA to CCC (letter grades) | 0-50+ (quantitative score) |
| Peer comparison | Industry-relative | Absolute across industries |
| Governance weight | Minimum 33% | Average 20% |
| Exposure adjustment | Past controversies influence weights | Beta concept vs. sub-industry average |
| Final interpretation | Higher is better | Lower is better |
These methodological differences translate into rating differences. The same company can receive a top-tier rating from one agency and a mediocre rating from another. This is not a hypothetical concern. It happens routinely.
The Divergence Problem
How bad is the disagreement? Berg, Koelbel, and Rigobon (2022) calculated a Krippendorff’s alpha of 0.55 for ESG ratings across six major providers on a common sample of 924 companies. Krippendorff’s alpha is a statistical measure of agreement; values below 0.667 indicate substantial disagreement.
For context, credit ratings from different agencies typically have correlations above 0.9. ESG ratings have correlations closer to 0.5. The agencies are barely agreeing more than they would by chance.
This divergence is not just an academic curiosity. It has real consequences for investors. If you screen your portfolio using MSCI ratings, you might include companies that Sustainalytics rates as high-risk. If you rely on Sustainalytics, you might exclude companies that MSCI considers leaders. The choice of rating provider can substantially change your portfolio composition.
Decomposing the Divergence
Berg et al. (2022) decomposed the rating divergence into three components:
Scope: What attributes does the rater consider? Different agencies might include or exclude categories like “climate risk management,” “product safety,” or “supply chain labor practices.”
Measurement: How does the rater assess each attribute? Two agencies might both evaluate “carbon emissions” but use different data sources, time periods, or normalization methods.
Weight: How does the rater aggregate attributes into a final score? Two agencies might both evaluate governance, but one might weight it at 33% while another weights it at 20%.
To perform this decomposition, Berg et al. created a taxonomy of 64 distinct categories by mapping all indicators from six rating agencies to a common framework. They then estimated the weights each agency implicitly applies when aggregating category scores into final ratings.
The results are striking:
| Component | Average Contribution to Divergence |
|---|---|
| Measurement | 56% |
| Scope | 38% |
| Weight | 6% |
Measurement divergence is the primary driver. More than half of the disagreement comes from agencies measuring the same things differently, not from measuring different things or weighting them differently.
Why Measurement Diverges
The measurement problem runs deeper than you might expect. Even for seemingly objective facts, agencies disagree.
Berg et al. found that the correlation between agencies on UN Global Compact membership, a binary fact that should be trivially verifiable, was only 0.92. The rational expectation is 1.0. Somehow, agencies are disagreeing on whether companies are members of an organization that publishes its membership list.
For more subjective categories, correlations are much lower. Some category pairs even show negative correlations, meaning one agency’s positive assessment predicts another agency’s negative assessment.
Part of this stems from data sources. Agencies use different combinations of company disclosures, third-party data, news reports, and proprietary research. They apply different quality controls and verification procedures. They make different assumptions when data is missing.
But there is also a more troubling explanation: the rater effect.
The Rater Effect
The rater effect, also known as the halo effect, describes a bias where performance in one category influences assessments in other categories. If an analyst perceives a company as “good,” they might unconsciously rate it favorably across multiple dimensions. The opposite happens for companies perceived as “bad.”
ESG rating requires substantial judgment. Analysts must interpret ambiguous disclosures, assess the adequacy of management programs, and make predictions about future risks. This creates opportunities for cognitive biases to influence ratings.
Berg et al. tested for the rater effect using two approaches. First, they ran fixed-effect regressions comparing variation across categories, firms, and raters. Second, they used LASSO regressions to evaluate within-rater patterns.
They found clear evidence of a rater effect. It explains about 15-16% of the variation in category scores. For some agencies, a small number of categories explain most of the final rating, suggesting that general impressions of a company are driving assessments across the board.
The organizational structure of rating agencies might contribute to this effect. If analysts specialize in covering specific companies rather than specific ESG topics, they develop overall impressions of “their” companies that color all their assessments.
Which Categories Matter Most?
Not all categories contribute equally to divergence. Berg et al. identified the categories where measurement disagreement is both large and consequential for final ratings:
- Climate Risk Management
- Product Safety
- Corporate Governance
- Corruption
- Environmental Management System
These are not obscure edge cases. They are central ESG concerns that investors care deeply about. The fact that agencies cannot agree on how to measure them is deeply problematic.
Some high-divergence categories, like “Environmental Fines” or “Clinical Trials,” turn out to have low weight in most methodologies, so their disagreement does not heavily influence final ratings. But for the categories listed above, both divergence and weight are high. Disagreement on these topics translates directly into divergent ratings.
The Weight Paradox
One surprising finding is how little weight differences contribute to overall divergence, despite agencies clearly using different weights.
Berg et al. found that the top three most important categories differ substantially across agencies:
| Agency | Top 3 Categories by Weight |
|---|---|
| KLD | Climate Risk Management, Product Safety, Remuneration |
| Moody’s ESG | Diversity, Environmental Policy, Labour Practices |
| MSCI | Exposure scores dominate |
| Sustainalytics | Varies by industry |
There is almost no overlap. Yet weight divergence only contributes 6% to overall rating divergence.
The explanation is that scope and measurement effects are so large that they dominate the final calculation. Even if agencies weighted categories identically, they would still produce very different ratings because they are measuring different things or measuring the same things differently.
What Does This Mean for Investors?
The divergence of ESG ratings creates several practical problems.
Portfolio construction becomes arbitrary. The same ESG screen applied with different rating providers produces different portfolios. Investors who think they are buying “sustainable” companies may be buying very different baskets depending on their data source.
Engagement is complicated. If a company wants to improve its ESG performance, which rating should it target? Improvements that boost one rating might not affect another. Companies face a confusing landscape of competing standards.
Research is difficult. Academic studies of ESG performance depend heavily on which ratings are used. Results that hold for one provider may not hold for another. Meta-analyses become nearly impossible.
Greenwashing is easier. Companies can cherry-pick the rating that makes them look best, or the methodology that rewards their particular strengths. Without standardization, it is hard to hold anyone accountable.
Can Standardization Help?
Some observers call for standardized ESG measurement, analogous to accounting standards for financial reporting. The idea is that if everyone measures the same things the same way, ratings will converge.
This is appealing but faces obstacles.
First, ESG is inherently multidimensional. Unlike credit risk, which ultimately boils down to probability of default, ESG encompasses environmental impact, social responsibility, and governance quality. These dimensions do not reduce to a single number in any obvious way. Reasonable people can disagree about the right weighting.
Second, measurement requires judgment. Even with standardized definitions, analysts must interpret incomplete data, assess management quality, and predict future risks. Some subjectivity is unavoidable.
Third, the industry has commercial incentives for differentiation. Rating agencies compete partly on methodology. If everyone used identical methods, ratings would become a commodity with no competitive moat.
Still, progress is possible. Standardizing underlying data disclosures would help. Requiring companies to report specific metrics in specific formats would reduce the measurement divergence that currently dominates. The EU’s Corporate Sustainability Reporting Directive and the SEC’s proposed climate disclosure rules move in this direction.
Conclusion
ESG measurement is harder than it looks. The current landscape of divergent ratings undermines the core promise of ESG investing: that investors can reliably identify sustainable companies and allocate capital accordingly.
The divergence is not primarily about what agencies choose to measure or how they weight different factors. It is about fundamental disagreement on measurement itself. Two agencies looking at the same company, evaluating the same category, come up with different numbers.
This is a solvable problem, but it requires recognizing its scope. Marginal improvements to methodology will not fix measurement divergence. The industry needs better underlying data, more standardized disclosures, and perhaps more humility about what a single ESG score can actually capture.
For investors, the practical implication is clear: do not treat any single ESG rating as ground truth. Understand the methodology behind the ratings you use. Consider multiple sources. And recognize that ESG assessment, like much of investing, involves irreducible uncertainty.
The $500 billion flowing into ESG funds deserves better measurement infrastructure. Building it is one of the important challenges facing sustainable finance.
This post is based on my seminar thesis completed at Karlsruhe Institute of Technology in June 2022
References
Berg, F., Koelbel, J. F., and Rigobon, R. (2022). Aggregate confusion: The divergence of ESG ratings. MIT Sloan School of Management.
Boffo, R., and Patalano, R. (2020). ESG investing: Practices, progress and challenges. OECD Paris.
Chatterji, A. K., Durand, R., Levine, D. I., and Touboul, S. (2016). Do ratings of firms converge? Implications for managers, investors and strategy researchers. Strategic Management Journal, 37(8), 1597-1614.
MSCI (2022). MSCI ESG Ratings Methodology.
Sustainalytics (2021). ESG Risk Ratings Methodology Abstract.