Unclear scale: when reviewers don’t know what score to choose

This page is part of a global project to create a better online reviews system. If you want to know more or give your feedback, write at antoine@soch.at and we’ll grab a beer ;)

Traditionally, each star in the 5-star rating system has a specific meaning:

1 = Poor experience
2 = Below expectations
3 = Met most expectations, but room for improvement
4 = Good, met expectations
5 = Excellent, above expectations

Example of TripAdvisor’s rating scale

Most of the time, these descriptions aren’t displayed alongside the stars. Even when they are, users often ignore them, relying instead on their personal interpretation of the scale. This results in a positive inflation of each score.

Remember the “The question asked matters” section? The question most people answer is actually: "Did your experience meet your expectations entirely, or were there specific aspects that fell short?" If you give 5 stars, it means the experience was fully satisfactory. Fewer stars indicate varying levels of disappointment.

In theory, that’s how it works. In practice, people often give 5 stars even if they had minor disappointments because 5 stars have become the new 3: “The experience met enough of my expectations.” While some have adapted to this new standard, not everyone has.

The gap between intended meanings and perceived scores is so wide that companies have started educating customers on how to rate ‘fairly.’ For example, on Airbnb, hosts provide guides to encourage guests to rate more favorably:

Taken from an Airbnb property manager’s website

This image taken from a blog translates the sentiment from hosts and guests alike towards star ratings:

Some people reserve 5 stars for exceptional performance and would consider 4 stars as satisfying, respecting the traditional scale used by most websites. However, when reading reviews, other might perceive 4 stars as “somehow not perfect,” rather than “satisfying enough.” This leads to double standards.

It also comes to subjectivity, everyone also has different standards: A 5-star experience for one person might be a 4-star experience for another due to varying expectations and criteria (we’ll explore this in “Categorization & Subjectivity”).

Some argue that the ambiguity of the star-rating system calls for a binary system (e.g., “satisfied” or “not satisfied”).

A survey I received in an email

YouTube and Netflix use this system, but it doesn’t always help. People who are “satisfied for the most part” want to share what went wrong specifically. As we’ll show in the “Nuances” section, no matter how many statements we use, we’ll always miss some information.

When there are enough reviews, the aggregation tends to flatten individual differences, making the average rating more reliable. This is known as the “law of numbers” or “wisdom of the crowd”. However, this assumes that the reviewers represent the customer base fairly, which is often not the case. When there are too few reviews, this system fails. We’ll cover these points in the reader-level study.

💡

Exploration

“Nothing to report” IS feedback. It means that most of the customer’s expectations were met. Was it a 4 or 5-star experience? They don’t know, but at least they didn’t have any issues, and that’s valuable information for the business and other customers.

A proposition of survey where “nothing to report” = validation

3 should clearly represent “as good as expected”. A truly neutral rating is missing from the current scale. Building on the above, here’s a proposed design:

An example of scale with a true neutral.

However, people might still use the top score as the new neutral due to a natural tendency toward positivity, partly because reviewers don’t want to hurt others’ feeling- something we explore in the dedicated section “Feeling bad judging people”.

Allow users to submit reviews without a rating. Sometimes, users only want to provide information without giving a rating (e.g., “the opening hours have changed”).
Remove the ratings- words are enough. The comment already includes the customer’s sentiment that we can capture with semantic analysis, and agglomerate at an overview level. The challenge is rather about the design and UX: how to make that information available without an overload? How to maintain the users trust about the information, since summarizing the comments inevitably modifies the original meaning? The advantage of ratings is that they’re static, they constitute an absolute information that can’t be altered. Relying solely on words both adds and removes clarity: the individual sentiments are true to the reviewers’ thoughts, but the agglomeration is harder to compile. Without an “agglomerated feeling”, potential customers would lose the current ability to compare options at the “calibrate” steps- we’re pending a design innovation there.

Yogi’s reviews analysis platform displays what is good and bad in a comment, attributing it to different categories.

A simple scale of 3 options: “good, ok, bad.” Platforms like Beli and Netflix use this effectively. It works because it provides self-serve feedback: the algorithm is refined and personalized to fit a user’s preferences. This approach is also effective for peer-to-peer reviews: if someone has the same taste as me, they’re likely to rely on my opinions. However, it presents a challenge in terms of aggregating the feedback.

Beli’s 3-scaled system

How does that compare to…? As discussed in “Why Do We Look at Online Reviews” and “Expectations, Subjectivity, Standards & Risks,” reviews are essentially about comparing options. So, in addition to a simple “good or bad” choice, we can ask the user whether their experience was better than a previous experience with a similar product or service.

Follow-up question on Beli’s app to establish a ranking of restaurants, tailored to your own standards

Unclear scale: when reviewers don’t know what score to choose

Give your opinion!

➡️ Next up: Categorization & subjectivity