Christian Legnitto: Star rating is the worst metric I have ever seen

Среда, 02 Декабря 2015 г. 21:14 + в цитатник

Note: The below is a slightly modified version of a rant I posted internally at Facebook when I was shipping their mobile apps. Even though the post is years old, I think the issues with star rating still apply in general. These days I mainly rant on Twitter.

Not only is star rating the worst metric I have ever seen at an engineering company, I think it is actively encouraging us to make wrong and irrational decisions.

My criticisms, in no particular order:

1. We can game it easily.

On iOS we prompt¹ people to rate our app and get at least a 1/2 a star bump. Is that a valid thing to do or are we juicing the stats? We don’t really know. On Android we don’t prompt…should we artificially add in a 1/2 a star there to make up for the lack of prompt and approximate the “real” rating? ²

We’re adding in-app rating dialogs to both platforms, which can juice the stats even more³. If we are able to add a simple feature–which I think we should add for what it’s worth–and wildly swing a core metric without actually changing the app itself, I would argue the core metric is not reflective of the state of the app.

2. We don’t understand it.

The star rating is up on Android…we don’t really know why. The star rating is down on iOS and we think we might know why, but we still have big countdown buckets like “performance”. For a concrete example, in the Facebook for Android release before Home we shipped the crashiest release ever…and the star rating was up! We think it was because we added a much-requested feature and people didn’t care about the crashes but we have no way to be sure.

When users give star ratings they are not required to enter text reviews, leaving us blind and with no actionable information for those ratings. So even when we cluster on text reviews (using awesome systems and legit legwork by the data folks) we are working with even fewer data points to try to understand what is happening.⁴

Finally, we have fixed countdown bugs on both platforms in the last quarter…we haven’t seen a step function up or down on either star rating….the trends are pretty constant. This implies that we don’t really know what levers to pull and what they get us.

3. Vocal minorities skewing risk vs reward reasoning.

The absolute number of star ratings is pretty low, so vocal minorities can swing it wildly–representative sample this is not. For example, on the latest iOS app we think 37% of 1-star reviews can be attributed to a crash on start. Based on what we know, the upper bound of affected users is likely ~1MM, which at 130MM MAU⁵ that’s 0.7%. The fix touches a critical component of the app and mucks around with threading (via blocks) and the master code is completly different. So 0.7% of users make up 37% of our 1 star reviews because of one bug (we think) and we are pushing out a hotfix touching the startup path because of the “37%” when we should really be focusing on the “0.7%”. I think that is the right decision if we put a lot of weight on star rating but it isn’t the right decision generally. Note that we did not push out a hotfix for the profile picture uploading failure issue in the same release because the 0.5% of DAU affected wasn’t seen as worth the risk and churn.

4. It’s fluid-ish.

A user can give us a star rating and then go back and change it. Often they do, but frequently they don’t (we think). This means our overall star ratings likely have an inertia coefficient and may not reflect the current state of the app. We have no visibility into how much this affects ratings and in what ways. If we fix the iOS crash mentioned above, what percent of users will go back and change their star rating from 1 to something else? As far as I know this inertia coefficient isn’t included in any analysis and isn’t really accounted for in our reasoning and goals.⁶

5. One star != bad experience.

Note: I added #5 today, it wasn’t in the original post.

Digging into our star rating, some curious behavior emerged:

The app stores show reviews on the app listing page. The algorithm that chooses which reviews to show must have some balance component as it usually shows at least one negative and positive review. We found that users in certain countries noticed this and would rate us as 1 star just to see their name on the listing page!
There were a number of 1 star ratings with very positive reviews attached. It turns out that in some cultures 1 star is the best (“we’re number one”) so those users were trying to give us the best rating and instead gave us the worst!

Of course, there is both the standard OMG CHANGE reaction (“Why am I being forced to install Messenger?”) and user support issues (“I am blocked from sending friend requests, please help me!”) that show up frequently in 1 star reviews too. While both of those are important to capture and measure, they don’t really reflect on the quality of the app or a particular release.

The emperor has no clothes.

Everyone working on mobile knows about these issues and has been going along with star rating due to the idea that a flawed metric is better than no metric. I don’t think even using star rating as a knowingly flawed metric is useful from what I’ve seen over the last quarter. I think we should keep an eye on it as a vanity metric. I think we should work to capture that feedback in-app so we can be in control and get actionable data. I think we should be aware of it as an input to our reasoning about hotfixes but make it clear the star rating itself has no value and shouldn’t be optimized for in a specific release cycle.