Rating Calculation FAQ

o!TR primarily uses the OpenSkill algorithm based on this paper, specifically using the Plackett-Luce ranking model. The implementation source code can be found here. In short, OpenSkill is a system similar to the Elo or Glicko/Glicko-2 rating systems used in games like chess. It assigns each player an approximate rating and rating deviation, with a higher rating deviation meaning more opportunity for the rating to increase or decrease. Rating updates are performed based on the relative performance of each player in the match.

How are matches selected and filtered?

Tournaments are approved manually by a member of the o!TR team if they are fair and played in a valid competitive environment. This means that most tournaments adhering to badging criteria are accepted, with qualifiers and tryouts excluded. Please see the tournaments page of this wiki for more details.

Anyone is permitted to submit tournaments for approval, but we ask that a list of all bracket match links be provided for consistency. Matches also do some filtering for warmups (through uneven team sizes or unusual mods), but anyone is permitted to submit a case-by-case request to exclude a map or player.

How are initial ratings determined?

Starting ratings are based on the closest-known rank according to osu!track, or your most recent global rank if none is known. For a rough reference, you'll receive a typical starting rating if you're somewhere around the high 5 digit range (rank 10,000). The initial placement is based on the closest point in time relative to when you started playing tournaments. For example, if a player's first match in our verified matches database is from November 2021, and osu!track's closest record is from December 2021, we would base their initial rating on the December 2021 rank. Please remember that ratings will adjust to a significantly more representative value after playing in a few tournaments, regardless of the initial value.

How are match scores interpreted?

OpenSkill updates ratings of players based only on relative ranking, and these relative rankings are decided using a match cost formula. Match cost is a measurement of how well a player does during the match. It compares their scores to the others in the lobby and gives a boost to players who are playing more frequently. This means players who perform better and play in the lobby more often will be placed higher in the ranking and thus gain more rating. Rating changes correspond to the results of complete matches, which helps mitigate situations where specialists receive an inflated rating for performing exceptionally on one or two maps. Whether a team wins or loses has zero impact on the rating algorithm; rating changes are purely based on one's individual performance in the match.

This is a key difference between o!TR and ETX/SIP: the latter two models consider more refined measurements of skill as well as the difficulty of the maps that are played. Since o!TR is primarily meant to measure individual performance, it does not take star rating into consideration. This means it is possible for low-rated players to "farm rating" off of others by consistently performing well. This is intentional and should be viewed as a sign the player may be playing in tournaments that do not present a significant challenge for them (i.e. sandbagging). Additionally, it does not matter how much you win by as long as your match cost is higher than other players.

When will ratings update?

Ratings will be recalculated every Tuesday at 12 UTC, taking into account any new matches and tournaments that have been submitted since the previous recalculation. Please note that when any tournament is added to the database, that can have a ripple effect and change ratings of not just the players in that tournament but anyone who has ever played them. Thus, ratings may change even if one does not play any new matches. This will remain true until the o!TR team stops adding historical data, which could be years from now.

How does rating decay work?

Every player is assigned a rating and volatility. Players start off with high volatility, meaning that they gain or lose rating more dramatically at first, but volatility gradually decreases as a player competes in more tournaments and their rating settles. If a player does not play in any matches for 4 months in a row, their volatility will begin to gradually increase over time. This translates to a gradual increase in uncertainty from the model's perspective, directly resulting in more significant rating adjustments when the decaying player returns to tournament play. Furthermore, players will start to lose a small amount of rating each week until they play again, down to roughly half of their peak performance. Specifically, they decay down to halfway between their peak rating and a base value near 800 rating. If a player's rating is below this floor, their rating will not decrease, but their volatility will still increase.

One common concern the o!TR team recognizes is that players may be incentivized to use rating decay or induce poor tournament performance to artificially reduce their TR, defeating the purpose of the system. However, the 4 month period is long enough and the rating decay is small enough that the effects would not be noticeable unless players are trying to only play the same rating-restricted tournament annually, at which point they would be isolating themselves from all other tournaments over the course of a year.

Additionally, calculable metrics such as player decay status and player volatility will be made available through an API should hosts wish to screen out highly volatile players (subject to tournament committee approval for badged tournaments). Furthermore, matches can be manually flagged by the o!TR team to not be included if foul play has been detected, such as players teaming up and underperforming to reduce rating. This is a primary advantage of the tournament and match approval system we have in place.

Math details and further explanation of the model

If you have further questions not answered on this page, feel free to raise a question on our GitHub discussions page.

Match cost

Our match cost formula is inspired by Bathbot's formula, simplified to remove factors that are not relevant for our purposes. Each player receives a map score between 0.5 and 1.5 for each map they play, with 1.0 being average across the lobby (specifically, this is 0.5 + normal cdf(z-score)). To calculate match cost, we take the average of these map scores and multiply by a lobby bonus factor ranging from 1.0 to 1.3 (specifically, this is 1.0 + 0.3 * sqrt(x), where x ranges linearly from 0.0 for playing only one map to 1.0 for playing all of them).

Bathbot also includes a bonus for playing in a tiebreaker and for playing more mod combinations, but both of those help reflect "general contribution to match" rather than "overall performance" which is why we do not use them. Unlike other match cost formulas, this one ensures that in a 1v1, the winner of the match always has the higher match cost unless warmups are mistakenly counted or the EZ multiplier is different from the assumed value of 1.75x. This is important because the winner of a 1v1 should always gain rating.

Rating change formulas

The rating system itself is based on OpenSkill, which is a Bayesian approximation algorithm. Without going too deep into any formulas (that's what reading the paper is for), these Bayesian rating algorithms assign each player a rating μ and volatility σ, which together describe a distribution of predicted actual skill levels for that player. When players compete against each other in a match, a formula is used to calculate the probability of various outcomes / rankings of that match, and then all of the players' μ and σ values are adjusted based on how "surprising" the match outcome was. In this case, the formula comes from the Plackett-Luce model, whose fundamental assumption is "irrelevance of alternatives". Specifically, Plackett-Luce is based on the idea that player A beats player B in match cost with the same probability, no matter who else is in the lobby. This assumption is not fully correct because teammates do affect how often one participates, but Plackett-Luce is still a model used in real-world ranking systems like horse racing or poker standings.

Quoting from the paper, these adjustments are calculated via "the average of the relative rate of change of [a player's] winning probability with respect to [their] strength," where the average is taken over the prior distribution. The actual Bayesian inference calculations for these types of models are computationally intensive, and OpenSkill is actually only an approximation of the full calculation (in the paper, they discuss why the simplifications are reasonable). Thus, we use OpenSkill because (1) it has an open license and (2) every time new matches are added, the entire rating history of all matches must be recalculated and thus a faster algorithm is favorable.

Choosing parameters

There are various parameters that can be adjusted when setting up the model; see the documentation here or read the paper for more detailed calculations. In words, μ and σ are the rating and volatility mentioned above, β is an "extra volatility" term for calculating head-to-head matchup probabilities, κ is used as part of a check that volatility stays positive, and τ adds a small amount to variance (squared volatility) after each match. Finally, the function γ is a dampening factor which causes volatility to decrease less significantly when matchups are large. Intuitively, this can be thought of as not treating a match with 10 players in it as counting as much as 9 separate 1v1s against each opponent.

The Plackett-Luce model allows for arbitrary scalings of parameters, though the OpenSkill documentation recommends that σ starts out as 1/3 of μ. We choose a scaling here so that the highest ratings look somewhat similar to chess (solely for aesthetic appeal), though we do choose varying initial ratings based on rank. While we keep the default values of γ and κ, we currently initialize σ, β, and τ to a smaller fraction of μ than the default to make it more difficult to farm rating from low-rated players.

Remember that different players specialize in different skillsets and have skillcaps at different levels, so please interpret TR not as an absolute skill comparison between two players. Remember that o!TR identifies when people frequently win relative to others in their rank range or skill level, so if you see a player with what seems like an unusually high rating, we recommend that you look at their tournament history and check if they're consistently the top performer in their matches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ratings.md

ratings.md

Rating Calculation FAQ

How are matches selected and filtered?

How are initial ratings determined?

How are match scores interpreted?

When will ratings update?

How does rating decay work?

Math details and further explanation of the model

Match cost

Rating change formulas

Choosing parameters

Files

ratings.md

Latest commit

History

ratings.md

File metadata and controls

Rating Calculation FAQ

How are matches selected and filtered?

How are initial ratings determined?

How are match scores interpreted?

When will ratings update?

How does rating decay work?

Math details and further explanation of the model

Match cost

Rating change formulas

Choosing parameters