Ludum Dare 31
Theme:
Entire Game on One Screen

Judging ends in:
It’s time to Play and Rate Games!

PlayRate80Star

Are the results fair?

Posted by (twitter: @AlwaysGeeky)
May 14th, 2012 7:59 am

Before I start this post I just wanna say that I am not writing this out of any spite or bitterness towards the results or angry about anything. I am completely proud of my entry and what I achieved during this LD (my first entry :) ), so don’t just dismiss this post as a bitter rant about my personal rankings.

But I do want to bring up the question of how fair the results are and if the way results are currently calculated is working as intended…

From what I can see and assume, the results are just calculated as an average of what you given in each category. So for example if you were rated by 10 people, your score for each category is tallied up and divided by 10, to give you a score for each category. Now this seems a bit broken for a competition like this, where each entry will get wildly different number of votes/ratings.

So in theory if a game was only rated by 1 person, but that person gave the game 5/5 for each category, then it would get a score of 5.0 for each category… This seems wrong to me and allows for wild fluctuation in the results depending on how many people played/rated your game.

Let’s take a look at some examples and see if what I am saying seems to make sense:

Looking at the top 25 list it doesnt take long before you can go down the list and find a game that scored *very* well, but on closer inspection seems to have only been played by a very few number of people…

Lonely Hated Rock – Came 5th overall in the competition and had scores of #8 in theme, #21 in fun and #30 in innovation #94 in mood and #128 in graphics. Very good scores all round! I’m impressed! That is, until I Look at the spreadsheet and find that this game was only played and rated 16 times…

Pocket Planet – Came 10th overall and joint 1st in fun! Well done… 20 people rated this entry.

Pale Blue Dot16th overall and #23 in the fun category… but was only played/rated by 20 people.

Necro Gaia#16 overall, #11 for graphics #16 for fun and #50 for mood. This game actually ranked in the top 100 for 6 categories! Wow.. thats amazing, how many times was that game played/rated? 18!! (Whats worse… the author of this game had a coolness rating of 0… yep, according to the data, they never played/rated any other game.

Using the latest play/rate data from https://docs.google.com/spreadsheet/ccc?key=0Ao74NZQqNUt5dDNvZUJ1UXVqZGkxUGVlVkxlZ3JnM2c#gid=0

I am not bashing any of the entries in my example, merely using them to point out how having fewer people play your game seems to benefit your final result.

It would seem that using an average score to give you a final result is flawed in a competition such as this. Average scores and using the mean is a great metric but should only be used when you can guarantee that each entry will get the same number of data points. If this is not the case you need to look to alternatives and start doing more statistics to get a fairer results.

It really depends on your viewpoint, but as I have pointed out above, if simply using the average to calculate the final score. The competition and results are open for wild abuse and un-fair results.

Maybe somebody could compile a data spreadsheet showing the number of ratings a game got, and ordered by the overall ranking of the game… would be interesting to see how high-up some games can get, by having a very few number of people rate their entry…

EDIT : I put an example in one of my reply comments below but just bringing it up to the original post for more exposure.

Here is an extreme example highlighting the problem in a simple way that shows how the system doesnt work for the current dataset:

With the current system a game that is played and rated by 10 people, all giving the game 5/5 would score a 5.0 in the final rankings… but a game that is played by 100 people and rated 5/5 by 50 and 4/5 by 50 would score 4.75 in the final rankings… now just consider this example and ponder if this seems correct to you? Personally I think this highlights how the final rankings for the games doesnt correctly reflect the ratings a game got.


41 Responses to “Are the results fair?”

  1. Attrition says:

    The more people rating your game, the larger spread of opinions brought in. Which means you’re more likely to head towards scores of 3s. With the coolness system pumping your game into more peoples queues the more you review others, it’s likely that if you rate just a few, and have some good reviews (say, a couple 4s in any category), it’s better to stop rating other games and hold out for the rankings. Of course, a couple bad ratings would end your chances, and you never know what anyone rates you.

    Of course, LD is not about the ratings anyway, but it seems the current system doesn’t pan out well, especially with the ever increasing number of entries.

    Personally I think the larger problem is there are entries in the top 10s with no source code provided… Rankings should not be given if you do not provide source by end of the judging period. I don’t mind if people ignore the theme or whatnot, if the rest of their game compensates that’s fine. Not including the source should be a disqualification.

    Overall I’m actually happy with everything, and though my entry was not complete I did recieve some good rankings, it was a lot of fun! I hope they address some of these concerns in the future.

  2. AlwaysGeeky says:

    Yes thats my point, the more people that rate your game, the more your score will correctly reflect the average that you received.

    For a competition like this, Bayesian Averages should be used, so that if you game is rated by a small subset of people it wont wildly fluctuate the final score.
    http://en.wikipedia.org/wiki/Bayesian_average

    • Raptor85 says:

      Another issue with the current ratings system that I’ve noted is a tendency for games with lower numbers of ratings to be SIGNIFICANTLY higher or lower rated than games with a lot. (I haven’t done a full analysis yet, I intend to do some statistics with everything i’ve been keeping during this compo but won’t have time for a day or two) I think this is mostly due to games with lots of ratings pulling from a more diverse sample while the games with a low number of ratings either do really good or really bad based on the input from just a few people (one person giving it a 5 in a group of 10 is a considerable boost to it’s rating)

      On the flip side the varied sample that games with > 100 ratings get seems to have the effect of pulling all scores closer towards 3 even if that category didnt exist in the game!

      With people with a higher “coolness” rating now getting more votes it did at least bring the average up….which made a much larger sample of games that are rated fairly in regards to each other, but for games significantly above and below the average the results are pretty skewed, could stilll use some work to get things more in order

      • AlwaysGeeky says:

        Yep, some in depth statistical analysis would be great. But just as a quick point of comparison it would be good to see the games paired next to their rating for each category and also showing how many ratings they received… then if you order that by the overall rankings of the games, and you will probably find many cases of what I pointed out above:

        Games that received few ratings but actually scoring very well. Way way above what they would have scored if they had had the same number of ratings as the popular games.

        Like I said previously, the worse case I can see so far is, a game being rated by only 16 out of 1400 of the judges, but going on to claim 5th place….

        I’m still waiting for someone to argue that the current system is working correctly and that is an accurate/correct result and reflection of the ranking system.

  3. PaulSB says:

    I think this discussion is sometimes avoided because after all, Ludum Dare isn’t about “winning”, it’s about making games. But… I do tend to agree, as the number of participants continues to explode the current system is open to at the least unstable results, and at worst abuse.
    Maybe the default sorting algorithm could push games with very few votes regardless of Coolness in future? Following the sorting formula this time, a Coolness of zero appeared to result in 5-6 eventual ratings, which really is too low a sample size I’d say.
    I still don’t think it’s a huge issue, but something to think about anyway :)

    • AlwaysGeeky says:

      I completely agree, winning really means nothing. Hence the lack of prize or anything associated with winning.

      But as long as the thing is called a ‘competition’ and it is predominantly displayed with competition style wording and structured like a competition, and has rankings… then you really need to ensure that the end result is meaningful… otherwise you lose all credence and may as well drop the competition/rankings part of it… since it isn’t an accurate reflection.

  4. TheSheep says:

    There is actually another catch. I don’t know about others, but I generally try to pick and play the games that I like from the screenshot or description — I can’t possibly play them all and I really prefer to enjoy it and give good scores. So the better-looking (or better-described) games naturally will get more votes from people like me. (I can assume that not everyone is like me, and that there are people who like to bash poor games and pick the ones that look the worst, but that also is some kind of bias).

    • digital_sorceress says:

      One observation I’ve made of the D=R-C system, is that it because it is a communal list, just like a fruit stall, it will get picked over.

      The games with the most attractive screenshots will quickly disappear from the it, while the games with the least attractive screenshots will linger the longest.

      So when we each go to rate games, we’re mostly going to be seeing those with the least attractive screenshots, and there’s probably a fairly strong correlation with those being the lower quality games. At the same time, the best quality games are going to be pointed out in the blog, and they’ll get played enough times that they won’t ever be seen on that list.

      And what that means is that those who use the D=R-C list the most, will mainly be faced with the lower quality games.

  5. Attrition says:

    Also I’d like to note that despite my comment above, I really don’t think theres much in the way of gaming or abuse going on, I mainly feel there are unintended consequences to the system that should be addressed.

    • AlwaysGeeky says:

      Yes that’s exactly it. Abuse might be the wrong word because I am not accusing anyone of trying to cheat the system intentionally. It’s just a shortcoming of using arithmetic average when the sample size varies from entry to entry. Which should really be fixed.

      This might have worked much better in the old days when there were many fewer entries and the variance between the number of times an entry was rated was small.

      But it is a real problem for a set for data this size. In fact I would go as far as to say the system break drastically when you have this large a data-set.

      Arithmetic averages *should not* be used when the dataset can vary this much.
      i.e some games being rated as few as <20 times while overs get over 200 ratings… This will produce really funky results and won't really allow you to actually get anything meaningful from the rankings.. as per the 4 examples I pointed out above.

  6. Pierrec says:

    I totally agree, but I don’t really care. I prefer my game being played 100 times and rated 4 than played 10 times and rated 4.
    In a way, I think it’s cool that Ludum Dare isn’t perfectly fair…it’s a reminder that we’re only here to have some fun and to make/play some games. Anyway, ranking will never be fair, it’s something impossible to do. Even if the system change to be the fairest possible, well, ratings will never be.
    For exemple, I’m sure I give better ratings to the people I know and enjoy the work. It is not conscient, but I’m pretty sure I do.

    That said, I’m not against some better way to rank games. It just doens’t seem that important to me.

    • AlwaysGeeky says:

      No system will be perfect you are right… but there is a big difference between between a system that works and isn’t perfect, and a system that just doesn’t work…

      If you look how other people solve this problem the solution is quite simple, a different average system…
      i.e IMDB.com – If a new film comes out and is rated 10/10 by just 1 person, that film doesn’t immediately get a score of 10… because they use a Bayesian Average system.

  7. PoV says:

    No worries, we handle this.

    The cutoff for Ludum Dare 23 was 11 votes. Prior events were 5, then 7. ~200 people, all of which rated no more than 1 entry (most were 0) were left below the threshold, and were not assigned a score. Everyone that rated 2+ games earned the required number of votes to go above the threshold, thanks to the built in recommendation system. For best effect, you want to rate many games as early as possible. It should be clear though, with 15% falling below the threshold, rating no games is dangerous thing to do. Some people managed to get by, but a lot did not.

    Also the scores are not a straight up average. The highest and the lowest rating are discarded, leaving the middle 9+ scores to weigh the results. With all ratings being decided by a two digit fraction of a point, the resulting error should be low given the sample size. More distribution given higher scoring games, less distribution (thus error) in the lower scoring games.

    I haven’t done a serious study on it or anything, but once an entry has >20 votes, it should reach an equilibrium and shouldn’t really deviate much. Less than 20, it will be inaccurate by a few hundredths (plus or minus).

    • AlwaysGeeky says:

      I don’t think it is a problem with games being rated a different number of times. This is bound to happen and unavoidable with the size of LD these days. But you have got to admit that how the average/rankings is calculated is a bit broken??

      Take my examples above, the most extreme case… do you think it is fair or an accurate reflection of the system that a game that was only played/rated 16 times came 5th?? out of 1400 games?

      • PoV says:

        Yes more ratings are better, but there is a prejudice against Windows games (vs Web). I would think players that only play web games are harsher critics, due to the volume they can play. Even if we had 25 votes versus 16, by removing the web critics, I’d still expect that game to be in the top 20.

        • AlwaysGeeky says:

          Nah I disagree personally I am afraid… LD is all about the community and a scoring system that reflects community should be used, people shouldn’t be penalized in their ranking because their game is played by more people. which is what currently happens (because the more people that play your game the more ‘averaged’ you score gets, games that are played by very few people don’t get ‘averaged’ as much)
          Using thresholds is not the way to solve this problem unfortunately.

          Looking at an example with the current system:
          (Forgetting the fact that the top and bottom rating is disregarded for a moment)

          A game that is played by exactly 11 people (the current threshold) and rated 5.0/5.0 by each of the 11 people would score a perfect 5.0.
          Whereas a game that is played by 100 people and 50 people rate it 5.0/5.0 and the other 50 rate it 4.0/5.0 would score 4.5 and rank lower than the previous game… which is wrong in my opinion since almost 5 times the number of people gave it a 5/5 score than the first game, but because it was rated by a larger number of people, its final score got ‘averaged’ down…

          See how the current system doesn’t scale well?

          • PoV says:

            As the number of votes increase, we can raise the cutoff threshold. Realize we cut 15%, >200 people from getting any score. Had it been pushed to 21 instead of 11, 30%, or ~450 people would have been left without ratings. If you ask me, that would have been even more tragic to the community.

            Phil seems to have the opposite opinion to me, so there will probably eventually be a compromise.

            • AlwaysGeeky says:

              I already said that cutoffs and thresholds won’t solve this problem. There is a way you could remove thresholds completely, allow every game entered to have a ranking, AND make it a fairer system that doesn’t penalize people who get rated by more people… I keep pointing out, using a Bayesian Average system would solve all these problems…

              There is no point promoting the community side of things and pushing people to rate more games and really try to promote the “rate more” part of the scoring system if you don’t have a system WHICH makes that a good thing.

              At the moment it is a sad fact that the more you promote your game and get it rated by more people WILL result in your getting a worse ranking on average because of this. It’s a shame.

              • PoV says:

                Okay, we’ve noted “Use Bayesian Average” in the bug database. You’re probably right. After doing a bit of digging it looks like it may do a better job. I just wish mysql had a built in feature for doing it. ;)

              • digital_sorceress says:

                The current system attempts to stop abuse by eliminating the highest and lowest votes.

                In your opinion, should the Bayesian average be used in combination with this filter, rather than replace it?

              • PoV says:

                @digital_sorceress The thing is there really isn’t an abuse problem. But all it took was 1 person giving 1111’s to convince everyone there was one. Hence why we don’t show the results publicly anymore.

              • digital_sorceress says:

                I don’t think it was made clear that it was only one person, for one contest.

                I only thought there was an endemic abuse problem because of the explanation you gave us about the hi/lo vote filter being used! :P

                So… why make a habit of hi/lo filtering the votes? :confused:

              • PoV says:

                The high/low filter was an easy way to get more-average results, than a straight up average. It had the added benefit of dealing with the perceived “abuse” problem. Ultimately, the core issue is the lackluster data-set we have to gauge the quality of an entry from. The improvements to the voting system seem to have dramatically improved the quantity and quality of votes, and we have a whole bunch of new data to criticize and critique.

                As I’m sure I’ve mentioned, running Ludum Dare is a hobby for us. We give up a lot of time and energy making sure the event runs regularly and smoothly. To make it better, we need to prioritize what we do next, and find the time to do it… finding the time being the hardest part. Most new features are done spontaneously, and they are almost always the simplest solution to a problem. While we can’t make everyone happy, we certainly strive to make the least number of people unhappy.

              • AlwaysGeeky says:

                POV, completely respect you and the team for all you do for LD. :) You guys are great.

                I don’t mean any disrespect or wanting to impose my views on anyone, just highlighting a potential flaw in the system and trying to suggest and show how an improvement could be made for the good of LD.

              • caranha says:

                Hey Pov, thanks for taking the time to talk to us here, being all busy and stuff.

                There are a number of people who would volunteer their time (I know I would) to run some statistical analysis on the voting data. Maybe you could make a fully anonimized database available (no game names or voter names) to allow us stat geeks to come with more concrete ideas about how to deal with the wide voting difference?

        • digital_sorceress says:

          I suggested last time PoV, instead of a numerical list, have groups:

          Top 2% – Gold medal
          Top 6% – Silver medal
          Top 10% – Bronze medal

          No matter how big LD grows, this system will absorb the error and instability that everyone here is concerned about.

  8. SonnyBone says:

    The current system is the best one yet. Things have greatly improved, as my game was played and rated more times than any other game before. With 1,400 entries… there are going to be issues.

    The new averaging system would be nice, and the groups that digital_sorceress suggested would also work.

    Either way, the current system is an improvement, and that means that LD is steadily evolving to meet the ever changing needs of the growing participation.

  9. madflame991 says:

    I got a score of 1.35 at Audio for a game with NO AUDIO. What’s frightening is that I received 71 votes. If, say 65 people rated 1 star for audio and 6 people rated 5 (this 5 is chosen as to minimize the number of invalid votes), then my Audio score would be of 1.33 which is almost what I scored (1.35). If the invalid voters gave me 3 stars then the number of invalid votes would be higher (12 or so). 6 invalid votes out of 71 is almost 9% and 12 out of 71 is almost 17%.

    Of course I still don’t understand where does N/A fits in all of this… (is it equivalent to 0? is it null?) In the calculations above I didn’t account for N/A.

    The system is not perfect and there are abusive voters (this is nothing new). I have no idea on how to refine it :(. It would be fair to have a group of judges to rate games, but at this volume you’d need a lot of time to rate all of them…

    • digital_sorceress says:

      The fraction with the smallest denominator that that equates with 1.35 is 19/14

      ie, a+b=14, a+2b=19, b=5, a=9

      9x 1 star (+1 for the one that was deleted) => 10 people gave you 1-star
      5x 2 stars (+1 for the one that was deleted) => 6 people gave you 2-stars
      everyone else gives you N/A = 71 – 10 – 6 => 55 people gave you N/A

      :)

  10. Volute says:

    Very interesting subject ! A fair rating system is a real challenge.
    As for me, I’d suggest a voting system with several rounds, like there for the choice of the theme.

    For example :
    1st round : all the games
    2nd round : the top 40% games of 1st round
    3rd round : the top 40% games of 2nd round

    For 1000 games :
    400 get to the 2nd round
    160 get to the 3rd round

    The votes are reset at each round.

    Of course, for this to work, the rating system would need some rethinking… For instance the overall rating could be an average of the other categories and not a separate rating. That way you’d have a more unified system to calculate the winners of each round.

    I think it would be a great system ! It would reset the interest towards the competition several times during the judging period. People would come back to the website to see the results of each round instead of rating a few games the first days and then only coming back to rate a few more towards the end, with nothing in the 15 days in-between.
    It would be so much more epic !

    I also think that with such a system, only the best games would be in the top of the competition, because in the last round, people would tend to rate with more objectivity as they should have seen a lot of games by now.

    If the current system stays on, next time I can only think twice before promoting a game I liked, in a blog post for example. Maybe that game would be better off with no additional promotion ? Maybe it got some very good ratings during the beginning of the competition and more could only hurt ?

    I was happy with the review my game got on Rock Paper Shotgun and Indiegames.com and that it got so many ratings (271), but maybe it was not such a good thing for the competition in itself ? Maybe the first few ratings were enthousiastic 5s and all the others were more reasonable ones ^^ ? Not that it really matter, like Pierrec says, I prefer my game is played many times than be in the top X. But still, I would prefer to have my mind totally free of such tactical considerations in the next Ludum Dare ;)

    (and sorry for the bad english !)

    • ratboy2713 says:

      First off, I think the feedback is far more important to the rating system, but I can’t think of a way to force people to give meaningful feedback.

      Anyway, I agree with a rounds system, though I would purpose a “Winners Battle” style. Divide all the games up into certain sub sections, and you have to play and rate the games in your sub section. The person with the best rating advances to the winners round. If you didn’t rate all the games, you can’t advance. Then, everybody gets to vote in the winners round, but there would be significantly fewer games everyone has to play.

      The overall number of games each participant has to play would be reduced, so hopefully people will do it as it would be less of a burden.

      This has the problem of each game getting fewer ratings unless you are in the winners, which is unfortunate, but if we were to look at it with real numbers, 20 games per subset yields 70 subsets. Each game would have a minimum of twenty ratings, which is a little lower than the average currently. And this leaves 70 games for the winners round, a much more feasible number for people to play.

  11. JonathanG says:

    Bear in mind that I don’t think https://docs.google.com/spreadsheet/ccc?key=0Ao74NZQqNUt5dDNvZUJ1UXVqZGkxUGVlVkxlZ3JnM2c#gid=0 corresponds to the actual final number of ratings each game had – although it’s a recent enough snapshot that it supports what you’re saying.

  12. Volute says:

    Another suggestion for a fairer way to rank the games would be something based on the ELO rating system.

    Whenever they feel like judging games, participants are presented with two randomly chosen games. They have to play the games and say which one is best in each category.
    They also can choose “tie” or ask for a new set of games, if they can’t play one of them for instance.

    It’s the same system Mark Zuckerberg used for Facemash as shown in the movie The Social Network, if that rings a bell. Or if you use Flashgamelicence, you can also be familiar with the system that helps developers evaluate the niceness of their game’s icon. Every icon starts with 50% niceness, and then based on the votes or the other users of the website, that percentage rises or drops. The amount of points lost or won after each “match” depends on the difference between the icon’s niceness and the niceness of the icon it’s confronted with.

    It seems easier to decide if that game’s graphics are better than this one’s than to decide if you’re going to rate 3 or 4 this game’s graphics. And if you can’t, you say “tie”, and the others will decide : your doubts won’t influence the rating of the game.

    Promoting a game won’t get it ratings anymore, but you can still get comments and the satisfaction to have people play your game (or the one you’re suggesting).

    • digital_sorceress says:

      The nature of chess is that there are only winners/ties/losers, and it is an objectively measured outcome. ELO is designed for that kind of “game”.

      In LD, things are much more subjective, and comparing one game against another isn’t so clear cut.

      It also takes a lot of games for ELO ratings to stabilize.., hundreds of games are recommended before an ELO rating is calculated. In LD most people receive just 10-50 ratings each.

      While it’s an interesting idea, I don’t think ELO would be the most suitable system for LD.

      • Volute says:

        Earlier I gave two examples of non-chess implementations of the ELO rating system : Facemash and FGL’s icon niceness service.

        But I agree with you, maybe it would take too much ratings for the system to stabilize, even though the ratings should be better split as it would be more the system that decides which games people play.

        Why I think it can be a great system is because when I look back at my rating experience during the judging time, I’m absolutely sure I gave 3s to games I would have given 4s at other times, after rating other games. And it doesn’t please me to realise that : I feel biased. Obviously, the games we’ve seen (or not) (and other things) influence our perception of the new games we are presented with, and our rating scale tends to vary over time.
        That’s a subjectivity I don’t like.

        On the other hand, you often know intimately which game graphic’s (or sound, etc.) you prefer. And I think that’s much less subject to vary over time.

        When I participate in FGL’s icon niceness rating system, all my decisions seem pretty obvious to me, I decide very quickly and ties don’t happen often : I know which icon appeals more to me.
        I don’t see why this system couldn’t be applied to games, and I don’t see why it would be bad that the games that appeal more to people should win.

        When I have a rate a game on any scale of notation, it takes me more time. I hesitate : “wait, I gave 3 to that game but I also gave 3 to this other one that seems less good to me and I gave 4 to that one which is better but not that better… Is that fair ?”.
        I often come back to change my rating and I don’t feel any better after. I feel biased.

        Of course I’m also biased when I say which game I prefer, but I don’t feel biased in a bad way : I have just claimed my own preference.

    • AlwaysGeeky says:

      This is EXACTLY what I am talking about and trying to get this point across all along… Thank you! :)

      “WRONG SOLUTION #2: Score = Average rating = (Positive ratings) / (Total ratings)”

      This is exactly the reason why the rankings for this particular competition all seem very wrong.

      This image sums up exactly the same thing that is rampant throughout the rankings chart for the compo this time: http://evanmiller.org/rating-amazon.png
      (See the stars and the number of ratings received)

  13. MadGnomeGamer says:

    I think having a ‘least rated’ filter in this LD was a big step towards counteracting the less-rated games having higher/lower scores thing.

    PS I personally seek out BOTH good and ok/bad looking games. I seek out bad-looking ones for several reasons:

    A) sometimes a real gem is hidden behind lousy graphics/unappealing screenshot
    B) someone with 3 comments is more likely to play my game in return for my feedback than someone with 100 comments
    C) I like to be different from everyone else!

  14. AlwaysGeeky says:

    Yes, your point in (B) is correct. And that should be a good thing… unfortunately though, having your game played by more people probably meant you got a lower overall ranking because of it… sorry dude.

    It’s a sad shame that you would have had a better statistical chance of getting a higher ranking if less people played/rated your game… Totally counter to what LD *should* be all about.

  15. ananasblau says:

    We should do away with the ratings and instead use the olympic measurement. As a rater, you give out gold/silver/bronze/no medals and the receiver can print them all out if they want. This will easily mount up to a million medals :)

    And of course custom, obscure categories so the 3D, windows, only source entries finally get some gold.

Leave a Reply

You must be logged in to post a comment.

[cache: storing page]