BBO Discussion Forums: Statistical help wanted for web project - BBO Discussion Forums

Jump to content

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

Statistical help wanted for web project football (soccer) sim

#1 User is offline   manudude03 

  • - - A AKQJT9876543
  • PipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 2,615
  • Joined: 2007-October-02
  • Gender:Male

Posted 2015-January-23, 20:47

I had been working on building a Football Manager style incremental game about 6 months ago that I had left on hold for various reasons. I've recently got back into it and tonight. I know there are at least a few statistics fanatics here on the forums so I thought I'd give it a shot. I've been trying to get the game balanced and trying to get the scores to be realistic. Statistics was never my strong suit, but to try and explain the mechanics of a match in this game:-

The ball starts in the middle of the pitch at kick-off and its position is determined by a variable T (territory) with an initial value of T=50. A goal is deemed to be scored if either T goes below 0 or above 100 at which point it returns to T=50. There are 900 "phases" to the game (you can think of them as passes and dribbles, 900 being an arbitrary number which goes nicely into 90 mins) which changes T by the formula rand()*n-(n*M/2) where rand() is a random continuous number (I assume uniformly distributed) between 0 and 1 and M being a multiplier based on the ability of the teams. M can be calculated as 1+(r1-r2)*Q/(r1+r2) where r1 and r2 represent the ability levels of the 2 teams, and Q is a limiter to prevent total imbalance (currently set at 1 so can be ignored). When you put this together, when the teams are evenly matched, M=1 and the change in T is a random number between -n/2 and +n/2. The higher M is, the more it favours the home side.

Just putting down those formulas again so they're not lost in the previous paragraph:
ΔT=rand()*n-nM/2 rand() being a RNG number between 0 and 1, n being the main number I need to find, M as below
M=1+(r1-r2)*Q/(r1+r2) r1,r2 being ability rating of team1,team2 , Q being an arbitrary limiter (currently set to 1 to be irrelevant)

I managed to get a hold of an excel spreadsheet earlier tonight showing all English Premier League results this season up to last Monday (19th Jan) and found that in 220 games, there was a mean of 2.59 goals scored per game with a standard deviation of 1.635. I figure that if all teams had the same level of ability, then the mean and SD would be slightly lower, so I am aiming for a mean of 2.5 with a SD of 1.5 . The last statistical test I ran on my code used a value of n=6 and after 30 games showed a mean of 4.97 and a SD of 1.847 with evenly matched teams.

There are 3 statistical questions/problems that come with this (if needed, I am looking for a 95% confidence level)
1. Assuming the abilities were equal throughout the field, what value of n (and Q if necessary) allows for the 2.5 mean and 1.5 SD?

2. In practice, there will be a defensive bias as T tends towards the extremes. Assuming both sides play a 4-4-2 formation this leads to the following values of M in my code when Q=1:
When T<=10 or T>=90, M=1.69 (if T<10) or 0.31 (if T>90) (these will be symmetrical around 1)
When 10<T<30 or 70<T<90, M=1.6 or 0.4
Otherwise M=1.
How does this affect the value of n (and Q)?

3. In my lop-sided tests (M tending to 2 if Q=1 and home side dominates), I found I was getting unrealistic scores (at n=6, Q=1 and r1=r2*4 there were constant scores of 25-0 and the like). Suppose I wanted the mean to be 10 goals if one side was crushing the other. What values of n and Q would be needed now, bearing in mind that I still want the 2.5 mean if the teams are even.

Thanks in advance for any help, it is 2:45am here, so time for bed I think. I'll get around to testing possible solutions when I wake up (each 30-game test will take about 20 mins real time).
ps. It will probably be a couple of weeks before the site is in a presentable form, so probably wouldn't present it until then.
Wayne Somerville
0

#2 User is offline   mgoetze 

  • PipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 4,942
  • Joined: 2005-January-28
  • Gender:Male
  • Location:Cologne, Germany
  • Interests:Sleeping, Eating

Posted 2015-January-24, 03:08

Sorry for not commenting on the math at this time, can't think this early in the morning. ;)

From what I have read of soccer/football analytics, the number that correlates best with goals scored is shots on goal. Also, having watched a match or two in my lifetime (it's unavoidable here in Germany), I happen to know that football players don't dribble the ball into the goal. Therefore you might consider amending your simulation algorithm by determining, for each phase, whether the ball will be advanced normally or whether the team on offense has a goal opportunity. This should be more likely, of course, the closer they are to the goal. It will also eventually help you provide more statistical fluff in your simulation output (Player A scored from X feet out). [Naturally, if it is determined that there is no goal opportunity, you will have to give the defense a correspondingly higher advantage.]
"One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision"
    -- Bertrand Russell
0

#3 User is offline   Mbodell 

  • PipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 2,871
  • Joined: 2007-April-22
  • Location:Santa Clara, CA

Posted 2015-January-24, 05:19

If I understand right, I'd rewrite your first equation to be:

deltaT = n * (rand() - M/2)

And for equal strength teams you'd want M to be 1 (which I think for Q=1 maps to r1=r2=0.5 which seems intuitive enough).

I think if I were simulating this I'd replace your M and M/2 with a log5 based system (named log5 by Bill James a famous Baseball sabermetric public analyst) and structure it as r1 and r2 being strengths between 0 and 1, and the M' value as (r1*(1-r2)) / ((r1*(1-r2)) + ((1-r1)*r2)). Intuitively this can be thought of r1 as being the probability team 1 wins against an average team. r2 is the probability team 2 wins against an average team. The probability that team 1 wins against team 2 is like repeatedly having each team play an average team and throw out the times both win or both lose and just pay attention to the times one wins and one loses. Therefore the probability that 1 beats 2 is the same as the probability that 1 beats an average team (r1) and team 2 loses to an average team (1-r2) compared to all the times the teams don't match which is r1 * (1-r2) for the situation we talked about, but also includes the times team 1 lost to the average team (1-r1) and team 2 beat the average team (r2). So the numerator is r1 * (1-r2) and the denominator is that plus (1-r1) * r2. It might be notationally easier as W1 * L2 / ((W1 * L2) + (L1 * W2)) which is the same thing.

Regardless of all that, in the equal strength case you have:

deltaT = n * (rand() - 1/2).

Which is a random walk with amplitude up to n/2. And you want to know how often it should cross 50 in either direction from the origin.

This is pretty hard, I think. If the steps were always the same, like if rand() - 1/2 >= 0 then do +n else do -n then it would be easier to calculate. As the expected number of to hit 50 or -50 taking steps of size 1 is 2500. If you make the steps size 10 the expected time to cross 50 or -50 is 25. More generally if the step size is s and the limit is + or - n*s then the number of steps is n*n.

If you want to get the average to be 2.5 goals in 900 steps, I think that maps to an expected number of 900/2.5 to hit your limits. So that is 360 expected number. So you want the steps to be 50 = sqrt(360)*n => n about 2.635. You have a continuous uniform range rather than an exact step size. In order for your average step size to be 2.635, your range has to be twice that size in each direction. That is if some steps are 5, some 2.5, some 0, some -2.5, some -5 that will make your average size step 2.5. So that maps to an n of 10.54 which would hand you steps of between 5.27 and -5.27 depending on the rand value.

But this suggests something is off with my understanding of your formula because you say with n=6 you were getting 4.97 mean, albeit in a small sample of 30 games.

When I simulate with 10.54, using my n * (rand() - 0.5), I get an average goals of about 3 (note I'm taking one of the 900 steps to go from a goal back to 50, not sure if you are, so if on turn 298 I get to -0.1 from the positives, that is a goal, and on turn 299 I put the ball at 50 and on turn 300 calculate where the ball moves from 50). If I instead set n = 10, I get an average of right around 2.5, maybe a hair under. The standard deviation of the n=10 is 1.17, so a little less than you wanted. - I did 3 45 game samples for my numbers. The standard deviation there is over all 135 games. For the goals per game I got 2.45, 2.5, and 2.43 in the 3 samples, or 2.47 goals total across the 135 games.
0

#4 User is offline   helene_t 

  • The Abbess
  • PipPipPipPipPipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 17,221
  • Joined: 2004-April-22
  • Gender:Female
  • Location:Copenhagen, Denmark
  • Interests:History, languages

Posted 2015-January-25, 07:27

If this is a commercial project you might consider purchasing data from www.optasports.com/Football_Stats

It is quite expensive but very extensive. You can buy, for example, the time and coordinates of every pass (succesful or otherwise) or tackle from the the premier league in a particular season. At Buzz Sports we used those data to train models that could predict specific events (goals, corners etc) from coordinates, possesion and the betfair price of the match.
The world would be such a happy place, if only everyone played Acol :) --- TramTicket
0

#5 User is offline   Fluffy 

  • World International Master without a clue
  • PipPipPipPipPipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 17,404
  • Joined: 2003-November-13
  • Gender:Male
  • Location:madrid

Posted 2015-January-25, 08:26

Maybe you are just missing stoppage time (phases lost between goals for celebrations)
0

#6 User is offline   Trinidad 

  • PipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 4,531
  • Joined: 2005-October-09
  • Location:Netherlands

Posted 2015-January-25, 14:59

I always saw a goal scored as a Poisson Event.

Since in a game -on average- about 4-5 goals are scored, this means that it is entirely possible for the weaker team to win. The stronger team was expected to win 3-2, but they scored less than expected, and the weaker team happened to score more than expected.

If I wouldn't have anything better to do, I would calculate the average goals per game for each team in a competition and see if I can find a fubction that predicts the score of a match between two teams based on the average goals per team. I would correct for home field advantage and perhaps a few other factors, but that would be it.

Rik
I want my opponents to leave my table with a smile on their face and without matchpoints on their score card - in that order.
The most exciting phrase to hear in science, the one that heralds the new discoveries, is not “Eureka!” (I found it!), but “That’s funny…” – Isaac Asimov
The only reason God did not put "Thou shalt mind thine own business" in the Ten Commandments was that He thought that it was too obvious to need stating. - Kenberg
1

#7 User is offline   helene_t 

  • The Abbess
  • PipPipPipPipPipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 17,221
  • Joined: 2004-April-22
  • Gender:Female
  • Location:Copenhagen, Denmark
  • Interests:History, languages

Posted 2015-January-26, 06:13

View PostTrinidad, on 2015-January-25, 14:59, said:

I always saw a goal scored as a Poisson Event.

Since in a game -on average- about 4-5 goals are scored, this means that it is entirely possible for the weaker team to win. The stronger team was expected to win 3-2, but they scored less than expected, and the weaker team happened to score more than expected.

If I wouldn't have anything better to do, I would calculate the average goals per game for each team in a competition and see if I can find a fubction that predicts the score of a match between two teams based on the average goals per team. I would correct for home field advantage and perhaps a few other factors, but that would be it.

Rik

I did just that when I worked for Buzz sports.

If the expected goals for the two teams are lambda1 and lambda2, log(lamda1) and log(lambda2) follow a bivariate normal distribution with a covariance (at least in the English premier league, might differ in other leagues) of appr. -0.3.

The betfair price is a quite good predictor of lambda1 and lambda2. If you know the covariance and the global mean you can calculate them from the betfair price. You might think that the betfair price for the number of goals scored in the match would add useful information but it doesn't seem to be the case. There is not much of a seasonal effect or effect of home team advantage (beyond what is already taken into account by betfair) either.

Obviously the English betting market is very large and it may well be that betting markets are less accurate in other countries or in lower leagues.
The world would be such a happy place, if only everyone played Acol :) --- TramTicket
0

#8 User is offline   manudude03 

  • - - A AKQJT9876543
  • PipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 2,615
  • Joined: 2007-October-02
  • Gender:Male

Posted 2015-January-26, 08:32

I did some tinkering about in the code, and found that setting n=4.4 and Q=0.5 got me a mean of 2.4 with SD of around 1. I'm just sticking with that for now while I deal with other issues on the site (though I'm up for ideas) This project is using made up teams and players so I don't exactly have stats. I realised that Q couldn't realistically be 1 since if for example the home team dominates the away team, then dT would effectively just be equal to rand()*n and would hardly ever be pushed back. With Q=0.5, dT becomes n*(rand()-1/4) which at least gives the other side a 25% chance of getting the better of a particular play. I suspect even Q=0.5 is too high for these purposes now, but I think I have more important things to fix at the moment.
Wayne Somerville
0

#9 User is offline   Fluffy 

  • World International Master without a clue
  • PipPipPipPipPipPipPipPipPipPipPip
  • Group: Advanced Members
  • Posts: 17,404
  • Joined: 2003-November-13
  • Gender:Male
  • Location:madrid

Posted 2015-January-29, 06:45

I guess you already programmed it, but the quality of one team over the other should depend on where the ball is.
0

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

2 User(s) are reading this topic
0 members, 2 guests, 0 anonymous users