(June 2007)

Engineers and scientists

I know it is bordering on sacrilege, but I am one of those engineers that - on occasion - fancy themselves to be "sort of" scientists as well. At work, such megalomaniacal tendencies quickly vanish; during my free-time, however, if the conditions are right, they can run amuck ;‑)

I have experienced such a "condition" recently, so I documented its stages - for fear of relapse :‑)

Last Saturday, my older brother asked me about the probability of "something" related to a lottery that is played here in Greece. The lottery is called "Joker" (and no, it has nothing to do with the Bat mobile). He described it to me as a set of balls with numbers from 1 to 40; I was told that they are thoroughly mixed in each lottery, and that 5 of them are randomly picked. Then another number (the "Joker") is randomly picked from a separate set of balls numbered from 1 to 20. If you have predicted all 6 of them, you instantly become a millionaire.

"What is the chance that the joker is equal to one of the 5 others?", I was asked.

It's been 15 years since I passed the probability and statistics exam in the University, but... like I said: "relapse" :‑)

In theory...

I vaguely remembered that there was a way that made calculations of probabilities easier - ah, yes, it was when the phenomenon constituted of "logical ANDs"!

When we calculate the odds for a logical AND of events, the overall probability of the combined event is the product of the probabilities of the events. At least, that's how I remember it...

P(A and B and C and D and E) = P(A) x P(B) x P(C) x P(D) x P(E)
In this case, how can I form a "logical AND"? Err... Therefore, the event "A and B and C and D and E" is the logical NOT of the event I want to measure: ALL 5 numbers not equal to the joker, is the logical negation of "any one of them - or even more than one of them - equal to the joker". If I calculate the probability P of this event, then the probability asked by my brother is the "logical negation", 1-P. And the probability of this event (A and B and C and D and E) if memory serves, is the product of the 5 probabilities for the 5 events, A to E. Sounds like a plan.
    p(A) = 39/40
    p(B) = ...
Now, the first one (A) is easy. There are 40 balls, I get one - and the joker is somewhere between 1 and 20. Chances of not hitting that "target" ball are therefore 39 in 40.

The second one (B) is a tad more difficult, though: Event B says that the second ball is not equal to the joker - this means that

either event (A) took place and now, event (B) takes place
    or
event (A) didn't take place and now, event (B) takes place
or, in simpler terms,...
either
(B1) the 1st ball was not the joker and neither is the 2nd one
or
(B2) the 1st ball was equal to the joker and the 2nd one is not
Let's see...
B can only happen if one of (B1, B2) takes place, so if I remember correctly...
P(B) = P(B1 or B2) = P(B1) + P(B2) - P(B1 and B2)
However, B1 and B2? That can't happen - either event (A) takes place, or it doesn't. P(B1 and B2) = 0.
Which means that all I need is find out P(B1) and P(B2).
P(B1) = P(1stEscapesJoker) x 
  P(2ndEscapesJoker given that 1stAlsoEscaped) = 39/40 x 38/39
The second term is 38/39, not 38/40 and not 39/40: the 5 chosen balls are all different; now that we've picked the first one and it was not equal to the joker, only 38 out of the remaining 39 are not equal to neither the joker nor the first one.

Similarly,...

P(B2) = P(1stIsJoker) x 
  P(2ndEscapesJoker given that 1stIsJoker) = 1/40 x 39/39
The second term here is "absolute certainty": with the first ball equal to the joker, the second one can not possibly be equal to the joker (since it can not be equal to the first ball!). All remaining 39 balls are therefore not equal to the joker.

So, what is P(B)?

    P(B) = P(B1) + P(B2) - P(B1 and B2)  =

      39   38      1   39
    = -- x --  +  -- x --  -  0     =
      40   39     40   39

      38    1     39
    = -- + --  =  --
      40   40     40
What do you know! P(A) = P(B)!
The chances of the second ball "escaping" the joker are the same as the chances of the first ball! Hmm...

Let's do it once more, for P(C):

    P(C) = P(third ball not equal to the joker) =
         = P(C1 or C2 or C3) =
         = P(C1) + P(C2) + P(C3) - 
	    - P(C1 and C2) - P(C2 and C3) - P(C1 and C3) +
            + P(C1 and C2 and C3)

Where:
    C1 = first escapes, second escapes, third escapes
    C2 = first escapes, second matches, third escapes
    C3 = first matches, second escapes, third escapes

and therefore,...

            39   38   37     37
    P(C1) = -- x -- x --  =  --
            40   39   38     40

            39    1   38      1
    P(C2) = -- x -- x --  =  --
            40   39   38     40

             1   39   38      1
    P(C3) = -- x -- x --  =  --
            40   39   38     40

which, finally, leads us to...

    P(C) = P(C1) + P(C2) + P(C3) - 
	    - P(C1 and C2) - P(C2 and C3) - P(C1 and C3) +
            + P(C1 and C2 and C3) =
         
             37       1       1
         =   --  +   --  +   --  - 0 - 0 - 0 + 0  = 
             40      40      40

             39
         =   --
             40
At this point, it indeed appears that the chances of escaping the Joker are constant (39/40) regardless of which one of the 5 balls we are looking at. Think about it: even if we were picking 40 balls instead of 5, the 40th ball has the same overall chance of escaping the joker as does the first one! Not immediately apparent, but makes sense if you think about it...

So, what is the chance of ALL 5 balls escaping the joker (the "logical AND")? The product of the 5 probabilities, so...

                          39  5
    P(all 5 escaping) = ( -- )
                          40
...and the chances of one of them being equal to the joker, is therefore,
P(one or more of them equal to joker) = 
    1 - P(all 5 escaping) = 1 - (39/40)^5 = .1189043
Looks good. Illusions of grandeur rapidly manifesting :‑)

Now let's put the theory to the test - Python to the rescue!

In practice...

Let's first see if we can verify this for 2 balls (instead of 5):
#!/usr/bin/env python

success = 0
tries = 0
for joker in xrange(1, 21):            # From 1 to 20
    for first in xrange(1, 41):        # From 1 to 40
        for second in xrange(1, 41):   # From 1 to 40
            if first == second:
                continue         # Ignore invalid lotteries
            if first != joker and second != joker:
                success += 1
            tries += 1

print float(success)/tries
We expect this to print...
P(A and B) = P(A) x P(B) = (39/40)^2 = 0.950625

But...

bash$ python experiment1.py
0.95
Why?

Don't read further down immediately - think this through, reread... treat this as a puzzle!

No, it is not a Python "floating point precision" issue...

Let's try to see what Python gives us for each of P(A) and P(B) (i.e. the chance of the first one escaping the joker, and - separately - the chance of the second one escaping the joker):

#!/usr/bin/env python

success1 = 0
success2 = 0
tries = 0
for joker in xrange(1, 21):
    for first in xrange(1, 41):
        for second in xrange(1, 41):
            if first == second:
                continue
            if first != joker:
                success1 += 1
            if second != joker:
                success2 += 1
            tries += 1

print float(success1)/tries
print float(success2)/tries
Well...
bash$ python experiment2.py 
0.975
0.975
At least this one looks good (0.975 = 39/40). Both the first one and the second one have the same chance of escaping the joker, exactly 39/40. Not all our theory is wrong... We correctly calculated the probabilities of each one of the 5 balls escaping the joker. But... why is the end result for the overall question different?

Hmm...

Science and engineering make up...

The answer given by Python after the first experiment - the chance of both the first and the second balls escaping the joker - was 0.95. We begun our theoretical calculation saying that...
    P(A and B) = P(A) x P(B) = (39/40)^2 = 0.950625
In other words, the real world experiment tells us that our theoretically calculated probability is slightly higher than the real one. Why, though? Isn't the probability of the logical AND of two events the product of their probabilities?

Time to dig up the textbooks... (boxes, dust, cough)

There, I found it...

P(A and B) = P(A) x P(B) if events A and B are independent

Oops - after 15 years I knew I would have missed something.

Independent?

They are most definitely NOT independent! Whether the second one escapes the joker VERY MUCH DEPENDS on whether the first one did! If the first one was equal to the joker, the second one SURELY will NOT (since it cannot be equal to the first one)! Don't get confused by the fact that we calculated the probabilities of each event and found them to be the same; the important point is that the outcome of the first event has implications on the second one.

Now what?

Wait a minute... By knowing P(A) and P(B), we also know P(not A) and P(not B)...

    P(first  is equal to joker)  = P(not A) = 1-P(A) = 1/40
    P(second is equal to joker)  = P(not B) = 1-P(B) = 1/40
... and therefore ...
P(1stIsJoker or 2ndIsJoker) = 
    P(1stIsJoker) + P(2ndIsJoker) - P(1stIsJoker and 2ndIsJoker)
The last term (1stIsJoker and 2ndIsJoker) can't happen - the first and second ball are never equal!

Finally, I got it!

    P(1stIsJoker or 2ndIsJoker) = 
	P(1stIsJoker) + P(2ndIsJoker) = 1/40 + 1/40 = 2/40
And yes, ...
    P(both first and second escape) = 
	1-P(1stIsJoker or 2ndIsJoker) = 1-(2/40) = 0.95
... as experiment1 showed.

Engineering: Bulldozing around till you get your answer...

Wait a minute! Looking back at all the things I used, I realized I run past the one - and only one - I needed!...
When I was calculating B1...
    P(firstEscapes and secondEscapes) = P(B1) = 
    
    (see above in Theory section)

                  39   38     38
                = -- x --  =  --  =  0.95
                  40   39     40 

... I could have just extended this to get to the result easily...
With this alone, I could have seen that with 2 balls, the probability of joker "hitting" one of the two balls is 2/40 (1-(38/40)). For the complete game (5 balls) as questioned by my brother, the chance is 5/40 (or 1/8), since...
p(1st and 2nd and 3rd and 4th and 5th escape the joker) =

   p(1stEscapes) x 
    p(2ndEscapes given 1stEscaped) x 
     p(3rdEscapes given 1stAnd2ndEscaped) x
      p(4thEscapes given 1stAnd2ndAnd3rdEscaped) x 
       p(5thEscapes given 1stAnd2ndAnd3rdAnd4thEscaped) =
  
                  39   38   37   36   35     35  
                = -- x -- x -- x -- x --  =  -- 
                  40   39   38   37   36     40

    and thus, the answer to our problem is... 
                                                    35    5
        P(one of the 5 is equal to the joker) = 1 - -- = --
                                                    40   40
Based on the above, I advised my brother as follows: P.S. The complete simulation of a lottery with all 5 balls is too slow to execute with Python. I wrote the loops in C and verified my results in just a couple of minutes. I then realized that the complete simulation taking minutes in my machine would take decades to execute on my first computer, a Z80 based Sinclair Spectrum... Thank God I live in the 21st century; I can use machines to "pass as a scientist" - on occasion...

profile for ttsiodras at Stack Overflow, Q&A for professional and enthusiast programmers
GitHub member ttsiodras
 
Index
 
 
CV
 
 
Updated: Sat Mar 8 22:58:16 2014