Thursday, March 15, 2012

Who Will Be in the Final Four? – Lessons from an Analytical Journey

The NCAA tournament usually sneaks up on me to the point where I am filling out my brackets on Thursday morning having paid little thought or attention to any sort of rational strategy.
My thought was to run a Monte Carlo analysis (surprise, surprise) of different strategies and from this select picks for my brackets. The following is an account of this “analytical journey” with some lessons about analysis thrown in along the way.

Data Please!
Figure A
The first step in an analytical process is data - without numbers to crunch it is called “qualitative” analysis as opposed to “quantitative”. Qualitative analysis is appropriate in many situations, but for a Monte Carlo one needs to establish parameters that will change, and for that we need numbers.
From Shawn Seigel, I found a data set that showed survivorship for each seeded team for each round of the tournament. Figure A portrays the initial data, which shows number of times each seed has gone from one round to the next.
This information did not load nicely into Excel, and required manual entry.
Lesson #1 – Data is rarely in the form required to be used, and there will be time spent preparing it.
By dividing the previous rounds survival by the current rounds, we can arrive at a probability of victory. For example, in Figure A 96 of the #2 seeded teams survived the first round, and of those 64 survived the second, for a probability of 2/3.
While these probabilities allow us to calculate an expected value for arriving at the final four, this does not serve the Monte Carlo objective, since it is a static calculation without any parameters that we can vary from one iteration to the next.
Lesson #2 – to achieve our analytical objective we often need to go beyond the initial data.

Let’s Start Crunching Data!
Figure B
For our NCAA survivor dataset, we can use the binomial distribution to create a varying parameter, namely the probability itself. This distribution tells us that given a certain number of occurrences (n) and individual event probability (p), we can ascertain the probability of all potential combinations of occurrences.
For example, Figure B shows the probability of how often heads shows up when we flip a coin 10 times (i.e n=10, p=.5). Reading from this, the chance of getting exactly 5 out of 10 heads is almost 25% (even though 5 out of 10 is the expected value), while getting 4 or 6 heads (close to the expected 5) is slightly over 20% each. Thus, approximately 2/3 of the time we will get 4-6 heads if we flip a coin 10 times (my apologies to the academically oriented – 10 flips of a fair coin).
Figure C
The reason this is a benefit is that given we want events to vary, we can define parameters within which variation may occur. Using the Figure B data, if we want to incorporate the number of situations that will occur 95% of the time, then we will need to use results between 2 and 8 (97% chance) and 3 and 7 (89% chance).
Using this method, Figure C shows the survivorship given a high and low bound (equivalent to the 95% chance) for each of the first four rounds of the tournament. In Round One, the results are fairly linear. This is as expected, since a #1 seed plays a #16, a #2 plays a #15, etc. the probabilities for the top 8 teams should be a mirror image of the bottom eight. The orange line in the graph depicts a straight 45 degree angle, with which the results tend to agree.
Problems begin to develop in Round Two (and get worse thereafter). The drop off is steeper than the 45 degree angle, suggesting that teams below the #1 seed are much more equal in strength than their seed rankings suggest. Furthermore, there is a higher probability of the #10-12 seeds winning their second round game than #7-9. Does this make sense?
Lesson #3 – think about what your data is telling you.

Ring-Ring…..Ring-Ring…….Data Here….I’m Telling You Something!
When thinking about the structure of the NCAA tournament, we can see why this is the case. A team seeded #8 plays the team seeded #9 in the first round, so one of those two will survive to the second round. However, during that round they in all likelihood will face the #1 seed in the tournament (to date the #1 seeds have 100% survivorship in Round One). Is it any wonder then that a #8 or #9 seeded team will not have a very good Round Two record?
Compare that against a #11 seeded team, who might face the #3 or #14 seeds. Looking at Figure A again, we see that a #14 seed team has made it to Round Two 15 times and the #3 seed 85 times. Thus, 15% of the time the #11 seeded team is ranked higher than their opponent.
The fact that different teams face different odds as the tournament progresses is what is known as path-dependence. When data exhibits path-dependence, we need to adjust our analytical methods to account for it.
Lesson #4 – the data will throw you curve balls that you will need to approach from a different angle or a new direction

More Data, Please!
Fortunately, seed vs. seed data was available here, though as per Lesson #1, this data again required significant processing to get it into a format compatible with Excel and with R (a statistical software package). Curiously, this data set did not show up in my first search, and looking into this I discovered that I had used the “seed by seed” phrase in the second one, whereas I used “seed by round” in the first.
Figure D
Lesson #5 – when searching for data, take a multiple query approach because small distinctions will matter

Round by Round
A #1 seed, after advancing to the second round (which has occurred 100% of the time to date), will play either the #8 or #9 seeds. Figure D in the top panel shows the probability of victory along with confidence intervals around that probability. The orange line depicts 50% probability. The bottom panel shows number of games played against that seed. In this case, it is fairly even whether they will play a #8 or #9 seed in the second round.
Figure E
In the third round, the #1 seeds continue to dominate, as shown in Figure E. Probability of victory begins at 62% against #4 seeds and continues upward from there. Problems begin to emerge here, however, where the sample size becomes smaller. The results for the #12 and #13 seeds are more problematic. At the historical 100% chance of victory, there is no variability with which to simulate through the Monte Carlo model.
Figure F
In the fourth round, #1 seeds have played a few games against #7, #10 and #11 seeds. Figure F shows the 4th round data, and it looks strange. There does not appear to be a logical reason why the probability of victory against a #11 should be so low compared to the others. However, the number of times this has occurred give us a clue as to why. A #1 has played a #11 only 5 times. This is certainly not a sufficient level to establish a credible estimate of the true probability of occurrence.
The problem of low sample size preventing reliable estimates to be made becomes greater as we evaluate lower seeded teams. Figure G has the results for the #4 seed in the 4th Round of the tournament. Taken at face value, they are expected to better against a #2 than a #3, and have only a 50% chance of beating a #7. But again, the number chart shows the weakness in these conclusions, the #4 has not played more than 10 games against any of these opponents.
Figure G
In order to fill in these gaps, we can do a number of things. For our #4 seed, we could combine estimates of the individual seed results into groups, such as #2 and #3, and #6, #7, and #11. For our #1 seed in the 4th round, we could combine the results of games against #7, #10 and #11 seeds in which case we would have a 80% probability of victory. This is in line with the slope of the graph in Figure F.
Combining seed “buckets”, while increasing our estimation power, has the problem of requiring us to map these combined results back to the individual seeds. If #1 plays a #7 rather than a #11 in the 4th round, how much are we going to “shade” our 80% probability? Presumably we would assign a lower than 80% chance, say 75%, whereas the #10 we could assign the average of 80%, and the #11 we will assign a slightly higher than average of 85%. From a purely analytical point of view, this method is somewhat arbitrary, though it might appear reasonable.
Given the sheer magnitude of adjustments that would be made (there are seed vs. seed combinations that have never occurred for which we would need to create variables), and the underlying uncertainty surrounding the reliability of these assignment methods, establishing the parameters to run a Monte Carlo would be a very time consuming process and the results questionable. For that reason we are going to abandon our Monte Carlo objective.
Lesson #6 – we must be willing to revise our objectives and our approach
Lesson #7 – much as we might like, a Monte Carlo is not always possible

So Who To Pick?
Given most brackets award higher points as the tournament progresses, we will go back to Figure A and note that over 90% of the teams that win the Finals, the Final Four, and Elite Eight games are the top 4 seeds, and that the #1 seeds are more than twice as likely to win than the next best seed (which in all cases was the #2).
Based on this, going with the #1 seeds in the Final Four is the best shot you can take to win.
Lesson #8 – Simple is oftentimes better

Key Takeaways
There are many lessons to be learned about statistical analysis, and these can be loosely classified into the following axioms – data availability and usability is a critical factor, appreciating the limitations to what the data can tell us is required, and we must be flexible in our approach to the analysis, objectives and results.
Questions
·         Who are your Final Four picks?

Add to the discussion with your thoughts, comments, questions and feedback! Please share Treasury Café with others. Thank you!

4 comments:

  1. Hi Author,



    My name is Hector and nowadays I associated with some financial community as a financial writer. Currently I am trying to grow my own community by reaching out to other bloggers. Since last few hours I've been reading your blog http://treasurycafe.blogspot.com/ and I'm a big fan, my favorite post was your - "Who Will Be in the Final Four? – Lessons from an Analytical Journey".



    I'm interested in writing a guest post for you - something your visitors will be interested too and I have some ideas that I think your readers would love:



    * Frugality

    * debt

    * budgeting

    * personal finance

    * managing the household



    I think you're busy, so I can write everything up and send it to you in one document, which you can drop right into Word-press. I''ll handle all editing, bylines, etc (feel free to edit) so this is super-easy for you. Plus, I promise the guest posts will get your readers thinking and talking to each other.



    All I want is li'l credit in terms of one backlinks in author's bio and no link would be embedded anywhere in the content.



    Let me know what you think of the same.I’m on a staycation this week.



    --

    Thanks and Regards

    Hector Cage

    ReplyDelete
  2. My favorite (and in my opinion the most critical) points are 3, 4, and 5. Many people tend to see all data as empirical and unequivocal when this is very simply not the case. Unless the data comes as a result of a very carefully conducted and authoritative scientific study with a large (and properly chosen) sample pool, it's probably ambiguous to an extent in one way or another.

    Does this mean that we should just throw our hands up in the air and ignore all but the most academic and stodgy data? Absolutely not. "Listen" to your data, observe the little and seemingly insignificant messages embedded within the numbers. It's these little things that often matter most when making decisions or...choosing a winning final four!

    ReplyDelete
    Replies
    1. Samantha,

      You make a great point that the data one often works with is not "perfect" in one way or another. Much as we like to focus on the rigorous, scientific aspects, there is always an element of art as well.

      Thanks for adding to the conversation!

      Delete
    2. You're welcome, David. And thank you for your work here on this blog. You actually make stats and analytics interesting for this Texas rebel. ;-)

      Delete