March 2006 Archives

« February 2006 | Main Index | January 2007 »

Memphis follow-up and UNC as a #1?

Memphis did their part to complicate the #1 seed line situation by losing Thursday to UAB, but the boost to their conference RPI from their last two opponents probably helped them more than the loss hurt.

Recall the discussion of using conference (and non-conference) RPI rank vs. the raw RPI. As of March 1, their conference RPI rank was so out of whack with those of previous #1 seeds that the Crashing The Dance OracleTM figured them a borderline #3/4 seed.

As a result, I switched to using raw RPI instead of rank, against the yelling of the voice inside my head. I felt dirty. As I wrote then, a .6000 conference RPI (like a .300 batting average in baseball) may not mean the same thing year-to-year, but a #5 conference RPI rank means the same thing every year: they had the 5th best conference RPI. For predicting from past data, it certainly makes more sense to use the rank to make comparisons fair across season.

I'm happy to report that I've switched back to using conference (and non-conference) RPI rank. All of our testing last season indicated that rank was a more accurate predictor than raw value. Plus, it just makes sense.

Anyway, with their last two games, Memphis improved from 48th (through March 1) to 27th (through March 4), which is more in line (though still below) previous #1 seeds. We'll have to see how a drop in the new polls hurts them.

This by no means makes them a lock for a #1. They need to watch their back for, among others, Texas, Illinois, George Washington, and (what?!?) North Carolina. Any of these four, with a strong conference tourney run, could sneak in if Memphis falters.

The CTD OracleTM currently puts UNC #6 overall in the S-Curve, a jump of 7 spots after last night's upset win in Durham. If they go on to win the ACC tournament, their 6 losses would not prohibit their chances of a #1 seed. Five teams (Arizona and Michigan State in 2000; Illinois in 2001; Oklahoma and Texas in 2003) got #1 seeds with 6 or more losses. The difference (see the points column on the S-Curve list) between the four teams currently on the #2 line is slim, so any could jump to the #1 line should Memphis slip.

Whither Memphis?

(This was written before Texas lost to Texas A&M, which improves the Memphis case. Nonetheless, the argument still applies.)

The top three overall seeds seem to be set. The fourth is up for debate. Most debate centers around Memphis and Texas, and most prognosticators and bracketologists pencil in Memphis. I would wait a bit before writing that with pen.

First, a little background on our methods. (Warning: Technical details ahead.)

The "nitty gritty" report used by the selection committee includes a decent amount of data for each team. Wins and losses are listed in multiple categories (for example, overall, conference, and against teams ranked 1-25 in RPI). Conference and non-conference RPI is listed along with overall RPI. They also have polls and computer rankings (most notably Jeff Sagarin's rankings). For our purposes, these are called the attributes of a team.

There is a principle in statistical machine learning called the "curse of dimensionality." Essentially, it means that as the number of attributes increases, the amount of data needed to train the learner well increases exponentially (that is, really fast). Otherwise, the learner will create a model that fits the training data well but does not generalize. (This is called overfitting.)

We want a general model, because that is better for making predictions on future data. Making predictions is what machine learning is all about, so a non-general model leads to bad predictions. And we don't want that.

What this means is that we can't just throw all the attributes at the learner and expect good prediction results. Through experimentation and applying our own expert knowledge (i.e., reading good articles), we arrived on a smaller subset of the attributes to use for training the learner. Most of these are the familiar ones discussed in the media, and even unofficially confirmed, including:

  • "good wins" (total wins against RPI top 25)
  • "bad losses" (total losses against RPI 101 or worse)
  • +/- .500 against other RPI teams
  • polls and Jeff Sagarin's computer rankings
  • conference record +/- .500
  • conference performance and schedule strength (measured by conference RPI)
  • non-conference performance and schedule strength (measured by non-conference RPI)

(We now return you to your regular programming.)

The point of all this is that we've tried to identify the most important attributes to predicting selection and seeding, based on the past data we have (2000-2005). Our results last season showed that we guessed pretty well, though there is certainly room for improvement.

Now, back to Memphis. Until recently, we measured conference performance and schedule strength with the team's ranking (rather than the actual value) in conference-specific RPI. (We did the same for non-conference as well.)

The idea is that the actual RPI value is not important; it's their standing in relation to other teams. A .6000 RPI may not mean the same thing year to year (much as a .300 batting average in baseball may not). To build a bracket, only this year's performance matters, so the important thing when looking at past seasons is how the team compares to others.

This worked well for predicting the 2005 seeds. (We correctly picked all four #1 seeds in order. #2 seeds, not so much.) However, things started looking a bit dodgy this year. Memphis, which many think deserves a #1 seed (certainly a #2) was showing up as a borderline #3/4 seed when we ran this numbers. So I got curious.

I looked at their numbers through games of March 1:

RankConfNonConfRoadL101-2526-5051-100101+
W-LSagarinPollRPIRPI+/-RPI+/-WW+/-+/-L
26-24350.5957120.680710104250

Looks pretty good in general. So I thought maybe something was wrong with my code or the data, but most of the other predictions looked sane. So I dug deeper.

From 2000 to 2005, 24 teams were given #1 seeds. The average conference RPI rank of those teams is a shade over 5th, with only two outside the top 10. The conference RPI rank for Memphis: 48th (as of Mar. 1). During that period, the average non-conference RPI rank for the same teams was 9.5, including a 38th and 42nd. So, at least for this sample, it looks like the committee gives the conference RPI rank more weight than the non-conference RPI.

Out of curiosity, I replaced the conference and non-conference RPI ranks with the actual RPI values. Memphis jumped up to the #1 line (#4 in the S-Curve), and this is what you currently see on the site. I would like to do more investigation for next season, but I'm leaving it as it is now because I think it's more likely for Memphis to be a #1 seed than a #4 seed.

It is possible that the revised RPI formula has some effect on this (previous seasons use the old formula), but I have not had a chance to try it with the new formula. Maybe next year...

None of this is to say that the Tigers' conference RPI (and the corresponding rank) will not improve in their remaining games. Their last 2 regular season opponents are in the RPI top 60, and winning their conference tournament should also help.

Just don't be surprised if they win out and still fall short of the #1 seed line.