Nerd Files: Pi, proxies and Big data

Ah yes Pi day was the other day. March 14th or 3.14. Google put up their cute little graphic display to celebrate. If you were being ultra math nerdy about it though, you’d celebrate it on July 22nd as 22/7 actually is a closer approximation. (Aside- really though buck tradition and celebrate Tau day [τ ] with me on 6.28. Or celebrate all three for the chance at great puns and yummy pies outside of Thanksgiving.)

Why would that even matter? Well, because pi is an irrational number- it’s infinite. Every decimal down the line you go gets you a better approximation to the real deal and thus a more accurate output. Depending on the application this might not matter, or it might matter a lot.

For instance, if you only used a very general approximation you could round it out to 3. Which makes you about 4.51% off from the correct value. On a tiny scale that may not matter, but if you use that for say a diameter of 100 feet, it’s going to be off by about 14 feet. That could be a glaring problem if you are building a fence. NASA for its calculations uses pi to 15-16 decimal points. Essentially, no matter if you round to the nearest integer or if you use pi to the 10 trillionth decimal point, you are using a proxy. An estimate. The more decimal places used however, the more accurate the answer comes out. Some things, like NASA rocket science, require more precision.

Big data also relies on proxies. There are things that simply can’t be easily measured, or measured at all, making it difficult to provide a model of value or behavior. But in an effort to try and predict the best outcomes a number must be assigned somehow. So we introduce proxies. The problems arise however when we use these proxies with no error feedback and on large scales. Enter: the correlation equals causation fallacies. When something correlated but may not have been from true causation, yet we use that correlating data set in algorythms to predict behavior, we can end up with a scenario much like rounding to 3 when building a 100 foot fence. Scale it up and build a 1000 foot fence and you end up with an even bigger problem- 140 unfenced feet.

Let’s talk baseball stats. How good is a player? That’s not completely measurable so in order to compute a ranking many factors are used to get the answer and different teams may even rank players with different weighting systems. The thing with baseball stats though is that when they are wrong, they go back in and look at what might have caused the error. Maybe one guy was scored low on a team because of his strike out ratios, but then he gets traded and does phenomenally better. The stats analysts go in and look at what happened, and change the model, hoping to figure out how to not let a good player go in the future.

A man gets fired during company wide hack and slashes and he was below the minimum score. A) he doesn’t know the score, doesn’t understand what things he was even scored on, doesn’t understand the weight each item carries, and doesn’t understand how the computation was even generated. So how could he have improved even if he was given a 6 month warning? He couldn’t. B) Let’s say after he was fired he was hired at another company and was much more successful there. The model that fired him is never looked at to see what went wrong- how did they fire a perfectly good employee? And, worse, keeps on trucking along like the model is fine because no errors were reported. That by the way, is how many are calculating how good or bad a teacher is performing. Value added big data models and precious little transparency.

With teaching it’s awful because you are attempting to take a large scale (mass amount of students) and translate that into a small scale (classrooms of 25-35 students). This is like judging the standard of 10,000 M&Ms at the factory, to a fun size bag with 17 M&Ms in it. With 10,000 M&Ms you are going to get an equal ratio of all the different colors. In a fun size bag you may not have anything close to an even distribution. You may not even see a specific color. Things start becoming random- what are the odds that even if you buy another fun size bag from the same store, the same time of day, wearing the same clothes as yesterday, that you’d even buy another bag of fun sized M&Ms with the exact same color ratios? How statistically annoying.

I don’t know the answer here, but one thing seems clear to me. Big data may improve a perceived efficiency, but at what ultimate cost? We won’t know, because we aren’t looking too hard at the error rates or updating the models with the error data that are helping with these so called efficiencies. Numbers, data, algorythms and models aren’t as objective and infallible as some may have us believe. I mean, even pi is an irrational number—which can give us various degrees of how precise and therefor right an answer could be. Turns out, machines with all their maths and fancy algorithms still can’t quite rule the world yet. At least not in a fair manner at any rate.