Wednesday, September 16, 2015

Long Odds: 7,975 to 1 Against

Commentary::Statistics
Audience::Entry
UUID: df96273e-2da9-4d92-9b6d-cbdfdfb9b5c8

What went wrong by noon. So that you can say, "I am doing way better than that guy."

Waking up at 0430

That's 4:30AM to an unfortunate majority of my fellow citizens. Side note: we are one of a very few countries to use this system, and I hate it. It's harder to parse in software, etc. But leaving the side note firmly aside, waking up at 0430 sucks, unless you have some good reason for it. Which I did not. It just happened, so it sucked.

My month-old coffee maker decided to not make coffee

If I am up at 0430, a cup of coffee will usually sound like a truly fine idea. Grrr, and dig out the drip pot I use when I'm camping. It makes very good coffee, though not good enough to wade through warranty hassles. Because nothing is that good.

Discovered that it had not rained

You might think that no rain is a Good Thing, but then you might not be living in western Oregon, in a year that has set records for number of days above 90° F (I don't mind the Fahrenheit scale, because of better resolution). and an ongoing drought. There was rain in the forecast, but NOAA, as usual, blew it. I had been hoping to prep the ground for a new herb garden this evening, but that ground remains hard as stone.

The clothes dryer does not dry clothes

Forgot to toss wet laundry in last night, when I was working late (I work at least 90% from home), so I did it this morning. Motor runs, and it heats up. But the drum doesn't spin, unless you spin it by hand, which you should not be able to do. Great. Probably a belt has broken, or somehow come off of a pulley. Deal with it tomorrow, since I'm driving into town anyway, for another coffee maker. Because screw waiting for some random warranty process to complete: I'll donate the replacement to some worthy group.

Retreat to data analysis fails

The work projects were all well ahead of schedule, except for one new project that I didn't know the extent of. Fine. The sample is a series of small (all under 4 MB) flat files, and the usual approach for this sort of thing, under Linux, is to poke them with bash shell tools, before loading them into an iPython notebook, and doing the real analysis with the Pandas data analysis library. I went a bit too far in the bash stage, binning some things out with regular expressions in GNU grep. How many results are in a single-digit bin, versus a 10-19 bin?

It turned out that in the first file I examined the numbers were equal, which was, to put it mildly, unlikely. Of course, the now you have two problems RegEx issue came to mind, so I did a bit of staring at code, which revealed no bug. I checked the result three different ways, and found it to be accurate.

Then I did what I should have done to start with, given the way the day had gone far. I checked the other files, and found that this was the only file that generated equal numbers. Given the way the day had gone so far, I was biased toward looking for something else that had gone Badly Wrong, however unlikely, and I wasted time.

7,975 to 1 against

Those were the odds (calculated after the fact) against running into that particular scenario. What are the odds of the day, overall, having gone so incalculably wrong? I have no idea. Because incalculable. Duh. We lack crucial numbers on coffee maker (by manufacturer and model number) failure rates...

There's a well-known example of unlikely things happening on a daily basis. If you have a data source, look at how many cars are registered in your state/province/whatever. Then, next time you are doing some random drive, pick a license plate number, and calculate the odds of having seen it.

There are probably very long odds against it. If not, congratulations may be in order: you may have just had an opportunity to discover, in day-to-day life, how difficult randomness (the source of much cryptography) really is. Consider sources of error e.g. picking a plate close to home, at a time many of your neighbors leave for work, is going to skew your result.

A double-plus-good would arise from seeing such a failure of randomness, and gaining an insight into the power, for good or ill, of data mining.

No comments:

Post a Comment

Comments on posts older than 60 days go into a moderation queue. It keeps out a lot of blog spam.

I really want to be quick about approving real comments in the moderation queue. When I think I won't manage that, I will turn moderation off, and sweep up the mess as soon as possible.

If you find comments that look like blog spam, they likely are. As always, be careful of what you click on. I may have had moderation off, and not yet swept up the mess.