Thursday, August 24, 2017

Mean-of-Means Under Unequal Cardinality

These days, I'm hearing too much politics, of the weird and/or horrible sort. This is weird and/or horrible only in the sense of really bad statistics. Assuming that a political poll matches reality, anyway.

A former co-worker recently sent me an 'analysis' of a Gallup poll, purporting to show Trump's approval declining in stages. A lot of effort went into this. Each data point was tediously recorded via mousing over the online plot, then averages (arithmetic means) were calculated for each varying period, which were cherry-picked.

It's clear that while these data show declining popularity, and it sorta looks like that's happening in discrete stages, that's about all you can tell. Beyond the cherry-picking, only simple arithmetic means were calculated, and means-of-means were used across different sample sizes. Which is often a cardinality problem. I'm using the mathematical definition of cardinality here: a measure of the number of elements in a set.

In a bog-standard arithmetic mean I'm using the following notation:
where S = sample
and n = sample count
mean = S/n

so n = cardinality.


See the current Gallup Daily: Trump Job Approval plot to see rollover data, etc.

Immediately after the page title lies the critical sentence, "Each result is based on a three-day rolling average." You should also note that "Margin of error is ±3 percentage points." So a lot of the day-to-day variance is indistinguishable from noise. And not plotting even simple error bars trades information for drama. Bad dog, Gallup. But that is no worse than what is to be expected, and I digress.

Long story short: you have to use a weighted mean here.

The scary bit is that this former co-worker influences financial decisions related to capacity planninng for various security systems. When and where more storage, networking, or CPU would be needed. So this was just waiting to bite, in one or more security-critical system(s). That can be career-limiting, and is best avoided.

This is Python3. I use an unweighted mean-of-mean first, just to demonstrate failure under unequal cardinality.

Note that this is truly terrible Python. While it works, it bears little resemblance to how one would actually write Python. Python has been called "executable pseudocode."  I am trying to capitalize on that to make what it does (illustrate my point) decipherable to those who have never seen Python and have no desire to experiment with it, or may not be any sort of coder at all.

# Python is dynamically typed; it would have been easy to do this without 
# using an array. But this way, and specifying floating point (the 'f' in the 
# a_in = array.array statement) may be more explicit for those who don't use 
# Python.
import array # Efficient arrays of numeric values.
from statistics import mean # Arithmetic mean ("average") of data.

def clr_vars():
    """Clear variables, shown in order of appearance."""
    del a_in    # The input array.
    del ms1     # Mean of subset 1.
    del ms2     # Mean of subset 2.    
    del a_ms12  # Array of the means of the two subsets.
    del ma_in   # Mean of input array.
    del ms12    # Mean of both subsets

print('Equal Cardinality Subsets')
a_in = array.array('f', [1, 2, 3, 4])
print('Input:', a_in)
print('Subset1:', a_in[:2])
print('Subset2:', a_in[2:])
ms1 = mean(a_in[:2])
ms2 = mean(a_in[2:])
print('Mean of subset1:', ms1)
print('Mean of subset2:', ms2)
print('Sum of subset means succeeds:', ms1 + ms2)
a_ms12 = array.array('f', [ms1, ms2])
ma_in = mean(iter(a_in))
print('Mean of entire input array:', ma_in)
ms12 = mean(a_ms12)
print('Mean of subset means succeeds:', ms12)

Which returns

Equal Cardinality Subsets
Input: array('f', [1.0, 2.0, 3.0, 4.0])
Subset1: array('f', [1.0, 2.0])
Subset2: array('f', [3.0, 4.0])
Mean of subset1: 1.5
Mean of subset2: 3.5
Sum of subset means succeeds: 5.0
Mean of entire input array: 2.5
Mean of subset means succeeds: 2.5

clr_vars

print('\nUnequal Cardinality Subsets')
a_in = array.array('f', [1, 2, 3, 4, 5])
print('Input:', a_in)
print('Subset1:', a_in[:2])
print('Subset2:', a_in[2:])
ms1 = mean(a_in[:2])
ms2 = mean(a_in[2:])
print('Mean of subset1:', ms1)
print('Mean of subset2:', ms2)
print('Sum of subset means succeeds:', ms1 + ms2)
a_ms12 = array.array('f', [ms1, ms2])
ma_in = mean(iter(a_in))
print('Mean of entire input array:', ma_in)
print('Mean of subset means fails:', ms12)

Which returns


Unequal Cardinality Subsets
Input: array('f', [1.0, 2.0, 3.0, 4.0, 5.0])
Subset1: array('f', [1.0, 2.0])
Subset2: array('f', [3.0, 4.0, 5.0])
Mean of subset1: 1.5
Mean of subset2: 4.0
Sum of subset means succeeds: 5.5
Mean of entire input array: 3.0
Mean of subset means fails: 2.5

Weighted Means

As you can (hopefully, Python is clear enough) see, cardinality didn't matter with addition. But with mean-of-means, it did. There are some subtleties involved with sum- vs mean-of-means. You may be asking two different questions, depending on which you use, so either can be correct. But here we can just use weighted means, which allow for the different number of samples (cardinality).

Differences

Instead of iterating over the entire array, we group by the subsets.

Recall that in a bog-standard arithmetic mean I'm using the following notation: where S = sample and n = sample count mean = S/n

We calculate a weighted mean of two samples by expressing groups of sample counts and sample means:
Weighted mean = (((mean of s1)(n of s1)) + ((mean of s2)(n of s2))) / ((n of s1)(n of s2))

We introduce two new variables s1 = subset1, s2 = subset2, instead of just printing what they
would have contained.

We then use len() to get n of s1 and s2, storing them in two more new variables, ns1 and ns2.

Substituting Python variables, and using Python syntax (still with extra parens for clarity):
w_mean = ((ms1 * ns1) + (ms2 * ns2)) / (ns1 + ns2)

Here's the code.

s1 = a_in[:2] s2 = a_in[2:]
print('s1 is:', s1) print('s2 is:', s2) ns1 = len(s1) ns2 = len(s2)
print('ns1 is:', ns1) print('ns2 is:', ns2) ms1 = mean(s1) ms2 = mean(s2) print('ms2 is:', ms1) print('ms2 is:', ms2) w_mean = ((ms1 * ns1) + (ms2 * ns2)) / (ns1 + ns2) print('Mean of entire input array:', ma_in) print('Weighted mean succeeds:', w_mean)

Which returns
s1 is: array('f', [1.0, 2.0])
s2 is: array('f', [3.0, 4.0, 5.0])
ns1 is: 2
ns2 is: 3
ms2 is: 1.5
ms2 is: 4.0
Mean of entire input array: 3.0
Weighted mean succeeds: 3.0

There is another potential problem here. Not having the original data, we can't see outliers, and there is no methodology discussion beyond "Daily results are based on telephone interviews with approximately 1,500 national adults". So we don't know that the arithmetic mean was safe to use. On the other hand, rolling averages smooth data. It's probably fine, but we would have to do a lot of analysis to establish a confidence level.

We would start by experimenting with harmonic means (the mean of choice when
there are outliers, though we have to go outside the Python standard library. Harmonic means are also commonly required in IT capacity planning or performance analysis, or anywhere else that data are spikey. My former co-worker should certainly have been aware of this too, but somehow wasn't. Another way to get bitten... The thing is that in most ways, this a pretty knowledgeable security person. On the theory that anything that helps get more science and engineering into the security field, at any level, is A Good Thing, I was happy to help.

What's in the Python stdlib is actually pretty weak for this sort of thing, compared to what's available in the science stack. The only mean available in the statistics module is the arithmetic mean, and arrays are limited as well. The solution starts as follows.

import numpy as np # array goodies and vastly more, required by
from scipy importstats # supplies harmonic mean, and lots more
import matplotlib.pyplot as plt
import seaborn as sns
Note that you also have to have Pandas on the system to use seaborn. Which is fine --
you'll want it anyway. Or use something else; there are several plotting packages available.
When the environment is set up, we would use stats.hmean() and lots of plots as our
exploration tools. That would take more play time than I actually have. Especially since all
this assumes that I want to fool with anything political, and that Gallup polls are worth
anything to begin with. I would really start by looking up their results from the 2016 election,
where we know there were polling failures.
If you care to push your way into the science stack, the packages I mentioned above are at
http://www.numpy.org/ https://www.scipy.org/ https://matplotlib.org/ https://seaborn.pydata.org/
http://pandas.pydata.org/