When I wrote the post, The Dance of a Normal: Data Quantity and Dimensionality I didn’t tie it back into product management. My bad. I’ll do that here. I was reminded that I needed to get that done by John D. Cook’s post on his blog post, “Big data is not enough.”

When we construct a normal from scratch, we need 20 data points before we can make any conclusions. That’s 20 data points of the same measurement in the same dimension. If that measurement involves data fusion and we change that fusion, we have a different measurement, so we need to segregate the data. If we change a policy or procedure, those will change our numbers, or basis. If we change our pragmatism slice, those numbers will change. If we had enough data, each of those change would be an alternative hypothesis of its own. Hopefully, they would intersect each other so we could test each of those hypotheses for correlation. But, we can’t just aggregate them and expect to make valid conclusions even if we now have 80 data points and a normal.

With those 20 data points, we have a histogram. We will also have kurtosis when we tell our tools to generate a normal with those 20 data points. We will have to check to see how many nomials we have. Each nominal will have a mean, median, and mode of its own. Those medians lean. Those medians remain the statistic of centrality while the mode and mean move out into the skew.

While you can estimate a normal from 20 data points, don’t expect it to be the answer. There is more work to be done. There is more logic involved. There is more Agile development to do. Don’t move on to the next thing until you have 36 data point for that dimension. If you release some new code, start the data point count over. This implies slack.

When I was managing projects, the mean would converge. When you see the same mean several days in a row, you’ve converged. Throw the data our and collect new data. Once the data converges, it is hard to move the number. Your performance might have changed, but the number hasn’t. Things hide in averages.

Beware of dimensions. A unit of measure could be more than one dimensional when it’s used in different measurements. What is the logic of this sensor versus another? What is the logic of the illuminator? What is the logic of the mathematics? Are we assuming things? A change in any of that brings us to a new dimension. Write down the definition of each dimension.

The statistics for each dimension and each measurement take time to reach validity. The rush to production, to release, to iteration leaves us with much invalidity until we reach validity. The numbers your analytics kick out won’t clue you in. Kurtosis can give you a hint if it is not swamped. Slow down.

Once you have achieved normality with a measurement, how many sigmas do you have: 1, 3, 6, >6, 60? At three, your underlying geometry changes from Euclidean to spherical. Your business will change when your sigma is greater than six. You will have more competition and the number of fast followers will explode.

Adding data points will change the normal, which in turn changes the outliers. This will be even more the case when you attend to the changes to your dimensions and measures, and your TALC phases and pragmatism slices. The carried and carrier will have their own dimensions and measures. They will also have different priorities, and levels of effort. When moving from a carried layer to a carrier layer, the outliers would be different, because the carrier and carrier have their own normal distributions each with their own dimensions and measures. The emphasis changes, so the statistics change. The populations across the stack differ widely.

So much mess can be made with metrics. Gaps in the data happen. The past hangs around to assert itself in the future. When you drive down a road adjacent houses can be from different decades. Data is likewise. The infrastructure helps eliminate gaps and the miss-allocation of data. It’s not as simple as measure to manage, you have to manage to measure.

Enjoy