## Unit of Measure

Back in an earlier post, A Quick Viz, Long Days, I was wondering if the separate areas on a graphic were caused by the raster graphics package I was using, or if they were real. If a pixel is your unit of measure, then the discontinuities are real. The unit of measure drives the data. So yes, those disconnected areas would be Poisson distributions tending to the normal and the units of measurement get smaller.

In this figure, I changed the unit of measure used to measure the top shape. I increase the size of the unit square moving down the page. Then, for each of the measured shapes, I counted complete units, used Excel to give me a moving mean and standard distribution with time (n) moving left to right on each figure. In the first, measurement I generated a histogram of the black numbers below the shape.

A graph of the moving averages appears above each shape in gray. A graph of the moving sigmas appears above each shape in black. This helps us see the maximum or minimum sigmas and means. It also reveals uninominal to multinominal structure, or how many normals are involved. In all cases, the means were uninominal involving a single normal. The results from the smallest pixel show that the sigma was binominal. The middle pixel resulted in three sigmas as the distribution was trinominal. The largest pixel resulted in a uninominal. In all three cases, the shape generated skewed distributions.

No time series windows were used.

Where the data was smaller than a pixel, it is highlighted in red and omitted from the pixel counts. You can see how the data was reduced each time the pixel size went up. The grid imposing the pixelizations were not applied in a standard way. We did not have an average when the grids were applied. The red pixels could be counted with Poisson distributions. They are waiting to trend to the normal. Or, they could be features waiting for validation. In a discontinuous innovation portfolio, they could be lanes in the bowling alley waiting for their client’s period of exclusion to expire, or waiting to cross the chasm. Continuous innovations do not cross Moore’s chasm. Continuous innovations might face scale chasms or downmarket moves via disruption or otherwise. All of these things impede progress through the customer base. They would be red. Do you count them or not.

Grids have size problems just like histogram bins.

## A Moving Mean

When you first start collecting data each data point changes the normal massively. We hide this by using a large amount of data after the fact, rather than like a time series building out a normal towards the standard normal, or a Poisson distribution and increasing the number of data points until the normal is achieved.

When watching a normal go from 1 to n, it matters where the next data point comes from. If the data point is the third or more, it will be inside or outside the core, or, as an outlier, outside the distribution entirely. In the core, an area defined by being plus or minus one sigma, one standard deviation from the mean, the density goes up, the sigma might shrink. That sigma won’t get wider. Outside the core, in the tail, the sigma might get wider. The sigma won’t get narrower. These would change the circumference of the circle representing the footprint of the normal. An outlier makes the normal wider. That outlier would definitely move the mean.

So what is the big deal about moving the mean? It moves the core. It’s only data. No. That normal resulted from the sum of all the processes and policies of the company. A population makes demands of the company and the product. When the core moves, some capabilities are no longer needed, some attitudes are no longer acceptable. On the financial side of the house, skew risk and kurtosis risk are real. When the core moves, the tails move. The further the core moves, the further the tail moves in the direction of the outlier.

Sales is a random process. Marketing is not. We don’t much notice this when we are selling commodity goods, but with a discontinuous innovation, that outlier sale has many costs that we have never experienced. The technology adoption lifecycle is only random when you pick where you start, your initial position, in the middle and work towards the death of the category. Picking the late mainstream phase because it’s all you know, leaves a lot of money on the table and rushes that population to the buy before the business case they need to see is ready to be seen. But, picking late mainstream also means you’re fast following. Don’t worry. The innovation press will still call your company innovative. Hell, yours is purple and the market leader’s version is brown.

But, let’s say you began in the beginning and through the early phases coming out of the tornado as the market leader. You will have gone from a Poisson distribution to the three sigma normal to the six, to the twelve, to more. Your normal will dance around before it sets its anchor at the mean and stays put while it grows outward in sigmas.

That outlier that sales demands and we refused eventually will be reached. Sales just got ahead of itself and cost the company quite a bit trying to build the capabilities the outlier takes for granted.

I sat down with a spreadsheet and sold one customer, built the normal, and sold another, built another normal. That first customer was narrow and very tall. It’s as tall as that normal will ever be. It looks like a Dirac function. Of course, there is no standard deviation when you have a single data point. I fudged the normal by giving it a standard deviation of one. And, the standard normal looks like any other standard normal. Only the measurement scales changed from one normal to the next. The normals get lower and wider as the population gets larger.

I did this without a spreadsheet, but I got normals with a kurtosis value, but no skew or kurtosis are produced by those standard normal generators. So this first figure is the first data point. It may be a few weeks until the next sale. Or, this might be a developer’s view of some functionality that certainly hasn’t been validated yet. Internal agilists never dealt with this problem. The unit measure is a standard deviation, a sigma.

In the figure above, DP_{1} is the first data point and the first mean. So I went on to the next data point.

Here, in the figure above, the distribution for the second data point, DP_{2}, is the gold one. The standard deviation was 13. The mean for the gold distribution is represented by the blue line extending to the peak of the gold distribution. The black vertical lines extending upwards to the gold distribution demark the core of the gold normal. In the top-down view, the normal and its core are shown as black circles. With a standard deviation of 13, three standard deviations are 39 units wide.

The next data point, the third data point, DP_{3} gives us the third mean. This mean is shown as a red line extending to the top of the pink distribution. In the top-down view, this normal and its core are shown as red circles. Notice that the height of this normal is lower than that of the gold normal. Also notice that this new data point is inside the core of the previous normal, so this normal contracts. With a standard deviation of 11, three standard deviation is 33 units wide. The third mean moved, so there is some movement of the distribution.

The figure above is illustrative but wrong. The vertical scale is off. So I rescaled the normals generated for the second and third data points. And, a fourth data point was added as an outlier. No normal was generated for it. That would be the next thing to do in this exploration.

The black arrows at the foot of the gold normal show the probability mass flowing into the pink normal. The white area is shared by both distributions.

Where I labeled the mean, median, and mode is the same is not real either. The distribution is not normal. I tried to draw skewed distribution show with the numbers from the spreadsheet. Eventually, I left that to the spreadsheet. In a skewed distribution all three numbers separate. The mean is closest to the tail.

In the top-down view, the outer circle is associated with the outlier.

The means moved from 5 to 18 to 20, and to 34 in response to the addition of the outlier at 75. The footprint of the normal expands with the addition of the outlier, and contracts in response to the addition of the third data point at 24.

The distribution is like gelatin.

Now, I got out the spreadsheet. I built a histogram and then put the line graph of a normal over it. The line graph doesn’t look normal at all.

So I took the normal off.

This showed three peaks. Which drove the normal to show us a trinomial that was right or positively skewed. This data has a long way to go before it is really normal. When I tried to hand draw the distribution, it looked left or negatively skewed. Adding the outlier cause this.

No, I’m not going to add another data point and keep on going. I’ll wait until I get my programmer to automate this animation. I did try to get a blog up for our new company, but WordPress has not gotten easier to use since the last time I set up a blog. Anyway, they told us in statistics class that the normal wouldn’t stabilize below 36 data points. We looked at this. Use a Poisson distribution instead. Set some policy about how many data points you have to have before you call a question answered.

In Agile, the developer wants to get to validation as quickly as possible. Using the distributions at n = 2 and n = 3, we can look test a hypothesis. We will test at n = 3 (now) and n = 3 -1 = 2 (previous). Since n =3 contracted, we could accept H_{1} previously and no longer accept H_{1} now.

I did not compensate for the skew in the original situation. The top-down view shows that with skew rejecting a hypothesis depends on direction. In our situation, the mean only moved to the right or the left. With another axis, the future distribution could move up or down, so there is, even more, sensitivity to skew and kurtosis. And, these sensitivities are financial risks. Sales to outliers translate into skew and kurtosis. These sales can also be costly in terms of, again, the cost of the capabilities needed to service the account.

Beware of subsets. With any given subset, that subset will likewise need 36 or more data points before the normal stabilizes. Skew risk and kurtosis risk will be realized otherwise.

Enjoy.