Archive for June, 2019

Pythagorean Theorem for PMs

June 21, 2019

What? Well, my math review forces me to go read about things I know. And, things I didn’t know, or things I never bothered to connect before.

In statistics, or in all math, independent variables are orthogonal. And, in equations one side of the equal sign is a collection of independent variables are independent, and the variables on the other side of the equation sign are dependent variables. Independent and dependent variables have relationships.

Now, change subjects for a moment. In MS project or in all projects, you have independent tasks and dependent tasks. And, these independent and dependent tasks have relationships.

Statistics was built on simple math. Simple math like the Pythagorean Theorem. You can argue about what is simple, but the Pythagorean Theorem is math BC, aka before calculus.

Distance is one of those simple ideas that gets messy fast, particularly when you collect data and you have many dimensions. The usual approach is to add another dimension to the Pythagorean Theorem. That’s what I was expecting when I read an email sent me out to the Better Explained blog, The author of this blog always has another take. I read this month’s post on another subject and went to look for what else I could find. I found a post, “How to Measure Any Distance with the Pythagorean Theorem,” Read it. Yes, the whole thing. There is more relevant content than I’m going to talk about. The author of this post assumes a Euclidean geometry, which around me means my data has achieved normality.

He build up slowly. We’ll just dive in the deep end of the cold pool. You know this stuff like me, or like me, assume you know this stuff.

The Multidimensional Pythagorean Theorem.

In this figure, I labeled the independent and dependent variables. This labeling assumed finding z was the goal. If we were trying to find b, then b would be dependent so the labels would be different.

In the software as media model, a would be the carrier code, and b would be carried content. Which implies a b is the unknown situation. The developer doesn’t know that stuff yet. And, without an ethnographer might never know that stuff. Steve Jobs knew typography, the developers of desktop publishing software 1.0 didn’t. But, don’t worry, the developers won the long war with MS Word for Windows, which didn’t permit graphic designers to specify a grid, which could be done in MS Word for DOS. Oh, well.

Those triangles would be handoffs, which is one of those dreaded concepts in Agile. The red triangle would be your technical writer; orange, your training people or marketing. However, you do it, or they do it.

Independent and dependent variables in a multidimensional application of the Pythagorean Theory

There are more dependent variables in the equation from the underlying source diagram so I drew another diagram to expose those.

The independent variables are shown on a yellow background. The dependent variables are shown on a white background. Notice that the dependent variables are hypotenuses.

In an example of linear regression that I worked through to the bitter end, new independent variables kept being added. And, the correlations kept being reordered. This was similar to the order of factors in a factor analysis which runs from steeper and longer working to flatter and shorter. There was always another factor because the budget would run out before the equation converged with the x-axis.

This particular view of the Pythagorean Theorem gives us a very general tool that has its place throughout product management and project management. Play with it. Enjoy.

Advertisements

Box Plots Again

June 3, 2019

I went through my email this morning and came across an email from Medium Daily Digest. I don’t link to to them often, but The 5 Basic Statistics Concepts Data Scientists Need to¬†Know looked like it might be a good read. Big data diverges from statistics. The underlying assumptions are not the same.

So my read began. The first thing that struck me was a diagram of a box plot. It needed some interpretation. The underlying distribution is

skewed. If the distribution was normal, the median would be in the middle of the rectangle. The median would be slightly closer to 1.0. You can find this by drawing diagonals across the rectangle. They would intersect at the mean. In a normal that has achieved normality, the mean, the median, and the normal converge. You will see this in later diagrams. The box plot is shown here in standard form.

Each quartile contains 25 percent of the dataset.

Skewed distributions should not be prevalent in big data. So we are talking small data, but how can that be given it is typically used in daily stock price reporting. We’ll get to that later.

In big data, normality is usually assumed, so although I got on this “is it normal” kick when I read a big data book telling me not to assume normality. As a do since then, I call it out. As I’m going to do in this post. Normality takes at least 2048 data points in a single dimension. So five dimensions requires 5×2048, or 10249 data points. When we focus on subsets, we might have less than 2048 data points, so that gives us a skewed normal. In n dimensional normals, the constituent normals that we are assuming are normal are not, in fact, normal yet. They are still skewed.

We mostly ignore this at our peril. When we make statistical inferences, we are assuming normality because the inference process requires it. Yes, experts can make inferences with other distributions, and no distribution at all, but we can’t.

I’ve read some paper on estimating distribution parameters where the suggested practice is to compute the parameters using a formula giving you the “standardized” mean and standard deviation.

I revised the above figure to show some of the things you can figure out given a box plot. I added the mean and mode. The mode is always on the short tail side of the distribution. The mean is always on the long tail side of the distribution. If the distribution had achieved normality, the median would be in the middle of the box. As it is, the median is below the center of the rectangle so it will take more data points before the distribution achieves normality. In a skewed normal, the mean and mode diverge symmetrically from the median. Once normality is achieved, the mode, mean, and median would converge to the same point. There would be a kurtosis of 3, which indicates that the tails are symmetrical. That implies that the curvature of the tails are the same as well.

That curvature would also define a torus would sitting on top of the tails. When the distribution is not yet normal, or is skewed, that torus would be a cyclide. A torus has a constant radius while a cyclide is a tube that starts with a small radius, which increases as it is swept around the normal from the short tail to the long tail. The long tail is where the tube has the largest radius. Neither of these are shown in this diagram. That cyclide is important over the life of the distribution, because it orients the normal. Once the distribution achieves normality, that orientation is lost due to symmetry, or not. That challenges some simplifying assumptions I will not address today, as in further research is required. But, accepting the orthodox, symmetry makes that orientation disappear.

A skewed normal as it appears in a box plot.

I showed, in black, where the core of the normal would be. I also indicted where the shoulder of the distribution would be. Kurtosis and tails start at the shoulder. The core is not informative. I used a thick red arrow pointing up to show how the mode median and mean would converge or merge. In a skewed distribution, the median is leaning over. As the distribution becomes more normal, it stands up straighter. Once normality is achieved, the median is perpendicular to the base of the distribution. Notice that the short tail does not move. I also show using a thick red arrow pointing down showing how the long tail will contract as the distribution becomes normal.

Invest on the stable side of the distribution, or infer on the stable side. Those decisions will last long after normality is achieved.

The next figure shows how to illustrate the curvature of the tails given just the box plot and some assumptions of our own.

Tails, curvatures, and the cyclide.

We begin here on the axis of the analyzed dimension, shown in orange. I’ve extended this horizontal axis beyond the box plot, shown in red.

The distance from the mean to the end of the maximum value in the box chart, the point at the top of the diagram marked with a “^” symbol rotated ninety degrees. This is also labeled, in blue, as a point of convergence. That distance is one half the length of the associated square, shown in red. The circle inside that box represents the diameter of the cyclide tube at the long tail.

The distance from the mode to the end f the minimum value in the box chart, the point at the bottom of the diagram marked with a “^” symbol and labeled as a point of convergence. Again, that distance is one half the length of the associated square that contains a circle representing the diameter of the cyclide tube at the short tail.

On both of the circles, the blue portions represent the curvatures of their respective tails. Here is where some assumptions kick in as well as the limitations of my tools. There are diagonals drawn from the mean and the mode to the origins of the respective curvature circles. Each has an angle associated with them. The blue curvature lines are not data driven. The curves should probably be lower. If we could rotate those red boxes in the direction of the black circular arrow while leaving the circles anchored at their convergence points, and clip the blue lines at the green clipping lines, we’d have better curvatures for the tails.

A tube would be swept around from the small circle to the large circle and continuing around to the small circle.

Here the light blue lines help us imagine the curvature circles being swept around the core of the distribution. This sweep generates the cyclide. This figure also shows the distribution as being skewed. The median eventually stands up perpendicular to the base place. The purple line equates this standing up of the median as the moment when the distribution has enough data points to no longer be skewed. The distribution would finally be normal. They cyclide would then be a torus. The short tail radius would have to grow, and the long tail radius would have to shrink.

So how does a multidimensional normal end up with a two dimensional distribution and a one dimensional box chart? The box chart shows the aggregation of a lot of information that gets summarized into a single number, the price of the share. Notice that frequency information is encoded in the box chart quartiles, but that is not apparent.

Notice that outliers might extend the range of the dimension. They are not shown. The box chart reflects the market’s knowledge as of the time of purchase. Tomorrow’s information is still unknown. The range of the next day’s information is unknown as well. The number of data points will increase so the distribution could well become normal. But, the increase in the number of data points tomorrow is unknown.

Had we build product and income streams into the long tail, we would be out of luck.

Enjoy.