The Direction of Learning

In my recent reading, I came across “When Bayes, Ockham, and Shannon come together to define machine learning.” That led me to this figure. This figure adds to the bias and variance interactions that I wrote about in Bias. That post was extending the notion of what it took to be normal. That normality brought with it symmetry. Symmetry happens when you have learned what there was to learn. Asymmetry calls out demanding learning.

In the above figure, I annotated several events. Gradient descent brought the system to the optimum. Much like a jar of peanut butter getting on the shelf of your grocery store, there was much involved in achieving that optimum. Call that achievement an event.

Here I’ve annotated the intersections as events. On one side of the intersection, the world works one way, and on the other side, the world works another way. The phases of the technology adoption lifecycle are like that. Each phase is a world. In the figure here, all I can say is that the factors have changed their order and scale. These changes apply over an interval. That interval is traversed over time. Then, the next interval is traversed. Consider the intervals to be defined by their own logic. The transitions can be jarring. As to the meanings of these worlds, I’ll have to know more before I can let you know.

John Cook tweeted about a post on estimating the Poisson distribution from a normal. That’s backward in my thinking. You start collecting data, which initially gives you a Poisson distribution, then you approximate the normal long before normality is achieved. Anyway, Cook’s post led me to this post, “Normal approximation to logistic distribution”And, again we are approximating the logistic distribution with a normal. I took his figure used it to summarize the effects of changes to the standard deviation, aka the variance. The orange circles are not accurate. They represent the extrinsic curvature of the tail. The circle on the right should be slightly larger than the circle on the left. The curvature is the inverse of the radius. The standard deviations are 1.8 for the approximating normal on the left, and 1.6 for the approximating normal on the left. The logistic distribution is the same in both figures.

On the left, the approximation is loose and leaves a large area at the top between the logistic distribution and the approximating normal. As the standard deviation is decreased to the optimal of 1.6, that area is filled with some probability mass that migrated from the tails. That changes the shape of the tail. I do not have the means to accurately compute the tails accurately so I won’t speak to that issue. I draw to discover things.

The logistic distribution is symmetric. And, the normal that Cook is using is likewise symmetric. We are computing these distributions based on their formulas, not on data collection over time. In my earlier discussions of kurtosis, we know that while data is being collected over time, kurtosis goes to zero. That gives us these ideal distributions, but in the approximation process much is assumed. Usually, distributions are built around the assumptions of a mean of zero and a standard distribution of one. I came across a generalization of the normal that used skew as a parameter.

It turns out that the logistic distribution is subject to a similar generalization. In this generalization, skew, or the third moment is used as a parameter. These generalizations allow us to use the distributions in the absence of data.

Skew brings kurtosis with it.

In the first article cited in this post, the one that mentions Bayes, a Bayesian inference is seen as a series of distributions that arrive at a truth in a Douglas MacArthur island hopping exercise, or playing a game of Go where the intersections are distributions. It’s all dynamic and differential, rather than static in the dataset view that was created to prevent p-hunting, yet p-hunting has become the practice.

So these generalizations found skew to be an important departure from the ungeneralized forms. So we can look at these kurtotic forms of the logistic distribution. Here shown in black we can see the ungeneralized form of the logistic distribution. It has two parameters: the mean, and the standard distribution. The generalization adds a third parameter, skew. The red distribution has a fractional skew that is less than one. The blue distribution has a skew greater than one. Kurtosis is multiplicative in this distribution. The kurtosis orients the red and blue distributions via their long and short tails. Having a long tail and a short tail is the visual characteristic of kurtosis. Kurtotic distributions are not symmetrical.

Kurtosis also orients the normal. This is true of both the normal and the generalized skew-normal. In the former, kurtosis is generated by the data. In the latter, kurtosis is generated by the specification of the skew parameter. The latter assumes much.

It would be interesting to watch a skew-normal distribution approximate a skew-logistic distribution.

The three distributions in the last figure illustrate the directionality of the kurtosis. This kurtosis is that of a single dimension. When considered in the sense of an asymmetrical distribution attempting to achieve symmetry, there is a direction of learning, the direction the distribution must move to achieve symmetry.

We make inferences based on the tails involved. Over time the long tail contracts and the short tail lengthens. Statisticians argue that you can infer with kurtotic distributions. I don’t know that I would. I’d bet on the short tails. The long tails will invalidate themselves as more data is collected. The short tails will be constant over the eventual maturity, the differential achievement of symmetry, or the learning of the distribution.

This learning can be achieved when product developers learn the content of their product and make it fit the cognitive models of their users, or when marcom, training, and documentation enable users to learn the product, and lastly, changing the population so its members more closely fit the idealized population served by the product. All three of these learnings happen simultaneously, and optimally without conflict. Each undertaking would require data collection. And, the shape of the distribution of that data would inform us as to our achievement of symmetry, or the success and failure of our efforts.

The technology adoption lifecycle informs us as to the phase, or our interval and its underlying logic. That lifecycle can move us away from symmetry. We have to learn back our symmetry. The pragmatism that organizes that lifecycle also has effects within a phase. This leaves us in a situation where our prospects are unlike our customers or installed users. Learning is constant so divergence from symmetry is also constant. We cannot be our pasts. We must be our present. That is hard to achieve given the lagging indications of our distributions.

Enjoy!