Central Limit Theory
I watched some YouTube videos on the central limit theory in which, according to that theory, a population can be sampled with samples of size 30. The presenter implied that you had as many such samples as needed to cover the population. But, the point was that each sample would have 30 entities in it.
I don’t know, but 30 seems too small. It is nowhere near 211. Even 211 is too small to get us to a symmetric normal, one without skew and kurtosis.
I drew a picture with my blunt tools. I didn’t use a sphere packing algorithm. I just drew a multiplication table. A surprise jumped out.
A normal has a circular footprint. A normal sits inside a square. So what shape are we talking about here? How do we get to 30? 25 or 25 is too small, and 62 or 36 is too large. We are talking people here, not say populations of the square root of 30.
The red lines are ellipses sitting inside rectangles. They are not normals yet. They are pre‑normals or long‑post‑normals. They are either hyperbolic or spherical. And, somehow, according to the central limit theory, when added together, they add up to a standard normal. That implies that their mean is 0 and their standard deviation is 1.
A circle implies the absence of a correlation. A rectangle implies the presence of a correlation, or a bias.
Notice that the samples for 1 and 36 are outside the circle. They are omitted from the population. Oh well.
Synthetic Data
Mashhood Ahmed’s discussion of synthetic data came up again out on LinkedIn. See the discussion.
Project management has long been a research topic. Software engineering research is similar. The data justifying and validating the practices have existed for a long time. If anybody went looking for data, it exists. Yes, maybe your way is different. But, getting consistent data for yourself and your organization should be simple. Capture it. Analyze it. Integrate it.
Once you know the parameters and the constraint envelopes, you can generate synthetic data on those parameters and constraints. You can run a real project using those synthetic parameters and constraints and then see what your organization delivers. Capturing your outcomes lets you forecast where the parameters and constraints will take you before you go.
When I talk about the technology adoption lifecycle, I know what my distributions will look like before I have any customers. I know what the processes are going to be. I know the evolution. My current obsession with regressions to the tail is just a matter of knowing what I can expect and knowing what to do about it when I see it coming. This business of financial turbulence bleeding into adjacent processes and dependent processes is trouble. How do you put that in a box? How do you deal with the coupling and cohesion of that turbulent system? How do you make it an object?
I built my long tail representing my application as I build my application. Use tells me about my tails. Does the tail match up with the requested requirements? The resulting tails validate the survey data that led to the requirements. The resulting tails confirm marketing organizations’ delivery of the appropriate users. With a user interface, there is plenty of ongoing data collection. Call them surveys if you like.
In too many correlation classes, a given correlation is arbitrarily thrown away. It is actually a component of a tail. There are many tails. And, I’ve seen one system of correlations get replaced by another, assuming that there is only one tail. When a UI control asks you to select one of three possible choices, there are three tails. Or, is that question some pointless data to be stored? Is that choice eventually expressed by some component of the system? Three choices give you three different probabilities, and three departing Markov chains of probabilities to add to the predecessor’s tail, assuming only one, led us to that control. In AI, the overall UI would have been a small world.
Knowing your expected distributions and putting synthetic data in them should not be a problem.
Open Source Software
Today, I came across a job description. They wanted a product manager for a product that aims to replace Dreamweaver. The product was written for programmers. We used Dremweaver. Were we programmers? The product is open source software.
Open-source software development is supposed to deliver better software than other development processes. It does this because the programmer is a member of the user community. That programmer knows the carried content. Most programmers know the carrier but have to be taught the carried content. These latter programmers are not users. Those two types of programmers present us with very different propositions.
Hell, I remember an organization that produced carrier type products. That company defined the world. They wanted to become a product company. That meant listening to the outside world, listening to users and others that defined their world, their carried content. In the end, they could not make that leap.
Barcodes and Persistent Homology
In my YouTube watching, I revisited bar codes. I got it this time. Start with a collection of points. Each of those points is the center of a circle of a given radius. All the circles are the same size. Increase the radius of all those circles. The circles begin to overlap at some radius. Continue to increase the radii. Some space gets surrounded. That surrounded space is a hole. When that happens, a hole is born. Continuing to increase those radii, the surrounded space, the hole disappears. That demarks the death of the hole. That happens at some radius.
The barcode starts at the radius when the hole is born. The barcode ends at the radius when the hole dies. That hole is exhibiting a lifecycle. That hole is a topological hole, not an algebraic hole. Tori and cyclide are topological structures that have holes. They show up in the curvatures of the tails around a normal over the lifecycle of that normal. Barcodes tell us how large the holes in those topological structures happens to be.
In the figure above, on the left, I put in a point cloud of blue points. These points are in a multidimensional space. I drew my first circles of radius r1. I cheated and moved the points, so they enclosed that big red space, our hole. The blue points generated a deformed torus. On the right, I drew a circle six points larger. That is radius r2. I overlaid those circles on the earlier ones. I failed to cover the hole. You can still see a red area in the center. Radius r2 needs to be one point larger. I added that point as shown. I did test the figure on the left, but my tools subtracted two points instead of one.
If the points came from survey data for a particular requirement, the barcode would show you the requirement’s lifecycle. In the figure on the left, removing point A representing a customer or user would prevent that hole’s birth.
If you could find the rate involved in the process of moving from r1 to r2, you could put a date on the birth and death of that requirement. The radius r1 tells you how much time you have to deliver that requirement from the start of the demand in your survey data.
Enjoy!