The primes form the vague, diffuse cluster near the centre
August 22, 2018 11:02 AM   Subscribe

What do numbers look like? When you factorize every number from two to one million, some pairs of numbers will have similar factorizations. This visualization attempts to place each number such that its closeness to any other number is proportional to how similar their factorizations are, using a technique called UMAP.
posted by a snickering nuthatch (31 comments total) 37 users marked this as a favorite
 
There's also an associated twitter thread that includes additional images by others.
posted by a snickering nuthatch at 11:34 AM on August 22, 2018 [1 favorite]


And don't miss this zoomable version.
posted by a snickering nuthatch at 11:43 AM on August 22, 2018 [2 favorites]


This seems to produce a pretty random-looking image, and the video shows the image being "painted" as it moves through the numbers in pretty random movements, sometimes continuous and sometimes not. Can anyone explain why this is so, and why there is not more of a pattern to it?
posted by beagle at 12:14 PM on August 22, 2018 [1 favorite]


beagle, per the twitter thread, the creator of these images isn't willing to commit to a reason for it yet, how much of the perceived structures here are an artifact of the way he's done the mapper vs. inherent patterns in the numbers themselves. A few others have expressed interest and are looking to take some stabs at it.
posted by Enturbulated at 12:18 PM on August 22, 2018 [1 favorite]


how much of the perceived structures here are an artifact of the way he's done the mapper vs. inherent patterns in the numbers themselves.

I really want to see a comparable image produced by t-SNE instead of UMAP to help answer this question. From the twitter thread:
alberto cetoli: Have you also tried with t-SNE?
John Williamson [creator of the original image]: Yes, but only with many fewer numbers as it grinds to a halt with >~50k on my machine. Structure seems similar from what can be seen, with nested clusters and big clouds of primes, but less clean than UMAP -- but that might be that the clean bits are UMAP specific artifacts!
posted by a snickering nuthatch at 12:26 PM on August 22, 2018 [1 favorite]


Senator, I served with the Mandelbrot Set. I knew the Mandelbrot Set. The Mandelbrot Set was a friend of mine. Senator, you're no Mandelbrot Set.
posted by humboldt32 at 12:53 PM on August 22, 2018 [4 favorites]


All the line and spirals in the farther reaches are interesting! Probably an artifact of the code, as the creator said, but it'd be cool if not entirely.
posted by tavella at 1:29 PM on August 22, 2018 [2 favorites]


These high-dimensional vectors are then squashed down from thousands of dimensions for each number to just two dimensions

There's, uh, some missing steps in the description here.
posted by eviemath at 1:33 PM on August 22, 2018 [7 favorites]


Actual description of the UMAP algorithm appears to be here (arXiv).
posted by eviemath at 1:36 PM on August 22, 2018 [3 favorites]


The representations for a number and its square are the same, so plotting all squares up to a million is the same as just plotting the first 1000 numbers.

There are only 168 primes < 1000, so ~99.8% of the columns in the matrix are all 0 except for a single 1 (on the diagonal), and would end up basically being useless. Thank god for sparse matrices I guess :)

Anyway, UMAP, huh. Wonder what the pictures from other algorithms would look like
posted by kleinsteradikaleminderheit at 2:24 PM on August 22, 2018 [1 favorite]


The numbers plotted include all integers 2...10^6, so there are non-squares (and primes) greater than 1000 in there.
posted by a snickering nuthatch at 2:35 PM on August 22, 2018


Yes, but the squares < 1e6 are exactly the numbers < 1000, squared. And a number and its square have the same set of prime factors. Talking about the plot that shows only the squares here -- it's showing the numbers < 1000.
posted by kleinsteradikaleminderheit at 2:51 PM on August 22, 2018 [1 favorite]


My god. It's full of stars!
posted by Quackles at 4:21 PM on August 22, 2018


As the author notes, this has as much to do with the geometry of the dimension-reducing algorithm as any real facts about the numbers. Doesn't make it less beautiful, but may make it less meaningful.

For a much simpler example of putting the numbers on a plane, see the prime number spiral.
posted by Nelson at 5:09 PM on August 22, 2018 [2 favorites]


To reiterate what eviemath said, the linked article doesn't really sufficiently describe the dimensional reduction method. I guess it's probably not what I intuitively figured before getting further in the article, which would be a plot where the y-axis was integer value of numbers as large as n and the x-axis was indexed by prime number up to sqrt(n), with "hotter" colors representing higher counts for that particular prime factor for any given number (e.g. 8 would be 8 steps up on the y-axis and would have a comparatively brighter dot at x=1 than 4).
posted by invitapriore at 6:26 PM on August 22, 2018


There's some stuff on the Hacker News thread too.
posted by RobotVoodooPower at 8:13 PM on August 22, 2018


Here, for the first time, is a two-dimensional view of the first trillion integers: ◄

You can clearly see the numbers appear to form some kind of shape, small at the start and large at the end. What does this mean? More research is required.
posted by East Manitoba Regional Junior Kabaddi Champion '94 at 7:02 AM on August 23, 2018 [4 favorites]


That is fantastic. There's a link in the Hacker News thread to a video of the author of UMAP explaining it in fairly understandable terms.

I imagine there's not much new there if you're a mathematician, but I would love to have a big, interactive map of that image annotated by one.
posted by lucidium at 8:50 AM on August 23, 2018 [2 favorites]


It would be interesting just to see which number goes with any particular dot on mouseover. That should make it easier to understand what up with those starbursts etc.
posted by a snickering nuthatch at 8:56 AM on August 23, 2018 [1 favorite]


The universe was made on purpose, the circle said. In whatever galaxy you happen to find yourself, you take the circumference of a circle, divide it by its diameter, measure closely enough, and uncover a miracle--another circle, drawn kilometers downstream of the decimal point. There would be richer messages farther in. It doesn't matter what you look like, or what you're made of, or where you come from. As long as you live in this universe, and have a modest talent for mathematics, sooner or later you'll find it. It's already here. It's inside everything. You don't have to leave your planet to find it. In the fabric of space and in the nature of matter, as in a great work of art, there is, written small, the artist's signature. Standing over humans, gods, and demons, subsuming Caretakers and Tunnel builders, there is an intelligence that antedates the universe. The circle had closed. She found what she had been searching for.
posted by dmd at 11:01 AM on August 23, 2018 [1 favorite]


Hmm, so, per the links in the original post:

1. Each number up to 10,000 (or whichever upper bound you choose) is assigned a vector based on its prime factorization. This creates a matrix, where the columns correspond to the prime numbers up to 10,000 (or whichever, and the rows correspond to sets of prime factors. Multiplicities of prime factors aren't counted, so the row for 6 is [1 1 0 0 ... ] (having a factor of 2 in the first column, and a factor of 3 in the second column), but 12 is also assigned this same row even though it has two copies of 2 in its prime factorization, as is 18, 36, etc. The matrix doesn't have repeated rows for these different numbers, just the one row for the combo of having prime factors of 2 and 3 but no other prime factors.

2. The UMAP algorithm sorts data into clusters, but unlike some other clustering algorithms, it also maintains some structure within the clusters, and the spaces between clusters have useful info too. To do this, you have to define a metric, i.e. some measure of "distance" on your data. There are a variety of choices for this. For data that consists of binary sequences (ordered lists of 1s and 0s, like our data here), a commonly used one is the Hamming metric, which counts the number of columns in which two rows differ (sometimes normalized based on the length of the sequence/number of columns). For example, [1 1 0 0] and [1 0 1 0] would have a Hamming distance of 2, since they differ in the second and third entries. But [1 1 0 0] and [0 1 0 0] would have a Hamming distance of 1.

In the case linked in this post, however, they used the dot product of two row vectors as a metric. What does this mean? Well, the dot product of [1 1 0 0] and [1 0 1 0] is the sum of the products of each corresponding entry: 1x1 + 1x0 + 0x1 + 0x0 = 1. The dot product of [1 1 0 0] and [0 1 0 0] is 1x0 + 1x1 + 0x0 + 0x0 = 1. The dot product of [1 0 0 0] and [0 1 0 0] is 1x0 + 0x1 + 0x0 + 0x0 = 0. They mention using the cosine of the angles between vectors, so this dot product is actually normalized by the length of the vectors involved, so the final calculated distance between [1 1 0 0] and [1 0 1 0] is their dot product divided by (length of [1 1 0 0] x length of [1 0 1 0]) = 1/(2x2) = 1/4. The final calculated distance between [1 1 0 0] and [0 1 0 0] is 1/(2x2) = 1/4 as well. But the calculated distance between [1 1 0 0] and [1 0 0 0] would be (1x1 + 1x0 + 0x0 + 0x0)/(2x1) = 1/2. Note that this is very much not related to the Hamming distance between the vectors. In particular, vectors with a greater Hamming distance will have a smaller numerator in this dot product calculation. Vectors with more prime factors will also have a larger denominator. And the distance between any pair of primes (or powers of a single prime, since they have the same vector as the individual prime) will be zero.

That's... an odd choice.

3. So, in the image created, they say that the primes are clustered in a nice little cloud around the center of the image. So the center of the image includes a big cluster of numbers that have been labeled as distance zero from each other, or have a large number of prime factors. The points farthest away from this cluster, if I'm understanding the general idea of the UMAP algorithm, will be points with a largest distance from any individual prime number, so those would be numbers that have two prime factors (like 6 = 2x3), which would have, eg., distance from 2 of (dot product of their vectors)/[(length of vector for 2)x(length of vector for 6)] = 1/(1x2) = 1/2. I conjecture that the loops correspond to groups of numbers with few prime factors that all share just a single prime factor in common - like, one loop perhaps corresponds to the set {2, 6=2x3, 10=2x5, 14=2x7, 22=2x11, ...}. I conjecture that the loops are more likely to be clustered based on the smallest prime factor a set of numbers have in common (eg. that 6 would be in the loop above, rather than in a loop {3, 6=3x2, 15=3x5, 21=3x7, ...}, because multiples of 2 occurs 3/2 times as frequently as multiples of 3 in the list of numbers up to 10,000 (or whichever upper bound you choose), so any reasonable clustering algorithm would probably pick out being a multiple of 2 as a more important feature than being a multiple of 3. So this set of products of 2 with a single other prime is probably the biggest loop that is far from the center. But there might be a loop for products of 3 with a single other prime larger than itself (so, not including 6 = 2x3)? Perhaps some of the loops not as far out correspond to sets of products of 2 with two other larger primes, sets of products of 3 with two other larger primes, etc.? Exactly what the clusters are would definitely depend on the specifics of the UMAP algorithm and how it goes about finding clusters.

4. Almost none of this depends on any properties of the prime numbers themselves. Almost. What we are looking at is clustering binary sequences, using a very odd choice of "distance" metric. Except the set of binary sequences that we have doesn't include quite all possible binary sequences of a given length. We throw some out. For example, the set of all binary sequences of length 3, listed as rows of a matrix, would give the following:
1 0 0 <> 2, 4, 8, ...
0 1 0 <> 3, 9, 27, ...
0 0 1 <> 5, 25, ...
1 1 0 <> 6, 12, 18, 36, ...
1 0 1 <> 10, 20, 50, 100, ...
0 1 1 <> 15, 45, 75, ...
1 1 1 <> 30, 60, 90, ...

Notice that the largest consecutive number we can get in this representation is 6. 7 is a new prime number, so would need a fourth column. So the matrix for numbers up to (and including) 6 is missing the last three rows, and is just:
1 0 0 <> 2, 4
0 1 0 <> 3
0 0 1 <> 5
1 1 0 <> 6

Going the other direction, say we wanted the matrix of these binary representations of numbers up to and including 20. Well, that will have to include all prime numbers up to 20, which are: 2, 3, 5, 7, 11, 13, 17, 19. There are eight of them, so our matrix will need eight columns. Let's start listing:
* I'll start with the row for 2
* then add the row for 3
* then write down "4" as corresponding to the row for 2
* then add a row for 5
* then add a row for 6 = 2x3
* then add a row for 7
* then write down "8" as corresponding to the row for 2
* then write down "9" as corresponding to the row for 3
* then add a row for 10, ... etc.

1 0 0 0 0 0 0 0 <> 2, 4, 8, 16
0 1 0 0 0 0 0 0 <> 3, 9
0 0 1 0 0 0 0 0 <> 5,
1 1 0 0 0 0 0 0 <> 6, 12, 18
0 0 0 1 0 0 0 0 <> 7
1 0 1 0 0 0 0 0 <> 10, 20
0 0 0 0 1 0 0 0 <> 11
0 0 0 0 0 1 0 0 <> 13
1 0 0 1 0 0 0 0 <> 14
0 1 1 0 0 0 0 0 <> 15
0 0 0 0 0 0 1 0 <> 17
0 0 0 0 0 0 0 1 <> 19

Looking at this matrix, my guess is that with this weird dot product metric, applying UMAP we'd have a cluster of {2 (and 4, 8, 16), 3 (and 9), 5, 7, 11, 13, 17, 19} near the center of the image. Then clusters for {6 (and 12, 18), 10 (and 20), 14} and for {15} farther out, both at a similar distance from the center cluster. But the point for 6 would be on the side of the cluster for {6, 10, 14} farthest from the point for 15, and the distance between these two (1/4) would be about half of the distance of each of these smaller clusters from their farthest points in the center cluster (1/2).

There might be some information here about distances between primes based on which rows from the complete matrix of all binary sequences of a given length that we throw out. That is, the difference between the UMAP image here and the UMAP image for the complete matrix of binary sequences may give useful information. If the focus is on what's not in the picture, then this weird choice of distance metric may be useful.

Otherwise, I'm not sure that this metric that in some sense reverses the Hamming distance between the binary sequences representing different numbers is a productive one to consider, or that we're getting much more information from the picture than knowing that about half of the numbers less than 10,000 are multiples of 2, a little less than a third of them are multiples of 3 but not of 2, a bit less than 1/5 of them are multiples of 5 but not 2 or 3, etc.
posted by eviemath at 11:53 AM on August 23, 2018 [4 favorites]


Otherwise, I'm not sure that this metric that in some sense reverses the Hamming distance between the binary sequences representing different numbers is a productive one to consider

Unless I've got something wrong, the "cosine distance" is the complement (1-X) of the normalized dot product (which itself is sometimes called "cosine similarity". So the cosine distance wouldn't reverse the sense of the Hamming distance.
posted by a snickering nuthatch at 12:51 PM on August 23, 2018 [1 favorite]


Ah, I see. The normalized dot product is equal to the cosine of the angle between two vectors, but if "cosine distance" means 1 - cos(angle), that makes much more sense as a distance metric ... though fails to explain why all of the separate, individual primes would be clustered together?
posted by eviemath at 2:11 PM on August 23, 2018


Apart from the choice of distance metric, there's at least a couple of technical details that may explain some of the apparent structures. First, there are a couple of parameters in UMAP which result in different amounts of global and local structures being preserved and how densely points can be compressed together. Spontaneous patterns which do not represent anything meaningful in the data tend to emerge with at least some choice of parameters.
Second, because even finding the nearest neighbour of a point becomes computationally expensive in high dimensions, UMAP uses approximations. It might add some additional structures to the result.
But what is interesting is that there seem to be several different kinds of structures. What makes the factorisations in the squiggly fireworks explosion thing on the right different from everything else?
posted by ikalliom at 2:29 PM on August 23, 2018 [2 favorites]


though fails to explain why all of the separate, individual primes would be clustered together?

Think of it like this. Oil (lipid) molecules have no particular attraction to one another. They end up clustered together in bubbles when in water, though, because the water likes itself so much more.

Likewise, primes have only one 1 in their vector, so only one chance to be attractive to other numbers, most of which will be indifferent to it. Composite numbers have better chances to get close to each other. The primes are lumped together not because of any particular affinity but because they're outcasts.
posted by a snickering nuthatch at 3:04 PM on August 23, 2018


Jpfed, is that a useful clustering though? Wouldn't a prime (or prime power) be more closely related to other multiples of that prime than it would be to other primes? Is there a situation where clustering by number of distinct prime factors is more relevant? (This is not my particular area of expertise, so there very well may be!)
posted by eviemath at 6:58 PM on August 23, 2018


I don't know if it's useful. I think it's probably an emergent accident of the primes being kinda shitty at getting close to neighbors. I bet that ball of primes is oriented to point each prime towards some of its multiples, but we can't tell without a mouseover dealie or similar.

Is there a situation where clustering by number of distinct prime factors is more relevant? (This is not my particular area of expertise, so there very well may be!)

I really like this idea! There are lots of ways you could choose to represent the prime factorization as a vector, and lots of possible distance metrics, to emphasize different kinds of similarity.

For example, if the particular prime factors were totally irrelevant and you just wanted to capture the distribution of their powers, you could take the prime factorization and make a vector out of the list of powers, sorted. So 75 = 3^1 * 5^2 would map to just {2, 1, 0, 0, ...}. This would make 75 related to all the other numbers p*q^2 for primes p and q, like 12, 18, 20, ...

This is almost complementary to what the OP did, because OP flattened out the powers and only cared about the presence or absence of each prime factor.
posted by a snickering nuthatch at 8:53 PM on August 23, 2018


I like your enthusiasm, and do not want to discount the idea of looking at domething just because it is cool! But my question was more along the lines of: does it give us any information that is useful for learning about anything else? :P

I'm also dubious still of the idea that a clustering algorithm that claims that the structure within clusters and the spaces between clusters have mathematically relevant meaning would end up grouping all far away outliers together into one central cluster. Some sort of axes/labelling on the graph, and/or a mouse over as suggested would be helpful!
posted by eviemath at 2:33 AM on August 24, 2018


The primes (in white here) are sort of spread out over the whole image, just much more densely in the centre, and they do appear to be dotted around some of the outer clusters. I couldn't get the 3D viewer thing to work on my old laptop, but maybe that has labelling?
posted by lucidium at 3:07 AM on August 24, 2018


But my question was more along the lines of: does it give us any information that is useful for learning about anything else? :P

On an individual level, I can imagine people using this just as a different representation of factorization that might cause some personal understanding to click in who-knows-what way. In the frontiers-of-human-knowledge sense I think the main contribution of the linked work is in the field of Cool Looking Stuff. So, uh, probably not... ¯\_(ツ)_/¯
posted by a snickering nuthatch at 6:07 AM on August 24, 2018 [1 favorite]


The main page has been updated with some more tests, two million and eight million integer versions, and a poster for purchase.
posted by lucidium at 1:37 PM on September 17, 2018


« Older Beyoncé meets Botticelli   |   Foul Play: Paid In Mississippi Newer »


This thread has been archived and is closed to new comments