DaemonicSigil - LessWrong

Oh, very cool, thanks! Spoiler tag in markdown is:

:::spoiler
text here
:::

Explaining a Math Magic Trick

DaemonicSigil2d30

Heh, sure.

Promote from a function to a linear operator on the space of functions, $F$ . The action of this operator is just "multiply by $f$ ". We'll similarly define $F^{\sim}, F^{\sim^{2}}$ meaning to multiply by the first, second integral of $f$ , etc.

Observe:

$I F = F^{\sim} - I F^{\sim} D$

$I F = F^{\sim} - F^{\sim^{2}} D + F^{\sim^{3}} D^{2} - \dots$

Now we can calculate what we get when applying $k$ times. The calculation simplifies when we note that all terms are of the form $F^{\sim^{a}} (- D)^{(a - k)}$ . Result:

$I^{k} F = \infty \sum j = k (\frac{j - 1}{k - 1}) F^{\sim^{j}} (- D)^{j - k}$

Now we apply the above operator to $p$ :

$I^{k} F p = \infty \sum j = k (\frac{j - 1}{k - 1}) F^{\sim^{j}} (- D)^{j - k} p$

$I^{k} (f p) = \infty \sum j = k (\frac{j - 1}{k - 1}) (I^{j} f) (- D)^{j - k} p$

The sum terminates because a polynomial can only have finitely many derivatives.

Explaining a Math Magic Trick

DaemonicSigil2d30

Use integration by parts:

Then $D p$ is another polynomial (of smaller degree), and $I f$ is another "nice" function, so we recurse.

How would you navigate a severe financial emergency with no help or resources?

DaemonicSigil2d40

Other people have mentioned sites like Mechanical Turk. Just to add another thing in the same category, apparently now people will pay you for helping train language models:

https://www.dataannotation.tech/faq?

Haven't tried it yet myself, but a roommate of mine has and he seems to have had a good experience. He's mentioned that sometimes people find it hard to get assigned work by their algorithm, though. I did a quick search to see what their reputation was, and it seemed pretty okay:

DaemonicSigil's Shortform

DaemonicSigil4d20

Linkpost for: https://pbement.com/posts/threads.html

Today's interesting number is 961.

Say you're writing a CUDA program and you need to accomplish some task for every element of a long array. Well, the classical way to do this is to divide up the job amongst several different threads and let each thread do a part of the array. (We'll ignore blocks for simplicity, maybe each block has its own array to work on or something.) The method here is as follows:

for (int i = threadIdx.x; i < array_len; i += 32) {
    arr[i] = ...;
}

So the threads make the following pattern (if there are threads):

⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫🟥🟧🟨🟩🟦🟪⬛🟫

This is for an array of length $l = 26$ . We can see that the work is split as evenly as possible between the threads, except that threads 0 and 1 (black and brown) have to process the last two elements of the array while the rest of the threads have finished their work and remain idle. This is unavoidable because we can't guarantee that the length of the array is a multiple of the number of threads. But this only happens at the tail end of the array, and for a large number of elements, the wasted effort becomes a very small fraction of the total. In any case, each thread will loop $⌈ \frac{l}{t} ⌉ = ⌈ \frac{26}{8} ⌉ = 4$ times, though it may be idle during the last loop while it waits for the other threads to catch up.

One may be able to spend many happy hours programming the GPU this way before running into a question: What if we want each thread to operate on a continguous area of memory? (In most cases, we don't want this.) In the previous method (which is the canonical one), the parts of the array that each thread worked on were interleaved with each other. Now we run into a scenario where, for some reason, the threads must operate on continguous chunks. "No problem" you say, we simply need to break the array into chunks and give a chunk to each thread.

const int chunksz = (array_len + blockDim.x - 1)/blockDim.x;
for (int i = threadIdx.x*chunksz; i < (threadIdx.x + 1)*chunksz; i++) {
    if (i < array_len) {
        arr[i] = ...;
    }
}

If we size the chunks at 3 items, that won't be enough, so again we need $⌈ l / t ⌉ = 4$ items per chunk. Here is the result:

⬛⬛⬛⬛🟫🟫🟫🟫🟥🟥🟥🟥🟧🟧🟧🟧🟨🟨🟨🟨🟩🟩🟩🟩🟦🟦

Beautiful. Except you may have noticed something missing. There are no purple squares. Though thread 6 is a little lazy and doing 2 items instead of 4, thread 7 is doing absolutely nothing! It's somehow managed to fall off the end of the array.

Unavoidably, some threads must be idle for $⌈ l / t ⌉ t - l = 6$ loops. This is the conserved total amount of idleness. With the first method, the idleness is spread out across threads. Mathematically, the amount of idleness can be no greater than $t - 1$ regardless of array length and thread number, and so each thread will be idle for at most 1 loop. But in the contiguous method, the idleness is concentrated in the last threads. There is nothing mathematically impossible about having $⌈ l / t ⌉ t - l$ as big as $⌈ l / t ⌉$ or bigger, and so it's possible for an entire thread to remain unused. Multiple threads, even. Eg. take $l = 9$ :

⬛⬛🟫🟫🟥🟥🟧🟧🟨

3 full threads are unused there! Practically, this shouldn't actually be a problem, though. The number of serial loops is still the same, and the total number of idle loops is still the same. It's just distributed differently. The reasons to prefer the interleaved method to the contiguous method would be related to memory coalescing or bank conflicts. The issue of unused threads would be unimportant.

We don't always run into this effect. If $l$ is a multiple of $t$ , all threads are fully utilized. Also, we can guarantee that there are no unused threads for $l$ larger than a certain maximal value. Namely, take $l = (t - 1) ²$ then $⌈ (t - 1) ² / t ⌉ = t - 1$ and so the idleness is $t (t - 1) - (t - 1) ² = t - 1 \geq ⌈ l / t ⌉ = t - 1$ . But if $l$ is larger than this, then one can show that all threads must be used at least a little bit.

So, if we're using $t = 32$ CUDA threads, then when the array size is 961, the contiguous processing method will leave thread 31 idle. And 961 is the largest array size for which that is true.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil15d20

So once that research is finished, assuming it is successful, you'd agree that many worlds would end up using fewer bits in that case? That seems like a reasonable position to me, then! (I find the partial-trace kinds of arguments that people make pretty convincing already, but it's reasonable not to.)

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil1mo20

MW theories have to specify when and how decoherence occurs. Decoherence isn't simple.

They don't actually. One could equally well say: "Fundamental theories of physics have to specify when and how increases in entropy occur. Thermal randomness isn't simple." This is wrong because once you've described the fundamental laws and they happen to be reversible, and also aren't too simple, increasing entropy from a low entropy initial state is a natural consequence of those laws. Similarly, decoherence is a natural consequence of the laws of quantum mechanics (with a not-too-simple Hamiltonian) applied to a low entropy initial state.

Ackshually, many worlds is wrong

DaemonicSigil1mo130

Good post, and I basically agree with this. I do think it's good to mostly focus on the experimental implications when talking about these things. When I say "many worlds", what I primarily mean is that I predict that we should never observe a spontaneous collapse, even if we do crazy things like putting conscious observers into superposition, or putting large chunks of the gravitational field into superposition. So if we ever did observe such a spontaneous collapse, that would falsify many worlds.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil1mo20

Amount of calculation isn't so much the concern here as the amount of bits used to implement that calculation. And there's no law that forces the amount of bits encoding the computation to be equal. Copenhagen can just waste bits on computations that MWI doesn't have to do.

In particular, I mentioned earlier that Copenhagen has to have rules for when measurements occur and what basis they occur in. How does MWI incur a similar cost? What does MWI have to compute that Copenhagen doesn't that uses up the same number of bits of source code?

Like, yes, an expected-value-maximizing agent that has a utility function similar to ours might have to do some computations that involve identifying worlds, but the complexity of the utility function doesn't count against the complexity of any particular theory. And an expected value maximizer is naturally going to try and identify its zone of influence, which is going to look like a particular subset of worlds in MWI. But this happens automatically exactly because the thing is an EV-maximizer, and not because the laws of physics incurred extra complexity in order to single out worlds.

Any evidence or reason to expect a multiverse / Everett branches?

DaemonicSigil1mo108

Right, so we both agree that the randomness used to determine the result of a measurement in Copenhagen, and the information required to locate yourself in MWI is the same number of bits. But the argument for MWI was never that it had an advantage on this front, but rather that Copenhagen used up some extra bits in the machine that generates the output tape in order to implement the wavefunction collapse procedure. (Not to decide the outcome of the collapse, those random bits are already spoken for. Just the source code of the procedure that collapses the wavefunction and such.) Such code has to answer questions like: Under what circumstances does the wavefunction collapse? What determines the basis the measurement is made in? There needs to be code for actually projecting the wavefunction and then re-normalizing it. This extra complexity is what people mean when they say that collapse theories are less parsimonious/have extra assumptions.

LESSWRONG
LW

Posts

Wiki Contributions

Comments