Aprillion (Peter Hozák)

https://peter.hozak.info

Wiki Contributions

Comments

It's duct tapes all the way down!

Bad: "Screw #8463 needs to be reinforced."

The best: "Book a service appointment, ask them to replace screw #8463, do a general check-up, and report all findings to the central database for all those statistical analyses that inform recalls and design improvements."

Consider a car that starts shaking whenever it's driven. It's uncomfortable, so the owner gets a pillow to put on the seat.

I know there are people are like that, but I have to say... Aaaaaaaaaaaargh!❕❗😱

Oh, I should probably mention that my weakness is that I cannot remember the stuff well while reading out loud (especially when I focus on pronunciation for the benefit of listeners)... My workaround is to make pauses - it seems the stuff is in working memory and my subconscious can process it if I give it a short moment, and then I can think about it consciously too, but if I would read out loud a whole page, I would have trouble even trying to summarize the content.

Similarly a common trick how to remember names is to repeat the name out loud.. that doesn't seem to improve recall for me very much, I can hear someone's name a lot of times and repeating it to myself doesn't seem to help. Perhaps seeing it written while hearing it might be better, but not sure... By far the best method is when I want to write them a message and I have to scroll around until I see their picture, after that I seem to remember names just fine 😹

Yeah, I myself subvocalize absolutely everything and I am still horrified when I sometimes try any "fast" reading techniques - those drain all of the enjoyment our of reading for me, as if instead of characters in a story I would imagine them as p-zombies.

For non-fiction, visual-only reading cuts connections to my previous knowledge (as if the text was a wave function entangled to the rest of the universe and by observing every sentence in isolation, I would collapse it to just "one sentence" without further meaning).

I never move my lips or tongue though, I just do the voices (obviously, not just my voice ... imagine reading Dennett without Dennett's delivery, isn't that half of the experience gone? how do other people enjoy reading with most of the beauty missing?).

It's faster then physical speech for me too, usually the same speed as verbal thinking.

ah, but booby traps in coding puzzles can be deliberate... one might even say that it can feel "rewarding" when we train ourselves on these "adversarial" examples

the phenomenon of programmers introducing similar bugs in similar situations might be fascinating, but I wouldn't expect a clear answer to the question "Is this true?" without a slightly more precise definitions of:

  • "same" bug
  • same "bug"
  • "hastily" cobbled-together programs
  • hastily "cobbled-together" programs ...

To me as a programmer and not a mathematitian, the distinction doesn't make practical intuitive sense.

If we can create 3 functions f, g, h so that they "do the same thing" like f(a, b, c) == g(a)(b)(c) == average(h(a), h(b), h(c)), it seems to me that cross-entropy can "do the same thing" as some particular objective function that would explicitly mention multiple future tokens.

My intuition is that cross-entropy-powered "local accuracy" can approximate "global accuracy" well enough in practice that I should expect better global reasoning from larger model sizes, faster compute, algorithmic improvements, and better data.

Implications of this intuition might be:

  • myopia is a quantity not a quality, a model can be incentivized to be more or less myopic, but I don't expect it will be proven possible to enforce it "in the limit"
  • instruct training on longer conversations outght to produce "better" overall conversations if the model simulates that it's "in the middle" of a conversation and follow-up questions are better compared to giving a final answer "when close to the end of this kind of conversation"

What nuance should I consider to understand the distinction better?

transformer is only trained explicitly on next token prediction!

I find myself understanding language/multimodal transformer capabilities better when I think about the whole document (up to context length) as a mini-batch for calculating the gradient in transformer (pre-)training, so I imagine it is minimizing the document-global prediction error, it wasn't trained to optimize for just a single-next token accuracy...

Can you help me understand a minor labeling convention that puzzles me? I can see how we can label  from the Z1R process as  in MSP because we observe 11 to get there, but why  is labeled as  after observing either 100 or 00, please?

Pushing writing ideas to external memory for my less burned out future self:

  • agent foundations need path-dependent notion of rationality

    • economic world of average expected values / amortized big O if f(x) can be negative or you start very high
    • vs min-maxing / worst case / risk-averse scenarios if there is a bottom (death)
  • alignment is a capability

    • they might sound different in the limit, but the difference disappears in practice (even close to the limit? 🤔)
  • in a universe with infinite Everett branches, I was born in the subset that wasn't destroyed by nuclear winter during the cold war - no matter how unlikely it was that humanity didn't destroy itself (they could have done that in most worlds and I wasn't born in such a world, I live in the one where Petrov heard the Geiger counter beep in some particular patter that made him more suspicious or something... something something anthropic principle)

    • similarly, people alive in 100 years will find themselves in a world where AGI didn't destroy the world, no matter what are the odds - as long as there is at least 1 world with non-zero probability (something something Born rule ... only if any decision along the way is a wave function, not if all decisions are classical and the uncertainty comes from subjective ignorance)
    • if you took quantum risks in the past, you now live only in the branches where you are still alive and didn't die (but you could be in pain or whatever)
    • if you personally take a quantum risk now, your future self will find itself only in a subset of the futures, but your loved ones will experience all your possible futures, including the branches where you die ... and you will experience everything until you actually die (something something s-risk vs x-risk)
    • if humanity finds itself in unlikely branches where we didn't kill our collective selves in the past, does that bring any hope for the future?
Load More