• voodooattack@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    3
    ·
    1 day ago

    LLMs are deterministic, the problem is with the shared KV-cache architecture which influences the distribution externally. E.g the LLM is being influenced by other concurrent sessions.

    • kersplomp@piefed.blahaj.zone
      link
      fedilink
      English
      arrow-up
      1
      ·
      19 hours ago

      Almost all clients do some random sampling after softmax using temperature. I’m confused why someone who knows about kv caching would not know about temperature. Also shared kv cache while plausible is not standard in open source as of a year or so ago, so i’m curious what you are basing this off of. Did I miss a research paper?

      • voodooattack@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        17 hours ago

        Almost all clients do some random sampling after softmax using temperature. I’m confused why someone who knows about kv caching would not know about temperature.

        I know what temperature is. Modifying the probability distribution is still not randomness. Because even the random sampling is PRNG based.

        The issue you’re not spotting is that it’s still deterministic because a binary system cannot source entropy without external assistance or access to qbits, it’s why even OS kernels have to do a warm up at boot and read all accessible analogue signal sources they can reach, and why PRNGs still exist to begin with.

        Also shared kv cache while plausible is not standard in open source as of a year or so ago,

        Shared KV-cache is an economic necessity for big providers, otherwise 1M context windows wouldn’t be a thing.

        so i’m curious what you are basing this off of. Did I miss a research paper?

        Empirical testing, 20 years of experience coding and tinkering with simulators, and Chaos Theory basics. The papers are out there, you just gotta cross some domains to see it.

        • kersplomp@piefed.blahaj.zone
          link
          fedilink
          English
          arrow-up
          1
          ·
          10 hours ago

          I see, thanks for clarifying. If you’re arguing that PRNG is not random, then you’re likely confusing non-technical readers. Additionally, it is an implementation detail whether it’s pseudorandom or actually random since /dev/random takes in actual random signals like network packets.

          If it used a seeded PRNG it’s repeatable, but repeatability does not imply predictability which is what a non-technical reader might assume. Remember, most people on here are non-technical.

          re: the kv cache thing, I don’t think that’s correct but I don’t have the energy to prove it sorry. shared kv cache sounds like a security nightmare but ymmv

    • qqq@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      ·
      1 day ago

      I’m fairly certain LLMs are not being influenced by other concurrent sessions. Can you share why you think otherwise? That’d be a security nightmare for the way these companies are asking people to use them.

      • voodooattack@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        ·
        1 day ago

        Any shared cache of this type makes behaviour non-deterministic. The KV-Cache is what does prompt caching, look at each word of this message, now imagine what the LLM does to give you a new response each time. Let’s say this whole paragraph as the first message from you and you just pressed send.

        Because the LLM is supposedly stateless, now the LLM is reading all this text from the beginning, and in non-cached inference, it has to repeat it, like token by token, which is useless computation because it already responded to all this previously. Then when it sees the last token, the system starts collecting the real response, token by token, each gets fed back to the model as input and it chugs along until it either outputs a special token stating that it’s done responding or the system stops it due to a timeout or reaching a tool call limit or something. Now you got the response from the LLM, and when you send the next message, this all has to happen all over again.

        Now imagine if Claude or Gemini had to do that with their 1 million token context window. It would not be computationally viable.

        So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.

        So now comes the issue: allocating a dedicated region for the KV-cache per user on VRAM is a big deal. Again try to imagine Gemini/Claude with their 1M context windows. It’s economically unviable.

        So what do ML science buffs come up with? A shared KV-Cache architecture. All users share the same cache on any particular node. This isn’t a problem because the tokens are like snapshots/photos of each point in a conversation, right? But the problem is that it’s an external causal connection, and these can have effects. Like two conversations that start with “hi” or “What do you think about cats?” Could in theory influence one another. If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.

        Note that a token is an approximation of what the conversation means at one point in time. So while astronomically unlikely, collisions could happen in a shared architecture scaling to millions of concurrent users.

        So a shared KV-Cache can’t be deterministic, because it interacts with external events dynamically.

        • qqq@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 day ago

          Hm this tracks to me. I’ve wondered for a bit how they deal with caching, since yes there is a huge potential for wasted compute here, but I haven’t had the time to look into it yet. Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?

          If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.

          This is very interesting to me, because I’d think they were doing something to combat that problem if they’re actually doing something multi-tenant here.

          Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?

          Thanks for the response it’s definitely something I’ve been trying to understand

          Edit here, thinking a bit more,

          So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.

          This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?

          • voodooattack@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            1 day ago

            Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?

            You’re welcome. Here’s an intro with animations: https://huggingface.co/blog/not-lain/kv-caching

            And yes. Most of the tech is proprietary. From what I’ve seen, nobody in ML fully understands it tbh. I have some prior experience from my youth from tinkering with small simulators I used to write in the pre-ML era, so I kinda slid into it comfortably when I got hired to work with it.

            Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?

            Yeah, but the real problem is scale and collision risk at that scale. Tokens resolution erodes over time as the context gets larger, and can become “samey” pretty easily for standard RLHF’d interactions.

            Edit:

            This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?

            This is what they do: (from that page I linked)

            Token 1: [K1, V1]Cache: [K1, V1]
            Token 2: [K2, V2]Cache: [K1, K2], [V1, V2]
            ...
            Token n: [Kn, Vn]Cache: [K1, K2, ..., Kn], [V1, V2, ..., Vn]
            

            So the key is the token and all that preceded it. It’s a kinda weird way to do it tbh. But I guess it’s necessary because floating point and GPU lossy precision.

      • voodooattack@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 day ago

        I didn’t say they normally aren’t. What I’m saying is that a shared KV-Cache removes that guarantee by introducing an external source of entropy.