A software developer and Linux nerd, living in Germany. I’m usually a chill dude but my online persona doesn’t always reflect my true personality. Take what I say with a grain of salt, I usually try to be nice and give good advice, though.

I’m into Free Software, selfhosting, microcontrollers and electronics, freedom, privacy and the usual stuff. And a few select other random things, too.

  • 3 Posts
  • 783 Comments
Joined 8 months ago
cake
Cake day: June 25th, 2024

help-circle


  • That laptop should be a bit faster than mine. It’s a few generations newer, has DDR5 RAM and maybe even proper dual channel. As far as I know, LLM inference is almost always memory bound. That means the bottleneck is your RAM speed (and how wide the bus is between CPU and memory). So whether you use SyCL, Vulkan or even the CPU cores shouldn’t have a dramatic effect. The main thing limiting speed is, that the computer has to transfer gigabytes worth of numbers from memory to the processor on each step. So the iGPU or processor spends most of its time waiting for memory transfers. I haven’t kept up with development, so I might be wrong here, but I don’t think more that single digit tokens/sec is possible on such a computer. It’d have to be a workstation or server with multiple separate memory banks, or something like a MacBook with Apple silicon and its unified memory. Or a GPU with fast VRAM on it. Though, you might be able to do a bit more than 3 t/s.

    Maybe keep trying the different computation backends. Have a look at your laptop’s power settings as well. Mine is a bit slow when it’s on the default “balanced” power profile. It’ll speed up once I set it to “performance” or gaming mode. And if you can’t get llama.cpp compiled, maybe just try Ollama, Koboldcpp instead. They use the same framework and might be easier to install. And SyCL might prove to be a bit of a letdown. It’s nice. But seems few people are using it, so it might not be very polished or optimized.


  • I’m not sure what kind of laptop you own. Mine does about 2-3 tokens/sec if I’m running a 8B parameter model. So your last try seems about right. Concerning the memory: Llama.cpp can load models “memory mapped”. That means the system decides which necessary parts lo load into the memory. It might be all in there, but it doesn’t count as active memory usage. I believe it’ll count towards the “cached” value in the statistics. If you want to make sure, you have to force it not to memory-map the model. In llama.cpp that’s the parameter --no-mmap I have no idea how to do it in gpt4all-chat. But I’d say it’s already loaded in your case, it just doesn’t show up as used memory, since it’s the mmap thing.
    Maybe try a few other software as well, like one of: ollama, koboldcpp, llama.cpp and see how they do. And I wouldn’t run full precision models on an iGPU. Keep it to quantized models. Q8 or Q5… or Q4…



  • Yeah, I hope we someday manage to transition to renewables and get cheap and relatively clean energy. I’m living in a country which isn’t sitting on huge oil reserves, so I’d say it’d be clever if we made an effort… And we kind of do. But it’s probably a bit uncoordinated. And there are people lobbying for the opposite… (And seems it’s a big undertaking.)

    I hope AI is going to get a bit more democratized in the future. And as you said, more efficient. It’ll probably be a combination of factors. More efficient hardware, custom-built LLMs tailored to specific use-cases, scientific progress… I’d like more affordable hardware to run LLMs at home. I think something like Apple processors with their “unified memory” might be promising. I heard LLMs run pretty well on modern MacBooks, without any seperate gaming graphics card.

    And I’m not even sure how it’ll turn out. Sure, the AI companies predict a high demand for AI, and they’re building datacenters and need new power plants to power all of that. But I’m not totally convinced. Maybe that’s part of the big AI hype, and it’ll turn out there is far less demand than what they tell their investors. Or they’re unable to keep up the pace and it’ll take longer until AI is intelligent enough to do all the things they envisioned. AI will be some part of the world’s electricity bill, though.



  • Ja, ich find’s halt doof, dass das ne Zwangsgebühr ist. Und auch ungerecht, weil nicht an Einkommen oder sowas gekoppelt. Ich würd das mit Steuer abhandeln. Und ein bisschen zusammenstreichen, keiner braucht beispielsweise die selbe Wahl-Nacht-Show mit denselben Meinungen nur halt einmal von der ARD und einmal vom ZDF separat produziert. Für mich wäre bei 15€ im Monat Schluss, ich hab auch Netflix gekündigt als die dann noch teurer geworden sind. Nur hier kann ich mir das nicht aussuchen. Und die sollen mal mehr mit anderen ÖRR zusammenarbeiten, z.B. Dr. Who von der BBC einkaufen. Und wenn wir schon nach Amerika schauen, finde ich sollten wir uns mal ein Beispiel daran nehmen, dass Dinge, die gemeinschaftlich finanziert werden, dann auch dem Bürger gehören. Ich finde, uns ständen dann Nutzungsrechte zu, wenn wir das schon bezahlen…

    Aber ja, wir sind im Allgemeinen viel besser dran mit unseren Medien als einige andere Länder. Gerade die beiden. Und hier bemüht man sich dann doch meistens tatsächlichen Journalismus zu betreiben. Gibt auch tatsächlich viele nette Sendungen in unseren Öffentlich Rechtlichen. Und sowas wie die Aktuelle Stunde oder regionales Zeug kriegt man wahrscheinlich sonst auch schlecht finanziert.







  • Thanks anyways. I guess it’s just a hard problem to tackle. With freedom comes the freedom to abuse it. And yes, the internet has been designed to be very agnostic about what it’ll get used for. I think it’s a super impressive invention. And it’s very successful if we measure that by looking at how omnipresent it is now. And I’m even more impressed if I look at the age of the protocols and the design that powers the foundation of it, to this date. A lot of it has been adopted around 50 years ago. And the particular design choices scale so well, they pretty much still power an entirely different world 50 years later. I don’t think it’s humanly possible to do a substantially better job at something… But yeah, that doesn’t take away from other things and consequences. I’m often a fan of the analogy with tools. The internet is a tool, and very much like a hammer that can be used to help build a house, or tear it down… It’s not exactly the tool’s fault for what it gets used for. I’m now getting really out of line for this community, so I’ll try to make it short: I think abstraction is a very elemental design choice and what makes the internet great. The lower layers transport arbitrary stuff and that’s what allowed us to build phones, watch TV over it… Things nobody envisioned half a century ago. We’d completely cripple it in that regard, by removing that abstraction between the layers. And that’s what makes me think it can’t be the internet (as in the transport layers) where we bake ethics into. It has to happen at the top, where things get applied and the individual platforms and services reside.

    I’m sorry, it’s way more complicated than that and more a topic for a long essay, and lots of it wouldn’t be very “casual” to read, as you said. I don’t think it’s a sad story, though. It’s just one taking place in the real world, where things are intertwined, have consequence and things often turn out in a way no-one anticipated. It’s just complex and the world is a varied place. And this is highly political. I agree.


  • Thanks for the tips. I’ll try to remember some of that. And yes, English is dumb. But also kind of nice. I think it’s comparatively easy to learn. At least that’s what I took from my own experience with learning English in school and then a few years later - French. And that’s just loads of exceptions to each and every rule, almost all verbs are irregular, half the letters are silent for some reason… But I guess English does that, too. You can’t really tell how to pronounce something just by reading the letters. Point is, I kind of enjoyed learning English. At least after overcoming the initial hurdles. And I’m exaggerating. We had a nice French teacher, and I wish I hadn’t lost most of it after school, due to lack of exposure… And I think learning languages is fun, as you’re bound to learn something about different cultures as well, and it might open doors to interesting places.


  • Yeah, that just depends on what you’re trying to achieve. Depending on what kind of AI workload you have, you can scale it across 4 GPUs. Or it’ll become super slow if it needs to transfer a lot of data between these GPUs. And depending on what kinds of maths is involved, a Pascal generation GPU might be perfectly fine, or it’ll lack support for some of the operations involved. So yes, of course you can build that rig. Whether it’s going to be useful in your scenario is a different question. But I’d argue, if you need 96GB of VRAM for more than just the sake of it, you should be able to tell… I’ve seen people discuss these rigs with several P40 or similar, on Reddit and in some forums and Github discussions of the software involved. You might just have to do some research and find out if your AI inference framework and the model does well on specific hardware.