@meso they're trying to work out a way to ban self-hosted AI but they can't so far because anybody with a good GPU can do it. But they're thinking hard.
@Moon@meso The amount of backpedaling and attempting to put the genie back in the bottle is insane, but not surprising. Seeing Tim Berners-Lee say if he could do the internet over again he'd make it easier to censor tells me everything I need to know about where tech is today.
@Moon@meso I tried to get chatGTP to draw ascii art of a cat with an ampersand in its mouth representing a mouse-- It refused until I told it ampersands represented cinnamon rolls.
@meso there are youtube videos that walk through every step. I kind of muddled through it. I'm generating cute girls right now but I am soon gonna try to generate text like stories and stuff.
@Moon@Christmas_Man@meso This is the guy making 4bit quantized models for home use: https://huggingface.co/TheBloke GPTQ models are for GPU based inference, GGML are for CPU based inference (though you can get speed boost from moving some of the load on your GPU).
With 24Gb VRAM, you can run GPTQ 13b to 20b models with room to spare for extended (over 2048 token) context and keeping Stable Diffusion loaded at the same time. Or you are supposed to be just about able to run 30b models with 2048 context on a headless linux machine. Expect double digit tokens per second. Answers will pop up in seconds.
With GGML models your RAM is going to be your limit, and speed is going to depend on CPU, GPU, RAM speed and how much you can offload to GPU/VRAM. But in general it's likely to be MUCH slower than GPTQ, but if you're running as big a model as you can fit in your machine, expect single digit tokens per second. Expect to wait sometimes over a minute for an answer. Sometimes it's worth it, sometimes not. I've heard people say that the returns from 30b to 70b are quite a bit diminished (ie: it's not really noticeably smarter, just different).