this post was submitted on 11 Sep 2023
153 points (92.7% liked)
Technology
59232 readers
4455 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It is much faster than stack overflow for code snippets. The user really needs a basic skepticism about all outputs even with an excellent model, but like, a basic 70B Llama2 can generate decent Python code. When it makes an error, pasting that error into the prompt will almost always generate a fix. This only applies to short single operations type tasks, but it is super useful if you already know the basics of code like variables, types, and branching constructs. It can explain API's and libraries too.
The real value comes from integrating databases and other AI models. I currently have a combination I can talk to with a mic and it can reply as an audio clip with a LLM generating the reply text. I'm working on integrating a database to help teach myself the computer science curriculum using free materials and a few books. Individualized education is a major application. You can also program a friend, or professional colleague, a councillor, or ask medical questions. There is a lot of effort going into getting accurate models for stuff like medical where they can provide citations. Even with sketchy information from basic models, they will still generate terms and hints that you can search in a regular search engine to find new information in many instances. This will help you escape the search engine echo chambers that are so pervasive now. Heck I even asked the 70B about meat smoker heat and timing settings and it made better suggestions than several YT examples I watched and tried. I needed an industrial adhesive a couple of weeks ago and found nothing searching google and bing, but after asking the 70B it gave me 4 of 6 valid results for products. After plugging these in to search, suddenly the search engines knew of thousands of results for what I was looking for. I honestly didn't expect it to be as useful as it really is. Like I turn on my computer, and start the 70B first thing every day. It unloads itself from memory while idle, but I'm constantly asking it stuff. I go many days without even going online from my workstation.
Are you using ooga booga? What specs does your system have?
I do use Oobabooga a lot. I am developing my own scripts and modifying some of Oobabooga too. I also use Koboldcpp. I am on a 12gen i7 with 20 logic cores and 64GB of system memory along with a 3080Ti with 16GBV. The 70B 4 bit quantized model running with 14 layers offloaded onto the GPU generates 3 tokens a second. So it is 1.5 times faster than just on the CPU.
If I was putting together another system, I would only get something with AVX-512 instructions support in the CPU. That instruction is troublesome for CVE issues. You'll probably need to look into this depending on your personal privacy/security threat model. The ability to run larger models is really important. You really want all the RAM. The answer to the question of how much is always yes. You are not going to get enough memory using consumer GPUs you can only offload a few layers onto a consumer grade GPU. I can't say how well even larger models than the 70B will perform as the memory bottlenecks. I can't even say how a 30B or larger runs at full quantization. I can't add any more memory to my system. Running the full models, as a rule of thumb, requires double the token size in RAM. So a 30B will require around 60GB of memory to initial load. Most of these models are float-16. So running them 8-bit cuts the size in half with penalties in areas like accuracy. Running 4 bit splits the size again. There is tuning, bias, and asymmetry in the way quantization is done to preserve certain aspects like emerging phenomena in the original data. This is why a larger model with a smaller quantization may outperform a smaller model running at full quantization. For GPUs, if you are at all serious about this, you need at least 16GBV at a bare minimum. Really, we need to see a descent priced 40-80GBV consumer option. The thing is that GPU memory is directly tied to compute hardware. There isn't the overhead of a memory management system like system memory has. This is what makes GPUs ideal and fast, but it is the biggest chunk of bleeding edge silicon in consumer hardware already, and we need it to be 4× larger and cheap. That is not going to happen any time soon. This means the most accessible path to larger models is using the system memory. While you'll never get the parallelism of a GPU, having cpu instructions that are 512 bits wide is a big performance boost. You also need max logic cores. That is just my take.