this post was submitted on 05 Jun 2024

16 points (94.4% liked)

Linux

48144 readers

765 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
No misinformation
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago

MODERATORS

[email protected]

Unable to run TabbyML with GPU on NixOS or Docker (solved on docker!) (lm.paradisus.day)

submitted 5 months ago* (last edited 5 months ago) by [email protected] to c/[email protected]

6 comments fedilink hide all child comments

TabbyML is a self-hosted code assistant. I have been unsuccessful at running it using my Nvidia GPU. There's two ways I've tried to deploy this.

As a docker container

Following the docs, it states I run the following docker run command. Below is what I run, modified to use the correct port:

docker run -it --gpus all \
  -p 11029:8080 -v $HOME/.tabby:/data \
  tabbyml/tabby serve --model StarCoder-1B --device cuda

Then I get the following error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

So this would appear that I don't have the "nvidia-container-toolkit" installed on my machine. So I go ahead and enable this in nixos:

hardware.nvidia-container-toolkit.enable = true;

To validate that this works, I should be able to run nvidia-smi from within a container. I can run this from the host without issue:

$ nvidia-smi
Wed Jun  5 08:14:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
...and so on

But if test this from a container, as the nvidia docs suggest as follows, I unable to access it from within the container.

$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker: Error response from daemon: unknown or invalid runtime name: nvidia.

Okay, so I go and read the instructions further. Install instructions state that after installation, I need to configure the runtime like so:

$ sudo nvidia-ctk runtime configure --runtime=docker
sudo: nvidia-ctk: command not found

Ah nuts. That's a bug in nixos. I made a PR for this here: https://github.com/NixOS/nixpkgs/pull/317199 Still awaiting results from this. I don't know if this is a bug that will be backported to 24.05. Regardless, I wouldn't expect this ad-hoc configuration when I enable the nvidia-container-toolkit option in NixOS. Anyway, this option could still work but with some more time. If you have advice doing this let me know.

FOUND Docker method solution

So looking closer at people with the error message "no such runtime nvidia" I found this thread. It specifies that what nvidia-ctk is supposed to do is add a "runtime" that points to the nvidia-container-runtime executable. So I tried manually adding that my nixos configuration by using the virtualisation.docker.daemon.settings options. I was having trouble getting that working, because I needed to find the exact path to the nvidia-container-runtime executable. If you know Nix, you know that it isn't just in /usr/bin/.

But that's still not a satisfying solution anyway...I shouldn't have to this. I went in deeper and looked at module for nvidia-container-toolkit. This module calls a script called cdi-generate.nix. It outputs the results of nvidia-ctk to a file called nvidia-container-toolkit.json.

Let's go look for that file...can't find it. I do more searching...anyway, I found the solution.

The nvidia-container-toolkit is a new option in NixOS 24.05. It explicitly states in the release notes that it is supposed to replace the now deprecated virtualisation.{docker, podman}.enableNvidia options. Well, when you go look at the module that defines docker.enableNvidia you see it there at the bottom! This file actually defines the nvidia runtime!

And yes, it works. Using the now "deprecated" option is the one that actually works. I guess this is another bug to file to NixOS.

This seems to work so far, but I don't know why the solution using a NixOS module doesn't work either.

As a NixOS module

Let's just do it the full NixOS module way (which is what I tried first). That should be easy. Let's enable the feature and set some options:

services.tabby = {
    enable = true;
    port = 11029;
    acceleration = "cuda";
  };
  networking.firewall.allowedTCPPorts = [ 11029 ];

It appears to be working! VSCodium extension sees the server and prompts for a authentication token. I add the token. I type some code and set for a manual trigger...then tabby dies. Let''s look at the systemd logs.

tabby[76786]: 📄 Version 0.11.1
tabby[76786]: 🚀 Listening at 0.0.0.0:11029
tabby[76786]:   JWT secret is not set
tabby[76786]:   Tabby server will generate a one-time (non-persisted) JWT secret for the current process.
tabby[76786]:   Please set the TABBY_WEBSERVER_JWT_TOKEN_SECRET environment variable for production usage.
systemd[1]: tabby.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: tabby.service: Failed with result 'exit-code'.
systemd[1]: tabby.service: Consumed 2.285s CPU time, received 121.0K IP traffic, sent 1.6M IP traffic

That's it. It's not very descriptive about what happened. I've had success running it this way using the "cpu" option for acceleration (no GPU) but that's too slow to be useful.

GPU specs

I am running a Nvidia RTX 2060 and using the proprietary drivers version 550.

Thanks for the read, if you have any input on what to do next let me know what I can try. Ideally, I'd like to have both options work, since I think the docker implementation may have the same problem as the NixOS module option.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 5 months ago (1 children)

Did you install the Nvidia docker toolkit?

[–] [email protected] 1 points 5 months ago (1 children)

Yes thats the nvidia-container-toolkit I described above. It should be installed.

[–] [email protected] 2 points 5 months ago (1 children)

Switch to running without the --gpus flag, just specify the Nvidia runtime, then hop into a shell in the container and verify it's correctly showing the GPU. If not, then your runtime toolkit isn't being sourced properly. Double check the daemon config and make sure it's defaulting to nvidia.

[–] [email protected] 2 points 5 months ago

It was not configured correctly. Its a nixos bug. Thanks for pointing out the daemon config its what lead me down to solving the docker problem.