ssh on Monsoon's Blog

Using GPU accessible VS Code Server on UIUC Delta

Sun, 22 Dec 2024 00:00:00 +0000

Why writing this blog post

Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta:

Unstable network connection: Connection is lost frequently when the network is poor. Each time when the VS Code Remote lost connection, you must reenter the password and DUO 2FA OTP (you have to unlock your phone to get the OTP) to reconnect, which is annoying, time-consuming, and distracting.
Broken OnDemand Code Server: Although you can run VS COde Remote on the login nodes by ssh, there’s no GPU for debugging, and the computing nodes are not accessible by ssh. The alternative ways include OnDemand Jupyter Lab and Code Server. But the functions of Jupiter Lab are limited, and the Code Server is broken – When I try to request a Code Server on computing nodes, the system just queues and shows my request has been completed, no running status.

Due to the above problems, debugging GPU programs on Delta are struggling. That’s why I wrote this blog post: by running private Code Server on computing nodes, and deploying a Cloudflare Tunnel reverse proxy, you can say goodbye to these annoying problems.

How to

My solution is based on an observation about the Delta: all login nodes and computing nodes are in a trusted network. There’s no firewalls between them, which means you can access to any ports on the computing nodes from the login nodes.

The main steps of my solution are simple:

Use srun to get a tty on the computing node (e.g., on gpua042 node).
Run a Code Server on the computing node. It will listen on 0.0.0.0:8080.
Reverse proxy gpua042:8080 to any port you have access. There are two approaches:
- Use ssh -L to forward the port to your local machine.
- Use Cloudflare Tunnel to reverse proxy the port to a public domain. This approach is more stable in poor network conditions.

Run Code Server

Download the Code Server binary from the Github repository (e.g., code-server-4.96.2-linux-amd64.tar.gz), and extract it. On the computing node, run:

1
2
3
4
5
6
7
8


cd code-server-4.96.2-linux-amd64/bin

## no auth
./code-server --bind-addr 0.0.0.0:8080 --auth none

## if port is exposed to untrusted network, use password auth
## password can be modified in ~/.config/code-server/config.yaml
./code-server --bind-addr 0.0.0.0:8080

Access Code Server

SSH Port Forwarding

ssh -L can forward a local port to a remote port. Run:

1

ssh -L 127.0.0.1:8080:gpua042:8080 username@login.delta.ncsa.illinois.edu

Then open http://127.0.0.1:8080 in your browser, and enjoy the Code Server!

Cloudflare Tunnel

Cloudflare Tunnel is more stable when your computer suffer from poor network connection. But it requires a domain name.

TODO

Using an SSH Reverse Tunnel to Log Into BitaHub Containers and Hold GPUs Long-Term

Fri, 20 Oct 2023 00:00:00 +0000

Problem

Every year before CVPR, GPUs are always in short supply, and we need to borrow cards from elsewhere. USTC provides BitaHub for on-campus users, but it suffers from the same shortage of cards before CVPR. At the same time, its job-submission-based usage model is very inconvenient: submitting jobs that occupy multiple cards often requires a long wait in the queue, and its data management approach is downright user-hostile.

As the server administrator for my group, in order to make my life easier before CVPR and to avoid repeating the 2021 pre-CVPR ordeal of scrambling to allocate resources, I needed to improve the BitaHub experience:

How to hold GPUs long-term to avoid repeatedly queuing (slightly unethical, but a measure born of necessity);
How to conveniently read data from our own servers, instead of being forced to use BitaHub’s user-hostile data management model;
How to make the BitaHub GPU experience as close as possible to that of our group’s servers, lowering migration costs and improving the flexibility of resource scheduling.

Idea

Jobs in BitaHub run as docker containers, which gives us the possibility of configuring the environment we want inside the container, as long as we can somehow ssh into it.

After some investigation, I found that as long as the startup command does not stop running, a BitaHub container will keep running indefinitely and will not release its GPU resources. At the same time, BitaHub containers have network access, and the BitaHub web page even thoughtfully provides the ssh private key for the root user inside each job’s container.

These facts give us an opportunity to exploit. All we need to do is run a tunnel program inside the container so that external parties can access port 22 of the container, and then we can log in and hold the resources long-term. Moreover, since the container has network access, we can also directly mount the file systems of other on-campus servers.

Solution

The tunnel program I ended up choosing is ssh, which can create a reverse tunnel:

1

ssh -i  -F none -o "StrictHostKeyChecking no" -o "ServerAliveInterval 15" -v -N -R :localhost:22 jump@

On the jumpserver, configure a user jump and allow login with a specific private key, then somehow get the private key into the container (you could bake it directly into the image, but I chose a more convenient approach: create a BitaHub dataset to store it, and just add this dataset to every job).

The container’s startup command is exactly the command above (considering network fluctuations, you can wrap it in a while true loop or use autossh to reconnect automatically). Once started, it creates a reverse tunnel on of , with mapped to port 22 inside the container.

You can set GatewayPorts yes in the sshd_config of so that the reverse tunnel listens on 0.0.0.0 instead of 127.0.0.1. Otherwise, I would have to create a user on for every person, or forward each port with iptables, which is far too tedious. Binding to 0.0.0.0 lets us access it directly from the existing VPN network.

There are many options for mounting a file system. Considering both security and convenience, I chose SSHFS. Exposing NFS directly to the public internet is too dangerous, while configuring NFS user authentication is too tedious. At the same time, the kernel that BitaHub uses to run containers neither loads the wireguard kmod nor maps /dev/net/tun, so we cannot use a VPN to protect data security. SSHFS can directly reuse the existing user authentication mechanism, and SSH traffic itself is also more likely to be let through by any potential data-center firewall.

Use the following command to mount SSHFS:

1

sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=30,ssh_command='ssh -p  -i ' @:/path /path

Postscript

TODO