linux on Monsoon's Blog

Using GPU accessible VS Code Server on UIUC Delta

Sun, 22 Dec 2024 00:00:00 +0000

Why writing this blog post

Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta:

Unstable network connection: Connection is lost frequently when the network is poor. Each time when the VS Code Remote lost connection, you must reenter the password and DUO 2FA OTP (you have to unlock your phone to get the OTP) to reconnect, which is annoying, time-consuming, and distracting.
Broken OnDemand Code Server: Although you can run VS COde Remote on the login nodes by ssh, there’s no GPU for debugging, and the computing nodes are not accessible by ssh. The alternative ways include OnDemand Jupyter Lab and Code Server. But the functions of Jupiter Lab are limited, and the Code Server is broken – When I try to request a Code Server on computing nodes, the system just queues and shows my request has been completed, no running status.

Due to the above problems, debugging GPU programs on Delta are struggling. That’s why I wrote this blog post: by running private Code Server on computing nodes, and deploying a Cloudflare Tunnel reverse proxy, you can say goodbye to these annoying problems.

How to

My solution is based on an observation about the Delta: all login nodes and computing nodes are in a trusted network. There’s no firewalls between them, which means you can access to any ports on the computing nodes from the login nodes.

The main steps of my solution are simple:

Use srun to get a tty on the computing node (e.g., on gpua042 node).
Run a Code Server on the computing node. It will listen on 0.0.0.0:8080.
Reverse proxy gpua042:8080 to any port you have access. There are two approaches:
- Use ssh -L to forward the port to your local machine.
- Use Cloudflare Tunnel to reverse proxy the port to a public domain. This approach is more stable in poor network conditions.

Run Code Server

Download the Code Server binary from the Github repository (e.g., code-server-4.96.2-linux-amd64.tar.gz), and extract it. On the computing node, run:

1
2
3
4
5
6
7
8


cd code-server-4.96.2-linux-amd64/bin

## no auth
./code-server --bind-addr 0.0.0.0:8080 --auth none

## if port is exposed to untrusted network, use password auth
## password can be modified in ~/.config/code-server/config.yaml
./code-server --bind-addr 0.0.0.0:8080

Access Code Server

SSH Port Forwarding

ssh -L can forward a local port to a remote port. Run:

1

ssh -L 127.0.0.1:8080:gpua042:8080 username@login.delta.ncsa.illinois.edu

Then open http://127.0.0.1:8080 in your browser, and enjoy the Code Server!

Cloudflare Tunnel

Cloudflare Tunnel is more stable when your computer suffer from poor network connection. But it requires a domain name.

TODO

NFS Performance Tuning

Fri, 16 Feb 2024 00:00:00 +0000

Introduction

This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of Lots of Small Files (LOSF).

Tuning

Hardware

On the network hardware side, both bandwidth and latency matter.

To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget.

For the Lots of Small Files (LOSF) scenario, latency is more important than bandwidth. Many tuning tutorials overlook this and focus only on sequential read/write performance; even when they test 4K random read/write, they use the wrong testing method (the correct method is given below).

The importance of latency lies in the fact that if a program’s access to small files is intrinsically serialized, latency determines the upper bound of serialized IOPS. A latency of 0.1 ms caps serialized IOPS at 10k, while a latency of 1 ms corresponds to a cap of 1k.

Intrinsically serialized access scenarios are very common. For example, when the home directory is placed on NFS, the loading of oh-my-zsh and the loading of Python packages are both intrinsically serialized. A 1 ms network latency makes these programs unacceptably slow (e.g., executing import torch takes more than 30s).

Using a decent enterprise-grade switch and a properly configured network topology can minimize latency as much as possible. At the same time, the quality of optical modules and optical-to-electrical port modules can also have a huge impact on latency (the Chinet (中科光电) optical-to-electrical port module I originally used introduced an extra 0.1 ms of latency, causing IOPS to drop by 2/3).

It should be noted that although RDMA can theoretically reduce latency, in actual testing I found that the difference in serialized IOPS between 10 Gbps Ethernet and 100 Gbps InfiniBand is not large; when the budget is limited, using only Ethernet is sufficient.

TODO: jumbo frames

Linux Kernel

The kernel network parameters need to be adjusted to suit a high-speed network:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


# Ref: https://gist.github.com/mizanRahman/40ba603759bfb5153189ccdc9dbbd1e4

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 56623104
net.core.wmem_max = 56623104
net.core.rmem_default = 56623104
net.core.wmem_default = 56623104
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 56623104
net.ipv4.tcp_wmem = 4096 65536 56623104

# TCP Congestion Control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = cake

This set of settings needs to be applied on both the server and the client; it can be written into /etc/sysctl.conf to make it persistent.

Server Side

The number of NFS server threads can be set as large as possible; it can improve performance when the server load is relatively high, and I simply set it to the number of threads on the server. Modify /etc/nfs.conf:

1
2


[nfsd]
threads=128

The following NFS server parameters need to be adjusted:

async: treats synchronous I/O operations as asynchronous. For workloads dominated by synchronous reads/writes this can greatly improve performance, but it may cause data loss when the server crashes; it is not recommended when there are extremely high requirements for data integrity;
no_subtree_check: has no major impact on performance, but in some cases it can improve reliability (with a slight security risk at the same time). See [1].

Client Side

When there is no special reason, you should use the latest NFSv4.2 by default. When NFSv3 uses UDP as the underlying transport, it can cause data corruption over high-speed networks due to UDP packet sequence number issues; see [2].

The following NFS client parameters need to be adjusted:

proto=rdma: set when the network supports RDMA;
nocto: disables close-to-open cache consistency semantics. The default NFS behavior is to write all changes back to the server when a file is closed. If you have relatively high requirements for file consistency across multiple clients, this option is not recommended;
ac: enables attribute caching, so the client caches file attributes. Likewise, for clusters with high requirements for data consistency, this option is not recommended;
fsc: uses FS-Cache to cache data locally. You also need to configure cachefilesd. Strangely, in my testing I did not find data being cached locally; this may require further investigation;
nconnect=16: sets up 16 TCP connections between the NFS client and server. By default the NFS client establishes only one TCP connection, and all RPCs are multiplexed over this connection. In some cases this limits the bandwidth of sequential reads/writes. Increasing nconnect (maximum value 16) can solve this problem.

In particular, the noatime / relatime settings have no effect on NFS [3]; the NFS client always caches atime changes.

Some tutorials recommend modifying rsize and wsize. In NFSv4.2 these two values are already negotiated to their maximum value 1048576 by default, so there is no need to change them manually; you only need to check whether they were negotiated correctly.

According to [4], sunrpc.tcp_max_slot_table_entries may affect performance and can be increased appropriately (the default is 2). In my testing, I found that when encountering a sustained small-file access workload on the order of tens of millions, NFS would sometimes hang. When I increased this parameter, the problem was resolved. Set /etc/modprobe.d/sunrpc.conf:

1

options sunrpc tcp_slot_table_entries=16384

Sometimes I encounter a problem where nfsd consumes a large amount of CPU and performance drops sharply, while a large number of delegreturn RPC calls are recorded. According to [5], this can be resolved by disabling fs.leases-enable. Set /etc/sysctl.conf:

1

fs.leases-enable = 0

When nfsd restarts for one reason or another, by default there is a 90s grace period for lock recovery, during which nfsd rejects all open requests, shown in the kernel log as:

1

[1073511.138061] NFSD: starting 90-second grace period (net f0000000)

In practice I found that this period can be reduced appropriately to lessen the impact of nfsd restarts. Set /etc/default/nfs-kernel-server:

1
2


# Options for rpc.svcgssd.
RPCSVCGSSDOPTS="--lease-time 10 --grace-time 10"

Testing

TODO

Conclusion

TODO

References

[1] https://man.archlinux.org/man/exports.5.en#no_subtree_check

[2] https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links

[3] https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance

[4] https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots

[5] https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation

Building WireGuard VPN for Machine Learning Server Cluster

Mon, 29 Jan 2024 00:00:00 +0000

Motivation

A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed.

Deploying a VPN network requires considering the following factors:

Network topology: an appropriate topology must be chosen to minimize latency as much as possible;
User management: it should be easy to add or remove users and to authorize them;
Simplicity of use and maintenance.

Design

Network Topology

The network topology determines the latency.

The lowest-latency option is obviously full-mesh, i.e. every pair of peers has a direct P2P connection. However, the management complexity of this topology is $\mathcal{O}(n^2)$, and adding a new peer requires modifying the configuration files of all other peers. It also has to deal with the problems introduced by NAT, which requires some automated management software. I tried Netmaker and Headscale, but neither of them seemed able to correctly handle the complex network environment within the campus, such as the symmetric NAT used by various enterprise-grade routers, and the probability of successfully establishing P2P was very low.

In the end I chose a topology that combines full-mesh and hub-and-spoke. Since the number of servers and their IPs rarely change, manually configuring a full-mesh network among the servers is feasible. At the same time, a gateway server is provided as the hub for user access, and users only need to establish a connection with the gateway server. Since most users actually use the VPN within the campus, connecting to the on-campus gateway server and forwarding traffic through it does not introduce much additional latency. This structure balances latency and management complexity, and adding/removing and authorizing users only needs to be done on the gateway server.

Protocol Choice

The popular OpenVPN and IPSec are both good enough, but the emerging WireGuard offers unparalleled configuration simplicity. On the server side, WireGuard can define a peer and a route with just a few lines of configuration; on the user side, since WireGuard uses key-pair-based authentication, a single configuration file is enough to join the VPN network, with no need to remember an additional password or perform a login operation.

Management Approach

For the sake of predictability and stability, I chose the manual configuration approach. The full-mesh network among servers does not need to be changed frequently once it is configured. User management, on the other hand, is implemented through a script: when a new user needs to be added, the script generates a key pair and allocates an IP, adds the public key and routing information to the gateway server’s peer list, then generates a configuration file containing the private key and the allocated IP, and sends it to the user.

Example of a user peer configuration on the gateway server:

1
2
3
4
5


[Peer]
PublicKey = 
AllowedIPs = 10.1.x.y/32
AllowedIPs = fd01::x:y/128
PersistentKeepalive = 25

Example of a user’s access configuration file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


[Interface]
PrivateKey = 
Address = 10.1.x.y/16
Address = fd01::x:y/64

[Peer]
PublicKey = 
AllowedIPs = 10.1.0.0/16  # route all VPN traffic to gateway server
AllowedIPs = fd01::/64
Endpoint = wg.ustcaigroup.xyz:51820  # gateway server is dual stack
# Endpoint = wg.ustcaigroup.xyz:51820  # IPv4
# Endpoint = wg.ustcaigroup.xyz:51820  # IPv6
PersistentKeepalive = 25

Building Storage System for Machine Learning Server Cluster

Fri, 24 Nov 2023 00:00:00 +0000

This is an unfinished blog.

Custom PyTorch Operators on Ascend 910B

Tue, 14 Nov 2023 00:00:00 +0000

Environment

The hardware environment this article is based on is the Ascend 910B3, and the software environment includes CANN 7.0-RC1, PyTorch 1.11.0, and Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0. The situation on other CANN and PyTorch versions may differ slightly.

Registration Process

Adding a Custom Operator in the Ascend PyTorch Adapter

References:

https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html

https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation

Add the npu_add_custom function in torch_npu/csrc/aten/npu_native_functions.yaml:

1
2


custom:
  - func: npu_add_custom(Tensor x, Tensor y) -> Tensor  # 添加的函数

Add the file AddCustomKernelNpu.cpp in torch_npu/csrc/aten/ops/op_api:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


#include 

#include "torch_npu/csrc/framework/utils/OpAdapter.h"
#include "torch_npu/csrc/aten/NPUNativeFunctions.h"
#include "torch_npu/csrc/aten/ops/op_api/op_api_common.h"

namespace at_npu {
  namespace native {
    using torch::autograd::Function;
    using torch::autograd::AutogradContext;

    at::Tensor NPUNativeFunctions::npu_add_custom(const at::Tensor& x, const at::Tensor& y) {
        at::Tensor result = OpPreparation::ApplyTensor(x); // 创建输出内存

        // calculate the output result of the NPU
        EXEC_NPU_CMD(aclnnAddCustom, x, y, result);
        return result;
    }
  } // namespace native
} // namespace at_npu

Afterwards, recompile and reinstall torch_npu.

Adding the Custom Operator Implementation in CANN

References:

https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html

First, define the operator description file add_custom.json:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


[
    {
        "op": "AddCustom",
        "language": "cpp",
        "input_desc": [
            {
                "name": "x",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            },
            {
                "name": "y",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            }
        ],
        "output_desc": [
            {
                "name": "z",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            }
        ]
    }
]

Run

1

msopgen gen -i add_custom.json -c ai_core-Ascend910B3 -f pytorch -out . -lan cpp

to generate the operator project:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


AddCustom
├── build.sh
├── cmake 
│   ├── config.cmake
│   ├── func.cmake
│   ├── intf.cmake
│   ├── makeself.cmake
│   └── util
├── CMakeLists.txt
├── CMakePresets.json          // 修改 ASCEND_CANN_PACKAGE_PATH
├── framework
├── op_host
│   ├── add_custom_tiling.h    // 定义 length 和 tiling 相关信息
│   ├── add_custom.cpp         // 算子 host 侧实现
│   ├── CMakeLists.txt
├── op_kernel
│   ├── CMakeLists.txt
│   ├── add_custom.cpp         // 算子 kernel 侧实现
└── scripts

In CMakePresets.json, change ASCEND_CANN_PACKAGE_PATH to the CANN installation path.

The content of op_host/add_custom_tiling.h is as follows (a simple implementation):

1
2
3
4
5
6
7
8
9


#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(AddCustomTilingData)
    TILING_DATA_FIELD_DEF(uint32_t, size);  // 定义 tensor size
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(AddCustom, AddCustomTilingData)
}

In op_host/add_custom.cpp, modify the block_dim used when the operator is invoked:

1

context->SetBlockDim(20); // 910B3 的 block_dim

op_kernel/add_custom.cpp is the concrete implementation of the operator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20



#include "kernel_operator.h"

#ifdef __DAV_C220_VEC__

extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);
    uint32_t M = tiling_data.size;  // 从 tiling_data 中获取 tensor size

    // ...
}

#else

// 重要：CANN 会尝试不同的 ccec 编译参数以推断算子的类型（VEC、CUBE、MIXED），如果不创建一个 stub 函数将会编译失败
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) {
    pip_barrier(PIPE_ALL);
}

#endif

Compilation and Deployment

1
2


$ bash build.sh
$ ./custom_opp_euleros_aarch64.run

Calling it in PyTorch:

1
2
3
4
5
6


import torch
import torch_npu

# ...

z = torch.npu_add_custom(x, y)  # 由于是运行时编译，第一次运行时需要等待编译

Registration Principles

TODO

References

TODO

Catching Mining Virus

Wed, 01 Nov 2023 00:00:00 +0000

Problem

On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me.

The “mining traffic” was a bitcoin.sipa.be DNS request sent to 223.5.5.5.

Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus.

However, this time the situation was different. I couldn’t find any suspicious processes, and there was no unusual GPU usage. Since I didn’t deploy any monitoring programs to record historical processes and sockets, the investigation couldn’t get started.

On October 31, I received the same warning again. Each time when mining traffic is detected, the firewall will block the server’s outbound network. Loss of Internet will cause lots of troubles.

I suspected that someone may have suffered a supply chain attack, such as, downloading a Python package containing a virus, or cloning code from GitHub and running it without any check.

The immediate task is to identify who and which process was responsible.

Solution

While I can’t directly determine who or which process, I can block and log suspicious traffic for further investigation.

This job can be done by iptables:

1
2
3
4
5


# iptables -N LOGDROP                   # create a new chain
# iptables -A LOGDROP -j LOG --log-uid  # log info
# iptables -A LOGDROP -j DROP           # drop packet

# iptables -I OUTPUT 1 -p udp -m string --string "bitcoin" --algo bm -j LOGDROP     # match string "bitcoin" in udp packet

The --log-uid option can enable UID recording in /var/log/kern.log, for example:

IN= OUT=wg0 SRC=10.1.92.3 DST=10.1.2.13 LEN=42 TOS=0x00 PREC=0x00 TTL=64 ID=23294 DF PROTO=UDP SPT=52328 DPT=2333 LEN=22 UID=2109 GID=2109

Result

I’m waiting the next requests sent by virus.

Optimizing MKL Performance on AMD CPUs

Mon, 19 Jun 2023 00:00:00 +0000

The Problem

My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don’t know what kind of load it is, or why it can’t run on the GPU, and I don’t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand.

But as nice as AMD processors are, using them in a deep-learning lab brings an extra problem: the numpy and PyTorch installed by Anaconda both use MKL as their BLAS implementation by default, and MKL’s library functions are also the hotspots of most high-CPU-load programs. However, MKL checks whether it is running on an Intel CPU, and if not, the optimizations have no effect.

Since this is a deep-learning lab, few people have enough HPC background to compile suitable versions of numpy and PyTorch themselves, and it’s hard for them to break away from Anaconda, so the dependency on MKL is hard to remove. For this reason I needed a solution that is transparent to ordinary users.

The Solution

A widely circulated solution can be found via search engines: set the environment variable MKL_DEBUG_CPU_TYPE=5. This used to work, but it no longer works for MKL 2020 and later versions.

In the end I found a more clever solution here.

MKL calls a function mkl_serv_intel_cpu_true() to check whether it is running on an Intel CPU. As long as we provide a fake mkl_serv_intel_cpu_true() that always returns 1, we can trick MKL into thinking it is running on an Intel CPU.

To do this, we can use Linux’s LD_PRELOAD mechanism. The dynamic library pointed to by LD_PRELOAD has the highest loading priority, so as long as we compile the desired mkl_serv_intel_cpu_true() function into an so file and point LD_PRELOAD at it, we can load this function ahead of everything else.

I have often heard of the LD_PRELOAD mechanism being used for library-function hijacking attacks; here it counts as a clever use.

Implementation

Create mkl_trick.c:

1
2
3


int mkl_serv_intel_cpu_true() {
    return 1;
}

Compile it with gcc -shared -fPIC -o libmkl_trick.so mkl_trick.c, and copy the generated libmkl_trick.so to /usr/local/lib.

Add the following to the shell’s global initialization file:

1
2
3


export MKL_DEBUG_CPU_TYPE=5  # compatibility with older MKL versions
export MKL_ENABLE_INSTRUCTIONS=AVX2  # optional, tells MKL it can use AVX2
export LD_PRELOAD=/usr/local/lib/libmkl_trick.so

Some of my labmates use Bash and some use ZSH, so both need to be modified:

Bash: create the file /etc/profile.d/mkl.sh and add the above content
ZSH: add it to /etc/zsh/zshenv

References

https://documentation.sigma2.no/jobs/mkl.html