linux

Using GPU accessible VS Code Server on UIUC Delta

Why writing this blog post Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta: ...

NFS Performance Tuning

Introduction This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of Lots of Small Files (LOSF). Tuning Hardware On the network hardware side, both bandwidth and latency matter. To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget. ...

Building WireGuard VPN for Machine Learning Server Cluster

Motivation A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed. Deploying a VPN network requires considering the following factors: Network topology: an appropriate topology must be chosen to minimize latency as much as possible; User management: it should be easy to add or remove users and to authorize them; Simplicity of use and maintenance. Design Network Topology The network topology determines the latency. ...

Building Storage System for Machine Learning Server Cluster

This is an unfinished blog.

Custom PyTorch Operators on Ascend 910B

Environment The hardware environment this article is based on is the Ascend 910B3, and the software environment includes CANN 7.0-RC1, PyTorch 1.11.0, and Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0. The situation on other CANN and PyTorch versions may differ slightly. Registration Process Adding a Custom Operator in the Ascend PyTorch Adapter References: https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation Add the npu_add_custom function in torch_npu/csrc/aten/npu_native_functions.yaml: 1 2 custom: - func: npu_add_custom(Tensor x, Tensor y) -> Tensor # 添加的函数 Add the file AddCustomKernelNpu.cpp in torch_npu/csrc/aten/ops/op_api: ...

Catching Mining Virus

Problem On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me. The “mining traffic” was a bitcoin.sipa.be DNS request sent to 223.5.5.5. Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus. ...

Optimizing MKL Performance on AMD CPUs

The Problem My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don’t know what kind of load it is, or why it can’t run on the GPU, and I don’t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand. ...