posts

Using GPU accessible VS Code Server on UIUC Delta

Why writing this blog post Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta: ...

All About IPv6 Address Allocation

Preface IPv4 has only one method of dynamic address allocation, namely DHCP, but IPv6 has two allocation methods, SLAAC and DHCPv6, and DHCPv6 additionally has the PD (Prefix Delegation) extension. These three allocation methods also interact with each other, which makes problems arising during IPv6 allocation far more common than with IPv4. Most tutorials you can find only solve problems superficially, are ambiguous about the underlying technical details, and do not fundamentally clarify the differences between IPv6 and IPv4. ...

Extracting Graph Topology from Image

The Problem Now we have an image representing a graph, as shown in the figure below: Suppose we already know the category of each pixel: background, node, or edge. How can we extract the graph topology from it and represent the graph by an adjacency matrix? Challenges in Classical Algorithm TODO What about Neural Network? We can use a simple algorithm to extract the position of each node. Suppose the position of a node is $\mathbf{P}(x,y)$, and there are $N$ nodes in total. ...

Latency in LLM Serving

Preface There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint. This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving. ...

How Quantization Works: From a Matrix Multiplication Perspective

Introduction Quantization is a commonly used acceleration technique in NN inference. The primary computational workloads in NNs come from Convolution, Linear Layers, and Attention, which are implemented by GEMM in the lower level. This blog aims to discuss the principles of quantization from the matrix multiplication perspective and to explain why some quantization methods are impractical. It also aims to review several LLM quantization methods from this perspective. I define practical quantization as follows: ...

NFS Performance Tuning

Introduction This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of Lots of Small Files (LOSF). Tuning Hardware On the network hardware side, both bandwidth and latency matter. To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget. ...

[Paper Reading] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs (arXiv'24)

This blog is a write-up of the paper “ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs” from arXiv'24. Motivation Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of small kernels which cannot fully utilize the GPU, and these kernels are not executed concurrently, although most of them are independent and in theory can be executed concurrently. ...

[Paper Reading] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud (PACT'22)

This blog is a write-up of the paper “GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud” from PACT'22. Motivation This paper focuses on the GPU sharing in cloud scenarios. Currently, existing GPU sharing techniques can be categorized into 2 types: Time-sharing means executing each concurrent VM on a full device in a round-robin fashion. Pros: Simple and mature. Cons: VMs could still under-utilize the hardware within each time slice. ...

Building WireGuard VPN for Machine Learning Server Cluster

Motivation A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed. Deploying a VPN network requires considering the following factors: Network topology: an appropriate topology must be chosen to minimize latency as much as possible; User management: it should be easy to add or remove users and to authorize them; Simplicity of use and maintenance. Design Network Topology The network topology determines the latency. ...

Building Storage System for Machine Learning Server Cluster

This is an unfinished blog.

Custom PyTorch Operators on Ascend 910B

Environment The hardware environment this article is based on is the Ascend 910B3, and the software environment includes CANN 7.0-RC1, PyTorch 1.11.0, and Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0. The situation on other CANN and PyTorch versions may differ slightly. Registration Process Adding a Custom Operator in the Ascend PyTorch Adapter References: https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation Add the npu_add_custom function in torch_npu/csrc/aten/npu_native_functions.yaml: 1 2 custom: - func: npu_add_custom(Tensor x, Tensor y) -> Tensor # 添加的函数 Add the file AddCustomKernelNpu.cpp in torch_npu/csrc/aten/ops/op_api: ...

Building Proxy Service for Team

This is an unfinished blog. Preface Due to Internet censorship in China (known as GFW, Great Firewall, 防火长城), many websites (e.g. Google, Twitter) are blocked, and some websites (e.g. GitHub) suffer connectivity issues. In China, the means to circumvent internet censorship is referred to as 翻墙 (means climbing over the wall). In China, to freely access the Internet, a proxy is essential. Despite various commercial options available, they may not be suitable for everyone. Therefore, I have constructed a user-friendly and easy-to-maintain proxy system for my research group, as a part of my responsibilities as a system administrator. ...

My TOEFL Experience

Preface As the exam that has caused me the most anxiety since the gaokao, the TOEFL kept me in the dark for most of 2023, and it is also the exam I invested the most time and money into. At the start I set a goal of 100 total and 20 in speaking. Along the way I went through countless days of lost confidence, of being drowned by anxiety, of practicing speaking until my tongue tied itself in knots — and finally, on November 3, 2023, I checked my scores and was satisfied. ...

Catching Mining Virus

Problem On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me. The “mining traffic” was a bitcoin.sipa.be DNS request sent to 223.5.5.5. Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus. ...

Using an SSH Reverse Tunnel to Log Into BitaHub Containers and Hold GPUs Long-Term

Problem Every year before CVPR, GPUs are always in short supply, and we need to borrow cards from elsewhere. USTC provides BitaHub for on-campus users, but it suffers from the same shortage of cards before CVPR. At the same time, its job-submission-based usage model is very inconvenient: submitting jobs that occupy multiple cards often requires a long wait in the queue, and its data management approach is downright user-hostile. As the server administrator for my group, in order to make my life easier before CVPR and to avoid repeating the 2021 pre-CVPR ordeal of scrambling to allocate resources, I needed to improve the BitaHub experience: ...

Enabling QUIC in Nginx While Keeping SNI Routing

Problem Since version 1.25.0, Nginx’s support for QUIC has been merged into mainline. Users who want to try it out can simply use the official nginx docker image, which is very convenient. However, the nginx on my server uses SNI routing, driven by the needs of a new generation of TLS-based proxy protocols such as Shadow TLS and Xray Reality. These proxy protocols cannot have their TLS layer handled by nginx on their behalf (unlike earlier protocols that could use gRPC/WebSocket and the like as their data transport). But in order to achieve the best camouflage effect, using the 443/tcp port is necessary (the whitelisted target sites used for camouflage generally only serve HTTPS on the 443/tcp port). Therefore, multiplexing the 443/tcp port is necessary. ...

Optimizing MKL Performance on AMD CPUs

The Problem My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don’t know what kind of load it is, or why it can’t run on the GPU, and I don’t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand. ...

VCB-Studio Technical Director Entry Test 2023 and My Answer

See original publication page for more details. All my answer files can be browsed in here, or you can download zipped file (5.9G). Requirements This is a test for candidates who wish to participate in the training class organized by VCB-Studio. Finish as many problems as you can, and then do the following things: Pack your answers, result files, and necessary attachments into a zip/rar/7z file. Source files we provided and intermediate file in your encoding should not be packed in. Register a Baidu Net Disk account (https://pan.baidu.com), upload the zipped file and create a sharing link. Whether you like it or not, Baidu Net Disk has been the most effective way to share files within our team since day one. Other sharing methods will NOT be considered. Send the link via email to [email protected] before Beijing Time (UTC+8) Monday, 23 Jan 2023, 23:59:59. Late submissions will NOT be considered. Prepare a QQ account. The follow-up training courses will be conducted in the QQ group. You should independently complete the answers without any public discussion. Any form of plagiarism will NOT be tolerated. ...

Hello World

My first post on blog!