posts on Monsoon's Blog

Using GPU accessible VS Code Server on UIUC Delta

Sun, 22 Dec 2024 00:00:00 +0000

Why writing this blog post

Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta:

Unstable network connection: Connection is lost frequently when the network is poor. Each time when the VS Code Remote lost connection, you must reenter the password and DUO 2FA OTP (you have to unlock your phone to get the OTP) to reconnect, which is annoying, time-consuming, and distracting.
Broken OnDemand Code Server: Although you can run VS COde Remote on the login nodes by ssh, there’s no GPU for debugging, and the computing nodes are not accessible by ssh. The alternative ways include OnDemand Jupyter Lab and Code Server. But the functions of Jupiter Lab are limited, and the Code Server is broken – When I try to request a Code Server on computing nodes, the system just queues and shows my request has been completed, no running status.

Due to the above problems, debugging GPU programs on Delta are struggling. That’s why I wrote this blog post: by running private Code Server on computing nodes, and deploying a Cloudflare Tunnel reverse proxy, you can say goodbye to these annoying problems.

How to

My solution is based on an observation about the Delta: all login nodes and computing nodes are in a trusted network. There’s no firewalls between them, which means you can access to any ports on the computing nodes from the login nodes.

The main steps of my solution are simple:

Use srun to get a tty on the computing node (e.g., on gpua042 node).
Run a Code Server on the computing node. It will listen on 0.0.0.0:8080.
Reverse proxy gpua042:8080 to any port you have access. There are two approaches:
- Use ssh -L to forward the port to your local machine.
- Use Cloudflare Tunnel to reverse proxy the port to a public domain. This approach is more stable in poor network conditions.

Run Code Server

Download the Code Server binary from the Github repository (e.g., code-server-4.96.2-linux-amd64.tar.gz), and extract it. On the computing node, run:

1
2
3
4
5
6
7
8


cd code-server-4.96.2-linux-amd64/bin

## no auth
./code-server --bind-addr 0.0.0.0:8080 --auth none

## if port is exposed to untrusted network, use password auth
## password can be modified in ~/.config/code-server/config.yaml
./code-server --bind-addr 0.0.0.0:8080

Access Code Server

SSH Port Forwarding

ssh -L can forward a local port to a remote port. Run:

1

ssh -L 127.0.0.1:8080:gpua042:8080 username@login.delta.ncsa.illinois.edu

Then open http://127.0.0.1:8080 in your browser, and enjoy the Code Server!

Cloudflare Tunnel

Cloudflare Tunnel is more stable when your computer suffer from poor network connection. But it requires a domain name.

TODO

All About IPv6 Address Allocation

Sat, 12 Oct 2024 00:00:00 +0000

Preface

IPv4 has only one method of dynamic address allocation, namely DHCP, but IPv6 has two allocation methods, SLAAC and DHCPv6, and DHCPv6 additionally has the PD (Prefix Delegation) extension. These three allocation methods also interact with each other, which makes problems arising during IPv6 allocation far more common than with IPv4. Most tutorials you can find only solve problems superficially, are ambiguous about the underlying technical details, and do not fundamentally clarify the differences between IPv6 and IPv4.

This article aims to start from the relevant fundamental concepts and, in a “teach a man to fish” manner, explain how the three IPv6 address allocation methods work, helping to thoroughly resolve the tricky problems in IPv6 allocation.

IPv6 Fundamental Concepts

LLA (Link-Local Address) and EUI-64

LLA actually already existed in IPv4: when DHCP is not working properly, some operating systems assign a 169.254.0.0/16 address to the network interface for temporary point-to-point communication. But LLA is not important in IPv4, playing only an optional fallback role that appears only when DHCP fails. As a result, the vast majority of people (including the author) did not learn about the existence of LLA until IPv6 became widespread.

IPv6 LLA (fe80::/8) inherits the basic point-to-point communication function of IPv4 LLA, but goes further to take on the important functions of NDP (Neighbor Discovery Protocol) and SLAAC (Stateless Address Autoconfiguration). Understanding it is necessary to understand how SLAAC works.

For example, when two network ports are directly connected with a cable, they each automatically generate an IPv6 LLA, such as fe80::dfc2:d2aa:c86f:171e/64 and fe80::da8f:9d5b:57e3:c6a6/64, and each can ping the other’s LLA. On Linux, the ip -6 route command shows the automatically configured LLA route entry:

1

fe80::/64 dev eth0 proto kernel metric 1024 pref medium

IPv6 LLA is generated from the MAC address using a specific algorithm, namely EUI-64. For example, when the network port’s MAC address is 70:07:12:34:56:78, the generated EUI-64 is 7207:12ff:fe34:5678, and the LLA is fe80:7207:12ff:fe34:5678/64 (EUI-64 with the fe80 prefix prepended). The specific generation process is shown in the figure below:

Generally, routers do not forward traffic for LLA addresses; it is only used for point-to-point communication on the link.

GUA (Global Unicast Address)

IPv6 GUA (2000::/3) can be mapped to the IPv4 concept of a “public IP”. In theory it is globally unique and can be used for communication over the public network. A well-designed network architecture should allow every device to obtain an IPv6 GUA, so as to maximize IPv6’s P2P communication advantage.

Private Addresses

fc00::/7 is defined as the IPv6 private address range, analogous to 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16 in IPv4, used for LAN communication. Unlike LLA, it can be forwarded by routers.

Because IPv6 is designed so that every device worldwide can be assigned a GUA, the role of private addresses in IPv6 is greatly diminished. When it is not possible to assign a GUA to every device (as in some campus network environments), assigning IPv6 private addresses on the internal network can serve as an alternative, allowing internal devices to access IPv6.

Multicast

IPv6 multicast addresses (ff00::/8) are similar to IPv4 multicast addresses (224.0.0.0/4), used for one-to-many communication within a network segment. Both SLAAC and DHCPv6 rely on multicast to work. Commonly used multicast addresses include:

ff02::1: all nodes on the local link;
ff02::2: all routers on the local link.

NDP (Neighbor Discovery Protocol)

NDP works on top of ICMPv6 and is similar to IPv4 ARP. It is used to discover other nodes on the data link layer and their corresponding IPv6 addresses, to determine available routes, and to maintain reachability information about available paths and other active nodes. SLAAC works based on NDP. The message types involved are:

RS (Router Solicitation) and RA (Router Advertisement): used to configure IPv6 addresses and routes;
NS (Neighbor Solicitation) and NA (Neighbor Advertisement): used to find the MAC addresses of other devices on the link.

SLAAC (Stateless Address Autoconfiguration)

SLAAC is the IPv6 address allocation method defined in RFC 4862, and is also the recommended allocation method. In fact, Android only supports SLAAC for IPv6 allocation.

The most notable feature of SLAAC is that it is stateless, i.e. it does not require a centralized server responsible for allocation. Below, the author uses an example to illustrate the SLAAC process.

Suppose the lan0 port on the router is connected to the eth0 port on the host. The LLA of lan0 is fe80::1/64, and the MAC address of eth0 is 70:07:12:34:56:78. At the same time, the router holds the GUA prefix 2001:db8::/64, i.e. all GUAs under this subnet will be routed by the upstream router to this router’s wan port. The SLAAC process is as follows:

eth0 generates the EUI-64 7207:12ff:fe34:5678 and the LLA fe80:7207:12ff:fe34:5678/64 based on its MAC address;
The host performs DAD (Duplicated Address Detection) to ensure the LLA is unique on the local link. This is unrelated to address allocation, so it is omitted here; interested readers can look up the relevant material themselves;
The host sends an RS message via the eth0 LLA. The RS is sent to all routers on the local link using the multicast address ff02::2.
The router replies with an RA message to the eth0 LLA. The RA contains the prefix 2001:db8::/64, the validity period, the MTU, and other information.

The host receives the RA, combines the prefix and the EUI-64 into 2001:db8::7207:12ff:fe34:5678/64, assigns it to eth0, and adds the routing table entries:

1
2


2001:db8::/64 dev eth0 proto ra metric 1024 expires 2591993sec pref medium
default via fe80::1 dev eth0 proto static metric 1024 onlink pref medium

The host performs DAD detection and uses an NA message to announce the use of the new address to neighbors on the link.

SLAAC looks great, but it has an important flaw: it does not support distributing DNS information, so the host must obtain DNS through some other means (usually DHCPv6). There are two flag bits in the RA to address this problem:

M (Managed Address Configuration): address information can be obtained via DHCPv6;
O (Other Configuration): other information (such as DNS) can be obtained via DHCPv6.

The newer RFC 6106 supports distributing DNS information by adding RDNSS (Recursive DNS Server) and DNSSL (DNS Search List) to the RA. For the level of RDNSS support across operating systems, see Comparison of IPv6 support in operating systems. In practice, in the vast majority of cases you only need to configure IPv4 DNS (obtained via DHCPv4), so the RDNSS extension is not very meaningful.

The problem with the EUI-64-based SLAAC address configuration above is that the addresses it generates are fixed and predictable, which brings security and privacy concerns. The IPv6 SLAAC privacy extension defined in RFC 4941 solves this problem. During SLAAC it also generates random, periodically rotated addresses to address the privacy issue. At the same time, the EUI-64-generated address is also retained, for use by externally incoming connections. With the privacy extension enabled, the IPv6 addresses generated on Linux look like the following, for example (from top to bottom: the privacy address, the EUI-64 GUA, and the LLA):

1
2
3
4
5
6
7
8


2: eth0:  mtu 1500 qdisc cake state UP group default qlen 1000
    link/ether 70:07:12:34:56:78 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8::dead:beef:aaaa:bbbb/64 scope global temporary dynamic
       valid_lft 2591998sec preferred_lft 604798sec
    inet6 2001:db8::7207:12ff:fe34:5678/64 scope global dynamic mngtmpaddr noprefixroute
       valid_lft 2591998sec preferred_lft 604798sec
    inet6 fe80:7207:12ff:fe34:5678/64 scope link
       valid_lft forever preferred_lft forever

DHCPv6

DHCPv6 operates in broadly the same way as DHCPv4: the host sends a multicast message to ff02::1:2 on UDP port 547, and the DHCPv6 server replies with address, DNS, and other information.

The difference is that DHCPv6 can run in either a stateful or a stateless mode, the distinction being whether or not an address is obtained. When used together with SLAAC, the host only needs to obtain DNS and other information from DHCPv6, so stateless DHCPv6 can be used.

DHCPv6 PD (Prefix Delegation)

PD is a DHCPv6 extension defined in RFC 3633. It is used to distribute IPv6 prefixes across a network.

With the PD extension enabled, the DHCP server grants the host the right to use an IPv6 subnet prefix (such as 2001:db8::/56) and adds routing table entries to ensure that all addresses under this subnet are routed to the host that requested the prefix. The host can then further subdivide and allocate this subnet.

A typical use case for DHCPv6 PD is home ISP network access. The home gateway router requests an IPv6 prefix from the ISP DHCP server, and then distributes addresses from this subnet prefix within the home internal network via SLAAC.

Conclusion

This article briefly introduced some of the concepts involved in IPv6 address allocation and explained how SLAAC, DHCPv6, and DHCPv6 PD work. In terms of simplifying address management, IPv6 can be said to have been rather unsuccessful: multiple standards coexist, and there are various combinations of them, which gives clients a non-trivial probability of failing to correctly obtain IPv6.

In practice, the three most common IPv6 allocation scenarios we encounter are:

Pure SLAAC: typical campus networks (education networks) fall into this category. In practice, the author has found cases where a misconfigured host on the internal network indiscriminately sends RAs, causing the IPv6 of all hosts on the entire internal network to be misconfigured. At the same time, in this mode, a router you connect yourself will no longer be able to distribute SLAAC GUAs to downstream devices, because the local-link multicast packets that SLAAC relies on cannot be forwarded by the router (this can be solved via IPv6 bridging or NAT6, which is not elaborated on here).
Pure DHCPv6: some enterprise internal networks use this mode, because DHCPv6 allows centralized management. The biggest problem with this mode is that Android does not support DHCPv6. But under other operating systems, this mode runs fairly stably.
SLAAC + DHCPv6 PD: this is the most common mode for home ISP network access. Most home routers are adapted for it and work out of the box.

References

Extracting Graph Topology from Image

Thu, 11 Jul 2024 00:00:00 +0000

The Problem

Now we have an image representing a graph, as shown in the figure below:

Suppose we already know the category of each pixel: background, node, or edge. How can we extract the graph topology from it and represent the graph by an adjacency matrix?

Challenges in Classical Algorithm

TODO

What about Neural Network?

We can use a simple algorithm to extract the position of each node. Suppose the position of a node is $\mathbf{P}(x,y)$, and there are $N$ nodes in total.

Then, the task is to fill in the $N\times N$ adjacency matrix with $0$ or $1$. As we can see, this can be converted into a binary classification problem.

we can train a neural network $\mathbf{f}$, which takes 3 input: the image $I$, the position of a node pair $\left( \mathbf{P}_ 1, \mathbf{P}_ 2 \right)$. It outputs $O\in\{0,1\}$, indicating whether there is a direct connection between the node pair, i.e.,

$$O=\mathbf{f}(\mathbf{I}, \mathbf{P}_ 1, \mathbf{P}_ 2).$$

The dataset can be synthesized by a simple program, and we can use any classification network (e.g., EfficientNet) as our network architecture.

The problem is how to feed $\left( \mathbf{P}_ 1, \mathbf{P}_ 2 \right)$ into the network. We can add an additional “mask channel” to the image, where the pixels belonging to the two input nodes are marked as 1, and the others as 0. Finally, we input this 4-channel “image” into the network.

Other Notes

TODO

Latency in LLM Serving

Sun, 07 Jul 2024 00:00:00 +0000

Preface

There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint.

This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving.

Latency Metrics

In LLM serving, we mainly focus on three latency metrics:

TBT ($t_ {tbt}$): Time Between Tokens.
TTFT ($t_ {ttft}$): Time to First Token.
TE2E ($t_ {e2e}$): Time of End-to-end.

In practice, rather than the average or median latency, we usually consider the latency SLA, which means that 50%, 90%, and 99% of data should fall below certain thresholds.

Where The Latency Comes From?

As shown in the figure above, the current popular LLM serving systems (such as vLLM, DeepSpeed) adopt an iteration-level scheduling strategy. The processing of each request is divided into the prefilling stage (prompt inference) and the generation stage (auto-regressive token-by-token generation). For systems such as Sarathi-Serve, the prompt is chunked to improve throughput, thus adding a chunked prefilling stage.

The LLM serving system maintains 3 queues to store requests in these 3 states. The scheduler runs in a loop, and in each iteration, it selects requests from these 3 queues with a certain strategy, and combines them into a batch for the inference engine.

In such systems, the latency of requests mainly comes from 2 aspects: queue latency and inference latency. Assuming the latencies for a request from being added into the prefilling queue, chunked prefilling queue, generation queue to being selected by scheduler are $t_ {qp}$, $t_ {qc}$, $t_ {qg}$ respectively, and inference latency of engine if $t_ {inf}$. We get:

$$\begin{aligned} t_ {ttft} &= t_ {qp} + (N_ {chunk} - 1) \cdot t_ {qc} + N_ {chunk} \cdot t_ {inf}, \\\\ t_ {tbt} &= t_ {qg} + t_ {inf}, \\\\ t_ {e2e} &= t_ {ttft} + N_{token} \cdot t_ {tbt}, \end{aligned}$$

where $N_ {chunk}$ is the chunk number of a prefilling request, $N_ {chunk}=1$ means no chunking. $N_ {token}$ is the total token number generated by a request.

Obviously, $t_ {inf}$ is not a fixed value. It’s related with the ingredient of the batch. We can denote it as:

$$t_ {inf} = f\left( B_ {p}, B_ {c}, B_ {g}, \mathbf{L}_ {p}, L_ {chunk} \right),$$

where $B_p$, $B_c$, $B_g$ indicates the number of non-chunked prefilling request, chunked prefilling request, generation request respectively. Vector $\mathbf{L}_ {p}$ means the prompt length of each non-chunked prefilling request in the batch. $L_ {chunk}$ is the chunk size.

How to Improve It?

Based on the above analysis, we can find that reducing latency mainly involves reducing both queue latency and inference latency. In fact, some techniques, such as iteration-level scheduling and chunked prefilling, can be seen as improvements to queue latency.

On the other hand, improvement of inference latency have not received much attention. One reason is that, for inference engines, there is a trade-off between latency and throughput. Generally speaking, higher batch size means higher throughput, but also higher inference latency. Techniques such as quantization and Paged Attention focus on more efficient memory usage to increase batch size, but inference latency may also increase accordingly (TODO: add an example), which means $t_ {tbt}$ and $t_ {ttft}$ may be increased, and SLA requirements are broken.

Therefore, there is an opportunity to improve inference latency in current LLM serving systems. The target may be an SLA-aware scheduler, which can maximize throughput without breaking SLA requirements. It should be able to dynamically decide the batch size and batch composition instead of just deploying a static prefilling-prioritize or generation-prioritize strategy.

I believe the key to this design is to predict $t_ {inf}$ to provide latency optimization guidance for the scheduler. Prediction based on profiling results may be a simple approach, but a performance model based on GPU computation capability and memory bandwidth might be more general.

Once we can predict $t_ {inf}$, $t_ {qp}$, $t_ {qc}$, and $t_ {qg}$ can also be predicted using mathematical tools such as Queueing Theory (e.g., Poisson distribution), allowing us to optimize serving for the following scenarios:

When the request arrival rate is less than the maximum throughput: we can appropriately reduce batch size to improve $t_ {tbt}$.
When the request arrival rate is greater than the maximum throughput: we can adjust the batch composition dynamically based on queue length, or drop some requests to avoid starvation.
When the request arrival rate suddenly increases: we can adjust the batch composition to avoid breaking the SLA of $t_ {ttft}$.

In summary, this SLA-aware scheduler should provide better results than a static scheduler by considering arrival rate, queue length, and predicted $t_ {inf}$.

Some Meaningful Experiment Result

TODO

How Quantization Works: From a Matrix Multiplication Perspective

Wed, 06 Mar 2024 00:00:00 +0000

Introduction

Quantization is a commonly used acceleration technique in NN inference. The primary computational workloads in NNs come from Convolution, Linear Layers, and Attention, which are implemented by GEMM in the lower level. This blog aims to discuss the principles of quantization from the matrix multiplication perspective and to explain why some quantization methods are impractical. It also aims to review several LLM quantization methods from this perspective.

I define practical quantization as follows:

Operation can still be performed using GEMM after quantization. This requires both mathematical feasibility and hardware support. It is a fundamental requirement for achieving acceleration.
Quantization must lead to actual acceleration. Acceleration can arise from higher INT8 hardware throughput, or from the memory bandwidth saved by smaller memory footprint. Importantly, the benefits of acceleration must outweigh the quantization overhead.

Let’s do some math

Suppose an operator can be expressed in the form of matrix multiplication:

$$\mathbf{Y}=\mathbf{X} \mathbf{W}^\top,$$

where $\mathbf{X} \in \mathbb{R}^{N \times C}$, $\mathbf{Y} \in \mathbb{R}^{N \times D}$, $\mathbf{W} \in \mathbb{R}^{D \times C}$, while their quantized versions are denoted as $\hat{\mathbf{X}}$, $\hat{\mathbf{Y}}$, $\hat{\mathbf{W}}$. Our goal is to ensure that operations can still be performed using GEMM after quantization, i.e.:

$$\hat{\mathbf{Y}}=\hat{\mathbf{X}} \hat{\mathbf{W}}^\top.$$

Let the per-element quantization functions for $\mathbf{X}$, $\mathbf{Y}$, and $\mathbf{W}$ be denoted as $p_{nc}(\cdot)$, $q_{nd}(\cdot)$, $r_{dc}(\cdot)$ respectively:

$$\begin{aligned} \hat{x}_ {nc} &= p_ {nc}(x_{nc}), \\\\ \hat{y}_ {nd} &= q_ {nd}(y_{nd}), \\\\ \hat{w}_ {dc} &= r_ {dc}(w_{dc}). \end{aligned}$$

The corresponding dequantization functions are denoted as $p_ {nc}^{-1}(\cdot)$, $q_ {nd}^{-1}(\cdot)$, $r_ {dc}^{-1}(\cdot)$, i.e.:

$$\begin{aligned} y_ {nd} &= \sum_ {c=1}^{C} x_ {nc} w_ {dc}, \\\\ q_ {nd}^{-1}(\hat{y}_ {nd}) &= \sum_ {c=1}^{C} p_ {nc}^{-1}(\hat{x}_ {nc}) \cdot r_ {dc}^{-1}(\hat{w}_ {dc}). \end{aligned}$$

The above formulas set the basic constraints that practical quantization should satisfy mathematically.

Some basic quantization methods

With this basic constraints, we can now discuss several fundamental quantization methods, including per-element, per-channel, per-token, and per-tensor quantization.

Per-element and Per-channel

In the basic constraints mentioned above, the dequantization function $q_ {nd}^{-1}(\cdot)$ on the left-hand side does not depend on $c$. Clearly, if the right-hand side quantization functions $p_ {nc}^{-1}(\cdot)$ and $r_ {dc}^{-1}(\cdot)$ depend on $c$, this constraint will be violated. This implies that these two conditions cannot be satisfied at the same time:

Computation can be done by GEMM.
Different quantization functions can be applied in different channels of $\mathbf{X}$ and $\mathbf{W}$.

In other words, this indicates that per-element and per-channel quantization cannot be accelerated using GEMM. They are impractical.

Per-token and per-tensor

From the above discussion, we know that practical quantization needs to satisfy at least:

$$\begin{aligned} p_ {n}(\cdot) &= p_ {nc} (\cdot), \quad \forall n, c, \\\\ r_ {d}(\cdot) &= r_ {dc} (\cdot), \quad \forall d, c. \end{aligned}$$

That is, the quantization function is same for all channels. Therefore, the basic constraint can be formulated as:

$$q_ {nd}^{-1}(\hat{y}_ {nd}) = \sum_ {c=1}^{C_ i} p_ {n}^{-1}(\hat{x}_ {nc}) \cdot r_ {d}^{-1}(\hat{w}_ {dc}),$$

Thus, we get per-channel quantization. If we further assume:

$$\begin{aligned} p(\cdot) &= p_ {nc} (\cdot), \quad \forall n, c, \\\\ r(\cdot) &= r_ {dc} (\cdot), \quad \forall d, c. \end{aligned}$$

That is, the quantization function is same for all elements in both $\mathbf{X}$ and $\mathbf{W}$. Therefore, the basic constraint can be formulated as:

$$q_ {nd}^{-1}(\hat{y}_ {nd}) = q^{-1}(\hat{y}_ {nd}) = \sum_ {c=1}^{C_i} p^{-1}(\hat{x}_ {nc}) \cdot r^{-1}(\hat{w}_ {dc}).$$

We thus obtain per-tensor quantization. While both of these quantization methods have theoretical feasibility, the practical values of them are still limited by hardware support (as discussed in the next section).

For convenience, the following discussion focuses only on per-token quantization. Per-tensor quantization can be seen as a special case of per-token quantization. The most commonly used quantization method in practice is symmetric uniform quantization, which scales the value range using multiplication, i.e.:

$$\begin{aligned} \hat{x}_ {nc} &= p_ {n}(x_ {nc}) = p_ n x_ {nc}, \\\\ \hat{w}_ {nd} &= r_ {d}(w_ {dc}) = r_ d w_ {dc}, \\\\ \hat{y}_ {dc} &= q_ {nd}(y_ {nd}) = p_ n r_ d y_ {nd}. \end{aligned}$$

We can formulate per-token symmetric uniform quantization by matrix multiplication:

$$\begin{aligned} \hat{\mathbf{X}} &= \text{diag}(p_1,\cdots,p_ N)\cdot \mathbf{X} = \begin{pmatrix} p_ 1 & \cdots & p_ 1 \\\\ \vdots & \ddots & \vdots \\\\ p_ N & \cdots & p_ N \end{pmatrix} \otimes \mathbf{X}, \\\\ \hat{\mathbf{W}} &= \text{diag}(r_1,\cdots,r_ D)\cdot \mathbf{W} = \begin{pmatrix} r_ 1 & \cdots & r_ D \\\\ \vdots & \ddots & \vdots \\\\ r_ 1 & \cdots & r_ D \end{pmatrix} \otimes \mathbf{W}, \\\\ \hat{\mathbf{Y}} &= \text{diag}(p_1,\cdots,p_ N)\cdot \mathbf{Y} \cdot \text{diag}(r_1,\cdots,r_ D) = \begin{pmatrix} p_ 1 r_ 1 & \cdots & p_ 1 r_ D \\\\ \vdots & \ddots & \vdots \\\\ p_ N r_ 1 & \cdots & p_ N r_ D \end{pmatrix} \otimes \mathbf{Y}, \end{aligned}$$

where $\otimes$ represents element-wise matrix multiplication. It can be observed that both quantization and dequantization can be efficiently implemented using element-wise matrix multiplication with dimension broadcasting. The following figure illustrates the computation process by an example:

Hardware requirements

Hardware support still need to be considered when we try to utilize GEMM for quantization. For example, on NVIDIA GPUs, Tensor Core supports matrix multiplication for FP16 and INT8, but it doesn’t support mixed precision matrix multiplication for FP16/INT8. This means that W8A8 quantization can benefit from Tensor Core, but W8A16 and W16A8 quantization lack hardware support and may not achieve real acceleration on NVIDIA GPUs. Many W8A16 and W16A8 quantization methods actually perform dequantization before GEMM and then use FP16 for computation. The actual acceleration effects of these methods require further discussion (see below).

Performance analysis

The above discussion only shows that per-token quantization can leverage GEMM. The following words will show whether it can provide actual acceleration.

We compare the following three setups:

Unquantized, using FP16 for both storage and computation.
W8A8 quantization, with I/O activations stored in FP16. This is the approach used by some works like LLM.int8(). To avoid additional CUDA kernel launch overhead, we assume that quantization and dequantization are fused with GEMM.
W8A16 quantization, internally converting weights to FP16 for computation. Kernel fusion is also applied here.

Without loss of generality, we can assume that the hardware INT8 throughput is $2\times$ than that of FP16. We can set normalized operations of one INT8 operation is $1$, while $2$ for FP16. We can list the following table:

Method	FP16	W8A8 (FP16 activations I/O)	W8A16
GEMM OPs	$2NCD$	$NCD$	$2NCD$
GEMM mem I/O	$2(NC+CD+ND)$	$2NC+CD+2N D$	$2NC+CD+2ND$
quant/dequant OPs	$0$	$2NC+4ND$	$2CD$
quant/dequant Mem I/O	$0$	$2(N+C_o)$	$2D$
total OPs	$2NC D$	$NC D+2NC+4N D$	$2NCD+2CD$
total mem I/O	$2(NC+C D+N D)$	$2NC+C D+2N D+2(N+C_o)$	$2NC+CD+2ND+2D$
total arithmetic intensity (OPs:I/O)	$\cfrac{1}{1/N+1/C+1/D}$	$\cfrac{1+2/D+4/C}{2/N+1/C+2/D+2/(NC)+2/(CD)}$	$\cfrac{1+2/N}{1/(2N)+1/C+1/D+1/(NC)}$
total arithmetic intensity (second-order approximation)	$\cfrac{1}{1/N+1/C+1/D}$	$\cfrac{1}{2/N+1/C+2/D}$	$\cfrac{1}{1/(2N)+1/C+1/D}$

Analyzing the table above, we can draw the following conclusions:

W8A8 quantization (with FP16 activations I/O) reduces the operations by almost half compared to FP16, but it decreases the total arithmetic intensity. Therefore, in memory-bound scenarios, W8A8 quantization may not achieve a $2\times$ throughput improvement (ZeroQuant addresses this issue, as discussed below). But it can still lead to a significant throughput improvement when memory bandwidth is sufficient.
W8A16 quantization maintains a similar operations compared to FP16, but it slightly increases the total arithmetic intensity (more increase when $N$ is large). Therefore, it also has practical value in memory-bound scenarios, especially since activations in LLMs are typically harder to be quantized than weights.

Some LLM Quantization works

`LLM.int8()`

LLM.int8() actually employs selective per-token quantization. It stores weights and activations in FP16 and then applies different strategies for different tokens, as illustrated below:

For tokens suitable for quantization, it applies per-token INT8 quantization to weights and activations, computes results using INT8 GEMM, and then dequantizes them to FP16.
For tokens with outliers, it directly computed the FP16 GEMM.

The results from these two parts can be combined to form the final result.

SmoothQuant

While per-channel quantization may not be practical, for LLM activation quantization, the main challenge arises from activations, where values with larger magnitudes may appear on some channels, as shown below:

SmoothQuant observed that these outliers occur consistently in specific channels, while outliers are rare in weights (thus easier to quantize). Therefore, it proposes to “balance” the quantization difficulty between activations and weights by introducing a per-channel scaling factor:

This “balance” can be formulated as:

$$\begin{aligned} \mathbf{Y} &= \mathbf{X}\mathbf{W}^\top \\\\ &= \mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C) \cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1} \cdot \mathbf{W}^\top \\\\ & = \left( \mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C) \right) \cdot \left( \mathbf{W}\cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1} \right)^\top. \end{aligned}$$

By selecting appropriate scaling factors $\text{diag}(s_ 1,\cdots,s_ C)$, we can achieve the goal of balancing outlier values in activations, and then we can quantize $\mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C)$ and $\mathbf{W}\cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1}$. The following figure give an example:

SmoothQuant is an excellent alternative to per-channel quantization, as demonstrated in the paper by its impressive performance in quantizing LLM to W8A8.

ZeroQuant

In the above performance analysis of W8A8, we found that using FP16 for activations I/O reduces the overall arithmetic intensity after quantization, which may harm the throughput improvement in memory-bound scenarios. ZeroQuant addresses this issue by fusing the quantization into the previous operator and fusing the dequantization after GEMM, as shown in the figure below.

Thus, the activations I/O between operators are still INT8, which reduces the total memory I/O to $NC+CD+ND+2(N+D)$, boosting arithmetic intensity to original FP16 level , and fully leveraging the high throughput of INT8.

Conclusion

This blog provides a matrix multiplication perspective for quantization, indicating some fundamental requirements for practical quantization and explaining why per-channel quantization in impractical. It also discusses several examples of LLM per-token quantization, including LLM.int8(), SmoothQuant, and ZeroQuant. They are all practical and demonstrate significant acceleration in real-world scenarios.

NFS Performance Tuning

Fri, 16 Feb 2024 00:00:00 +0000

Introduction

This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of Lots of Small Files (LOSF).

Tuning

Hardware

On the network hardware side, both bandwidth and latency matter.

To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget.

For the Lots of Small Files (LOSF) scenario, latency is more important than bandwidth. Many tuning tutorials overlook this and focus only on sequential read/write performance; even when they test 4K random read/write, they use the wrong testing method (the correct method is given below).

The importance of latency lies in the fact that if a program’s access to small files is intrinsically serialized, latency determines the upper bound of serialized IOPS. A latency of 0.1 ms caps serialized IOPS at 10k, while a latency of 1 ms corresponds to a cap of 1k.

Intrinsically serialized access scenarios are very common. For example, when the home directory is placed on NFS, the loading of oh-my-zsh and the loading of Python packages are both intrinsically serialized. A 1 ms network latency makes these programs unacceptably slow (e.g., executing import torch takes more than 30s).

Using a decent enterprise-grade switch and a properly configured network topology can minimize latency as much as possible. At the same time, the quality of optical modules and optical-to-electrical port modules can also have a huge impact on latency (the Chinet (中科光电) optical-to-electrical port module I originally used introduced an extra 0.1 ms of latency, causing IOPS to drop by 2/3).

It should be noted that although RDMA can theoretically reduce latency, in actual testing I found that the difference in serialized IOPS between 10 Gbps Ethernet and 100 Gbps InfiniBand is not large; when the budget is limited, using only Ethernet is sufficient.

TODO: jumbo frames

Linux Kernel

The kernel network parameters need to be adjusted to suit a high-speed network:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


# Ref: https://gist.github.com/mizanRahman/40ba603759bfb5153189ccdc9dbbd1e4

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 56623104
net.core.wmem_max = 56623104
net.core.rmem_default = 56623104
net.core.wmem_default = 56623104
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 56623104
net.ipv4.tcp_wmem = 4096 65536 56623104

# TCP Congestion Control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = cake

This set of settings needs to be applied on both the server and the client; it can be written into /etc/sysctl.conf to make it persistent.

Server Side

The number of NFS server threads can be set as large as possible; it can improve performance when the server load is relatively high, and I simply set it to the number of threads on the server. Modify /etc/nfs.conf:

1
2


[nfsd]
threads=128

The following NFS server parameters need to be adjusted:

async: treats synchronous I/O operations as asynchronous. For workloads dominated by synchronous reads/writes this can greatly improve performance, but it may cause data loss when the server crashes; it is not recommended when there are extremely high requirements for data integrity;
no_subtree_check: has no major impact on performance, but in some cases it can improve reliability (with a slight security risk at the same time). See [1].

Client Side

When there is no special reason, you should use the latest NFSv4.2 by default. When NFSv3 uses UDP as the underlying transport, it can cause data corruption over high-speed networks due to UDP packet sequence number issues; see [2].

The following NFS client parameters need to be adjusted:

proto=rdma: set when the network supports RDMA;
nocto: disables close-to-open cache consistency semantics. The default NFS behavior is to write all changes back to the server when a file is closed. If you have relatively high requirements for file consistency across multiple clients, this option is not recommended;
ac: enables attribute caching, so the client caches file attributes. Likewise, for clusters with high requirements for data consistency, this option is not recommended;
fsc: uses FS-Cache to cache data locally. You also need to configure cachefilesd. Strangely, in my testing I did not find data being cached locally; this may require further investigation;
nconnect=16: sets up 16 TCP connections between the NFS client and server. By default the NFS client establishes only one TCP connection, and all RPCs are multiplexed over this connection. In some cases this limits the bandwidth of sequential reads/writes. Increasing nconnect (maximum value 16) can solve this problem.

In particular, the noatime / relatime settings have no effect on NFS [3]; the NFS client always caches atime changes.

Some tutorials recommend modifying rsize and wsize. In NFSv4.2 these two values are already negotiated to their maximum value 1048576 by default, so there is no need to change them manually; you only need to check whether they were negotiated correctly.

According to [4], sunrpc.tcp_max_slot_table_entries may affect performance and can be increased appropriately (the default is 2). In my testing, I found that when encountering a sustained small-file access workload on the order of tens of millions, NFS would sometimes hang. When I increased this parameter, the problem was resolved. Set /etc/modprobe.d/sunrpc.conf:

1

options sunrpc tcp_slot_table_entries=16384

Sometimes I encounter a problem where nfsd consumes a large amount of CPU and performance drops sharply, while a large number of delegreturn RPC calls are recorded. According to [5], this can be resolved by disabling fs.leases-enable. Set /etc/sysctl.conf:

1

fs.leases-enable = 0

When nfsd restarts for one reason or another, by default there is a 90s grace period for lock recovery, during which nfsd rejects all open requests, shown in the kernel log as:

1

[1073511.138061] NFSD: starting 90-second grace period (net f0000000)

In practice I found that this period can be reduced appropriately to lessen the impact of nfsd restarts. Set /etc/default/nfs-kernel-server:

1
2


# Options for rpc.svcgssd.
RPCSVCGSSDOPTS="--lease-time 10 --grace-time 10"

Testing

TODO

Conclusion

TODO

References

[1] https://man.archlinux.org/man/exports.5.en#no_subtree_check

[2] https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links

[3] https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance

[4] https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots

[5] https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation

[Paper Reading] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs (arXiv'24)

Wed, 07 Feb 2024 00:00:00 +0000

This blog is a write-up of the paper “ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs” from arXiv'24.

Motivation

Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of small kernels which cannot fully utilize the GPU, and these kernels are not executed concurrently, although most of them are independent and in theory can be executed concurrently.

But there are some challenges to execute these kernels concurrently:

Input-dependent kernel dependencies. For some workload, the the dependencies between kernels are only determined at runtime for each input. Constructing full computational graph and resolving dependencies before execution will introduce high latency (see Figure 2,average of 47% of overall execution time as the paper says).

Irregular kernel dependencies. Some workloads have irregular computational graphs. We can partitioned the computational graph of the workload into independent streams of kernels. But this would require fine-grained scheduling and synchronization, with large overhead (see Figure 3).

Existed solutions:

CUDA Graph and AMD ATMI. They allow users specify dependencies between different kernels as DAG, and can eliminate the synchronization and kernel launch overhead. But the DAG needs to be constructed in full before execution, which imakes them not suitable for dynamic kernel dependencies (such as Dynamic DNNs).
Using events provided by the CUDA stream management API, which allows synchronization between kernels across streams through the cudaStreamWaitEvent API, without blocking the host. But approach still requires deriving dependencies between all kernels beforehand.
Persistent threads (PT) can eliminate the scheduling and launch overheads, but are only effective when all kernels are homogeneous.

PT is just like coroutine in some programming languages.
CUDA dynamic parallelism (CDP) or AMD’s device enqueue (DE) enables parent kernels to launch child kernels, but , only allowing data dependencies between one parent and its children (so cannot be use to synchronize between multiple tasks).

Design

The goal of this paper is to design a framework that enables efficient concurrent execution of GPU kernels with:

lightweight detection of inter-kernel dependencies at runtime,
low overhead kernel scheduling and synchronization.

The key idea is to perform the dependence checking and scheduling within a small window of kernels at runtime similar to out-of-order instruction scheduling.

The authors proposed Automatic Concurrent Scheduling (ACS) as solution. The overall design of ACS-SW is shown in Figure 4. It contains three main functionalities:

Determining inter-kernel dependencies. By checking for overlaps between read segments and write segments, we determine dependencies between kernels. For a wide range of commonly used kernels (e.g., matrix multiplication, convolution), we can infer the read and write segments from the input easily. But for some kernels, it’s impossible to determine the range of memory accessed statically because of the potential indirect memory accesses, so the authors just assume the entire GPU memory may be accessed.

The authors use a kernel wrapper to finish the dependency detection. get_addresses() is called to get __read_segments__ and __write_segments__.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


struct ACE_wrapper { 
  //list of read,write segments defined as
  //[{start_adr1,size1},{start_adr2,size2}..]
  list __read_segments__;
  list __write_segments__;
  // function which gets called at kernel
  // launch to populate read,write segments
  void get_addresses(
    dim3 blocks, dim3 threads, ...
  );
  // function declaration of the kernel
  static __global__ void kernel(...);
};

Tracking kernel state at runtime. The kernels in the window can be three states:
1. Ready: kernels it is dependent on complete execution.
2. Pending: upstream kernels are still pending or executing.
3. Executing.

Eliminating CPU synchronization overheads. See ACS-HW for more details.

ACS has two variants:

ACS-SW: software-only implementation which emulates the out-of-order kernel scheduling mechanism.
ACS-HW: hardware-facilitated implementation which is more efficient as it also alleviates synchronization overheads.

ACS-SW

Window Module

This module is to determining inter-kernel dependencies. It is implemented as a separate thread that manages the input FIFO queue and the scheduling window. The kernel state tracking is implemented in the hardware.

Scheduler Module

This module schedules and launches ready kernels for execution. It has fixed number of CUDA streams. Each stream contains only one kernel at any given time. Threads with empty streams poll the scheduling window for a ready kernel.

ACS-HW

ACS-SW incurs kernel synchronization and launch overheads because scheduler module launches a kernel in the CPU. ACS-HW solves these problems by a software-hardware co-design.

Software-side: maintains an input FIFO queue like ACS-SW, and a list of kernels in the GPU’s scheduling window, but it can be stale.

Hardware-side: the scheduling window and its management are implemented in hardware on the GPU side.

A key novelty in hardware design is two stage dependency detections. First, ACS use software to perform initial detection using stale kernel information (without frequent synchronize overhead), then utilizes hardware to correct outdated dependency information. This two-stage approach significantly reduces the hardware complexity.

Evaluation

Baseline: cuDNN implementation (for DNNs) and a jax implementation (for deep RL simulation), both using CUDA streams.
ACS-SW: on real hardware.
ACS-SW-Sim: ACS-SW on the GPU simulator.
ACS-HW: on the GPU simulator.
CUDAGraph.

Comments

Strengths

This paper focuses on the problem of low GPU utilization caused by the serial execution of numerous small CUDA kernels. I believe this paper effectively addresses this problem, particularly with the following innovative points that are impressive me:

Out-of-order dependency detection and scheduling. Out-of-order (OoO) is a common technique in micro-architecture and software (e.g., hard disk I/O queue) designs. It’s an impressive and innovative idea to introduce OoO into this area to find the dynamic dependencies efficiently.
A good trade-off. When I first read the Introduction section of the paper, I thought the read-write dependencies detection may be a difficulty task. To my knowledge, there aren’t reliable static binary memory access analysis techniques (otherwise, segmentation fault wouldn’t be a common problem). However, the authors made a good simplification and trade-off regarding this problem. For most common kernels, memory access areas can be inferred from input parameters. For the rest kernels, it can be assumed that they access the entire memory. Since few common operators occupy most of the execution time, this trade-off leads to significant performance improvements with a relatively low scheduling overhead. This innovation is my favorite aspect of this paper.
Two-stage dependency detection in ACS-HW. While a complete hardware dependency detection approach is theoretically feasible, it could incur significant chip area costs (as we know, the re-order buffer in microprocessor carries large area). The authors proposed a two-stage software-hardware co-design dependency detection, significantly simplifying the difficulty of hardware design. It is a brilliant idea.

Weaknesses

This paper has some potential weaknesses:

To each type of kernel, we must custom get_addresses function int the kernel wrapper. This weakness may limit the adoption of ACS.
Deciding whether kernels should be executed concurrently requires considering more factors than just data dependencies. If there are resource conflict (e.g., memory bandwidth, shared memory size) between two large kernels, performance may degrade if they co-execute.

Improvements

I propose some potential improvements to this paper:

In response to the first weakness mentioned above, I propose a profiling-rollback strategy to achieve safe automatic dependency detection. This strategy leverages the commonly used paging technique in OS virtual memory management: we can set a memory page as read-only or write-only. When a program is running, if a page fault is triggered, we can know that a read/write occurs. While I’m unsure if Nvidia GPUs provide APIs for user to control page tables, let’s assume such APIs exist. Given that many workloads are iterative (e.g., neural network training), we can profile the workload just one iteration, utilizing the aforementioned paging trick to record the memory access segments of each kernel. Obviously this may introduce some inaccuracies, we need a rollback strategy to ensure correct program execution. During runtime, we set known __write_segments__ as read-write, while other areas are set as read-only. Upon encountering a page fault, we detect an error and revert to the default strategy (assuming all memory areas will be read and wrote). With this strategy, we can eliminate the need of manual get_addresses function, and maximize the potential parallelism.
Regarding the second weakness, I suggest adopting the method of GPUPool to determine which kernels are suitable for concurrent execution. A naive solution involves tracking the number of SMs each kernel occupies. When the SMs of a GPU are fully occupied, even if there are kernels in the ready state and available CUDA streams, no new kernels are scheduled.

[Paper Reading] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud (PACT'22)

Wed, 07 Feb 2024 00:00:00 +0000

This blog is a write-up of the paper “GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud” from PACT'22.

Motivation

This paper focuses on the GPU sharing in cloud scenarios.

Currently, existing GPU sharing techniques can be categorized into 2 types:

Time-sharing means executing each concurrent VM on a full device in a round-robin fashion. Pros: Simple and mature. Cons: VMs could still under-utilize the hardware within each time slice.
Shape-sharing: split a device into partitions and allows multiple workloads to execute on different partitions simultaneously.

Space-sharing can be categorized into 2 types：

Coarse-grained assigns disjoint sets of streaming multiprocessors (SMs) and memory channels to concurrent workloads. For example, Nvidia MIG. Pros: offers great performance isolation among tenants of the same GPU. Cons: (i) resource under-utilization within each SM consisting of heterogeneous functional units (e.g., FP32, INT, FP64, Tensor Cores) meant for different workload types. (ii) inefficient memory bandwidth usage caused by the bursty nature of GPU memory traffic.
Fine-grained allows different workloads to co-run on the same SMs and request memory bandwidth flexibly, such as CUDA Stream and MPS. Pros: Better hardware utilization.

The key problem of GPU sharing in data center is performance unpredictability. It contains 2 key challenges:

Mitigating interference. The amount of performance improvement from fine-grained sharing varies drastically depending on how compatible the concurrent workloads are in terms of resource usage. Also, the interference cannot be statically estimated. So, it is non-trivial to determine compatibility among a large number of incoming jobs in the cluster.
Providing QoS guarantees.

Existing solutions:

Software-based: kernel slicing or a persistent thread model. Cons: high scheduling overhead.
Hardware-based: integrate sophisticated resource management logic into hardware to allocate resources for concurrent kernels. Cons: expensive and also inflexible.

Common problems of existing solutions:

They do not concern with interference mitigation at the cluster level.
They do not handle scenarios where incoming jobs must be distributed among multiple GPUs to satisfy QoS constraints.

Figure 1. Simulated system throughput of co-running `parb_spmv` and `rod_hotspot` at various TBs/SM settings

Problems of hardware TB scheduler which hinder the fine-grained sharing:

It always attempts to launch as many thread blocks per SM (TBs/SM) for each kernel as allowed by the execution context storage constraints (e.g., registers, shared memory, thread slots). It leaves insufficient resources for concurrent kernels. As showed in Figure 1, if we can individually set the TBs/SM for each kernel, we may achieve a higher throughput.
It only dispatches concurrent kernels onto SMs after the earlier arriving one completes launching all the thread blocks specified by the kernel grid size. This will force an almost serially execution of kernels in some scenarios.

GPU applications in the cloud fall into two main categories: latency-sensitive, and throughput-oriented. Throughput-oriented workloads are good candidates for hardware space-sharing. They have the following characteristics:

Most workloads involve a large variety of kernels with different hardware resource utilization characteristics (e.g., CNN: compute-intensive, batch-norm: memory-intensive).
Active SMs are underutilized in some resources (FP, tensor core, memory bandwidth).
They typically repeatedly execute the same sequence of kernels (e.g., ML).
Relaxed QoS Requirements.

Design

This paper proposed a hardware-software co-designed strategy to solve these challenges.

Hardware

This paper changes the default behavior of CUDA runtime to make it more suitable for fine-grained sharing:

Allows CUDA runtime to program the TBs/SM setting as one of the kernel launch parameters. The value of TBs/SM is selected by the performance predictor.
Make the TB scheduler launch TBs from any concurrent kernels whenever they are running under their TBs/SM quota.

Software

Concept Explanation:

Job: a task submitted by user, such as a DNN training task. It may be iterative and contains multiple kernels.

Kernel: CUDA kernel.

Normalized Progress (NP): $t _ {isolate} / t _ {co-execute}$.

Two key observations:

Co-execution performance of GPU kernels is highly correlated with resource utilization of individual kernels measured when running in isolation.
Once we have predicted which job pairs can co-execute without violating QoS requirements, the scheduling task can be reduced to the classic maximum cardinality matching problem in graph theory.

Figure 2. Overall System Design of GPUPool

Based on these 2 observations, the author proposed GPUPool. Its overall system design is shown in Figure 2. It consists of 4 steps:

Kernel Profiler. GPUPool groups all incoming GPU job into a batch for every scheduling window (e.g., 30 seconds). User should provide application executable and execution time budget. Then GPUPool automatically profiles the application for one iteration of the job in isolation on hardware, to collect the performance counter metrics of each kernel of data.
Co-execution Performance Predictor. This step decides the compatibility of all possible job pairs within the batch using the profiling result. It contains 2 stages:
1. Kernel-wise Predictors. It predicts how well each kernel from one job will co-run with the ones in the other job. This stage uses a Gradient Boosting Tree (GBT) model to predict the performance of each kernel when co-running with another kernel (based on the 1st key observation). The model takes the profiling data of kernels as input and outputs the NP. This prediction will be done for each feasible TBs/SM settings.
2. Job-wise Predictor. It gets an interference matrix (shown in Figure 3) based on the predicted NP (under optimal TBs/SM setting) from former stage, which indicates how will two kernels slow down when they are co-running. Then, GPUPool using this matrix to calculate the co-running time of two jobs. Here, the authors found that a whole calculation may require tens of thousands iterations, but the result will coverage to a steady-state after several iterations. So the authors used an approximation algorithm (shown in Figure 4) – stops timeline calculation once the accumulated slowdown values of each job is within a small delta over the past epoch.

Figure 3. Interference Matrix

Figure 4. Concurrent Application Timeline

Job dispatcher. It decides which job pairs should co-run to maximize system performance while satisfying QoS. The decisions are found by solving a maximum cardinality matching problem – each node represent a job, when two jobs can co-run and will not violate the QoS requirement, connecting an edge between them. Then a graph theory algorithm is used to maximum cardinality matching, which means a largest subset of edges that do not share a common end node. Due to the potential unreliability of the performance predictor, GPUPool also add a safety margin $\delta$ to edge formulation.

$$E = \left\{ ( {job} _ i, {job} _ j ) \mid {job} _ i,{job} _ j \in V\ \text{and}\ {NP} _ {job _ x} > {QoS} _ {job _ x} \times (1 + \delta ), x \in \{i, j\} \right\}$$

Execution. The batch of jobs are assigned to the modified GPU hardware.

Evaluations

The paper compare GPUPool against three baseline systems:

No-Sharing.
Coarse: packing the jobs onto as few GPUs as possible using a greedy scheduling algorithm.
Heuristic: pairing up jobs with the highest and lowest bandwidth utilization (profiled offline) from a batch of incoming jobs.

The metrics is system throughput $STP=\sum_{i=1}^n \cfrac{t_{isolated}^i}{t_{shared}^i}$. $t_{isolated}^i$ and $t_{shared}^i$ are turnaround time of the i-th concurrent job when executing in an isolated and shared environment respectively. The paper also uses we use ${QoS}_{reached}$ to evaluate QoS fulfilment rate.

Comparison of GPU Sharing Systems

Sorted STP on GPUs

Throughput Normalized to QoS Target

Prediction Accuracy of Different ML Techniques

Comments

Strengths

This paper targets the fine-grained GPU sharing problem in the cloud. I believe this work provides a valuable solution to this problem.

From my perspective, fine-grained GPU sharing presents three key challenges:

Limitations imposed by hardware and CUDA, which make it difficult for programmers to flexibly control kernel execution.
Reliable and low-cost performance prediction for concurrent kernel execution. Establishing an analytical performance prediction model is highly challenging. One naive approach is using real hardware to profile, but due to the $\mathcal{O}(n^2)$ ($n$ representing the number of jobs) time complexity, this method is not scalable to larger clusters.
Efficient algorithms to find appropriate job combinations. If we allow an arbitrary number of jobs to execute concurrently, this becomes an NP-hard problem.

This paper cleverly addresses or bypasses these challenges through the following strategies:

Hardware-software co-design, which involves modifying hardware to provide more flexible API for upper-layer application. While this prevents the authors from testing their method on actual hardware and forces them perform experiments on simulator (GPGPU-Sim), I believe such simulations can provide valuable insights for adjustments on real hardware.
Predicting kernel concurrent execution performance by a ML model. This is a standout aspect of the paper (which is also my favorite novelty). The authors introducing ML with a good motivation to effectively addresses a challenging performance modeling problem, bypassing a complicated analytical modeling. Also, this ML model has good interpretability, top-10 import metrics (show in Figure) align well with human’s intuition. Furthermore, in my research experiences about Deep Learning Compiler (e.g., TVM), I also found many paper introduce such ML models for performance prediction. I believe the thought that leveraging ML techniques to bypass some complicated modeling problems is highly valuable in system research, which is the most important thing I learned from this paper.
Instead of solving the whole NP-hard job combination problem, the authors limit the number of concurrently executed jobs to 2, considering this simpler case. It is a fantastic tradeoff. The simplified problem can be solved by a maximum cardinality matching algorithm, which may not find the optimal combination, but exchanging reasonable scheduling overhead for a substantial performance improvement.

Weaknesses

This paper also has some potential weaknesses:

It seems to ignore the situation which two concurrent jobs have different execution times. For instance, when a longer job and a shorter job are executed together, after the shorter job finishes, GPUPool seems unable to schedule a new job to the GPU. Instead, the remaining GPU time is monopolized by the longer job. This could result in a lower resource utilization.
The concurrent execution of multiple jobs on a single GPU may also be constrained by GPU memory capacity. A possible improvement is to ask users to indicate maximum GPU memory usage of their applications and consider the these constraints when constructing the graphs.
This paper does not consider the job which leverages multiple GPUs. These jobs are quite common in reality. When a job can occupy multiple GPUs, there are some additional constraints:
1. Inter-GPU connection (e.g., NVLink or InfiniBand) bandwidth is the potential bottleneck, especially for distributed training strategies relying on high GPU interconnect bandwidth, such as Data Parallelism. Improper job scheduling may lead to contention for bandwidth among multiple jobs, or jobs requiring high GPU interconnect bandwidth may run on different nodes.
2. When a single job leverages multiple GPUs, the workload types on different GPUs may not be the same. For example, in Pipeline Parallelism, different GPUs run different stages of the neural network.
This paper does not clearly take into account the impact of memory hierarchy on performance, such as shared memory (or just implicitly consider it using a ML model). Some CUDA kernels are optimized by carefully utilizing CUDA SM shared memory, such as Flash Attention. When two kernels run together, does it lead to shared memory contention? Could it result in runtime errors or shared memory overflowing into global memory, causing a severe performance decline? Experiments in the paper can not answer these questions. Also, the selected profiling metrics to train stage 1 model listed in Figure 5 do not contains any metrics about shared memory capacity. Another possibility is that a ML model is already good enough to handle this problem. Regardless, the impact of memory hierarchy on GPU-sharing deserves further study.

Figure 5. Metrics Used to Train Stage 1 Prediction Model

Possible Improvements

I have some potential ideas to improve this work:

As response to the first weakness mentioned above, we can extend GPUPool to enable it to schedule a new job to the GPU after the shorter job finishes. This improvement can be achieved by a simple modification: keep the running jobs in the incoming window, and if two jobs are still running in the same GPU, also keep the edge between them in the pairing graph. With this modification, if shorter job finishes, we can re-run the matching algorithm to find a new job to pair with it.
We can extend GPUPool to support multiple GPU job. To achieve that, we should consider inter-GPU connection bandwidth. This may include following modifications:
1. Ask users to indicate the required inter-GPU bandwidth or connection types (e.g., NVLink/PCIe/Infiniband/Ethernet).
2. Take a multiple GPU task as several sub-jobs. Each of sub-job is a single GPU job, with interconnection constraints. Then we can reuse the infrastructure of GPUPool to find the co-running chances.
3. Extend the last step “Execution” to consider the interconnection constraints, so it can dispatch sub-jobs to nodes that meet the constraints. This may require an efficient graph algorithm to find job placement, which requires a further research.
Sometimes the goal of a data center is not just to improve resource utilization, but also to save energy. Improving resource utilization does not necessarily mean energy saving, because the chip’s speed $S$, power consumption $P$, and frequency $f$ have the following approximate relationship:

$$\begin{align} S & \propto f \\ P & \propto f^\alpha, \text{while}\ \alpha \in [2, 3] \end{align}$$

We can extend the optimization target of GPUPool to power consumption. This can be achieved by add a power prediction model with similar methods. Then we can use a multi-objective optimization algorithm to find the best job combination, considering both performance and power consumption.

Building WireGuard VPN for Machine Learning Server Cluster

Mon, 29 Jan 2024 00:00:00 +0000

Motivation

A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed.

Deploying a VPN network requires considering the following factors:

Network topology: an appropriate topology must be chosen to minimize latency as much as possible;
User management: it should be easy to add or remove users and to authorize them;
Simplicity of use and maintenance.

Design

Network Topology

The network topology determines the latency.

The lowest-latency option is obviously full-mesh, i.e. every pair of peers has a direct P2P connection. However, the management complexity of this topology is $\mathcal{O}(n^2)$, and adding a new peer requires modifying the configuration files of all other peers. It also has to deal with the problems introduced by NAT, which requires some automated management software. I tried Netmaker and Headscale, but neither of them seemed able to correctly handle the complex network environment within the campus, such as the symmetric NAT used by various enterprise-grade routers, and the probability of successfully establishing P2P was very low.

In the end I chose a topology that combines full-mesh and hub-and-spoke. Since the number of servers and their IPs rarely change, manually configuring a full-mesh network among the servers is feasible. At the same time, a gateway server is provided as the hub for user access, and users only need to establish a connection with the gateway server. Since most users actually use the VPN within the campus, connecting to the on-campus gateway server and forwarding traffic through it does not introduce much additional latency. This structure balances latency and management complexity, and adding/removing and authorizing users only needs to be done on the gateway server.

Protocol Choice

The popular OpenVPN and IPSec are both good enough, but the emerging WireGuard offers unparalleled configuration simplicity. On the server side, WireGuard can define a peer and a route with just a few lines of configuration; on the user side, since WireGuard uses key-pair-based authentication, a single configuration file is enough to join the VPN network, with no need to remember an additional password or perform a login operation.

Management Approach

For the sake of predictability and stability, I chose the manual configuration approach. The full-mesh network among servers does not need to be changed frequently once it is configured. User management, on the other hand, is implemented through a script: when a new user needs to be added, the script generates a key pair and allocates an IP, adds the public key and routing information to the gateway server’s peer list, then generates a configuration file containing the private key and the allocated IP, and sends it to the user.

Example of a user peer configuration on the gateway server:

1
2
3
4
5


[Peer]
PublicKey = 
AllowedIPs = 10.1.x.y/32
AllowedIPs = fd01::x:y/128
PersistentKeepalive = 25

Example of a user’s access configuration file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


[Interface]
PrivateKey = 
Address = 10.1.x.y/16
Address = fd01::x:y/64

[Peer]
PublicKey = 
AllowedIPs = 10.1.0.0/16  # route all VPN traffic to gateway server
AllowedIPs = fd01::/64
Endpoint = wg.ustcaigroup.xyz:51820  # gateway server is dual stack
# Endpoint = wg.ustcaigroup.xyz:51820  # IPv4
# Endpoint = wg.ustcaigroup.xyz:51820  # IPv6
PersistentKeepalive = 25

Building Storage System for Machine Learning Server Cluster

Fri, 24 Nov 2023 00:00:00 +0000

This is an unfinished blog.

Custom PyTorch Operators on Ascend 910B

Tue, 14 Nov 2023 00:00:00 +0000

Environment

The hardware environment this article is based on is the Ascend 910B3, and the software environment includes CANN 7.0-RC1, PyTorch 1.11.0, and Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0. The situation on other CANN and PyTorch versions may differ slightly.

Registration Process

Adding a Custom Operator in the Ascend PyTorch Adapter

References:

https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html

https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation

Add the npu_add_custom function in torch_npu/csrc/aten/npu_native_functions.yaml:

1
2


custom:
  - func: npu_add_custom(Tensor x, Tensor y) -> Tensor  # 添加的函数

Add the file AddCustomKernelNpu.cpp in torch_npu/csrc/aten/ops/op_api:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


#include 

#include "torch_npu/csrc/framework/utils/OpAdapter.h"
#include "torch_npu/csrc/aten/NPUNativeFunctions.h"
#include "torch_npu/csrc/aten/ops/op_api/op_api_common.h"

namespace at_npu {
  namespace native {
    using torch::autograd::Function;
    using torch::autograd::AutogradContext;

    at::Tensor NPUNativeFunctions::npu_add_custom(const at::Tensor& x, const at::Tensor& y) {
        at::Tensor result = OpPreparation::ApplyTensor(x); // 创建输出内存

        // calculate the output result of the NPU
        EXEC_NPU_CMD(aclnnAddCustom, x, y, result);
        return result;
    }
  } // namespace native
} // namespace at_npu

Afterwards, recompile and reinstall torch_npu.

Adding the Custom Operator Implementation in CANN

References:

https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html

First, define the operator description file add_custom.json:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


[
    {
        "op": "AddCustom",
        "language": "cpp",
        "input_desc": [
            {
                "name": "x",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            },
            {
                "name": "y",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            }
        ],
        "output_desc": [
            {
                "name": "z",
                "param_type": "required",
                "format": [
                    "ND"
                ],
                "type": [
                    "fp16"
                ]
            }
        ]
    }
]

Run

1

msopgen gen -i add_custom.json -c ai_core-Ascend910B3 -f pytorch -out . -lan cpp

to generate the operator project:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


AddCustom
├── build.sh
├── cmake 
│   ├── config.cmake
│   ├── func.cmake
│   ├── intf.cmake
│   ├── makeself.cmake
│   └── util
├── CMakeLists.txt
├── CMakePresets.json          // 修改 ASCEND_CANN_PACKAGE_PATH
├── framework
├── op_host
│   ├── add_custom_tiling.h    // 定义 length 和 tiling 相关信息
│   ├── add_custom.cpp         // 算子 host 侧实现
│   ├── CMakeLists.txt
├── op_kernel
│   ├── CMakeLists.txt
│   ├── add_custom.cpp         // 算子 kernel 侧实现
└── scripts

In CMakePresets.json, change ASCEND_CANN_PACKAGE_PATH to the CANN installation path.

The content of op_host/add_custom_tiling.h is as follows (a simple implementation):

1
2
3
4
5
6
7
8
9


#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(AddCustomTilingData)
    TILING_DATA_FIELD_DEF(uint32_t, size);  // 定义 tensor size
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(AddCustom, AddCustomTilingData)
}

In op_host/add_custom.cpp, modify the block_dim used when the operator is invoked:

1

context->SetBlockDim(20); // 910B3 的 block_dim

op_kernel/add_custom.cpp is the concrete implementation of the operator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20



#include "kernel_operator.h"

#ifdef __DAV_C220_VEC__

extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);
    uint32_t M = tiling_data.size;  // 从 tiling_data 中获取 tensor size

    // ...
}

#else

// 重要：CANN 会尝试不同的 ccec 编译参数以推断算子的类型（VEC、CUBE、MIXED），如果不创建一个 stub 函数将会编译失败
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) {
    pip_barrier(PIPE_ALL);
}

#endif

Compilation and Deployment

1
2


$ bash build.sh
$ ./custom_opp_euleros_aarch64.run

Calling it in PyTorch:

1
2
3
4
5
6


import torch
import torch_npu

# ...

z = torch.npu_add_custom(x, y)  # 由于是运行时编译，第一次运行时需要等待编译

Registration Principles

TODO

References

TODO

Building Proxy Service for Team

Thu, 09 Nov 2023 00:00:00 +0000

This is an unfinished blog.

Preface

Due to Internet censorship in China (known as GFW, Great Firewall, 防火长城), many websites (e.g. Google, Twitter) are blocked, and some websites (e.g. GitHub) suffer connectivity issues. In China, the means to circumvent internet censorship is referred to as 翻墙 (means climbing over the wall).

In China, to freely access the Internet, a proxy is essential. Despite various commercial options available, they may not be suitable for everyone. Therefore, I have constructed a user-friendly and easy-to-maintain proxy system for my research group, as a part of my responsibilities as a system administrator.

Target

Easy to use. Team members only need some simple configurations.The proxy client should be able to automatically update configuration.
Stability.
Sufficient traffic, to download large datasets.
Low Latency, to provide good experience for web.
Low Cost.
Easy to maintain. Frequent maintenance is unacceptable, and only simple changes of the configuration are required for new function.
Concealment. The cat-and-mouse game between GFW and anti-censorship tools has been escalating. Ten years ago (2013), only an OpenVPN client was all your need to “Across the Great Wall and reach every corner in the world”. Now, you must use much more sophisticated solutions to prevent your “unusual” traffic from being detected by GFW. According to GFW Report, popular Shadowsocks (a proxy protocol which simply encrypt all traffic using pre-shared key) was detected and blocked, and the TLS-based proxy also encountered large-scale blocking in Oct 2022. The tools and protocols used must be concealed enough to allow the service to run for a long time.

Available Resources

CERNET

Cloudflare WARP

VPS

Server in USTC

Anti-Censorship Tools

Adopted Solution

Deployment

Problems

Client Initialization

Compatibility

Conclusion

My TOEFL Experience

Sun, 05 Nov 2023 00:00:00 +0000

Preface

As the exam that has caused me the most anxiety since the gaokao, the TOEFL kept me in the dark for most of 2023, and it is also the exam I invested the most time and money into.

At the start I set a goal of 100 total and 20 in speaking. Along the way I went through countless days of lost confidence, of being drowned by anxiety, of practicing speaking until my tongue tied itself in knots — and finally, on November 3, 2023, I checked my scores and was satisfied.

I write this article both as a summary of my own past and in the hope that it can help anyone who happens to read it.

The sittings I took and my scores:

Exam date	Total	Reading	Listening	Speaking	Writing	Note
2023.7.22	89	27	24	16	22	before reform
2023.8.15	89	28	25	17	19	this and after: post-reform
2023.9.16	96	29	27	19	21
2023.10.14	96	30	24	19	23
2023.10.28	101	28	27	22	24
MyBest	103	30	27	22	24

Study materials I used:

Vocabulary: MaiMemo
Listening and speaking practice: TAL Kaomanfen, New Oriental TOEFL, all speaking questions from TPO 1~74 bought on Taobao
Speaking reference: New Oriental TOEFL Speaking White Paper
Writing reference: New Oriental TOEFL Writing White Paper, and post-reform all academic-discussion writing real questions and sample essays

Reading

For most Chinese students this is the easiest section, and any competent student from a 211 university or above can certainly handle it with ease.

Before the exam I only did two passages to get used to the pacing, and I scored 27 on my first attempt, then stayed stable, and hit a full score on my fourth attempt. Personally I feel TOEFL reading is even easier than the Jiangsu gaokao or CET-6 reading. Although I memorized a lot of vocabulary before my first exam, that was mostly preparation for the GRE; TOEFL reading itself poses basically no vocabulary challenge.

While a high score isn’t hard, a full score still takes a bit of luck. On the time I scored full marks, the two reading topics were “the early ocean and atmosphere of Earth” and “the agricultural revolution and irrigation,” both topics I was very familiar with. In that case the reading was just easy mode.

Listening

The TOEFL’s bizarre exam format makes listening, speaking, and writing all test your listening ability. But the listening across these three parts is actually completely different:

The listening section itself:
- Conversation: relatively hard; everyday conversation has always been my weak spot, with the most linking and elision, and a fairly fast pace;
- Lecture: moderate difficulty; although it looks long, the pace is actually slow and tolerant of errors, and if you miss a sentence you can completely infer it from context;
Integrated speaking: the listening here is actually the hardest, as you need to capture as many details as possible and take sufficient notes; my speaking foundation itself was very poor, which made it even harder;
Integrated writing: the lowest difficulty; at the start you read a passage to get familiar with the topic, and the listening has a rigid structure, clear logic, and a slow pace.

But I have to say, with proper training, the listening section is also very easy to improve and to score high on. I did about 20 days of concentrated, intensive training, plus roughly another 30 days of scattered training (mixed in with other things).

The single most important point about listening is that you must figure out the approach to answering questions that suits you. Many study materials emphasize how to take notes correctly during listening, and at first I trained that way too, but after my first exam I realized this method didn’t suit me — taking notes distracts your attention, and the probability of losing track of the listening content (no longer being able to grasp the logical relationships in the context) increases enormously.

My conclusion is that notes are good for recording details, and the human brain is good for remembering logic.

The pure listening section of the TOEFL actually doesn’t focus on details; instead it tests your overall grasp of the listening material. In my later 20 days of dedicated training I completely abandoned note-taking, and it worked very well. I should note that I later found occasional note-taking still useful when the density of details was high — it helps you avoid losing focus, but what you write down is actually useless; I never once looked at it during the exam. Here, taking notes is really just a way to reinforce the brain’s memory, not a way to store information externally.

The listening training method I used: first pass, do the questions; second pass, re-listen; third pass, listen while reading the transcript; then listen several more times until you can hear every detail clearly. During dedicated training, each listening passage took me roughly 20~40 minutes, and I practiced at least 6 passages a day.

Likewise, topic familiarity greatly affects your performance. On the sitting where I first scored 27, one lecture told the classic story of “winning the Nobel Prize by peeling graphene with tape.” Although I was very familiar with it and breezed through, the content was indeed somewhat specialized, with many physics terms, touching on the layered structure of graphene and the principle of its anisotropic conductivity. Since TOEFL listening lectures are still mainly STEM-oriented, useless knowledge you picked up while slacking off on Zhihu or Bilibili — even some popular science books you read back in secondary school — can help you in unexpected ways; a broad knowledge base lets you achieve more with less effort. But by the same token, unfamiliar topics become very troublesome: on my fourth exam I only scored 24 in listening, precisely because I ran into a literature topic and didn’t understand most of the content.

After the July 2023 reform, listening has a pitfall: since the mid-test break was removed, some people finish faster and start speaking while you are still listening, causing serious interference. Although I did dedicated training before the second exam, I still only got 25 in listening — exactly because I fell into this trap.

The way to avoid this pitfall is to quickly skip all the direction parts and end the reading section two minutes early, so that you can be the first in the room to start speaking, ~~letting others be interfered with by you~~.

~~Better that I wrong the world than that the world wrong me.~~

Speaking

Looking at the scores, you can tell this was the part that tormented me the most — the last two sittings were taken purely for speaking (a speaking score below 20 is very risky when applying).

I did high-intensity dedicated speaking training for about 30 days, and the number of non-dedicated training days is beyond counting.

For someone like me with a very poor speaking foundation, a large amount of training can ensure your score lands around 20; beyond that it still comes down to luck and on-the-spot performance.

TOEFL speaking is less a speaking test than a grand integrated test. For me personally, the reading and listening demands within the speaking section are even higher than in the reading and listening sections themselves:

The reading parts of task 2 and task 3 require speed-reading ability; personally I feel you can’t manage without 4 words/s, and you won’t get a chance to roll back if you don’t read it through. The reading section, by contrast, can be read at the same speed I normally read papers, and if a sentence isn’t clear you can read it several more times.
The listening in integrated speaking requires you to write down details, whereas in the listening section much of the time you only need to note the logic. Recording details forces you to rely on notes, and balancing note-taking, receiving information, and grasping the overall logic is the hardest part.

Independent Speaking

Accumulating material is necessary, but quantity is not the point — I only prepared 10 commonly used ones; what matters is being able to use them fluently, so that when you see a question you can quickly react with which material to apply. You can practice this specifically with the Golden 80 Speaking Questions on TAL Kaomanfen.

At the same time, material isn’t a cure-all; independent speaking inevitably carries many random factors and often requires making up a story on the spot. In that case it’s faster to quickly think it through in Chinese and then translate it into English (jot down a few keywords and string them into sentences as you speak).

Integrated Speaking

For me this was the hardest part of the whole exam; getting here basically triggered an adrenaline surge every time.

Handling integrated speaking is the part I spent the most time training on. There is no shortcut; you have to find your own feel and your own experience. Here I’ll share the experience I summarized that worked for me:

While reading: although task 2 and task 3 give you 45s of reading, it’s best to scan it in just 15s, find the key sentences (skip non-key sentences entirely), and then copy down the key sentences (not necessarily word for word, but as complete as possible — the kind you can read straight off without having to compose anything). The benefit is that during prep time I can quickly read through it once, and when I formally speak I’m not only fluent at the start but also save time;
While listening: write down as many details as possible, but you must simultaneously filter out the non-essential, and for the essential parts likewise write down keywords/sentences. At the same time, note-taking absolutely must not interfere with receiving the information itself;
During prep: read out what you’re going to say (don’t say it silently in your head — that gives you the illusion that you already speak it fluently) while circling useful information (or crossing out useless information), use arrows to organize a single thread to follow, and where necessary write filler content between some keywords to reduce the burden of composing on the spot;
When formally speaking: make fluency your top priority, and when you’re out of time or stuck you can drop some details. Stammering and repeating a sentence not only lowers your score but also wastes time.

No matter the situation, you must never become overly nervous. Being overly nervous slows your thinking and greatly increases stumbling while speaking. On the sitting where I scored 22, I was in a fairly relaxed state during the speaking section.

My personal training method for integrated speaking: first do it normally, then immediately re-speak it, then look at the answer, then keep re-speaking until you can do it very fluently. Under this method one passage takes about 15~30 minutes, and I practiced 10 passages a day.

Writing

No feelings, all formula. In fact I hardly invested any time in writing training; an average English foundation plus appropriate techniques is enough to get at least 22.

One thing to note is don’t let your typing speed drag you down. I’m someone who types fairly slowly and makes a lot of typos, and in the first two sittings this did affect me, but once I got more practiced it was no longer a problem.

Integrated Writing

For integrated writing you can read the passage at a calm, comfortable pace — the time given is enough for you to read it twice — and you don’t need to take notes. The listening is also simple: the reading sets the stage so you’re familiar with the topic, and the structure is rigid, the logic clear, and the pace slow, so writing down the important details isn’t hard.

The thing to watch is don’t memorize templates rigidly; wasting exam time typing out a template isn’t worth it — just keep the logic clear and the structure neat. The time should be spent reconstructing as many details as possible; for language use, gaokao-level vocabulary is enough to get 24.

Academic Discussion Writing

The July 2023 reform removed independent writing and replaced it with academic discussion writing, shortening the time to 10 minutes. My writing score of 19 on the second exam was because I went in carelessly without practicing the new question type at all, and the result was that I completely failed to answer as required.

Later I spent half a day specifically training academic discussion writing and basically got the hang of it. In the exam you really only need to read the professor’s question, skip the pile of filler, then glance at the two student sample answers and find their core viewpoints — this is to avoid colliding with the same viewpoint, and you don’t need to read their specific content fully — after which you can start writing.

My personal template is as follows:

1
2
3
4
5
6
7
8
9


From my perspective, .

Although , .

.

<(optional, an expression I personally like) sometimes you can say that my method can actually achieve the goal of the method I disagree with even better>.

So, .

Conclusion

Without accumulating small steps, one cannot reach a thousand li.

For me personally, the TOEFL made me reflect on my study patterns since college. My undergraduate courses were either things I was already familiar with or had a foundation in, or things I crammed for right before the exam. A language exam like the TOEFL has no shortcut (unless you’re a language genius); you have to train little by little starting from Day 1, finding your feel and your experience bit by bit. In this process, beyond the obstacle of the questions themselves, there is even more the obstacle of negative emotions, and finding some people you trust and who are also willing to listen, to share your feelings with, is extremely helpful.

Catching Mining Virus

Wed, 01 Nov 2023 00:00:00 +0000

Problem

On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me.

The “mining traffic” was a bitcoin.sipa.be DNS request sent to 223.5.5.5.

Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus.

However, this time the situation was different. I couldn’t find any suspicious processes, and there was no unusual GPU usage. Since I didn’t deploy any monitoring programs to record historical processes and sockets, the investigation couldn’t get started.

On October 31, I received the same warning again. Each time when mining traffic is detected, the firewall will block the server’s outbound network. Loss of Internet will cause lots of troubles.

I suspected that someone may have suffered a supply chain attack, such as, downloading a Python package containing a virus, or cloning code from GitHub and running it without any check.

The immediate task is to identify who and which process was responsible.

Solution

While I can’t directly determine who or which process, I can block and log suspicious traffic for further investigation.

This job can be done by iptables:

1
2
3
4
5


# iptables -N LOGDROP                   # create a new chain
# iptables -A LOGDROP -j LOG --log-uid  # log info
# iptables -A LOGDROP -j DROP           # drop packet

# iptables -I OUTPUT 1 -p udp -m string --string "bitcoin" --algo bm -j LOGDROP     # match string "bitcoin" in udp packet

The --log-uid option can enable UID recording in /var/log/kern.log, for example:

IN= OUT=wg0 SRC=10.1.92.3 DST=10.1.2.13 LEN=42 TOS=0x00 PREC=0x00 TTL=64 ID=23294 DF PROTO=UDP SPT=52328 DPT=2333 LEN=22 UID=2109 GID=2109

Result

I’m waiting the next requests sent by virus.

Using an SSH Reverse Tunnel to Log Into BitaHub Containers and Hold GPUs Long-Term

Fri, 20 Oct 2023 00:00:00 +0000

Problem

Every year before CVPR, GPUs are always in short supply, and we need to borrow cards from elsewhere. USTC provides BitaHub for on-campus users, but it suffers from the same shortage of cards before CVPR. At the same time, its job-submission-based usage model is very inconvenient: submitting jobs that occupy multiple cards often requires a long wait in the queue, and its data management approach is downright user-hostile.

As the server administrator for my group, in order to make my life easier before CVPR and to avoid repeating the 2021 pre-CVPR ordeal of scrambling to allocate resources, I needed to improve the BitaHub experience:

How to hold GPUs long-term to avoid repeatedly queuing (slightly unethical, but a measure born of necessity);
How to conveniently read data from our own servers, instead of being forced to use BitaHub’s user-hostile data management model;
How to make the BitaHub GPU experience as close as possible to that of our group’s servers, lowering migration costs and improving the flexibility of resource scheduling.

Idea

Jobs in BitaHub run as docker containers, which gives us the possibility of configuring the environment we want inside the container, as long as we can somehow ssh into it.

After some investigation, I found that as long as the startup command does not stop running, a BitaHub container will keep running indefinitely and will not release its GPU resources. At the same time, BitaHub containers have network access, and the BitaHub web page even thoughtfully provides the ssh private key for the root user inside each job’s container.

These facts give us an opportunity to exploit. All we need to do is run a tunnel program inside the container so that external parties can access port 22 of the container, and then we can log in and hold the resources long-term. Moreover, since the container has network access, we can also directly mount the file systems of other on-campus servers.

Solution

The tunnel program I ended up choosing is ssh, which can create a reverse tunnel:

1

ssh -i  -F none -o "StrictHostKeyChecking no" -o "ServerAliveInterval 15" -v -N -R :localhost:22 jump@

On the jumpserver, configure a user jump and allow login with a specific private key, then somehow get the private key into the container (you could bake it directly into the image, but I chose a more convenient approach: create a BitaHub dataset to store it, and just add this dataset to every job).

The container’s startup command is exactly the command above (considering network fluctuations, you can wrap it in a while true loop or use autossh to reconnect automatically). Once started, it creates a reverse tunnel on of , with mapped to port 22 inside the container.

You can set GatewayPorts yes in the sshd_config of so that the reverse tunnel listens on 0.0.0.0 instead of 127.0.0.1. Otherwise, I would have to create a user on for every person, or forward each port with iptables, which is far too tedious. Binding to 0.0.0.0 lets us access it directly from the existing VPN network.

There are many options for mounting a file system. Considering both security and convenience, I chose SSHFS. Exposing NFS directly to the public internet is too dangerous, while configuring NFS user authentication is too tedious. At the same time, the kernel that BitaHub uses to run containers neither loads the wireguard kmod nor maps /dev/net/tun, so we cannot use a VPN to protect data security. SSHFS can directly reuse the existing user authentication mechanism, and SSH traffic itself is also more likely to be let through by any potential data-center firewall.

Use the following command to mount SSHFS:

1

sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=30,ssh_command='ssh -p  -i ' @:/path /path

Postscript

TODO

Enabling QUIC in Nginx While Keeping SNI Routing

Tue, 26 Sep 2023 00:00:00 +0000

Problem

Since version 1.25.0, Nginx’s support for QUIC has been merged into mainline. Users who want to try it out can simply use the official nginx docker image, which is very convenient.

However, the nginx on my server uses SNI routing, driven by the needs of a new generation of TLS-based proxy protocols such as Shadow TLS and Xray Reality. These proxy protocols cannot have their TLS layer handled by nginx on their behalf (unlike earlier protocols that could use gRPC/WebSocket and the like as their data transport). But in order to achieve the best camouflage effect, using the 443/tcp port is necessary (the whitelisted target sites used for camouflage generally only serve HTTPS on the 443/tcp port). Therefore, multiplexing the 443/tcp port is necessary.

To make SNI routing and QUIC coexist, you only need to add listen 443 quic to each server in the original SNI routing configuration. An example configuration is shown below.

Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66


http {
    
    # ...

    server {
        server_name example.com;

        # 443/tcp is already occupied by nginx stream, so it cannot be listened on again
        # listen 443 ssl http2 reuseport so_keepalive=on;
        # listen [::]:443 ssl http2 reuseport so_keepalive=on;

        # Listen on the 443/udp port and enable QUIC
        # ref: https://nginx.org/en/docs/http/ngx_http_v3_module.html
        listen 443 quic reuseport;
        listen [::]:443 quic reuseport;

        # Listen on a unix domain socket to accept connections forwarded from stream; a local port can also be used
        # Accept proxy_protocol, otherwise the connection source address shown in the log will all be unix:
        listen unix:/dev/shm/nginx-example.sock ssl http2 proxy_protocol;
        set_real_ip_from unix:;  # Only override the source address for connections coming from the unix domain socket
        real_ip_header proxy_protocol;

        add_header Alt-Svc 'h3=":443"; ma=86400';  # used to advertise the availability of HTTP/3

        # ...
    }

    server {
        server_name foo.example.com;

        # Multiple domains can share 443/udp
        listen 443 quic;
        listen [::]:443 quic;

        listen unix:/dev/shm/nginx-example-foo.sock ssl http2 proxy_protocol;
        set_real_ip_from unix:;
        real_ip_header proxy_protocol;

        add_header Alt-Svc 'h3=":443"; ma=86400';  # used to advertise the availability of HTTP/3

        # ...
    }
}

stream {

    # ...

    # Route based on TLS SNI
    map $ssl_preread_server_name $name {
        example.com             unix:/dev/shm/nginx-example.sock;
        foo.example.com         unix:/dev/shm/nginx-example-foo.sock;
        learn.microsoft.com     127.0.0.1:8443;  # used for shadow-tls/xray-reality, etc.
        default                 unix:/dev/shm/nginx-default.sock;
    }

    server {
        # Listen on 443/tcp and route based on SNI
        listen 443 reuseport so_keepalive=on;
        listen [::]:443 reuseport so_keepalive=on;
        proxy_pass $name;
        ssl_preread on;
        proxy_protocol on;
    }

}

Testing

Currently, the mainline of curl/wget does not yet support QUIC. You can use the ymuski/curl-http3 docker image:

1
2
3
4
5
6
7
8


$ docker run -it --rm ymuski/curl-http3 curl https://static.monsoon-cs.moe/public/ --http3 -IL

HTTP/3 200
server: nginx/1.25.2
date: Tue, 26 Sep 2023 14:52:29 GMT
content-type: text/html; charset=utf-8
strict-transport-security: max-age=63072000
alt-svc: h3=":443"; ma=86400

References

Optimizing MKL Performance on AMD CPUs

Mon, 19 Jun 2023 00:00:00 +0000

The Problem

My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don’t know what kind of load it is, or why it can’t run on the GPU, and I don’t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand.

But as nice as AMD processors are, using them in a deep-learning lab brings an extra problem: the numpy and PyTorch installed by Anaconda both use MKL as their BLAS implementation by default, and MKL’s library functions are also the hotspots of most high-CPU-load programs. However, MKL checks whether it is running on an Intel CPU, and if not, the optimizations have no effect.

Since this is a deep-learning lab, few people have enough HPC background to compile suitable versions of numpy and PyTorch themselves, and it’s hard for them to break away from Anaconda, so the dependency on MKL is hard to remove. For this reason I needed a solution that is transparent to ordinary users.

The Solution

A widely circulated solution can be found via search engines: set the environment variable MKL_DEBUG_CPU_TYPE=5. This used to work, but it no longer works for MKL 2020 and later versions.

In the end I found a more clever solution here.

MKL calls a function mkl_serv_intel_cpu_true() to check whether it is running on an Intel CPU. As long as we provide a fake mkl_serv_intel_cpu_true() that always returns 1, we can trick MKL into thinking it is running on an Intel CPU.

To do this, we can use Linux’s LD_PRELOAD mechanism. The dynamic library pointed to by LD_PRELOAD has the highest loading priority, so as long as we compile the desired mkl_serv_intel_cpu_true() function into an so file and point LD_PRELOAD at it, we can load this function ahead of everything else.

I have often heard of the LD_PRELOAD mechanism being used for library-function hijacking attacks; here it counts as a clever use.

Implementation

Create mkl_trick.c:

1
2
3


int mkl_serv_intel_cpu_true() {
    return 1;
}

Compile it with gcc -shared -fPIC -o libmkl_trick.so mkl_trick.c, and copy the generated libmkl_trick.so to /usr/local/lib.

Add the following to the shell’s global initialization file:

1
2
3


export MKL_DEBUG_CPU_TYPE=5  # compatibility with older MKL versions
export MKL_ENABLE_INSTRUCTIONS=AVX2  # optional, tells MKL it can use AVX2
export LD_PRELOAD=/usr/local/lib/libmkl_trick.so

Some of my labmates use Bash and some use ZSH, so both need to be modified:

Bash: create the file /etc/profile.d/mkl.sh and add the above content
ZSH: add it to /etc/zsh/zshenv

References

https://documentation.sigma2.no/jobs/mkl.html

VCB-Studio Technical Director Entry Test 2023 and My Answer

Thu, 25 May 2023 00:00:00 +0000

See original publication page for more details.

All my answer files can be browsed in here, or you can download zipped file (5.9G).

Requirements

This is a test for candidates who wish to participate in the training class organized by VCB-Studio. Finish as many problems as you can, and then do the following things:

Pack your answers, result files, and necessary attachments into a zip/rar/7z file. Source files we provided and intermediate file in your encoding should not be packed in.

Register a Baidu Net Disk account (https://pan.baidu.com), upload the zipped file and create a sharing link. Whether you like it or not, Baidu Net Disk has been the most effective way to share files within our team since day one. Other sharing methods will NOT be considered.

Send the link via email to vcbs.training@gmail.com before Beijing Time (UTC+8) Monday, 23 Jan 2023, 23:59:59. Late submissions will NOT be considered.

Prepare a QQ account. The follow-up training courses will be conducted in the QQ group.

You should independently complete the answers without any public discussion. Any form of plagiarism will NOT be tolerated.

This test has 5 questions. For question 2 and 3, you can choose ONE of them. Choosing both then we will pick one with higher points. The answers should be made in English.

Question1 (15pt)

Please describe yourself as who you are, where do you study, how do you come to know VCB-Studio and why are you interested in this project, etc. Please do not write more than 500 words, or approximately 1 page. (15pt)

Answers are hidden for privacy reasons.

Question2 (30pt)

Scanned pictures (or simply scans) are an important part of BDRips, which are often released as lossless PNG, TIFF format or lossy JPG format. Scans feature high resolution and large size. In the file Q2.7z, two sets of pictures have been provided for you. PNGs are the source scans, and WEBPs are transcoded from PNGs according to VCB-Studio Collation specifications. Your tasks are:

Summarize the format conversion rules of scans in VCB-Studio Collation specifications. (6pt)

Convert the sources to AVIF and JPEG-XL format, with sizes comparable to the WEBPs. (12pt)

Comment on the quality, encoding speed, and compatibility of AVIF and JPEG- XL, and why/why not you may recommend us switching to the new format as the upgrade for WEBP in 2023. (12pt)

You are free to utilize existing tools, but you need to describe clearly where you find the tool and how to use it.

(1) Format conversion rules of scans in VCB-Studio Collation specifications

Choosing a format with better image quality at the same size when ensuring compatibility.

(2) Converting test

See Q2/convert.py for my conversion code. Pillow, pillow_avif_plugin and jxlpy are used libraries. Pillow is the image processing library which I often use, it supports WEBP but not AVIF and JPEG-XL. So I find two Pillow plugins by Google to support AVIF and JPEG-XL.

PNG and WEBP Ref are given images, and WEBP Cus, AVIF, JPEG-XL are custom encoded images.

WEBP Custom is encoded by Pillow, which is backed by libwebp. Encoding speed is set to slowest(6), and quality is set to 90 to keep the same size with reference webp images.

AVIF is encoded by pillow-avif-plugin, which is backed by libavif. Encoding speed is set to slowest(0), and quality is set to 84 to get the comparable size with reference webp images.

JPEG-XL is encoded by jxlpy, which is backed by libjxl. Encoding speed is set to slowest(9), decoding speed is also slowest(0), and quality is set to 92 to get the comparable size with reference webp images.

The following table shows the result:

Image	PNG (size)	WEBP Ref (size)	WEBP Cus (size/time)	AVIF (size/time)	JPEG-XL (size/time)
01	26.97 MB	2.95 MB	2.95 MB / 3.36 s	2.77 MB / 37.77 s	2.56 MB / 32.00 s
02	26.25 MB	2.93 MB	2.94 MB / 3.27 s	2.71 MB / 34.87 s	2.48 MB / 33.07 s
03	3.60 MB	0.26 MB	0.26 MB / 0.37 s	0.28 MB / 11.48 s	0.28 MB / 5.12 s
04	21.78 MB	1.03 MB	1.03 MB / 2.06 s	1.32 MB / 29.56 s	1.39 MB / 32.25 s
05	2.65 MB	0.13 MB	0.13 MB / 0.24 s	0.15 MB / 9.29 s	0.18 MB / 4.11 s
06	2.66 MB	0.13 MB	0.13 MB / 0.25 s	0.15 MB / 9.39 s	0.16 MB / 3.81 s
07	24.38 MB	1.71 MB	1.71 MB / 2.25 s	1.67 MB / 27.78 s	1.68 MB / 35.59 s
08	55.52 MB	7.58 MB	7.58 MB / 26.48 s	7.93 MB / 83.44 s	6.36 MB / 72.90 s
09	44.39 MB	2.00 MB	2.00 MB / 3.53 s	1.99 MB / 59.79 s	2.47 MB / 71.73 s
10	41.59 MB	1.21 MB	1.21 MB / 3.11 s	1.16 MB / 59.99 s	1.70 MB / 63.65 s

PS: pillow-avif-plugin uses 8 threads to encode images (on i7-11700), and I didn’t find an option to turn it off. Other encoders use only 1 thread. jxlpy example shows that it supports setting multithreading, but it doesn’t work.

(3) Comparison and comment

Quality comparison:

`PNG`	`WEBP Ref`	`AVIF`	`JPEG-XL`

Above is a cropped part from 03 for the given encoding. The WEBP image has severe smearing in dark areas, and obvious color shift occurs in the red dots on the upper left and lower right. The AVIF image is better in smearing, but the color shift is the same as WEBP. The JPEG-XL image is relatively closest to reference PNG image.

Detailed compatibility:

Format	Windows	macOS	Android	iOS	Chrome	Firefox	Safari
`WEBP`	≥10	≥11	≥4	≥14	✅	✅	✅
`AVIF`	≥10-1903	≥13	≥12	≥16	✅	✅	✅
`JPEG-XL`	❌	❌	❌	❌	❌	❌	❌

PS: Results on Windows, macOS, Android and iOS are got by Google. Browser compatibility information can be found at https://caniuse.com.

Summary:

Format	Quality	Encoding Speed	Compatibility
`WEBP`	worst	fast	good
`AVIF`	medium	slow	medium
`JPEG-XL`	best	slow	bad

Due to the bad compatibility of JPEG-XL, it should not be considered an appropriate option. AVIF features the better image quality than WEBP, but is only well supported in new platforms, which needs time for adoption, especially for fragmented Android and Windows. Although WBEP takes huge advantage in encoding speed, I don’t think encoding speed is a factor that needs to be considered because even for large images, the encoding time is only about 1 minute, and the number of images not large. Compared with video encoding, this is a completely negligible time overhead.

Summarily, I think now is not a suitable time to switch to AVIF or JPEG-XL. But two years later, it will be time for AVIF to show its strength.

Question3 (30pt)

Recently 32-bit audio tracks have appeared in some of the latest Hi-Res music. Although now we would not see these annoying 32-bit tracks in the Blu-ray, we have to start working on them in advance. In the file Q3.7z, two 32-bit PCM files are provided for you. Your tasks are:

Learn about 32-bit tracks and tell the difference between these two files. (6pt)

Try to convert them to FLAC, ALAC, and WavPack losslessly. (15pt)

Consider various aspects such as compression rate, encoding speed, and playback compatibility and select the format you recommend most for 32-bit audio. (9pt)

You are free to utilize existing tools, but you need to describe clearly where you find the tool and how to use it.

(1)

Using ffprobe to get audio encoding info:

1
2
3


Input #0, wav, from '01.wav':
  Duration: 00:03:52.48, bitrate: 6144 kb/s
  Stream #0:0: Audio: pcm_s32le ([1][0][0][0] / 0x0001), 96000 Hz, 2 channels, s32, 6144 kb/s

1
2
3


Input #0, wav, from '02.wav':
  Duration: 00:07:03.00, bitrate: 6144 kb/s
  Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 96000 Hz, 2 channels, flt, 6144 kb/s

The difference is: 01.wav is encoded by pcm_s32le, and 02.wav is encoded by pcm_f32le.

pcm_s32le means PCM encoding by 32-bit signed integer with little-endian byte ordering, while pcm_s32le means PCM encoding by 32-bit floating point with little-endian byte ordering.

(2)

I first tried to convert them losslessly using FFmpeg. If FFmpeg failed, I used Google to find a suitable codec.

This is the result of my attempt:

Format	32-bit integer	32-bit float
`FLAC`	FFmpeg ❌ `flac` (from v1.4.0) ✅	FFmpeg ❌
`ALAC`	FFmpeg (decoding only) `qaac` (backed by Apple `CoreAudioToolbox`) ✅	FFmpeg ❌
`WavPack`	FFmpeg ✅	FFmpeg ✅

The conversion command:

Format	32-bit integer	32-bit float
`FLAC`	`flac -o 01.flac 01.wav`	❌
`ALAC`	`qaac64 -b 32 --alac -i 01.wav -o 01.m4a`	❌
`WavPack`	`ffmpeg -i 01.wav 01.wv`	`ffmpeg -i 02.wav 02.wv`

The resulting files are Q3/01.flac, Q3/01.m4a, Q3/01.wv and Q3/02.wv.

(3)

Encoding speed and compression rate of different encoding methods:

Format	`WAV` file size / encoded file size	audio time / encoding time
`FLAC s32`	1.337	128.44
`ALAC s32`	1.304	69.81
`WavPack s32`	1.280	121.08
`WavPack f32`	1.489	109.02

Summary:

	`FLAC s32`	`FLAC f32`	`ALAC s32`	`ALAC f32`	`WavPack s32`	`WavPack f32`
Compression rate	best	❌	medium	❌	worst	-
Encoding speed	very fast	❌	fast	❌	very fast	very fast
Playback compatibility	bad (`flac` only)	❌	good (FFmpeg)	❌	good (FFmpeg)	good (FFmpeg)

Because FFmpeg is the de facto standard multimedia codec library used by most video players, FLAC is not suitable, which can only be decoded by flac. Also, WavPack shows advantage in encoding speed compared to ALAC, but considering that all of three formats are fast in absolute speed (compared to video encoding), this advantage is not greatly valuable. Last, ALAC shows better compression rate than WavPack, thus file size can be saved.

To sum up, I recommend ALAC for encoding 32-bit audio. But if float point encoding is required (which is rare), WavPack is the only choice.

Question4 (35pt)

MSU publishes video encoder tests every year, with the latest one here: https://compression.ru/video/codec_comparison/2021/main_report.html.

For the first time last year, H.266 (VVC) encoders participated in the tests and they performed well in terms of encoding quality in the slow encoding (1 fps) test.

Choose any of the H.266 (VVC) or AV1 encoders in the figure below, and then encode the source file Q4 [E46686C4].m2ts with no more than 2500 Kbps of video bitrate. You’d better use 10bit variants of these encoders, which facilitates the comparison later. In addition, you need to describe clearly where you found the encoder and state the version and parameters you used. If you use H.266 (VVC) encoder, you will get additional 5pt. (10pt+5pt)

We provide an AV1 video file Q4_AV1 [41A7EDDA].mkv, which was encoded via SVT-AV1 10bit encoder without any pre-processing. Comment on the picture quality compared to the source file. When you compare the picture quality, you may want to sample a few frames, attach some screenshots, and comment on the performance of dark scenes and moving scenes. (10pt)

Now compare your own encoding to the given AV1 file in terms of picture quality, encoding speed, and playback compatibility. As a reference, we encoded the above AV1 file at 1.0 fps. (10pt)

(1) VVC encoding

The testing hardware and software environment is:

Encoder: VVenC v1.7.0.
Compiler: AMD Optimizing C/C++ Compiler 4.0.0.
CPU: 2 x AMD EPYC 7713, 128 cores / 256 threads in total.
RAM: 16 channel DDR4-3200.
OS: Ubuntu 18.04.6.

First, use ffmpeg to convert Q4 [E46686C4].m2ts to raw yuv420p10 video:

1

ffmpeg -i "Q4 [E46686C4].m2ts" -pix_fmt yuv420p10 Q4_yuv420p10.yuv

Parameter -pix_fmt yuv420p10 indicates ffmpeg to output raw video use yuv420p10 format:

Then, use vvencapp to encode the raw video:

1

vvencapp --input Q4_yuv420p10.yuv --size 1920x1080 --format yuv420_10 --fps 24000/1001 --preset  --bitrate 2500kbps --output Q4_VVC.vvc

Parameters meaning:

--size 1920x1080: indicating the input raw video frame size is 1920x1080.
--format yuv420_10: same as yuv420p10 meaning in ffmpeg.
--fps 24000/1001: indicating the output video fps is 23.976 (same as original m2ts file).
--preset : Preset vvc encoding parameter combination. Available options are faster, fast, meadium, slow and slower. Detailed settings are listed in https://github.com/fraunhoferhhi/vvenc/blob/master/cfg/randomaccess_*.cfg.
--bitrate 2500kbps: controlling the output encoded video bitrate to about 2500kbps.

File	Preset	FPS
`Q4_VVC_faster.vvc`	`faster`	5.762
`Q4_VVC_fast.vvc`	`fast`	2.156
`Q4_VVC_medium.vvc`	`medium`	0.557
`Q4_VVC_slow.vvc`	`slow`	0.177
`Q4_VVC_slower.vvc`	`slower`	0.058

(2) Comparing source video and reference `AV1` encoded video

The video player used is MPV with libvvdec & xHE-AAC support, configured according to https://vcb-s.com/archives/7594.

Dynamic fire with a dark background is a highly challenging scene. Compared to the original video, There are color blocks around hte flame in AV1 video, which is a common problem when the bitrate is insufficient.

Encoding Method	Capture	File
Original		`pics/m2ts-flame.png`
AV1		`pics/av1-flame.png`

(3) Comparing custom `VVC` encoded video and reference `AV1` encoded video

Using the same player as (2). In order to be comparable to the video encoded by AV1, I chose the medium preset encoded VVC video, which has an encoding speed of 0.557 fps.

The VVC encoded video is much better than the AV1 video in flame scene. The color blocks are less obvious and closer to the original video.

Encoding Method	Capture	File
Original		`pics/m2ts-flame.png`
AV1		`pics/av1-flame.png`
VVC (medium)		`pics/vvc-flame.png`

Question5 (20pt)

When we check an encoded file, we need to locate frames that have been encoded exceptionally awful. We use algorithms like PSNR to evaluate the similarity of each frame in the encoded file to the source file. The result is an array of scores, where the i-th score is tied to the i-th frame. These scores are called raw scores. However, what we are concerned about is the standard score, which is the raw score minus a threshold. A frame with a standard score less than 0 is considered a bad frame. The tasks are:
Find the worst frame, i.e. the one with the lowest standard score among the bad frames, and output its index. If there is more than one worst frame, output the first. If there are no bad frames, output -1. Frames with a standard score of exactly 0 are not considered as bad frames. (10pt)

Input: 2 lines. The first line is two integers that represent the number of frames N and the threshold value S. The second row is an array of integers A[N], representing the raw score of each frame.

For all the data, 1<=N<=200000, 0, 0<=A[i]<=100
Output: An integer, the index of the worst frame. The index starts from 0. If there is more than one worst frame, output the first. If there are no bad frames, output -1. Sample: Input 10 30 42 31 44 23 21 26 31 41 50 72 Output 10
Find a continuous sequence of frames that minimizes the sum of their standard scores and output this minimum value. Full scores will only be given if the time complexity of your algorithm is optimal. (10pt) Input: The same as (1). Output: An integer, the minimum sum value. Sample: Input 10 30 42 31 44 23 21 26 31 41 50 72 Output -20
For each sub question, use C/C++/Java/Python/C# to write a console program. Read the input from the standard input and write it to standard output. Do NOT use libraries other than built-in ones (for example, no “import numpy as np”). Submit your source code.


(1) Find the worst frame
The following code is consisted with Q5/q5-1.c:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


#include 

int main() {
    int frame_num;
    int threshold;
    scanf("%d%d", &frame_num, &threshold);
    int worst_idx = -1;
    int worst_rate = 101;
    for (int i = 0; i < frame_num; i||) {
        int rate;
        scanf("%d", &rate);
        if (rate < threshold && rate < worst_rate) {
            worst_rate = rate;
            worst_idx = i;
        }
    }
    printf("%d", worst_idx);
    return 0;
}


(2) Find minimum subsequence sum
PS: Due to the ambiguity of the problem, I can‘t determine whether a sequence of 0 length satisfies the requirement. This determines whether the output should be 0 (indicating that a subsequence of length 0 is selected) or the smallest score (indicating that the sequence length is at least 1) when the input standard scores are all positive. The code I submitted is consistent with the second understanding (sequence length is at least 1), if the first understanding (0 length is allowed) is correct, please comment int min_sum = 101; and uncomment int min_sum = 0;.
The following code is consisted with Q5/q5-2.c:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


#include 

int main() {
    int frame_num;
    int threshold;
    scanf("%d%d", &frame_num, &threshold);
    int min_sum = 101; // when all scores > 0, output the minimum
    // int min_sum = 0; // when all scores > 0, output 0
    int sum = 0;
    for (int i = 0; i < frame_num; i||) {
        int rate;
        scanf("%d", &rate);
        rate -= threshold;
        sum |= rate;
        if (sum < min_sum) {
            min_sum = sum;
        } else if (sum > 0) {
            sum = 0;
        }
    }
    printf("%d", min_sum);
    return 0;
}



Hello World
Wed, 29 Mar 2023 00:00:00 +0000
My first post on blog!