<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>posts on Monsoon's Blog</title><link>https://monsoon-cs.moe/posts/</link><description>Recent content in posts on Monsoon's Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 22 Dec 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://monsoon-cs.moe/posts/index.xml" rel="self" type="application/rss+xml"/><item><title>Using GPU accessible VS Code Server on UIUC Delta</title><link>https://monsoon-cs.moe/2024-12-22-uiuc-delta-code-server/</link><pubDate>Sun, 22 Dec 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-12-22-uiuc-delta-code-server/</guid><description>&lt;h2 id="why-writing-this-blog-post"&gt;Why writing this blog post&lt;/h2&gt;
&lt;p&gt;Many UIUC students rely on the &lt;a href="https://www.ncsa.illinois.edu/research/project-highlights/delta/"&gt;Delta&lt;/a&gt; to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use &lt;code&gt;srun&lt;/code&gt; to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta:&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="why-writing-this-blog-post">Why writing this blog post</h2>
<p>Many UIUC students rely on the <a href="https://www.ncsa.illinois.edu/research/project-highlights/delta/">Delta</a> to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use <code>srun</code> to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta:</p>
<ul>
<li><strong>Unstable network connection</strong>: Connection is lost frequently when the network is poor. Each time when the VS Code Remote lost connection, you must reenter the password and DUO 2FA OTP (you have to unlock your phone to get the OTP) to reconnect, which is annoying, time-consuming, and distracting.</li>
<li><strong>Broken <a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/ood/index.html">OnDemand Code Server</a></strong>: Although you can run VS COde Remote on the login nodes by ssh, there&rsquo;s no GPU for debugging, and the computing nodes are not accessible by ssh. The alternative ways include <a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/ood/index.html">OnDemand Jupyter Lab and Code Server</a>. But the functions of Jupiter Lab are limited, and the Code Server is broken &ndash; When I try to request a Code Server on computing nodes, the system just queues and shows my request has been completed, <strong>no running status</strong>.</li>
</ul>
<p>Due to the above problems, debugging GPU programs on Delta are struggling. That&rsquo;s why I wrote this blog post: by running private Code Server on computing nodes, and deploying a <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/">Cloudflare Tunnel</a> reverse proxy, you can say goodbye to these annoying problems.</p>
<h2 id="how-to">How to</h2>
<p>My solution is based on an <strong>observation</strong> about the Delta: all login nodes and computing nodes are in a trusted network. There&rsquo;s no firewalls between them, which means you can access to any ports on the computing nodes from the login nodes.</p>
<p>The main steps of my solution are simple:</p>
<ol>
<li>Use <code>srun</code> to get a tty on the computing node (e.g., on <code>gpua042</code> node).</li>
<li>Run a Code Server on the computing node. It will listen on <code>0.0.0.0:8080</code>.</li>
<li>Reverse proxy <code>gpua042:8080</code> to any port you have access. There are two approaches:
<ul>
<li>Use <code>ssh -L</code> to forward the port to your local machine.</li>
<li>Use Cloudflare Tunnel to reverse proxy the port to a public domain. This approach is more stable in poor network conditions.</li>
</ul>
</li>
</ol>
<h3 id="run-code-server">Run Code Server</h3>
<p>Download the Code Server binary from the <a href="https://github.com/coder/code-server">Github repository</a> (e.g., <code>code-server-4.96.2-linux-amd64.tar.gz</code>), and extract it. On the computing node, run:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">cd</span> code-server-4.96.2-linux-amd64/bin
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">## no auth</span>
</span></span><span class="line"><span class="cl">./code-server --bind-addr 0.0.0.0:8080 --auth none
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">## if port is exposed to untrusted network, use password auth</span>
</span></span><span class="line"><span class="cl"><span class="c1">## password can be modified in ~/.config/code-server/config.yaml</span>
</span></span><span class="line"><span class="cl">./code-server --bind-addr 0.0.0.0:8080
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="access-code-server">Access Code Server</h3>
<h4 id="ssh-port-forwarding">SSH Port Forwarding</h4>
<p><code>ssh -L</code> can forward a local port to a remote port. Run:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ssh -L 127.0.0.1:8080:gpua042:8080 username@login.delta.ncsa.illinois.edu
</span></span></code></pre></td></tr></table>
</div>
</div><p>Then open <code>http://127.0.0.1:8080</code> in your browser, and enjoy the Code Server!</p>
<h4 id="cloudflare-tunnel">Cloudflare Tunnel</h4>
<p>Cloudflare Tunnel is more stable when your computer suffer from poor network connection. But it requires a domain name.</p>
<p>TODO</p>
]]></content:encoded></item><item><title>All About IPv6 Address Allocation</title><link>https://monsoon-cs.moe/2024-10-12-all-about-ipv6-addr-alloc/</link><pubDate>Sat, 12 Oct 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-10-12-all-about-ipv6-addr-alloc/</guid><description>&lt;h2 id="preface"&gt;Preface&lt;/h2&gt;
&lt;p&gt;IPv4 has only one method of dynamic address allocation, namely DHCP, but IPv6 has two allocation methods, SLAAC and DHCPv6, and DHCPv6 additionally has the PD (Prefix Delegation) extension. These three allocation methods also interact with each other, which makes problems arising during IPv6 allocation far more common than with IPv4. Most tutorials you can find only solve problems superficially, are ambiguous about the underlying technical details, and do not fundamentally clarify the differences between IPv6 and IPv4.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="preface">Preface</h2>
<p>IPv4 has only one method of dynamic address allocation, namely DHCP, but IPv6 has two allocation methods, SLAAC and DHCPv6, and DHCPv6 additionally has the PD (Prefix Delegation) extension. These three allocation methods also interact with each other, which makes problems arising during IPv6 allocation far more common than with IPv4. Most tutorials you can find only solve problems superficially, are ambiguous about the underlying technical details, and do not fundamentally clarify the differences between IPv6 and IPv4.</p>
<p>This article aims to start from the relevant fundamental concepts and, in a &ldquo;teach a man to fish&rdquo; manner, explain how the three IPv6 address allocation methods work, helping to thoroughly resolve the tricky problems in IPv6 allocation.</p>
<h2 id="ipv6-fundamental-concepts">IPv6 Fundamental Concepts</h2>
<h3 id="lla-link-local-address-and-eui-64">LLA (Link-Local Address) and EUI-64</h3>
<p>LLA actually already existed in IPv4: when DHCP is not working properly, some operating systems assign a <code>169.254.0.0/16</code> address to the network interface for temporary point-to-point communication. But LLA is not important in IPv4, playing only an optional fallback role that appears only when DHCP fails. As a result, the vast majority of people (including the author) did not learn about the existence of LLA until IPv6 became widespread.</p>
<p>IPv6 LLA (<code>fe80::/8</code>) inherits the basic point-to-point communication function of IPv4 LLA, but goes further to take on the important functions of NDP (Neighbor Discovery Protocol) and SLAAC (Stateless Address Autoconfiguration). Understanding it is necessary to understand how SLAAC works.</p>
<p>For example, when two network ports are directly connected with a cable, they each automatically generate an IPv6 LLA, such as <code>fe80::dfc2:d2aa:c86f:171e/64</code> and <code>fe80::da8f:9d5b:57e3:c6a6/64</code>, and each can <code>ping</code> the other&rsquo;s LLA. On Linux, the <code>ip -6 route</code> command shows the automatically configured LLA route entry:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">fe80::/64 dev eth0 proto kernel metric 1024 pref medium
</span></span></code></pre></td></tr></table>
</div>
</div><p>IPv6 LLA is generated from the MAC address using a specific algorithm, namely EUI-64. For example, when the network port&rsquo;s MAC address is <code>70:07:12:34:56:78</code>, the generated EUI-64 is <code>7207:12ff:fe34:5678</code>, and the LLA is <code>fe80:7207:12ff:fe34:5678/64</code> (EUI-64 with the <code>fe80</code> prefix prepended). The specific generation process is shown in the figure below:</p>
<p><img alt="IPv6 LLA generation process, image source https://www.networkacademy.io/ccna/ipv6/stateless-address-autoconfiguration-slaac" loading="lazy" src="/2024-10-12-all-about-ipv6-addr-alloc/generating-link-local-address-example.png"></p>
<p>Generally, routers do not forward traffic for LLA addresses; it is <strong>only used for point-to-point communication on the link</strong>.</p>
<h3 id="gua-global-unicast-address">GUA (Global Unicast Address)</h3>
<p>IPv6 GUA (<code>2000::/3</code>) can be mapped to the IPv4 concept of a &ldquo;public IP&rdquo;. In theory it is globally unique and can be used for communication over the public network. A well-designed network architecture should allow every device to obtain an IPv6 GUA, so as to maximize IPv6&rsquo;s P2P communication advantage.</p>
<h3 id="private-addresses">Private Addresses</h3>
<p><code>fc00::/7</code> is defined as the IPv6 private address range, analogous to <code>10.0.0.0/8</code>, <code>172.16.0.0/12</code>, and <code>192.168.0.0/16</code> in IPv4, used for LAN communication. Unlike LLA, it can be forwarded by routers.</p>
<p>Because IPv6 is designed so that every device worldwide can be assigned a GUA, the role of private addresses in IPv6 is greatly diminished. When it is not possible to assign a GUA to every device (as in some campus network environments), assigning IPv6 private addresses on the internal network can serve as an alternative, allowing internal devices to access IPv6.</p>
<h3 id="multicast">Multicast</h3>
<p>IPv6 multicast addresses (<code>ff00::/8</code>) are similar to IPv4 multicast addresses (<code>224.0.0.0/4</code>), used for one-to-many communication within a network segment. <strong>Both SLAAC and DHCPv6 rely on multicast to work</strong>. Commonly used multicast addresses include:</p>
<ul>
<li><code>ff02::1</code>: all nodes on the local link;</li>
<li><code>ff02::2</code>: all routers on the local link.</li>
</ul>
<h3 id="ndp-neighbor-discovery-protocol">NDP (Neighbor Discovery Protocol)</h3>
<p>NDP works on top of ICMPv6 and is similar to IPv4 ARP. It is used to discover other nodes on the data link layer and their corresponding IPv6 addresses, to determine available routes, and to maintain reachability information about available paths and other active nodes. <strong>SLAAC works based on NDP</strong>. The message types involved are:</p>
<ol>
<li>RS (Router Solicitation) and RA (Router Advertisement): used to configure IPv6 addresses and routes;</li>
<li>NS (Neighbor Solicitation) and NA (Neighbor Advertisement): used to find the MAC addresses of other devices on the link.</li>
</ol>
<h2 id="slaac-stateless-address-autoconfiguration">SLAAC (Stateless Address Autoconfiguration)</h2>
<p>SLAAC is the IPv6 address allocation method defined in <a href="https://datatracker.ietf.org/doc/html/rfc4862">RFC 4862</a>, and is also the <strong>recommended allocation method</strong>. In fact, Android only supports SLAAC for IPv6 allocation.</p>
<p>The most notable feature of SLAAC is that it is stateless, i.e. it does not require a centralized server responsible for allocation. Below, the author uses an example to illustrate the SLAAC process.</p>
<p>Suppose the <code>lan0</code> port on the <strong>router</strong> is connected to the <code>eth0</code> port on the <strong>host</strong>. The LLA of <code>lan0</code> is <code>fe80::1/64</code>, and the MAC address of <code>eth0</code> is <code>70:07:12:34:56:78</code>. At the same time, the router holds the GUA prefix <code>2001:db8::/64</code>, i.e. all GUAs under this subnet will be routed by the upstream router to this router&rsquo;s <code>wan</code> port. The SLAAC process is as follows:</p>
<ol>
<li>
<p><code>eth0</code> generates the EUI-64 <code>7207:12ff:fe34:5678</code> and the LLA <code>fe80:7207:12ff:fe34:5678/64</code> based on its MAC address;</p>
</li>
<li>
<p>The host performs DAD (Duplicated Address Detection) to ensure the LLA is unique on the local link. This is unrelated to address allocation, so it is omitted here; interested readers can look up the relevant material themselves;</p>
</li>
<li>
<p>The host sends an RS message via the <code>eth0</code> LLA. The RS is sent to all routers on the local link using the multicast address <code>ff02::2</code>.</p>
</li>
<li>
<p>The router replies with an RA message to the <code>eth0</code> LLA. The RA contains the prefix <code>2001:db8::/64</code>, the validity period, the MTU, and other information.</p>
</li>
<li>
<p>The host receives the RA, combines the prefix and the EUI-64 into <code>2001:db8::7207:12ff:fe34:5678/64</code>, assigns it to <code>eth0</code>, and adds the routing table entries:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">2001:db8::/64 dev eth0 proto ra metric 1024 expires 2591993sec pref medium
</span></span><span class="line"><span class="cl">default via fe80::1 dev eth0 proto static metric 1024 onlink pref medium
</span></span></code></pre></td></tr></table>
</div>
</div></li>
<li>
<p>The host performs DAD detection and uses an NA message to announce the use of the new address to neighbors on the link.</p>
</li>
</ol>
<p><img alt="SLAAC process, image source https://www.networkacademy.io/ccna/ipv6/stateless-address-autoconfiguration-slaac" loading="lazy" src="/2024-10-12-all-about-ipv6-addr-alloc/ipv6-stateless-address-autoconfiguration.gif"></p>
<p>SLAAC looks great, but it has an <strong>important flaw</strong>: it does not support distributing DNS information, so the host must obtain DNS through some other means (usually DHCPv6). There are two flag bits in the RA to address this problem:</p>
<ul>
<li><code>M</code> (Managed Address Configuration): address information can be obtained via DHCPv6;</li>
<li><code>O</code> (Other Configuration): other information (such as DNS) can be obtained via DHCPv6.</li>
</ul>
<p>The newer <a href="https://datatracker.ietf.org/doc/html/rfc8106">RFC 6106</a> supports distributing DNS information by adding RDNSS (Recursive DNS Server) and DNSSL (DNS Search List) to the RA. For the level of RDNSS support across operating systems, see <a href="https://en.wikipedia.org/wiki/Comparison_of_IPv6_support_in_operating_systems">Comparison of IPv6 support in operating systems</a>. In practice, in the vast majority of cases you only need to configure IPv4 DNS (obtained via DHCPv4), so the RDNSS extension is not very meaningful.</p>
<p>The problem with the EUI-64-based SLAAC address configuration above is that <strong>the addresses it generates are fixed and predictable</strong>, which brings security and privacy concerns. The IPv6 SLAAC privacy extension defined in <a href="https://datatracker.ietf.org/doc/html/rfc4941">RFC 4941</a> solves this problem. During SLAAC it also generates random, periodically rotated addresses to address the privacy issue. At the same time, the EUI-64-generated address is also retained, for use by externally incoming connections. With the privacy extension enabled, the IPv6 addresses generated on Linux look like the following, for example (from top to bottom: the privacy address, the EUI-64 GUA, and the LLA):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">2: eth0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc cake state UP group default qlen 1000
</span></span><span class="line"><span class="cl">    link/ether 70:07:12:34:56:78 brd ff:ff:ff:ff:ff:ff
</span></span><span class="line"><span class="cl">    inet6 2001:db8::dead:beef:aaaa:bbbb/64 scope global temporary dynamic
</span></span><span class="line"><span class="cl">       valid_lft 2591998sec preferred_lft 604798sec
</span></span><span class="line"><span class="cl">    inet6 2001:db8::7207:12ff:fe34:5678/64 scope global dynamic mngtmpaddr noprefixroute
</span></span><span class="line"><span class="cl">       valid_lft 2591998sec preferred_lft 604798sec
</span></span><span class="line"><span class="cl">    inet6 fe80:7207:12ff:fe34:5678/64 scope link
</span></span><span class="line"><span class="cl">       valid_lft forever preferred_lft forever
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="dhcpv6">DHCPv6</h2>
<p>DHCPv6 operates in broadly the same way as DHCPv4: the host sends a multicast message to <code>ff02::1:2</code> on UDP port 547, and the DHCPv6 server replies with address, DNS, and other information.</p>
<p>The difference is that DHCPv6 can run in either a stateful or a stateless mode, the distinction being whether or not an address is obtained. When used together with SLAAC, the host only needs to obtain DNS and other information from DHCPv6, so stateless DHCPv6 can be used.</p>
<h2 id="dhcpv6-pd-prefix-delegation">DHCPv6 PD (Prefix Delegation)</h2>
<p>PD is a DHCPv6 extension defined in <a href="https://datatracker.ietf.org/doc/html/rfc3633">RFC 3633</a>. It is used to distribute IPv6 prefixes across a network.</p>
<p>With the PD extension enabled, the DHCP server grants the host the right to use an IPv6 subnet prefix (such as <code>2001:db8::/56</code>) and adds routing table entries to ensure that all addresses under this subnet are routed to the host that requested the prefix. The host can then further subdivide and allocate this subnet.</p>
<p>A typical use case for DHCPv6 PD is home ISP network access. The home gateway router requests an IPv6 prefix from the ISP DHCP server, and then distributes addresses from this subnet prefix within the home internal network via SLAAC.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This article briefly introduced some of the concepts involved in IPv6 address allocation and explained how SLAAC, DHCPv6, and DHCPv6 PD work. In terms of simplifying address management, IPv6 can be said to have been rather unsuccessful: multiple standards coexist, and there are various combinations of them, which gives clients a non-trivial probability of failing to correctly obtain IPv6.</p>
<p>In practice, the three most common IPv6 allocation scenarios we encounter are:</p>
<ul>
<li>Pure SLAAC: typical campus networks (education networks) fall into this category. In practice, the author has found cases where a misconfigured host on the internal network indiscriminately sends RAs, causing the IPv6 of all hosts on the entire internal network to be misconfigured. At the same time, in this mode, a router you connect yourself will no longer be able to distribute SLAAC GUAs to downstream devices, because the local-link multicast packets that SLAAC relies on cannot be forwarded by the router (this can be solved via IPv6 bridging or NAT6, which is not elaborated on here).</li>
<li>Pure DHCPv6: some enterprise internal networks use this mode, because DHCPv6 allows centralized management. The biggest problem with this mode is that <a href="https://www.nullzero.co.uk/android-does-not-support-dhcpv6-and-google-wont-fix-that/">Android does not support DHCPv6</a>. But under other operating systems, this mode runs fairly stably.</li>
<li>SLAAC + DHCPv6 PD: this is the most common mode for home ISP network access. Most home routers are adapted for it and work out of the box.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://www.networkacademy.io/ccna/ipv6/stateless-address-autoconfiguration-slaac">IPv6 Stateless Address Auto-configuration (SLAAC)</a></li>
<li><a href="https://datatracker.ietf.org/doc/html/rfc4862">RFC 4862: IPv6 Stateless Address Autoconfiguration</a></li>
<li><a href="https://datatracker.ietf.org/doc/html/rfc8106">RFC 6106: IPv6 Router Advertisement Options for DNS Configuration</a></li>
<li><a href="https://datatracker.ietf.org/doc/html/rfc4941">RFC 4914: Privacy Extensions for Stateless Address Autoconfiguration in IPv6</a></li>
<li><a href="https://datatracker.ietf.org/doc/html/rfc3633">RFC 3633: IPv6 Prefix Options for Dynamic Host Configuration Protocol (DHCP) version 6</a></li>
<li><a href="https://www.nullzero.co.uk/android-does-not-support-dhcpv6-and-google-wont-fix-that/">Android does not support DHCPv6 and Google &lsquo;Won&rsquo;t Fix&rsquo; that</a></li>
<li><a href="https://en.wikipedia.org/wiki/Comparison_of_IPv6_support_in_operating_systems">Comparison of IPv6 support in operating systems</a></li>
</ul>
]]></content:encoded></item><item><title>Extracting Graph Topology from Image</title><link>https://monsoon-cs.moe/2024-07-11-extracting-graph-topology-from-image/</link><pubDate>Thu, 11 Jul 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-07-11-extracting-graph-topology-from-image/</guid><description>&lt;h2 id="the-problem"&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Now we have an image representing a graph, as shown in the figure below:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://monsoon-cs.moe/2024-07-11-extracting-graph-topology-from-image/image.png"&gt;&lt;/p&gt;
&lt;p&gt;Suppose we already know the category of each pixel: background, node, or edge. How can we &lt;strong&gt;extract the graph topology&lt;/strong&gt; from it and represent the graph by an adjacency matrix?&lt;/p&gt;
&lt;h2 id="challenges-in-classical-algorithm"&gt;Challenges in Classical Algorithm&lt;/h2&gt;
&lt;p&gt;TODO&lt;/p&gt;
&lt;h2 id="what-about-neural-network"&gt;What about Neural Network?&lt;/h2&gt;
&lt;p&gt;We can use a simple algorithm to extract the position of each node. Suppose the position of a node is $\mathbf{P}(x,y)$, and there are $N$ nodes in total.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="the-problem">The Problem</h2>
<p>Now we have an image representing a graph, as shown in the figure below:</p>
<p><img loading="lazy" src="/2024-07-11-extracting-graph-topology-from-image/image.png"></p>
<p>Suppose we already know the category of each pixel: background, node, or edge. How can we <strong>extract the graph topology</strong> from it and represent the graph by an adjacency matrix?</p>
<h2 id="challenges-in-classical-algorithm">Challenges in Classical Algorithm</h2>
<p>TODO</p>
<h2 id="what-about-neural-network">What about Neural Network?</h2>
<p>We can use a simple algorithm to extract the position of each node. Suppose the position of a node is $\mathbf{P}(x,y)$, and there are $N$ nodes in total.</p>
<p>Then, the task is to fill in the $N\times N$ adjacency matrix with $0$ or $1$. As we can see, this can be converted into <strong>a binary classification problem</strong>.</p>
<p>we can train a neural network $\mathbf{f}$, which takes 3 input: the image $I$, the position of a node pair $\left( \mathbf{P}_ 1, \mathbf{P}_ 2
\right)$. It outputs $O\in\{0,1\}$, indicating whether there is a direct connection between the node pair, i.e.,</p>
$$O=\mathbf{f}(\mathbf{I}, \mathbf{P}_ 1, \mathbf{P}_ 2).$$<p>The dataset can be synthesized by a simple program, and we can use any classification network (e.g., <a href="https://arxiv.org/abs/1905.11946">EfficientNet</a>) as our network architecture.</p>
<p>The problem is how to feed $\left( \mathbf{P}_ 1, \mathbf{P}_ 2
\right)$​ into the network. We can add an additional &ldquo;mask channel&rdquo; to the image, where the pixels belonging to the two input nodes are marked as 1, and the others as 0. Finally, we input this 4-channel &ldquo;image&rdquo; into the network.</p>
<p><img loading="lazy" src="/2024-07-11-extracting-graph-topology-from-image/nn.png"></p>
<h2 id="other-notes">Other Notes</h2>
<p>TODO</p>
]]></content:encoded></item><item><title>Latency in LLM Serving</title><link>https://monsoon-cs.moe/2024-07-07-latency-in-llm-serving/</link><pubDate>Sun, 07 Jul 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-07-07-latency-in-llm-serving/</guid><description>&lt;h2 id="preface"&gt;Preface&lt;/h2&gt;
&lt;p&gt;There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, &lt;strong&gt;currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="preface">Preface</h2>
<p>There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, <strong>currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint</strong>.</p>
<p>This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving.</p>
<h2 id="latency-metrics">Latency Metrics</h2>
<p>In LLM serving, we mainly focus on three latency metrics:</p>
<ul>
<li><strong>TBT</strong> ($t_ {tbt}$): Time Between Tokens.</li>
<li><strong>TTFT</strong> ($t_ {ttft}$): Time to First Token.</li>
<li><strong>TE2E</strong> ($t_ {e2e}$): Time of End-to-end.</li>
</ul>
<p>In practice, rather than the average or median latency, we usually consider the <strong>latency SLA</strong>, which means that 50%, 90%, and 99% of data should fall below certain thresholds.</p>
<h2 id="where-the-latency-comes-from">Where The Latency Comes From?</h2>
<p><img loading="lazy" src="/2024-07-07-latency-in-llm-serving/latency_in_llm_serving.png"></p>
<p>As shown in the figure above, the current popular LLM serving systems (such as vLLM, DeepSpeed) adopt an <strong>iteration-level scheduling strategy</strong>. The processing of each request is divided into the <strong>prefilling stage</strong> (prompt inference) and the <strong>generation stage</strong> (auto-regressive token-by-token generation). For systems such as Sarathi-Serve, the prompt is chunked to improve throughput, thus adding a <strong>chunked prefilling stage</strong>.</p>
<p>The LLM serving system maintains <strong>3 queues</strong> to store requests in these 3 states. The scheduler runs in a loop, and in each iteration, it selects requests from these 3 queues with a certain strategy, and combines them into a batch for the inference engine.</p>
<p>In such systems, the latency of requests mainly comes from 2 aspects: <strong>queue latency</strong> and <strong>inference latency</strong>. Assuming the latencies for a request from being added into the prefilling queue, chunked prefilling queue, generation queue to being selected by scheduler are $t_ {qp}$, $t_ {qc}$, $t_ {qg}$ respectively, and inference latency of engine if $t_ {inf}$.
We get:</p>
$$\begin{aligned}
  t_ {ttft} &= t_ {qp} + (N_ {chunk} - 1) \cdot t_ {qc} + N_ {chunk} \cdot t_ {inf}, \\\\
  t_ {tbt} &= t_ {qg} + t_ {inf}, \\\\
  t_ {e2e} &= t_ {ttft} + N_{token} \cdot t_ {tbt},
\end{aligned}$$<p>where $N_ {chunk}$ is the chunk number of a prefilling request, $N_ {chunk}=1$ means no chunking. $N_ {token}$ is the total token number generated by a request.</p>
<p>Obviously, $t_ {inf}$ is not a fixed value. It&rsquo;s related with the ingredient of the batch. We can denote it as:</p>
$$t_ {inf} = f\left( B_ {p}, B_ {c}, B_ {g}, \mathbf{L}_ {p}, L_ {chunk} \right),$$<p>where $B_p$, $B_c$, $B_g$ indicates the number of non-chunked prefilling request, chunked prefilling request, generation request respectively. Vector $\mathbf{L}_ {p}$ means the prompt length of each non-chunked prefilling request in the batch.
$L_ {chunk}$ is the chunk size.</p>
<h2 id="how-to-improve-it">How to Improve It?</h2>
<p>Based on the above analysis, we can find that reducing latency mainly involves reducing both <strong>queue latency</strong> and <strong>inference latency</strong>. In fact, some techniques, such as iteration-level scheduling and chunked prefilling, can be seen as improvements to queue latency.</p>
<p>On the other hand, <strong>improvement of inference latency have not received much attention</strong>. One reason is that, <strong>for inference engines, there is a trade-off between latency and throughput</strong>.
Generally speaking, higher batch size means higher throughput, but also higher inference latency. Techniques such as quantization and Paged Attention focus on more efficient memory usage to increase batch size, <strong>but inference latency may also increase accordingly</strong> (TODO: add an example), which means $t_ {tbt}$ and $t_ {ttft}$ may be increased, and SLA requirements are broken.</p>
<p>Therefore, <strong>there is an opportunity to improve inference latency in current LLM serving systems</strong>. The target may be an <strong>SLA-aware scheduler</strong>, which can maximize throughput without breaking SLA requirements. It should be able to <strong>dynamically decide the batch size and batch composition</strong> instead of just deploying a static prefilling-prioritize or generation-prioritize strategy.</p>
<p>I believe the key to this design is to predict $t_ {inf}$ to provide latency optimization guidance for the scheduler. Prediction based on profiling results may be a simple approach, <strong>but a performance model based on GPU computation capability and memory bandwidth might be more general</strong>.</p>
<p>Once we can predict $t_ {inf}$, $t_ {qp}$, $t_ {qc}$, and $t_ {qg}$ can also be predicted using mathematical tools such as Queueing Theory (e.g., Poisson distribution), allowing us to optimize serving for the following scenarios:</p>
<ol>
<li>When the request arrival rate is less than the maximum throughput: we can appropriately reduce batch size to improve $t_ {tbt}$.</li>
<li>When the request arrival rate is greater than the maximum throughput: we can adjust the batch composition dynamically based on queue length, or drop some requests to avoid starvation.</li>
<li>When the request arrival rate suddenly increases: we can adjust the batch composition to avoid breaking the SLA of $t_ {ttft}$.</li>
</ol>
<p>In summary, this SLA-aware scheduler should provide better results than a static scheduler by considering <strong>arrival rate</strong>, <strong>queue length</strong>, and <strong>predicted $t_ {inf}$</strong>.</p>
<h2 id="some-meaningful-experiment-result">Some Meaningful Experiment Result</h2>
<p>TODO</p>
]]></content:encoded></item><item><title>How Quantization Works: From a Matrix Multiplication Perspective</title><link>https://monsoon-cs.moe/2024-03-06-quantization-gemm/</link><pubDate>Wed, 06 Mar 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-03-06-quantization-gemm/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Quantization is a commonly used acceleration technique in NN inference. The primary computational workloads in NNs come from Convolution, Linear Layers, and Attention, which are implemented by GEMM in the lower level. This blog aims to &lt;strong&gt;discuss the principles of quantization from the matrix multiplication perspective and to explain why some quantization methods are impractical&lt;/strong&gt;. It also aims to review several LLM quantization methods from this perspective.&lt;/p&gt;
&lt;p&gt;I define &lt;strong&gt;practical quantization&lt;/strong&gt; as follows:&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Quantization is a commonly used acceleration technique in NN inference. The primary computational workloads in NNs come from Convolution, Linear Layers, and Attention, which are implemented by GEMM in the lower level. This blog aims to <strong>discuss the principles of quantization from the matrix multiplication perspective and to explain why some quantization methods are impractical</strong>. It also aims to review several LLM quantization methods from this perspective.</p>
<p>I define <strong>practical quantization</strong> as follows:</p>
<ol>
<li>Operation <strong>can still be performed using GEMM after quantization</strong>. This requires both mathematical feasibility and hardware support. It is a fundamental requirement for achieving acceleration.</li>
<li>Quantization must lead to <strong>actual acceleration</strong>. Acceleration can arise from higher INT8 hardware throughput, or from the memory bandwidth saved by smaller memory footprint. Importantly, the benefits of acceleration must outweigh the quantization overhead.</li>
</ol>
<h2 id="lets-do-some-math">Let&rsquo;s do some math</h2>
<p>Suppose an operator can be expressed in the form of matrix multiplication:
</p>
$$\mathbf{Y}=\mathbf{X} \mathbf{W}^\top,$$<p>
where $\mathbf{X} \in \mathbb{R}^{N \times C}$, $\mathbf{Y} \in \mathbb{R}^{N \times D}$, $\mathbf{W} \in \mathbb{R}^{D \times C}$, while their quantized versions are denoted as $\hat{\mathbf{X}}$, $\hat{\mathbf{Y}}$, $\hat{\mathbf{W}}$. Our goal is to ensure that operations can still be performed using GEMM after quantization, i.e.:
</p>
$$\hat{\mathbf{Y}}=\hat{\mathbf{X}} \hat{\mathbf{W}}^\top.$$<p>Let the <strong>per-element</strong> quantization functions for $\mathbf{X}$, $\mathbf{Y}$, and $\mathbf{W}$ be denoted as $p_{nc}(\cdot)$, $q_{nd}(\cdot)$, $r_{dc}(\cdot)$ respectively:
</p>
$$\begin{aligned}
    \hat{x}_ {nc} &= p_ {nc}(x_{nc}), \\\\
    \hat{y}_ {nd} &= q_ {nd}(y_{nd}), \\\\
    \hat{w}_ {dc} &= r_ {dc}(w_{dc}).
\end{aligned}$$<p>
The corresponding dequantization functions are denoted as $p_ {nc}^{-1}(\cdot)$, $q_ {nd}^{-1}(\cdot)$, $r_ {dc}^{-1}(\cdot)$, i.e.:
</p>
$$\begin{aligned}
y_ {nd}
&= \sum_ {c=1}^{C} x_ {nc} w_ {dc}, \\\\
q_ {nd}^{-1}(\hat{y}_ {nd}) &= \sum_ {c=1}^{C} p_ {nc}^{-1}(\hat{x}_ {nc}) \cdot r_ {dc}^{-1}(\hat{w}_ {dc}).
\end{aligned}$$<p>
The above formulas set the <strong>basic constraints</strong> that <strong>practical quantization</strong> should satisfy mathematically.</p>
<h2 id="some-basic-quantization-methods">Some basic quantization methods</h2>
<p>With this basic constraints, we can now discuss several fundamental quantization methods, including per-element, per-channel, per-token, and per-tensor quantization.</p>
<h3 id="per-element-and-per-channel">Per-element and Per-channel</h3>
<p>In the basic constraints mentioned above, the dequantization function $q_ {nd}^{-1}(\cdot)$ on the left-hand side does not depend on $c$. Clearly, if the right-hand side quantization functions $p_ {nc}^{-1}(\cdot)$ and $r_ {dc}^{-1}(\cdot)$ depend on $c$, <strong>this constraint will be violated</strong>. This implies that these two conditions cannot be satisfied at the same time:</p>
<ol>
<li>Computation can be done by GEMM.</li>
<li>Different quantization functions can be applied in different channels of $\mathbf{X}$ and $\mathbf{W}$.</li>
</ol>
<p>In other words, this indicates that <strong>per-element and per-channel quantization cannot be accelerated using GEMM. They are impractical</strong>.</p>
<h3 id="per-token-and-per-tensor">Per-token and per-tensor</h3>
<p>From the above discussion, we know that practical quantization needs to satisfy at least:
</p>
$$\begin{aligned}
    p_ {n}(\cdot) &= p_ {nc} (\cdot), \quad \forall n, c, \\\\
    r_ {d}(\cdot) &= r_ {dc} (\cdot), \quad \forall d, c.
\end{aligned}$$<p>
That is, the quantization function is same for all channels. Therefore, the basic constraint can be formulated as:
</p>
$$q_ {nd}^{-1}(\hat{y}_ {nd}) = \sum_ {c=1}^{C_ i} p_ {n}^{-1}(\hat{x}_ {nc}) \cdot r_ {d}^{-1}(\hat{w}_ {dc}),$$<p>
Thus, we get <strong>per-channel quantization</strong>. If we further assume:
</p>
$$\begin{aligned}
    p(\cdot) &= p_ {nc} (\cdot), \quad \forall n, c, \\\\
    r(\cdot) &= r_ {dc} (\cdot), \quad \forall d, c.
\end{aligned}$$<p>
That is, the quantization function is same for all elements in both $\mathbf{X}$ and $\mathbf{W}$. Therefore, the basic constraint can be formulated as:
</p>
$$q_ {nd}^{-1}(\hat{y}_ {nd}) = q^{-1}(\hat{y}_ {nd}) = \sum_ {c=1}^{C_i} p^{-1}(\hat{x}_ {nc}) \cdot r^{-1}(\hat{w}_ {dc}).$$<p>
We thus obtain <strong>per-tensor quantization</strong>. While both of these quantization methods have theoretical feasibility, the practical values of them are still limited by hardware support (as discussed in the next section).</p>
<p>For convenience, the following discussion focuses only on per-token quantization. Per-tensor quantization can be seen as a special case of per-token quantization. The most commonly used quantization method in practice is <strong>symmetric uniform quantization</strong>, which scales the value range using multiplication, i.e.:
</p>
$$\begin{aligned}
    \hat{x}_ {nc} &= p_ {n}(x_ {nc}) = p_ n x_ {nc}, \\\\
    \hat{w}_ {nd} &= r_ {d}(w_ {dc}) = r_ d w_ {dc}, \\\\
    \hat{y}_ {dc} &= q_ {nd}(y_ {nd}) = p_ n r_ d y_ {nd}.
\end{aligned}$$<p>We can formulate per-token symmetric uniform quantization by matrix multiplication:
</p>
$$\begin{aligned}
    \hat{\mathbf{X}} &= \text{diag}(p_1,\cdots,p_ N)\cdot \mathbf{X} = \begin{pmatrix}
        p_ 1 & \cdots & p_ 1 \\\\
        \vdots & \ddots & \vdots \\\\
        p_ N & \cdots & p_ N
    \end{pmatrix} \otimes \mathbf{X}, \\\\
    \hat{\mathbf{W}} &= \text{diag}(r_1,\cdots,r_ D)\cdot \mathbf{W} = \begin{pmatrix}
        r_ 1 & \cdots & r_ D \\\\
        \vdots & \ddots & \vdots \\\\
        r_ 1 & \cdots & r_ D
    \end{pmatrix} \otimes \mathbf{W}, \\\\
    \hat{\mathbf{Y}} &= \text{diag}(p_1,\cdots,p_ N)\cdot \mathbf{Y} \cdot \text{diag}(r_1,\cdots,r_ D) = \begin{pmatrix}
        p_ 1 r_ 1 & \cdots & p_ 1 r_ D \\\\
        \vdots & \ddots & \vdots \\\\
        p_ N r_ 1 & \cdots & p_ N r_ D
    \end{pmatrix} \otimes \mathbf{Y},
\end{aligned}$$<p>
where $\otimes$ represents element-wise matrix multiplication. It can be observed that both quantization and dequantization <strong>can be efficiently implemented using element-wise matrix multiplication with dimension broadcasting</strong>. The following figure illustrates the computation process by an example:</p>
<p><img loading="lazy" src="/2024-03-06-quantization-gemm/quant_matrix.png"></p>
<h2 id="hardware-requirements">Hardware requirements</h2>
<p>Hardware support still need to be considered when we try to utilize GEMM for quantization. For example, on NVIDIA GPUs, Tensor Core supports matrix multiplication for FP16 and INT8, but it doesn&rsquo;t support mixed precision matrix multiplication for FP16/INT8. This means that W8A8 quantization can benefit from Tensor Core, but W8A16 and W16A8 quantization lack hardware support and may not achieve real acceleration on NVIDIA GPUs. Many W8A16 and W16A8 quantization methods actually perform dequantization before GEMM and then use FP16 for computation. The actual acceleration effects of these methods require further discussion (see below).</p>
<h2 id="performance-analysis">Performance analysis</h2>
<p>The above discussion only shows that per-token quantization can leverage GEMM. The following words will show whether it can provide actual acceleration.</p>
<p>We compare the following three setups:</p>
<ol>
<li>Unquantized, using FP16 for both storage and computation.</li>
<li>W8A8 quantization, with I/O activations stored in FP16. This is the approach used by some works like <code>LLM.int8()</code>. To avoid additional CUDA kernel launch overhead, we assume that quantization and dequantization are fused with GEMM.</li>
<li>W8A16 quantization, internally converting weights to FP16 for computation. Kernel fusion is also applied here.</li>
</ol>
<p>Without loss of generality, we can assume that the hardware INT8 throughput is $2\times$ than that of FP16. We can set normalized operations of one INT8 operation is $1$, while $2$ for FP16. We can list the following table:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Method</th>
					<th style="text-align: center">FP16</th>
					<th style="text-align: center">W8A8 (FP16 activations I/O)</th>
					<th style="text-align: center">W8A16</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">GEMM OPs</td>
					<td style="text-align: center">$2NCD$</td>
					<td style="text-align: center">$NCD$</td>
					<td style="text-align: center">$2NCD$</td>
			</tr>
			<tr>
					<td style="text-align: center">GEMM mem I/O</td>
					<td style="text-align: center">$2(NC+CD+ND)$</td>
					<td style="text-align: center">$2NC+CD+2N D$</td>
					<td style="text-align: center">$2NC+CD+2ND$</td>
			</tr>
			<tr>
					<td style="text-align: center">quant/dequant OPs</td>
					<td style="text-align: center">$0$</td>
					<td style="text-align: center">$2NC+4ND$</td>
					<td style="text-align: center">$2CD$</td>
			</tr>
			<tr>
					<td style="text-align: center">quant/dequant Mem I/O</td>
					<td style="text-align: center">$0$</td>
					<td style="text-align: center">$2(N+C_o)$</td>
					<td style="text-align: center">$2D$</td>
			</tr>
			<tr>
					<td style="text-align: center">total OPs</td>
					<td style="text-align: center">$2NC D$</td>
					<td style="text-align: center">$NC D+2NC+4N D$</td>
					<td style="text-align: center">$2NCD+2CD$</td>
			</tr>
			<tr>
					<td style="text-align: center">total mem I/O</td>
					<td style="text-align: center">$2(NC+C D+N D)$</td>
					<td style="text-align: center">$2NC+C D+2N D+2(N+C_o)$</td>
					<td style="text-align: center">$2NC+CD+2ND+2D$</td>
			</tr>
			<tr>
					<td style="text-align: center">total arithmetic intensity (OPs:I/O)</td>
					<td style="text-align: center">$\cfrac{1}{1/N+1/C+1/D}$</td>
					<td style="text-align: center">$\cfrac{1+2/D+4/C}{2/N+1/C+2/D+2/(NC)+2/(CD)}$</td>
					<td style="text-align: center">$\cfrac{1+2/N}{1/(2N)+1/C+1/D+1/(NC)}$</td>
			</tr>
			<tr>
					<td style="text-align: center">total arithmetic intensity (second-order approximation)</td>
					<td style="text-align: center">$\cfrac{1}{1/N+1/C+1/D}$</td>
					<td style="text-align: center">$\cfrac{1}{2/N+1/C+2/D}$</td>
					<td style="text-align: center">$\cfrac{1}{1/(2N)+1/C+1/D}$</td>
			</tr>
	</tbody>
</table>
<p>Analyzing the table above, we can draw the following conclusions:</p>
<ol>
<li>W8A8 quantization (with FP16 activations I/O) reduces the operations by almost half compared to FP16, but it decreases the total arithmetic intensity. Therefore, in memory-bound scenarios, W8A8 quantization may not achieve a $2\times$ throughput improvement (ZeroQuant addresses this issue, as discussed below). But <strong>it can still lead to a significant throughput improvement when memory bandwidth is sufficient</strong>.</li>
<li>W8A16 quantization maintains a similar operations compared to FP16, but it slightly increases the total arithmetic intensity (more increase when $N$ is large). Therefore, <strong>it also has practical value in memory-bound scenarios</strong>, especially since activations in LLMs are typically harder to be quantized than weights.</li>
</ol>
<h2 id="some-llm-quantization-works">Some LLM Quantization works</h2>
<h3 id="llmint8"><code>LLM.int8()</code></h3>
<p><code>LLM.int8()</code> actually employs selective per-token quantization. It stores weights and activations in FP16 and then applies different strategies for different tokens, as illustrated below:</p>
<p><img alt="LLM.int8()" loading="lazy" src="/2024-03-06-quantization-gemm/llm_int8.png"></p>
<ul>
<li>For tokens suitable for quantization, it applies per-token INT8 quantization to weights and activations, computes results using INT8 GEMM, and then dequantizes them to FP16.</li>
<li>For tokens with outliers, it directly computed the FP16 GEMM.</li>
</ul>
<p>The results from these two parts can be combined to form the final result.</p>
<h3 id="smoothquant">SmoothQuant</h3>
<p>While per-channel quantization may not be practical, for LLM activation quantization, the main challenge arises from activations, where values with larger magnitudes may appear on some channels, as shown below:</p>
<p><img loading="lazy" src="/2024-03-06-quantization-gemm/smooth_quant_motivation.png"></p>
<p>SmoothQuant observed that these outliers occur consistently in specific channels, while outliers are rare in weights (thus easier to quantize). Therefore, it proposes to &ldquo;balance&rdquo; the quantization difficulty between activations and weights by introducing a per-channel scaling factor:</p>
<p><img alt="SmoothQuant" loading="lazy" src="/2024-03-06-quantization-gemm/smooth_quant.png"></p>
<p>This &ldquo;balance&rdquo; can be formulated as:
</p>
$$\begin{aligned}
    \mathbf{Y}
    &= \mathbf{X}\mathbf{W}^\top \\\\
    &= \mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C) \cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1} \cdot \mathbf{W}^\top \\\\
    & = \left( \mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C) \right) \cdot \left( \mathbf{W}\cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1} \right)^\top.
\end{aligned}$$<p>
By selecting appropriate scaling factors $\text{diag}(s_ 1,\cdots,s_ C)$, we can achieve the goal of balancing outlier values in activations, and then we can quantize $\mathbf{X} \cdot \text{diag}(s_ 1,\cdots,s_ C)$ and $\mathbf{W}\cdot \text{diag}(s_ 1,\cdots,s_ C)^{-1}$. The following figure give an example:</p>
<p><img alt="SmoothQuant example" loading="lazy" src="/2024-03-06-quantization-gemm/smooth_quant_2.png"></p>
<p><strong>SmoothQuant is an excellent alternative to per-channel quantization</strong>, as demonstrated in the paper by its impressive performance in quantizing LLM to W8A8.</p>
<h3 id="zeroquant">ZeroQuant</h3>
<p>In the above performance analysis of W8A8, we found that using FP16 for activations I/O reduces the overall arithmetic intensity after quantization, which may harm the throughput improvement in memory-bound scenarios. ZeroQuant addresses this issue by fusing the quantization into the previous operator and fusing the dequantization after GEMM, as shown in the figure below.</p>
<p><img alt="ZeroQuant" loading="lazy" src="/2024-03-06-quantization-gemm/zero_quant.png"></p>
<p>Thus, the activations I/O between operators are still INT8, which reduces the total memory I/O to $NC+CD+ND+2(N+D)$, boosting arithmetic intensity to original FP16 level , and fully leveraging the high throughput of INT8.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This blog provides a matrix multiplication perspective for quantization, indicating some fundamental requirements for practical quantization and explaining why per-channel quantization in impractical. It also discusses several examples of LLM per-token quantization, including <code>LLM.int8()</code>, SmoothQuant, and ZeroQuant.
They are all practical and demonstrate significant acceleration in real-world scenarios.</p>
]]></content:encoded></item><item><title>NFS Performance Tuning</title><link>https://monsoon-cs.moe/2024-02-16-nfs-tuning/</link><pubDate>Fri, 16 Feb 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-02-16-nfs-tuning/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of &lt;strong&gt;Lots of Small Files&lt;/strong&gt; (LOSF).&lt;/p&gt;
&lt;h2 id="tuning"&gt;Tuning&lt;/h2&gt;
&lt;h3 id="hardware"&gt;Hardware&lt;/h3&gt;
&lt;p&gt;On the network hardware side, both &lt;strong&gt;bandwidth&lt;/strong&gt; and &lt;strong&gt;latency&lt;/strong&gt; matter.&lt;/p&gt;
&lt;p&gt;To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>This article is a guide to NFS performance tuning over a 10 Gbps network in production scenarios, which I have distilled from practice. It focuses in particular on optimizing the reading and writing of <strong>Lots of Small Files</strong> (LOSF).</p>
<h2 id="tuning">Tuning</h2>
<h3 id="hardware">Hardware</h3>
<p>On the network hardware side, both <strong>bandwidth</strong> and <strong>latency</strong> matter.</p>
<p>To guarantee NFS performance, a high-bandwidth network is necessary. 10 Gbps is the baseline requirement for production scenarios; faster InfiniBand or RoCE networks can be chosen according to your needs and budget.</p>
<p>For the <strong>Lots of Small Files</strong> (LOSF) scenario, <strong>latency is more important than bandwidth</strong>. Many tuning tutorials overlook this and focus only on sequential read/write performance; even when they test 4K random read/write, they use the <strong>wrong testing method</strong> (the correct method is given below).</p>
<p>The importance of latency lies in the fact that if a program&rsquo;s access to small files is <strong>intrinsically serialized</strong>, <strong>latency determines the upper bound of serialized IOPS</strong>. A latency of 0.1 ms caps serialized IOPS at 10k, while a latency of 1 ms corresponds to a cap of 1k.</p>
<p>Intrinsically serialized access scenarios are very common. For example, when the home directory is placed on NFS, the loading of oh-my-zsh and the loading of Python packages are both intrinsically serialized. A 1 ms network latency makes these programs unacceptably slow (e.g., executing <code>import torch</code> takes more than 30s).</p>
<p>Using a decent enterprise-grade switch and a properly configured network topology can minimize latency as much as possible. At the same time, the quality of optical modules and optical-to-electrical port modules can also have a huge impact on latency (the Chinet (中科光电) optical-to-electrical port module I originally used introduced an extra 0.1 ms of latency, causing IOPS to drop by 2/3).</p>
<p>It should be noted that although RDMA can theoretically reduce latency, in actual testing I found that the difference in serialized IOPS between 10 Gbps Ethernet and 100 Gbps InfiniBand is not large; when the budget is limited, using only Ethernet is sufficient.</p>
<p>TODO: jumbo frames</p>
<h3 id="linux-kernel">Linux Kernel</h3>
<p>The kernel network parameters need to be adjusted to suit a high-speed network:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="c1"># Ref: https://gist.github.com/mizanRahman/40ba603759bfb5153189ccdc9dbbd1e4</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Disable TCP slow start on idle connections</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_slow_start_after_idle</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Increase Linux autotuning TCP buffer limits</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Don&#39;t set tcp_mem itself! Let the kernel scale it based on RAM.</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.rmem_max</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.wmem_max</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.rmem_default</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.wmem_default</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.optmem_max</span> <span class="o">=</span> <span class="s">40960</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_rmem</span> <span class="o">=</span> <span class="s">4096 87380 56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_wmem</span> <span class="o">=</span> <span class="s">4096 65536 56623104</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># TCP Congestion Control</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_congestion_control</span> <span class="o">=</span> <span class="s">bbr</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.default_qdisc</span> <span class="o">=</span> <span class="s">cake</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>This set of settings needs to be applied on both the server and the client; it can be written into <code>/etc/sysctl.conf</code> to make it persistent.</p>
<h3 id="server-side">Server Side</h3>
<p>The number of NFS server threads can be set as large as possible; it can improve performance when the server load is relatively high, and I simply set it to the number of threads on the server. Modify <code>/etc/nfs.conf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[nfsd]</span>
</span></span><span class="line"><span class="cl"><span class="na">threads</span><span class="o">=</span><span class="s">128</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The following NFS server parameters need to be adjusted:</p>
<ul>
<li><code>async</code>: treats synchronous I/O operations as asynchronous. For workloads dominated by synchronous reads/writes this can greatly improve performance, but it may cause data loss when the server crashes; it is not recommended when there are extremely high requirements for data integrity;</li>
<li><code>no_subtree_check</code>: has no major impact on performance, but in some cases it can improve reliability (with a slight security risk at the same time). See [1].</li>
</ul>
<h3 id="client-side">Client Side</h3>
<p>When there is no special reason, you should use the latest NFSv4.2 by default. When NFSv3 uses UDP as the underlying transport, it can cause data corruption over high-speed networks due to UDP packet sequence number issues; see [2].</p>
<p>The following NFS client parameters need to be adjusted:</p>
<ul>
<li><code>proto=rdma</code>: set when the network supports RDMA;</li>
<li><code>nocto</code>: disables close-to-open cache consistency semantics. The default NFS behavior is to write all changes back to the server when a file is closed. If you have relatively high requirements for file consistency across multiple clients, this option is not recommended;</li>
<li><code>ac</code>: enables attribute caching, so the client caches file attributes. Likewise, for clusters with high requirements for data consistency, this option is not recommended;</li>
<li><code>fsc</code>: uses FS-Cache to cache data locally. You also need to <a href="https://github.com/jnsnow/cachefilesd">configure cachefilesd</a>. Strangely, in my testing I did not find data being cached locally; this may require further investigation;</li>
<li><code>nconnect=16</code>: sets up 16 TCP connections between the NFS client and server. By default the NFS client establishes only one TCP connection, and all RPCs are multiplexed over this connection. In some cases this limits the bandwidth of sequential reads/writes. Increasing <code>nconnect</code> (maximum value 16) can solve this problem.</li>
</ul>
<p>In particular, the <code>noatime</code> / <code>relatime</code> settings have no effect on NFS [3]; the NFS client always caches atime changes.</p>
<p>Some tutorials recommend modifying <code>rsize</code> and <code>wsize</code>. In NFSv4.2 these two values are already negotiated to their maximum value <code>1048576</code> by default, so there is no need to change them manually; you only need to check whether they were negotiated correctly.</p>
<p>According to [4], <code>sunrpc.tcp_max_slot_table_entries</code> may affect performance and can be increased appropriately (the default is <code>2</code>). In my testing, I found that when encountering a sustained small-file access workload on the order of tens of millions, NFS would sometimes hang. When I increased this parameter, the problem was resolved. Set <code>/etc/modprobe.d/sunrpc.conf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="na">options sunrpc tcp_slot_table_entries</span><span class="o">=</span><span class="s">16384</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Sometimes I encounter a problem where <code>nfsd</code> consumes a large amount of CPU and performance drops sharply, while a large number of <code>delegreturn</code> RPC calls are recorded. According to [5], this can be resolved by disabling <code>fs.leases-enable</code>. Set <code>/etc/sysctl.conf</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="na">fs.leases-enable</span> <span class="o">=</span> <span class="s">0</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>When <code>nfsd</code> restarts for one reason or another, by default there is a 90s grace period for lock recovery, during which <code>nfsd</code> rejects all <code>open</code> requests, shown in the kernel log as:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">[1073511.138061] NFSD: starting 90-second grace period (net f0000000)
</span></span></code></pre></td></tr></table>
</div>
</div><p>In practice I found that this period can be reduced appropriately to lessen the impact of <code>nfsd</code> restarts. Set <code>/etc/default/nfs-kernel-server</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Options for rpc.svcgssd.</span>
</span></span><span class="line"><span class="cl"><span class="nv">RPCSVCGSSDOPTS</span><span class="o">=</span><span class="s2">&#34;--lease-time 10 --grace-time 10&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="testing">Testing</h2>
<p>TODO</p>
<h2 id="conclusion">Conclusion</h2>
<p>TODO</p>
<h2 id="references">References</h2>
<p>[1] <a href="https://man.archlinux.org/man/exports.5.en#no_subtree_check">https://man.archlinux.org/man/exports.5.en#no_subtree_check</a></p>
<p>[2] <a href="https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links">https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links</a></p>
<p>[3] <a href="https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance">https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance</a></p>
<p>[4] <a href="https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots">https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots</a></p>
<p>[5] <a href="https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation">https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation</a></p>
]]></content:encoded></item><item><title>[Paper Reading] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs (arXiv'24)</title><link>https://monsoon-cs.moe/2024-02-07-paper-reading-arxiv24-acs/</link><pubDate>Wed, 07 Feb 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-02-07-paper-reading-arxiv24-acs/</guid><description>&lt;blockquote&gt;
&lt;p&gt;This blog is a write-up of the paper &amp;ldquo;&lt;a href="https://arxiv.org/abs/2401.12377"&gt;ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs&lt;/a&gt;&amp;rdquo; from arXiv'24.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of &lt;strong&gt;small kernels&lt;/strong&gt; which cannot fully utilize the GPU, and these kernels are not executed concurrently, although &lt;strong&gt;most of them are independent and in theory can be executed concurrently&lt;/strong&gt;.&lt;/p&gt;</description><content:encoded><![CDATA[<blockquote>
<p>This blog is a write-up of the paper &ldquo;<a href="https://arxiv.org/abs/2401.12377">ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs</a>&rdquo; from arXiv'24.</p>
</blockquote>
<h2 id="motivation">Motivation</h2>
<p>Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of <strong>small kernels</strong> which cannot fully utilize the GPU, and these kernels are not executed concurrently, although <strong>most of them are independent and in theory can be executed concurrently</strong>.</p>
<p><img alt="Figure 1. Achieved Occupancy of simulation engines (up) and dynamic DNN (down)" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/achieved_occ.png"></p>
<p>But there are some challenges to execute these kernels concurrently:</p>
<ol>
<li><strong>Input-dependent kernel dependencies</strong>. For some workload, the the dependencies between kernels are only <strong>determined at runtime</strong> for each input. Constructing full computational graph and resolving dependencies before execution will introduce <strong>high latency</strong> (see Figure 2,average of 47% of overall execution time as the paper says).</li>
</ol>
<p><img alt="Figure 2. DAG construction time as % of execution time" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/dag_time.png"></p>
<ol start="2">
<li><strong>Irregular kernel dependencies</strong>. Some workloads have irregular computational graphs. We can partitioned the computational graph of the workload into independent streams of kernels. But this would require <strong>fine-grained scheduling</strong> and <strong>synchronization</strong>, with <strong>large overhead</strong> (see Figure 3).</li>
</ol>
<p><img alt="Figure 3. Kernel launch and synchronization overheads" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/sync_overhead.png"></p>
<p>Existed solutions:</p>
<ol>
<li>
<p>CUDA Graph and AMD ATMI. They allow users specify dependencies between different kernels as DAG, and can eliminate the synchronization and kernel launch overhead. But the DAG needs to be constructed in <strong>full before execution</strong>, which imakes them not suitable for dynamic kernel dependencies (such as Dynamic DNNs).</p>
</li>
<li>
<p>Using events provided by the CUDA stream management API, which allows synchronization between kernels across streams through the <code>cudaStreamWaitEvent</code> API, without blocking the host. But approach still requires deriving dependencies between all kernels beforehand.</p>
</li>
<li>
<p>Persistent threads (PT) can eliminate the scheduling and launch overheads, but are only effective when all kernels are homogeneous.</p>
<blockquote>
<p>PT is just like coroutine in some programming languages.</p>
</blockquote>
</li>
<li>
<p>CUDA dynamic parallelism (CDP) or AMD’s device enqueue (DE) enables parent kernels to launch child kernels, but , only allowing data dependencies between one parent and its children (so cannot be use to synchronize between multiple tasks).</p>
</li>
</ol>
<h2 id="design">Design</h2>
<p>The <strong>goal</strong> of this paper is to design a framework that enables efficient concurrent execution of GPU kernels with:</p>
<ol>
<li>
<p>lightweight detection of inter-kernel dependencies at runtime,</p>
</li>
<li>
<p>low overhead kernel scheduling and synchronization.</p>
</li>
</ol>
<p><strong>The key idea is to perform the dependence checking and scheduling within a small window of kernels at runtime similar to out-of-order instruction scheduling.</strong></p>
<p>The authors proposed Automatic Concurrent Scheduling (ACS) as solution. The overall design of ACS-SW is shown in Figure 4. It contains three main functionalities:</p>
<p><img alt="Figure 4. ACS-SW Overview" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_overview.png"></p>
<ol>
<li>
<p><strong>Determining inter-kernel dependencies</strong>. By checking for <strong>overlaps between read segments and write segments</strong>, we determine dependencies between kernels. For a wide range of commonly used kernels (e.g., matrix multiplication, convolution), we can infer the read and write segments from the input easily. But for some kernels, it&rsquo;s impossible to determine the range of memory accessed statically because of the potential indirect memory accesses, so the authors just assume the <strong>entire GPU memory may be accessed</strong>.</p>
<p><img alt="Memory regions written to/accessed by the kernel" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/seg.png"></p>
<p>The authors use a kernel wrapper to finish the dependency detection. <code>get_addresses()</code> is called to get <code>__read_segments__</code> and <code>__write_segments__</code>.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">ACE_wrapper</span> <span class="p">{</span> 
</span></span><span class="line"><span class="cl">  <span class="c1">//list of read,write segments defined as
</span></span></span><span class="line"><span class="cl">  <span class="c1">//[{start_adr1,size1},{start_adr2,size2}..]
</span></span></span><span class="line"><span class="cl">  <span class="n">list</span> <span class="n">__read_segments__</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="n">list</span> <span class="n">__write_segments__</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// function which gets called at kernel
</span></span></span><span class="line"><span class="cl">  <span class="c1">// launch to populate read,write segments
</span></span></span><span class="line"><span class="cl">  <span class="kt">void</span> <span class="nf">get_addresses</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">dim3</span> <span class="n">blocks</span><span class="p">,</span> <span class="n">dim3</span> <span class="n">threads</span><span class="p">,</span> <span class="p">...</span>
</span></span><span class="line"><span class="cl">  <span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// function declaration of the kernel
</span></span></span><span class="line"><span class="cl">  <span class="k">static</span> <span class="n">__global__</span> <span class="kt">void</span> <span class="nf">kernel</span><span class="p">(...);</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></td></tr></table>
</div>
</div></li>
<li>
<p>Tracking kernel state at runtime. The kernels in the window can be three states:</p>
<ol>
<li><strong>Ready</strong>: kernels it is dependent on complete execution.</li>
<li><strong>Pending</strong>: upstream kernels are still pending or executing.</li>
<li><strong>Executing</strong>.</li>
</ol>
</li>
</ol>
<p><img alt="Kernels in the scheduling window with their state and corresponding upstream kernels" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/window.png"></p>
<ol start="3">
<li>Eliminating CPU synchronization overheads. See ACS-HW for more details.</li>
</ol>
<p>ACS has two variants:</p>
<ol>
<li>
<p>ACS-SW: software-only implementation which emulates the out-of-order kernel scheduling mechanism.</p>
</li>
<li>
<p>ACS-HW: hardware-facilitated implementation which is more efficient as it also alleviates synchronization overheads.</p>
</li>
</ol>
<h3 id="acs-sw">ACS-SW</h3>
<h4 id="window-module">Window Module</h4>
<p>This module is to determining inter-kernel dependencies. It is implemented as a separate thread that manages the input FIFO queue and the scheduling window. The kernel state tracking is implemented in the hardware.</p>
<h4 id="scheduler-module">Scheduler Module</h4>
<p>This module schedules and launches ready kernels for execution. It has fixed number of CUDA streams. Each stream contains only one kernel at any given time. Threads with empty streams poll the scheduling window for a ready kernel.</p>
<p><img alt="ACS-SW: The scheduler module" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_hw_scheduler.png"></p>
<h3 id="acs-hw">ACS-HW</h3>
<p>ACS-SW incurs kernel synchronization and launch overheads because scheduler module launches a kernel in the CPU. ACS-HW solves these problems by a software-hardware co-design.</p>
<p><img alt="ACS-HW Overview" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_hw.png"></p>
<p>Software-side: maintains an input FIFO queue like ACS-SW, and a list of kernels in the GPU’s scheduling window, <strong>but it can be stale</strong>.</p>
<p>Hardware-side: the scheduling window and its management are implemented in hardware on the GPU side.</p>
<p>A key novelty in hardware design is <strong>two stage dependency detections</strong>. First, ACS use software to perform initial detection using stale kernel information (without frequent synchronize overhead), then utilizes hardware to correct outdated dependency information. This two-stage approach significantly reduces the hardware complexity.</p>
<p><img alt="ACS-HW Scheduler" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/hw_scheduler.png"></p>
<h2 id="evaluation">Evaluation</h2>
<ol>
<li>Baseline: cuDNN implementation (for DNNs) and a jax implementation (for deep RL simulation), both using CUDA streams.</li>
<li>ACS-SW: on real hardware.</li>
<li>ACS-SW-Sim: ACS-SW on the GPU simulator.</li>
<li>ACS-HW: on the GPU simulator.</li>
<li>CUDAGraph.</li>
</ol>
<p><img alt="Deep RL physics simulations: Normalized Speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl.png"></p>
<p><img alt="Deep RL physics simulations: Normalized Speedup on GPU simulator" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl_sim.png"></p>
<p><img alt="Deep RL physics simulations: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl_occ.png"></p>
<p><img alt="Dynamic DNNs: Normalized speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_dcnn.png"></p>
<p><img alt="Dynamic DNNs: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_dcnn_occ.png"></p>
<p><img alt="Static DNNs: Normalized speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_scnn.png"></p>
<p><img alt="Static DNNs: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_scnn_occ.png"></p>
<h2 id="comments">Comments</h2>
<h3 id="strengths">Strengths</h3>
<p>This paper focuses on the problem of low GPU utilization caused by the serial execution of numerous small CUDA kernels. I believe this paper effectively addresses this problem, particularly with the following innovative points that are impressive me:</p>
<ol>
<li>
<p><strong>Out-of-order dependency detection and scheduling</strong>. Out-of-order (OoO) is a common technique in micro-architecture and software (e.g., hard disk I/O queue) designs. It&rsquo;s an impressive and innovative idea to introduce OoO into this area to find the dynamic dependencies efficiently.</p>
</li>
<li>
<p>A good <strong>trade-off</strong>. When I first read the Introduction section of the paper, I thought the read-write dependencies detection may be a difficulty task. To my knowledge, there aren&rsquo;t reliable static binary memory access analysis techniques (otherwise, segmentation fault wouldn&rsquo;t be a common problem). However, the authors made a good <strong>simplification</strong> and <strong>trade-off</strong> regarding this problem. For most common kernels, memory access areas can be inferred from input parameters. For the rest kernels, it can be assumed that they access the entire memory. Since few common operators occupy most of the execution time, this trade-off leads to significant performance improvements with a relatively low scheduling overhead. This innovation is my <strong>favorite</strong> aspect of this paper.</p>
</li>
<li>
<p><strong>Two-stage dependency detection</strong> in ACS-HW. While a complete hardware dependency detection approach is theoretically feasible, it could incur significant <strong>chip area costs</strong> (as we know, the re-order buffer in microprocessor carries large area). The authors proposed a two-stage software-hardware co-design dependency detection, significantly simplifying the difficulty of hardware design. It is a brilliant idea.</p>
</li>
</ol>
<h3 id="weaknesses">Weaknesses</h3>
<p>This paper has some potential weaknesses:</p>
<ol>
<li>
<p>To each type of kernel, we must custom <code>get_addresses</code> function int the kernel wrapper. This weakness may limit the adoption of ACS.</p>
</li>
<li>
<p>Deciding whether kernels should be executed concurrently requires considering <strong>more factors</strong> than just data dependencies. If there are resource conflict (e.g., memory bandwidth, shared memory size) between two <strong>large kernels</strong>, performance may degrade if they co-execute.</p>
</li>
</ol>
<h3 id="improvements">Improvements</h3>
<p>I propose some potential improvements to this paper:</p>
<ol>
<li>
<p>In response to the first weakness mentioned above, I propose a <strong>profiling-rollback</strong> strategy to achieve safe automatic dependency detection. This strategy leverages the commonly used <strong>paging</strong> technique in OS virtual memory management: we can set a memory page as <strong>read-only</strong> or <strong>write-only</strong>. When a program is running, if a <strong>page fault</strong> is triggered, we can know that a read/write occurs. While I&rsquo;m unsure if Nvidia GPUs provide APIs for user to control page tables, let&rsquo;s assume such APIs exist. Given that many workloads are iterative (e.g., neural network training), we can profile the workload just one iteration, utilizing the aforementioned paging trick to <strong>record the memory access segments</strong> of each kernel. Obviously this may introduce some inaccuracies, we need a <strong>rollback strategy</strong> to ensure correct program execution. During runtime, we set known <code>__write_segments__</code> as read-write, while other areas are set as read-only. Upon encountering a page fault, we detect an error and revert to the default strategy (assuming all memory areas will be read and wrote). With this strategy, we can eliminate the need of manual <code>get_addresses</code> function, and maximize the potential parallelism.</p>
</li>
<li>
<p>Regarding the second weakness, I suggest adopting the method of <strong>GPUPool</strong> to determine which kernels are suitable for concurrent execution. A naive solution involves tracking the number of SMs each kernel occupies. When the SMs of a GPU are fully occupied, even if there are kernels in the <code>ready</code> state and available CUDA streams, no new kernels are scheduled.</p>
</li>
</ol>
]]></content:encoded></item><item><title>[Paper Reading] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud (PACT'22)</title><link>https://monsoon-cs.moe/2024-02-07-paper-reading-pact22-gpupool/</link><pubDate>Wed, 07 Feb 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-02-07-paper-reading-pact22-gpupool/</guid><description>&lt;blockquote&gt;
&lt;p&gt;This blog is a write-up of the paper &amp;ldquo;&lt;a href="https://dl.acm.org/doi/10.1145/3559009.3569650"&gt;GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud&lt;/a&gt;&amp;rdquo; from PACT'22.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;This paper focuses on the &lt;strong&gt;GPU sharing in cloud scenarios&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Currently, existing GPU sharing techniques can be categorized into 2 types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Time-sharing&lt;/strong&gt; means executing each concurrent VM on a full device in a round-robin fashion. &lt;strong&gt;Pros&lt;/strong&gt;: Simple and mature. &lt;strong&gt;Cons&lt;/strong&gt;: VMs could still under-utilize the hardware within each time slice.&lt;/p&gt;</description><content:encoded><![CDATA[<blockquote>
<p>This blog is a write-up of the paper &ldquo;<a href="https://dl.acm.org/doi/10.1145/3559009.3569650">GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud</a>&rdquo; from PACT'22.</p>
</blockquote>
<h2 id="motivation">Motivation</h2>
<p>This paper focuses on the <strong>GPU sharing in cloud scenarios</strong>.</p>
<p>Currently, existing GPU sharing techniques can be categorized into 2 types:</p>
<ul>
<li>
<p><strong>Time-sharing</strong> means executing each concurrent VM on a full device in a round-robin fashion. <strong>Pros</strong>: Simple and mature. <strong>Cons</strong>: VMs could still under-utilize the hardware within each time slice.</p>
</li>
<li>
<p><strong>Shape-sharing</strong>: split a device into partitions and allows multiple workloads to execute on different partitions simultaneously.</p>
</li>
</ul>
<p>Space-sharing can be categorized into 2 types：</p>
<ul>
<li>
<p><strong>Coarse-grained</strong> assigns disjoint sets of streaming multiprocessors (SMs) and memory channels to concurrent workloads. For example, Nvidia MIG. <strong>Pros</strong>: offers great performance isolation among tenants of the same GPU. <strong>Cons</strong>: (i) resource under-utilization within each SM consisting of heterogeneous functional units (e.g., FP32, INT, FP64, Tensor Cores) meant for different workload types. (ii) inefficient memory bandwidth usage caused by the bursty nature of GPU memory traffic.</p>
</li>
<li>
<p><strong>Fine-grained</strong> allows different workloads to co-run on the same SMs and request memory bandwidth flexibly, such as CUDA Stream and MPS. <strong>Pros</strong>: Better hardware utilization.</p>
</li>
</ul>
<p>The key problem of GPU sharing in data center is <strong>performance unpredictability</strong>. It contains 2 <strong>key challenges</strong>:</p>
<ol>
<li>
<p><strong>Mitigating interference</strong>. The amount of performance improvement from fine-grained sharing varies drastically depending on how compatible the concurrent workloads are in terms of resource usage. Also, the interference cannot be statically estimated. So, <strong>it is non-trivial to determine compatibility</strong> among a large number of incoming jobs in the cluster.</p>
</li>
<li>
<p><strong>Providing QoS guarantees</strong>.</p>
</li>
</ol>
<p>Existing solutions:</p>
<ul>
<li>
<p><strong>Software-based</strong>: kernel slicing or a persistent thread model. <strong>Cons</strong>: high scheduling overhead.</p>
</li>
<li>
<p><strong>Hardware-based</strong>: integrate sophisticated resource management logic into hardware to allocate resources for concurrent kernels. <strong>Cons</strong>: expensive and also inflexible.</p>
</li>
</ul>
<p>Common problems of existing solutions:</p>
<ol>
<li>
<p>They do not concern with interference mitigation at the cluster level.</p>
</li>
<li>
<p>They do not handle scenarios where incoming jobs must be distributed among multiple GPUs to satisfy QoS constraints.</p>
</li>
</ol>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/tb_sm.png"></p>
<center>Figure 1. Simulated system throughput of co-running `parb_spmv` and `rod_hotspot` at various TBs/SM settings</center>
<p><strong>Problems of hardware TB scheduler</strong> which hinder the fine-grained sharing:</p>
<ol>
<li>
<p>It always attempts to <strong>launch as many thread blocks per SM</strong> (TBs/SM) for each kernel as allowed by the execution context storage constraints (e.g., registers, shared memory, thread slots). <strong>It leaves insufficient resources for concurrent kernels</strong>. As showed in Figure 1, if we can individually set the TBs/SM for each kernel, we may achieve a higher throughput.</p>
</li>
<li>
<p>It only dispatches concurrent kernels onto SMs after the earlier arriving one completes launching all the thread blocks specified by the kernel grid size. This will force an <strong>almost serially execution</strong> of kernels in some scenarios.</p>
</li>
</ol>
<p>GPU applications in the cloud fall into two main categories: latency-sensitive, and <strong>throughput-oriented</strong>. Throughput-oriented workloads are good candidates for hardware space-sharing. They have the following characteristics:</p>
<ol>
<li>
<p>Most workloads involve a large variety of kernels with <strong>different hardware resource utilization</strong> characteristics (e.g., CNN: compute-intensive, batch-norm: memory-intensive).</p>
</li>
<li>
<p>Active SMs are <strong>underutilized</strong> in some resources (FP, tensor core, memory bandwidth).</p>
</li>
<li>
<p>They typically repeatedly execute the same sequence of kernels (e.g., ML).</p>
</li>
<li>
<p>Relaxed QoS Requirements.</p>
</li>
</ol>
<h2 id="design">Design</h2>
<p>This paper proposed a <strong>hardware-software co-designed</strong> strategy to solve these challenges.</p>
<h3 id="hardware">Hardware</h3>
<p>This paper changes the default behavior of CUDA runtime to make it more suitable for fine-grained sharing:</p>
<ol>
<li>
<p>Allows CUDA runtime to program the <strong>TBs/SM setting</strong> as one of the kernel launch parameters. The value of TBs/SM is selected by the performance predictor.</p>
</li>
<li>
<p>Make the TB scheduler <strong>launch TBs from any concurrent kernels</strong> whenever they are running under their TBs/SM quota.</p>
</li>
</ol>
<h3 id="software">Software</h3>
<blockquote>
<p>Concept Explanation:</p>
<ul>
<li>Job: a task submitted by user, such as a DNN training task. It may be iterative and contains multiple kernels.</li>
<li>Kernel: CUDA kernel.</li>
<li>Normalized Progress (NP): $t _ {isolate} / t _ {co-execute}$.</li>
</ul>
</blockquote>
<p><strong>Two key observations</strong>:</p>
<ol>
<li>
<p>Co-execution performance of GPU kernels is highly correlated with resource utilization of individual kernels measured when running in isolation.</p>
</li>
<li>
<p>Once we have predicted which job pairs can co-execute without violating QoS requirements, the scheduling task can be reduced to the classic maximum cardinality matching problem in graph theory.</p>
</li>
</ol>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/system-design.png"></p>
<center>Figure 2. Overall System Design of GPUPool</center>
<p>Based on these 2 observations, the author proposed GPUPool. Its overall system design is shown in Figure 2. It consists of 4 steps:</p>
<ol>
<li>
<p><strong>Kernel Profiler</strong>. GPUPool <strong>groups all incoming GPU job into a batch</strong> for every scheduling window (e.g., 30 seconds). User should provide application executable and execution time budget. Then GPUPool automatically <strong>profiles</strong> the application for one iteration of the job in isolation on hardware, to collect the <strong>performance counter metrics</strong> of each kernel of data.</p>
</li>
<li>
<p><strong>Co-execution Performance Predictor</strong>. This step decides the <strong>compatibility</strong> of all possible job pairs within the batch using the profiling result. It contains 2 stages:</p>
<ol>
<li>
<p><strong>Kernel-wise Predictors</strong>. It predicts how well each kernel from one job will co-run with the ones in the other job. This stage uses a <em>Gradient Boosting Tree</em> (GBT) model to <strong>predict the performance of each kernel when co-running with another kernel</strong> (based on the 1st key observation). The model takes the profiling data of kernels as input and outputs the <strong>NP</strong>. This prediction will be done for <strong>each feasible TBs/SM</strong> settings.</p>
</li>
<li>
<p><strong>Job-wise Predictor</strong>. It gets an <em>interference matrix</em> (shown in Figure 3) based on the <strong>predicted NP</strong> (under optimal TBs/SM setting) from former stage, which indicates how will two kernels slow down when they are co-running. Then, GPUPool using this matrix to calculate the <strong>co-running time of two jobs</strong>. Here, the authors found that a whole calculation may require tens of thousands iterations, but the result will <strong>coverage to a steady-state</strong> after several iterations. So the authors used an <strong>approximation algorithm</strong> (shown in Figure 4) &ndash; stops timeline calculation once the accumulated slowdown values of each job is within a small delta over the past epoch.</p>
</li>
</ol>
</li>
</ol>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/interference_matrix.png"></p>
<center>Figure 3. Interference Matrix</center>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/stage2.2.png"></p>
<center>Figure 4. Concurrent Application Timeline</center>
<ol start="3">
<li><strong>Job dispatcher</strong>. It decides which job pairs should co-run to maximize system performance while satisfying QoS. The decisions are found by solving a <strong>maximum cardinality matching problem</strong> &ndash; each node represent a job, when two jobs can co-run and will not violate the QoS requirement, connecting an edge between them. Then a graph theory algorithm is used to maximum cardinality matching, which means a largest subset of edges that do not share a common end node. Due to the potential unreliability of the performance predictor, GPUPool also add <strong>a safety margin</strong> $\delta$ to edge formulation.</li>
</ol>
$$E = \left\{ ( {job} _ i, {job} _ j ) \mid {job} _ i,{job} _ j \in V\ \text{and}\ {NP} _ {job _ x} > {QoS} _ {job _ x} \times (1 + \delta ), x \in \{i, j\} \right\}$$<ol start="4">
<li><strong>Execution</strong>. The batch of jobs are assigned to the modified GPU hardware.</li>
</ol>
<h2 id="evaluations">Evaluations</h2>
<p>The paper compare GPUPool against three baseline systems:</p>
<ol>
<li>
<p>No-Sharing.</p>
</li>
<li>
<p>Coarse: packing the jobs onto <strong>as few GPUs as possible</strong> using a greedy scheduling algorithm.</p>
</li>
<li>
<p>Heuristic: pairing up jobs with the <strong>highest and lowest bandwidth utilization</strong> (profiled offline) from a batch of incoming jobs.</p>
</li>
</ol>
<p>The metrics is system throughput $STP=\sum_{i=1}^n \cfrac{t_{isolated}^i}{t_{shared}^i}$. $t_{isolated}^i$ and $t_{shared}^i$ are turnaround time of the i-th concurrent job when executing in an isolated and shared environment respectively. The paper also uses we use ${QoS}_{reached}$ to evaluate QoS fulfilment rate.</p>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/gpu_sharing_compare.png"></p>
<center>Comparison of GPU Sharing Systems</center>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/sorted_stp.png"></p>
<center>Sorted STP on GPUs</center>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/throughput.png"></p>
<center>Throughput Normalized to QoS Target</center>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/ml_pred.png"></p>
<center>Prediction Accuracy of Different ML Techniques</center>
<h2 id="comments">Comments</h2>
<h3 id="strengths">Strengths</h3>
<p>This paper targets the fine-grained GPU sharing problem in the cloud. I believe this work provides a valuable solution to this problem.</p>
<p>From my perspective, fine-grained GPU sharing presents three key challenges:</p>
<ol>
<li>
<p><strong>Limitations imposed by hardware and CUDA</strong>, which make it difficult for programmers to flexibly control kernel execution.</p>
</li>
<li>
<p><strong>Reliable and low-cost performance prediction</strong> for concurrent kernel execution. Establishing an analytical performance prediction model is highly challenging. One naive approach is using real hardware to profile, but due to the $\mathcal{O}(n^2)$ ($n$ representing the number of jobs) time complexity, this method is not scalable to larger clusters.</p>
</li>
<li>
<p><strong>Efficient algorithms to find appropriate job combinations</strong>. If we allow an arbitrary number of jobs to execute concurrently, this becomes an NP-hard problem.</p>
</li>
</ol>
<p>This paper cleverly addresses or bypasses these challenges through the following strategies:</p>
<ol>
<li>
<p><strong>Hardware-software co-design</strong>, which involves modifying hardware to provide more flexible API for upper-layer application. While this prevents the authors from testing their method on actual hardware and forces them perform experiments on simulator (GPGPU-Sim), I believe such simulations can provide valuable insights for adjustments on real hardware.</p>
</li>
<li>
<p>Predicting kernel concurrent execution performance <strong>by a ML model</strong>. This is <strong>a standout aspect</strong> of the paper (which is also my <strong>favorite novelty</strong>). The authors introducing ML with a good motivation to effectively addresses a challenging performance modeling problem, bypassing a complicated analytical modeling. Also, this ML model has good <strong>interpretability</strong>, top-10 import metrics (show in Figure) align well with human&rsquo;s intuition. Furthermore, in my research experiences about Deep Learning Compiler (e.g., TVM), I also found many paper introduce such ML models for performance prediction. I believe the thought that <strong>leveraging ML techniques to bypass some complicated modeling problems</strong> is highly valuable in system research, which is the most important thing I learned from this paper.</p>
</li>
<li>
<p>Instead of solving the whole NP-hard job combination problem, the authors limit the number of concurrently executed jobs to 2, considering this simpler case. It is <strong>a fantastic tradeoff</strong>. The simplified problem can be solved by a maximum cardinality matching algorithm, which may not find the optimal combination, but exchanging reasonable scheduling overhead for a substantial performance improvement.</p>
</li>
</ol>
<h3 id="weaknesses">Weaknesses</h3>
<p>This paper also has some potential weaknesses:</p>
<ol>
<li>
<p>It seems to ignore the situation which <strong>two concurrent jobs have different execution times</strong>. For instance, when a longer job and a shorter job are executed together, after the shorter job finishes, GPUPool seems unable to schedule a new job to the GPU. Instead, the remaining GPU time is monopolized by the longer job. This could result in a lower resource utilization.</p>
</li>
<li>
<p>The concurrent execution of multiple jobs on a single GPU may also be <strong>constrained by GPU memory capacity</strong>. A possible improvement is to ask users to indicate maximum GPU memory usage of their applications and consider the these constraints when constructing the graphs.</p>
</li>
<li>
<p>This paper does not consider <strong>the job which leverages multiple GPUs</strong>. These jobs are quite common in reality. When a job can occupy multiple GPUs, there are some additional constraints:</p>
<ol>
<li>
<p><strong>Inter-GPU connection</strong> (e.g., NVLink or InfiniBand) bandwidth is the potential bottleneck, especially for distributed training strategies relying on high GPU interconnect bandwidth, such as <em>Data Parallelism</em>. Improper job scheduling may lead to contention for bandwidth among multiple jobs, or jobs requiring high GPU interconnect bandwidth may run on different nodes.</p>
</li>
<li>
<p>When a single job leverages multiple GPUs, <strong>the workload types on different GPUs may not be the same</strong>. For example, in <em>Pipeline Parallelism</em>, different GPUs run different stages of the neural network.</p>
</li>
</ol>
</li>
<li>
<p>This paper does not clearly take into account <strong>the impact of memory hierarchy on performance</strong>, such as shared memory (or just implicitly consider it using a ML model). Some CUDA kernels are optimized by carefully utilizing CUDA SM shared memory, such as <em>Flash Attention</em>. When two kernels run together, does it lead to shared memory contention? Could it result in runtime errors or shared memory overflowing into global memory, causing a severe performance decline? Experiments in the paper can not answer these questions. Also, the selected profiling metrics to train stage 1 model listed in Figure 5 do not contains any metrics about shared memory capacity. Another possibility is that a ML model is already good enough to handle this problem. Regardless, the impact of memory hierarchy on GPU-sharing deserves further study.</p>
</li>
</ol>
<p><img loading="lazy" src="/2024-02-07-paper-reading-pact22-gpupool/metrics.png"></p>
<center>Figure 5. Metrics Used to Train Stage 1 Prediction Model</center>
<h3 id="possible-improvements">Possible Improvements</h3>
<p>I have some potential ideas to improve this work:</p>
<ol>
<li>
<p>As response to the first weakness mentioned above, we can extend GPUPool to enable it to schedule a new job to the GPU after the shorter job finishes. This improvement can be achieved by a simple modification: <strong>keep the running jobs in the incoming window, and if two jobs are still running in the same GPU, also keep the edge between them in the pairing graph</strong>. With this modification, if shorter job finishes, we can re-run the matching algorithm to find a new job to pair with it.</p>
</li>
<li>
<p>We can extend GPUPool to support <strong>multiple GPU job</strong>. To achieve that, we should consider inter-GPU connection bandwidth. This may include following modifications:</p>
<ol>
<li>
<p>Ask users to <strong>indicate the required inter-GPU bandwidth or connection types</strong> (e.g., NVLink/PCIe/Infiniband/Ethernet).</p>
</li>
<li>
<p>Take a multiple GPU task as several sub-jobs. <strong>Each of sub-job is a single GPU job</strong>, with interconnection constraints. Then we can reuse the infrastructure of GPUPool to find the co-running chances.</p>
</li>
<li>
<p>Extend the last <strong>step &ldquo;Execution&rdquo; to consider the interconnection constraints</strong>, so it can dispatch sub-jobs to nodes that meet the constraints. This may require an efficient graph algorithm to find job placement, which requires a further research.</p>
</li>
</ol>
</li>
<li>
<p>Sometimes the goal of a data center is not just to improve resource utilization, but also to <strong>save energy</strong>. Improving resource utilization does not necessarily mean energy saving, because the chip&rsquo;s speed $S$, power consumption $P$, and frequency $f$ have the following approximate relationship:</p>
</li>
</ol>
$$\begin{align}
   S & \propto f \\
   P & \propto f^\alpha, \text{while}\ \alpha \in [2, 3]
\end{align}$$<p>We can extend the optimization target of GPUPool to power consumption. This can be achieved by add a power prediction model with similar methods. Then we can use a multi-objective optimization algorithm to find the best job combination, considering both performance and power consumption.</p>
]]></content:encoded></item><item><title>Building WireGuard VPN for Machine Learning Server Cluster</title><link>https://monsoon-cs.moe/2024-01-29-wg-for-cluster/</link><pubDate>Mon, 29 Jan 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-01-29-wg-for-cluster/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed.&lt;/p&gt;
&lt;p&gt;Deploying a VPN network requires considering the following factors:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Network topology: an appropriate topology must be chosen to minimize latency as much as possible;&lt;/li&gt;
&lt;li&gt;User management: it should be easy to add or remove users and to authorize them;&lt;/li&gt;
&lt;li&gt;Simplicity of use and maintenance.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="design"&gt;Design&lt;/h2&gt;
&lt;h3 id="network-topology"&gt;Network Topology&lt;/h3&gt;
&lt;p&gt;The network topology determines the latency.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="motivation">Motivation</h2>
<p>A machine learning cluster needs a secure way to expose services to users, as well as to interconnect servers across the public network. For this, a VPN network needs to be deployed.</p>
<p>Deploying a VPN network requires considering the following factors:</p>
<ol>
<li>Network topology: an appropriate topology must be chosen to minimize latency as much as possible;</li>
<li>User management: it should be easy to add or remove users and to authorize them;</li>
<li>Simplicity of use and maintenance.</li>
</ol>
<h2 id="design">Design</h2>
<h3 id="network-topology">Network Topology</h3>
<p>The network topology determines the latency.</p>
<p>The lowest-latency option is obviously full-mesh, i.e. every pair of peers has a direct P2P connection. However, the management complexity of this topology is $\mathcal{O}(n^2)$, and adding a new peer requires modifying the configuration files of all other peers. It also has to deal with the problems introduced by NAT, which requires some automated management software. I tried <a href="https://www.netmaker.io/">Netmaker</a> and <a href="https://headscale.net/">Headscale</a>, but neither of them seemed able to correctly handle the <strong>complex network environment</strong> within the campus, such as the symmetric NAT used by various enterprise-grade routers, and <strong>the probability of successfully establishing P2P was very low</strong>.</p>
<p>In the end I chose a <strong>topology that combines full-mesh and hub-and-spoke</strong>. Since the number of servers and their IPs rarely change, manually configuring a full-mesh network among the servers is feasible. At the same time, a gateway server is provided as the hub for user access, and users only need to establish a connection with the gateway server. Since most users actually use the VPN within the campus, connecting to the on-campus gateway server and forwarding traffic through it does not introduce much additional latency. This structure balances latency and management complexity, and adding/removing and authorizing users only needs to be done on the gateway server.</p>
<p><img alt="Network Topology" loading="lazy" src="/2024-01-29-wg-for-cluster/topo.png"></p>
<h3 id="protocol-choice">Protocol Choice</h3>
<p>The popular OpenVPN and IPSec are both good enough, but the emerging WireGuard offers unparalleled configuration simplicity. On the server side, WireGuard can define a peer and a route with just a few lines of configuration; on the user side, since WireGuard uses key-pair-based authentication, a single configuration file is enough to join the VPN network, with no need to remember an additional password or perform a login operation.</p>
<h3 id="management-approach">Management Approach</h3>
<p>For the sake of predictability and stability, I chose the manual configuration approach. The full-mesh network among servers does not need to be changed frequently once it is configured. User management, on the other hand, is implemented through a script: when a new user needs to be added, the script generates a key pair and allocates an IP, adds the public key and routing information to the gateway server&rsquo;s peer list, then generates a configuration file containing the private key and the allocated IP, and sends it to the user.</p>
<p>Example of a user peer configuration on the gateway server:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Peer]</span>
</span></span><span class="line"><span class="cl"><span class="na">PublicKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">10.1.x.y/32</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">fd01::x:y/128</span>
</span></span><span class="line"><span class="cl"><span class="na">PersistentKeepalive</span> <span class="o">=</span> <span class="s">25</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Example of a user&rsquo;s access configuration file:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Interface]</span>
</span></span><span class="line"><span class="cl"><span class="na">PrivateKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">Address</span> <span class="o">=</span> <span class="s">10.1.x.y/16</span>
</span></span><span class="line"><span class="cl"><span class="na">Address</span> <span class="o">=</span> <span class="s">fd01::x:y/64</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Peer]</span>
</span></span><span class="line"><span class="cl"><span class="na">PublicKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">10.1.0.0/16  # route all VPN traffic to gateway server</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">fd01::/64</span>
</span></span><span class="line"><span class="cl"><span class="na">Endpoint</span> <span class="o">=</span> <span class="s">wg.ustcaigroup.xyz:51820  # gateway server is dual stack</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Endpoint = wg.ustcaigroup.xyz:51820  # IPv4</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Endpoint = wg.ustcaigroup.xyz:51820  # IPv6</span>
</span></span><span class="line"><span class="cl"><span class="na">PersistentKeepalive</span> <span class="o">=</span> <span class="s">25</span>
</span></span></code></pre></td></tr></table>
</div>
</div>]]></content:encoded></item><item><title>Building Storage System for Machine Learning Server Cluster</title><link>https://monsoon-cs.moe/2023-11-24-storage-system-desgin/</link><pubDate>Fri, 24 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-11-24-storage-system-desgin/</guid><description>&lt;blockquote&gt;
&lt;p&gt;This is an unfinished blog.&lt;/p&gt;
&lt;/blockquote&gt;</description><content:encoded>&lt;blockquote>
&lt;p>This is an unfinished blog.&lt;/p>
&lt;/blockquote>
</content:encoded></item><item><title>Custom PyTorch Operators on Ascend 910B</title><link>https://monsoon-cs.moe/2023-11-14-ascend-910b-custom-op/</link><pubDate>Tue, 14 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-11-14-ascend-910b-custom-op/</guid><description>&lt;h2 id="environment"&gt;Environment&lt;/h2&gt;
&lt;p&gt;The hardware environment this article is based on is the Ascend 910B3, and the software environment includes &lt;a href="https://www.hiascend.com/developer/download/community/result"&gt;CANN 7.0-RC1&lt;/a&gt;, &lt;a href="https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/"&gt;PyTorch 1.11.0&lt;/a&gt;, and &lt;a href="https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc3-pytorch1.11.0"&gt;Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0&lt;/a&gt;. The situation on other CANN and PyTorch versions may differ slightly.&lt;/p&gt;
&lt;h2 id="registration-process"&gt;Registration Process&lt;/h2&gt;
&lt;h3 id="adding-a-custom-operator-in-the-ascend-pytorch-adapter"&gt;Adding a Custom Operator in the Ascend PyTorch Adapter&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html"&gt;https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation"&gt;https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Add the &lt;code&gt;npu_add_custom&lt;/code&gt; function in &lt;code&gt;torch_npu/csrc/aten/npu_native_functions.yaml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;func&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;npu_add_custom(Tensor x, Tensor y) -&amp;gt; Tensor &lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# 添加的函数&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Add the file &lt;code&gt;AddCustomKernelNpu.cpp&lt;/code&gt; in &lt;code&gt;torch_npu/csrc/aten/ops/op_api&lt;/code&gt;:&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="environment">Environment</h2>
<p>The hardware environment this article is based on is the Ascend 910B3, and the software environment includes <a href="https://www.hiascend.com/developer/download/community/result">CANN 7.0-RC1</a>, <a href="https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/">PyTorch 1.11.0</a>, and <a href="https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc3-pytorch1.11.0">Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0</a>. The situation on other CANN and PyTorch versions may differ slightly.</p>
<h2 id="registration-process">Registration Process</h2>
<h3 id="adding-a-custom-operator-in-the-ascend-pytorch-adapter">Adding a Custom Operator in the Ascend PyTorch Adapter</h3>
<blockquote>
<p>References:</p>
<ul>
<li><a href="https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html">https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html</a></li>
<li><a href="https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation">https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation</a></li>
</ul>
</blockquote>
<p>Add the <code>npu_add_custom</code> function in <code>torch_npu/csrc/aten/npu_native_functions.yaml</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">custom</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">func</span><span class="p">:</span><span class="w"> </span><span class="l">npu_add_custom(Tensor x, Tensor y) -&gt; Tensor </span><span class="w"> </span><span class="c"># 添加的函数</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Add the file <code>AddCustomKernelNpu.cpp</code> in <code>torch_npu/csrc/aten/ops/op_api</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;torch/csrc/autograd/custom_function.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/framework/utils/OpAdapter.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/aten/NPUNativeFunctions.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/aten/ops/op_api/op_api_common.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">namespace</span> <span class="n">at_npu</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="k">namespace</span> <span class="n">native</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">using</span> <span class="n">torch</span><span class="o">::</span><span class="n">autograd</span><span class="o">::</span><span class="n">Function</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">using</span> <span class="n">torch</span><span class="o">::</span><span class="n">autograd</span><span class="o">::</span><span class="n">AutogradContext</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">NPUNativeFunctions</span><span class="o">::</span><span class="n">npu_add_custom</span><span class="p">(</span><span class="k">const</span> <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">x</span><span class="p">,</span> <span class="k">const</span> <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">result</span> <span class="o">=</span> <span class="n">OpPreparation</span><span class="o">::</span><span class="n">ApplyTensor</span><span class="p">(</span><span class="n">x</span><span class="p">);</span> <span class="c1">// 创建输出内存
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1">// calculate the output result of the NPU
</span></span></span><span class="line"><span class="cl">        <span class="n">EXEC_NPU_CMD</span><span class="p">(</span><span class="n">aclnnAddCustom</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span> <span class="c1">// namespace native
</span></span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="c1">// namespace at_npu
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>Afterwards, recompile and reinstall <code>torch_npu</code>.</p>
<h3 id="adding-the-custom-operator-implementation-in-cann">Adding the Custom Operator Implementation in CANN</h3>
<blockquote>
<p>References:</p>
<ul>
<li><a href="https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html">https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html</a></li>
</ul>
</blockquote>
<p>First, define the operator description file <code>add_custom.json</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;op&#34;</span><span class="p">:</span> <span class="s2">&#34;AddCustom&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;language&#34;</span><span class="p">:</span> <span class="s2">&#34;cpp&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;input_desc&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;x&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;y&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;output_desc&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;z&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Run</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">msopgen gen -i add_custom.json -c ai_core-Ascend910B3 -f pytorch -out . -lan cpp
</span></span></code></pre></td></tr></table>
</div>
</div><p>to generate the operator project:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">AddCustom
</span></span><span class="line"><span class="cl">├── build.sh
</span></span><span class="line"><span class="cl">├── cmake 
</span></span><span class="line"><span class="cl">│   ├── config.cmake
</span></span><span class="line"><span class="cl">│   ├── func.cmake
</span></span><span class="line"><span class="cl">│   ├── intf.cmake
</span></span><span class="line"><span class="cl">│   ├── makeself.cmake
</span></span><span class="line"><span class="cl">│   └── util
</span></span><span class="line"><span class="cl">├── CMakeLists.txt
</span></span><span class="line"><span class="cl">├── CMakePresets.json          // 修改 ASCEND_CANN_PACKAGE_PATH
</span></span><span class="line"><span class="cl">├── framework
</span></span><span class="line"><span class="cl">├── op_host
</span></span><span class="line"><span class="cl">│   ├── add_custom_tiling.h    // 定义 length 和 tiling 相关信息
</span></span><span class="line"><span class="cl">│   ├── add_custom.cpp         // 算子 host 侧实现
</span></span><span class="line"><span class="cl">│   ├── CMakeLists.txt
</span></span><span class="line"><span class="cl">├── op_kernel
</span></span><span class="line"><span class="cl">│   ├── CMakeLists.txt
</span></span><span class="line"><span class="cl">│   ├── add_custom.cpp         // 算子 kernel 侧实现
</span></span><span class="line"><span class="cl">└── scripts
</span></span></code></pre></td></tr></table>
</div>
</div><p>In <code>CMakePresets.json</code>, change <code>ASCEND_CANN_PACKAGE_PATH</code> to the CANN installation path.</p>
<p>The content of <code>op_host/add_custom_tiling.h</code> is as follows (a simple implementation):</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;register/tilingdata_base.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">namespace</span> <span class="n">optiling</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="n">BEGIN_TILING_DATA_DEF</span><span class="p">(</span><span class="n">AddCustomTilingData</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">TILING_DATA_FIELD_DEF</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>  <span class="c1">// 定义 tensor size
</span></span></span><span class="line"><span class="cl"><span class="n">END_TILING_DATA_DEF</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">REGISTER_TILING_DATA_CLASS</span><span class="p">(</span><span class="n">AddCustom</span><span class="p">,</span> <span class="n">AddCustomTilingData</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>In <code>op_host/add_custom.cpp</code>, modify the <code>block_dim</code> used when the operator is invoked:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="n">context</span><span class="o">-&gt;</span><span class="n">SetBlockDim</span><span class="p">(</span><span class="mi">20</span><span class="p">);</span> <span class="c1">// 910B3 的 block_dim
</span></span></span></code></pre></td></tr></table>
</div>
</div><p><code>op_kernel/add_custom.cpp</code> is the concrete implementation of the operator:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;kernel_operator.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef __DAV_C220_VEC__
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">extern</span> <span class="s">&#34;C&#34;</span> <span class="n">__global__</span> <span class="n">__aicore__</span> <span class="kt">void</span> <span class="n">add_custom</span><span class="p">(</span><span class="n">GM_ADDR</span> <span class="n">x</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">y</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">z</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">tiling</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">GET_TILING_DATA</span><span class="p">(</span><span class="n">tiling_data</span><span class="p">,</span> <span class="n">tiling</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">M</span> <span class="o">=</span> <span class="n">tiling_data</span><span class="p">.</span><span class="n">size</span><span class="p">;</span>  <span class="c1">// 从 tiling_data 中获取 tensor size
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// ...
</span></span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#else
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 重要：CANN 会尝试不同的 ccec 编译参数以推断算子的类型（VEC、CUBE、MIXED），如果不创建一个 stub 函数将会编译失败
</span></span></span><span class="line"><span class="cl"><span class="k">extern</span> <span class="s">&#34;C&#34;</span> <span class="n">__global__</span> <span class="n">__aicore__</span> <span class="kt">void</span> <span class="n">add_custom</span><span class="p">(</span><span class="n">GM_ADDR</span> <span class="n">x</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">y</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">z</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">tiling</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">pip_barrier</span><span class="p">(</span><span class="n">PIPE_ALL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="compilation-and-deployment">Compilation and Deployment</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">$ bash build.sh
</span></span><span class="line"><span class="cl">$ ./custom_opp_euleros_aarch64.run
</span></span></code></pre></td></tr></table>
</div>
</div><p>Calling it in PyTorch:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch_npu</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">z</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">npu_add_custom</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># 由于是运行时编译，第一次运行时需要等待编译</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="registration-principles">Registration Principles</h2>
<p>TODO</p>
<h2 id="references">References</h2>
<p>TODO</p>
]]></content:encoded></item><item><title>Building Proxy Service for Team</title><link>https://monsoon-cs.moe/2023-11-09-proxy-for-team/</link><pubDate>Thu, 09 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-11-09-proxy-for-team/</guid><description>&lt;blockquote&gt;
&lt;p&gt;This is an unfinished blog.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="preface"&gt;Preface&lt;/h2&gt;
&lt;p&gt;Due to &lt;a href="https://en.wikipedia.org/wiki/Internet_censorship_in_China"&gt;Internet censorship in China&lt;/a&gt; (known as &lt;em&gt;GFW&lt;/em&gt;, &lt;em&gt;Great Firewall&lt;/em&gt;, &lt;em&gt;防火长城&lt;/em&gt;), many websites (e.g. Google, Twitter) are blocked, and some websites (e.g. GitHub) suffer connectivity issues. In China, the means to circumvent internet censorship is referred to as &lt;em&gt;翻墙&lt;/em&gt; (means &lt;em&gt;climbing over the wall&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;In China, to freely access the Internet, a proxy is essential. Despite various commercial options available, they may not be suitable for everyone. Therefore, I have constructed a user-friendly and easy-to-maintain proxy system for my research group, as a part of my responsibilities as a system administrator.&lt;/p&gt;</description><content:encoded><![CDATA[<blockquote>
<p>This is an unfinished blog.</p>
</blockquote>
<h2 id="preface">Preface</h2>
<p>Due to <a href="https://en.wikipedia.org/wiki/Internet_censorship_in_China">Internet censorship in China</a> (known as <em>GFW</em>, <em>Great Firewall</em>, <em>防火长城</em>), many websites (e.g. Google, Twitter) are blocked, and some websites (e.g. GitHub) suffer connectivity issues. In China, the means to circumvent internet censorship is referred to as <em>翻墙</em> (means <em>climbing over the wall</em>).</p>
<p>In China, to freely access the Internet, a proxy is essential. Despite various commercial options available, they may not be suitable for everyone. Therefore, I have constructed a user-friendly and easy-to-maintain proxy system for my research group, as a part of my responsibilities as a system administrator.</p>
<h2 id="target">Target</h2>
<ol>
<li><strong>Easy to use</strong>. Team members only need some simple configurations.The proxy client should be able to automatically update configuration.</li>
<li><strong>Stability</strong>.</li>
<li><strong>Sufficient traffic</strong>, to download large datasets.</li>
<li><strong>Low Latency</strong>, to provide good experience for web.</li>
<li><strong>Low Cost</strong>.</li>
<li><strong>Easy to maintain</strong>. Frequent maintenance is unacceptable, and only simple changes of the configuration are required for new function.</li>
<li><strong>Concealment</strong>. The cat-and-mouse game between GFW and anti-censorship tools has been escalating. Ten years ago (2013), only an OpenVPN client was all your need to <a href="https://www.cnnic.com.cn/IDR/hlwfzdsj/201306/t20130628_40563.htm">&ldquo;Across the Great Wall and reach every corner in the world&rdquo;</a>. Now, you must use much more sophisticated solutions to prevent your &ldquo;unusual&rdquo; traffic from being detected by GFW. According to <a href="https://gfw.report/">GFW Report</a>, popular <a href="https://shadowsocks.org/">Shadowsocks</a> (a proxy protocol which simply encrypt all traffic using pre-shared key) was <a href="https://gfw.report/blog/gfw_shadowsocks/">detected and blocked</a>, and the TLS-based proxy also <a href="https://github.com/net4people/bbs/issues/129">encountered large-scale blocking in Oct 2022</a>. The tools and protocols used must be concealed enough to allow the service to run for a long time.</li>
</ol>
<h2 id="available-resources">Available Resources</h2>
<h3 id="cernet">CERNET</h3>
<h3 id="cloudflare-warp">Cloudflare WARP</h3>
<h3 id="vps">VPS</h3>
<h3 id="server-in-ustc">Server in USTC</h3>
<h3 id="anti-censorship-tools">Anti-Censorship Tools</h3>
<h2 id="adopted-solution">Adopted Solution</h2>
<!-- draw a picture -->
<h2 id="deployment">Deployment</h2>
<h2 id="problems">Problems</h2>
<h3 id="client-initialization">Client Initialization</h3>
<h3 id="compatibility">Compatibility</h3>
<h2 id="conclusion">Conclusion</h2>
]]></content:encoded></item><item><title>My TOEFL Experience</title><link>https://monsoon-cs.moe/2023-11-05-toefl-exp/</link><pubDate>Sun, 05 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-11-05-toefl-exp/</guid><description>&lt;h2 id="preface"&gt;Preface&lt;/h2&gt;
&lt;p&gt;As the exam that has caused me the most anxiety since the gaokao, the TOEFL kept me in the dark for most of 2023, and it is also the exam I invested the most time and money into.&lt;/p&gt;
&lt;p&gt;At the start I set a goal of 100 total and 20 in speaking. Along the way I went through countless days of lost confidence, of being drowned by anxiety, of practicing speaking until my tongue tied itself in knots — and finally, on November 3, 2023, I checked my scores and was satisfied.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="preface">Preface</h2>
<p>As the exam that has caused me the most anxiety since the gaokao, the TOEFL kept me in the dark for most of 2023, and it is also the exam I invested the most time and money into.</p>
<p>At the start I set a goal of 100 total and 20 in speaking. Along the way I went through countless days of lost confidence, of being drowned by anxiety, of practicing speaking until my tongue tied itself in knots — and finally, on November 3, 2023, I checked my scores and was satisfied.</p>
<p>I write this article both as a summary of my own past and in the hope that it can help anyone who happens to read it.</p>
<p>The sittings I took and my scores:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Exam date</th>
					<th style="text-align: center">Total</th>
					<th style="text-align: center">Reading</th>
					<th style="text-align: center">Listening</th>
					<th style="text-align: center">Speaking</th>
					<th style="text-align: center">Writing</th>
					<th style="text-align: center">Note</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">2023.7.22</td>
					<td style="text-align: center">89</td>
					<td style="text-align: center">27</td>
					<td style="text-align: center">24</td>
					<td style="text-align: center">16</td>
					<td style="text-align: center">22</td>
					<td style="text-align: center">before reform</td>
			</tr>
			<tr>
					<td style="text-align: center">2023.8.15</td>
					<td style="text-align: center">89</td>
					<td style="text-align: center">28</td>
					<td style="text-align: center">25</td>
					<td style="text-align: center">17</td>
					<td style="text-align: center">19</td>
					<td style="text-align: center">this and after: post-reform</td>
			</tr>
			<tr>
					<td style="text-align: center">2023.9.16</td>
					<td style="text-align: center">96</td>
					<td style="text-align: center">29</td>
					<td style="text-align: center">27</td>
					<td style="text-align: center">19</td>
					<td style="text-align: center">21</td>
					<td style="text-align: center"></td>
			</tr>
			<tr>
					<td style="text-align: center">2023.10.14</td>
					<td style="text-align: center">96</td>
					<td style="text-align: center">30</td>
					<td style="text-align: center">24</td>
					<td style="text-align: center">19</td>
					<td style="text-align: center">23</td>
					<td style="text-align: center"></td>
			</tr>
			<tr>
					<td style="text-align: center">2023.10.28</td>
					<td style="text-align: center">101</td>
					<td style="text-align: center">28</td>
					<td style="text-align: center">27</td>
					<td style="text-align: center">22</td>
					<td style="text-align: center">24</td>
					<td style="text-align: center"></td>
			</tr>
			<tr>
					<td style="text-align: center">MyBest</td>
					<td style="text-align: center">103</td>
					<td style="text-align: center">30</td>
					<td style="text-align: center">27</td>
					<td style="text-align: center">22</td>
					<td style="text-align: center">24</td>
					<td style="text-align: center"></td>
			</tr>
	</tbody>
</table>
<p>Study materials I used:</p>
<ul>
<li>Vocabulary: <a href="https://www.maimemo.com/">MaiMemo</a></li>
<li>Listening and speaking practice: <a href="https://toefl.kmf.com/">TAL Kaomanfen</a>, <a href="https://tpo.xdf.cn/">New Oriental TOEFL</a>, all speaking questions from TPO 1~74 bought on Taobao</li>
<li>Speaking reference: <a href="https://book.douban.com/subject/30300871/">New Oriental <em>TOEFL Speaking White Paper</em></a></li>
<li>Writing reference: <a href="https://book.douban.com/subject/26338897/">New Oriental <em>TOEFL Writing White Paper</em></a>, and post-reform <a href="https://zhuanlan.zhihu.com/p/648415673">all academic-discussion writing real questions and sample essays</a></li>
</ul>
<h2 id="reading">Reading</h2>
<p>For most Chinese students this is the easiest section, and any competent student from a 211 university or above can certainly handle it with ease.</p>
<p>Before the exam I only did two passages to get used to the pacing, and I scored 27 on my first attempt, then stayed stable, and hit a full score on my fourth attempt. Personally I feel TOEFL reading is even easier than the Jiangsu gaokao or CET-6 reading. Although I memorized a lot of vocabulary before my first exam, that was mostly preparation for the GRE; TOEFL reading itself poses basically no vocabulary challenge.</p>
<p>While a high score isn&rsquo;t hard, a full score still takes a bit of luck. On the time I scored full marks, the two reading topics were &ldquo;the early ocean and atmosphere of Earth&rdquo; and &ldquo;the agricultural revolution and irrigation,&rdquo; both topics I was very familiar with. In that case the reading was just easy mode.</p>
<h2 id="listening">Listening</h2>
<p>The TOEFL&rsquo;s bizarre exam format makes listening, speaking, and writing all test your listening ability. But <strong>the listening across these three parts is actually completely different</strong>:</p>
<ul>
<li>The listening section itself:
<ul>
<li>Conversation: relatively hard; everyday conversation has always been my weak spot, with the most linking and elision, and a fairly fast pace;</li>
<li>Lecture: moderate difficulty; although it looks long, the pace is actually slow and tolerant of errors, and if you miss a sentence you can completely infer it from context;</li>
</ul>
</li>
<li>Integrated speaking: the listening here is actually the hardest, as you need to capture as many details as possible and take sufficient notes; my speaking foundation itself was very poor, which made it even harder;</li>
<li>Integrated writing: the lowest difficulty; at the start you read a passage to get familiar with the topic, and the listening has a rigid structure, clear logic, and a slow pace.</li>
</ul>
<p>But I have to say, <strong>with proper training, the listening section is also very easy to improve and to score high on.</strong> I did about 20 days of concentrated, intensive training, plus roughly another 30 days of scattered training (mixed in with other things).</p>
<p>The single most important point about listening is that <strong>you must figure out the approach to answering questions that suits you.</strong> Many study materials emphasize how to take notes correctly during listening, and at first I trained that way too, but after my first exam I realized this method didn&rsquo;t suit me — taking notes distracts your attention, and the probability of losing track of the listening content (no longer being able to grasp the logical relationships in the context) increases enormously.</p>
<p>My conclusion is that <strong>notes are good for recording details, and the human brain is good for remembering logic.</strong></p>
<p><strong>The pure listening section of the TOEFL actually doesn&rsquo;t focus on details; instead it tests your overall grasp of the listening material.</strong> In my later 20 days of dedicated training I completely abandoned note-taking, and it worked very well. I should note that I later found occasional note-taking still useful when the density of details was high — it helps you avoid losing focus, but what you write down is actually useless; I never once looked at it during the exam. Here, taking notes is really just a way to reinforce the brain&rsquo;s memory, not a way to store information externally.</p>
<p>The listening training method I used: first pass, do the questions; second pass, re-listen; third pass, listen while reading the transcript; then listen several more times until you can hear every detail clearly. During dedicated training, each listening passage took me roughly 20~40 minutes, and I practiced at least 6 passages a day.</p>
<p>Likewise, topic familiarity greatly affects your performance. On the sitting where I first scored 27, one lecture told the classic story of &ldquo;winning the Nobel Prize by peeling graphene with tape.&rdquo; Although I was very familiar with it and breezed through, the content was indeed somewhat specialized, with many physics terms, touching on the layered structure of graphene and the principle of its anisotropic conductivity. Since TOEFL listening lectures are still mainly STEM-oriented, useless knowledge you picked up while slacking off on Zhihu or Bilibili — even some popular science books you read back in secondary school — can help you in unexpected ways; a broad knowledge base lets you achieve more with less effort. But by the same token, unfamiliar topics become very troublesome: on my fourth exam I only scored 24 in listening, precisely because I ran into a literature topic and didn&rsquo;t understand most of the content.</p>
<p><strong>After the July 2023 reform</strong>, listening has a pitfall: since the mid-test break was removed, some people finish faster and start speaking while you are still listening, causing serious interference. Although I did dedicated training before the second exam, I still only got 25 in listening — exactly because I fell into this trap.</p>
<p>The way to avoid this pitfall is to quickly skip all the direction parts and end the reading section two minutes early, so that you can be the first in the room to start speaking, <del>letting others be interfered with by you</del>.</p>
<blockquote>
<p><del>Better that I wrong the world than that the world wrong me.</del></p>
</blockquote>
<h2 id="speaking">Speaking</h2>
<p>Looking at the scores, you can tell this was the part that tormented me the most — the last two sittings were taken purely for speaking (a speaking score below 20 is very risky when applying).</p>
<p>I did high-intensity dedicated speaking training for about 30 days, and the number of non-dedicated training days is beyond counting.</p>
<p>For someone like me with a very poor speaking foundation, a large amount of training can ensure your score lands around 20; beyond that it still comes down to luck and on-the-spot performance.</p>
<p><strong>TOEFL speaking is less a speaking test than a grand integrated test</strong>. For me personally, the reading and listening demands within the speaking section are even higher than in the reading and listening sections themselves:</p>
<ul>
<li>The reading parts of task 2 and task 3 require <strong>speed-reading ability</strong>; personally I feel you can&rsquo;t manage without 4 words/s, and <strong>you won&rsquo;t get a chance to roll back if you don&rsquo;t read it through</strong>. The reading section, by contrast, can be read at the same speed I normally read papers, and if a sentence isn&rsquo;t clear you can read it several more times.</li>
<li>The listening in integrated speaking requires you to write down details, whereas in the listening section much of the time you only need to note the logic. Recording details forces you to rely on notes, and balancing note-taking, receiving information, and grasping the overall logic is the hardest part.</li>
</ul>
<h3 id="independent-speaking">Independent Speaking</h3>
<p>Accumulating material is necessary, but quantity is not the point — I only prepared 10 commonly used ones; what matters is being able to use them fluently, so that when you see a question you can quickly react with which material to apply. You can practice this specifically with the <a href="https://toefl.kmf.com/speak/gold/1">Golden 80 Speaking Questions</a> on TAL Kaomanfen.</p>
<p>At the same time, material isn&rsquo;t a cure-all; independent speaking inevitably carries many random factors and often requires making up a story on the spot. In that case it&rsquo;s faster to quickly think it through in Chinese and then translate it into English (jot down a few keywords and string them into sentences as you speak).</p>
<h3 id="integrated-speaking">Integrated Speaking</h3>
<p>For me this was the hardest part of the whole exam; getting here basically triggered an adrenaline surge every time.</p>
<p><strong>Handling integrated speaking is the part I spent the most time training on. There is no shortcut; you have to find your own feel and your own experience.</strong> Here I&rsquo;ll share the experience I summarized that worked for me:</p>
<ul>
<li>While reading: although task 2 and task 3 give you 45s of reading, <strong>it&rsquo;s best to scan it in just 15s, find the key sentences (skip non-key sentences entirely), and then copy down the key sentences</strong> (not necessarily word for word, but as complete as possible — the kind you can read straight off without having to compose anything). The benefit is that during prep time I can quickly read through it once, and when I formally speak I&rsquo;m not only fluent at the start but also save time;</li>
<li>While listening: write down as many details as possible, but you must simultaneously filter out the non-essential, and for the essential parts likewise write down keywords/sentences. At the same time, note-taking absolutely must not interfere with receiving the information itself;</li>
<li>During prep: read out what you&rsquo;re going to say (don&rsquo;t say it silently in your head — that gives you the illusion that you already speak it fluently) while circling useful information (or crossing out useless information), use arrows to organize a single thread to follow, and where necessary write filler content between some keywords to reduce the burden of composing on the spot;</li>
<li>When formally speaking: make fluency your top priority, and when you&rsquo;re out of time or stuck you can drop some details. Stammering and repeating a sentence not only lowers your score but also wastes time.</li>
</ul>
<p><strong>No matter the situation, you must never become overly nervous.</strong> Being overly nervous slows your thinking and greatly increases stumbling while speaking. On the sitting where I scored 22, I was in a fairly relaxed state during the speaking section.</p>
<p>My personal training method for integrated speaking: first do it normally, then immediately re-speak it, then look at the answer, then keep re-speaking until you can do it very fluently. Under this method one passage takes about 15~30 minutes, and I practiced 10 passages a day.</p>
<h2 id="writing">Writing</h2>
<p><strong>No feelings, all formula.</strong> In fact I hardly invested any time in writing training; an average English foundation plus appropriate techniques is enough to get at least 22.</p>
<p>One thing to note is <strong>don&rsquo;t let your typing speed drag you down</strong>. I&rsquo;m someone who types fairly slowly and makes a lot of typos, and in the first two sittings this did affect me, but once I got more practiced it was no longer a problem.</p>
<h3 id="integrated-writing">Integrated Writing</h3>
<p>For integrated writing you can read the passage at a calm, comfortable pace — the time given is enough for you to read it twice — and you don&rsquo;t need to take notes. The listening is also simple: the reading sets the stage so you&rsquo;re familiar with the topic, and the structure is rigid, the logic clear, and the pace slow, so writing down the important details isn&rsquo;t hard.</p>
<p>The thing to watch is <strong>don&rsquo;t memorize templates rigidly</strong>; wasting exam time typing out a template isn&rsquo;t worth it — just keep the logic clear and the structure neat. The time should be spent reconstructing as many details as possible; for language use, gaokao-level vocabulary is enough to get 24.</p>
<h3 id="academic-discussion-writing">Academic Discussion Writing</h3>
<p>The July 2023 reform removed independent writing and replaced it with academic discussion writing, shortening the time to 10 minutes. My writing score of 19 on the second exam was because I went in carelessly without practicing the new question type at all, and the result was that I completely failed to answer as required.</p>
<p>Later I spent half a day specifically training academic discussion writing and basically got the hang of it. In the exam you really only need to read the professor&rsquo;s question, skip the pile of filler, then glance at the two student sample answers and find their core viewpoints — this is to avoid colliding with the same viewpoint, and you don&rsquo;t need to read their specific content fully — after which you can start writing.</p>
<p>My personal template is as follows:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">From my perspective, &lt;my viewpoint&gt;.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Although &lt;pick a sample answer you disagree with and copy down its viewpoint&gt;, &lt;briefly state the advantage of my viewpoint&gt;.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&lt;elaborate in detail, you can use some examples, and you can also point out the shortcomings of the viewpoint you disagree with; 60~70 words is enough&gt;.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">&lt;(optional, an expression I personally like) sometimes you can say that my method can actually achieve the goal of the method I disagree with even better&gt;.
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">So, &lt;summarize the viewpoint&gt;.
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="conclusion">Conclusion</h2>
<p><strong>Without accumulating small steps, one cannot reach a thousand li.</strong></p>
<p>For me personally, the TOEFL made me reflect on my study patterns since college. My undergraduate courses were either things I was already familiar with or had a foundation in, or things I crammed for right before the exam. A language exam like the TOEFL has no shortcut (unless you&rsquo;re a language genius); you have to train little by little starting from Day 1, finding your feel and your experience bit by bit. In this process, beyond the obstacle of the questions themselves, there is even more the obstacle of negative emotions, and finding some people you trust and who are also willing to listen, to share your feelings with, is extremely helpful.</p>
]]></content:encoded></item><item><title>Catching Mining Virus</title><link>https://monsoon-cs.moe/2023-11-01-catching-mining-virus/</link><pubDate>Wed, 01 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-11-01-catching-mining-virus/</guid><description>&lt;h2 id="problem"&gt;Problem&lt;/h2&gt;
&lt;p&gt;On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://monsoon-cs.moe/2023-11-01-catching-mining-virus/firewall_warning.png"&gt;&lt;/p&gt;
&lt;p&gt;The &amp;ldquo;mining traffic&amp;rdquo; was a &lt;code&gt;bitcoin.sipa.be&lt;/code&gt; DNS request sent to &lt;code&gt;223.5.5.5&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="problem">Problem</h2>
<p>On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me.</p>
<p><img loading="lazy" src="/2023-11-01-catching-mining-virus/firewall_warning.png"></p>
<p>The &ldquo;mining traffic&rdquo; was a <code>bitcoin.sipa.be</code> DNS request sent to <code>223.5.5.5</code>.</p>
<p>Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus.</p>
<p>However, this time the situation was different. I couldn&rsquo;t find any suspicious processes, and there was no unusual GPU usage. Since I didn&rsquo;t deploy any monitoring programs to record historical processes and sockets, the investigation couldn&rsquo;t get started.</p>
<p>On October 31, I received the same warning again. Each time when mining traffic is detected, the firewall will block the server&rsquo;s outbound network. Loss of Internet will cause lots of troubles.</p>
<p>I suspected that someone may have suffered a <strong>supply chain attack</strong>, such as, downloading a Python package containing a virus, or cloning code from GitHub and running it without any check.</p>
<p>The immediate task is to identify who and which process was responsible.</p>
<h2 id="solution">Solution</h2>
<p>While I can&rsquo;t directly determine who or which process, I can block and log suspicious traffic for further investigation.</p>
<p>This job can be done by <code>iptables</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># iptables -N LOGDROP                   # create a new chain</span>
</span></span><span class="line"><span class="cl"><span class="c1"># iptables -A LOGDROP -j LOG --log-uid  # log info</span>
</span></span><span class="line"><span class="cl"><span class="c1"># iptables -A LOGDROP -j DROP           # drop packet</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># iptables -I OUTPUT 1 -p udp -m string --string &#34;bitcoin&#34; --algo bm -j LOGDROP     # match string &#34;bitcoin&#34; in udp packet</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>The <code>--log-uid</code> option can enable UID recording in <code>/var/log/kern.log</code>, for example:</p>
<pre tabindex="0"><code class="language-log" data-lang="log">IN= OUT=wg0 SRC=10.1.92.3 DST=10.1.2.13 LEN=42 TOS=0x00 PREC=0x00 TTL=64 ID=23294 DF PROTO=UDP SPT=52328 DPT=2333 LEN=22 UID=2109 GID=2109
</code></pre><h2 id="result">Result</h2>
<p>I&rsquo;m waiting the next requests sent by virus.</p>
]]></content:encoded></item><item><title>Using an SSH Reverse Tunnel to Log Into BitaHub Containers and Hold GPUs Long-Term</title><link>https://monsoon-cs.moe/2023-10-20-bitahub/</link><pubDate>Fri, 20 Oct 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-10-20-bitahub/</guid><description>&lt;h2 id="problem"&gt;Problem&lt;/h2&gt;
&lt;p&gt;Every year before CVPR, GPUs are always in short supply, and we need to borrow cards from elsewhere. USTC provides &lt;a href="https://bitahub.ustc.edu.cn/"&gt;BitaHub&lt;/a&gt; for on-campus users, but it suffers from the same shortage of cards before CVPR. At the same time, its job-submission-based usage model is very inconvenient: submitting jobs that occupy multiple cards often requires a long wait in the queue, and its data management approach is downright user-hostile.&lt;/p&gt;
&lt;p&gt;As the server administrator for my group, in order to make my life easier before CVPR and to avoid repeating the 2021 pre-CVPR ordeal of scrambling to allocate resources, I needed to improve the BitaHub experience:&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="problem">Problem</h2>
<p>Every year before CVPR, GPUs are always in short supply, and we need to borrow cards from elsewhere. USTC provides <a href="https://bitahub.ustc.edu.cn/">BitaHub</a> for on-campus users, but it suffers from the same shortage of cards before CVPR. At the same time, its job-submission-based usage model is very inconvenient: submitting jobs that occupy multiple cards often requires a long wait in the queue, and its data management approach is downright user-hostile.</p>
<p>As the server administrator for my group, in order to make my life easier before CVPR and to avoid repeating the 2021 pre-CVPR ordeal of scrambling to allocate resources, I needed to improve the BitaHub experience:</p>
<ol>
<li>How to hold GPUs long-term to avoid repeatedly queuing (slightly unethical, but a measure born of necessity);</li>
<li>How to conveniently read data from our own servers, instead of being forced to use BitaHub&rsquo;s user-hostile data management model;</li>
<li>How to make the BitaHub GPU experience as close as possible to that of our group&rsquo;s servers, lowering migration costs and improving the flexibility of resource scheduling.</li>
</ol>
<h2 id="idea">Idea</h2>
<p>Jobs in BitaHub run as docker containers, which gives us the possibility of configuring the environment we want inside the container, as long as we can somehow ssh into it.</p>
<p>After some investigation, I found that as long as the startup command does not stop running, a BitaHub container will keep running indefinitely and will not release its GPU resources. <strong>At the same time, BitaHub containers have network access</strong>, and the BitaHub web page even thoughtfully provides the ssh private key for the root user inside each job&rsquo;s container.</p>
<p>These facts give us an opportunity to exploit. All we need to do is run a tunnel program inside the container so that external parties can access port 22 of the container, and then we can log in and hold the resources long-term. Moreover, since the container has network access, we can also directly mount the file systems of other on-campus servers.</p>
<h2 id="solution">Solution</h2>
<p>The tunnel program I ended up choosing is <code>ssh</code>, which can create a reverse tunnel:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">ssh -i &lt;key_file&gt; -F none -o <span class="s2">&#34;StrictHostKeyChecking no&#34;</span> -o <span class="s2">&#34;ServerAliveInterval 15&#34;</span> -v -N -R &lt;port&gt;:localhost:22 jump@&lt;jumpserver&gt;
</span></span></code></pre></td></tr></table>
</div>
</div><p>On the <code>jumpserver</code>, configure a user <code>jump</code> and allow login with a specific private key, then somehow get the private key into the container (you could bake it directly into the image, but I chose a more convenient approach: create a BitaHub dataset to store it, and just add this dataset to every job).</p>
<p>The container&rsquo;s startup command is exactly the command above (considering network fluctuations, you can wrap it in a <code>while true</code> loop or use <code>autossh</code> to reconnect automatically). Once started, it creates a reverse tunnel on <code>&lt;port&gt;</code> of <code>&lt;jumpserver&gt;</code>, with <code>&lt;port&gt;</code> mapped to port <code>22</code> inside the container.</p>
<p>You can set <code>GatewayPorts yes</code> in the <code>sshd_config</code> of <code>&lt;jumpserver&gt;</code> so that the reverse tunnel listens on <code>0.0.0.0</code> instead of <code>127.0.0.1</code>. Otherwise, I would have to create a user on <code>&lt;jumpserver&gt;</code> for every person, or forward each port with <code>iptables</code>, which is far too tedious. Binding to <code>0.0.0.0</code> lets us access it directly from the existing VPN network.</p>
<p>There are many options for mounting a file system. Considering both security and convenience, I chose SSHFS. Exposing NFS directly to the public internet is too dangerous, while configuring NFS user authentication is too tedious. At the same time, the kernel that BitaHub uses to run containers neither loads the <code>wireguard</code> kmod nor maps <code>/dev/net/tun</code>, so we cannot use a VPN to protect data security. SSHFS can directly reuse the existing user authentication mechanism, and SSH traffic itself is also more likely to be let through by any potential data-center firewall.</p>
<p>Use the following command to mount SSHFS:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">sshfs -o reconnect,ServerAliveInterval<span class="o">=</span>15,ServerAliveCountMax<span class="o">=</span>30,ssh_command<span class="o">=</span><span class="s1">&#39;ssh -p &lt;dataserver_port&gt; -i &lt;key_file&gt;&#39;</span> &lt;user&gt;@&lt;dataserver&gt;:/path /path
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="postscript">Postscript</h2>
<p>TODO</p>
]]></content:encoded></item><item><title>Enabling QUIC in Nginx While Keeping SNI Routing</title><link>https://monsoon-cs.moe/2023-09-26-nginx-quic-with-ssl-preread/</link><pubDate>Tue, 26 Sep 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-09-26-nginx-quic-with-ssl-preread/</guid><description>&lt;h2 id="problem"&gt;Problem&lt;/h2&gt;
&lt;p&gt;Since version 1.25.0, Nginx&amp;rsquo;s support for QUIC &lt;a href="https://nginx.org/en/docs/quic.html"&gt;has been merged into mainline&lt;/a&gt;. Users who want to try it out can simply use the official &lt;code&gt;nginx&lt;/code&gt; docker image, which is very convenient.&lt;/p&gt;
&lt;p&gt;However, the nginx on my server uses &lt;a href="https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html"&gt;SNI routing&lt;/a&gt;, driven by the needs of a new generation of TLS-based proxy protocols such as &lt;a href="https://github.com/ihciah/shadow-tls"&gt;Shadow TLS&lt;/a&gt; and &lt;a href="https://github.com/XTLS/REALITY"&gt;Xray Reality&lt;/a&gt;. These proxy protocols cannot have their TLS layer handled by nginx on their behalf (unlike earlier protocols that could use gRPC/WebSocket and the like as their data transport). But in order to achieve the best camouflage effect, using the &lt;code&gt;443/tcp&lt;/code&gt; port is necessary (the whitelisted target sites used for camouflage generally only serve HTTPS on the &lt;code&gt;443/tcp&lt;/code&gt; port). Therefore, multiplexing the &lt;code&gt;443/tcp&lt;/code&gt; port is necessary.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="problem">Problem</h2>
<p>Since version 1.25.0, Nginx&rsquo;s support for QUIC <a href="https://nginx.org/en/docs/quic.html">has been merged into mainline</a>. Users who want to try it out can simply use the official <code>nginx</code> docker image, which is very convenient.</p>
<p>However, the nginx on my server uses <a href="https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html">SNI routing</a>, driven by the needs of a new generation of TLS-based proxy protocols such as <a href="https://github.com/ihciah/shadow-tls">Shadow TLS</a> and <a href="https://github.com/XTLS/REALITY">Xray Reality</a>. These proxy protocols cannot have their TLS layer handled by nginx on their behalf (unlike earlier protocols that could use gRPC/WebSocket and the like as their data transport). But in order to achieve the best camouflage effect, using the <code>443/tcp</code> port is necessary (the whitelisted target sites used for camouflage generally only serve HTTPS on the <code>443/tcp</code> port). Therefore, multiplexing the <code>443/tcp</code> port is necessary.</p>
<p>To make SNI routing and QUIC coexist, you only need to add <code>listen 443 quic</code> to each server in the original SNI routing configuration. An example configuration is shown below.</p>
<h2 id="configuration">Configuration</h2>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span><span class="lnt">41
</span><span class="lnt">42
</span><span class="lnt">43
</span><span class="lnt">44
</span><span class="lnt">45
</span><span class="lnt">46
</span><span class="lnt">47
</span><span class="lnt">48
</span><span class="lnt">49
</span><span class="lnt">50
</span><span class="lnt">51
</span><span class="lnt">52
</span><span class="lnt">53
</span><span class="lnt">54
</span><span class="lnt">55
</span><span class="lnt">56
</span><span class="lnt">57
</span><span class="lnt">58
</span><span class="lnt">59
</span><span class="lnt">60
</span><span class="lnt">61
</span><span class="lnt">62
</span><span class="lnt">63
</span><span class="lnt">64
</span><span class="lnt">65
</span><span class="lnt">66
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-nginx" data-lang="nginx"><span class="line"><span class="cl"><span class="k">http</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    <span class="c1"># ...
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">server_name</span> <span class="s">example.com</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># 443/tcp is already occupied by nginx stream, so it cannot be listened on again
</span></span></span><span class="line"><span class="cl">        <span class="c1"># listen 443 ssl http2 reuseport so_keepalive=on;
</span></span></span><span class="line"><span class="cl">        <span class="c1"># listen [::]:443 ssl http2 reuseport so_keepalive=on;
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Listen on the 443/udp port and enable QUIC
</span></span></span><span class="line"><span class="cl">        <span class="c1"># ref: https://nginx.org/en/docs/http/ngx_http_v3_module.html
</span></span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="mi">443</span> <span class="s">quic</span> <span class="s">reuseport</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="s">[::]:443</span> <span class="s">quic</span> <span class="s">reuseport</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Listen on a unix domain socket to accept connections forwarded from stream; a local port can also be used
</span></span></span><span class="line"><span class="cl">        <span class="c1"># Accept proxy_protocol, otherwise the connection source address shown in the log will all be unix:
</span></span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="s">unix:/dev/shm/nginx-example.sock</span> <span class="s">ssl</span> <span class="s">http2</span> <span class="s">proxy_protocol</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">set_real_ip_from</span> <span class="s">unix:</span><span class="p">;</span>  <span class="c1"># Only override the source address for connections coming from the unix domain socket
</span></span></span><span class="line"><span class="cl">        <span class="kn">real_ip_header</span> <span class="s">proxy_protocol</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="kn">add_header</span> <span class="s">Alt-Svc</span> <span class="s">&#39;h3=&#34;:443&#34;</span><span class="p">;</span> <span class="kn">ma=86400&#39;</span><span class="p">;</span>  <span class="c1"># used to advertise the availability of HTTP/3
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># ...
</span></span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">server_name</span> <span class="s">foo.example.com</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># Multiple domains can share 443/udp
</span></span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="mi">443</span> <span class="s">quic</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="s">[::]:443</span> <span class="s">quic</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="s">unix:/dev/shm/nginx-example-foo.sock</span> <span class="s">ssl</span> <span class="s">http2</span> <span class="s">proxy_protocol</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">set_real_ip_from</span> <span class="s">unix:</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">real_ip_header</span> <span class="s">proxy_protocol</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="kn">add_header</span> <span class="s">Alt-Svc</span> <span class="s">&#39;h3=&#34;:443&#34;</span><span class="p">;</span> <span class="kn">ma=86400&#39;</span><span class="p">;</span>  <span class="c1"># used to advertise the availability of HTTP/3
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1"># ...
</span></span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">stream</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># ...
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Route based on TLS SNI
</span></span></span><span class="line"><span class="cl">    <span class="kn">map</span> <span class="nv">$ssl_preread_server_name</span> <span class="nv">$name</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kn">example.com</span>             <span class="s">unix:/dev/shm/nginx-example.sock</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">foo.example.com</span>         <span class="s">unix:/dev/shm/nginx-example-foo.sock</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">learn.microsoft.com</span>     <span class="n">127.0.0.1</span><span class="p">:</span><span class="mi">8443</span><span class="p">;</span>  <span class="c1"># used for shadow-tls/xray-reality, etc.
</span></span></span><span class="line"><span class="cl">        <span class="kn">default</span>                 <span class="s">unix:/dev/shm/nginx-default.sock</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="kn">server</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Listen on 443/tcp and route based on SNI
</span></span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="mi">443</span> <span class="s">reuseport</span> <span class="s">so_keepalive=on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">listen</span> <span class="s">[::]:443</span> <span class="s">reuseport</span> <span class="s">so_keepalive=on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_pass</span> <span class="nv">$name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">ssl_preread</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="kn">proxy_protocol</span> <span class="no">on</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="testing">Testing</h2>
<p>Currently, the mainline of <code>curl</code>/<code>wget</code> does not yet support QUIC. You can use the <code>ymuski/curl-http3</code> docker image:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">$ docker run -it --rm ymuski/curl-http3 curl https://static.monsoon-cs.moe/public/ --http3 -IL
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">HTTP/3 <span class="m">200</span>
</span></span><span class="line"><span class="cl">server: nginx/1.25.2
</span></span><span class="line"><span class="cl">date: Tue, <span class="m">26</span> Sep <span class="m">2023</span> 14:52:29 GMT
</span></span><span class="line"><span class="cl">content-type: text/html<span class="p">;</span> <span class="nv">charset</span><span class="o">=</span>utf-8
</span></span><span class="line"><span class="cl">strict-transport-security: max-age<span class="o">=</span><span class="m">63072000</span>
</span></span><span class="line"><span class="cl">alt-svc: <span class="nv">h3</span><span class="o">=</span><span class="s2">&#34;:443&#34;</span><span class="p">;</span> <span class="nv">ma</span><span class="o">=</span><span class="m">86400</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="references">References</h2>
<ul>
<li><a href="https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html">https://nginx.org/en/docs/stream/ngx_stream_ssl_preread_module.html</a></li>
<li><a href="https://nginx.org/en/docs/http/ngx_http_v3_module.html">https://nginx.org/en/docs/http/ngx_http_v3_module.html</a></li>
</ul>
]]></content:encoded></item><item><title>Optimizing MKL Performance on AMD CPUs</title><link>https://monsoon-cs.moe/2023-06-19-mkl-on-amd/</link><pubDate>Mon, 19 Jun 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-06-19-mkl-on-amd/</guid><description>&lt;h2 id="the-problem"&gt;The Problem&lt;/h2&gt;
&lt;p&gt;My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don&amp;rsquo;t know what kind of load it is, or why it can&amp;rsquo;t run on the GPU, and I don&amp;rsquo;t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="the-problem">The Problem</h2>
<p>My lab has some AMD EPYC 7713 servers. We bought them because some people in the group run programs with very high CPU load (I don&rsquo;t know what kind of load it is, or why it can&rsquo;t run on the GPU, and I don&rsquo;t have the energy to help everyone solve it one by one). AMD processors with their many cores are a great fit for this kind of demand.</p>
<p>But as nice as AMD processors are, using them in a deep-learning lab brings an extra problem: the numpy and PyTorch installed by Anaconda both use MKL as their BLAS implementation by default, and MKL&rsquo;s library functions are also the hotspots of most high-CPU-load programs. However, <strong>MKL checks whether it is running on an Intel CPU, and if not, the optimizations have no effect.</strong></p>
<p>Since this is a deep-learning lab, few people have enough HPC background to compile suitable versions of numpy and PyTorch themselves, and it&rsquo;s hard for them to break away from Anaconda, so the dependency on MKL is hard to remove. For this reason I needed a solution that is <strong>transparent to ordinary users</strong>.</p>
<h2 id="the-solution">The Solution</h2>
<p>A widely circulated solution can be found via search engines: set the environment variable <code>MKL_DEBUG_CPU_TYPE=5</code>. This used to work, but <strong>it no longer works for MKL 2020 and later versions</strong>.</p>
<p>In the end I found a more clever solution <a href="https://documentation.sigma2.no/jobs/mkl.html">here</a>.</p>
<p>MKL calls a function <code>mkl_serv_intel_cpu_true()</code> to check whether it is running on an Intel CPU. As long as we provide a fake <code>mkl_serv_intel_cpu_true()</code> that always returns <code>1</code>, we can trick MKL into thinking it is running on an Intel CPU.</p>
<p>To do this, we can use Linux&rsquo;s <strong><code>LD_PRELOAD</code> mechanism</strong>. The dynamic library pointed to by <code>LD_PRELOAD</code> has the highest loading priority, so as long as we compile the desired <code>mkl_serv_intel_cpu_true()</code> function into an <code>so</code> file and point <code>LD_PRELOAD</code> at it, we can load this function ahead of everything else.</p>
<blockquote>
<p>I have often heard of the <code>LD_PRELOAD</code> mechanism being used for library-function hijacking attacks; here it counts as a clever use.</p>
</blockquote>
<h2 id="implementation">Implementation</h2>
<p>Create <code>mkl_trick.c</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">mkl_serv_intel_cpu_true</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>Compile it with <code>gcc -shared -fPIC -o libmkl_trick.so mkl_trick.c</code>, and copy the generated <code>libmkl_trick.so</code> to <code>/usr/local/lib</code>.</p>
<p>Add the following to the shell&rsquo;s global initialization file:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">MKL_DEBUG_CPU_TYPE</span><span class="o">=</span><span class="m">5</span>  <span class="c1"># compatibility with older MKL versions</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">MKL_ENABLE_INSTRUCTIONS</span><span class="o">=</span>AVX2  <span class="c1"># optional, tells MKL it can use AVX2</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">LD_PRELOAD</span><span class="o">=</span>/usr/local/lib/libmkl_trick.so
</span></span></code></pre></td></tr></table>
</div>
</div><p>Some of my labmates use Bash and some use ZSH, so both need to be modified:</p>
<ul>
<li>Bash: create the file <code>/etc/profile.d/mkl.sh</code> and add the above content</li>
<li>ZSH: add it to <code>/etc/zsh/zshenv</code></li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://documentation.sigma2.no/jobs/mkl.html">https://documentation.sigma2.no/jobs/mkl.html</a></li>
</ul>
]]></content:encoded></item><item><title>VCB-Studio Technical Director Entry Test 2023 and My Answer</title><link>https://monsoon-cs.moe/2023-05-25-vcb/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-05-25-vcb/</guid><description>&lt;p&gt;See &lt;a href="https://vcb-s.com/archives/15949"&gt;original publication page&lt;/a&gt; for more details.&lt;/p&gt;
&lt;p&gt;All my answer files can be browsed in &lt;a href="https://static.monsoon-cs.moe/public/VCB-Studio%20Entry%20Test%202023%20Answers/"&gt;here&lt;/a&gt;, or you can download &lt;a href="https://static.monsoon-cs.moe/public/VCB-Studio%20Entry%20Test%202023%20Answers.zip"&gt;zipped file&lt;/a&gt; (5.9G).&lt;/p&gt;
&lt;h2 id="requirements"&gt;Requirements&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;This is a test for candidates who wish to participate in the training class organized by VCB-Studio. Finish as many problems as you can, and then do the following things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pack your answers, result files, and necessary attachments into a &lt;strong&gt;zip/rar/7z&lt;/strong&gt; file. Source files we provided and intermediate file in your encoding should not be packed in.&lt;/li&gt;
&lt;li&gt;Register a Baidu Net Disk account (&lt;a href="https://pan.baidu.com"&gt;https://pan.baidu.com&lt;/a&gt;), upload the zipped file and create a sharing link. Whether you like it or not, Baidu Net Disk has been the most effective way to share files within our team since day one. Other sharing methods will NOT be considered.&lt;/li&gt;
&lt;li&gt;Send the link via email to &lt;a href="mailto:vcbs.training@gmail.com"&gt;vcbs.training@gmail.com&lt;/a&gt; before &lt;strong&gt;Beijing Time (UTC+8) Monday, 23 Jan 2023, 23:59:59&lt;/strong&gt;. Late submissions will NOT be considered.&lt;/li&gt;
&lt;li&gt;Prepare a QQ account. The follow-up training courses will be conducted in the QQ group.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You should independently complete the answers without any public discussion. Any form of plagiarism will NOT be tolerated.&lt;/p&gt;</description><content:encoded><![CDATA[<p>See <a href="https://vcb-s.com/archives/15949">original publication page</a> for more details.</p>
<p>All my answer files can be browsed in <a href="https://static.monsoon-cs.moe/public/VCB-Studio%20Entry%20Test%202023%20Answers/">here</a>, or you can download <a href="https://static.monsoon-cs.moe/public/VCB-Studio%20Entry%20Test%202023%20Answers.zip">zipped file</a> (5.9G).</p>
<h2 id="requirements">Requirements</h2>
<blockquote>
<p>This is a test for candidates who wish to participate in the training class organized by VCB-Studio. Finish as many problems as you can, and then do the following things:</p>
<ol>
<li>Pack your answers, result files, and necessary attachments into a <strong>zip/rar/7z</strong> file. Source files we provided and intermediate file in your encoding should not be packed in.</li>
<li>Register a Baidu Net Disk account (<a href="https://pan.baidu.com">https://pan.baidu.com</a>), upload the zipped file and create a sharing link. Whether you like it or not, Baidu Net Disk has been the most effective way to share files within our team since day one. Other sharing methods will NOT be considered.</li>
<li>Send the link via email to <a href="mailto:vcbs.training@gmail.com">vcbs.training@gmail.com</a> before <strong>Beijing Time (UTC+8) Monday, 23 Jan 2023, 23:59:59</strong>. Late submissions will NOT be considered.</li>
<li>Prepare a QQ account. The follow-up training courses will be conducted in the QQ group.</li>
</ol>
<p>You should independently complete the answers without any public discussion. Any form of plagiarism will NOT be tolerated.</p>
<p>This test has 5 questions. For question 2 and 3, you can choose ONE of them. Choosing both then we will pick one with higher points. The answers should be made in English.</p>
</blockquote>
<h2 id="question1-15pt">Question1 (15pt)</h2>
<blockquote>
<p>Please describe yourself as who you are, where do you study, how do you come to know VCB-Studio and why are you interested in this project, etc. Please do not write more than 500 words, or approximately 1 page. (15pt)</p>
</blockquote>
<p><em>Answers are hidden for privacy reasons.</em></p>
<h2 id="question2-30pt">Question2 (30pt)</h2>
<blockquote>
<p>Scanned pictures (or simply scans) are an important part of BDRips, which are often released as lossless PNG, TIFF format or lossy JPG format. Scans feature high resolution and large size. In the file <strong>Q2.7z</strong>, two sets of pictures have been provided for you. PNGs are the source scans, and WEBPs are transcoded from PNGs according to VCB-Studio Collation specifications. Your tasks are:</p>
<ol>
<li>Summarize the format conversion rules of scans in VCB-Studio Collation specifications. (6pt)</li>
<li>Convert the sources to AVIF and JPEG-XL format, with sizes comparable to the WEBPs. (12pt)</li>
<li>Comment on the quality, encoding speed, and compatibility of AVIF and JPEG- XL, and why/why not you may recommend us switching to the new format as the upgrade for WEBP in 2023. (12pt)</li>
</ol>
<p>You are free to utilize existing tools, but you need to describe clearly where you find the tool and how to use it.</p>
</blockquote>
<h3 id="1-format-conversion-rules-of-scans-in-vcb-studio-collation-specifications">(1) Format conversion rules of scans in VCB-Studio Collation specifications</h3>
<p>Choosing a format with better image quality at the same size when ensuring compatibility.</p>
<h3 id="2-converting-test">(2) Converting test</h3>
<p>See <code>Q2/convert.py</code> for my conversion code. <code>Pillow</code>, <code>pillow_avif_plugin</code> and <code>jxlpy</code> are used libraries. <code>Pillow</code> is the image processing library which I often use, it supports <code>WEBP</code> but not <code>AVIF</code> and <code>JPEG-XL</code>. So I find two <code>Pillow</code> plugins by Google to support <code>AVIF</code> and <code>JPEG-XL</code>.</p>
<p><code>PNG</code> and <code>WEBP Ref</code> are given images, and <code>WEBP Cus</code>, <code>AVIF</code>, <code>JPEG-XL</code> are custom encoded images.</p>
<p><code>WEBP Custom</code> is encoded by <code>Pillow</code>, which is backed by <code>libwebp</code>. Encoding speed is set to slowest(<code>6</code>), and quality is set to <code>90</code> to keep the same size with reference webp images.</p>
<p><code>AVIF</code> is encoded by <code>pillow-avif-plugin</code>, which is backed by <code>libavif</code>. Encoding speed is set to slowest(<code>0</code>), and quality is set to <code>84</code> to get the comparable size with reference webp images.</p>
<p><code>JPEG-XL</code> is encoded by <code>jxlpy</code>, which is backed by <code>libjxl</code>. Encoding speed is set to slowest(<code>9</code>), decoding speed is also slowest(<code>0</code>), and quality is set to <code>92</code> to get the comparable size with reference webp images.</p>
<p>The following table shows the result:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Image</th>
					<th style="text-align: center">PNG (size)</th>
					<th style="text-align: center">WEBP Ref (size)</th>
					<th style="text-align: center">WEBP Cus (size/time)</th>
					<th style="text-align: center">AVIF (size/time)</th>
					<th style="text-align: center">JPEG-XL (size/time)</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">01</td>
					<td style="text-align: center">26.97 MB</td>
					<td style="text-align: center">2.95 MB</td>
					<td style="text-align: center">2.95 MB / 3.36 s</td>
					<td style="text-align: center">2.77 MB / 37.77 s</td>
					<td style="text-align: center">2.56 MB / 32.00 s</td>
			</tr>
			<tr>
					<td style="text-align: center">02</td>
					<td style="text-align: center">26.25 MB</td>
					<td style="text-align: center">2.93 MB</td>
					<td style="text-align: center">2.94 MB / 3.27 s</td>
					<td style="text-align: center">2.71 MB / 34.87 s</td>
					<td style="text-align: center">2.48 MB / 33.07 s</td>
			</tr>
			<tr>
					<td style="text-align: center">03</td>
					<td style="text-align: center">3.60 MB</td>
					<td style="text-align: center">0.26 MB</td>
					<td style="text-align: center">0.26 MB / 0.37 s</td>
					<td style="text-align: center">0.28 MB / 11.48 s</td>
					<td style="text-align: center">0.28 MB / 5.12 s</td>
			</tr>
			<tr>
					<td style="text-align: center">04</td>
					<td style="text-align: center">21.78 MB</td>
					<td style="text-align: center">1.03 MB</td>
					<td style="text-align: center">1.03 MB / 2.06 s</td>
					<td style="text-align: center">1.32 MB / 29.56 s</td>
					<td style="text-align: center">1.39 MB / 32.25 s</td>
			</tr>
			<tr>
					<td style="text-align: center">05</td>
					<td style="text-align: center">2.65 MB</td>
					<td style="text-align: center">0.13 MB</td>
					<td style="text-align: center">0.13 MB / 0.24 s</td>
					<td style="text-align: center">0.15 MB / 9.29 s</td>
					<td style="text-align: center">0.18 MB / 4.11 s</td>
			</tr>
			<tr>
					<td style="text-align: center">06</td>
					<td style="text-align: center">2.66 MB</td>
					<td style="text-align: center">0.13 MB</td>
					<td style="text-align: center">0.13 MB / 0.25 s</td>
					<td style="text-align: center">0.15 MB / 9.39 s</td>
					<td style="text-align: center">0.16 MB / 3.81 s</td>
			</tr>
			<tr>
					<td style="text-align: center">07</td>
					<td style="text-align: center">24.38 MB</td>
					<td style="text-align: center">1.71 MB</td>
					<td style="text-align: center">1.71 MB / 2.25 s</td>
					<td style="text-align: center">1.67 MB / 27.78 s</td>
					<td style="text-align: center">1.68 MB / 35.59 s</td>
			</tr>
			<tr>
					<td style="text-align: center">08</td>
					<td style="text-align: center">55.52 MB</td>
					<td style="text-align: center">7.58 MB</td>
					<td style="text-align: center">7.58 MB / 26.48 s</td>
					<td style="text-align: center">7.93 MB / 83.44 s</td>
					<td style="text-align: center">6.36 MB / 72.90 s</td>
			</tr>
			<tr>
					<td style="text-align: center">09</td>
					<td style="text-align: center">44.39 MB</td>
					<td style="text-align: center">2.00 MB</td>
					<td style="text-align: center">2.00 MB / 3.53 s</td>
					<td style="text-align: center">1.99 MB / 59.79 s</td>
					<td style="text-align: center">2.47 MB / 71.73 s</td>
			</tr>
			<tr>
					<td style="text-align: center">10</td>
					<td style="text-align: center">41.59 MB</td>
					<td style="text-align: center">1.21 MB</td>
					<td style="text-align: center">1.21 MB / 3.11 s</td>
					<td style="text-align: center">1.16 MB / 59.99 s</td>
					<td style="text-align: center">1.70 MB / 63.65 s</td>
			</tr>
	</tbody>
</table>
<p><strong>PS</strong>: <code>pillow-avif-plugin</code> uses 8 threads to encode images (on i7-11700), and I didn&rsquo;t find an option to turn it off. Other encoders use only 1 thread. <code>jxlpy</code> example shows that it supports setting multithreading, but it doesn&rsquo;t work.</p>
<h3 id="3-comparison-and-comment">(3) Comparison and comment</h3>
<p>Quality comparison:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center"><code>PNG</code></th>
					<th style="text-align: center"><code>WEBP Ref</code></th>
					<th style="text-align: center"><code>AVIF</code></th>
					<th style="text-align: center"><code>JPEG-XL</code></th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/Q2_png.png"></td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/Q2_webp.png"></td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/Q2_avif.png"></td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/Q2_jxl.png"></td>
			</tr>
	</tbody>
</table>
<p>Above is a cropped part from 03 for the given encoding. The <code>WEBP</code> image has severe smearing in dark areas, and obvious color shift occurs in the red dots on the upper left and lower right. The <code>AVIF</code> image is better in smearing, but the color shift is the same as <code>WEBP</code>. The <code>JPEG-XL</code> image is relatively closest to reference <code>PNG</code> image.</p>
<p>Detailed compatibility:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Format</th>
					<th style="text-align: center">Windows</th>
					<th style="text-align: center">macOS</th>
					<th style="text-align: center">Android</th>
					<th style="text-align: center">iOS</th>
					<th style="text-align: center">Chrome</th>
					<th style="text-align: center">Firefox</th>
					<th style="text-align: center">Safari</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>WEBP</code></td>
					<td style="text-align: center">≥10</td>
					<td style="text-align: center">≥11</td>
					<td style="text-align: center">≥4</td>
					<td style="text-align: center">≥14</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">✅</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>AVIF</code></td>
					<td style="text-align: center">≥10-1903</td>
					<td style="text-align: center">≥13</td>
					<td style="text-align: center">≥12</td>
					<td style="text-align: center">≥16</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">✅</td>
					<td style="text-align: center">✅</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>JPEG-XL</code></td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">❌</td>
			</tr>
	</tbody>
</table>
<p>PS: Results on Windows, macOS, Android and iOS are got by Google. Browser compatibility information can be found at <a href="https://caniuse.com">https://caniuse.com</a>.</p>
<p>Summary:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Format</th>
					<th style="text-align: center">Quality</th>
					<th style="text-align: center">Encoding Speed</th>
					<th style="text-align: center">Compatibility</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>WEBP</code></td>
					<td style="text-align: center">worst</td>
					<td style="text-align: center">fast</td>
					<td style="text-align: center">good</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>AVIF</code></td>
					<td style="text-align: center">medium</td>
					<td style="text-align: center">slow</td>
					<td style="text-align: center">medium</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>JPEG-XL</code></td>
					<td style="text-align: center">best</td>
					<td style="text-align: center">slow</td>
					<td style="text-align: center">bad</td>
			</tr>
	</tbody>
</table>
<p>Due to the bad compatibility of <code>JPEG-XL</code>, it should not be considered an appropriate option. <code>AVIF</code> features the better image quality than <code>WEBP</code>, but is only well supported in new platforms, which needs time for adoption, especially for fragmented Android and Windows. Although <code>WBEP</code> takes huge advantage in encoding speed, I don&rsquo;t think encoding speed is a factor that needs to be considered because even for large images, the encoding time is only about 1 minute, and the number of images not large. Compared with video encoding, this is a completely negligible time overhead.</p>
<p>Summarily, I think <strong>now</strong> is not a suitable time to switch to <code>AVIF</code> or <code>JPEG-XL</code>. But two years later, it will be time for <code>AVIF</code> to show its strength.</p>
<h2 id="question3-30pt">Question3 (30pt)</h2>
<blockquote>
<p>Recently 32-bit audio tracks have appeared in some of the latest Hi-Res music. Although now we would not see these annoying 32-bit tracks in the Blu-ray, we have to start working on them in advance. In the file <strong>Q3.7z</strong>, two 32-bit PCM files are provided for you. Your tasks are:</p>
<ol>
<li>Learn about 32-bit tracks and tell the difference between these two files. (6pt)</li>
<li>Try to convert them to FLAC, ALAC, and WavPack losslessly. (15pt)</li>
<li>Consider various aspects such as compression rate, encoding speed, and playback compatibility and select the format you recommend most for 32-bit audio. (9pt)</li>
</ol>
<p>You are free to utilize existing tools, but you need to describe clearly where you find the tool and how to use it.</p>
</blockquote>
<h3 id="1">(1)</h3>
<p>Using <code>ffprobe</code> to get audio encoding info:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Input #0, wav, from &#39;01.wav&#39;:
</span></span><span class="line"><span class="cl">  Duration: 00:03:52.48, bitrate: 6144 kb/s
</span></span><span class="line"><span class="cl">  Stream #0:0: Audio: pcm_s32le ([1][0][0][0] / 0x0001), 96000 Hz, 2 channels, s32, 6144 kb/s
</span></span></code></pre></td></tr></table>
</div>
</div><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Input #0, wav, from &#39;02.wav&#39;:
</span></span><span class="line"><span class="cl">  Duration: 00:07:03.00, bitrate: 6144 kb/s
</span></span><span class="line"><span class="cl">  Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 96000 Hz, 2 channels, flt, 6144 kb/s
</span></span></code></pre></td></tr></table>
</div>
</div><p>The difference is: <code>01.wav</code> is encoded by <code>pcm_s32le</code>, and <code>02.wav</code> is encoded by <code>pcm_f32le</code>.</p>
<p><code>pcm_s32le</code> means PCM encoding by 32-bit signed integer with little-endian byte ordering, while <code>pcm_s32le</code> means PCM encoding by 32-bit floating point with little-endian byte ordering.</p>
<h3 id="2">(2)</h3>
<p>I first tried to convert them losslessly using FFmpeg. If FFmpeg failed, I used Google to find a suitable codec.</p>
<p>This is the result of my attempt:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Format</th>
					<th style="text-align: center">32-bit integer</th>
					<th style="text-align: center">32-bit float</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>FLAC</code></td>
					<td style="text-align: center">FFmpeg ❌<br><a href="https://github.com/xiph/flac"><code>flac</code></a> (from v1.4.0) ✅</td>
					<td style="text-align: center">FFmpeg ❌</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>ALAC</code></td>
					<td style="text-align: center">FFmpeg (decoding only)<br><a href="https://github.com/nu774/qaac"><code>qaac</code></a> (backed by Apple <code>CoreAudioToolbox</code>) ✅</td>
					<td style="text-align: center">FFmpeg ❌</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>WavPack</code></td>
					<td style="text-align: center">FFmpeg ✅</td>
					<td style="text-align: center">FFmpeg ✅</td>
			</tr>
	</tbody>
</table>
<p>The conversion command:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Format</th>
					<th style="text-align: center">32-bit integer</th>
					<th style="text-align: center">32-bit float</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>FLAC</code></td>
					<td style="text-align: center"><code>flac -o 01.flac 01.wav</code></td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>ALAC</code></td>
					<td style="text-align: center"><code>qaac64 -b 32 --alac -i 01.wav -o 01.m4a</code></td>
					<td style="text-align: center">❌</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>WavPack</code></td>
					<td style="text-align: center"><code>ffmpeg -i 01.wav 01.wv</code></td>
					<td style="text-align: center"><code>ffmpeg -i 02.wav 02.wv</code></td>
			</tr>
	</tbody>
</table>
<p>The resulting files are <code>Q3/01.flac</code>, <code>Q3/01.m4a</code>, <code>Q3/01.wv</code> and <code>Q3/02.wv</code>.</p>
<h3 id="3">(3)</h3>
<p>Encoding speed and compression rate of different encoding methods:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Format</th>
					<th style="text-align: center"><code>WAV</code> file size / encoded file size</th>
					<th style="text-align: center">audio time / encoding time</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>FLAC s32</code></td>
					<td style="text-align: center">1.337</td>
					<td style="text-align: center">128.44</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>ALAC s32</code></td>
					<td style="text-align: center">1.304</td>
					<td style="text-align: center">69.81</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>WavPack s32</code></td>
					<td style="text-align: center">1.280</td>
					<td style="text-align: center">121.08</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>WavPack f32</code></td>
					<td style="text-align: center">1.489</td>
					<td style="text-align: center">109.02</td>
			</tr>
	</tbody>
</table>
<p>Summary:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center"></th>
					<th style="text-align: center"><code>FLAC s32</code></th>
					<th style="text-align: center"><code>FLAC f32</code></th>
					<th style="text-align: center"><code>ALAC s32</code></th>
					<th style="text-align: center"><code>ALAC f32</code></th>
					<th style="text-align: center"><code>WavPack s32</code></th>
					<th style="text-align: center"><code>WavPack f32</code></th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">Compression rate</td>
					<td style="text-align: center">best</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">medium</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">worst</td>
					<td style="text-align: center">-</td>
			</tr>
			<tr>
					<td style="text-align: center">Encoding speed</td>
					<td style="text-align: center">very fast</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">fast</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">very fast</td>
					<td style="text-align: center">very fast</td>
			</tr>
			<tr>
					<td style="text-align: center">Playback compatibility</td>
					<td style="text-align: center">bad (<code>flac</code> only)</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">good (FFmpeg)</td>
					<td style="text-align: center">❌</td>
					<td style="text-align: center">good (FFmpeg)</td>
					<td style="text-align: center">good (FFmpeg)</td>
			</tr>
	</tbody>
</table>
<p>Because FFmpeg is the de facto standard multimedia codec library used by most video players, <code>FLAC</code> is not suitable, which can only be decoded by <code>flac</code>. Also, <code>WavPack</code> shows advantage in encoding speed compared to <code>ALAC</code>, but considering that all of three formats are fast in absolute speed (compared to video encoding), this advantage is not greatly valuable. Last, <code>ALAC</code> shows better compression rate than <code>WavPack</code>, thus file size can be saved.</p>
<p>To sum up, I recommend <code>ALAC</code> for encoding 32-bit audio. But if float point encoding is required (which is rare), <code>WavPack</code> is the only choice.</p>
<h2 id="question4-35pt">Question4 (35pt)</h2>
<blockquote>
<p>MSU publishes video encoder tests every year, with the latest one here:
<a href="https://compression.ru/video/codec_comparison/2021/main_report.html">https://compression.ru/video/codec_comparison/2021/main_report.html</a>.</p>
<p>For the first time last year, H.266 (VVC) encoders participated in the tests and they performed well in terms of encoding quality in the slow encoding (1 fps) test.</p>
<ol>
<li>Choose any of the H.266 (VVC) or AV1 encoders in the figure below, and then encode the source file <em>Q4 [E46686C4].m2ts</em> with no more than 2500 Kbps of video bitrate. You&rsquo;d better use 10bit variants of these encoders, which facilitates the comparison later. In addition, you need to describe clearly where you found the encoder and state the version and parameters you used. If you use H.266 (VVC) encoder, you will get additional 5pt. (10pt+5pt)</li>
<li>We provide an AV1 video file <em>Q4_AV1 [41A7EDDA].mkv</em>, which was encoded via SVT-AV1 10bit encoder without any pre-processing. Comment on the picture quality compared to the source file. When you compare the picture quality, you may want to sample a few frames, attach some screenshots, and comment on the performance of dark scenes and moving scenes. (10pt)</li>
<li>Now compare your own encoding to the given AV1 file in terms of picture quality, encoding speed, and playback compatibility. As a reference, we encoded the above AV1 file at 1.0 fps. (10pt)</li>
</ol>
</blockquote>
<p><img loading="lazy" src="/2023-05-25-vcb/pics/Q4.png"></p>
<h3 id="1-vvc-encoding">(1) VVC encoding</h3>
<p>The testing hardware and software environment is:</p>
<ul>
<li>Encoder: <a href="https://github.com/fraunhoferhhi/vvenc">VVenC</a> v1.7.0.</li>
<li>Compiler: AMD Optimizing C/C++ Compiler 4.0.0.</li>
<li>CPU: 2 x AMD EPYC 7713, 128 cores / 256 threads in total.</li>
<li>RAM: 16 channel DDR4-3200.</li>
<li>OS: Ubuntu 18.04.6.</li>
</ul>
<p>First, use <code>ffmpeg</code> to convert <code>Q4 [E46686C4].m2ts</code> to raw <code>yuv420p10</code> video:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">ffmpeg -i <span class="s2">&#34;Q4 [E46686C4].m2ts&#34;</span> -pix_fmt yuv420p10 Q4_yuv420p10.yuv
</span></span></code></pre></td></tr></table>
</div>
</div><p>Parameter <code>-pix_fmt yuv420p10</code> indicates <code>ffmpeg</code> to output raw video use <code>yuv420p10</code> format:</p>
<p>Then, use <code>vvencapp</code> to encode the raw video:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">vvencapp --input Q4_yuv420p10.yuv --size 1920x1080 --format yuv420_10 --fps 24000/1001 --preset &lt;preset&gt; --bitrate 2500kbps --output Q4_VVC.vvc
</span></span></code></pre></td></tr></table>
</div>
</div><p>Parameters meaning:</p>
<ul>
<li><code>--size 1920x1080</code>: indicating the input raw video frame size is 1920x1080.</li>
<li><code>--format yuv420_10</code>: same as <code>yuv420p10</code> meaning in <code>ffmpeg</code>.</li>
<li><code>--fps 24000/1001</code>: indicating the output video fps is <code>23.976</code> (same as original <code>m2ts</code> file).</li>
<li><code>--preset &lt;preset&gt;</code>: Preset vvc encoding parameter combination. Available options are <code>faster</code>, <code>fast</code>, <code>meadium</code>, <code>slow</code> and <code>slower</code>. Detailed settings are listed in <a href="https://github.com/fraunhoferhhi/vvenc/blob/master/cfg/">https://github.com/fraunhoferhhi/vvenc/blob/master/cfg/randomaccess_*.cfg</a>.</li>
<li><code>--bitrate 2500kbps</code>: controlling the output encoded video bitrate to about <code>2500kbps</code>.</li>
</ul>
<table>
	<thead>
			<tr>
					<th style="text-align: center">File</th>
					<th style="text-align: center">Preset</th>
					<th style="text-align: center">FPS</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center"><code>Q4_VVC_faster.vvc</code></td>
					<td style="text-align: center"><code>faster</code></td>
					<td style="text-align: center">5.762</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>Q4_VVC_fast.vvc</code></td>
					<td style="text-align: center"><code>fast</code></td>
					<td style="text-align: center">2.156</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>Q4_VVC_medium.vvc</code></td>
					<td style="text-align: center"><code>medium</code></td>
					<td style="text-align: center">0.557</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>Q4_VVC_slow.vvc</code></td>
					<td style="text-align: center"><code>slow</code></td>
					<td style="text-align: center">0.177</td>
			</tr>
			<tr>
					<td style="text-align: center"><code>Q4_VVC_slower.vvc</code></td>
					<td style="text-align: center"><code>slower</code></td>
					<td style="text-align: center">0.058</td>
			</tr>
	</tbody>
</table>
<h3 id="2-comparing-source-video-and-reference-av1-encoded-video">(2) Comparing source video and reference <code>AV1</code> encoded video</h3>
<p>The video player used is <a href="https://github.com/MartinEesmaa/VVCEasy">MPV with libvvdec &amp; xHE-AAC support</a>, configured according to <a href="https://vcb-s.com/archives/7594">https://vcb-s.com/archives/7594</a>.</p>
<p>Dynamic fire with a dark background is a highly challenging scene. Compared to the original video, There are color blocks around hte flame in AV1 video, which is a common problem when the bitrate is insufficient.</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Encoding Method</th>
					<th style="text-align: center">Capture</th>
					<th style="text-align: center">File</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">Original</td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/m2ts-flame.png"></td>
					<td style="text-align: center"><code>pics/m2ts-flame.png</code></td>
			</tr>
			<tr>
					<td style="text-align: center">AV1</td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/av1-flame.png"></td>
					<td style="text-align: center"><code>pics/av1-flame.png</code></td>
			</tr>
	</tbody>
</table>
<h3 id="3-comparing-custom-vvc-encoded-video-and-reference-av1-encoded-video">(3) Comparing custom <code>VVC</code> encoded video and reference <code>AV1</code> encoded video</h3>
<p>Using the same player as (2). In order to be comparable to the video encoded by AV1, I chose the medium preset encoded VVC video, which has an encoding speed of 0.557 fps.</p>
<p>The VVC encoded video is much better than the AV1 video in flame scene. The color blocks are less obvious and closer to the original video.</p>
<table>
	<thead>
			<tr>
					<th style="text-align: center">Encoding Method</th>
					<th style="text-align: center">Capture</th>
					<th style="text-align: center">File</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: center">Original</td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/m2ts-flame.png"></td>
					<td style="text-align: center"><code>pics/m2ts-flame.png</code></td>
			</tr>
			<tr>
					<td style="text-align: center">AV1</td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/av1-flame.png"></td>
					<td style="text-align: center"><code>pics/av1-flame.png</code></td>
			</tr>
			<tr>
					<td style="text-align: center">VVC (medium)</td>
					<td style="text-align: center"><img loading="lazy" src="/2023-05-25-vcb/pics/vvc-flame.png"></td>
					<td style="text-align: center"><code>pics/vvc-flame.png</code></td>
			</tr>
	</tbody>
</table>
<h2 id="question5-20pt">Question5 (20pt)</h2>
<blockquote>
<p>When we check an encoded file, we need to locate frames that have been encoded exceptionally awful. We use algorithms like PSNR to evaluate the similarity of each frame in the encoded file to the source file. The result is an array of scores, where the i-th score is tied to the i-th frame. These scores are called raw scores. However, what we are concerned about is the standard score, which is the raw score minus a threshold. A frame with a standard score less than 0 is considered a bad frame. The tasks are:</p>
<ol>
<li>
<p>Find the worst frame, i.e. the one with the lowest standard score among the bad frames, and output its index. If there is more than one worst frame, output the first. If there are no bad frames, output <code>-1</code>. Frames with a standard score of exactly <code>0</code> are not considered as bad frames. (10pt)</p>
<p><strong>Input:</strong>
2 lines. The first line is two integers that represent the number of frames <code>N</code> and the threshold value <code>S</code>. The second row is an array of integers <code>A[N]</code>, representing the raw score of each frame.</p>
<p>For all the data, <code>1&lt;=N&lt;=200000</code>, <code>0&lt;S&lt;100</code>, <code>0&lt;=A[i]&lt;=100</code></p>
<p><strong>Output:</strong>
An integer, the index of the worst frame. The index starts from <code>0</code>. If there is more than one worst frame, output the first. If there are no bad frames, output <code>-1</code>.</p>
<p><strong>Sample:</strong></p>
<pre tabindex="0"><code>Input
10 30
42 31 44 23 21 26 31 41 50 72

Output
10
</code></pre></li>
<li>
<p>Find a continuous sequence of frames that minimizes the sum of their standard scores and output this minimum value. Full scores will only be given if the time complexity of your algorithm is optimal. (10pt)</p>
<p><strong>Input:</strong>
The same as (1).</p>
<p><strong>Output:</strong>
An integer, the minimum sum value.</p>
<p><strong>Sample:</strong></p>
<pre tabindex="0"><code>Input
10 30
42 31 44 23 21 26 31 41 50 72

Output
-20
</code></pre></li>
</ol>
<p>For each sub question, use C/C++/Java/Python/C# to write a console program. Read the input from the standard input and write it to standard output. Do NOT use libraries other than built-in ones (for example, no “import numpy as np”). Submit your source code.</p>
</blockquote>
<h3 id="1-find-the-worst-frame">(1) Find the worst frame</h3>
<p>The following code is consisted with <code>Q5/q5-1.c</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">frame_num</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">threshold</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scanf</span><span class="p">(</span><span class="s">&#34;%d%d&#34;</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">frame_num</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">threshold</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">worst_idx</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">worst_rate</span> <span class="o">=</span> <span class="mi">101</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">frame_num</span><span class="p">;</span> <span class="n">i</span><span class="o">||</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kt">int</span> <span class="n">rate</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nf">scanf</span><span class="p">(</span><span class="s">&#34;%d&#34;</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rate</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">rate</span> <span class="o">&lt;</span> <span class="n">threshold</span> <span class="o">&amp;&amp;</span> <span class="n">rate</span> <span class="o">&lt;</span> <span class="n">worst_rate</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">worst_rate</span> <span class="o">=</span> <span class="n">rate</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">worst_idx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;%d&#34;</span><span class="p">,</span> <span class="n">worst_idx</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h3 id="2-find-minimum-subsequence-sum">(2) Find minimum subsequence sum</h3>
<p><strong>PS:</strong> Due to the ambiguity of the problem, I can‘t determine whether a sequence of 0 length satisfies the requirement. This determines whether the output should be 0 (indicating that a subsequence of length 0 is selected) or the smallest score (indicating that the sequence length is at least 1) when the input standard scores are all positive. The code I submitted is consistent with the second understanding (sequence length is at least 1), if the first understanding (0 length is allowed) is correct, please comment <code>int min_sum = 101;</code> and uncomment <code>int min_sum = 0;</code>.</p>
<p>The following code is consisted with <code>Q5/q5-2.c</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">frame_num</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">threshold</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="nf">scanf</span><span class="p">(</span><span class="s">&#34;%d%d&#34;</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">frame_num</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">threshold</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">min_sum</span> <span class="o">=</span> <span class="mi">101</span><span class="p">;</span> <span class="c1">// when all scores &gt; 0, output the minimum
</span></span></span><span class="line"><span class="cl">    <span class="c1">// int min_sum = 0; // when all scores &gt; 0, output 0
</span></span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">frame_num</span><span class="p">;</span> <span class="n">i</span><span class="o">||</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="kt">int</span> <span class="n">rate</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nf">scanf</span><span class="p">(</span><span class="s">&#34;%d&#34;</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rate</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">rate</span> <span class="o">-=</span> <span class="n">threshold</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">sum</span> <span class="o">|=</span> <span class="n">rate</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="n">sum</span> <span class="o">&lt;</span> <span class="n">min_sum</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">min_sum</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">sum</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;%d&#34;</span><span class="p">,</span> <span class="n">min_sum</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div>]]></content:encoded></item><item><title>Hello World</title><link>https://monsoon-cs.moe/2023-03-29-hello-world/</link><pubDate>Wed, 29 Mar 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2023-03-29-hello-world/</guid><description>&lt;p&gt;&lt;strong&gt;My first post on blog!&lt;/strong&gt;&lt;/p&gt;</description><content:encoded>&lt;p>&lt;strong>My first post on blog!&lt;/strong>&lt;/p>
</content:encoded></item></channel></rss>