<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>arxiv on Monsoon's Blog</title><link>https://monsoon-cs.moe/tags/arxiv/</link><description>Recent content in arxiv on Monsoon's Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 07 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://monsoon-cs.moe/tags/arxiv/index.xml" rel="self" type="application/rss+xml"/><item><title>[Paper Reading] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs (arXiv'24)</title><link>https://monsoon-cs.moe/2024-02-07-paper-reading-arxiv24-acs/</link><pubDate>Wed, 07 Feb 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-02-07-paper-reading-arxiv24-acs/</guid><description>&lt;blockquote&gt;
&lt;p&gt;This blog is a write-up of the paper &amp;ldquo;&lt;a href="https://arxiv.org/abs/2401.12377"&gt;ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs&lt;/a&gt;&amp;rdquo; from arXiv'24.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of &lt;strong&gt;small kernels&lt;/strong&gt; which cannot fully utilize the GPU, and these kernels are not executed concurrently, although &lt;strong&gt;most of them are independent and in theory can be executed concurrently&lt;/strong&gt;.&lt;/p&gt;</description><content:encoded><![CDATA[<blockquote>
<p>This blog is a write-up of the paper &ldquo;<a href="https://arxiv.org/abs/2401.12377">ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs</a>&rdquo; from arXiv'24.</p>
</blockquote>
<h2 id="motivation">Motivation</h2>
<p>Some workloads (e.g., Simulation Engines for Deep RL, Dynamic DNNs) cannot fully utilize the massive parallelism of GPUs (see Figure 1). The main reason is that these workloads contain lots of <strong>small kernels</strong> which cannot fully utilize the GPU, and these kernels are not executed concurrently, although <strong>most of them are independent and in theory can be executed concurrently</strong>.</p>
<p><img alt="Figure 1. Achieved Occupancy of simulation engines (up) and dynamic DNN (down)" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/achieved_occ.png"></p>
<p>But there are some challenges to execute these kernels concurrently:</p>
<ol>
<li><strong>Input-dependent kernel dependencies</strong>. For some workload, the the dependencies between kernels are only <strong>determined at runtime</strong> for each input. Constructing full computational graph and resolving dependencies before execution will introduce <strong>high latency</strong> (see Figure 2,average of 47% of overall execution time as the paper says).</li>
</ol>
<p><img alt="Figure 2. DAG construction time as % of execution time" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/dag_time.png"></p>
<ol start="2">
<li><strong>Irregular kernel dependencies</strong>. Some workloads have irregular computational graphs. We can partitioned the computational graph of the workload into independent streams of kernels. But this would require <strong>fine-grained scheduling</strong> and <strong>synchronization</strong>, with <strong>large overhead</strong> (see Figure 3).</li>
</ol>
<p><img alt="Figure 3. Kernel launch and synchronization overheads" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/sync_overhead.png"></p>
<p>Existed solutions:</p>
<ol>
<li>
<p>CUDA Graph and AMD ATMI. They allow users specify dependencies between different kernels as DAG, and can eliminate the synchronization and kernel launch overhead. But the DAG needs to be constructed in <strong>full before execution</strong>, which imakes them not suitable for dynamic kernel dependencies (such as Dynamic DNNs).</p>
</li>
<li>
<p>Using events provided by the CUDA stream management API, which allows synchronization between kernels across streams through the <code>cudaStreamWaitEvent</code> API, without blocking the host. But approach still requires deriving dependencies between all kernels beforehand.</p>
</li>
<li>
<p>Persistent threads (PT) can eliminate the scheduling and launch overheads, but are only effective when all kernels are homogeneous.</p>
<blockquote>
<p>PT is just like coroutine in some programming languages.</p>
</blockquote>
</li>
<li>
<p>CUDA dynamic parallelism (CDP) or AMD’s device enqueue (DE) enables parent kernels to launch child kernels, but , only allowing data dependencies between one parent and its children (so cannot be use to synchronize between multiple tasks).</p>
</li>
</ol>
<h2 id="design">Design</h2>
<p>The <strong>goal</strong> of this paper is to design a framework that enables efficient concurrent execution of GPU kernels with:</p>
<ol>
<li>
<p>lightweight detection of inter-kernel dependencies at runtime,</p>
</li>
<li>
<p>low overhead kernel scheduling and synchronization.</p>
</li>
</ol>
<p><strong>The key idea is to perform the dependence checking and scheduling within a small window of kernels at runtime similar to out-of-order instruction scheduling.</strong></p>
<p>The authors proposed Automatic Concurrent Scheduling (ACS) as solution. The overall design of ACS-SW is shown in Figure 4. It contains three main functionalities:</p>
<p><img alt="Figure 4. ACS-SW Overview" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_overview.png"></p>
<ol>
<li>
<p><strong>Determining inter-kernel dependencies</strong>. By checking for <strong>overlaps between read segments and write segments</strong>, we determine dependencies between kernels. For a wide range of commonly used kernels (e.g., matrix multiplication, convolution), we can infer the read and write segments from the input easily. But for some kernels, it&rsquo;s impossible to determine the range of memory accessed statically because of the potential indirect memory accesses, so the authors just assume the <strong>entire GPU memory may be accessed</strong>.</p>
<p><img alt="Memory regions written to/accessed by the kernel" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/seg.png"></p>
<p>The authors use a kernel wrapper to finish the dependency detection. <code>get_addresses()</code> is called to get <code>__read_segments__</code> and <code>__write_segments__</code>.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">ACE_wrapper</span> <span class="p">{</span> 
</span></span><span class="line"><span class="cl">  <span class="c1">//list of read,write segments defined as
</span></span></span><span class="line"><span class="cl">  <span class="c1">//[{start_adr1,size1},{start_adr2,size2}..]
</span></span></span><span class="line"><span class="cl">  <span class="n">list</span> <span class="n">__read_segments__</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="n">list</span> <span class="n">__write_segments__</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// function which gets called at kernel
</span></span></span><span class="line"><span class="cl">  <span class="c1">// launch to populate read,write segments
</span></span></span><span class="line"><span class="cl">  <span class="kt">void</span> <span class="nf">get_addresses</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">dim3</span> <span class="n">blocks</span><span class="p">,</span> <span class="n">dim3</span> <span class="n">threads</span><span class="p">,</span> <span class="p">...</span>
</span></span><span class="line"><span class="cl">  <span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="c1">// function declaration of the kernel
</span></span></span><span class="line"><span class="cl">  <span class="k">static</span> <span class="n">__global__</span> <span class="kt">void</span> <span class="nf">kernel</span><span class="p">(...);</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></td></tr></table>
</div>
</div></li>
<li>
<p>Tracking kernel state at runtime. The kernels in the window can be three states:</p>
<ol>
<li><strong>Ready</strong>: kernels it is dependent on complete execution.</li>
<li><strong>Pending</strong>: upstream kernels are still pending or executing.</li>
<li><strong>Executing</strong>.</li>
</ol>
</li>
</ol>
<p><img alt="Kernels in the scheduling window with their state and corresponding upstream kernels" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/window.png"></p>
<ol start="3">
<li>Eliminating CPU synchronization overheads. See ACS-HW for more details.</li>
</ol>
<p>ACS has two variants:</p>
<ol>
<li>
<p>ACS-SW: software-only implementation which emulates the out-of-order kernel scheduling mechanism.</p>
</li>
<li>
<p>ACS-HW: hardware-facilitated implementation which is more efficient as it also alleviates synchronization overheads.</p>
</li>
</ol>
<h3 id="acs-sw">ACS-SW</h3>
<h4 id="window-module">Window Module</h4>
<p>This module is to determining inter-kernel dependencies. It is implemented as a separate thread that manages the input FIFO queue and the scheduling window. The kernel state tracking is implemented in the hardware.</p>
<h4 id="scheduler-module">Scheduler Module</h4>
<p>This module schedules and launches ready kernels for execution. It has fixed number of CUDA streams. Each stream contains only one kernel at any given time. Threads with empty streams poll the scheduling window for a ready kernel.</p>
<p><img alt="ACS-SW: The scheduler module" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_hw_scheduler.png"></p>
<h3 id="acs-hw">ACS-HW</h3>
<p>ACS-SW incurs kernel synchronization and launch overheads because scheduler module launches a kernel in the CPU. ACS-HW solves these problems by a software-hardware co-design.</p>
<p><img alt="ACS-HW Overview" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/acs_hw.png"></p>
<p>Software-side: maintains an input FIFO queue like ACS-SW, and a list of kernels in the GPU’s scheduling window, <strong>but it can be stale</strong>.</p>
<p>Hardware-side: the scheduling window and its management are implemented in hardware on the GPU side.</p>
<p>A key novelty in hardware design is <strong>two stage dependency detections</strong>. First, ACS use software to perform initial detection using stale kernel information (without frequent synchronize overhead), then utilizes hardware to correct outdated dependency information. This two-stage approach significantly reduces the hardware complexity.</p>
<p><img alt="ACS-HW Scheduler" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/hw_scheduler.png"></p>
<h2 id="evaluation">Evaluation</h2>
<ol>
<li>Baseline: cuDNN implementation (for DNNs) and a jax implementation (for deep RL simulation), both using CUDA streams.</li>
<li>ACS-SW: on real hardware.</li>
<li>ACS-SW-Sim: ACS-SW on the GPU simulator.</li>
<li>ACS-HW: on the GPU simulator.</li>
<li>CUDAGraph.</li>
</ol>
<p><img alt="Deep RL physics simulations: Normalized Speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl.png"></p>
<p><img alt="Deep RL physics simulations: Normalized Speedup on GPU simulator" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl_sim.png"></p>
<p><img alt="Deep RL physics simulations: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_deep_rl_occ.png"></p>
<p><img alt="Dynamic DNNs: Normalized speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_dcnn.png"></p>
<p><img alt="Dynamic DNNs: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_dcnn_occ.png"></p>
<p><img alt="Static DNNs: Normalized speedup" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_scnn.png"></p>
<p><img alt="Static DNNs: Achieved occupancy" loading="lazy" src="/2024-02-07-paper-reading-arxiv24-acs/eval_scnn_occ.png"></p>
<h2 id="comments">Comments</h2>
<h3 id="strengths">Strengths</h3>
<p>This paper focuses on the problem of low GPU utilization caused by the serial execution of numerous small CUDA kernels. I believe this paper effectively addresses this problem, particularly with the following innovative points that are impressive me:</p>
<ol>
<li>
<p><strong>Out-of-order dependency detection and scheduling</strong>. Out-of-order (OoO) is a common technique in micro-architecture and software (e.g., hard disk I/O queue) designs. It&rsquo;s an impressive and innovative idea to introduce OoO into this area to find the dynamic dependencies efficiently.</p>
</li>
<li>
<p>A good <strong>trade-off</strong>. When I first read the Introduction section of the paper, I thought the read-write dependencies detection may be a difficulty task. To my knowledge, there aren&rsquo;t reliable static binary memory access analysis techniques (otherwise, segmentation fault wouldn&rsquo;t be a common problem). However, the authors made a good <strong>simplification</strong> and <strong>trade-off</strong> regarding this problem. For most common kernels, memory access areas can be inferred from input parameters. For the rest kernels, it can be assumed that they access the entire memory. Since few common operators occupy most of the execution time, this trade-off leads to significant performance improvements with a relatively low scheduling overhead. This innovation is my <strong>favorite</strong> aspect of this paper.</p>
</li>
<li>
<p><strong>Two-stage dependency detection</strong> in ACS-HW. While a complete hardware dependency detection approach is theoretically feasible, it could incur significant <strong>chip area costs</strong> (as we know, the re-order buffer in microprocessor carries large area). The authors proposed a two-stage software-hardware co-design dependency detection, significantly simplifying the difficulty of hardware design. It is a brilliant idea.</p>
</li>
</ol>
<h3 id="weaknesses">Weaknesses</h3>
<p>This paper has some potential weaknesses:</p>
<ol>
<li>
<p>To each type of kernel, we must custom <code>get_addresses</code> function int the kernel wrapper. This weakness may limit the adoption of ACS.</p>
</li>
<li>
<p>Deciding whether kernels should be executed concurrently requires considering <strong>more factors</strong> than just data dependencies. If there are resource conflict (e.g., memory bandwidth, shared memory size) between two <strong>large kernels</strong>, performance may degrade if they co-execute.</p>
</li>
</ol>
<h3 id="improvements">Improvements</h3>
<p>I propose some potential improvements to this paper:</p>
<ol>
<li>
<p>In response to the first weakness mentioned above, I propose a <strong>profiling-rollback</strong> strategy to achieve safe automatic dependency detection. This strategy leverages the commonly used <strong>paging</strong> technique in OS virtual memory management: we can set a memory page as <strong>read-only</strong> or <strong>write-only</strong>. When a program is running, if a <strong>page fault</strong> is triggered, we can know that a read/write occurs. While I&rsquo;m unsure if Nvidia GPUs provide APIs for user to control page tables, let&rsquo;s assume such APIs exist. Given that many workloads are iterative (e.g., neural network training), we can profile the workload just one iteration, utilizing the aforementioned paging trick to <strong>record the memory access segments</strong> of each kernel. Obviously this may introduce some inaccuracies, we need a <strong>rollback strategy</strong> to ensure correct program execution. During runtime, we set known <code>__write_segments__</code> as read-write, while other areas are set as read-only. Upon encountering a page fault, we detect an error and revert to the default strategy (assuming all memory areas will be read and wrote). With this strategy, we can eliminate the need of manual <code>get_addresses</code> function, and maximize the potential parallelism.</p>
</li>
<li>
<p>Regarding the second weakness, I suggest adopting the method of <strong>GPUPool</strong> to determine which kernels are suitable for concurrent execution. A naive solution involves tracking the number of SMs each kernel occupies. When the SMs of a GPU are fully occupied, even if there are kernels in the <code>ready</code> state and available CUDA streams, no new kernels are scheduled.</p>
</li>
</ol>
]]></content:encoded></item></channel></rss>