<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>linux on Monsoon's Blog</title><link>https://monsoon-cs.moe/zh/tags/linux/</link><description>Recent content in linux on Monsoon's Blog</description><generator>Hugo</generator><language>zh-CN</language><lastBuildDate>Sat, 17 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://monsoon-cs.moe/zh/tags/linux/index.xml" rel="self" type="application/rss+xml"/><item><title>NFS Performance Tuning</title><link>https://monsoon-cs.moe/zh/2024-02-16-nfs-tuning/</link><pubDate>Fri, 16 Feb 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/zh/2024-02-16-nfs-tuning/</guid><description>&lt;h2 id="前言"&gt;前言&lt;/h2&gt;
&lt;p&gt;本文是我在实践中总结出的生产场景下 10 Gbps 网络下的 NFS 性能调优指南，特别是针对&lt;strong&gt;大量小文件&lt;/strong&gt;（Lots of Small Files, LOSF）读写的优化。&lt;/p&gt;
&lt;h2 id="调优"&gt;调优&lt;/h2&gt;
&lt;h3 id="硬件"&gt;硬件&lt;/h3&gt;
&lt;p&gt;网络硬件方面，&lt;strong&gt;带宽&lt;/strong&gt;和&lt;strong&gt;延迟&lt;/strong&gt;两者都很重要。&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="前言">前言</h2>
<p>本文是我在实践中总结出的生产场景下 10 Gbps 网络下的 NFS 性能调优指南，特别是针对<strong>大量小文件</strong>（Lots of Small Files, LOSF）读写的优化。</p>
<h2 id="调优">调优</h2>
<h3 id="硬件">硬件</h3>
<p>网络硬件方面，<strong>带宽</strong>和<strong>延迟</strong>两者都很重要。</p>
<p>要保证 NFS 的性能，高带宽网络是必要的，10 Gbps 对于生产场景来说是基础要求，更高速的 InfiniBand 或者 RoCE 网络则可按照需求和预算进行选择。</p>
<p>对于<strong>大量小文件</strong>（Lots of Small Files, LOSF）场景来说，<strong>延迟比带宽更重要</strong>。很多性能调优教程都忽略了这一点，只关注了连续读写的性能，即使测试了 4K 随机读写，也使用了<strong>错误的测试方法</strong>（下文给出了正确的测试方法）。</p>
<p>延迟的重要性体现在，如果程序对于小文件的访问是<strong>内秉串行化</strong>的，<strong>延迟会决定串行化 IOPS 的上限</strong>。0.1 ms 的延迟决定了串行化的 IOPS 上限是 10k，而 1 ms 的延迟对应的上限则是 1k。</p>
<p>内秉串行化访问的场景非常多。例如，把家目录放置于 NFS 上，oh-my-zsh 的加载、python 包的加载都是内秉串行化的。1ms 的网络延迟会让这些程序慢到不可接受（例如 <code>import torch</code> 的执行需要 30s 以上）。</p>
<p>使用合格的企业级交换机、恰当配置的网络拓扑，可以尽量降低延迟。同时，光模块、光转电口模块的质量也有可能极大影响延迟（我原来使用的中科光电光转电口模块会引入 0.1ms 的额外延迟，导致 IOPS 下降了 2/3）。</p>
<p>需要注意的是，RDMA 尽管理论上能降低延迟，但实际测试中发现 10 Gbps 以太网和 100 Gbps InfiniBand 的串行化 IOPS 差距并不大，预算有限时只使用以太网也足够。</p>
<p>TODO: 巨型帧</p>
<h3 id="linux-kernel">Linux Kernel</h3>
<p>内核网络参数需要进行调整，以适应高速网络：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="c1"># Ref: https://gist.github.com/mizanRahman/40ba603759bfb5153189ccdc9dbbd1e4</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Disable TCP slow start on idle connections</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_slow_start_after_idle</span> <span class="o">=</span> <span class="s">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Increase Linux autotuning TCP buffer limits</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Don&#39;t set tcp_mem itself! Let the kernel scale it based on RAM.</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.rmem_max</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.wmem_max</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.rmem_default</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.wmem_default</span> <span class="o">=</span> <span class="s">56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.optmem_max</span> <span class="o">=</span> <span class="s">40960</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_rmem</span> <span class="o">=</span> <span class="s">4096 87380 56623104</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_wmem</span> <span class="o">=</span> <span class="s">4096 65536 56623104</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># TCP Congestion Control</span>
</span></span><span class="line"><span class="cl"><span class="na">net.ipv4.tcp_congestion_control</span> <span class="o">=</span> <span class="s">bbr</span>
</span></span><span class="line"><span class="cl"><span class="na">net.core.default_qdisc</span> <span class="o">=</span> <span class="s">cake</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>在服务端和客户端都需要应用这套设置，可以写入 <code>/etc/sysctl.conf</code> 中以持久化。</p>
<h3 id="server-side">Server Side</h3>
<p>NFS server 的线程数可以尽量调大点，服务器负载比较高时可以提升性能，我直接设成了服务器的线程数。修改 <code>/etc/nfs.conf</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[nfsd]</span>
</span></span><span class="line"><span class="cl"><span class="na">threads</span><span class="o">=</span><span class="s">128</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>以下几个 NFS server 参数需要调整：</p>
<ul>
<li><code>async</code>：将同步 IO 操作视为异步。同步读写为主的负载可以大幅提升性能，但服务器崩溃时可能造成数据丢失，对数据完整性有极高要求的情况下不推荐使用；</li>
<li><code>no_subtree_check</code>：对性能没有大影响，但在某些情况下可以提升可靠性（同时有轻微的安全风险）。参见 [1]。</li>
</ul>
<h3 id="client-side">Client Side</h3>
<p>没有特殊的理由时应该默认使用最新的 NFSv4.2，NFSv3 使用 UDP 作为底层传输方式时，在高速网络下会因为 UDP 包序列号问题导致数据损坏，参见 [2]。</p>
<p>以下几个 NFS client 参数需要调整：</p>
<ul>
<li><code>proto=rdma</code>：网络支持 RDMA 时设置；</li>
<li><code>nocto</code>：关闭 close-to-open 缓存一致性语义。NFS 默认行为是关闭文件时会把所有更改写回到服务器。如果对于多客户端之间的文件一致性要求比较高，不推荐使用此选项；</li>
<li><code>ac</code>：启用属性缓存（attribute caching），客户端会缓存文件属性。同样。对于数据一致性要求较高的集群，不推荐使用此选项；</li>
<li><code>fsc</code>：使用 FS-Cache 缓存数据到本地。需要同时<a href="https://github.com/jnsnow/cachefilesd">配置 cachefilesd</a>。奇怪的是我在测试中并没有发现数据被缓存到本地，这可能需要进一步的探究；</li>
<li><code>nconnect=16</code>：设置 NFS client 和 server 间建立 16 条 TCP 连接。NFS client 默认只建立一条 TCP 连接，所有 RPC 复用这条连接。在某些情况下这会限制连续读写的带宽。增大 <code>nconnect</code>（最大值 16）可以解决这个问题。</li>
</ul>
<p>特别的，<code>noatime</code> / <code>relatime</code> 的设置对于 NFS 并无影响 [3]，NFS client 始终会缓存 atime 的更改。</p>
<p>有些教程中会推荐修改 <code>rsize</code> 和 <code>wsize</code>，这两个值在 NFSv4.2 默认协商出的即是最大值 <code>1048576</code>，因而无需手动更改，只需检查一下是否协商正确即可。</p>
<p>根据 [4]，<code>sunrpc.tcp_max_slot_table_entries</code> 可能会影响性能，可以适当调大（默认 <code>2</code>）。在我的测试中，我发现当遇到千万数量级的持续小文件访问负载时，NFS 有时候会卡住。当我把这个参数调大时，此问题得以解决。设置 <code>/etc/modprobe.d/sunrpc.conf</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="na">options sunrpc tcp_slot_table_entries</span><span class="o">=</span><span class="s">16384</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>有时我会遇到 <code>nfsd</code> 占用大量 CPU 且性能急剧下降的问题，同时记录到大量 <code>delegreturn</code> RPC calls。根据 [5]，可以通过禁用 <code>fs.leases-enable</code> 解决，设置 <code>/etc/sysctl.conf</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="na">fs.leases-enable</span> <span class="o">=</span> <span class="s">0</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>当 <code>nfsd</code> 因为种种原因重启后，默认会有 90s 的 grace period 用于锁恢复，这段时间内 <code>nfsd</code> 会拒绝所有 <code>open</code> 请求，在内核日志中显示：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">[1073511.138061] NFSD: starting 90-second grace period (net f0000000)
</span></span></code></pre></td></tr></table>
</div>
</div><p>实践中发现这段时间可以适当调小，以减少 <code>nfsd</code> 重启带来的影响。设置 <code>/etc/default/nfs-kernel-server</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Options for rpc.svcgssd.</span>
</span></span><span class="line"><span class="cl"><span class="nv">RPCSVCGSSDOPTS</span><span class="o">=</span><span class="s2">&#34;--lease-time 10 --grace-time 10&#34;</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="测试">测试</h2>
<p>TODO</p>
<h2 id="总结">总结</h2>
<p>TODO</p>
<h2 id="参考">参考</h2>
<p>[1] <a href="https://man.archlinux.org/man/exports.5.en#no_subtree_check">https://man.archlinux.org/man/exports.5.en#no_subtree_check</a></p>
<p>[2] <a href="https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links">https://man.archlinux.org/man/nfs.5.en#Using_NFS_over_UDP_on_high-speed_links</a></p>
<p>[3] <a href="https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance">https://man.archlinux.org/man/nfs.5.en#File_timestamp_maintenance</a></p>
<p>[4] <a href="https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots">https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-linux-concurrency-session-slots</a></p>
<p>[5] <a href="https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation">https://docs.gitlab.com/ee/administration/nfs.html#disable-nfs-server-delegation</a></p>
]]></content:encoded></item><item><title>Building WireGuard VPN for Machine Learning Server Cluster</title><link>https://monsoon-cs.moe/zh/2024-01-29-wg-for-cluster/</link><pubDate>Mon, 29 Jan 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/zh/2024-01-29-wg-for-cluster/</guid><description>&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;机器学习集群需要一个安全的方式向用户暴露服务，以及跨公网服务器互联，为此需要部署 VPN 网络。&lt;/p&gt;
&lt;p&gt;VPN 网络的部署需要考虑如下因素：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;网络拓扑：需要选择合适的拓扑结构以尽可能降低延迟；&lt;/li&gt;
&lt;li&gt;用户管理：可以方便地进行用户的增减和授权；&lt;/li&gt;
&lt;li&gt;使用和维护简单。&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="design"&gt;Design&lt;/h2&gt;
&lt;h3 id="网络拓扑"&gt;网络拓扑&lt;/h3&gt;
&lt;p&gt;网络拓扑决定着延迟。&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="motivation">Motivation</h2>
<p>机器学习集群需要一个安全的方式向用户暴露服务，以及跨公网服务器互联，为此需要部署 VPN 网络。</p>
<p>VPN 网络的部署需要考虑如下因素：</p>
<ol>
<li>网络拓扑：需要选择合适的拓扑结构以尽可能降低延迟；</li>
<li>用户管理：可以方便地进行用户的增减和授权；</li>
<li>使用和维护简单。</li>
</ol>
<h2 id="design">Design</h2>
<h3 id="网络拓扑">网络拓扑</h3>
<p>网络拓扑决定着延迟。</p>
<p>延迟最低的方案显然是 full-mesh，即每一对 peer 之间都有直接的 P2P 连接。但这种拓扑结构的管理复杂度是 $\mathcal{O}(n^2)$ 的，并且每添加一个新的 peer 就需要修改所有其他 peer 的配置文件，还需要解决 NAT 带来的问题，这必须借助一些自动化的软件管理。我尝试了 <a href="https://www.netmaker.io/">Netmaker</a> 和 <a href="https://headscale.net/">Headscale</a>，但它们似乎都无法正确处理学校内的<strong>复杂网络环境</strong>，比如各种企业级路由器使用的 symmetric NAT，<strong>成功建立 P2P 的概率非常之低</strong>。</p>
<p>最终我选择了 <strong>full-mesh 和 hub-and-spoke 相结合的拓扑</strong>。由于服务器数量和 IP 很少变化，手动配置一个服务器间的 full-mesh 网络是可行的。与此同时，提供一个 gateway server 作为用户接入的 hub，用户只需要与 gateway server 建立连接。由于大部分用户其实是在校内使用 VPN 的，因此连接到校内的 gateway server 并转发流量并不会带来太多额外延迟。这种结构可以平衡延迟与管理复杂度，用户的增减和授权也只需要在 gateway server 上操作。</p>
<p><img alt="Network Topology" loading="lazy" src="/2024-01-29-wg-for-cluster/topo.png"></p>
<h3 id="协议选择">协议选择</h3>
<p>流行的 OpenVPN 和 IPSec 都足够优秀，但新兴的 WireGuard 具有无可比拟的配置简单性。对于服务端，WireGuard 可以用几行配置文件定义一个 peer 和路由；对于用户，由于 WireGuard 采用基于密钥对的认证方式，只需要一个配置文件即可接入 VPN 网络，不需要额外的密码记忆和登录操作。</p>
<h3 id="管理方式">管理方式</h3>
<p>出于可预测性和稳定性的考量，我选择了手动配置的方法。服务器间的 full-mesh 网络一次配置后就不需要再频繁更改。而用户管理则通过一个脚本实现，当需要添加一个新用户时，脚本生成密钥对并分配 IP，把公钥和路由信息加入 gateway server 的 peer list 中，然后生成包含私钥和分配的 IP 的配置文件，并发给用户。</p>
<p>Gateway server 上的用户 peer 配置示例：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Peer]</span>
</span></span><span class="line"><span class="cl"><span class="na">PublicKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">10.1.x.y/32</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">fd01::x:y/128</span>
</span></span><span class="line"><span class="cl"><span class="na">PersistentKeepalive</span> <span class="o">=</span> <span class="s">25</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>用户的接入配置文件示例：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-ini" data-lang="ini"><span class="line"><span class="cl"><span class="k">[Interface]</span>
</span></span><span class="line"><span class="cl"><span class="na">PrivateKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">Address</span> <span class="o">=</span> <span class="s">10.1.x.y/16</span>
</span></span><span class="line"><span class="cl"><span class="na">Address</span> <span class="o">=</span> <span class="s">fd01::x:y/64</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">[Peer]</span>
</span></span><span class="line"><span class="cl"><span class="na">PublicKey</span> <span class="o">=</span> <span class="s">&lt;redacted&gt;</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">10.1.0.0/16  # route all VPN traffic to gateway server</span>
</span></span><span class="line"><span class="cl"><span class="na">AllowedIPs</span> <span class="o">=</span> <span class="s">fd01::/64</span>
</span></span><span class="line"><span class="cl"><span class="na">Endpoint</span> <span class="o">=</span> <span class="s">wg.ustcaigroup.xyz:51820  # gateway server is dual stack</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Endpoint = wg.ustcaigroup.xyz:51820  # IPv4</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Endpoint = wg.ustcaigroup.xyz:51820  # IPv6</span>
</span></span><span class="line"><span class="cl"><span class="na">PersistentKeepalive</span> <span class="o">=</span> <span class="s">25</span>
</span></span></code></pre></td></tr></table>
</div>
</div>]]></content:encoded></item><item><title>Ascend 910B 自定义 PyTorch 算子</title><link>https://monsoon-cs.moe/zh/2023-11-14-ascend-910b-custom-op/</link><pubDate>Tue, 14 Nov 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/zh/2023-11-14-ascend-910b-custom-op/</guid><description>&lt;h2 id="环境"&gt;环境&lt;/h2&gt;
&lt;p&gt;本文基于的硬件环境为 Ascend 910B3，基于的软件环境包括 &lt;a href="https://www.hiascend.com/developer/download/community/result"&gt;CANN 7.0-RC1&lt;/a&gt;、&lt;a href="https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/"&gt;PyTorch 1.11.0&lt;/a&gt;、&lt;a href="https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc3-pytorch1.11.0"&gt;Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0&lt;/a&gt;。其他 CANN 和 PyTorch 版本上的情况可能略有不同。&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="环境">环境</h2>
<p>本文基于的硬件环境为 Ascend 910B3，基于的软件环境包括 <a href="https://www.hiascend.com/developer/download/community/result">CANN 7.0-RC1</a>、<a href="https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/">PyTorch 1.11.0</a>、<a href="https://gitee.com/ascend/pytorch/releases/tag/v5.0.rc3-pytorch1.11.0">Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0</a>。其他 CANN 和 PyTorch 版本上的情况可能略有不同。</p>
<h2 id="注册过程">注册过程</h2>
<h3 id="ascend-pytorch-adapter-中添加自定义算子">Ascend PyTorch Adapter 中添加自定义算子</h3>
<blockquote>
<p>参考：</p>
<ul>
<li><a href="https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html">https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0045.html</a></li>
<li><a href="https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation">https://gitee.com/ascend/samples/tree/master/operator/AddCustomSample/FrameworkLaunch/PytorchInvocation</a></li>
</ul>
</blockquote>
<p>在 <code>torch_npu/csrc/aten/npu_native_functions.yaml</code> 中添加 <code>npu_add_custom</code> 函数：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="nt">custom</span><span class="p">:</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span>- <span class="nt">func</span><span class="p">:</span><span class="w"> </span><span class="l">npu_add_custom(Tensor x, Tensor y) -&gt; Tensor </span><span class="w"> </span><span class="c"># 添加的函数</span><span class="w">
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>在 <code>torch_npu/csrc/aten/ops/op_api</code> 中添加 <code>AddCustomKernelNpu.cpp</code> 文件：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;torch/csrc/autograd/custom_function.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/framework/utils/OpAdapter.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/aten/NPUNativeFunctions.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;torch_npu/csrc/aten/ops/op_api/op_api_common.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">namespace</span> <span class="n">at_npu</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="k">namespace</span> <span class="n">native</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">using</span> <span class="n">torch</span><span class="o">::</span><span class="n">autograd</span><span class="o">::</span><span class="n">Function</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">using</span> <span class="n">torch</span><span class="o">::</span><span class="n">autograd</span><span class="o">::</span><span class="n">AutogradContext</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">NPUNativeFunctions</span><span class="o">::</span><span class="n">npu_add_custom</span><span class="p">(</span><span class="k">const</span> <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">x</span><span class="p">,</span> <span class="k">const</span> <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span><span class="o">&amp;</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">at</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">result</span> <span class="o">=</span> <span class="n">OpPreparation</span><span class="o">::</span><span class="n">ApplyTensor</span><span class="p">(</span><span class="n">x</span><span class="p">);</span> <span class="c1">// 创建输出内存
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="c1">// calculate the output result of the NPU
</span></span></span><span class="line"><span class="cl">        <span class="n">EXEC_NPU_CMD</span><span class="p">(</span><span class="n">aclnnAddCustom</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span> <span class="c1">// namespace native
</span></span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="c1">// namespace at_npu
</span></span></span></code></pre></td></tr></table>
</div>
</div><p>之后重新编译安装 <code>torch_npu</code>。</p>
<h3 id="cann-中添加自定义算子的实现">CANN 中添加自定义算子的实现</h3>
<blockquote>
<p>参考：</p>
<ul>
<li><a href="https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html">https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/operatordev/Ascendcopdevg/atlas_ascendc_10_0023.html</a></li>
</ul>
</blockquote>
<p>首先定义算子描述文件 <code>add_custom.json</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span><span class="lnt">39
</span><span class="lnt">40
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">[</span>
</span></span><span class="line"><span class="cl">    <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;op&#34;</span><span class="p">:</span> <span class="s2">&#34;AddCustom&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;language&#34;</span><span class="p">:</span> <span class="s2">&#34;cpp&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;input_desc&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;x&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">},</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;y&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">],</span>
</span></span><span class="line"><span class="cl">        <span class="nt">&#34;output_desc&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">            <span class="p">{</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;name&#34;</span><span class="p">:</span> <span class="s2">&#34;z&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;param_type&#34;</span><span class="p">:</span> <span class="s2">&#34;required&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;format&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;ND&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="nt">&#34;type&#34;</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;fp16&#34;</span>
</span></span><span class="line"><span class="cl">                <span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">]</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>执行</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">msopgen gen -i add_custom.json -c ai_core-Ascend910B3 -f pytorch -out . -lan cpp
</span></span></code></pre></td></tr></table>
</div>
</div><p>生成算子工程：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-txt" data-lang="txt"><span class="line"><span class="cl">AddCustom
</span></span><span class="line"><span class="cl">├── build.sh
</span></span><span class="line"><span class="cl">├── cmake 
</span></span><span class="line"><span class="cl">│   ├── config.cmake
</span></span><span class="line"><span class="cl">│   ├── func.cmake
</span></span><span class="line"><span class="cl">│   ├── intf.cmake
</span></span><span class="line"><span class="cl">│   ├── makeself.cmake
</span></span><span class="line"><span class="cl">│   └── util
</span></span><span class="line"><span class="cl">├── CMakeLists.txt
</span></span><span class="line"><span class="cl">├── CMakePresets.json          // 修改 ASCEND_CANN_PACKAGE_PATH
</span></span><span class="line"><span class="cl">├── framework
</span></span><span class="line"><span class="cl">├── op_host
</span></span><span class="line"><span class="cl">│   ├── add_custom_tiling.h    // 定义 length 和 tiling 相关信息
</span></span><span class="line"><span class="cl">│   ├── add_custom.cpp         // 算子 host 侧实现
</span></span><span class="line"><span class="cl">│   ├── CMakeLists.txt
</span></span><span class="line"><span class="cl">├── op_kernel
</span></span><span class="line"><span class="cl">│   ├── CMakeLists.txt
</span></span><span class="line"><span class="cl">│   ├── add_custom.cpp         // 算子 kernel 侧实现
</span></span><span class="line"><span class="cl">└── scripts
</span></span></code></pre></td></tr></table>
</div>
</div><p><code>CMakePresets.json</code> 中修改 <code>ASCEND_CANN_PACKAGE_PATH</code> 为 CANN 安装路径。</p>
<p><code>op_host/add_custom_tiling.h</code> 的内容如下（简单实现）：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;register/tilingdata_base.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">namespace</span> <span class="n">optiling</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="n">BEGIN_TILING_DATA_DEF</span><span class="p">(</span><span class="n">AddCustomTilingData</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">TILING_DATA_FIELD_DEF</span><span class="p">(</span><span class="kt">uint32_t</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>  <span class="c1">// 定义 tensor size
</span></span></span><span class="line"><span class="cl"><span class="n">END_TILING_DATA_DEF</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">REGISTER_TILING_DATA_CLASS</span><span class="p">(</span><span class="n">AddCustom</span><span class="p">,</span> <span class="n">AddCustomTilingData</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p><code>op_host/add_custom.cpp</code> 中修改算子调用时的 <code>block_dim</code>：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl"><span class="n">context</span><span class="o">-&gt;</span><span class="n">SetBlockDim</span><span class="p">(</span><span class="mi">20</span><span class="p">);</span> <span class="c1">// 910B3 的 block_dim
</span></span></span></code></pre></td></tr></table>
</div>
</div><p><code>op_kernel/add_custom.cpp</code> 是算子的具体实现：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&#34;kernel_operator.h&#34;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef __DAV_C220_VEC__
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">extern</span> <span class="s">&#34;C&#34;</span> <span class="n">__global__</span> <span class="n">__aicore__</span> <span class="kt">void</span> <span class="n">add_custom</span><span class="p">(</span><span class="n">GM_ADDR</span> <span class="n">x</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">y</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">z</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">tiling</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">GET_TILING_DATA</span><span class="p">(</span><span class="n">tiling_data</span><span class="p">,</span> <span class="n">tiling</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="kt">uint32_t</span> <span class="n">M</span> <span class="o">=</span> <span class="n">tiling_data</span><span class="p">.</span><span class="n">size</span><span class="p">;</span>  <span class="c1">// 从 tiling_data 中获取 tensor size
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// ...
</span></span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#else
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">// 重要：CANN 会尝试不同的 ccec 编译参数以推断算子的类型（VEC、CUBE、MIXED），如果不创建一个 stub 函数将会编译失败
</span></span></span><span class="line"><span class="cl"><span class="k">extern</span> <span class="s">&#34;C&#34;</span> <span class="n">__global__</span> <span class="n">__aicore__</span> <span class="kt">void</span> <span class="n">add_custom</span><span class="p">(</span><span class="n">GM_ADDR</span> <span class="n">x</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">y</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">z</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">workspace</span><span class="p">,</span> <span class="n">GM_ADDR</span> <span class="n">tiling</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="n">pip_barrier</span><span class="p">(</span><span class="n">PIPE_ALL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span></code></pre></td></tr></table>
</div>
</div><h3 id="编译部署">编译部署</h3>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">$ bash build.sh
</span></span><span class="line"><span class="cl">$ ./custom_opp_euleros_aarch64.run
</span></span></code></pre></td></tr></table>
</div>
</div><p>PyTorch 中调用：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">torch_npu</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># ...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">z</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">npu_add_custom</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>  <span class="c1"># 由于是运行时编译，第一次运行时需要等待编译</span>
</span></span></code></pre></td></tr></table>
</div>
</div><h2 id="注册原理">注册原理</h2>
<p>TODO</p>
<h2 id="参考">参考</h2>
<p>TODO</p>
]]></content:encoded></item><item><title>优化 MKL 在 AMD CPU 上的性能</title><link>https://monsoon-cs.moe/zh/2023-06-19-mkl-on-amd/</link><pubDate>Mon, 19 Jun 2023 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/zh/2023-06-19-mkl-on-amd/</guid><description>&lt;h2 id="问题"&gt;问题&lt;/h2&gt;
&lt;p&gt;实验室有一些 AMD EPYC 7713 的服务器，采购的原因是组里有一些人的程序有非常高的 CPU 负载（我也不知道是什么负载，为什么不能跑在 GPU 上，我也没有精力去逐个帮助解决），框框多的 AMD 处理器非常适合这种需求。&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="问题">问题</h2>
<p>实验室有一些 AMD EPYC 7713 的服务器，采购的原因是组里有一些人的程序有非常高的 CPU 负载（我也不知道是什么负载，为什么不能跑在 GPU 上，我也没有精力去逐个帮助解决），框框多的 AMD 处理器非常适合这种需求。</p>
<p>不过 AMD 的处理器虽然香，用在炼丹实验室会有额外的问题：Anaconda 安装的 numpy 和 PyTorch 默认都使用了 MKL 作为 BLAS 的实现，MKL 的 library function 也是大部分高 CPU 负载程序的热点，但 <strong>MKL 会判断自己是否在 Intel CPU 上运行，如果不是，则没有优化效果。</strong></p>
<p>由于这是炼丹实验室，大家很少有足够的 HPC 基础去自己编译适合的 numpy 和 PyTorch 版本，也很难脱离 Anaconda，对于 MKL 的依赖因此很难去除。为此需要一个<strong>对一般用户无感知的解决方案</strong>。</p>
<h2 id="解决方案">解决方案</h2>
<p>通过搜索引擎可以搜索到一个广为流传解决方案：设置环境变量 <code>MKL_DEBUG_CPU_TYPE=5</code>。这是个曾经有效的解决方案，但<strong>对于 MKL 2020 及之后的版本不再有效</strong>。</p>
<p>最终我在<a href="https://documentation.sigma2.no/jobs/mkl.html">此处</a>找到了更巧妙的解决方案。</p>
<p>MKL 会调用一个 <code>mkl_serv_intel_cpu_true()</code> 函数以检查自己是否运行在 Intel CPU 上，只要提供一个虚假的、始终返回 <code>1</code> 的 <code>mkl_serv_intel_cpu_true()</code>，即可欺骗 MKL 让它认为自己在 Intel CPU 上运行。</p>
<p>为此，可以利用 Linux 的 <strong><code>LD_PRELOAD</code> 机制</strong>。<code>LD_PRELOAD</code> 指向的动态链接库有最高的加载优先级，只要编译一个想要的 <code>mkl_serv_intel_cpu_true()</code> 函数为 <code>so</code> 文件，并用 <code>LD_PRELOAD</code> 指向它，即可抢先完成此函数的加载。</p>
<blockquote>
<p>笔者也经常有耳闻 <code>LD_PRELOAD</code> 机制被用于库函数劫持攻击，此处算是一种妙用。</p>
</blockquote>
<h2 id="具体实施">具体实施</h2>
<p>新建 <code>mkl_trick.c</code>:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">mkl_serv_intel_cpu_true</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></td></tr></table>
</div>
</div><p>使用 <code>gcc -shared -fPIC -o libmkl_trick.so mkl_trick.c</code> 编译，并将生成的 <code>libmkl_trick.so</code> 复制到 <code>/usr/local/lib</code>。</p>
<p>在 Shell 的全局初始化文件中加入：</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">MKL_DEBUG_CPU_TYPE</span><span class="o">=</span><span class="m">5</span>  <span class="c1"># 兼容旧版本 MKL</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">MKL_ENABLE_INSTRUCTIONS</span><span class="o">=</span>AVX2  <span class="c1"># 可选，指明 MKL 可以使用 AVX2</span>
</span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">LD_PRELOAD</span><span class="o">=</span>/usr/local/lib/libmkl_trick.so
</span></span></code></pre></td></tr></table>
</div>
</div><p>实验室的同学有的用 Bash 也有的用 ZSH，所以两者都要修改：</p>
<ul>
<li>Bash: 新建文件 <code>/etc/profile.d/mkl.sh</code> 并添加上述内容</li>
<li>ZSH: 添加到 <code>/etc/zsh/zshenv</code></li>
</ul>
<h2 id="参考">参考</h2>
<ul>
<li><a href="https://documentation.sigma2.no/jobs/mkl.html">https://documentation.sigma2.no/jobs/mkl.html</a></li>
</ul>
]]></content:encoded></item></channel></rss>