<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>deepspeed on Monsoon's Blog</title><link>https://monsoon-cs.moe/tags/deepspeed/</link><description>Recent content in deepspeed on Monsoon's Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 07 Jul 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://monsoon-cs.moe/tags/deepspeed/index.xml" rel="self" type="application/rss+xml"/><item><title>Latency in LLM Serving</title><link>https://monsoon-cs.moe/2024-07-07-latency-in-llm-serving/</link><pubDate>Sun, 07 Jul 2024 00:00:00 +0000</pubDate><guid>https://monsoon-cs.moe/2024-07-07-latency-in-llm-serving/</guid><description>&lt;h2 id="preface"&gt;Preface&lt;/h2&gt;
&lt;p&gt;There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, &lt;strong&gt;currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving.&lt;/p&gt;</description><content:encoded><![CDATA[<h2 id="preface">Preface</h2>
<p>There have been many excellent works on LLM serving, mainly focusing on improving the throughput. Meanwhile, in practical applications, latency is equally important for LLM serving. However, <strong>currently few works focus on improvement of LLM serving latency, especially the latency optimization under SLA constraint</strong>.</p>
<p>This blog attempts to summarize the basic concepts and problems in this direction, and give some novel research directions based on some analysis of latency in LLM serving.</p>
<h2 id="latency-metrics">Latency Metrics</h2>
<p>In LLM serving, we mainly focus on three latency metrics:</p>
<ul>
<li><strong>TBT</strong> ($t_ {tbt}$): Time Between Tokens.</li>
<li><strong>TTFT</strong> ($t_ {ttft}$): Time to First Token.</li>
<li><strong>TE2E</strong> ($t_ {e2e}$): Time of End-to-end.</li>
</ul>
<p>In practice, rather than the average or median latency, we usually consider the <strong>latency SLA</strong>, which means that 50%, 90%, and 99% of data should fall below certain thresholds.</p>
<h2 id="where-the-latency-comes-from">Where The Latency Comes From?</h2>
<p><img loading="lazy" src="/2024-07-07-latency-in-llm-serving/latency_in_llm_serving.png"></p>
<p>As shown in the figure above, the current popular LLM serving systems (such as vLLM, DeepSpeed) adopt an <strong>iteration-level scheduling strategy</strong>. The processing of each request is divided into the <strong>prefilling stage</strong> (prompt inference) and the <strong>generation stage</strong> (auto-regressive token-by-token generation). For systems such as Sarathi-Serve, the prompt is chunked to improve throughput, thus adding a <strong>chunked prefilling stage</strong>.</p>
<p>The LLM serving system maintains <strong>3 queues</strong> to store requests in these 3 states. The scheduler runs in a loop, and in each iteration, it selects requests from these 3 queues with a certain strategy, and combines them into a batch for the inference engine.</p>
<p>In such systems, the latency of requests mainly comes from 2 aspects: <strong>queue latency</strong> and <strong>inference latency</strong>. Assuming the latencies for a request from being added into the prefilling queue, chunked prefilling queue, generation queue to being selected by scheduler are $t_ {qp}$, $t_ {qc}$, $t_ {qg}$ respectively, and inference latency of engine if $t_ {inf}$.
We get:</p>
$$\begin{aligned}
  t_ {ttft} &= t_ {qp} + (N_ {chunk} - 1) \cdot t_ {qc} + N_ {chunk} \cdot t_ {inf}, \\\\
  t_ {tbt} &= t_ {qg} + t_ {inf}, \\\\
  t_ {e2e} &= t_ {ttft} + N_{token} \cdot t_ {tbt},
\end{aligned}$$<p>where $N_ {chunk}$ is the chunk number of a prefilling request, $N_ {chunk}=1$ means no chunking. $N_ {token}$ is the total token number generated by a request.</p>
<p>Obviously, $t_ {inf}$ is not a fixed value. It&rsquo;s related with the ingredient of the batch. We can denote it as:</p>
$$t_ {inf} = f\left( B_ {p}, B_ {c}, B_ {g}, \mathbf{L}_ {p}, L_ {chunk} \right),$$<p>where $B_p$, $B_c$, $B_g$ indicates the number of non-chunked prefilling request, chunked prefilling request, generation request respectively. Vector $\mathbf{L}_ {p}$ means the prompt length of each non-chunked prefilling request in the batch.
$L_ {chunk}$ is the chunk size.</p>
<h2 id="how-to-improve-it">How to Improve It?</h2>
<p>Based on the above analysis, we can find that reducing latency mainly involves reducing both <strong>queue latency</strong> and <strong>inference latency</strong>. In fact, some techniques, such as iteration-level scheduling and chunked prefilling, can be seen as improvements to queue latency.</p>
<p>On the other hand, <strong>improvement of inference latency have not received much attention</strong>. One reason is that, <strong>for inference engines, there is a trade-off between latency and throughput</strong>.
Generally speaking, higher batch size means higher throughput, but also higher inference latency. Techniques such as quantization and Paged Attention focus on more efficient memory usage to increase batch size, <strong>but inference latency may also increase accordingly</strong> (TODO: add an example), which means $t_ {tbt}$ and $t_ {ttft}$ may be increased, and SLA requirements are broken.</p>
<p>Therefore, <strong>there is an opportunity to improve inference latency in current LLM serving systems</strong>. The target may be an <strong>SLA-aware scheduler</strong>, which can maximize throughput without breaking SLA requirements. It should be able to <strong>dynamically decide the batch size and batch composition</strong> instead of just deploying a static prefilling-prioritize or generation-prioritize strategy.</p>
<p>I believe the key to this design is to predict $t_ {inf}$ to provide latency optimization guidance for the scheduler. Prediction based on profiling results may be a simple approach, <strong>but a performance model based on GPU computation capability and memory bandwidth might be more general</strong>.</p>
<p>Once we can predict $t_ {inf}$, $t_ {qp}$, $t_ {qc}$, and $t_ {qg}$ can also be predicted using mathematical tools such as Queueing Theory (e.g., Poisson distribution), allowing us to optimize serving for the following scenarios:</p>
<ol>
<li>When the request arrival rate is less than the maximum throughput: we can appropriately reduce batch size to improve $t_ {tbt}$.</li>
<li>When the request arrival rate is greater than the maximum throughput: we can adjust the batch composition dynamically based on queue length, or drop some requests to avoid starvation.</li>
<li>When the request arrival rate suddenly increases: we can adjust the batch composition to avoid breaking the SLA of $t_ {ttft}$.</li>
</ol>
<p>In summary, this SLA-aware scheduler should provide better results than a static scheduler by considering <strong>arrival rate</strong>, <strong>queue length</strong>, and <strong>predicted $t_ {inf}$</strong>.</p>
<h2 id="some-meaningful-experiment-result">Some Meaningful Experiment Result</h2>
<p>TODO</p>
]]></content:encoded></item></channel></rss>