Monday, October 27, 2025

[DMANET] [Scheduling seminar] Zijie Zhou (IEDA, HKUST) | October 29 | Efficient and Robust Large Language Model (LLM) Inference Scheduling Optimization

Dear scheduling researcher,

We are delighted to announce the talk given by Zijie Zhou (IEDA, HKUST).
The title is "Efficient and Robust Large Language Model (LLM) Inference
Scheduling Optimization". The seminar will take place on Zoom on
Wednesday, October 29 at 14:00 UTC.
Join Zoom Meeting
https://cesnet.zoom.us/j/91957974467?pwd=bIbyEXuAi693d5e0YLjaxmFFYZQOpF.1
Meeting ID: 919 5797 4467
Passcode: 291598

You can follow the seminar online or offline on our Youtube channel as
well:
https://www.youtube.com/channel/UCUoCNnaAfw5NAntItILFn4A

The abstract follows.
We study the problem of optimizing Large Language Model (LLM) inference
scheduling to minimize total completion time. LLM inference is an online
and multi-task service process and also heavily energy consuming by
which a pre-trained LLM processes input requests and generates output
tokens sequentially. Therefore, it is vital to improve its scheduling
efficiency and reduce the power consumption while a great amount of
prompt requests are arriving. There are two key challenges: (i) each
request has heterogeneous prefill and decode lengths. In LLM serving,
the prefill length corresponds to the input prompt length, which
determines the initial memory usage in the KV cache. The decode length
refers to the number of output tokens generated sequentially, with each
additional token increasing the KV cache memory usage by one unit. We
show that minimizing total completion time is NP-hard due to the
interplay of batching, placement constraints, precedence relationships,
and linearly increasing memory usage. We then analyze commonly used
scheduling strategies in practice, such as First-Come-First-Serve (FCFS)
and Shortest-First (SF), and prove that their competitive ratios are
unbounded. To address this, we propose a novel algorithm based on a new
selection metric that efficiently forms batches over time. We prove that
this algorithm achieves a constant competitive ratio. (ii) the output
length, which critically impacts memory usage and processing time, is
unknown. We first design a conservative algorithm, Amax, which schedules
requests based on the upper bound of predicted output lengths to prevent
memory overflow. However, this approach is overly conservative: as
prediction accuracy decreases, performance degrades significantly due to
potential overestimation. To overcome this limitation, we propose Amin,
an adaptive algorithm that initially treats the predicted lower bound as
the output length and dynamically refines this estimate during
inferencing. We prove that Amin achieves a log-scale competitive ratio.

The next talk in our series will be:
Rachel R. Chen (UC Davis) | November 12 | Outpatient Appointment
Scheduling with Waiting Time Limits
For more details, please visit https://schedulingseminar.com/

With kind regards

Zdenek Hanzalek, Michael Pinedo and Guohua Wan

--
Zdenek Hanzalek
Industrial Informatics Department,
Czech Institute of Informatics, Robotics and Cybernetics,
Czech Technical University in Prague,
Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
https://rtime.ciirc.cvut.cz/~hanzalek/

**********************************************************
*
* Contributions to be spread via DMANET are submitted to
*
* DMANET@zpr.uni-koeln.de
*
* Replies to a message carried on DMANET should NOT be
* addressed to DMANET but to the original sender. The
* original sender, however, is invited to prepare an
* update of the replies received and to communicate it
* via DMANET.
*
* DISCRETE MATHEMATICS AND ALGORITHMS NETWORK (DMANET)
* http://www.zaik.uni-koeln.de/AFS/publications/dmanet/
*
**********************************************************