We are delighted to announce the talk given by Zijie Zhou (IEDA, HKUST). 
The title is "Efficient and Robust Large Language Model (LLM) Inference 
Scheduling Optimization". The seminar will take place on Zoom on 
Wednesday, October 29 at 14:00 UTC.
Join Zoom Meeting
https://cesnet.zoom.us/j/91957974467?pwd=bIbyEXuAi693d5e0YLjaxmFFYZQOpF.1
Meeting ID: 919 5797 4467
Passcode: 291598
You can follow the seminar online or offline on our Youtube channel as 
well:
https://www.youtube.com/channel/UCUoCNnaAfw5NAntItILFn4A
The abstract follows.
We study the problem of optimizing Large Language Model (LLM) inference 
scheduling to minimize total completion time. LLM inference is an online 
and multi-task service process and also heavily energy consuming by 
which a pre-trained LLM processes input requests and generates output 
tokens sequentially. Therefore, it is vital to improve its scheduling 
efficiency and reduce the power consumption while a great amount of 
prompt requests are arriving. There are two key challenges: (i) each 
request has heterogeneous prefill and decode lengths. In LLM serving, 
the prefill length corresponds to the input prompt length, which 
determines the initial memory usage in the KV cache. The decode length 
refers to the number of output tokens generated sequentially, with each 
additional token increasing the KV cache memory usage by one unit. We 
show that minimizing total completion time is NP-hard due to the 
interplay of batching, placement constraints, precedence relationships, 
and linearly increasing memory usage. We then analyze commonly used 
scheduling strategies in practice, such as First-Come-First-Serve (FCFS) 
and Shortest-First (SF), and prove that their competitive ratios are 
unbounded. To address this, we propose a novel algorithm based on a new 
selection metric that efficiently forms batches over time. We prove that 
this algorithm achieves a constant competitive ratio. (ii) the output 
length, which critically impacts memory usage and processing time, is 
unknown. We first design a conservative algorithm, Amax, which schedules 
requests based on the upper bound of predicted output lengths to prevent 
memory overflow. However, this approach is overly conservative: as 
prediction accuracy decreases, performance degrades significantly due to 
potential overestimation. To overcome this limitation, we propose Amin, 
an adaptive algorithm that initially treats the predicted lower bound as 
the output length and dynamically refines this estimate during 
inferencing. We prove that Amin achieves a log-scale competitive ratio.
The next talk in our series will be:
Rachel R. Chen (UC Davis) | November 12 | Outpatient Appointment 
Scheduling with Waiting Time Limits
For more details, please visit https://schedulingseminar.com/
With kind regards
Zdenek Hanzalek, Michael Pinedo and Guohua Wan
-- 
Zdenek Hanzalek
Industrial Informatics Department,
Czech Institute of Informatics, Robotics and Cybernetics,
Czech Technical University in Prague,
Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
https://rtime.ciirc.cvut.cz/~hanzalek/
**********************************************************
*
*   Contributions to be spread via DMANET are submitted to
*
*                   DMANET@zpr.uni-koeln.de
*
*   Replies to a  message carried  on DMANET should NOT be
*   addressed to DMANET  but to  the original sender.  The
*   original  sender,  however,  is invited  to prepare an
*   update  of the replies  received and to communicate it
*   via DMANET.
*
*    DISCRETE MATHEMATICS AND ALGORITHMS NETWORK (DMANET)
*      http://www.zaik.uni-koeln.de/AFS/publications/dmanet/
*
**********************************************************