The Operations Research Team at Huawei France Research Center (Boulogne-Billancourt, Paris area) is opening a 12-month postdoctoral position (with a possible 6-month extension) in the context of the ANR project Net4AI.
The topic focuses on optimizing collective communication within datacenters during large language model (LLM) training and inference. Communication between GPUs during these operations is a major bottleneck. The objective is to develop optimization and learning-based approaches to enhance communication efficiency and reduce training time, by addressing both offline and online decision-making challenges. Offline, a bi-level optimization problem is to be solved for GPU assignment, job scheduling, and routing. Online, reinforcement learning methods will be investigated to adaptively buffer and balance tasks based on real-time network conditions.
Keywords: Optimization, Reinforcement Learning, Datacenter Networking, Collective Communication, LLM Training
This position is part of Huawei's research initiative on next-generation AI infrastructure and offers the opportunity to collaborate with a dynamic and multidisciplinary team.
Interested candidates can apply by replying to this email with a detailed CV, a cover letter, university transcripts, and references.
Kind regards,
Dr. Youcef Magnouche
Huawei France Research Center
youcef.magnouche@huawei.com
**********************************************************
*
* Contributions to be spread via DMANET are submitted to
*
* DMANET@zpr.uni-koeln.de
*
* Replies to a message carried on DMANET should NOT be
* addressed to DMANET but to the original sender. The
* original sender, however, is invited to prepare an
* update of the replies received and to communicate it
* via DMANET.
*
* DISCRETE MATHEMATICS AND ALGORITHMS NETWORK (DMANET)
* http://www.zaik.uni-koeln.de/AFS/publications/dmanet/
*
**********************************************************