LLM trained bandit algorithms for improving Q-Commerce grocery swaps

In the modern competitive marketplace, providing consumers with convenience, low costs, and a positive customer experience is essential. Online shopping at supermarkets via quick commerce (Q-commerce) has become increasingly common, presenting new challenges due to product catalogue sizes ranging from hundreds to tens of thousands. This makes stockouts or outdated product listings very common, requiring significant attention to maintain customer satisfaction by creating robust and scalable product swap systems. Product similarity is often used to suggest alternative product swaps, but large catalogues and limited customer feedback data make it difficult to monitor and change swap suggestions if they are not satisfactory. This paper explores methods for incorporating customer feedback to change product swap recommendations over time, by using Thompson sampling multi-armed bandits and upper confidence bound (UCB) contextual bandits with network-based experience sharing. A multimodal LLM is used to provide feedback for the process, enabling the generation of synthetic responses via human-like automation, in the absence of real-world data.

Conference Name

6th IFAC/INSTICC International Conference on Innovative Intelligent Production and Logistics

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)

Collections

University of Cambridge Research Outputs (Articles and Conferences)