End-to-end training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While state-of-the-art learning-driven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an end-to-end

Code and datasets:

NP-hard combinatorial optimization problems are the family of integer constrained optimization problems which are intractable to solve optimally at large scales. Robust approximation algorithms to popular problems have immense practical applications and are the backbone of modern industries. Among combinatorial problems, the 2D Euclidean Travelling Salesperson Problem (TSP) has been the most intensely studied NP-hard graph problem in the Operations Research (OR) community, with applications in logistics, genetics and scheduling [

An alternate approach by the Machine Learning community is to develop generic learning algorithms which can be trained to solve

Scaling end-to-end approaches to practical and real-world instances is still an open question [

Computational challenges of learning large scale TSP. We compare three identical autoregressive GNN-based models trained on 12.8 Million TSP instances via reinforcement learning. We plot average optimality gap to the Concorde solver on 1,280 held-out TSP200 instances vs. number of training samples (left) and wall clock time (right) during the learning process. Training on large TSP200 from scratch is intractable and sample inefficient. Active Search [

As an illustration, Fig.

We advocate for an alternative to expensive large-scale training: learning efficiently from trivially small TSP and transferring the learnt policy to larger graphs in a

Towards end-to-end learning of

The prevalent evaluation paradigm overshadows models’ poor generalization capabilities by measuring performance on fixed or trivially small TSP sizes.

Generalization performance of GNN aggregation functions and normalization schemes benefits from explicit redesigns which account for shifting graph distributions, and can be further boosted by enforcing regularities such as constant graph diameters when defining problems using graphs.

Autoregressive decoding enforces a sequential inductive bias which improves generalization over non-autoregressive models, but is costlier in terms of inference time.

Models trained with expert supervision are more amenable to post-hoc search, while reinforcement learning approaches scale better with more computation as they do not rely on labelled data.

Our framework and datasets are available online.

In a recent survey, Bengio et al. [

State-of-the-art end-to-end approaches for TSP use Graph Neural Networks (GNNs) [

Other classical problems tackled by similar architectures include Vehicle Routing [

Notably, TSP has emerged as a challenging testbed for neural combinatorial optimization. Whereas generalization to problem instances larger and more complex than those seen in training has at least partially been demonstrated on non-sequential problems such as SAT, MaxCut, and MVC [

From the perspective of graph representation learning, algorithmic and combinatorial problems have recently been used to characterize the expressive power of GNNs [

Advances on classical combinatorial problems have shown promising results in downstream applications to novel or under-studied optimization problems in the physical sciences [

NP-hard problems can be formulated as sequential decision making tasks on graphs due to their highly structured nature. Towards a controlled study of neural combinatorial optimization for TSP, we unify recent ideas [

End-to-end neural combinatorial optimization pipeline: The entire model in trained end-to-end via imitating an optimal solver (i.e. supervised learning) or through minimizing a cost function (i.e. reinforcement learning)

The 2D Euclidean TSP is defined as follows: _{2} denotes the _{2} norm.

Classically, TSP is defined on fully-connected graphs, see Fig.

Most work on learning for TSP has focused on training with a fixed graph size [

A Graph Neural Network (GNN) encoder computes _{i} and the euclidean distance ∥_{i} − _{j}∥_{2}, respectively.

We make the aggregation function anisotropic or directional via a dense attention mechanism which scales the neighborhood features _{ij}). Anisotropic and attention-based GNNs such as Graph Attention Networks [

Consider TSP as a link prediction task: each edge may belong/not belong to the optimal TSP solution independent of one another [_{ij} via a softmax.

Although NAR decoders are fast as they produce predictions in one shot, they ignore the sequential ordering of TSP tours. Autoregressive decoders, based on attention [

At time step _{G} and the embeddings of the first and last node in the partial tour: _{ij} are computed via a final attention mechanism between the context _{j}:
_{ij} via a softmax over all edges.

NAR approaches, which make predictions over edges independently of one-another, have shown strong out-of-distribution generalization for non-sequential problems such as SAT and MVC [

For AR decoding, the predicted probabilities at node _{i} or greedily selecting the most probable edge _{ij}, i.e. greedy search. Since NAR decoders directly output probabilities over all edges independent of one-another, we can obtain valid TSP tours using greedy search to traverse the graph starting from a random node and masking previously visited nodes. Thus, the probability of a partial tour

During inference, we can increase the capacity of greedy search via limited width breadth-first beam search, which maintains the

Models can be trained end-to-end via imitating an optimal solver at each step (i.e. supervised learning). For models with NAR decoders, the edge predictions are linked to the ground-truth TSP tour by minimizing the binary cross-entropy loss for each edge [

Reinforcement learning is a elegant alternative in the absence of groundtruth solutions, as is often the case for understudied combinatorial problems. Models can be trained by minimizing problem-specific cost functions (the tour length in the case of TSP) via policy gradient algorithms [_{𝜃}(

We design controlled experiments to probe the unified pipeline described in Section

To quantify ‘good’ generalization, we additionally evaluate our models against a simple, non-learnt _{1} is maximized. Kool et al. [

We perform ablation studies of each component of the pipeline by training on variable TSP20-50 graphs for rapid experimentation. We also compare to learning from fixed graph sizes up to TSP100. Each TSP instance consist of

For models with AR decoders, we use 3 GNN encoder layers followed by the attention decoder head, setting hidden dimension

We compare models on a held-out test set of 25,600 TSPs, consisting of 1,280 samples each of TSP10, TSP20,

Learning from various TSP sizes. The prevalent protocol of evaluation on training sizes overshadows brittle out-of-distribution performance to larger and smaller graphs

Impact of graph sparsification. Maintaining a constant graph diameter across TSP sizes leads to better generalization on larger problems than using full graphs

We train five identical models on fully connected graphs of instances from TSP20, TSP50, TSP100, TSP200 and variable TSP20-50. The line plots of optimality gap across TSP sizes in Fig.

Training on TSP200 graphs is intractable within our computational budget, see Fig.

Figure

Although both sparsification techniques lead to faster convergence on training instance sizes (not shown), we find that only approach (2) leads to better generalization on larger problems than using full graphs. Consequently, all further experiments use approach (2) to operate on sparse 20

Impact of GNN aggregation functions. For larger graphs, aggregators that are agnostic to node degree (

Impact of normalization schemes. Modifying BatchNorm to account for changing graph statistics across graph sizes leads to better generalization

In Fig.

We find that the choice of GNN aggregation function does not have an impact when evaluating models within the training size range TSP20-50. As we tackle larger graphs, GNNs with aggregation functions that are agnostic to node degree (

We also experiment with the following normalization schemes: (1) standard BatchNorm which learns mean and variance from training data, as well as (2) BatchNorm with batch statistics; and (3) LayerNorm, which normalizes at the embedding dimension instead of across the batch. Figure

We further dissect the relationship between graph representations and normalization in Appendix

Comparing AR and NAR decoders. Sequential AR decoding is a powerful inductive bias for TSP as it enables significantly better generalization, even in the absence of graph structure (MLP encoders)

Inference time for various decoders. One-shot NAR decoding is significantly faster than sequential AR, especially when re-embedding the graph at each decoding step [

Figure

Conversely, NAR architectures are a poor inductive bias as they require significantly more computation to perform competitively to AR decoders. For instance, recent models [

Identical models are trained via supervised learning (SL) and reinforcement learning (RL).

In Appendix

Comparing solution search settings. Under greedy decoding, RL demonstrates better performance and generalization. Conversely, SL models improve over their RL counterparts when performing beam search or sampling

Impact of increasing beam width. Teacher-forcing during SL leads to poor generalization under greedy decoding, but makes the probability distribution more amenable to beam search

Our experiments till this point have focused on isolating the impact of various pipeline components on zero-shot generalization under limited computation. At the same time, recent results in natural language processing have highlighted the power of large scale pre-training for effective transfer learning [

In Fig.

Scaling computation and parameters for SL and RL-trained models. All models are trained on TSP20-50. We plot optimality gap on 1,280 held-out samples of both TSP50 (performance on training size) and TSP100 (out-of-distribution generalization) under greedy decoding. Note that SL models are less amenable than RL models to greedy search. RL models are able to keep improving their performance within as well as outside of training size range with more data. On the other hand, SL performance is bottlenecked by the need for optimal groundtruth solutions

Since the initial publication of this work [

As a reminder, the unified neural combinatorial optimization pipeline consists of: (1) Problem Definition → (2) Graph Embedding → (3) Solution Decoding → (4) Solution Search → (5) Policy Learning.

The autoregressive Attention Model [

Kwon et al. [

Future work may follow the Geometric Deep :earning blueprint [

Several papers have proposed to improve the one-shot non-autoregressive approach of Joshi et al. [

Notably, the GNN + MCTS framework of Fu et al. [

Overall, this line of work suggests that stronger coupling between the design of both the neural and symbolic/search components of models is essential for out-of-distribution generalization.

Recent work has explored an alternative to constructive AR and NAR decoding schemes which involves learning to iteratively improve (sub-optimal) solutions or learning to perform local search [

A limitation of this line of work is the need for hand-designed local search heuristics, which may be missing for understudied problems. On the other hand, constructive approaches are comparatively easier to adapt to new problems by enforcing constraints during the solution decoding and search procedure (Fig.

Future work could look at novel learning paradigms which explicitly focus on generalization beyond supervised and reinforcement learning. For e.g., this work explored zero-shot generalization to larger problems, but the logical next step is to fine-tune the model on a small number of larger problem instances [

Another interesting direction could explore tackling understudied routing problems with challenging constraints via multi-task pre-training on well-known routing problems such as TSP and CVPR, followed by problem-specific finetuning. Similar to language modelling as a pre-training objective in NLP [

Learning-driven solvers for combinatorial problems such as the Travelling Salesperson Problem have shown promising results for trivially small instances up to a few hundred nodes. However, scaling fully

This paper advocates for an alternative to expensive large-scale training: training models efficiently on trivially small TSP and transferring the learnt policy to larger graphs in a

We perform the first principled investigation into zero-shot generalization for learning large scale TSP, unifying state-of-the-art architectures and learning paradigms into one experimental pipeline for neural combinatorial optimization. Our findings suggest that key design choices such as GNN layers, normalization schemes, graph sparsification, and learning paradigms need to be explicitly re-designed to consider out-of-distribution generalization. Additionally, we use our unified pipeline to characterize recent advances in deep learning for routing problems and provide new directions to stimulate future research.

We would like to thank R. Anand, X. Bresson, V. Dwivedi, A. Ferber, E. Khalil, W. Kool, R. Levie, A. Prouvost, P. Veličković and the anonymous reviewers for helpful comments and discussions.

None.

In Fig.

We characterize ‘good’ generalization across our experiments by the well-known _{1} is maximized.

We motivate our work by showing that learning from large TSP200 is intractable on university-scale hardware, and that efficient pre-training on trivial TSP20-50 enables models to better generalize to TSP200 in a zero-shot manner. Within our computational budget, furthest insertion still outperforms our best models. At the same time, we are not claiming that it is

It is worth mentioning why we chose to study TSP in particular. Firstly, TSP has stood the test of time in terms of relevance and continues to serve as an engine of discovery for general purpose techniques in applied mathematics.

TSP and associated routing problems have also emerged as a challenging testbed for learning-driven approaches to combinatorial optimization. Whereas generalization to problem instances larger and more complex than those seen in training has at least partially been demonstrated on non-sequential problems such as SAT, MaxCut, MVC [

Fairly timing research code can be difficult due to differences in libraries used, hardware configurations and programmer skill. In Table

Approximate training time (12.8M samples) and inference time (1,280 samples) across TSP sizes and search settings for SL and RL-trained models

Graph Size | Training Time | Inference Time | |||
---|---|---|---|---|---|

SL | RL | GS | BS128 | S128 | |

TSP20 | 4h 24m | 8h 02m | 2.62s | 7.06s | 63.37s |

TSP20-50 | 9h 49m | 15h 47m | − | − | − |

TSP50 | 16h 11m | 40h 29m | 7.45s | 29.09s | 86.48s |

TSP100 | 68h 34m | 108h 30m | 19.04s | 98.26s | 180.30s |

TSP200 | − | 495h 55m | 54.88s | 372.09s | 479.37s |

Figure

We understand this phenomenon as follows: More confident predictions (Fig.

Histograms of greedy selection probabilities (x-axis) across TSP sizes (y-axis)

Our results in Section

We utilize distribution plots to study the variation in embedding statistics

We draw upon work in learning embeddings for computer vision [_{2} norms, indicating whether embeddings are shrinking to one magnitude or expanding outwards as TSP size increases; and (2)

Distribution plots of _{2} norms (y-axis) across TSP sizes (x-axis)

Distribution plots of

In Figs.

Distribution plots of _{2} norms (y-axis) across TSP sizes (x-axis)

Distribution plots of

Figures

We can further visualize this phenomenon through 2D Principal Component Analysis (PCA) plots of graph embedding spaces for

2D PCA of graph embedding spaces. Colors represent TSP instance sizes, e.g. orange: TSP10, teal: TSP20, pink: TSP50, dark grey: TSP200

GNN aggregation functions (NAR decoder)

Comparing learning paradigms and solution search settings

In Section

Figure

In Figures

Scaling computation and model parameters for AR decoder

Scaling computation and model parameters for NAR decoder

As a final note, we present a visualization tool for generating model predictions and heatmaps of TSP instances, see Figures

Prediction visualization for TSP20

Prediction visualization for TSP50

The largest TSP solved by Concorde to date has 109,399 nodes with running time of 7.5 months.

For RL, we show the greedy rollout baseline. Critic baseline results are available in Appendix

It is worth noting that classical algorithmic and symbolic components such as graph reduction, sophisticated tree search as well as post-hoc local search have been pivotal and complementary to GNNs in enabling such generalization.

Distribution plots show 0, 5, 50, 95, and 100-percentiles for embedding statistics at various TSP sizes, thus visualizing how the statistics changes with problem scale (implemented via TensorBoard [

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.