Massively Parallel Neural
Computation
Paul James Fox
University of Cambridge
Computer Laboratory
Jesus College
October 2012
This dissertation is submitted for
the degree of Doctor of Philosophy

Declaration
This dissertation is the result of my own work and includes nothing which is the
outcome of work done in collaboration except where specifically indicated in the
text.
This dissertation does not exceed the regulation length of 60 000 words, including
tables and footnotes.

Massively Parallel Neural Computation
Paul James Fox
Summary
Reverse-engineering the brain is one of the US National Academy of Engineering’s
“Grand Challenges.” The structure of the brain can be examined at many different
levels, spanning many disciplines from low-level biology through psychology and
computer science. This thesis focusses on real-time computation of large neural
networks using the Izhikevich spiking neuron model.
Neural computation has been described as “embarrassingly parallel” as each
neuron can be thought of as an independent system, with behaviour described
by a mathematical model. However, the real challenge lies in modelling neural
communication. While the connectivity of neurons has some parallels with that
of electrical systems, its high fan-out results in massive data processing and com-
munication requirements when modelling neural communication, particularly for
real-time computations.
It is shown that memory bandwidth is the most significant constraint to the scale
of real-time neural computation, followed by communication bandwidth, which
leads to a decision to implement a neural computation system on a platform based
on a network of Field Programmable Gate Arrays (FPGAs), using commercial off-
the-shelf components with some custom supporting infrastructure. This brings im-
plementation challenges, particularly lack of on-chip memory, but also many ad-
vantages, particularly high-speed transceivers. An algorithm to model neural com-
munication that makes efficient use of memory and communication resources is
developed and then used to implement a neural computation system on the multi-
FPGA platform.
Finding suitable benchmark neural networks for a massively parallel neural com-
putation system proves to be a challenge. A synthetic benchmark that has
biologically-plausible fan-out, spike frequency and spike volume is proposed and
used to evaluate the system. It is shown to be capable of computing the activity
of a network of 256k Izhikevich spiking neurons with a fan-out of 1k in real-time
using a network of 4 FPGA boards. This compares favourably with previous work,
with the added advantage of scalability to larger neural networks using more FP-
GAs.
It is concluded that communication must be considered as a first-class design con-
straint when implementing massively parallel neural computation systems.

For George and Reg, who couldn’t be here to see this completed

Acknowledegments
Simon Moore gave many hours of supervision and advice. Theo Markettos proof-
read this entire thesis and corrected many typos and other errors. Andrew Moore
provided the coffee machine that allowed sufficient work to be put in to get this
finished.
None of this would have been possible without the support of my friends and fam-
ily. Particular thanks go to Cambridge University Yacht Club for providing many
opportunities to escape from the pressures of writing this thesis. Hopefully future
Commodores will not need to spend a whole week on the phone trying to fix an
engine failure half-way across the world.

Contents
1 Introduction 17
1.1 Motivation for neural computation . . . . . . . . . . . . . . . . . . . . 18
1.2 Fundamental processes of neural computation . . . . . . . . . . . . . 18
1.3 Goals and hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Background 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Fundamentals of neural computation . . . . . . . . . . . . . . . . . . 24
2.3 Neural computation methods . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Mapping neural networks on to computation systems . . . . . 26
2.3.2 Neuron models . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Communication models . . . . . . . . . . . . . . . . . . . . . . 29
2.3.4 Implementation methods . . . . . . . . . . . . . . . . . . . . . 29
2.4 Software-based neural computation . . . . . . . . . . . . . . . . . . . 30
2.4.1 Supercomputer-based systems . . . . . . . . . . . . . . . . . . 30
2.4.2 PC-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 GPU-based systems . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Hardware-based neural computation . . . . . . . . . . . . . . . . . . . 32
2.5.1 Modelling neurons . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Modelling neural network topology . . . . . . . . . . . . . . . 34
2.5.3 Modelling synaptic connections . . . . . . . . . . . . . . . . . 36
2.6 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
11
3 High-level requirements 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 The Izhikevich spiking neuron model . . . . . . . . . . . . . . . . . . 42
3.2.1 Conversion to discrete-time . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Conversion to fixed-point . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Sampling interval size . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.4 Final simplification . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.5 Numerical precision . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Communication properties of neural networks . . . . . . . . . . . . . 50
3.3.1 Fan-out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Spike frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Synaptic delays . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Requirements 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Neural computation tasks . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Volume of activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Evaluating neuron modelling equations . . . . . . . . . . . . . 55
4.3.2 Synaptic updates . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.3 Delaying synaptic updates . . . . . . . . . . . . . . . . . . . . 56
4.3.4 Applying synaptic updates . . . . . . . . . . . . . . . . . . . . 59
4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Memory requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Evaluating equations . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Synaptic updates . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.3 Delaying synaptic updates . . . . . . . . . . . . . . . . . . . . 62
4.4.4 Applying synaptic updates . . . . . . . . . . . . . . . . . . . . 63
4.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Communication requirements . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 Type of communication infrastructure . . . . . . . . . . . . . . 65
4.5.2 Synaptic update communication . . . . . . . . . . . . . . . . . 65
4.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Implementation technology . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.1 Memory resources . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.2 Communication resources . . . . . . . . . . . . . . . . . . . . . 67
4.6.3 Reprogrammability . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.4 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.5 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6.6 Choice of implementation technology . . . . . . . . . . . . . . 68
4.6.7 Implications of this choice . . . . . . . . . . . . . . . . . . . . . 69
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Platform 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Choice of FPGA board . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 Number of FPGAs per PCB . . . . . . . . . . . . . . . . . . . . 72
5.2.2 Memory interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.3 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.4 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.5 Availability of off-the-shelf solutions . . . . . . . . . . . . . . . 74
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Terasic DE4 evaluation board . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Creating a multi-FPGA platform . . . . . . . . . . . . . . . . . . . . . 77
5.4.1 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.2 Programming and diagnostics . . . . . . . . . . . . . . . . . . 80
5.4.3 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Synaptic updates 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Design goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.1 Fetching synaptic updates efficiently . . . . . . . . . . . . . . . 85
6.2.2 On-chip memory usage . . . . . . . . . . . . . . . . . . . . . . 86
6.2.3 Inter-FPGA communication . . . . . . . . . . . . . . . . . . . . 86
6.3 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Starting the update process . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Fetching synaptic updates . . . . . . . . . . . . . . . . . . . . . 88
6.3.3 Delaying synaptic updates . . . . . . . . . . . . . . . . . . . . 88
6.3.4 Applying synaptic updates . . . . . . . . . . . . . . . . . . . . 89
6.4 Passing pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Delaying pointers to synaptic updates . . . . . . . . . . . . . . 91
6.5 Synaptic update algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5.1 Benefits of the algorithm . . . . . . . . . . . . . . . . . . . . . . 93
6.6 Evaluation of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6.1 Mathematical models . . . . . . . . . . . . . . . . . . . . . . . 94
6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.3 Memory size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Implementation 105
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Splitting the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.1 Equation processor . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 Fan-out engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.3 Delay unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.4 Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.5 Memory spike source . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.6 Spike injector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2.7 Spike auditor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.8 Control state machine . . . . . . . . . . . . . . . . . . . . . . . 119
7.2.9 Control logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.1 Neural network description format . . . . . . . . . . . . . . . 122
7.3.2 Memory image file format . . . . . . . . . . . . . . . . . . . . . 123
7.3.3 Memory image file generation . . . . . . . . . . . . . . . . . . 123
7.4 Inter-FPGA communication . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 Physical layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 Link layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.4 Message filtering and delivery . . . . . . . . . . . . . . . . . . 132
7.4.5 Application usage . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.5 Implementation languages . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8 Evaluation 135
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.2 Benchmark neural network . . . . . . . . . . . . . . . . . . . . . . . . 137
8.3 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4 Communication overhead . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9 Conclusions 145
9.1 Evaluation of hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.1.1 Scalability Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 147
9.1.2 Communication-Centric Hypothesis . . . . . . . . . . . . . . . 147
9.1.3 Bandwidth Hypothesis . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Comparison to other systems . . . . . . . . . . . . . . . . . . . . . . . 148
9.2.1 Supercomputer-based systems . . . . . . . . . . . . . . . . . . 149
9.2.2 PC-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.2.3 GPU-based systems . . . . . . . . . . . . . . . . . . . . . . . . 150
9.2.4 Bespoke hardware-based systems . . . . . . . . . . . . . . . . 151
9.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.3.1 Increase in scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.3.2 Increase in flexibility . . . . . . . . . . . . . . . . . . . . . . . . 153

Chapter1
Introduction
17
18 1.1. MOTIVATION FOR NEURAL COMPUTATION
Reverse-engineering the brain is one of the US National Academy of Engineering’s
“Grand Challenges.” While there are many ways to go about this, one method is
to study the behaviour of real brains and attempt to emulate their behaviour using
computer-based computation of an inferred neural network. This process will be
referred to as “neural computation” throughout this work.
1.1 Motivation for neural computation
The structure and behaviour of individual neurons is well known, particularly as
a result of experiments by Hodgkin and Huxley (1952), who studied the behaviour
of giant squid neurons. We also have a good understanding of the high-level func-
tional structure of the human brain based on studies using brain imaging technolo-
gies such as MRI scans. However, we currently know very little about how neurons
in the human brain are connected to form neural systems capable of exhibiting the
functions that we observe.
One way to explore the connectivity of neurons in the brain is to infer candidate
neural networks based on observed functionality and then compute their beha-
viour. The results of these computations can be used to evaluate and refine the in-
ferred neural networks and hence to gain a greater understanding of the low-level
structure of the human brain.
Another application of neural computation is creating “brain-like” systems, par-
ticularly in the fields of computer vision (Rowley et al., 1998) and pattern recog-
nition (Rice et al., 2009). By emulating the methods used by the human brain to
perform visual processing, there is potential to create systems that have higher per-
formance and lower power consumption than those created using conventional
computer architectures. Such systems will need to be developed with the support
of neural computation.
1.2 Fundamental processes of neural computation
All neural computation systems need to emulate the behaviour of biological neural
networks. While they vary in the level of detail and accuracy that they provide
(principally as a result of the algorithms used to model the behaviour of neurons
and their synaptic connections), they all need to provide:
CHAPTER 1. INTRODUCTION 19
Algorithms to model the behaviour of neurons and their interaction via synaptic
connections
Data structures to store the parameters of neurons and synaptic connections
Communication infrastructure to support modelling of the interaction between
neurons
The majority of existing neural computation systems, including PC-bases systems
such as NEST (Gewaltig and Diesmann, 2007) and many hardware-based systems
such as those by Thomas and Luk (2009) and Fidjeland and Shanahan (2010) have
been designed with little focus on either data structures or communication. As a
result, the size of neural network that these systems can perform neural computa-
tion with is limited to whatever can be handled by a single device (such as a single
PC or single custom hardware device) as they lack the ability to scale to handle
larger neural networks using additional devices operating in parallel. The state-of-
the-art is around 50k neurons and 50M synaptic connections on a single device as
exhibited by Fidjeland and Shanahan.
The main exceptions to this are supercomputer-based neural computation systems
such as that by Ananthanarayanan et al. (2009), which are able to take advantage
of the custom communications infrastructure provided by supercomputers. This
allows them to perform neural computation with neural networks which are or-
ders of magnitude larger than those that can be handled by other systems (around
109 neurons and 1013 synaptic connections), but at a financial cost that puts these
systems out of reach of the majority of researchers.
1.3 Goals and hypotheses
This work aims to design a neural computation architecture with communication
and data structures as first-class design constraints alongside the algorithms used
to model neurons and synaptic connections. This will lead to an architecture that
can scale to use many devices operating in parallel and hence perform neural com-
putation with neural networks consisting of millions of neurons and billions of
synaptic connections in real-time.
20 1.3. GOALS AND HYPOTHESES
Since I aim to perform neural computations at a scale that greatly exceeds the ca-
pacity of any single device, I propose the Scalability Hypothesis.
Scalability Hypothesis
The scale of neural network that can be handled by a neural computation system in real-
time must be able to be increased by scaling the system to multiple devices
The Scalability Hypothesis indicates that a neural computation system must be a
parallel processing system. There are two major types of resources in a parallel
processing system:
Compute resources
Perform a portion of the computation involved in a parallel processing task
Communication resources
Pass data between compute resources and provide coordination between
them so that the complete system produces the correct result for a parallel
processing task
If a neural network (as well as the system used to perform neural computation
of this network) is considered to be a parallel processing system, then neurons
can be considered to be compute resources, while synaptic connections (see Sec-
tion 2.2) can be considered to be communication resources. This leads to the
Communication-Centric Hypothesis.
Communication-Centric Hypothesis
The scalability of a neural computation system is communication-bound, not compute-
bound
The Communication-Centric Hypothesis means that the work involved in model-
ling communication in a neural computation system dominates the work involved
in modelling the behaviour of neurons. It also means that the scale of neural net-
work that can be handled by a neural computation system in real-time is bounded
by the availability of communication resources in this system rather than the avail-
ability of compute resources.
Communication resources in a parallel processing system can be divided into those
that provide inter-device communication and those that provide access to memory.
This leads to the Bandwidth Hypothesis.
CHAPTER 1. INTRODUCTION 21
Bandwidth Hypothesis
The scale of neural network that can be handled by a neural computation system in real-
time is bounded by inter-device communication bandwidth and memory bandwidth
The Bandwidth Hypothesis provides two important considerations when design-
ing a massively parallel neural computation system:
1. Using an implementation platform that provides sufficient inter-device com-
munication bandwidth and memory bandwidth
2. Using this bandwidth efficiently to maximise the scale of neural network that
can be handled by the neural computation system in real-time
This thesis will provide justification for these hypotheses.
1.4 Overview
I begin by reviewing related work on neural computation systems in Chapter 2
to discover what methods have been used to model the behaviour of neurons and
their communication, and how the scale of neural computation that can be handled
by these systems has been increased, both on a single chip and by scaling to mul-
tiple chips. This provides justification for the Scalability Hypothesis and begins to
provide justification for the Communication-Centric Hypothesis and Bandwidth
Hypothesis.
Chapter 3 analyses the Izhikevich spiking neuron model to determine how its
neuron modelling equations can be adapted from floating-point and continuous-
time to create a fixed-point, discrete-time neuron model that is better suited to
massively parallel neural computation. The communication properties of biolo-
gical neural networks are also analysed, as they affect the design of both neural
computation systems and the benchmark neural networks that are used to evalu-
ate them.
22 1.5. PUBLICATIONS
Chapter 4 divides neural computation into separate tasks and analyses the volume
of work involved in each task and their memory and communication bandwidth
requirements. This justifies the Bandwidth Hypothesis and leads to discussion of
appropriate implementation platforms for a massively parallel neural computation
system, after which a platform based on a network of Field Programmable Gate Ar-
rays (FPGAs) is selected. This platform is presented in Chapter 5, with discussion
of how its features influence the design of a neural computation system implemen-
ted on it.
Chapter 6 develops and evaluates an algorithm for modelling communication
between neurons that takes account of the Bandwidth Hypothesis and is optimised
for a FPGA-based neural computation platform. This platform is implemented in
Chapter 7 and evaluated in Chapter 8. Chapter 9 evaluates the three hypotheses
proposed in this chapter, draws conclusions and considers future work.
1.5 Publications
The following publications were produced in the course of this work:
Simon W Moore, Paul J Fox, Steven JT Marsh, A Theodore Markettos and Alan
Mujumdar; “Bluehive - A Field-Programable Custom Computing Machine for
Extreme-Scale Real-Time Neural Network Simulation”; 20th Annual International
Symposium on Field-Programmable Custom Computing Machines pp 133-140;
April 29 - May 1 2012
Paul J Fox and Simon W Moore; “Efficient Handling of Synaptic Updates in FPGA-
based Large-Scale Neural Network Simulations”; 2012 Workshop on Neural En-
gineering using Reconfigurable Hardware; September 1 2012; Held in Conjunction
with FPL 2012
Paul J Fox and Simon W Moore; “Communication-focussed Approach for Real-
time Neural Simulation”; 3rd International Workshop on Parallel Architectures
and Bioinspired Algorithms pp 9-16; September 11 2010; Held in conjunction with
PACT 2010
Chapter2
Background
23
24 2.1. INTRODUCTION
2.1 Introduction
To create a massively parallel neural computation system that can scale to computa-
tions of large neural networks in real-time, we must understand the tasks involved
in neural computation, how they have been approached by other researchers and
what limits the scale and speed of a neural computation system.
We begin with an overview of the fundamentals of neural computation, finding
that there are two fundamental requirements – modelling neurons and modelling
their communication. Modelling neural communication can be further divided into
modelling the topology of a neural network and modelling the effect that neurons
have on other neurons via synaptic connections. Having identified these funda-
mental tasks of neural computation, we review how other work has approached
them, ranging from methods that closely mirror the processes and structure of bio-
logical neural networks to those that aim to emulate only their externally observ-
able behaviour.
2.2 Fundamentals of neural computation
Neural computation is the process of replicating the behaviour of a biological
neural network using a computer system. A diagram of a biological neuron is
shown in Figure 2.1 on the facing page. It has four major components that are
pertinent to neural computation:
Dendrites
Receive input from other neurons via synaptic connections.
Soma
The main body of a neuron. It receives input from the dendrites. The aggreg-
ated input has an effect on its membrane potential. If the membrane potential
passes a threshold then the soma will emit an action potential or “spike,” and
then reset its membrane potential and become less responsive to input for a
period.
Axon
Carries spikes over some distance. Wrapped in a myelin sheaf to provide
electrical insulation. Its end is fanned-out to connect to multiple target
neurons.
CHAPTER 2. BACKGROUND 25
Synapse
A gap between the axon endings of a source neuron and the dendrites of a
target neuron. Introduces a delay to the communication of spikes between
neurons and also provides a varying degree of effect of each spike on the tar-
get neuron, which results in each target neuron receiving a varying “synaptic
update.”
Axon
SynapseSoma
Dendrites
Figure 2.1: Diagram of neuron showing components that are pertinent to neural
computation
2.3 Neural computation methods
All methods of neural computation will need to model each of the four major com-
ponents of a neuron to some degree. Computation methods can be categorised
based on how closely they aim to replicate biological processes, both at a chemical
and an electrical level.
The soma can be thought of as the computing the state of a neuron, while the dend-
rites, axon and synaptic connections collectively communicate synaptic updates
between neurons. While each neuron has a single soma and axon, there can be
many dendrites and synaptic connections (around 103 per neuron in the human
brain). As will be shown in Chapter 3, this means that the work involved in mod-
elling neural communication greatly exceeds that involved in modelling the beha-
viour of the soma, particularly if the behaviour of each dendrite and synapse is
modelled separately.
Since the axon, dendrites and synapses are all involved in neural communication
rather than computing a neuron’s state, the remainder of this work refers to mod-
elling the behaviour of the soma as modelling the behaviour of a neuron.
26 2.3. NEURAL COMPUTATIONMETHODS
Figure 2.2: A neural network as a number of communicating processes
2.3.1 Mapping neural networks on to computation systems
Before neural computation can be performed with a neural network, it needs to be
mapped on to a neural computation system. The first stage of this process involves
selecting a neuron model and a communication model. The neural network can
then be though of as a number of neuron processes communicating with each other
via communication channels which represent axons, synapses and dendrites.
Figure 2.2 shows a neural network modelled as a number of communicating pro-
cesses. The circles represent neuron processes and the lines communication chan-
nels between these processes. The red lines show the communication channels
between neuron process 1 and neuron processes 6, 7 and 9. This communicating
process model provides a layer of abstraction between neural networks (and in
particular biological neural networks) and neural computation systems. Given a
suitable model for each neuron process and communication channel, it will be pos-
sible to model the behaviour of a neural network with the desired level of biological
plausibility and accuracy.
If a neural computation system conceptually consists of a number of connected
devices, then each neuron process will need to be mapped on to one of these
devices. The communication channels between the neuron processes will then
need to be mapped on to the physical communication links between the devices.
Since biological neural networks frequently have communication channels that
form “non-uniform” patterns, with it being possible for arbitrary neurons in a net-
work to be connected, this means that multiplexing and routing will be needed to
overlay these communication channels on to physical communication links.
CHAPTER 2. BACKGROUND 27
Figure 2.3: Communicating process network from Figure 2.2 mapped on to a neural
computation system
For example, the communicating process network in Figure 2.2 could be mapped
on to a neural computation system with a number of devices connected in a two-
dimensional grid as shown in Figure 2.3. The channels between neuron process
1 and neuron processes 6 and 7 each use one physical link between two adjacent
devices, while the channel between neuron process 1 and neuron process 9 uses
two physical links and some form of routing on the intermediate device.
2.3.2 Neuron models
Neuron models range in complexity from detailed models that are claimed to
be biologically accurate, to much simpler models that aim to model only the
externally-observable spiking behaviour of a neuron, known as spiking neuron
models. Which type of model is used in a neural computation system depends
on the goals of individual researchers and the resources available to them, as many
researchers are happy to accept the reduced accuracy of spiking neuron models
or believe that the extra detail provided by more biologically accurate models is
irrelevant. Using a simpler neuron model also allows neural computation to be
performed for larger neural networks than more complex models using the same
resources.
28 2.3. NEURAL COMPUTATIONMETHODS
Integrate and fire
Izhikevich
Hodgkin-Huxley
Increasing computational complexity −→
In
cr
ea
si
ng
bi
ol
og
ic
al
pl
au
si
bi
lit
y
−→
Figure 2.4: Biological plausibility and computational complexity of spiking neuron
models. Redrawn from (Izhikevich, 2004). Unlabelled points are other neuron
models considered by Izhikevich
One of the most commonly used biologically-accurate neuron models is the
Hodgkin-Huxley model (1952) which aims to model the behaviour of a neuron
based on observed chemical and electrical processes, so that both the internal and
external behaviour of a modelled neuron follow biology. The model uses ten dif-
ferential equations per neuron.
There are many spiking neuron models, with varying levels of complexity and bio-
logical accuracy of the spike patterns that they produce. Some of the simplest spik-
ing neuron models are the integrate and fire (Abbott, 1999) and leaky integrate
and fire (Stein, 1965) models, which both model the behaviour of a neuron using a
single differential equation. Another, more complex, spiking neuron model is the
Izhikevich model (2003), which aims to be able to reproduce many observed types
of neuron spike patterns using two differential equations per neuron.
Izhikevich (2004) surveys a number of spiking neuron models, and classifies them
based on their biological plausibility and implementation cost (analogous to com-
putational complexity). The results of this analysis are shown in Figure 2.4, and
show that (based on the metrics used by Izhikevich) the Izhikevich model has an at-
tractive combination of biological plausibility and computational complexity.
Neurons can also be modelled using electrical circuits that replicate their behaviour,
typically using analogue components such as capacitors to model features such as
membrane potential. For example the Hodgkin-Huxley model can be specified in
terms of an electrical circuit as well as a set of differential equations (Toumazou
et al., 1998).
CHAPTER 2. BACKGROUND 29
2.3.3 Communication models
Spiking neuron models assume that axons and dendrites are essentially equivalent
to electrical wiring, and hence their communication models are restricted to com-
municating neural spikes and modelling the effect of synapses on the transmission
of these neural spikes from source to target neurons. Synapses are normally mod-
elled using a simple model with each synaptic connection having a delay and a
multiplicative weight, which can be either positive or negative, to model varying
effect of a spike on a target neuron, and hence a varying synaptic update. The
weights of each synaptic update targeted at a neuron are summed to produce a
single value that is input to the neuron model.
More complex neuron models are often accompanied by complex communication
models, with detailed models of the effect of axons, dendrites and synapses on the
transmission of neural spikes. Axons and dendrites can be modelled using cable
theory (Rall, 1959), while synapse models can have a range of complexities, which
can model the effect of both electrical and chemical processes on the transmission
of neural spikes.
This results in complex models of the behaviour of the soma, axon, dendrites and
synapses such as that used by Traub et al. (2005). These models are often designed
to model specific regions of the brain (particularly cortical regions), for example the
model used by Ananthanarayanan et al. (2009) is specific to the cat cortex.
These detailed communication models add orders of magnitude of complexity to
a neural computation as the number of dendrites and synaptic connections that
need to be modelled is orders of magnitude greater (between 103× and 104×)
than the number of neurons that need to be modelled. Therefore such models
result in neural computation systems that operate many times slower than real-
time (Ananthanarayanan et al., 2009).
Communication models may also implement some form of learning mechanism,
such as Hebbian learning (Hebb, 1949). This is often based on altering the con-
nectivity or parameters of synaptic connections (synaptic plasticity) in response
to spike activity (Song et al., 2000), which will add computational complexity to a
neural computation but will not affect communication complexity.
2.3.4 Implementation methods
There is a major division in neural computation between software-based and
hardware-based systems. Software-based neural computation systems support
larger neural networks by using more compute resources, with communication
30 2.4. SOFTWARE-BASED NEURAL COMPUTATION
between these resources provided by some form of shared memory. Hardware-
based neural computation systems show more variation, with different methods
used to model neurons and to model their communication, with work by other re-
searchers often combining different neural modelling methods with different com-
munication modelling methods. Therefore we will survey methods of modelling
neurons and of modelling their communication in hardware-based neural compu-
tation systems separately.
2.4 Software-based neural computation
In a software-based neural computation system, neurons and neural communica-
tion are both modelled in software. Communication resources typically use some
form of shared memory. Scaling to larger neural networks is achieved primarily by
increasing compute and communication resources (as suggested by the Scalabil-
ity Hypothesis), but shared memory communication scales poorly without special-
ised interconnect, so neural computation of large neural networks either requires
a supercomputer with a custom memory hierarchy (Migliore et al., 2006) or many
multiples of real-time on commodity PC-based systems with finite resources.
2.4.1 Supercomputer-based systems
A notable supercomputer-based neural computation system has been produced by
the Blue Brain project (Markram, 2006), using an IBM BlueGene/L supercomputer.
A similar system by Ananthanarayanan et al. (2009) can simulate a neural network
based on the cat cortex with 0.9× 109 neurons and 0.9× 1013 synaptic connections
around 83× times slower than real-time per Hertz of mean spike frequency (which
would be 830× slower than real-time for a mean spike frequency of 10 Hz) with
a sampling interval of 1 ms. This saturates the memory bandwidth in an IBM
BlueGene/P supercomputer with 147456 CPUs and 144 TB of RAM. The number
of CPUs in this system provides evidence for the Scalability Hypothesis.
Ananthanarayanan et al. demonstrate that doubling the available RAM in this sys-
tem almost doubles the size of neural network that can be handled by this system
in a given time-scale, which agrees with the Bandwidth Hypothesis. These results
are impressive (a neural network around 25× the size of this network would match
the scale of the human brain), but the cost, size and power requirements of the su-
percomputer required by this system puts it out of reach of the vast majority of
researchers.
CHAPTER 2. BACKGROUND 31
2.4.2 PC-based systems
PC-based systems are more readily available to researchers than supercomputer-
based systems, and can be used for smaller-scale neural computations. For ex-
ample NEURON (Hines and Carnevale, 1997), Brian (Goodman and Brette, 2008)
and Nest (Gewaltig and Diesmann, 2007) all use commodity PC hardware. While
there is support for using multiple processors in these PC-based systems, they are
ultimately limited to running on a single PC as the hardware is not designed to
support shared memory between separate systems. For example, Steven Marsh
found that a single-threaded neural network simulator written in C required 48.8 s
to perform a neural computation of 300 ms of activity for 256k Izhikevich neur-
ons, using the benchmark neural network presented in Section 8.2, which is 163×
slower than real-time (Moore et al., 2012). This is broadly comparable to the time-
scale of supercomputer-based systems, but for a massively smaller neural network
scale.
It would be relatively trivial to perform neural computations of many unconnec-
ted neural networks using many PCs to create computations with the same total
number of neurons as the neural networks used with supercomputer-based neural
computation systems, but this is not comparable as the separate, small compu-
tations would not model the full set of synaptic connections in the larger neural
network. This provides support for the Communication-Centric Hypothesis.
2.4.3 GPU-based systems
Between supercomputer-based and PC-based systems sit a more recent class of
neural computation systems based on Graphics Processing Units (GPUs). These
can provide real-time performance for reasonably large neural networks (for
example Fidjeland and Shanahan (2010) report computations of up to 3 × 104
Izhikevich spiking neurons in real-time), but they struggle to scale to larger neural
networks using additional GPUs as communication between GPUs is only avail-
able via the CPU in the host PC, a limit which supports the Bandwidth Hypo-
thesis.
32 2.5. HARDWARE-BASED NEURAL COMPUTATION
2.5 Hardware-based neural computation
Hardware-based neural computation systems typically separate modelling neur-
ons, modelling the topology of a neural network and modelling synaptic connec-
tions into separate, inter-connected tasks that operate in parallel, using separate
circuits. We will survey methods for performing each of these tasks separately as
different methods are often combined in systems by other researchers.
2.5.1 Modelling neurons
Hardware-based neural computation systems can model neurons using both ana-
logue circuits and mathematical models. Between these two modelling methods
are models that directly map each neuron into hardware using a dedicated digital
circuit with similar behaviour to an analogue neuron model.
Analogue models
Analogue neuron models use an analogue circuit with components that mimic the
behaviour of biological neurons. These range from neuron models that closely rep-
licate the structure of biological neurons such as those implemented by Mahowald
and Douglas (1991), Saighi et al. (2005) and Alvado et al. (2004), which all imple-
ment the Hodgkin-Huxley model (1952), to those that replicate only the spiking
behaviour of biological neurons, such as the array of leaky integrate and fire neur-
ons implemented by Indiveri et al. (2006). Basu et al. (2010) use a combination of
hardware blocks that allows computation with both integrate and fire neurons and
more complex multichannel neuron models such as Hodgkin-Huxley.
In some cases the parameters of the neuron model can be altered, using an ana-
logue memory (Saighi et al., 2005) or a more conventional digital memory (Al-
vado et al., 2004), while in others the ability to alter parameters is more limited,
for example only allowing spike threshold voltages to be altered (Mahowald and
Douglas, 1991). Basu et al. allow the behaviour of their configurable blocks to be
altered using programmable floating gate transistors.
The biggest weakness of analogue neurons models is their lack of scalability, for
example Indiveri et al. (2006) implement 32 neurons on a chip, while Saighi et al.
(2005) implement 5 and Basu et al. up to 84 integrate and fire neurons or a smaller
number of Hodgkin-Huxley neurons.
CHAPTER 2. BACKGROUND 33
Analogue components have large physical size compared to the transistors used in
digital systems, and hence the number of neurons that can be modelled on a single
device using analogue neuron models is limited.
Direct mapping into hardware
An evolution of analogue neuron models are digital models which replace each
analogue neuron with a digital neuron implemented in a hardware description
language (HDL) such as Verilog or VHDL (Bailey, 2010; Upegui et al., 2005). The
number of neurons that can be modelled using a single chip is broadly comparable
with analogue models. Bailey (2010) implements 100 neurons and 200 synaptic
connections on a Xilinx Virtex 5 100T FPGA, while Upegui et al. (2005) implements
30 neurons with up to 900 synaptic connections in a more constrained topology on
a Xilinx Spartan II xc2s200 FPGA.
The scale of a neural computation system using direct mapping is limited by need-
ing a separate HDL block to represent each neuron, although it is possible to time-
multiplex HDL blocks in some cases, such as if the topology of a neural network
forms a regular grid (Yang et al., 2011). Maguire et al. (2007) take another approach,
with neuron state being stored in memory and HDL blocks being allocated to neur-
ons by a coordinating processor, allowing these blocks to be time-multiplexed. This
allows significantly more neurons to be modelled per chip, around 2× 103 on an
unspecified Xilinx FPGA.
Mathematical models
Hardware-based neural computation systems can use the same mathematical
neuron models as software-based neural computation systems, particularly simple
spiking neuron models. Using discrete-time mathematical neuron models allows
compute resources to be shared between neurons using time-multiplexing, allow-
ing the number of neurons that can be modelled by each resource to be increased
significantly. The proportion of total resources required by each neuron depends
on the sampling interval of the neuron model.
One method of evaluating neuron modelling equations in a hardware-based sys-
tem is using many small processor cores running simple embedded software. This
approach is taken by Jin et al. (2008), who use a network of custom ASICs with up
to 20 small ARM processor cores per chip, with around 103 neurons per core.
34 2.5. HARDWARE-BASED NEURAL COMPUTATION
Another method of evaluating neuron modelling equations in a hardware-based
system is converting them into a data-flow pipeline (Thomas and Luk, 2009; Cas-
sidy et al., 2011; Martinez-Alvarez et al., 2007; Rice et al., 2009). Mathematical neuron
models are amenable to being converted to pipelines with many stages (Rice et al.
use 23 stages for an Izhikevich neuron model), as the neuron modelling equations
must be evaluated for every neuron in a neural network once per sampling inter-
val, resulting in simple control flow with no branches that would require a pipeline
to be flushed.
Since each neuron is represented by a flow of data rather than by physical re-
sources, the throughput of both the embedded software and data flow approaches
and hence the number of neurons in a real-time neural computation is limited by
the ability of the neural computation system to fetch this data from memory. This
agrees with the Bandwidth Hypothesis.
2.5.2 Modelling neural network topology
Modelling the topology of a neural network requires some form of interconnect in
a neural computation system. Within an individual device (either custom ASIC,
FPGA or CPU) this interconnect can be provided by internal wiring (either fixed,
reconfigurable or an on-chip network), but if a neural computation system is to
scale to multiple devices (as suggested by the Scalability Hypothesis), some kind
of inter-device interconnect will be needed.
Analogue wiring
Pure analogue neural models use voltages on wires between neuron blocks to
model the topology of a neural network. For example, Saighi et al. (2005) con-
nect their analogue neuron modelling blocks together using reconfigurable wiring
controlled by data in a SRAM, similar to the method used to control routing in
FPGAs.
Digital wiring
Digital wiring can be used to model the topology of a neural network, particu-
larly when neuron models are directly mapped into hardware using Field Pro-
grammable Gate Arrays (FPGAs), and also in some custom neural computation
systems, such as those by Alvado et al. (2004) and Basu et al. (2010). While FPGA-
based neural computation systems using this approach are much easier to imple-
ment than custom systems as FPGAs are readily available to researchers, the topo-
CHAPTER 2. BACKGROUND 35
logy of a neural network can only be altered by resynthesising a FPGA bitstream,
which is time-consuming. Altering a neural network topology at run-time is pos-
sible using dynamic partial reconfiguration, as done by Upegui et al. (2005), but this
requires that a portion of the FPGA bitstream is resynthesised for every topology
change, which rules out frequent changes.
Modelling the topology of a neural network using wiring that directly mirrors this
topology is often limited to single-chip systems as extending this type of topology
modelling to multiple chips requires more pins than can be implemented on a chip
package, which suggests that this method of neural network topology modelling is
not suited to creating neural computation systems for large neural networks.
The number of neurons and connections that can be modelled per chip is also
limited by routing resources, as connections between neurons tend not to form
regular grid pattens, unless the arrangement of these connections is artificially
limited by the architecture (Upegui et al., 2005). This provides support for the
Communication-Centric Hypothesis.
Packet-based signalling
The topology of a neural network can be modelled more abstractly by modelling
neuron spikes (or other types of communication) using packets that indicates the
identity of the source neuron (and possibly other information), commonly called
AER packets (Boahen, 2000). These packets must be transmitted to target neurons
using some form of network.
The topology of a neural network is then modelled using a routing system that
identifies which source neurons are connected to which target neurons. This re-
quires a routing table, which is typically stored in off-chip memory in larger neural
computation systems (Jin et al., 2008; Cassidy et al., 2011). The memory needed to
store this routing table is typically orders of magnitude larger than that needed to
store neuron modelling equation parameters when the number of target neurons
per source neuron is high (high fan-out), and hence the scale of neural network
that can be handled by these neural computation systems in real-time is bounded
by memory bandwidth, supporting the Bandwidth Hypothesis.
Packet-based signalling is particularly suited to modelling the topology of a neural
network in neural computation systems that span multiple chips, for example In-
diveri et al. (2006) use analogue neuron and synapse models with packet-based
signalling for inter-chip communication. It can also be used to model the topology
of a neural network within a chip using an on-chip network (Dally and Towles,
2001). Packet-based signalling could be used just for long-distance connections,
36 2.5. HARDWARE-BASED NEURAL COMPUTATION
such as in the system implemented by Emery et al. (2009), which arranges neurons
into tiles and uses programmable logic for connections within a tile and packet-
based signalling for long-distance and inter-chip connections. Other systems such
those implemented by Jin et al. (2008), Thomas and Luk (2009) and Cassidy et al.
(2011) use an on- and inter-chip network to model the entire neural network topo-
logy.
2.5.3 Modelling synaptic connections
Synaptic connections can be modelled using either analogue models or arithmetic
models, both with varying levels of complexity.
Analogue models
The behaviour of synaptic connections can be modelled using analogue hardware,
with a separate hardware block for each synaptic connection. This allows for a
more complex range of synaptic behaviour (particular plasticity) than delay-weight
models. For example Rachmuth et al. (2011) implement a detailed model of a single
synaptic connection on a single chip using a complex model involving ion chan-
nels and supporting synaptic plasticity. Even with less complex analogue synaptic
connection models, the number of neurons and synapses that can be implemented
on a single chip is limited as analogue synaptic connection models take significant
chip area, for example Indiveri et al. (2006) report 32 neurons and 256 synapses per
chip. This is a mean fan-out of only 8, compared to 103 in the human brain. Given
the resource requirements of each synapse, it would appear that implementing a
neural computation with biologically-plausible fan-out using an analogue synapse
model is not possible.
Arithmetic models
If a simple delay-weight synapse model is used then modelling synaptic connec-
tions requires determining what the delay and weight of each connection should
be, applying the delay and then summing the weights of each synaptic update for
each neuron. Jin et al. (2008), Maguire et al. (2007), Cassidy et al. (2011) and Rice
et al. (2009) all store the delay and weight of each synaptic connection in memory
and then fetch it when a spike is received from a source neuron. Basu et al. (2010)
encode synaptic weights in the interconnect between neuron blocks using floating
gate transistors.
CHAPTER 2. BACKGROUND 37
Jin et al. apply the delay and sum the weights using embedded software and tem-
porary memory while Maguire et al., Cassidy et al. and Rice et al. use dedicated
hardware. Maguire et al. provide a dedicated hardware block for each synapse
while Cassidy et al. use a data-flow processing pipeline and store the resulting
summed value in memory before it is input to the neuron model.
With the parameters of each synaptic connection stored in memory, memory band-
width limits the number of synaptic connections that can be modelled by a neural
computation system in a given time period, as suggested by the Bandwidth Hypo-
thesis.
2.6 Design decisions
The evaluation of neural computation systems reveals several major decisions that
need to be made when designing a massively parallel neural computation sys-
tem:
Software or hardware implementation
Software-based neural computation systems allow a wide variety of neuron
models to be used, but they are not capable of performing massively paral-
lel neural computation in real-time on platforms available to the majority of
researchers. Hardware-based systems add many constraints but show signi-
ficantly more potential for massively parallel computation.
Neuron modelling
While neuron modelling using analogue hardware has the potential to be sig-
nificantly more accurate than digital modelling (since analogue hardware op-
erates in continuous-time and digital in discrete-time), analogue hardware
implementations are restricted to custom ASICs and show limited potential
to scale to massively parallel computations, particularly as analogue compon-
ents use a large amount of chip area. Hence a discrete-time, digital neuron
model is most appropriate for massively parallel neural computation. The
Izhikevich model will be used as it has an attractive combination of biolo-
gical plausibility and computational complexity compared to other models
such as integrate and fire.
38 2.6. DESIGN DECISIONS
Synaptic connection modelling
Similarly synaptic connections can be modelled using either analogue or di-
gital hardware. While analogue again gives potential for greater accuracy
and complex modelling of synapse behaviour, each synaptic connection will
need dedicated hardware, which is unfeasible in a system with millions of
neurons and billions of synaptic connections. In the digital domain it is pos-
sible to model the behaviour of synaptic connections using a range of models,
but a simple model of discrete delay and weight shows the most potential for
scaling to a massively parallel neural computation.
Neural network topology modelling
A variety of methods have been used to model the topology of a neural net-
work in neural computation systems. Many are based on providing dedicated
wiring that mimics each connection between source and target neurons, of-
ten using the reconfigurable routing resources in a FPGA, but these methods
exhibit poor scaling. Modelling neural network topology using AER packets
shows significantly more scaling potential, particularly when combined with
an on- and inter-chip network. This means that the neural network topology
will need to be stored in a routing table. The routing table can held in on-chip
memory, but this cannot scale to massively parallel neural computations, par-
ticularly computations of neural networks with high fan-out as this results in
large routing tables that will not fit in on-chip memory. Hence at least some
of the data describing the topology of a neural network will need to be held
in off-chip memory.
Resource sharing
A system using a discrete-time neuron model allows for the hardware used to
model neurons and their communication to be time-multiplexed. Given that
the equations used in the neuron model will need to be evaluated at around
1 ms intervals (see Section 3.2.3) and that the pipelines used to evaluate these
equations can operate at 100s of MHz, the potential scale of a neural compu-
tation is significantly increased by time-multiplexing resources.
Numerical precision
Researchers who aim to create large-scale neural computation systems often
use fixed-point arithmetic for their neuron models, as it is less complex than
floating-point and hence has greater scaling potential. There will be a loss of
accuracy as a result, manifested largely in differing spike times, but this is not
particularly concerning as the overall spike pattern resulting from a neural
computation (and in particular the probability density of the distribution of
these spikes (Berger and Levy, 2010)) is of more interest than the exact timing
of individual spikes.
CHAPTER 2. BACKGROUND 39
Time-scale
Neural computation systems created by other researchers range from being
orders of magnitude slower than real-time to orders of magnitude faster than
real-time. The time-scale used by each system depends on the goals of the re-
searchers who created it, and on limitations imposed by the system itself,
such as maximum clock frequency or memory and communication band-
width.
Selection of a time-scale is hence somewhat arbitrary, particularly if a neural
computation system is implemented using data-flow pipelines where the
volume of data that can be processed in a given time period is effectively
fixed by available memory bandwidth, and hence the size of neural network
that can be handled by the neural computation system is proportional to the
time-scale used and the number of memory channels in the system.
This work uses a time-scale of real-time (that is that the computation of 1 ms
of neural network activity takes 1 ms) as this will ease comparison with other
large neural computation systems such as that by Jin et al. (2008), and also al-
low for interfacing to real-time systems such as external sensors in the future.
2.7 Conclusion
When creating a massively parallel neural computation system that operates in
real-time and that can scale to computations of large neural networks with high
fan-out, we have much to learn from other neural computation systems. It is clear
that a massively parallel neural computation system must model neurons using a
digital, discrete-time neuron model (such as the Izhikevich spiking neuron model)
as neural computation with analogue neuron models has only been demonstrated
for very small scale neural networks.
It is also clear that neural network topology should be modelled using packet-
based signalling over an on- and inter-chip network, as other neural network to-
pology modelling methods do not allow a neural computation system to scale bey-
ond a single chip, which is essential for large-scale neural computation systems.
Packet-based signalling also allows time-multiplexing of resources, which allows
a massive increase in the scale of a neural computation. Further resource saving
and hence increased scale can be achieved using fixed-point arithmetic in place of
floating-point in neuron models.
40 2.7. CONCLUSION
The biggest unsolved challenge is how to efficiently model neural network topo-
logy and synaptic connections, which can be referred to as communicating and
applying synaptic updates. It is clear that an on- and inter-chip network should
be used to do this. The amount of data and bandwidth that will be needed to
model the topology of a large-scale neural network and to communicate and ap-
ply synaptic updates in real-time must be analysed, as the Bandwidth Hypothesis
suggests that the scale of neural network that can be handled by a neural compu-
tation system in real-time is bounded by inter-device communication bandwidth
and memory bandwidth. We begin by analysing the high-level requirements of
massively parallel neural computation in the next chapter and then develop these
requirements into system resource requirements in Chapter 4.
Chapter3
High-level requirements
41
42 3.1. INTRODUCTION
3.1 Introduction
We have chosen to use the Izhikevich spiking neuron model (2003) to perform
massively parallel neural computation. While it is a “black-box” model and its
equations bear little direct relation to the internal processes of biological neurons,
the model is capable of reproducing the spiking patterns of a wide variety of types
of biological neuron (Izhikevich, 2004). Also it has an attractive combination of bio-
logical plausibility and computational complexity compared to other models such
as integrate and fire, as discussed in Section 2.3.2.
In its original form the model is unsuitable for implementation in a time-
multiplexed, digital system as its equations are designed to operate in continuous-
time, and so they must be converted into discrete-time alternatives. Further con-
version can be applied to replace floating-point arithmetic with fixed-point, to pre-
compute constants and to eliminate terms relating to the length of the sampling
interval. This yields much simpler versions of the model’s equations which are
suited to massively parallel neural computation.
The communication properties of biological neural networks are examined to
form some initial bounds on the character and number of synaptic updates
that a massively parallel neural computation system will need to handle, which
will allow the resource requirements of massively parallel neural computation
to be analysed in Chapter 4 and help to evaluate the hypotheses proposed
in Chapter 1.
3.2 The Izhikevich spiking neuron model
Izhikevich’s spiking neuron model (2003) uses a pair of differential equations to
model the behaviour of a single neuron. Equation 3.1a calculates the membrane
potential of the neuron and Equation 3.1b the refractory voltage. These equations
have two variables (v and u) and two constants (a and b). An additional variable
(I) represents the sum of all the synaptic updates targeted at the neuron.
An action potential (‘spike’) is produced if v ≥ 30 mV, in which case v and u are
reset by Equation 3.1c and Equation 3.1d respectively. This requires two more equa-
tions with two constants (c and d).
CHAPTER 3. HIGH-LEVEL REQUIREMENTS 43
dv
dt
= 0.04v2 + 5v + 140− u + I (3.1a)
du
dt
= a(bv− u) (3.1b)
If v ≥ 30mV after dvdt and dudt are evaluated then
v = c (3.1c)
u = u + d (3.1d)
The model appears to be capable of reproducing the behaviour of a range of types
of biological neuron by altering parameters a to d. Izhikevich provides several
examples (2003).
3.2.1 Conversion to discrete-time
Equation 3.1 operates in continuous-time, which makes it unsuitable for evaluation
using a digital system, particular if we wish to increase the scale of a neural com-
putation by time-multiplexing resources.
Thankfully it is straightforward to convert Equation 3.1 to operate in discrete-time.
The simplest method is to use Euler integration. Given equations of the form in
Equation 3.2 they can be converted to discrete-time equations, sampled at intervals
of length δt by using Equation 3.3.
dy
dt
= f (x, y)
y(x0) = y0 (3.2)
xn+1 = xn + δt
yn+1 = yn + δt · f (xn, yn) (3.3)
Applying this method to Equation 3.1 yields the discrete-time version of the
Izhikevich neuron model in Equation 3.4.
44 3.2. THE IZHIKEVICH SPIKING NEURONMODEL
vn+1 = vn + δt(0.04(vn)2 + 5vn + 140− un + I)
un+1 = un + δta(bvn − un)
If vn+1 ≥ 30mV after vn+1 and un+1 are evaluated then
vn+1 = c
un+1 = un+1 + d (3.4)
3.2.2 Conversion to fixed-point
Equation 3.4 operates in discrete-time, but still relies on floating-point arithmetic,
which is computationally expensive, particularly when aiming for a massively par-
allel system. By adapting the work of Jin et al. (2008), we can convert Equation 3.4
to use fixed-point arithmetic, which can be evaluated using comparatively simple
hardware, particularly if multiplication or division can be approximated to a power
of 2, allowing simple bit shifts to be used in their place.
I will extend the work of Jin et al. by providing additional intermediate steps to
preserve δt (they used the simplifying assumption δt = 1), and to assist in under-
standing the process involved in transforming the equations.
The first step is to transform Equation 3.4 into Equation 3.5 by factoring out vn and
multiplying by a. The components of the equation that are applied if v ≥ 30mV
will be omitted for now as they do not require transformation.
vn+1 = vn + δt[vn(0.04vn + 5) + 140− un + I]
un+1 = un + δt(abvn − aun) (3.5)
The floating-point variables and parameters can now be converted to fixed point
by applying scaling factors according to the mapping in Table 3.1 on the facing
page. Additional multiplication / division by these scaling factors is used to keep
the equations balanced. The result is Equation 3.6.
CHAPTER 3. HIGH-LEVEL REQUIREMENTS 45
Variable / Parameter Scaling Factor
v p1
u p1
a p2
b p2
c p1
d p1
Table 3.1: Mapping of variables / parameters to scaling factors (Jin et al., 2008)
Original Value Replacement
vp1 V
up1 U
abp2 A
−ap2 B
cp1 C
dp1 D
Table 3.2: Mapping used to simplify Equation 3.6 (Jin et al., 2008)
vn+1 p1 = vn p1 + δt[vn p1((0.04p2vn p1)/p2 + 5p1)/p1 + 140p1 − un p1 + Ip1]
un+1 p1 = un p1 + δt([(abp2)vn p1]/p2 + [(−ap2)un p1]/p2)
If vn+1 ≥ 30mV after vn+1 p1 and un+1 p1 are evaluated then
vn+1 p1 = cp1
un+1 p1 = un+1 p1 + dp1 (3.6)
Notice that the majority of the scaling factors are applied directly to variables or
parameters, with just a few additional scaling factors needed to keep the equations
balanced. Equation 3.6 can be simplified by storing parameters in memory with
the relevant scaling factors already applied. Using the mapping in Table 3.2, this
results in the simplified Equation 3.7.
Vn+1 = Vn + δt[Vn((0.04p2Vn)/p2 + 5p1)/p1 + 140p1 −Un + Ip1]
Un+1 = Un + δt(AVn + BUn)/p2
If Vn+1 ≥ 30mV after Vn+1 and Un+1 are evaluated then
Vn+1 = C
Un+1 = Un+1 + D (3.7)
46 3.2. THE IZHIKEVICH SPIKING NEURONMODEL
Scaling Factor Value
p1 256
p2 65536
Table 3.3: Scaling factors selected by (Jin et al., 2008)
Operation Equivalent Shift
Multiply p1 << 8
Multiply p2 << 16
Divide p1 >> 8
Divide p2 >> 16
Table 3.4: Mapping of scaling factors to bit shifts (Jin et al., 2008)
The accuracy and ease of evaluation of Equation 3.7 depends on the values of the
scaling factors p1 and p2. After experimentation, Jin et al. selected the values in
Table 3.3. Since these are powers of 2, simple bit shifts can be used in place of
comparatively expensive multiplication and division to apply these scaling factors,
resulting in the equivalent bit shifts in Table 3.4. After applying these transforma-
tions (and with the values of p1 and p2 fixed as in Table 3.3), Equation 3.7 becomes
Equation 3.8.
Vn+1 = Vn + δt[
Vn(((0.04 << 16)Vn) >> 16+ 5p1) >> 8
+ (140 << 8)−Un + (I << 8)
]
Un+1 = Un+δt(AVn + BUn) >> 16 (3.8)
Finally the shifted constants in Equation 3.8 can be simplified, resulting in Equa-
tion 3.9.
Vn+1 = Vn + δt[
(Vn((2621Vn) >> 16+ 1280)) >> 8
+ 35840−Un + (I << 8)
]
Un+1 = Un+δt(AVn + BUn) >> 16
CHAPTER 3. HIGH-LEVEL REQUIREMENTS 47
If Vn+1 ≥ 30mV after Vn+1 and Un+1 are evaluated then
Vn+1 = C
Un+1 = Un+1 + D (3.9)
Note that while it might appear that the calculation of Un+1 in Equation 3.9 can be
simplified by removing the shift operation and reverting to using parameters ab
and −a rather than A and B, both of the original groups of parameters need to be
shifted to allow them to be stored using a fixed-point representation.
3.2.3 Sampling interval size
In order to use Equation 3.9 to perform neural computation, a value must be
chosen for the sampling interval size (δt). If a neural computation system uses
time-multiplexed resources then the sampling interval size has a direct effect on
the number of neurons in a real-time computation. Given the same resources, us-
ing a smaller sampling interval size will result in a more accurate computation of
fewer neurons, while a larger sampling interval size will result in a potentially less
accurate computation of more neurons.
In addition, there is potential for the Euler integration method to become unstable
and produce totally inaccurate results if the sampling interval size is too large.
Particularly when used with coupled differential equations such as those in the
Izhikevich neuron model, it is possible for the results given by Equation 3.4 to be
radically different from those that would be given if Equation 3.1 were evaluated
in continuous-time. The fixed-point arithmetic used in Equation 3.6 introduces fur-
ther inaccuracies.
Parameter Value
v0 -70
u0 -14
a 0.02
b 0.2
c -50
d 2
In 15 for n >= 22
Table 3.5: Parameters used to illustrate instability produced by an overly-large
sampling interval size. Note that these parameters correspond to those in Equa-
tion 3.1, and so must be transformed as described in Table 3.2 before being used in
Equation 3.4
48 3.2. THE IZHIKEVICH SPIKING NEURONMODEL
Figure 3.1: Result of evaluating Equation 3.4 with the parameters in Table 3.5 and
δt = 1.0ms.
Figure 3.2: Result of evaluating Equation 3.4 with the parameters in Table 3.5 and
δt = 0.5ms.
As an example, consider the effects of evaluating Equation 3.4 with the parameters
in Table 3.5 on the preceding page. Figure 3.1 shows the results with δt = 1.0ms
while Figure 3.2 shows the result of evaluating the same equation with δt = 0.5ms.
These graphs were produced using NEURON (Hines and Carnevale, 1997). Fig-
ure 3.1 is clearly radically different from Figure 3.2, in particular there are several
peaks in Figure 3.1 which correspond to extended flat areas in Figure 3.2.
This simple experiment illustrates that caution is needed when selecting a suit-
able value for δt. Further experiments with Equation 3.6 using both NEURON (us-
ing Equation 3.4) and a simple spreadsheet model in Microsoft Excel (using Equa-
tion 3.10) appear to indicate that increasing δt leads to a sudden, sharp transition
from results similar (at least in the timing of peaks) to Figure 3.2 to results similar
to Figure 3.1. It is likely that the exact value of δt when this transition occurs is
dependent on the parameters of Equation 3.4.
CHAPTER 3. HIGH-LEVEL REQUIREMENTS 49
Therefore there is a tradeoff between the number of neurons in a real-time compu-
tation and the accuracy of Equation 3.4 for a given value of δt. It is possible that
there might be a value of δt that is safe regardless of the parameters used. This will
not be investigated further – instead the value of δt will be fixed and the opera-
tion of the neural computation system will be validated using benchmark neural
networks.
3.2.4 Final simplification
If, as suggested by Jin et al., δt = 1 ms, then a final simplification of Equation 3.9
can be made to produce Equation 3.10.
Vn+1 = (Vn((2621Vn) >> 16+ 1536)) >> 8+ 35840−Un + (I << 8)
Un+1 = Un + (AVn + BUn) >> 16
If Vn+1 ≥ 30mV after Vn+1 and Un+1 are evaluated then
Vn+1 = C
Un+1 = Un+1 + D (3.10)
3.2.5 Numerical precision
The variables and parameters of Equation 3.10 could be stored using either floating-
point or fixed-point precision. Using fixed-point precision reduces hardware com-
plexity compared to floating-point and can also reduce the data storage (and hence
memory bandwidth) requirements of processing the equation parameters. IEEE
single precision floating-point requires 32 bits per value while IEEE double preci-
sion requires 64 bits. In comparison fixed-point precision can use any number of
bits, though there will be a loss of accuracy compared to floating-point if fewer bits
are used, manifesting itself as rounding errors.
Since reducing the memory size and bandwidth requirements of modelling each
neuron increases the number of neurons that can be modelled per memory chan-
nel, 16 bit fixed-point precision will be used for the variables and parameters of
Equation 3.10 (in common with Jin et al.), with 32 bit precision used during calcu-
lations to prevent significant loss of accuracy. With 2 variables and 4 parameters
this means that 6 × 16 = 96 bits are needed for each neuron. The loss of accur-
acy relative to floating-point will manifest itself largely in slightly differing spike
50 3.3. COMMUNICATION PROPERTIES OF NEURAL NETWORKS
times owing to accumulation of rounding errors. In practice it is likely that this can
be mitigated by altering the parameters of Equation 3.10 to bring the behaviour of
each neuron in line with what would be expected if floating-point precision were
used.
3.3 Communication properties of neural networks
Along with analysing the neuron model, the communication properties of biolo-
gical neural networks must be analysed. Analysing neural communication prop-
erties will facilitate design and evaluation of an algorithm to communicate and
apply synaptic updates, as some algorithms have varying workload depending on
the volume of synaptic updates. Other algorithms dedicate resources to all possible
synaptic connections regardless of whether they are active or not, and hence do not
experience additional workload in proportion to the volume of synaptic updates,
but these algorithms do not scale well as network size and fan-out increase.
The volume of communication between neurons depends on their:
• Fan-out
• Locality
• Spike frequency
• Synaptic delays
Each of these properties of biological neural networks will be analysed to consider
their effect on the volume of synaptic updates that must be communicated and ap-
plied and hence methods used to achieve this in a massively parallel neural com-
putation system.
3.3.1 Fan-out
As a rough guide, we will assume that each neuron will send synaptic updates to
(have a fan-out of) about 103 other neurons. This assumption is based on the hu-
man brain containing approximately 109 neurons, with 1012 synaptic connections.
Likewise each neuron will be able to receive synaptic updates from about 103 other
neurons. However, this is only an assumption regarding the average case, and in
principle individual neurons can exhibit significantly greater fan-out.
CHAPTER 3. HIGH-LEVEL REQUIREMENTS 51
3.3.2 Locality
Bassett et al. (2010) show that interconnect in mammalian brains can be analysed
using a variant of Rent’s rule which is often used to analyse communication re-
quirements in VLSI chips. They find that communication in biological neural net-
works exhibits a significant degree of locality and modularity, with neurons which
are close in physical space forming densely connected sets. Larger sets of neurons
which perform more complex functions are then formed from smaller sets, with re-
latively few additional connections. Hence the majority of communication between
neurons covers short physical distances, with there being some medium distance
communication, and less still over longer distances. However, it is still possible
that any neuron could form a synaptic connection with any other neuron, breaking
these principles of modularity and locality.
This suggests that it would be efficient to place neurons which are close together
in physical space close together in a massively parallel neural computation system
too. We should plan for most communication to be relatively local, but must still
allow for the possibility that some synaptic updates might need to be sent between
neurons which are further apart in both physical space and in a computation sys-
tem.
3.3.3 Spike frequency
Mead (1990) suggests that biological neurons can produce spikes at a rate of just
a ‘few hertz’. We will approximate this to 10Hz, although clearly there will be
variation between individual biological neurons.
3.3.4 Synaptic delays
To model synaptic delays, we need to consider both their absolute magnitude (to
obtain an upper bound) and the difference in their magnitude that appears to have
an observable effect on the behaviour of target neurons (so that the number of dis-
tinct delay magnitudes can be determined). In common with Jin et al. (2008) we will
assume that synaptic delays can range from 1 ms to 16 ms in 1 ms increments.
52 3.4. CONCLUSION
3.4 Conclusion
A variant of the Izhikevich spiking neuron model that is suitable for use in a
massively parallel neural computation system has been developed and the com-
munication properties of biological neural networks have been analysed.
We will assume that biological neurons have a mean spike frequency of 10 Hz and a
mean fan-out of 103. While individual neurons will deviate from these means, they
can be used to estimate the total number of spikes per 1 ms sampling interval and
hence the number of neurons that will receive synaptic updates every 1 ms. The
next chapter will explore this further, and consider the communication, memory
and resource requirements of massively parallel neural computation system.
The benchmark neural network used to evaluate the massively parallel neural com-
putation system developed in this work will build on these assumptions rather
than using uniform, random spike activity, as these assumptions are biologically
plausible. Many existing neural computation systems have been evaluated using
benchmarks which use uniform, random spike activity, often with the mean fan-
out of a neuron being proportional to network size (e.g. Thomas and Luk, 2009).
Given the discussion of locality in biological neural networks in Section 3.3.2, it
would appear that such benchmarks are inappropriate for evaluating a massively
parallel neural computation system.
Chapter4
System requirements
53
54 4.1. INTRODUCTION
4.1 Introduction
The last chapter analysed the communication properties of biological neural net-
works and the Izhikevich spiking neuron model. This chapter builds on that ana-
lysis to derive some bounds on the communication, memory and resource require-
ments of real-time, massively parallel neural computation, and hence provide justi-
fication for the Bandwidth Hypothesis. The requirements of modelling 105 neurons
per device with mean fan-out of 103 and mean spike frequency of 10 Hz are used
as an example.
After analysing these requirements, an appropriate implementation technology
can be selected. An implementation based on a network of Field Programmable
Gate Arrays (FPGAs) is most appropriate for a number of reasons, particularly
the availability of high-speed communication and memory interfaces in recent FP-
GAs. However, using FPGAs does bring some limitations compared to an Applic-
ation Specific Integrated Circuit (ASIC) implementation, particularly limited on-
chip memory, and so a neural computation system implemented on a multi-FPGA
platform must take account of this.
4.2 Neural computation tasks
To model the behaviour of a complete neural network, a massively parallel neural
computation system needs to:
1. Provide the variables and parameters of the neuron modelling equation
(Equation 3.10) for every neuron every sampling interval
2. Communicate synaptic updates to target neurons when a neuron spikes
3. Delay synaptic updates
4. Sum synaptic updates to produce an I-value for every neuron every sampling
interval, which is used in the neuron modelling equation
The last two tasks are perhaps the most subtle of the Izhikevich neural model, as
they are not part its neuron modelling equation, or its derivatives. Any implement-
ation of the model will have to delay and sum synaptic updates, as well as eval-
uating the equation for each neuron. Communicating synaptic updates between
neurons also require thought, particularly if massively parallel neural computa-
tion is performed using a scaleable system consisting of multiple communicating
devices as suggested by the Scalability Hypothesis.
CHAPTER 4. REQUIREMENTS 55
We will now analyse the volume of activity for each of these tasks, which leads to
analysis of their memory and communication requirements and hence to choice of
an appropriate implementation technology for a massively parallel neural compu-
tation system.
4.3 Volume of activity
Neural computation can be broken down into four distinct but linked tasks:
1. Evaluating the neuron modelling equation for each neuron
2. Fetching and communicating synaptic updates
3. Delaying synaptic updates
4. Applying synaptic updates
Considering the volume of activity for each of these tasks provides a basis for ana-
lysis of the memory and communication requirements of massively parallel neural
computation, which leads to justification of the Bandwidth Hypothesis.
4.3.1 Evaluating neuron modelling equations
The neuron modelling equations will need to be evaluated for every neuron at the
desired sampling interval. Chapter 3 defined this as evaluating Equation 3.10 for
every neuron every 1 ms. This activity is present throughout the life of a compu-
tation and its volume does not change in response to any factor. In particular the
workload of evaluating the neuron modelling equations does not depend on the
volume of spike activity.
This means that the workload of evaluating the neuron modelling equations
provides an absolute lower bound on the resources required for a neural computa-
tion to operate in real-time. The only circumstances that might slow the evaluation
of neuron modelling equations would be resource contention, for example if the
volume of spikes were sufficiently high to cause contention for resources such as
off-chip memory between processes evaluating neuron modelling equations and
processes communicating and applying synaptic updates.
56 4.3. VOLUME OF ACTIVITY
4.3.2 Synaptic updates
The workload involved in fetching and communicating synaptic updates depends
on the volume of these updates that need to be processed. With a mean spike fre-
quency of 10 Hz (Section 3.3.3) and 105 neurons per device, there will be 10× 105 =
106 spike events per second per device. With a mean fan-out of 103 (Section 3.3.1)
106 × 103 = 109 synaptic updates will need to be fetched and communicated by
every device every second.
4.3.3 Delaying synaptic updates
Based on the assumptions in Section 4.3.2, each device will receive a mean of
109/103 = 106 synaptic updates every 1 ms. These updates will have varying
delays, which could have a clear distribution or none at all. It is important to ana-
lyse the potential distribution of updates between delays as this has an effect on the
total number of updates that are being delayed at any one time. If this total exceeds
the capacity provided by the system then either it will deadlock or some updates
will have to be discarded, both of which would invalidate a computation. This is in
contrast to fetching the parameters of the neuron modelling equations or synaptic
updates, where trying to fetch more parameters than there is available bandwidth
for would cause a stall and mean that a computation was no longer in real-time,
but would not necessarily invalidate its results.
Sample delay distributions
We will explore how the volume of synaptic updates that need to be delayed de-
pends on the distribution of delays in incoming synaptic updates. This will allow
us to determine the memory requirements of delaying synaptic updates in Sec-
tion 4.4.3. Five sample delay distributions are used, with the first four being shown
in Table 4.1 on the facing page. Each delay distribution divides the mean 106 syn-
aptic updates that arrive every 1 ms between delay sizes from 1 ms to 16 ms.
We also consider a pathological case where the delay size for all 106 updates that ar-
rive every 1 ms changes through time between the maximum and minimum so that
16× 106 synaptic updates finish being delayed at the same time. This distribution
is shown in Figure 4.1 on page 58.
CHAPTER 4. REQUIREMENTS 57
Distribution→ Uniform All min All mid All max
Delay / ms Updates / ms ×106
1 1/16 1 0 0
2 1/16 0 0 0
3 1/16 0 0 0
4 1/16 0 0 0
5 1/16 0 0 0
6 1/16 0 0 0
7 1/16 0 0 0
8 1/16 0 1 0
9 1/16 0 0 0
10 1/16 0 0 0
11 1/16 0 0 0
12 1/16 0 0 0
13 1/16 0 0 0
14 1/16 0 0 0
15 1/16 0 0 0
16 1/16 0 0 1
Table 4.1: Distribution of synaptic updates between delay sizes for sample distri-
butions
Distribution Maximum updates with same
delay completion time ×106
Uniform 1
Minimum 1
Median 1
Maximum 1
Pathological 16
Table 4.2: Maximum updates with the same delay completion time for each delay
distribution
We will use these sample delay distributions to determine the total number of syn-
aptic updates in the process of being delayed at any one time and also the max-
imum number of updates that ever have their delays completing at the same time,
which will be important when the memory requirements of delaying synaptic up-
dates are examined in Section 4.4.3.
Figure 4.2 on page 59 shows how the number of synaptic updates in the process
of being delayed varies over time for each of the delay distributions in Table 4.1,
and for the pathological delay distribution in Figure 4.1 on the next page. These
figures show that the maximum number of updates in the process of being delayed
occurs when all updates need to be delayed for the maximum possible delay. With
a maximum delay of 16 ms up to 17× 106 updates can be in the process of being
delayed at any time if 106 updates are received every 1 ms. This corresponds to 106
58 4.3. VOLUME OF ACTIVITY
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
0
2
4
6
8
10
12
14
16
Time / ms
D
el
ay
fo
r
al
l1
06
sy
na
pt
ic
up
da
te
s
Figure 4.1: Distribution of synaptic update delay size over time that creates a patho-
logical number of updates with the same delay completion time
updates for each possible delay size and a further 106 for a delay of 0, representing
updates that are currently being applied to their target neurons. This maximum is
also periodically exhibited by the pathological delay distribution.
Table 4.2 on the preceding page shows the maximum number of updates with the
same delay completion time for each delay distribution. The effect of the patholo-
gical, time-dependent delay distribution is apparent here as it is possible for up to
16× 106 updates to have their delays complete at the same time. Therefore the max-
imum number of updates that complete their delay at the same time must be con-
sidered alongside the total number of updates when designing a system to delay
synaptic updates as this affects the bandwidth needed at the output of the delay
buffers.
CHAPTER 4. REQUIREMENTS 59
0 5 10 15 20 25 30 35 40 45 50
0
2
4
6
8
10
12
14
16
18
Time / ms
To
ta
lu
pd
at
es
in
th
e
pr
oc
es
s
of
be
in
g
de
la
ye
d
×1
06
Distribution Uniform Minimum Median
Maximum Pathological
Figure 4.2: Total updates in the process of being delayed over time for each delay
distribution. The delay distributions are described in Table 4.1 and Figure 4.1
This model predicts that there are a maximum of u × (1 + s) synaptic updates in
the process of being delayed at any one time, where u is the number of synaptic
updates received every 1 ms and s is the number of delay sizes. The total synaptic
updates that are in the process of being delayed at any one time can be spread
between delay completion times in various ways, with it being possible for up to
u× s updates to have the same delay completion time.
4.3.4 Applying synaptic updates
The volume of synaptic updates that need to be applied by a device every 1 ms
will depend on the volume that finishes being delayed. As shown in Section 4.3.3,
this could be up to 16 × 106, depending on the distribution of synaptic updates
between delay sizes. Any of these synaptic updates could target any neuron on the
device.
60 4.3. VOLUME OF ACTIVITY
Operation Generic 105 neurons
Neuron equation evaluations E = N × S 108
Synaptic updates to fetch and communicate U = N × H × F 109
Synaptic updates arriving to be delayed D = N × H × F 109
Maximum synaptic updates:
being delayed at one time MD = (D/S)× (1+ B) 17× 106
completing their delays at the same time MC = (D/S)× B 16× 106
to apply MA = MC × S = D× B 16× 109
Table 4.3: Volume of operations performed per second for each task of neural com-
putation
Variable Meaning Example value
N Number of neurons per device 105
S Number of time steps per second 103
H Mean spike frequency (Hz) 10
F Mean fan-out 103
B Maximum delay length (ms) 16
Table 4.4: Variables and example values used in Table 4.3
4.3.5 Summary
A summary of the volumes of each task of neural computation that need to be
performed is shown in Table 4.3. This is shown in both a generic form and as an
example for 105 neurons per device. The variables and their example values are
defined in Table 4.4.
These calculations assume that there is no significant variation in mean spike fre-
quency or mean fan-out. In reality there will be local variations in any biologically
plausible neural network, which could cause some devices in a system perform-
ing a computation of a large network to be more heavily loaded than others. It is
assumed that fan-out is balanced between devices, so for example if some fan-out
is to other devices then an equal amount of fan-out will come from those devices
to this device. If this were not the case then it would theoretically be possible for
many devices to have fan-outs that all targeted a single device, with the number of
synaptic updates being delayed multiplied by the number of source devices. This
would massively increase the workload of communicating and applying synaptic
updates on the target device, and is unlikely to be biologically plausible.
CHAPTER 4. REQUIREMENTS 61
4.4 Memory requirements
We now consider the memory requirements of each task of neural computation, us-
ing the volumes of activity identified in Section 4.3. The variables in the equations
that follow are identified in Table 4.4 on the facing page.
4.4.1 Evaluating equations
While the overall structure of the neuron modelling equation remains the same for
each neuron being modelled, each neuron can have different behaviour defined
by different equation parameters. The complete set of variables and parameters
will need to be available once per sampling interval. The data size and bandwidth
needed to achieve this depends on the data size of each set of neural modelling
equation parameters.
Section 3.2.5 found that a set of neural modelling equation parameters can be stored
in 96 bits. Rounding this up to 128 bits or 16 bytes allows for some extra data
to be stored to simplify communication and application of synaptic updates (see
Section 6.3.1), and allows a better fit with memory systems that deal with word
widths that are a power of 2.
Therefore the total set of neural modelling equation parameters takes 128× N bits,
which for the example of 105 neurons per device is 1.6 MB. The total bandwidth
required to access these parameters is 128× N × S, which is 1.6 GB/s for the ex-
ample. Since all of the parameters for all neurons will need to be accessed once per
sampling interval, from start to end with no repeated accesses, there is no temporal
locality but perfect spatial locality. This needs to considered when specifying an
appropriate memory hierarchy to store and access these parameters.
4.4.2 Synaptic updates
To determine the data size and bandwidth needed to communicate and apply syn-
aptic updates, we need to start by considering what items of data need to be stored
for each synaptic update and how large each item needs to be. The items that are
required, exactly where each item is stored and how much data is needed will vary
depending on the implementation of the synaptic update communication and ap-
plication algorithm.
However we will start by assuming a simple, worst-case algorithm that stores the
target neuron, delay and weight for each synaptic update separately. The data size
needed for each item can be determined by examining their expected maximum
62 4.4. MEMORY REQUIREMENTS
Field Minimum Maximum Bits
Target 0 4294967295 32
Delay 1 16 4
Weight -2048 2047 12
Total 48
Table 4.5: Ranges and bit sizes of each item in a synaptic update
and minimum magnitudes. These and the resulting number of bits required are
shown in Table 4.5. Every potential target neuron is expected to have a unique
identifier.
These data sizes allow for a computation of up to 4.2 billion neurons, with syn-
aptic connections having positive and negative weights and delay sizes that match
previous work (Jin et al., 2008). In reality the synaptic update communication and
application algorithm implemented in Chapter 6 finds each target neuron via two
layers of indirection, allowing for computations of much larger numbers of neur-
ons, but these figures are sufficient to illustrate the general data size and bandwidth
requirements.
With 48 bits = 6 bytes of data per synaptic update, 6× N × F bytes will be needed
for the complete set of synaptic updates. For the example this is 6× 108 bytes or
0.6 GB. This is almost three orders of magnitude more than is needed for the neural
modelling equation parameters.
The number of synaptic updates that need to be fetched from memory per second
is U = N × H × F = 105 × 10× 103 = 109. At 6 bytes per synaptic update this
will require 6× 109 = 6 GB/s of bandwidth. This makes it clear that the scale of
neural network that can be handled by a neural computation system in real-time
is bounded by inter-device communication bandwidth and memory bandwidth, as
suggested by the Bandwidth Hypothesis.
4.4.3 Delaying synaptic updates
Synaptic updates need to be held in some form of buffer to delay them. Since the
delay size data for each incoming synaptic update can be discarded once the update
has been buffered, and assuming that a buffered update must target a neuron on the
local device (see Section 6.3.3), each buffered update will consist of a local neuron
identifier and a weight. If local neuron identifiers are allowed to range up to 105
(matching the example number of neurons on a device) then 17+ 12 = 29 bits will
need to be buffered per update. We will round this up to 32 bits or 4 bytes.
CHAPTER 4. REQUIREMENTS 63
If each of these updates were buffered individually than the maximum total size
of the buffer (regardless of whether it is subdivided into separate buffers for each
delay size or not) would be 4× 16× 106 = 64 MB, plus another 4 MB for the buf-
fer currently being emptied. It is assumed that data will not be copied between
buffers, rather they will be used to represent delays in a circular fashion (see Sec-
tion 7.2.3).
The bandwidth required to this buffer depends on the volume of updates that need
to be delayed and the distribution of their delay sizes, which was discussed in
Section 4.3.3. With the pathological delay size distribution from Section 4.3.3 the
maximum bandwidth required will be 4× 106 × 103 = 4 GB/s for input and 4×
16× 106 × 103 = 64 GB/s for output. In the more balanced cases exhibited by the
other delay distributions, a maximum of 4 GB/s of bandwidth will be needed for
both input and output, giving 8 GB/s total, another justification for the Bandwidth
Hypothesis.
As mentioned previously, the delay buffer must not be allowed to fill up as this
would either deadlock the computation or cause it to become invalid. We must
find a way to ensure that this does not happen, with the worst case being that the
computation slows so that it is no longer real-time, as would be the case if the
volume of synaptic updates being processed at any given time were too high. It is
also possible to significantly reduce the amount of data that needs to be buffered to
achieve a delay to a set of synaptic updates, and hence the bandwidth needed, as
will be shown in Section 6.6.2.
4.4.4 Applying synaptic updates
Each I-value is 16 bits, matching the precision of the other variables and paramet-
ers of the neuron modelling equation. Any of these I-values could be updated by
any of the 16× 109 synaptic updates that need to be applied every second. Since
applying these updates is the critical inner loop of neural computation, this process
will be particularly sensitive to any delay in fetching or updating the I-values. Two
copies of the I-values need to be stored to ensure that the values being updated do
not conflict with the values being read by the neural modelling equations.
With 105 neurons per device 2 × 2 × 105 = 400 kB of storage will be needed for
I-values. Up to 16× 109 updates may need to be applied every second, requiring
a bandwidth of 2× 16× 109 = 32 GB/s for read and the same for write, a total of
64 GB/s. Since I-values need to be accessed relatively randomly to apply synaptic
updates (assuming that incoming synaptic updates are applied in order of arrival),
and any latency in these accesses would slow or deadlock processing, the I-values
64 4.5. COMMUNICATION REQUIREMENTS
will be best suited to being stored in any on-chip memory that is provided by an
implementation technology. This is in contrast to memory accesses for other neural
computation tasks, which deal with lists of instructions that are less sensitive to
latency, provided that throughput is maintained.
4.4.5 Summary
This analysis shows that a massively parallel neural computation system will need
the memory resources in Table 4.6 to perform a neural computation with 105 neur-
ons per device, a mean spike frequency of 10 Hz and a mean fan-out of 103. These
requirements will be essential when an implementation technology for a massively
parallel neural computation system is selected in Section 4.6 and when it is decided
how to partition data between on- and off-chip memory in Chapter 6.
Summing synaptic updates to produce I-values is the critical inner loop of the
neural computation algorithm, and so the I-value data is the most sensitive to
latency of all that used in neural computation, as well as having high locality of
access. Therefore the I-value data should be given highest priority when deciding
how to allocate whatever high-speed memories are provided by an implementation
technology. The bandwidth required to delay synaptic updates is high relative to
the total data size required, and therefore this data should be given the next highest
priority for high-speed memory.
Operation Size Bandwidth
Neural modelling equations 1.6 MB 1.6 GB/s
Fetch synaptic updates 0.6 GB 6 GB/s
Delay synaptic updates 68 MB 8 GB/s
Apply synaptic updates 400 kB 64 GB/s
Table 4.6: Memory resources needed for a neural computation of 105 neurons
4.5 Communication requirements
When a neural computation scales from a single device to multiple devices, as the
Scalability Hypothesis expects to be necessary for neural computations of large
neural networks, there must be communication between these devices. This com-
munication will be made up of control messages (particularly to start the computa-
tion and ensure that sampling intervals are synchronised) and messages resulting
from the computation itself. We will concentrate on messages related to the com-
putation rather than control messages as the latter will have very limited volume
(of the order of a few tens of messages per sampling interval).
CHAPTER 4. REQUIREMENTS 65
4.5.1 Type of communication infrastructure
It is assumed that an on- and inter-chip network will be used to communicate syn-
aptic updates rather than dedicating resources to every possible synaptic connec-
tion, as the analysis of neural computation systems in Chapter 2 suggests that the
latter will not scale to more than one device. This means that the volume of net-
work traffic will be dependent on the frequency of neural spikes and their fan-out.
The volume of network traffic together with routing efficiency will affect scalabil-
ity, and hence bound the size of neural network that can be handled by a neural
computation system in real-time.
4.5.2 Synaptic update communication
Messages between devices only need to be used when there are synaptic connec-
tions between neurons on one device and neurons on another device. We will ini-
tially assume that each synaptic update is transmitted from the source to the target
device individually (unicast) as this will provide an upper bound on communica-
tion bandwidth requirements.
Section 4.4.2 found that each synaptic update required 6 bytes of data. If we concep-
tually consider that all synaptic updates will require some form of communication
(even if it is just to the local device) then with 105 neurons per device, a fan-out of
103 and a mean spike frequency of 10 Hz, 6 GB/s of bandwidth will be needed to
communicate synaptic updates.
The proportion of this bandwidth that needs to be provided between devices de-
pends on the locality and overall size of a neural network. Since the discussion in
Section 3.3.2 expects biological neural networks to exhibit significant locality, for
very large neural computations, spanning many devices we expect to see a great
deal of communication between neighbouring devices and very little communica-
tion over any distance.
An absolute upper bound on inter-device communication bandwidth can be found
using the pathological case that all 105 neurons on a device have synaptic connec-
tions to neurons on other devices. In this case all 6 GB/s of bandwidth needs to be
provided by inter-device links.
Communication latency is also an important consideration. For real-time neural
computation we must ensure that all synaptic updates are communicated and ap-
plied to their target neurons in well under a 1 ms sampling interval.
66 4.6. IMPLEMENTATION TECHNOLOGY
4.5.3 Summary
An on- and inter-chip network will be used to communicate synaptic updates. The
maximum bandwidth that may be needed between devices is 6 GB/s, which will
decrease both with increased locality and if a more efficient method is used to
communicate synaptic updates than unicast messaging. Since many other neural
computation systems are not able to provide this amount of bandwidth between
devices they are communication-bound as suggested by the Communication-
Centric Hypothesis.
4.6 Implementation technology
The choice implementation technology for a massively parallel neural computation
system is governed by many factors, particularly ability to provide the memory
and communication resources that were identified in Section 4.4 and Section 4.5.
We also need to consider other factors such as cost and usability. Implementation
technology also affects many details of the design of a massively parallel neural
computation system, particularly how synaptic updates are communicated and ap-
plied, and strategies that are used to make efficient use of memory resources. The
principle choice of implementation technology is between Application Specific In-
tegrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs). We will
explore how the requirements of a massively parallel neural computation system
map to the characteristics of these technologies.
4.6.1 Memory resources
Compared to what can be implemented in an ASIC, FPGAs have limited on-chip
memory, for example the Altera Stratix IV 230 FPGA has approximately 2 MB of
Block RAM (BRAM). This constrains the implementation of a massively parallel
neural computation system, for example it would not be possible to store all of
the I-values for 105 neurons in BRAM on the example FPGA as well as the neuron
modelling equation parameters. Certain types of structure are hard or impossible
to implement in a FPGA, for example content addressable memories, which are
often used as part of the implementation of operations such as multicast routing.
Both ASICs and FPGAs are able to make use of large off-chip memory resources,
for example DDR2 SDRAM.
CHAPTER 4. REQUIREMENTS 67
4.6.2 Communication resources
The volume of communication that a massively parallel neural computation sys-
tem needs between devices requires high-speed transceivers. While it is possible
to implement these in ASICs (remembering that FPGAs are themselves a type of
ASIC), this can only be achieved using recent nanometre-scale processes, leading
to massively increased costs for small production volumes compared to older pro-
cesses. In contrast these transceivers are readily available in high-end commercial
FPGAs and so they are available at significantly less cost in a FPGA implementation
than in an ASIC implementation.
4.6.3 Reprogrammability
ASICs give no scope for reprogramming, and hence limited scope for fixing errors
or altering many aspects of system behaviour. This means that massive effort is
needed for design and testing and hardware cannot be altered to suit future design
requirements. In comparison FPGAs can be reprogrammed with little cost beyond
the time taken to resynthesise a design. This significantly lowers design and test-
ing effort as errors can be resolved by reprogramming, while much more effort is
needed to avoid errors in ASICs as they could lead to an entire batch of chips and
the photographic masks needed to fabricate them being written off.
Reprogrammability also allows significant scope architectural exploration. This
could allow for a system to become more efficient over time, to have aspects of
its behaviour altered (for example using a different neural modelling equation) or
to be migrated to a different platform with comparatively little effort, for example
moving to cheaper FPGAs that have sufficient resources for the design after an
initial implementation with larger FPGAs.
4.6.4 Cost
ASICs have a very high initial cost, both as a result of the design effort noted
above and for producing the photographic masks that are needed to produce chip
wafers. The unit cost of each chip is then typically much lower than that of a FPGA,
though it depends on yield, which is often a function of the complexity and size of
a design. Overall this means that an ASIC implementation is not cost-effective if
overall volumes are low, but it becomes competitive as volumes increase.
The unit cost of a FPGA is relatively independent of volume, though discounts
for bulk purchases are often available. This makes FPGAs significantly more cost-
effective than ASICs for low volume applications, becoming less cost-effective as
68 4.6. IMPLEMENTATION TECHNOLOGY
volumes increase, if there are no other factors affecting the choice of implementa-
tion technology. However FPGAs largely remove the risks involved in producing
faulty batches of ASICs, and the costs of producing ASICs in a process that can
implement high-speed transceivers need to be considered. This means that the
volume at which an ASIC will be more cost-effective than a FPGA is much higher
than would be the case for a simpler design, particularly one without high-speed
transceivers.
Another factor to consider is the cost of producing a PCB to power each device and
provide it with connectivity and peripherals. A custom ASIC implementation will
need custom PCBs. These can be expensive to design and manufacture, particularly
when high-speed signalling is involved. In contrast FPGA evaluation boards which
provide access to high-speed communication links as well as other resources such
as DDR2 SDRAM are readily available off-the-shelf, allowing the evaluation board
manufacturer to amortise PCB design and production costs.
4.6.5 Power consumption
An ASIC will typically have lower power consumption than a FPGA when per-
forming an identical task. This is because the ASIC will provide precisely the re-
sources that are required by that task, while the FPGA must implement these re-
sources using its network of Lookup Tables (LUTs), which will typically require
more transistor switching, and hence more power to achieve the same result. How-
ever FPGAs will often be implemented in a more modern, smaller-scale process,
which will have the effect of reducing power consumption when compared to an
ASIC using an older process.
4.6.6 Choice of implementation technology
Given these considerations I have chosen to implement a massively parallel neural
computation system using a network of FPGAs. This choice is driven primarily by
prior experience and the availability of high-speed transceivers. It will also have
the useful side-effect of creating a platform that could be used for a wide range
of applications beyond massively parallel neural computation. This will amort-
ise hardware costs over many projects. It also gives an opportunity to experiment
with a range of architectures for a massively parallel neural computation system
and select whatever is most suitable, which would not be possible without repro-
grammability.
CHAPTER 4. REQUIREMENTS 69
4.6.7 Implications of this choice
When designing a massively parallel neural computation system using a FPGA im-
plementation, we must consider the very limited amount of on-chip memory avail-
able (2 MB for an Altera Stratix IV 230). As discussed in Section 4.4.4, two copies
of the I-values for the neurons being computed on the FPGA must be stored in on-
chip memory (and hence FPGA BRAM) to avoid a performance bottleneck.
With two copies of the I-values at 16 bits per neuron the absolute maximum
number of neurons that could be computed on a FPGA with 2 MB of BRAM is
2 MB
16 bit×2 = 256k. However in practice FPGA designs cannot make use of every avail-
able resource, and BRAM will also be needed for many other purposes, particularly
delaying synaptic updates.
While there is clearly not enough BRAM available to delay the maximum possible
number of synaptic updates using only the BRAM (e.g. 64 MB is needed in the
pathological case from Section 4.3.3), there are many cases where the amount of
BRAM available will be sufficient. Therefore a hybrid solution will be used to delay
synaptic updates, with delays being implemented primarily using BRAM, with ex-
cess data being transferred to off-chip memory as required. This will be elaborated
on in Section 7.2.3.
Beyond these uses of BRAM, all other data (other than structures such as temporary
buffers for processing data) will be kept in off-chip memory. Given the calculations
in Section 4.4.5, this would require a total off-chip memory data size of 0.6 GB and
bandwidth of 7.6 GB/s for a neural computation of 105 neurons with a mean fan-
out of 103 and mean spike frequency of 10 Hz.
Given the bandwidth available using off-chip memory (e.g. around 6 GB/s for
DDR2 SDRAM), this gives some implementation challenges:
• Keeping utilisation of off-chip memory high so that bandwidth is not wasted
• Communicating synaptic updates between FPGAs without multicast routing
that is reliant on large amounts of BRAM
• Optimising the organisation of data in off-chip memory to reduce the band-
width required to access it to a level that is achievable using currently avail-
able technology
As Section 6.6 will show, using unicast routing to communicate synaptic updates
has poor scaling as it uses communication resources in direct proportion to fan-out,
so an algorithm to communicate synaptic updates must be designed that approx-
imates multicast routing while using only off-chip memory.
70 4.7. CONCLUSION
This means that off-chip memory bandwidth becomes the limiting factor to the
volume of synaptic updates that can be communicated in real-time, and by exten-
sion the size and scalability of a neural computation system, as proposed by the
Bandwidth Hypothesis. The routing algorithm must be designed to make max-
imum use of off-chip memory bandwidth, and this means making efficient use of
burst read transactions.
4.7 Conclusion
I have chosen to implement a massively parallel neural computation system us-
ing a network of FPGAs. The multi-FPGA platform that will be used for this is
introduced in the next chapter. FPGAs provide a good balance between perform-
ance and cost, and significantly reduce development costs and risks. However they
pose implementation challenges, particularly related to the limited amount of on-
chip memory that they provide.
The next chapter will introduce a multi-FPGA platform that is suitable for imple-
menting a massively parallel neural computation system. Then Chapter 6 will de-
velop an algorithm to communicate and apply synaptic updates in a massively
parallel neural computation system that takes account of the limitations of a FPGA
implementation.
Chapter5
Multi-FPGA platform
71
72 5.1. INTRODUCTION
5.1 Introduction
In the last chapter it was decided to implement a massively parallel neural compu-
tation system on a platform with a network of FPGAs on commercial off-the-shelf
evaluation boards. The platform, christened “Bluehive” was constructed by Simon
Moore and A. Theodore Markettos. This chapter discusses the design of this plat-
form and the FPGA evaluation boards that it uses. Many of its features are pertinent
to the design of the algorithm for communicating and applying synaptic updates
in Chapter 6.
5.2 Choice of FPGA board
While it is clear that massively parallel neural computation needs FPGAs that
provide high-speed transceivers and memory interfaces, these FPGAs must be
placed on PCB boards to allow them to function, and to link them together via
their transceivers to create a multi-FPGA platform. There are a number of design
choices.
5.2.1 Number of FPGAs per PCB
In the past, the only way to create multi-FPGA platforms was placing many FP-
GAs on a single large, multi-layer PCB, as inter-FPGA links had to use parallel
signalling, and so care was needed to ensure good signal integrity and low skew.
This is still done (e.g various Dini Group products), but large PCBs for multi-FPGA
systems are very expensive commercially, as a result of the raw cost of producing
the large PCB, the cost of design effort (particularly routing the parallel links to
avoid skew) and low volume of sales.
The introduction of high-speed serial transceivers has greatly simplified routing of
links between FPGAs in some ways, as less wires are needed per link, however the
higher clock rates involved mean that care is still needed to avoid signal integrity
issues. However, differential signalling allows these links to operate at high speeds
over relatively long cables. Factors such as the orientation of the cable and the
proximity of other similar cables have relatively little bearing on the signal integrity
of each link.
CHAPTER 5. PLATFORM 73
This means that it is easier, cheaper and more convenient to design a multi-FPGA
platform which uses pluggable cabling for all links between FPGAs than a platform
using multiple FPGAs on a single large PCB. With one FPGA per PCB the design
effort of creating the PCB is significantly reduced, manufacturing costs are lower
and volumes higher. All of these factors reduce the commercial unit cost of these
single FPGA PCBs.
5.2.2 Memory interfaces
In Section 4.4 it became clear that each FPGA needs a high-bandwidth off-chip
memory interface. This needs to support burst reads to allow the available band-
width to be used as efficiently as possible, keeping delays for setting up reads and
writes to a minimum.
The most suitable type of memory for this application is one of the variants of DDR
SDRAM, as a well-implemented memory controller can fetch two words every
memory clock cycle, and translate this to providing very long words to the main
FPGA logic at a lower clock frequency.
5.2.3 Connectivity
A variety of physical interconnects have been used on FPGA boards to link them
together. It is important that the physical interconnect can be used to construct to-
pologies that are suitable for a massively parallel neural computation system. Some
commercial FPGA evaluation boards have suffered from suboptimal interconnects
in the past, for example the Terasic DE3 evaluation board provides interconnect us-
ing 4 High Speed Mezzanine Connectors (HSMC), using parallel signalling. These
connectors are very short and not very flexible (to ensure signal integrity), and in
addition the single DDR2 SDRAM connector shared pins with one of the 4 HSMC
connectors. This constrained any multi-FPGA implementation using this board to
a topology of either a stack, linear row or a ring, and in all cases the physical ar-
rangement of boards had to match the topology, so a ring required that the boards
be formed into a ring stood on end. This is not suited to creating a large multi-
FPGA platform.
Therefore the connections between FPGA evaluation boards must use flexible
cables, so a variety of communication topologies can be implemented without hav-
ing to change the physical arrangement of boards in a multi-FPGA platform. Either
the board must provide suitable connections or it must be possible to adapt what
is provided.
74 5.2. CHOICE OF FPGA BOARD
5.2.4 Symmetry
Some multi-FPGA platforms (particularly those with multiple FPGAs on a PCB),
have a different pin-out for each FPGA, for example some FPGAs may be connec-
ted to off-chip memory or other I/O devices while others are not. These could
be called “non-symmetrical” platforms. Different FPGA bitstreams will be needed
for some or all of the FPGAs when implementing a system on a non-symmetrical
platform, which makes creating or modifying such a system complex and time-
consuming.
An alternative approach is to use a “symmetrical” platform, where every FPGA has
identical peripherals and connectivity. This allows the same bitstream to be used
for all FPGAs if desired, which greatly simplifies creating and modifying a system
implemented on the platform.
A symmetrical platform is most appropriate for implementing a massively parallel
neural computation system as each FPGA performs a portion the computation and
(assuming that the neural network description is held in off-chip memory) each
FPGA can use an identical bitstream.
5.2.5 Availability of off-the-shelf solutions
If a commercial manufacturer can design a single product that can be used by many
customers for a wide range of applications, it will make economic sense for them to
do so, and unit costs are significantly reduced for all customers. This is significantly
more likely to be the case for PCBs that have a single FPGA per PCB and that have
a range of interconnects and peripherals than it would be for a PCB with multiple
FPGAs or with only the features required by a given application.
Since such off-the-shelf solutions exist, it is not sensible to go to the effort and incur
the cost of designing a custom PCB. However this could lead to having to make
some design trade-offs if the mix of interconnect and peripherals provided by a
off-the-shelf solution is not perfectly suited to the needs of a massively parallel
neural computation system.
5.2.6 Summary
The most important considerations when creating a multi-FPGA platform for a
massively parallel neural computation system are memory interfaces and intercon-
nect. Given the benefits of using commercial off-the-shelf evaluation boards with
CHAPTER 5. PLATFORM 75
a single FPGA on each board, Simon Moore decided to identify and use a suit-
able board that meets the requirements of a massively parallel neural computation
system and other similar projects. A combination of technical merit, suitable peri-
pherals and prior experience of working with the manufacturer lead to the Terasic
DE4 FPGA evaluation board being selected.
5.3 Terasic DE4 evaluation board
The Terasic DE4 evaluation board has a single Altera Stratix IV 230 FPGA, and the
following features which are pertinent to the design of a massively parallel neural
computation system:
• 228k logic elements
• 2 MB of Block RAM
• 2 DDR2 SO-DIMM sockets, which support up to 4GB of RAM each
• Bidirectional, high-speed serial transceivers:
– 4 presented as SATA connectors
– 8 presented as a PCI Express (PCIe) edge connector
– 4 presented as 1 Gbit/s Ethernet
– 8 presented in one HSMC connector
– 4 presented in another HSMC connector
The layout of the board and location of each of these features is shown in Figure 5.1
on the next page.
As identified in Section 5.2.3 and Section 5.2.2, the high-speed serial transceivers
and DDR2 memory are particularly pertinent to the design of a massively paral-
lel neural computation system. At the time that construction of the multi-FPGA
platform commenced, this evaluation board was one of only a few that provided
high-speed transceivers, particularly in a physical format that facilitated construct-
ing a scaleable multi-FPGA platform.
5.3.1 Memory hierarchy
There are three types of memory available on the DE4 board. Their capacity, latency
and bandwidth are summarised in Table 5.1 on page 77. The bandwidth of BRAM
depends on how it is accessed. There are multiple blocks, each with separate read
76 5.3. TERASIC DE4 EVALUATION BOARD
Figure 5.1: Terasic DE4 evaluation board and its major components (Image
provided by Terasic)
CHAPTER 5. PLATFORM 77
Type Size / MB Latency / clock cycles Bandwidth / GB/s
BRAM 2 1 Various
SRAM 2 3 0.2
DDR2 2000-4000 5 6
Table 5.1: Memories available on a DE4 board
and write ports. This allows for many parallel accesses and hence it is inefficient to
construct large memories (in terms of the percentage of BRAM used) as this would
effectively leave some read or write ports unused.
There are two DDR2 banks which are physically separate and use separate control-
ler logic (although they may share some timing logic). Each supports up to 4 GB of
RAM, although only a single bank of 1 GB is being used at present. Note that the
latency of DDR2 is very dependent on access patterns. If burst reads and writes are
used then the effective latency when accessing a contiguous set of words is reduced
compared to accessing non-consecutive words
5.4 Creating a multi-FPGA platform
It was ascertained that 16 DE4 boards could be fitted in an 8U 19” rack box, 8 at
the front and 8 at the back (Figure 5.2 on the next page). Suitable methods needed
to be found to connect them and to provide power, programming support (JTAG)
and management facilities.
5.4.1 Interconnect
Of the high-speed interconnects on the DE4 board:
• 4 presented as SATA connectors can be used directly
• 8 presented as a PCIe edge connector need adapting to be used outside of a
host PC
• 4 presented as Ethernet are too slow for any use other than connecting to a
network for management purposes. They are further limited by an Ethernet
interface chip between the FPGA transceivers and the physical connectors
• 12 presented as HSMC connectors are difficult to use for interconnect directly,
as they are designed for custom peripherals. Optical transceiver adapters are
available, but they are expensive, and will be best suited to communication
between racks in a larger multi-FPGA platform in future
78 5.4. CREATING AMULTI-FPGA PLATFORM
Figure 5.2: One of the multi-FPGA boxes with side panels removed showing 16
× DE4 boards at the top. There are 16 × PCIe-to-SATA adapters, SATA links and
power supplies at the bottom.
The combination of SATA and PCIe gives 12 high-speed interconnects in total, thus
12× 6 = 72 Gbit/s = 9 GB/s of bidirectional communication bandwidth per FPGA
board. This is higher than the pathological maximum bandwidth requirement of
6 Gbit/s for 105 neurons per FPGA with a fan-out of 103 and a spike frequency
of 10 Hz identified in Section 4.5.2. Even allowing for overheads and before any
optimisation this is more bandwidth than a massively parallel neural computation
system is ever likely to need barring even more extreme pathological cases.
To allow the interconnects in the PCIe 8× edge connector to be used to link multiple
FPGA boards, one large motherboard could have been designed, but as noted in
Section 5.2.1 this would have been a complex and expensive PCB given the high-
speed signalling.
Instead a small, four-layer PCIe-to-SATA adapter board was designed to break the
DE4’s PCIe channels out to SATA connectors (Figure 5.3 on the facing page). SATA
links are normally used for hard disk communication since they have very low cost
and yet work at multi-gigabit rates. Using SATA3 links Simon Moore and Theo
Markettos found that it was easy to achieve 6 Gbit/s of bidirectional bandwidth
per link (Moore et al., 2012) with virtually no bit errors (less than 1 error in 1015
bits), a data rate that is much higher than that reported by Schmidt et al. (2008)
CHAPTER 5. PLATFORM 79
Figure 5.3: PCIe-to-SATA adapter board connected to a DE4 board
for FPGA-to-FPGA links. The SATA and PCIe connections are used at an electrical
level only. A custom protocol is used rather than the SATA or PCIe protocols (see
Section 7.4).
SATA is also available in an external variety, eSATA, offering better electrical char-
acteristics and a more robust connector, so eSATA connectors are used for for box-
to-box communication, connected to the 4 SATA ports on the board via an adapter.
SATA and eSATA are capacitively decoupled which gives some board-to-board
isolation. A common ground plane makes them suitable for rack scale platforms
but probably no further due to cable length and three-phase power issues, neces-
sitating the use of optical interconnects.
The PCIe-to-SATA adapter board permits intrabox communication to use a recon-
figurable (repluggable) topology. A PCIe edge connector socket was chosen, rather
than the more conventional through-hole motherboard socket, to allow the ad-
apter board to be in the same plane as the FPGA board (Figure 5.3), allowing a
3D “motherboard” of adapter boards to be constructed. The adapter board also
provides power, JTAG programming and SPI status monitoring (e.g. of FPGA tem-
perature).
80 5.4. CREATING AMULTI-FPGA PLATFORM
5.4.2 Programming and diagnostics
The DE4 board has an on-board USB-to-JTAG adapter (Altera USB Blaster) which
could have been used to connect multiple FPGA boards to a PC via USB hubs to
program them. However Altera’s jtagd is only capable of programming boards
sequentially and, at 22 s per board, this becomes inconvenient for multi-FPGA plat-
forms.
Given that it is often desired to program all FPGAs in a system with the same
bitstream, a parallel broadcast programming mechanism was developed by Theo
Markettos, using a DE2-115 (Cyclone IV) board to fan-out signals from a single
USB-to-JTAG adapter. The DE2-115 board has selectable GPIO output voltage, al-
lowing it to be easily matched to the DE4’s JTAG chain requirements. A small
NIOS processor system on the DE2-115 board facilitates communication with the
programming PC, allowing the programming configuration to be selected. This al-
lows everything from broadcast of a bitstream to all FPGAs in a system through to
programming just one FPGA with an individual bitstream.
In broadcast mode, the PC only sees a JTAG chain with one FPGA, but the bitstream
is sent to many FPGAs, depending on the programming configuration selected. It
is also possible to make the FPGAs appear in one long JTAG chain, e.g. for com-
munication with JTAG-UARTs on each FPGA board after configuration.
For diagnostic purposes the on-board temperature and power sensors on each DE4
can be monitored via an SPI interface. A small MAX II CPLD is used to multiplex
the SPI signals between the DE4 boards and the DE2-115 board. This MAX II sits
on a simple two-layer custom PCB (Figure 5.4). This PCB also fans-out the JTAG
programming signals from the DE2-115 using the standard Altera JTAG header.
The DE2-115 controls the configuration of the fan-out via the MAX II.
Figure 5.4: Parallel programming and status monitoring board
CHAPTER 5. PLATFORM 81
5.4.3 Management
There is a mini-ITX based Linux PC inside each box to act as a remote programming
and management device. A small power supply powers this PC and the DE2-115
board. The PC can power up a server-class power supply, via the DE2-115, which
powers the DE4 boards. This allows the multi-FPGA platform to be powered up to
run a task and then be put to sleep afterwards, reducing standby power to under
10 W. The PC also monitors the DE4 temperatures and voltages and the server
power supply status (via an I2C link), and powers them down if they go outside of
the desired operating range.
5.5 Conclusion
The Bluehive multi-FPGA platform will be used to implement a multi-FPGA,
massively parallel neural computation system. It has many features that make it
suited to this task, particularly interconnect and off-chip memory interfaces.
Section 4.4 found that BRAM needs to be conserved, with the majority of the data
that describes a neural network kept in off-chip memory, and that access to this
memory would become a bottleneck that limits the scale and speed of neural com-
putation, as proposed by the Bandwidth Hypothesis.
On the DE4 board the majority of the off-chip memory is DDR2 SDRAM, which
is most efficiently accessed using burst reads and writes. Therefore a massively
parallel neural computation system implemented on the multi-FPGA platform in-
troduced in this chapter must make efficient use of burst reads and writes, as this
will make most efficient use of memory bandwidth and hence maximise the scale
of a real-time neural computation.

Chapter6
Communicating and applying
synaptic updates
83
84 6.1. INTRODUCTION
6.1 Introduction
Neural computation would be straightforward to parallelise if each neuron was
modelled in isolation, without synaptic connections. However such a computation
would not yield useful results, and so modelling synaptic connections must be in-
cluded in the scope of a massively parallel neural computation system. This proves
to be a significant challenge to parallelise, as predicted by the Communication-
Centric hypothesis.
When a neuron spikes, we need to determine what synaptic connections exist,
what effect each connection should have on other neurons and how long that effect
should be delayed for. We then need to communicate these synaptic updates and
apply their effects in the correct sampling interval so that target neurons are appro-
priately affected. This process will be referred to as communicating and applying
synaptic updates.
Communicating and applying synaptic updates efficiently is critical if a massively
parallel neural computation system is to operate in real-time, particularly with
biologically plausible fan-out. The choice of a FPGA-based implementation plat-
form gives many benefits (particularly high-speed transceivers), but it poses chal-
lenges to communicating and applying synaptic updates, particularly in relation to
memory usage.
With high fan-out and significant locality, a multicast routing algorithm would ap-
pear to be appropriate to communicate synaptic updates. However, this requires a
multi-stage memory lookup to determine which neurons are targeted by each syn-
aptic update at target FPGAs. Since there is not room for these lookup tables in
on-chip memory, they will need to be stored in off-chip memory. Scanning lookup
tables in off-chip memory makes very inefficient use of bandwidth and limits the
scale of a real-time neural computation. Therefore a multicast routing algorithm is
unsuitable for a FPGA-based massively parallel neural computation system.
Unicast routing is also unsuitable as the number of messages it uses between FP-
GAs is proportional to the product of spike frequency and fan-out, which would
use too much communication bandwidth. To counter both of these limitations, this
chapter develops and evaluates an alternative algorithm to communicate and ap-
ply synaptic updates which aims to maximise memory efficiency, and finds that it
provides a compromise between the communication bandwidth needed by unicast
and the memory bandwidth needed by multicast.
CHAPTER 6. SYNAPTIC UPDATES 85
6.2 Design goals
We aim to communicate and apply synaptic updates in massively parallel neural
computations with high fan-out in real-time, within the constraints imposed by a
FPGA implementation. Given the communication and resource constraints identi-
fied in Chapter 4, this leads to three main design goals:
1. Fetch synaptic updates from off-chip memory as efficiently as possible
2. Make good use of on-chip memory so that its limited quantity does not overly
limit the scale of computations
3. Avoid excessive inter-FPGA communication, which would hamper scalabil-
ity
Each of these goals will now be studied further.
6.2.1 Fetching synaptic updates efficiently
To make most efficient use of off-chip memory the characteristics of the memory
system provided by the implementation platform must be considered. DDR2
memory supports memory accesses on both the rising and falling edges of a clock,
and so using DDR2-800 memory at a memory clock frequency of 400 MHz, memory
accesses can be made at an effective frequency of 800 MHz. Each memory access is
64 bits wide. Using buffering and clock-crossing logic, this is equivalent to making
256 bit memory accesses at 200 MHz. The Altera DDR2 memory controller sup-
ports pipelining of memory requests and burst reads and writes. It is currently set
up to support burst reads of up to 8× 256 bit words. The latency between a burst
read request and the first word being returned is around 5 clock cycles, with the
remaining words returned on each of the following clock cycles.
This means that the most efficient way to access off-chip memory on this platform
is to to fetch data in blocks rather than performing multiple, separate single word
reads, and to process it 256 bits at a time, making full use of all data returned.
Memory requests should be queued in the memory controller so that it is never
idle, either by making requests for multiple sets of sequential bursts from one part
of a system, or independent requests from different parts of the system. If chains
of memory requests are inter-dependent (for example the result of a single read
contains the address for a further read) then the memory controller will be left
idle when it could have been returning data, which reduces off-chip memory effi-
ciency.
86 6.3. DESIGN DECISIONS
6.2.2 On-chip memory usage
As discussed in Section 4.6.7, a significant proportion of the on-chip memory
provided by a FPGA is used to store two copies of the I-value for each neuron.
Since the remainder of the data that describes a neural network is stored in off-chip
memory, this leaves the remainder of the on-chip memory to delay synaptic up-
dates and to implement neural computation hardware. Structures that use large
amounts of on-chip memory, such as buffers with size proportional to the number
of neurons in a computation or their fan-out must be avoided.
6.2.3 Inter-FPGA communication
Inter-FPGA communication must neither be excessive nor must it introduce excess-
ive latency, as either could prevent a massively parallel neural computation system
operating in real-time. Some form of multicast routing would appear to be appro-
priate for communicating synaptic updates, but this must not be at the expense of
inefficient off-chip memory usage or excessive use of on-chip memory.
6.3 Design decisions
The task of a synaptic update communication and application algorithm at first
appears straightforward – take each synaptic update that occurs when a neuron
spikes and apply it to the target neuron. However, this becomes a rather complex
process when the scale of the task is considered – around a thousand synaptic up-
dates need to be communicated and applied in response to a single spike, with
many thousands of neurons spiking every sampling interval resulting in millions
of synaptic updates needing to be communicated and applied every millisecond.
This means that this process dominates that of modelling the behaviour of neurons,
as predicted by the Communication-Centric Hypothesis.
The task of communicating and applying synaptic updates can be broken down
into several inter-dependant subtasks, each presenting a series of design decisions.
Making appropriate design choices will produce an algorithm which makes ef-
ficient use of resources and is well suited to implementing a massively parallel
neural computation system that operates in real-time.
CHAPTER 6. SYNAPTIC UPDATES 87
6.3.1 Starting the update process
Each neuron has an associated set of synaptic updates that need to be commu-
nicated and applied to other neurons whenever it spikes. These will be stored in
off-chip memory. When the neuron spikes the updates will need to be located. This
could be done using:
Synaptic updates alongside neuron parameters
The synaptic updates could be stored at subsequent memory addresses to the
parameters of the neuron modelling equation for each neuron. This would
avoid any indirection, but would also make memory accesses less efficient as
fetching neuron modelling equation parameters would require either sparse
storage or a table or chain of pointers to locate each set of neuron parameters.
A pointer found using a lookup table
A lookup table could be used to find a pointer to the set of synaptic updates,
keyed on the identity of the neuron that has spiked. This method is used
by Jin et al. (2008) in the Spinnaker system, which uses a lookup table (imple-
mented using a binary tree in on-chip memory) to find a pointer to a block of
synaptic updates in off-chip memory.
A pointer alongside neuron parameters
A pointer to the set of synaptic updates could be stored alongside the para-
meters of the neuron modelling equation. These parameters will have come
from a different region of off-chip memory, and so the contents of one region
of off-chip memory will be dependent on the contents of another.
Storing synaptic updates alongside neuron parameters is clearly unsuitable as it
makes accessing these parameters complex and inefficient as burst reads cannot
be used. Using a lookup table will be inefficient unless the table can be stored in
on-chip memory, which is inconsistent with the aim of making good use of on-chip
memory.
While storing a pointer to the set of synaptic updates alongside the neuron model-
ling equation parameters may make it slightly harder to update the structure of a
neural network while computation is in progress (for example to implement learn-
ing), it provides a significant advantage that (unless there are no spare bits in the
data used to represent the neuron modelling equation parameters) the pointer can
be found without any extra memory accesses.
Therefore a pointer to the set of synaptic updates that need to be communicated
and applied when a neuron spikes will be stored alongside the parameters of the
neuron modelling equation.
88 6.3. DESIGN DECISIONS
6.3.2 Fetching synaptic updates
Synaptic updates will need to be fetched from off-chip memory so that they can
be communicated and applied to their target neurons. These updates conceptually
consist of tuples of (target FPGA, target neuron, delay, weight). Either all or part of
these updates could be fetched at:
Source FPGA
Fetch the tuples at the source FPGA and send them to their target FPGAs
using the inter-FPGA communication system.
Target FPGA
Fetch the tuples at target FPGAs in response to messages received over the
inter-FPGA communication system.
Other FPGAs
Fetch the tuples at some point between the source and target FPGAs and send
them to their target FPGAs.
Fetching all of the tuples at the source FPGA would mean sending them all
between FPGAs, which would make excessive use of inter-FPGA communication
and would not be consistent with the design goals. Therefore update tuples will be
fetched in a combination of locations, some at the source FPGA, some at the target
FPGA and some at intermediate FPGAs.
6.3.3 Delaying synaptic updates
Delaying synaptic updates requires some form of buffering at a point or several
points in the neural computation system. This could be at either or several of:
Source FPGA
Delay synaptic updates at the source FPGA before communicating them to
target FPGAs. This requires at least one message to be sent from the source
FPGA to each target FPGA for each delay size.
Target FPGA
Delay synaptic updates at their target FPGAs. This has the advantage of
hiding communication latency. Any synaptic update can be delayed in the
inter-FPGA communication system for up to the smallest synaptic delay size
without having any effect on the behaviour of the neural communication sys-
tem, since the update will still arrive at the target FPGA in time to be pro-
cessed without increasing the total delay from the point of view of the target
neuron.
CHAPTER 6. SYNAPTIC UPDATES 89
Other FPGAs
Synaptic updates could be delayed at some point between the source and tar-
get FPGAs. This would be difficult to implement, and since some of the syn-
aptic updates might have the same source and target FPGA, it would reduce
to being equivalent to one of the other two methods.
Delaying synaptic updates at the target FPGA is more suitable for several reasons.
Firstly it helps to hide communication latency as synaptic updates are communic-
ated to their target FPGA before they are delayed. As long as a synaptic update
arrives at its target FPGA within the minimum delay size of 1 ms, it will appear
to have arrived at the target FPGA with no extra latency. Conversely if synaptic
updates are delayed at the source FPGA then any delay caused by communication
latency in a real-time neural computation will be in addition to the delay specified
by the synaptic update, potentially invalidating the neural computation.
Secondly the amount of inter-FPGA communication needed to communicate syn-
aptic updates between FPGAs is reduced by delaying synaptic updates at their tar-
get FPGAs, as a single message can be sent from the source FPGA to target FPGAs
in response to a spike, rather than having to send at least one message per delay
size.
Given that it reduces complexity, helps to hide communication latency, and poten-
tially reduces inter-FPGA communication, synaptic updates will be delayed at their
target FPGAs.
6.3.4 Applying synaptic updates
Synaptic updates need to be applied so that they have an affect on their target
neurons in the correct sampling interval. This could by done by:
Applying at the target FPGA
The target FPGA updates the I-value of the target neuron.
Applying from some other FPGA
The FPGA applying the update will need to write to the I-value memory on
the target FPGA.
Applying synaptic updates at any point in the system other than the target FPGA
requires shared memory accesses and infrastructure to allow these shared accesses.
This adds significant complexity compared to applying synaptic updates at the tar-
get FPGA, which only requires FPGAs to be able to access their local memory and
send messages to other FPGAs.
90 6.4. PASSING POINTERS
Therefore synaptic updates will be applied at their target FPGAs. This doesn’t
necessarily preclude the synaptic updates themselves being fetched at some other
point in the system.
6.4 Passing pointers
Based on the preceding discussion, it is clear that:
• The update process should be started using a pointer stored alongside the
parameters of the neuron modelling equation
• It is inefficient to fetch all of the synaptic updates at the source FPGA and
communicate them to target FPGAs using the inter-FPGA communication
system
• The synaptic updates must be stored in off-chip memory, and should be ac-
cessed using burst reads, making full use of all data returned
• Synaptic updates should be delayed at target FPGAs to hide communication
latency
• Synaptic updates should be applied at target FPGAs so that shared memory
is not required
The largest open question is how to arrange the synaptic updates in off-chip
memory to make optimal use of bandwidth. One method to help achieve this is to
make full use of pipelining in the memory controller by having multiple hardware
blocks in a neural computation system, that can each make independent memory
requests. This can be achieved by breaking synaptic update communication and
application into separate stages, each of which accesses data in off-chip memory
relevant to its function:
1. Determining the set of target FPGAs (fan-out)
2. Communicating with the target FPGAs (communication)
3. Delaying synaptic updates (delay)
4. Applying synaptic updates (accumulation)
Each of these stages will have an allocated region of off-chip memory which will
contain a portion of each set of synaptic updates. This means that when each stage
communicates with the next (either using the inter-FPGA communication system
or internal wiring), the next stage will have to fetch data from its region of off-chip
memory. Since lookup tables should be avoided (as they make inefficient use of
CHAPTER 6. SYNAPTIC UPDATES 91
off-chip memory bandwidth), this means that each message between stages should
contain a pointer to the data that will be used by the next stage. Therefore each
stage’s data (other than the last) will need to contain a pointer that will be used
by the next stage alongside whatever other data that stage needs to perform its
function.
6.4.1 Delaying pointers to synaptic updates
Implementing a delay requires buffering, with the total amount of buffering re-
quired being proportional to the combination of the size of each item of data being
delayed, the number of items being delayed and the length of the delay. Synaptic
updates need to be delayed from 1 to 16 ms, and so 16 parallel delay buffers will be
needed, one for each size of delay. The size of these delay buffers should be minim-
ised to minimise on- and off-chip memory usage, and therefore so should the size
and volume of data that passes through them.
One way to reduce the size and volume of data being delayed is to delay a pointer
to a block of (target neuron, weight) pairs (the remains of synaptic updates after
delay information is removed) rather than the pairs themselves. This means that
the delay stage will need to fetch tuples of (delay, pointer to updates) from off-chip
memory. However, an extra off-chip memory access from the delay stage can be
avoided by transmitting these tuples from the fan-out stage. The fan-out stage will
hence fetch tuples of (target FPGA, delay, pointer to updates).
This uses extra inter-FPGA communication (up to 16×more if every possible delay
size is used), but it will improve memory access efficiency by avoiding any off-chip
memory accesses from the delay stage (other than if it needs to spill buffered data
to off-chip memory) and increasing the size of off-chip memory accesses from the
fan-out stage.
A pointer to the block of tuples that will be fetched by the fan-out stage in response
to a neuron spiking will be kept alongside the parameters of the neuron model-
ling equation, and will be provided by the hardware that evaluates the equation
whenever a neuron spikes. This completes the arrangement of synaptic updates in
off-chip memory, which is shown in Figure 6.1 on the following page.
92 6.5. SYNAPTIC UPDATE ALGORITHM
Pointer
D
C
B
A
U
V
Delay
Pointer
Neuron ID
Weight
Equation 
Parameters
Fan-out
Tuples
Update
Tuples
Figure 6.1: Arrangement of data used by synaptic update algorithm in off-chip
memory
6.5 Synaptic update algorithm
The proposed synaptic update communication and application algorithm applies a
complete set of synaptic updates by following a tree of pointers through regions of
off-chip memory on the source FPGA, some intermediate FPGAs and target FPGAs.
It proceeds as follows:
1. A pointer is found alongside the parameters of the neuron modelling equa-
tion for the neuron that has spiked
2. The pointer is used to burst read a block of tuples from the fan-out region of
off-chip memory. These tuples consist of either:
(a) (target FPGA, delay, pointer to update tuples) or
(b) (target FPGA, pointer to fan out tuples)
3. All tuples are communicated to their target FPGA using the inter-FPGA com-
munication system
4. Tuples of type 2a delayed at their target FPGAs
5. Tuples of type 2b repeat step 2
6. After being delayed each pointer is used to burst read a set of update tuples.
These tuples consist of (target neuron, weight). Each weight is added to the
target neuron’s I-value to perform the synaptic update.
Repetition of step 2 is optional, and is suited to cases where either the number of
delay sizes used by the fan-out to a given FPGA is high or a number of target neur-
ons are clustered on groups of FPGAs remote from the source. A single message
can be sent to one of the FPGAs in the group, with further fan-out then sending
messages to the other FPGAs in the group.
CHAPTER 6. SYNAPTIC UPDATES 93
6.5.1 Benefits of the algorithm
The proposed synaptic update communication and application algorithm minim-
ises use of on-chip memory, with the neural network description being stored in
off-chip memory. This facilitates implementation of the algorithm using FPGAs.
Off-chip memory accesses are optimised (and hence the number of neurons that
can be handled by each FPGA is maximised) in a number of ways:
• The pointer in step 1 is stored alongside the parameters of the neuron model-
ling equation, and so it does not need to be fetched from memory
• Separate tuples from the fan-out stage for each pair of target FPGA and delay
avoid additional memory accesses at target FPGAs before applying delays
• The assumption of locality makes it likely that the majority of synaptic up-
dates will target neurons which are either on the same FPGA as the source
neuron or on nearby FPGAs. Combined with the assumption of high fan-
out this means that a large number of fan-out tuples will be fetched by each
pointer at the fan-out stage, which makes efficient use of burst memory ac-
cesses
• The pointer needed by the target FPGA in step 6 is supplied by the source
FPGA, unlike multicast algorithms such as that proposed by Jin et al. (2008)
that require multiple memory accesses to a lookup table to find the pointer
based on the identity of the neuron that has spiked
• If the size of an off-chip memory word is greater than the size of a fan-out
tuple (either type in step 2) then multiple tuples can fit in a word, which
makes more efficient use of off-chip memory bandwidth than storing a single
tuple in a word
• If the size of an off-chip memory word is greater than the size of an update
tuple then multiple updates can be applied in parallel, making efficient use
of both time and off-chip memory bandwidth
While the algorithm requires that three inter-dependent regions of off-chip memory
are populated before a neural computation can run, this is a similar situation to
populating the routing tables used by multicast routing algorithms. Since all data
required by a neural computation is stored in off-chip memory, it is simpler to load
a neural network on to the neural computation platform since it is not necessary
to transfer routing data into on-chip memories before the neural computation is
run. Nor is it necessary to resynthesise the FPGA bitstream to change the neural
network, as is the case with many previous FPGA implementations, such as that
by Bailey (2010).
94 6.6. EVALUATION OF THE ALGORITHM
6.6 Evaluation of the algorithm
The performance of the proposed synaptic update communication and application
algorithm can be evaluated by comparing it to algorithms that communicate syn-
aptic updates using unicast and multicast routing. We will use mathematical mod-
els of the number of clock cycles needed to communicate and apply the synaptic
updates produced by a single neuron spike with a fan-out of 103. There are two
mathematical models. The first assumes that full sets of update tuples are delayed
at their target FPGAs (“tuple delay”) while the second delays pointers to subsets of
these update tuples (grouped by delay) rather than the tuples themselves (“pointer
delay”). This cannot be done with unicast routing, which has consistent behaviour
using both models.
The models count the clock cycles needed for inter-FPGA communication, to per-
form off-chip memory accesses and to insert and remove update tuples or pointers
from delay buffers. A number of distributions of target neurons relative to the
source neuron are used to provide several points of comparison with varying de-
grees of locality.
6.6.1 Mathematical models
The mathematical models evaluate synaptic update communication and applica-
tion algorithms based on the number of clock cycles used to communicate, delay
and apply synaptic updates. Each clock cycle represents resource usage somewhere
in a neural computation system, with lower resource usage translating to either
ability to increase the scale of a computation, or a greater probability that a neural
computation will operate in real-time.
We now discuss the distributions of target neurons relative to the source neuron
that are used with the mathematical models and the method used to sample the
clock cycles taken by off-chip memory accesses, inter-FPGA communication and
delay buffer accesses.
Distribution number
0 1 2 3 4 5 ←− Distance
% of target neurons Neuron-distance
1 100 0 0 0 0 0 0
2 80 20 0 0 0 0 200
3 70 16 8 6 0 0 500
4 50 20 15 10 5 0 1000
5 30 30 15 15 5 5 1500
Table 6.1: Target neuron distributions used to evaluate synaptic update communic-
ation and application algorithms
CHAPTER 6. SYNAPTIC UPDATES 95
Distance % of target neurons FPGAs Neurons / FPGA
0 70 1 700
1 16 4 40
2 8 8 10
3 6 12 5
Table 6.2: Number of target FPGAs and number of target neurons on each of these
FPGAs for distribution 3 from Table 6.1
Target neuron distributions
The distributions of target neurons used with the mathematical models assume that
a network of FPGAs is arranged in a 2-D mesh, with each FPGA having 4 neigh-
bours. A varying percentage of target neurons are distributed between FPGAs a
given distance from the FPGA holding the source neuron, using the distributions
in Table 6.1 on the preceding page. Table 6.2 shows the number of target FPGAs
and the number of target neurons on each of these FPGAs for distribution 3 from
Table 6.1 as an example.
To quantify the degree of locality exhibited by each of these distributions, a measure
that will be called neuron-distance is used. This is the sum of distance from the
source neuron for all target neurons in the distribution. For example the neuron-
distance for distribution 3 in Table 6.1 is 0 × 700 + 1 × 160 + 2 × 80 + 3 × 60 =
500.
Off-chip memory accesses
The off-chip memory is modelled as having a word size of 256 bits, a burst size of
8 and a latency of 5 clock cycles between a burst (or single access) being reques-
ted and the first word being returned, with the remaining words returned on each
of the following clock cycles, matching the assumed characteristics of the DDR2
SDRAM on the DE4 board in Section 5.3.1. A new read request can be made after
the last word in a burst has been returned.
For the proposed algorithm it is assumed that 4 fan-out tuples or 8 update tuples are
stored in a word (matching what is achieved in Chapter 7), and that these tuples are
accessed using burst reads. A unicast algorithm is assumed to perform burst reads
at the source FPGA to retrieve tuples of (target FPGA, target neuron, delay, weight),
again with 4 tuples in a word and with no further memory accesses required at each
target FPGA.
A multicast algorithm is assumed to perform memory accesses in the FPGAs at the
“leaves” of the fan-out tree. In each of these FPGAs, a lookup table needs to be
96 6.6. EVALUATION OF THE ALGORITHM
scanned to find a pointer to either a set of tuples of (target neuron, delay, weight) in
the tuple delay model, or a list of pointers to sets of (delay, weight) tuples, grouped
by delay in the pointer delay model. The lookup table is assumed to contain 103
entries (equal to the fan-in), and so scanning it takes log2 10
3 ≈ 10 clock cycles.
Fetching tuples from off-chip memory takes the same number of clock cycles as the
proposed algorithm.
Inter-FPGA communication
The number of times that an inter-FPGA message traverses a link between two
FPGAs is counted. It is assumed that each message takes 5 clock cycles to traverse a
link. A unicast algorithm sends one message per target neuron, with many of these
messages traversing more than one link. A multicast algorithm sends messages in
a minimum spanning tree between the source and target FPGAs. The proposed
algorithm sends one message per target FPGA, with some traversing more than
one link.
Delay buffer accesses
With the tuple delay model and with a unicast algorithm (regardless of delay
model), 103 tuples need to be inserted into delay buffers and 103 tuples need to
be removed, taking 2× 103 clock cycles in total. With the pointer delay model the
number of pointers that need to be inserted and removed from delay buffers is
counted, which is dependent on the distribution of neurons between FPGAs, the
number of delay sizes used and the threshold used to determine when a second
fan-out stage should be used at target FPGAs. Inserting or removing a pointer
from a delay buffer is assumed to take one clock cycle.
Delay sizes used
The performance of the pointer delay model is dependent on the number of delay
sizes (from 1 to 16 ms) that are used by each set of synaptic updates. Using more
delay sizes partitions the synaptic updates into more sets of tuples at target FPGAs,
and results in more memory accesses for smaller sets of tuples, which results in
decreased performance as the number of tuples fetched by each memory request
drops below the burst size of off-chip memory. It is assumed that 8 delay sizes are
used by each set of synaptic updates to begin with. The effect that the number of
delay sizes used has on performance will be examined later.
CHAPTER 6. SYNAPTIC UPDATES 97
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
1
2
3
4
5
6
7
8
9
10
Neuron-distance ×103
C
lo
ck
cy
cl
es
×1
03
Tuple Unicast Multicast Proposed
Pointer Unicast Multicast Proposed
Figure 6.2: Clock cycles needed to apply synaptic updates for each algorithm using
both tuple and pointer delay models. Note that the two plots for unicast overlap
6.6.2 Results
Figure 6.2 shows the total number of clock cycles needed to communicate and
apply a set of 103 synaptic updates using each algorithm and both the tuple and
pointer delay models. Figure 6.3 on the following page and Figure 6.4 on page 99
show how the clock cycles are divided between memory accesses, messaging and
delay buffer accesses for each of the delay models. At first glance it is clear that
the proposed algorithm consistently uses fewer clock cycles than either unicast or
multicast using the same delay model, though which of the two delay models is
better depends on the total neuron-distance.
The number of clock cycles needed by the pointer delay model is significantly less
than for the tuple delay model when locality is high (lower neuron-distance), but
this advantage is gradually eroded as locality decreases (higher neuron-distance),
until the tuple delay models takes less clock cycles than the pointer delay model.
98 6.6. EVALUATION OF THE ALGORITHM
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
0
2
4
6
8
10
0 200 500 1000 1500
Algorithm
Neuron-distance
C
lo
ck
cy
cl
es
×1
03
Communication Memory Delay
Figure 6.3: Breakdown of clock cycles needed to apply synaptic updates using the
tuple delay model
This can be seen more clearly in Figure 6.5 on page 100, which compares the pro-
posed algorithm and a multicast algorithm using both the tuple and pointer delay
models, based on the clock cycles that they require as a percentage of those required
by a unicast algorithm. A unicast algorithm takes the same number of clock cycles
regardless of which delay model is used. Figure 6.5 clearly shows that the two
delay models are almost equal at a neuron-distance of 1.0× 103 and that the tuple
delay model is more efficient when locality decreases further at a neuron-distance
of 1.5× 103.
This can be explained by considering how the two delay models partition the syn-
aptic updates for each spike. While the tuple delay model divides the synaptic
updates into sets of tuples for each target FPGA, the pointer delay model further
subdivides these sets by delay size. If the number of target neurons on a given
FPGA is already small (e.g one of the peripheral FPGAs in a case with low loc-
ality), then this results in there being pointers to very small sets of update tuples
CHAPTER 6. SYNAPTIC UPDATES 99
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
U
ni
ca
st
M
ul
ti
ca
st
Pr
op
os
ed
0
2
4
6
8
10
0 200 500 1000 1500
Algorithm
Neuron-distance
C
lo
ck
cy
cl
es
×1
03
Communication Memory Delay
Figure 6.4: Breakdown of clock cycles needed to apply synaptic updates using the
pointer delay model
e.g only one or two, which makes inefficient use of off-chip memory bandwidth
and can result in the pointer delay model being less efficient than the tuple delay
model.
This can be seen clearly in Figure 6.6 on page 101, which plots the number of clock
cycles taken by the proposed algorithm as a percentage of those taken by a unicast
algorithm against the number of delay sizes used in a set of synaptic updates for
each of the five neuron distributions in Table 6.1. It can be seen that the proposed
algorithm becomes significantly less efficient for the neuron distributions with less
locality (higher neuron-distance) as more delay sizes are used, to the point that the
proposed algorithm becomes less efficient than unicast for distributions with less
locality when synaptic updates are distributed over more than 8 delay sizes.
100 6.6. EVALUATION OF THE ALGORITHM
0 0.2 0.4 0.6 0.8 1 1.2 1.4
10
20
30
40
50
60
70
80
90
100
Neuron-distance ×103
C
lo
ck
cy
cl
es
as
%
of
un
ic
as
t
Tuple Proposed Multicast
Pointer Proposed Multicast
Figure 6.5: % of clock cycles compared to unicast used by the proposed algorithm
and multicast for varying neuron-distance using both tuple and pointer delay mod-
els
6.6.3 Memory size
The memory requirements of communicating and applying synaptic updates were
discussed in Section 4.4. The analysis assumed that the tuple delay model was
used, and that synaptic updates were stored in memory in full, without any com-
pression or indirection, resulting in a synaptic update data size of 0.6 GB and band-
width requirement of 6 GB/s for a neural computation with 105 neurons per device,
mean fan-out of 103 and mean spike frequency of 10 Hz.
Using the pointer delay model the size of the synaptic update data and hence the
bandwidth needed to read it will depend on the locality of a neural network and
the number of delay sizes used by synaptic updates.
CHAPTER 6. SYNAPTIC UPDATES 101
2 4 6 8 10 12 14 16
0
20
40
60
80
100
120
140
160
Tuple delay more efficient
Pointer delay more efficient
Delay sizes used by synaptic updates
C
lo
ck
cy
cl
es
as
%
of
tu
pl
e
de
la
y
m
od
el
Neuron-distance 0 200 500 1000 1500
Figure 6.6: Clock cycles used by pointer delay model as % of those used by tuple
delay model for different numbers of delay sizes used by synaptic updates
The memory size needed by the proposed algorithm using the pointer delay model
is shown as a percentage of the size needed for a unicast algorithm in Figure 6.7
on the following page. This shows that the memory size needed is lower with
the pointer delay model when locality is high, but it increases sharply as local-
ity decreases, and further as the number of delay sizes increases. This means that
a massively parallel neural computation system using the proposed synaptic up-
date communication and application algorithm will use more memory and more
memory bandwidth if a neural network exhibits poor locality. This is because
the amount of memory used by the fan-out stage of the algorithm approaches the
amount of memory used by the update stage when the number of target neurons
on each target FPGA becomes small.
102 6.6. EVALUATION OF THE ALGORITHM
2 4 6 8 10 12 14 16
40
60
80
100
120
140
160
180
200
220
240
260
280
Tuple delay more efficient
Pointer delay more efficient
Number of delay sizes used by synaptic updates
M
em
or
y
si
ze
ne
ed
ed
as
%
of
un
ic
as
t
Neuron-distance 0 200 500 1000 1500
Figure 6.7: Change in memory size needed to store synaptic updates for differing
numbers of delay sizes used by these updates when using the pointer delay model
rather than the tuple delay model
Figure 6.8 on the next page shows the size of memory needed to delay synaptic up-
dates using the pointer delay model as a percentage of the size needed by the tuple
delay model for varying numbers of delay sizes used by these synaptic updates.
This shows a similar trend to Figure 6.7, but in this case the memory size needed
by the pointer delay model decreases by at least 50% compared to the tuple delay
model, even with all delay sizes in use and little locality. If locality is high then
the memory size needed by the pointer delay model decreases dramatically to less
than 10% of that needed with the tuple delay model.
The memory needed to delay synaptic updates was originally calculated to be
68 MB in Section 4.4.3, needing a bandwidth of between 8 GB/s and 68 GB/s. If
these requirements are reduced to 10% of the original then 6.8 MB of memory
would be needed to delay synaptic updates, with a bandwidth of 0.8 GB/s to
6.8 GB/s.
CHAPTER 6. SYNAPTIC UPDATES 103
2 4 6 8 10 12 14 16
0
5
10
15
20
25
30
35
40
45
50
Number of delay sizes used by synaptic updates
M
em
or
y
si
ze
ne
ed
ed
as
%
of
un
ic
as
t
Neuron-distance 0 200 500 1000 1500
Figure 6.8: Change in memory size needed to delay synaptic updates for differing
numbers of delay sizes used by these synaptic updates
While the increase in memory needed to store synaptic updates as locality de-
creases is less than ideal, the analysis of the locality of biological neural networks
in Section 3.3.2 indicates that locality is expected to be high, and so there will
be a large number of target neurons on each target FPGA. When combined with
the dramatic reduction in memory needed to delay synaptic updates in all cases,
this means that the proposed synaptic update communication and application al-
gorithm appears to meet the goals of efficient off-chip memory usage, minimal
on-chip memory usage and avoiding excessive use of inter-FPGA communication
set in Section 6.2.
104 6.7. CONCLUSION
6.7 Conclusion
The proposed synaptic update communication and application algorithm reduces
the memory size, memory bandwidth and inter-FPGA communication needed
compared to unicast and multicast algorithms when locality in a neural network
is high.
There is a more mixed outlook as locality decreases, particularly an increase in the
amount of memory needed to store synaptic updates, but given the analysis in
Section 3.3.2 that shows that biological neural networks exhibit high locality, any
cases of low locality can be considered to be “out of spec.” At worst a massively
parallel neural computation system based on this algorithm would exhibit reduced
performance and take longer than real-time if it were used to perform a neural
computation using a neural network with low locality.
It was previously determined that all I-values will have to be stored in on-chip
memory. It is clear that there is insufficient on-chip memory to store the remainder
of the neural network description, and so it will need to be stored in off-chip
memory, and hence off-chip memory bandwidth must be used as efficiently as pos-
sible to maximise the size of neural network that can be handled in real-time, as
predicted by the Bandwidth Hypothesis.
Since the synaptic update communication and application algorithm is designed to
group data into as large blocks as is possible, this means that high bandwidth usage
efficiency can be achieved by using burst reads, unlike a multicast algorithm which
has to use a large number of single-word reads, which makes much less efficient
use of bandwidth.
The reduction in the amount of memory needed to delay synaptic updates com-
pared to other algorithms is such that in some cases there will be sufficient on-chip
memory (beyond that needed for the I-values) for the entire delay buffer to fit in
on-chip memory, providing random access (needed to deal with incoming point-
ers as soon as they arrive and avoid deadlocking the inter-FPGA communication
system or the fan-out stage of the algorithm) without inefficient use of memory
bandwidth.
However this will not always be the case, and so a hybrid solution to delaying
synaptic updates will be appropriate, involving a combination of on-chip memory
and off-chip memory, with the “middle” values in each delay buffer (which are
neither recently arrived nor about to be drained and processed) being “spilled” to
off-chip memory. This would allow random access to a delay buffer for incoming
and outgoing pointers as well as sufficient size to delay larger numbers of synaptic
updates if needed.
Chapter7
Implementation
105
106 7.1. INTRODUCTION
7.1 Introduction
Given the analysis so far, it is clear that a multi-FPGA massively parallel neural
computation system must be implemented with careful use of on-chip memory and
making as efficient use as possible of off-chip memory bandwidth, as suggested by
the Bandwidth Hypothesis.
This will be achieved by splitting the neural computation algorithm down into sev-
eral stages, each implemented as a separate hardware block. Each of these blocks
will make independent sets of off-chip memory accesses using burst reads. The
aim is for there to always be a hardware block available to request a new off-chip
memory access while other blocks are processing the data returned to them.
For each hardware block we will analyse how its design makes efficient use of off-
chip memory bandwidth, as the Bandwidth Hypothesis suggests that this is critical
when designing a massively parallel neural computation system that operates in
real-time.
7.2 Splitting the algorithm
The implementation of a massively parallel neural computation system on the
Bluehive platform has the following hardware blocks in each FPGA:
Equation processor
Evaluates the neuron modelling equation for every neuron every 1 ms
sampling interval. Passes a pointer to blocks of fan-out tuples to the fan-out
engine when a neuron spikes.
Fan-out engine
Performs the fan-out stage of the algorithm.
Router
Provides inter-FPGA communication for both communicating synaptic up-
dates and command and control of the neural computation system.
Delay unit
Delays synaptic updates.
Accumulator
Applies synaptic updates to target neurons.
Spike auditor
Records spike events to output as the computation results.
CHAPTER 7. IMPLEMENTATION 107
Inter-FPGA Links
Off-Chip Memory Interface
Fan-outEquation
Spike
Injector
Spike
Auditor Accumulator
Delay
Figure 7.1: Architecture of an FPGA in the massively parallel neural computation
system. Solid lines indicate a flow of messages while dashed lines indicate memory
accesses
Spike injector
Allows external spike events to be injected into a neural network. This is used
to provide an initial stimulus. It could also be used to interface to external
systems (e.g. sensors).
Control state machine
Coordinates the setup and progress of a neural computation.
Control logic
Sets up the neural computation system and advances the sampling interval
in response to commands from the control state machine.
The architecture of each FPGA in the massively parallel neural computation system
is shown in Figure 7.1. Note that the control state machine and control logic are
not shown.
7.2.1 Equation processor
The equation processor needs to evaluate the neuron modelling equation (Equa-
tion 3.10) for every neuron every 1 ms sampling interval. It reads the parameters for
each neuron from off-chip memory, evaluates the equation, writes back the results
and sends a pointer to the fan-out engine if the neuron spiked.
108 7.2. SPLITTING THE ALGORITHM
Input
The main input is a stream of neuron modelling equation parameters. Each neuron
has 128 bits of parameters, which are stored in off-chip memory. These are laid
out as shown in Figure 7.2. There is also an I-value, which is fetched from the
accumulator (described in Section 7.2.4), and a signal from the control logic that
indicates that a new 1 ms sampling interval has started.
When a new 1 ms sampling interval starts this triggers a series of burst reads to
read in the neuron modelling equation parameters for all of the neurons on the
FPGA from off-chip memory. The data that is returned is a stream of 256 bit words,
with each word containing the parameters for two neurons. There is enough buffer
space to store the stream returned by two full burst reads of 8 words each. A new
burst read is requested whenever there are at least 8 free spaces in the buffer, until
all of the neuron modelling equation parameters have been fetched.
0 16 32 48 64 80 96 128
V U A B C D
pointer to block
of fan-out tuples
Figure 7.2: Layout of neuron modelling equation parameters in off-chip memory.
The parameters are those of Equation 3.10
Function
The equation processor takes the stream of 256 bit words from off-chip memory
and evaluates the next state for the two neurons that are represented by the two
sets of equation parameters in each word. This is done using two parallel pipelines
which have 6 stages each:
1. Request I-value from accumulator
2. V2 = ((2621×V) >> 16) + 1536
U2 = A×V
3. V3 = ((V ×V2) >> 8) + 35840
U3 = (U2 + B×U) >> 16
4. V4 = V3 + (I << 8)−U
U′ = U3 +U
5. if V4 ≥ (30 << 8) then
V ′ = C
U′ = U + D
spiked = true
CHAPTER 7. IMPLEMENTATION 109
else
V ′ = V4
spiked = false
Send updated parameters to output stage
6. if spiked then
Send pointer to fan-out engine
Send neuron identifier to spike auditor
The original equation parameters are passed between stages alongside the addi-
tional intermediate values (V2 to V4 and U2 and U3). All intermediate values use 32
bit fixed-point precision to avoid loss of accuracy.
Output
The output of the equation processor is the next state for two neurons, represen-
ted by an updated 256 bit word (laid out as in Figure 7.2). The updated words
are buffered until there are 8 of them and then they are written back to off-chip
memory using a burst write. Because the off-chip memory controller has separate
wiring for data being read and data being written it is possible to interleave this
burst write with burst reads of parameters for other neurons, which significantly
increases throughput compared to serialising burst reads and burst writes.
Note that the parameters A to D and the pointer do not change, and are only writ-
ten back to off-chip memory along with V and U as the memory controller expects
to receive write commands for whole 256 bit words.
If either or both the neurons being evaluated spikes, then one or two pointers to
blocks of fan-out tuples will be passed to the fan-out engine. These pointers are
laid out as in Figure 7.3. If both neurons spike then their pointers are serialised
before being passed to the fan-out engine. This causes a stall if the fan-out en-
gine’s input buffer is full, but in the vast majority of cases the gain in efficiency of
evaluating the state of two neurons in parallel outweighs the penalty of these stalls.
Identifiers for neurons that spike are also sent to the spike auditor.
0 6 12 32
length address
Figure 7.3: Layout of pointer to block of fan-out tuples
110 7.2. SPLITTING THE ALGORITHM
Summary
Input off-chip memory (V, U, A, B, C, D) and pointer to block of
(2× 128 bit chunks fan-out tuples
every clock cycle)
accumulator I
control logic new sampling interval signal
Function neuron modelling equation (Equation 3.10)
Output off-chip memory (V, U, A, B, C, D) and pointer to block
(2× 128 bit chunks fan-out tuples
every clock cycle)
if V ≥ 30 mV then
fan-out engine pointer to block of fan-out tuples
spike auditor neuron identifier
Efficiency
The equation processor makes efficient use of off-chip memory bandwidth
by:
• Using burst reads to fetch large blocks of input data
• Having requests for further blocks of input data ready so that they can be
actioned whenever the memory controller is free
• Making full use of the 256 bit words returned by off-chip memory
• Evaluating the neuron modelling equation in a single clock cycle on average,
which consumes input words as quickly as possible and avoids stalls
• Writing back updated values of V and U efficiently, interleaving writes with
reads to avoid using extra bandwidth
CHAPTER 7. IMPLEMENTATION 111
7.2.2 Fan-out engine
Input
The fan-out engine receives pointers to blocks of fan-out tuples (Figure 7.3) from
the equation processor.
Function
Each pointer is used to burst read a block of fan-out tuples from off-chip memory.
The layout of these tuples is shown in Figure 7.4. These tuples are either targeted
at the delay unit on this or another FPGA (delay 6= 0) or at the fan-out engine on
another FPGA (delay = 0).
0 32 34 38 58 64
target FPGA delay address length
Figure 7.4: Layout of a fan-out tuple
Output
The fan-out tuples are sent to the delay unit if their target is the current FPGA, and
are otherwise sent to the router to be communicated to their target FPGAs.
Summary
Input equation processor pointer to block of fan-out tuples
off-chip memory fan-out tuples
Function fetch fan-out tuples
Output delay unit fan-out tuples for current FPGA
router fan-out tuples for other FPGAs
Efficiency
The fan-out engine makes efficient use of off-chip memory bandwidth by:
• Using burst reads to fetch blocks of fan-out tuples
• Fitting multiple fan-out tuples into a 256 bit word
• Reducing the off-chip memory usage of the delay unit
112 7.2. SPLITTING THE ALGORITHM
7.2.3 Delay unit
Input
The delay unit receives fan-out tuples from the fan-out engines on both the current
FPGA and other FPGAs in the neural computation system (via the router and inter-
FPGA communication system).
Function
The fan-out tuples are stripped of their target FPGA and delay fields to become
pointers to blocks of update tuples. These have the same format as the pointers to
blocks of fan-out tuples in Figure 7.3. These pointers are then inserted into a delay
buffer.
The delay buffers are implemented by 16 FIFOs, one for each delay size (in 1 ms
increments). FIFOs are assigned to delay sizes in a cyclical fashion and every 1 ms,
the “2 ms” FIFO logically becomes the “1 ms” FIFO and the “1 ms” FIFO becomes
the “0 ms” FIFO, and so on. The contents of the current “0 ms” FIFO are drained
and become the output. This matches work by Jin et al. (2008).
As shown in Section 6.6.3, the maximum total size of the delay buffers depends on
the locality of a neural neural network and the number of delay sizes in use, with
an estimate of 6.8 MB, requiring a bandwidth of 0.8 GB/s to 6.8 GB/s. This is clearly
too much data to fit into on-chip memory, so off-chip memory will have to be used.
However, FIFO semantics should be retained, particularly as the delay unit must
consume input immediately to avoid deadlocks elsewhere in the system.
Therefore the delay unit is implemented using 16 “spillable FIFOs.” These use a
small FIFO at the input and another at the output. When the input FIFO becomes
full data is transferred to off-chip memory using a burst write. When the output
FIFO becomes empty this data is fetched back using a burst read.
Output
The main output from the delay unit is a stream of pointers to blocks of update
tuples. These are sent to the accumulator. These pointers share their format with
that in Figure 7.3. There are also off-chip memory accesses from the spillable FI-
FOs.
CHAPTER 7. IMPLEMENTATION 113
Summary
Input fan-out engine or fan-out tuples
router
control logic new sampling interval signal
off-chip memory burst read FIFO data previously spilled
to external memory
Function sort pointers into FIFO delay buffers
Output accumulator pointer to block of update tuples
off-chip memory burst write FIFO data spilled to external
memory
Efficiency
The delay unit makes efficient use of off-chip memory bandwidth by:
• Delaying pointers to blocks of synaptic updates rather than the updates them-
selves
• Allowing excess data to be spilled from premium on-chip memory to more
abundant off-chip memory
• Using circular allocation for delay buffers to avoid needing to copy data
between buffers every 1 ms
7.2.4 Accumulator
The accumulator sums synaptic updates for each neuron to produce I-values. Since
this operation occurs at the leaves of a tree with high fan-out it is the inner loop of
the neural computation algorithm, which makes it the most performance critical
task in massively parallel neural computation.
Input
The accumulator receives a stream of pointers to blocks of update tuples from the
delay unit and separately requests for I-values from the equation processor. It
also receives a signal marking the end of a 1 ms sampling interval from the control
logic.
114 7.2. SPLITTING THE ALGORITHM
Function
Burst reads are used to fetch streams of update tuples from off-chip memory. The
layout of these tuples is shown in Figure 7.5. The tuples are packed in to 256 bit
words with up to 8 tuples per word. To make good use of off-chip memory band-
width all 8 tuples should be processed in parallel, in a single clock cycle on average.
This is done by separating the core of the accumulator into 8 banks, each holding
the I values for 1/8 of the neurons on the FPGA in on-chip memory.
0 16 32
neuron id weight
Figure 7.5: Layout of an update tuple
Neurons that frequently receive synaptic updates together (as a result of locality)
are allocated to different banks so that they can be updated in parallel, which res-
ults in up to 8 updates being applied per clock cycle in the best case. Buffering
is provided to prevent deadlock if multiple update tuples in an input word target
neurons in the same bank. This statistical multiplexing means that there will only
be stalls when there are an unusually high number of updates to neurons in the
same bank.
The update tuples are fed into the core of the accumulator block as shown in Fig-
ure 7.6 on the next page. This figure shows 4 tuples per input word and 4 banks for
clarity. This processes the update tuples as follows:
1. Each tuple is fed to a bank selector, which identifies which of the 8 banks
holds the I-value for the neuron being updated by the tuple
2. The tuple is routed to a FIFO queue in its target bank. There is 1 FIFO per
bank per position in the input word to allow update tuples to appear in any
position in the input word. Without these FIFOs, 2 update tuples targeting
the same bank would cause a stall
3. Update tuples are dequeued from the set of FIFOs in each bank in a round-
robin fashion
4. The current I-value for the target neuron is fetched from on-chip memory
5. The weight in the update tuple is added to the current I-value
6. The new I-value is stored to on-chip memory
CHAPTER 7. IMPLEMENTATION 115
Update
Tuple
Update
Tuple
Update
Tuple
Update
Tuple
Read 
Address
Read 
Data
Store
I-Values
Schedule
Read
Add
Write
Store
I-Values
Schedule
Read
Add
Write
Store
I-Values
Schedule
Read
Add
Write
Store
I-Values
Schedule
Read
Add
Write
Bank
Selector
Bank
Selector
Bank
Selector
Bank
Selector
Read
Selector
Figure 7.6: Layout of the core of the accumulator. 4 banks are shown for clarity,
there are actually 8 banks
There are two complete copies of the I-values in the accumulator, one used to apply
updates for the next sampling interval and the other used to supply the current I
value for each neuron to the equation processor. These copies are swapped around
every sampling interval.
Output
The output of the accumulator is a stream of I-values that are requested by the
equation processor. Each of the I-values will be read once per 1 ms sampling in-
terval.
116 7.2. SPLITTING THE ALGORITHM
Summary
Input delay unit pointer to block of update tuples
off-chip memory blocks of update tuples
equation processor requests for I-values
control logic new sampling interval signal
Function apply the updates to the appropriate I-values in parallel
Output equation processor I-values
Efficiency
The accumulator makes efficient use of off-chip memory bandwidth by:
• Consuming a whole 256 bit word from off-chip memory every clock cycle
• Providing buffering to prevent stalls in all but the most pathological distribu-
tions of update tuples between banks
• Keeping I-values in on-chip memory to allow for frequent updates with low
latency
7.2.5 Memory spike source
The memory spike source allows a “script” of spikes that need to be injected into
a neural network to be fetched from off-chip memory. This is particularly useful
when providing some initial stimulus to a neural network at the start of a neural
computation.
Input
A stream of tuples is fetched from off-chip memory. Each tuple contains the num-
ber of the sampling interval that the spike should be injected in, a neuron identifier
and a weight, laid out as shown in Figure 7.7.
0 32 64 80
sampling interval number neuron identifier weight
Figure 7.7: Layout of an injected spike tuple
CHAPTER 7. IMPLEMENTATION 117
Function
Injected spike tuples need to be fetched from off-chip memory in advance of the
sampling interval when they need to be applied. This is achieved by buffering a
whole burst of 24 tuples and then fetching a new burst when the buffer is almost
empty.
Output
A stream of tuples is sent to the spike injector to be combined with injected spikes
from other sources.
Summary
Input off-chip memory stream of injected spike tuples
Function fetch tuples from memory in advance of their sampling interval
Output spike injector stream of injected spike tuples
Efficiency
The memory spike source makes efficient use of off-chip memory bandwidth by
fetching injected spike tuples using burst reads and buffering them until they are
needed.
7.2.6 Spike injector
The spike injector allows injected spikes to be applied to neurons in a neural net-
work from external sources It is used to provide some initial stimulus to start a
computation (sourcing these injected spikes from the memory spike source), and
could also be used to interface to external systems.
Input
Injected spikes are taken from multiple sources in the form of tuples of the same
form as used by the memory spike source (Figure 7.7). Input can be either from the
memory spike source or from other sources such as external sensors. It is assumed
that the tuples presented from each source are already ordered by sampling interval
number. The current sampling interval number and a next sampling interval signal
are sourced from the control logic.
118 7.2. SPLITTING THE ALGORITHM
Function
If an input tuple is for the current sampling interval then it is sent to the output.
Input tuples for sampling intervals that have already passed are discarded.
Output
Injected spike tuples are sent to the accumulator so that they can be applied to their
target neurons.
Summary
Input memory spike source streams of injected spike tuples
and other sources
control logic current sampling interval number
Function select tuples for the current sampling interval
Output accumulator stream of injected spike tuples
7.2.7 Spike auditor
The spike auditor records spike events that are generated by a neural computation.
Each FPGA maintains a record of spike events generated by its neurons in off-chip
memory.
Input
A stream of neuron identifiers from the equation processor indicate which neurons
have spiked. The current sampling interval number is provided by the control
logic.
Function
Neuron identifiers are combined with the current sampling interval number to
form tuples, laid out as in Figure 7.8 on the next page. These tuples are buffered
until there are 64, which fill a whole burst write of 8× 256 bit words.
CHAPTER 7. IMPLEMENTATION 119
0 32 64
sampling interval number neuron identifier
Figure 7.8: Layout of a spike auditor tuple
Output
A block of tuples is written to off-chip memory using a burst write. These tuples
can be read by a host PC to produce an output from the neural computation.
Summary
Input equation processor neuron identifiers
control logic current sampling interval number
Function create spike auditor tuples
Output off-chip memory blocks of spike auditor tuples
Efficiency
The spike auditor makes efficient use of off-chip memory bandwidth by writing
out spike event records 256 bits at a time.
7.2.8 Control state machine
The control state machine runs on a single FPGA in the system. As well as con-
trolling setup of the inter-FPGA communication system (see Section 7.4), it also
sets up the neural computation system (including memory offsets for each hard-
ware block) and controls the progress of the computation by determining when a
sampling interval has ended and then requesting that all FPGAs advance to the
next sampling interval.
Input
Messages are received from the control logic in all FPGAs in the system (includ-
ing the FPGA that also contains the control state machine) to indicate when the
sending FPGA has finished the setup process or finished processing all of the com-
putation for the current sampling interval.
120 7.2. SPLITTING THE ALGORITHM
Function
In the setup phase each the messages that indicate that an FPGA has finished setup
are counted until the total number of messages matches the total number of FPGAs
in the system. At that point all FPGAs will have finished setup and the neural
computation can be started.
Once the neural computation has been started a similar process is used to count
messages from each FPGA that indicate that it has finished the computation for the
current sampling interval. Once a message has been received from all FPGAs the
next sampling interval is started.
Output
A message is sent to the control logic in all FPGAs in the system to start the setup
process.
When a new sampling interval starts a message is sent to the control logic in all
FPGAs in the system containing the new sampling interval number.
Summary
Input control logic finished setup and finished
(via inter-FPGA communication) computation messages
Function determine when setup has finished and when to start the next
sampling interval
Output control logic setup request and new sampling
(via inter-FPGA communication) interval messages
7.2.9 Control logic
The control logic provides the current sampling interval number to other hard-
ware blocks, determines when an FPGA has finished computation for a sampling
interval and receives commands from the control state machine.
CHAPTER 7. IMPLEMENTATION 121
Input
Messages from the control state machine indicate when the setup process should
be started. Other messages indicate when a new sampling interval has started and
the number of this sampling interval. Signals from other hardware blocks indicate
when they have finished computation for the current sampling interval.
Function
Status signals from other hardware blocks are monitored to determine when setup
has finished and when computation for the current sampling interval has com-
pleted.
Output
A message is sent to the control state machine when setup has finished or com-
putation for the current sampling interval has completed. A signal is sent to each
hardware block when a new sampling interval begins as well as the number of this
sampling interval.
Summary
Input control state machine setup and next sampling
(via inter-FPGA communication) interval messages
other hardware blocks finished computation signal
Function determine when all other hardware blocks are ready for the next
sampling interval
Output control state machine setup finished and ready for next
(via inter-FPGA communication) sampling interval messages
other hardware blocks new sampling interval signal
sampling interval number
122 7.3. INPUT DATA
7.3 Input data
A neural network is programmed into the massively parallel neural computation
system using memory image files that are copied into the off-chip memory of each
FPGA from a host PC. These image files are generated in advance on the host PC
from a neural network description. Each of the hardware blocks in the system that
accesses off-chip memory will accesses data from the memory image to perform its
part of the neural computation.
7.3.1 Neural network description format
The input to the memory image file generator consists of two text files, one de-
scribing each of the neurons in a neural network and one describing the synaptic
connections.
Neuron file
The neuron file contains the parameters of the neuron modelling equation for a
single neuron on each line, separated by spaces. An example extract from this
file is shown in Table 7.1. These parameters are for the original version of the
Izhikevich neuron modelling equation (Equation 3.1). They are translated to the
values required by Equation 3.10 as part of the memory image file generation pro-
cess. There are also parameters for the magnitude of an initial injected spike (In)
and the sampling interval when this initial spike should be injected (n).
Neuron Id v0 u0 a b c d In n
1000 -70.0 -14.0 0.02 0.20 -65.0 6.4 20 1
1001 -75.0 -15.0 0.03 0.25 -66.0 6.2 0 0
1002 -60.0 -13.0 0.02 0.20 -67.0 6.3 0 0
1003 -70.0 -12.0 0.03 0.25 -68.0 6.7 0 0
Table 7.1: Example extract from neuron file
CHAPTER 7. IMPLEMENTATION 123
Connection file
The connection file contains the parameters of a synaptic connection on each line,
separated by spaces. An example extract from this file is shown in Table 7.2.
Source Neuron Target Neuron Weight Delay
10 25 11 2
12 17 26 6
1012 557 -22 5
2018 2002 -9 15
Table 7.2: Example extract from connection file
7.3.2 Memory image file format
The format of a memory image file for each FPGA is shown in Figure 7.9 on the
following page. After a header there are four major blocks of data: neural model-
ling equation parameters, fan-out tuples, update tuples and spike injector tuples.
The format and meaning of each of these tuples was defined in Section 7.2. The
layout of these tuples is reproduced for completeness in Figure 7.10 on the next
page.
The header consists of a 256 bit word that contains the offsets of the neuron mod-
elling equation parameter, fan-out tuple, update tuple and injected spike tuple re-
gions. These offsets are read by the control logic as part of the system setup process
and are then used by the equation processor, fan-out engine, accumulator and
memory spike source when reading blocks of tuples from off-chip memory. The
layout of the header is shown in Figure 7.11 on page 125.
7.3.3 Memory image file generation
As well as translating the input data files from Section 7.3.1 into a format that is
usable by the system, the major task involved in generating memory image files for
each FPGA is building blocks of tuples and populating other tuples with pointers to
these blocks. This process is performed on the host PC in advance of a computation
on the system. The memory image files are generated by 4 separate processes (one
for each type of tuple), with some cross references between them as a result of the
pointers to different types of tuple.
124 7.3. INPUT DATA
}
Header
hhhhhhhhhhhhhhhhhhhhhhhhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hh

Neuron Parameters
hhhhhhhhhhhhhhhhhhhhhhhhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hh

Fan-out Tuples
hhhhhhhhhhhhhhhhhhhhhhhhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hh

Update Tuples
hhhhhhhhhhhhhhhhhhhhhhhhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hhhh
hh

Injected Spike Tuples
Figure 7.9: Format of a memory image file
0 16 32 48 64 80 96 128
V U A B C D pointer
Neuron modelling equation parameters
0 32 34 38 58 64
target FPGA delay address length
Fan-out tuple
0 16 32
neuron id weight
Update tuple
0 32 64 80
sampling interval number neuron identifier weight
Injected spike tuple
Figure 7.10: Formats of tuples used by the neural computation system. Reproduced
from Section 7.2
CHAPTER 7. IMPLEMENTATION 125
0 32 64 96 128 256
Neuron Parameter Offset Fan-out Tuple Offset Update Tuple Offset Injector Tuple Offset
Figure 7.11: Layout of a memory image file header. The upper 128 bits are unused
and are not shown
The memory image file generation algorithm maintains data structures that repres-
ent the off-chip memory of each FPGA, as well as data structures used to generate
each type of tuple. These data structures are serialised, combined and a header
added to produce the final memory image file for each FPGA.
Neuron modelling equation parameters
The neuron modelling equation parameters are generated from the neuron file and
are converted to fixed-point parameters for Equation 3.10. The pointers to blocks of
fan-out tuples are added once the update and fan-out tuples have been generated
and allocated memory addresses.
Fan-out and update tuples
The algorithm used to generate blocks of fan-out and update tuples and pointers
to these blocks works by replicating the path of every synaptic update through
the multi-FPGA, massively parallel neural computation system. Every connection
in the connection file is converted to an object that represents the progress of a
synaptic update from its source neuron to its target neuron, with a linked list to
record the path taken by the update. Another data structure (one for each FPGA)
records the updates that have passed through or arrived at each FPGA, grouped by
source neuron identifier and delay.
Each synaptic update is routed from its source neuron to its target neuron in the
same way that it would be in the multi-FPGA, massively parallel neural computa-
tion system. When this process has finished, each FPGA will have lists of updates
that are targeted at it, grouped by source neuron identifier and delay. These lists
are converted to blocks of update tuples and are allocated to memory addresses,
allowing pointers to these blocks to be produced.
The path taken by a single update in each block is then followed in reverse, al-
lowing blocks of fan-out tuples to be produced. Again these tuples are allocated
to memory addresses so that pointers to these blocks can be produced. Mul-
tiple blocks of fan-out tuples on different FPGAs can be generated for each source
neuron.
126 7.4. INTER-FPGA COMMUNICATION
Finally the pointer to the block of fan-out tuples for each source neuron that is on
the same FPGA as that neuron is added to the neuron modelling equation paramet-
ers.
Injected spike tuples
The injected spike tuples are generated by reading the parameters In and n from
the neuron file.
7.4 Inter-FPGA communication
The inter-FPGA communication system allows FPGAs in the massively parallel
neural computation system to communicate, both to communicate synaptic up-
dates and for coordination between the control state machine and control logic on
each FPGA.
The inter-FPGA communication system makes use of the high-speed transceivers
integrated into the Stratix IV FPGA on the DE4 board, as well as some custom
physical connectivity, link layer protocols and a routing system. The end result is
a communication system that allows any hardware block on any FPGA to com-
municate with any other block on any other FPGA, without being aware of the
physical topology of the FPGAs, and with the assumption that communication is
reliable.
7.4.1 Physical layer
Physical connectivity between FPGAs is provided using the high-speed transceiv-
ers provided by the Stratix IV FPGA. A combination of 4 SATA connectors on the
DE4 board and a further 8 on a breakout board (see Section 5.4.1) gives 12 bidirec-
tional, high-speed links per FPGA.
While SATA connectors and cabling are used because of their low cost, reliability
and interoperability, this does not limit the links between FPGAs to the protocols
and data rates defined by industry standards (nominally 3 Gbit/s). The Stratix IV
transceivers support data rates of up to 10 Gbit/s in some cases. A data rate of
6 Gbit/s is used as this is the highest data rate that the Altera design tools will ac-
cept given the range of available input clock frequencies. Theo Markettos has con-
firmed that the physical links (using both the SATA connectors on the DE4 board as
well as those on the breakout board) have a bit error rate of less than 1 error in 1015
bits at 6 Gbit/s. With 12 bidirectional links per board a variety of topologies could
CHAPTER 7. IMPLEMENTATION 127
be supported. However the current version of the massively parallel neural com-
putation system only makes use of 4 links to produce a two-dimensional torodical
mesh, as this is sufficient for the number of FPGAs in use.
7.4.2 Link layer
The link layer was developed by Simon Moore. It encapsulates each physical link
to provide a communications channel which:
• Presents a FIFO interface to higher-level logic
• Marshals data from a flit-based format to individual 8B10B symbols
• Handles booting and resetting of the physical link and provides presence de-
tection
• Provides reliability with retransmission of failed flits as needed
Each physical link on a FPGA is wrapped in link layer logic. The resulting reliable
links are then connected to the routing infrastructure to produce an inter-FPGA
communication system.
Data is sent from and received by higher-level logic using flits. The layout of a flit
is shown in Figure 7.12. This has fields to indicate the start and end of a packet if
flits are combined to produce larger packets and also a channel field that is used to
indicate the message type and where it should be delivered and a data field.
0 1 2 11 12 75
sop eop channel data
Figure 7.12: Layout of a link layer flit. sop and eop stand for “start of packet” and
“end of packet” respectively
7.4.3 Routing
To allow communication between any pair of FPGAs in a multi-FPGA system, a
routing system is needed. The routing system is designed to be used by any ap-
plication on the multi-FPGA platform, regardless of the number of FPGAs and the
number of dimensions in their topology. The only assumption is that the FPGAs
are configured in an N-dimensional torus. It has the following features:
128 7.4. INTER-FPGA COMMUNICATION
• Automatic end-to-end routing of messages from source to target FPGAs
without any intervention from the application at intermediate FPGAs
• No requirement for the application to be aware of the physical topology of
the FPGAs
• Automatic allocation of FPGA identifiers and discovery of the FPGA topo-
logy at boot time
• Support for multiple copies of an application on a FPGA via an additional
topology dimension
• Support for some links between FPGAs to be omitted
Automatic identification system
The automatic identification system is used to allocate unique addresses to all FP-
GAs in the multi-FPGA platform, and to determine their topology. The following
information needs to be provided for the system to perform this function:
1. The number of dimensions in the FPGA topology
2. The dimension allocated to each link on a FPGA, and whether messages fol-
lowing the link will reach a FPGA with a higher or lower address in that
dimension
3. Which FPGA is the master
The system is then able to infer:
1. The address of each FPGA in the multi-FPGA platform
2. The total size of the multi-FPGA platform in each dimension
3. When all FPGAs have been allocated addresses and when the size of the
multi-FPGA platform has been determined in all dimensions
4. Whether the FPGA topology is inconsistent in any way, for example a link
has been connected between an incorrect pair of FPGAs
The identification process is started by the master FPGA. It is the first stage of boot-
ing any application on the multi-FPGA platform, as all other stages of the boot
process rely on being able to send messages between all FPGAs in the platform.
The identification algorithm is largely stateless, other than each FPGA gradually
building up its address in each dimension and then the size of the multi-FPGA
platform in each dimension.
CHAPTER 7. IMPLEMENTATION 129
Three types of messages used by the algorithm. All are sent point-to-point between
adjacent FPGAs:
Address setup
Contains the address of the sending FPGA. When another FPGA receives this
message it knows that the FPGA whose address is given is at the other end
of the link the message was received on. The recipient can then either infer
its own address or (if it already has an address) or check that the local FPGA
topology is consistent.
Dimension setup
Contains the size of the multi-FPGA platform in the dimension in which the
message was received. When a FPGA receives this message it updates its
local view of the dimensions of the platform.
Identification finished
Indicates that the identification system has finished identifying all FPGAs in
the platform.
The identification algorithm starts by allocating address zero in all dimensions to
the master FPGA. The master FPGA then sends an address setup message with its
address on each of the links which connect to a FPGA which will have a higher
address in each dimension (positive-going links). The remainder of the algorithm
then proceeds based on the types of messages received by each FPGA in the plat-
form, and whether that FPGA already has an address set, as shown in Figure 7.13
on the following page.
The master FPGA can determine the total number of FPGAs in the multi-FPGA
platform as soon as it has generated dimension setup messages for every dimen-
sion. It then compares the number of FPGAs that it expects to exist with the number
of identification finished messages that it receives. Once the correct number of mes-
sages is received the boot process can move to the next stage in the knowledge that
the end-to-end dimension-ordered routing system is available. The address of each
FPGA and the dimensions of the multi-FPGA platform are passed to the target to
hops convertor.
130 7.4. INTER-FPGA COMMUNICATION
Figure 7.13: Processing of messages by automatic identification system
CHAPTER 7. IMPLEMENTATION 131
Target FPGA to hops conversion
The target to hops converter takes the address of a target FPGA and converts it to
the number of hops needed in each dimension to reach the target FPGA. The hops
will be either positive for all dimensions or negative for all dimensions (positive-
going or negative-going), so that requests and replies are guaranteed to take dis-
joint paths through the inter-FPGA communication system, helping to prevent
deadlock. Once the number of hops to reach the target FPGA has been determined,
the message and the hops data are passed to the dimension-ordered router.
Dimension-ordered router
The dimension-ordered router sends and receives messages on each of the reli-
able links on a FPGA, as well as to the system implemented within the FPGA. The
router is designed to introduce as little latency as possible, by avoiding complex
calculations at intermediate routers between source and target FPGAs. The routing
algorithm decides which link to send a message on based on the number of hops
remaining, whether the hops are positive- or negative-going and the availability of
links.
For each message:
1. If there are no hops remaining it is delivered to this FPGA
2. Otherwise send the message on the link in the highest dimension with re-
maining hops
3. If that link does not exist, send the message on a link in the next lowest di-
mension with an available link and remaining hops
4. If no exit route is available, an error has occurred
This routing algorithm allows for some links to be omitted in higher dimensions.
For example if there were three dimensions, x, y and z, the x dimension could
represent links between adjacent FPGAs, which will always exist, and the z di-
mension could represent links between boxes in a rack, where a link may not be
provided between all pairs of FPGAs which logically straddle the divide between
boxes.
132 7.4. INTER-FPGA COMMUNICATION
7.4.4 Message filtering and delivery
When a message is delivered to a FPGA, it must be passed to one or more hardware
blocks in the FPGA. This might be a higher-level block, for example the identific-
ation system, or a lower-level block such as part of the massively parallel neural
computation system.
This is achieved using a hierarchy of simple switches which either deliver a mes-
sage to logic at that level in the FPGA or pass it to a switch at the next level down.
Switching decisions are made based on the channel field in a flit. Messages with
a channel number lower than the switch’s extraction threshold are extracted while
other messages are passed to the next level. In addition to this usage of the channel
number, the recipient can also use it for any application-specific purpose.
7.4.5 Application usage
In order to use the inter-FPGA communication system, an application (such as the
massively parallel neural computation system) needs to:
1. Select the type of message using a channel number. As well as indicating to
the eventual recipient what actions should be performed, the channel number
is also used (as noted above) by the filtering and delivery system
2. Provide the address of the target FPGA and optionally application block
within the FPGA. No information about the topology of the multi-FPGA sys-
tem is required apart from knowledge of how many application blocks are in
a FPGA
3. Provide a payload if appropriate
A message is formatted as in Figure 7.14 (which also shows how the message fits
into a link layer flit). It is then converted into a flit and passed to a local switch,
which will arrange delivery to the target.
0 1 2 11 12 43 44 75
sop eop channel data
}
flit
sop eop channel payload address
}
message
Figure 7.14: Layout of an inter-FPGA message showing how it fits into a link layer
flit
CHAPTER 7. IMPLEMENTATION 133
7.5 Implementation languages
The majority of the multi-FPGA massively parallel neural computation system is
implemented using Bluespec SystemVerilog (BSV) (Nikhil and Czeck, 2010), since
it is higher-level than Verilog or VHDL but still allows low-level design optimisa-
tions. In particular, channel communication can be concisely expressed both within
and between modules. This allows the architecture to be easily expressed in a ‘com-
municating sequential processes’ style, which is very well suited to a system with
multiple components.
BSV comes with a simulator (Bluesim) which allows for cycle-accurate simulation
of the behaviour of a system which is significantly quicker than an RTL-level sim-
ulation such as ModelSim. Bluesim allows simulated memories to be populated
using a data file, which allows the massively parallel neural computation system
to be simulated using the same off-chip memory image files (with a few minor
formatting changes) as are used to program the multi-FPGA system on the Blue-
hive platform.
Memory image file generation is performed using a program written in C++. Vari-
ous Makefiles and Perl scripts are used to link system components together in sim-
ulation and to upload and download data from the multi-FPGA system.
7.6 Conclusion
The implementation of a massively parallel neural computation system on the
Bluehive multi-FPGA platform consists of separate hardware blocks for neuron
modelling equations and to perform the fan-out and accumulation stages of
the synaptic update communication and application algorithm introduced in
Chapter 6, together with supporting infrastructure.
Each hardware block is designed to make efficient use of off-chip memory band-
width by using burst reads and processing whole words in a single clock cycle.
This should maximise the size of neural network that can be handled in real-time,
as suggested by the Bandwidth Hypothesis.

Chapter8
Evaluation
135
136 8.1. INTRODUCTION
8.1 Introduction
The multi-FPGA, massively parallel neural computation system will be evaluated
using two metrics:
1. Scale of neural network that can be handled in real-time
2. Speed-up when adding compute resources while keeping the neural network
constant
These methods are similar to those used by Ananthanarayanan et al. (2009) to eval-
uate their supercomputer-based neural computation system.
A benchmark neural network of sufficient scale and with mean fan-out and mean
spike frequency that match the assumptions that were made in Section 4.3 is
needed. Since there are no benchmark neural networks available that satisfy these
criteria, a benchmark neural network is developed in Section 8.2. While the pattern
of synaptic connections in this benchmark neural network cannot be claimed to be
biologically plausible, in all other respects it is suitable for evaluating the massively
parallel neural computation system.
The multi-FPGA, massively parallel neural computation system is able to handle
up to 64k neurons per FPGA in real-time using the benchmark neural network. Res-
ults are provided for systems using 1, 2 and 4 FPGAs, which perform computations
for networks of 64k, 128k and 256k neurons respectively. Analysis of the number
of clock cycles of work for each sampling interval of the computation for each of
these systems shows that there is a significant jump in the number of clock cycles of
work per sampling interval between 1 and 2 FPGAs, but little extra between 2 and
4 FPGAs. This provides promise for scaling to larger neural networks with more
FPGAs, and meets the requirements of the Scalability Hypothesis.
The speed-up when adding compute resources to a computation of a benchmark
network of 64k neurons is almost linear with 1, 2 and 4 FPGA systems. This
suggests that the massively parallel neural computation system is sufficiently suc-
cessful at optimising communication and memory bandwidth usage that (at least
for this particular benchmark neural network), neural computation has become
compute-bound rather than communication bound, showing that scalability can
be achieved by adding compute resources provided that there are also sufficient
communication resources.
CHAPTER 8. EVALUATION 137
8.2 Benchmark neural network
The benchmark neural network is designed to provide a “load test” of the neural
computation system, which scales to the number of neurons required and has a
biologically-plausible fan-out size (but not fan-out pattern) and spike frequency.
Taking these parameters from Section 3.3, the benchmark neural network will have
a mean fan-out of 103 and a mean spike frequency of 10 Hz.
The benchmark neural network is made up of a number of synfire chains. A syn-
fire chain (Abeles, 1982) is a neural network where groups of neurons have synaptic
connections to other groups of neurons. Each group of neurons spikes synchron-
ously after receiving spikes from another group of neurons.
In the synfire chain used in the benchmark neural network, each neuron has a
combination of neuron modelling equation parameters and synaptic delays and
weights (shown in Table 8.1) that result in it spiking 10 ms after it receives a syn-
aptic update. They were selected after experiments using NEURON (Hines and
Carnevale, 1997), which provides an Izhikevich neuron model, and allows the be-
haviour of a single neuron to be observed as its parameters are changed.
If an input voltage (I-value) of 16 mV is applied to a neuron with these parameters
then it spikes immediately. If 10 of these neurons are connected in a chain with a
synaptic delay of 10 ms, then each neuron will spike at 100 ms intervals, which is
equivalent to a spike frequency of 10 Hz.
Biologically-plausible fan-out size is created by grouping neurons into blocks of
103. Each neuron in a block has 103 synaptic connections, one to each of the other
neurons in the block. The weights of the synaptic connections from each set of 100
neurons to the next set of 100 sum to 16 mV, so that when the first 100 neurons
in the block spike in unison this will result in the next 100 neurons in the block
spiking in unison after 10 ms and so on, with synaptic connections from the last
100 neurons looping around back to the first 100 neurons. The result is a synfire
chain where each neuron has a fan-out of 103 and a spike frequency of 10 Hz.
Parameter Value
v0 −70.0
u0 −14.0
a 0.02
b 0.2
c −65.0
d 6.0
Table 8.1: Parameters of neuron modelling equation (Equation 3.1) used to create
benchmark neural network
138 8.3. SCALE
The scale of a benchmark neural network is increased by observing that each block
of 103 neurons has a set of 100 neurons spiking every 10 ms, with no spike activ-
ity in between. Therefore a network of 104 neurons that maintains a biologically-
plausible fan-out size and mean spike frequency can be created by interleaving 10
of these 103 neuron blocks. The time of the initial injected spike that starts each
block’s spike chain is offset by increments from 0 to 9 ms for blocks 0 to 9 respect-
ively.
The result is a benchmark neural network of 104 neurons, with biologically-
plausible fan-out size and mean spike frequency. To create larger benchmark neural
networks, multiple copies of this 104 neuron network are instantiated until the
benchmark neural network reaches the required size. For example a benchmark
neural network of 6.4 × 104 neurons will have 6.4 copies of the 104 neuron net-
work. The decimal part is achieved using a network of 4× 103 neurons with 4 offset
blocks of 103 neurons. There will be slight variations in the workload produced by
a benchmark neural network with a decimal part between sampling intervals, but
fan-out and spike frequency remain unaffected.
8.3 Scale
The ability of the multi-FPGA massively parallel neural computation system to
perform neural computations for large neural networks in real-time will be shown
using a neural computation of a benchmark neural network of 256k neurons on an
implementation of the system with 4 FPGAs. A plot of the spikes recorded from this
neural computation is shown in Figure 8.1 on the next page. Figure 8.2 on page 140
shows the number of neurons spiking in each sampling interval and the number of
clock cycles of work per sampling interval. As the number of clock cycles of work
for each sampling interval is consistently less than 200k (dashed green line), the
computation is running in real-time.
Neurons are allocated to FPGAs in groups of two e.g. neurons 0 and 1 are allocated
to FPGA 0, neurons 2 and 3 to FPGA 1 etc. Since each neuron has a fan-out of 103
this means that synaptic updates from each neuron will target 250 neurons on each
of the 4 FPGAs. Locality is deliberately reduced in this way to avoid an optimal
allocation of neurons to FPGAs, which would place all of the neurons in each block
of 103 on the same FPGA. This would result in no inter-FPGA communication and
hence essentially 4 independent neural computations with 64k neurons each – not
a fair test of a system designed for massively parallel neural computation.
CHAPTER 8. EVALUATION 139
100 120 140 160 180 200 220 240 260 280 300
0
50
100
150
200
250
Time / ms
N
eu
ro
n
nu
m
be
r
×1
03
Figure 8.1: Spike pattern for a computation of 256k neurons over 4 FPGAs
Figure 8.3 on page 141 shows the number of clock cycles of work per sampling
interval for benchmark neural networks of 64k, 128k and 256k neurons. These net-
works are distributed over 1, 2 and 4 FPGAs respectively so that there are 64k neur-
ons on each FPGA. There are four causes of variation in the number of clock cycles
of work per sampling interval between the 64k, 128k and 256k networks:
1. The jump in maximum clock cycles per sampling interval between 64k neur-
ons and 128k and 256k neurons is most likely a result of the need to use
the inter-FPGA communication system to communicate synaptic updates
between FPGAs, which is not necessary with the 64k neuron network.
2. The major peaks and troughs are the result of “excess” neurons that are not
part of a complete set of 10k neurons. For example the 128k neuron network
has 12× 10k + 8k neurons, with the 8k neurons being constructed of 8× 1k
neuron sets and 2 “gaps.” This means that more neurons will spike and hence
more synaptic updates will need to be communicated and applied in the first
8 1 ms sampling intervals in every 10 than in the last 2, and hence that the
number of clock cycles of work per sampling interval varies in the same pat-
tern.
140 8.3. SCALE
100 120 140 160 180 200 220 240 260 280 300
2.35
2.4
2.45
2.5
2.55
2.6
2.65
2.7
2.75
2.8
Time / ms
N
eu
ro
ns
sp
ik
in
g
×1
03
Neurons spiking Clock cycles
175
180
185
190
195
200
205
Slower than real-time
Real-time
C
lo
ck
cy
cl
es
of
w
or
k
/
sa
m
pl
in
g
in
te
rv
al
×1
03
Figure 8.2: Neurons spiking per sampling interval and clock cycles of work per
sampling interval for computation of 256k neurons using 4 FPGAs
For the 256k neuron network there are 25 × 10k + 6k neurons, with the 6k
neurons being constructed of 6× 1k neuron sets and 4 “gaps,” and hence the
peaks and troughs are more equal than with 128k neurons.
3. The amplitude of the peaks and troughs is dependent on the number of FP-
GAs that the “excess” neurons are distributed over. With 64k and 128k neur-
ons each FPGA has 4k excess neurons while with 256k neurons each FPGA
has 1.5k excess neurons, and hence the variation in clock cycles per sampling
interval caused by communicating and applying synaptic updates from these
excess neurons is lower.
4. Minor variations are an artefact of the mechanism used to determine when
all computation for a sampling interval has been completed, which relies on
each FPGA determining when no data remains in the input and output FIFOs
of each component of the neural computation system and then signalling this
to the master FPGA, which determines when all FPGAs have completed com-
putation and signals that computation for the next sampling interval should
be started.
CHAPTER 8. EVALUATION 141
60 80 100 120 140 160 180 200 220 240 260 280 300
145
150
155
160
165
170
175
180
185
190
195
Time / ms
C
lo
ck
cy
cl
es
of
w
or
k
/
sa
m
pl
in
g
in
te
rv
al
×1
03
64k 128k 256k
Figure 8.3: Clock cycles of work per sampling interval for benchmark networks of
64k, 128k and 256k neurons on 1, 2 and 4 FPGAs respectively
While this will need to be confirmed by further experiments, it is likely that a com-
putation of a benchmark neural network of 512k neurons distributed over 8 FPGAs
will have a maximum number of clock cycles of work per sampling interval which
is close to that of that of the network with 256k neurons as the number of neurons
per FPGA will remain constant. There will be 2k excess neurons in total, but only
256 excess neurons per FPGA, and so the number of clock cycles per sampling in-
terval will show variation that has wider troughs than peaks but lower amplitude
than that shown by the 256k neuron network. This suggests that the system will
scale to computations of larger neural networks spread over more FPGAs while
preserving real-time.
142 8.4. COMMUNICATION OVERHEAD
8.4 Communication overhead
The overhead of inter-FPGA communication and partitioning a neural network
between multiple FPGAs can be examined by spreading a computation of a bench-
mark neural network of 64k neurons over 1, 2 and 4 FPGAs. If there were no over-
head then the 2 and 4 FPGA computations would be expected to take 50% and 25%
of clock cycles of work per sampling interval of the 1 FPGA computation respect-
ively.
Figure 8.4 on the facing page shows the number of clock cycles of work per
sampling interval for a benchmark neural network of 64k neurons when the com-
putation is spread over 1, 2 and 4 FPGAs. Figure 8.5 on page 144 shows the same
data for 2 and 4 FPGAs as a percentage of that for 1 FPGA. This shows that the
actual percentages are close to the theoretical values of 50% and 25%. This shows
an almost linear speed-up, and suggests that the massively parallel neural com-
putation system has been sufficiently successful at optimising communication and
memory bandwidth usage that (at least for this particular benchmark neural net-
work), neural computation has become compute-bound rather than communica-
tion bound. This shows that scalability can be achieved by adding compute re-
sources provided that there are also sufficient communication resources.
8.5 Validation
The behaviour of the neural computation system was cross-validated against a PC-
based system written in Java by Steven Marsh. There was no deviation in the
output of the two systems for a sample of 300 ms of activity of the benchmark
neural network presented in Section 8.2. This activity included that shown in Fig-
ure 8.1.
CHAPTER 8. EVALUATION 143
60 80 100 120 140 160 180 200 220 240 260 280 300
40
60
80
100
120
140
160
180
Time / ms
C
lo
ck
cy
cl
es
of
w
or
k
/
sa
m
pl
in
g
in
te
rv
al
×1
03
1 FPGA 2 FPGAs 4 FPGAs
Figure 8.4: Clock cycles of work per sampling interval for a benchmark network of
64k neurons spread over 1, 2 and 4 FPGAs
8.6 Conclusion
The evaluation of the multi-FPGA massively parallel neural computation system
implemented on the Bluehive platform shows that it is able to perform neural com-
putation with up to 64k neurons per FPGA in real-time. Results are provided for
benchmark neural networks of up to 256k neurons. The system shows promise of
scaling to handle larger neural networks using more FPGAs.
The system shows almost linear speed-up when adding compute resources to a
computation of a constant-sized neural network, which suggests that the massively
parallel neural computation system is successful at optimising communication and
memory bandwidth usage.
144 8.6. CONCLUSION
60 80 100 120 140 160 180 200 220 240 260 280 300
25
30
35
40
45
50
Time / ms
%
of
cl
oc
k
cy
cl
es
ta
ke
n
by
1
FP
G
A
Expected 2 FPGAs 4 FPGAs
Actual 2 FPGAs 4 FPGAs
Figure 8.5: Clock cycles of work per sampling interval for a benchmark network of
64k neurons spread over 2 and 4 FPGAs as % of clock cycles taken by 1 FPGA
Chapter9
Conclusions
145
146 9.1. EVALUATION OF HYPOTHESES
The major conclusion of this thesis is that massively parallel neural computation
systems must be designed with communication as a first-class design constraint.
This manifests itself in several ways. Firstly it is clear that the implementation
platform for a massively parallel neural computation system must allow the sys-
tem to scale to multiple devices by providing high-bandwidth, low latency con-
nections between the devices that make up the platform. This is made clear when
the limitations of PC-based and GPU-based neural computation systems are con-
sidered.
Secondly, failure to consider the implications of communication in biological neural
networks, and particular the operating rate of biological neurons, significantly
hampers the scale of some types of neural computation systems. Those that ded-
icate resources to each neuron and synaptic connection cannot match the scale of
those that time-multiplex resources, use an on- and inter-chip network for com-
munication and store a neural network description in off-chip memory. Therefore
all future massively parallel neural computation systems should use the latter ap-
proach.
Thirdly, the scale of neural computation that can be handled by a system imple-
mented using today’s electronics is limited by communication and memory band-
width, and so neural computation systems must make maximum use of memory
bandwidth by issuing memory requests whenever the memory controller is able to
accept them, for example by using requests for contiguous blocks of data or having
multiple parts of the system issue memory requests so that it is likely that a part of
the system will be able to issue a request whenever the memory controller is able to
accept even when other parts of the system are busy. Also, full use must be made
of all data returned, for example by packing multiple items into long words and
processing these items in parallel.
By taking a communication-centric design approach I was able to demonstrate
a scalable that a FPGA-based massively parallel neural computation system can
handle 64k neurons and 64M synaptic connections per FPGA. This system will be
compared to other neural computation systems in Section 9.2.
9.1 Evaluation of hypotheses
Three hypotheses were introduced in Chapter 1. We will now examine how they
are supported by the work in this thesis and the conclusion that communica-
tion must be considered as a first-class design constraint when implementing a
massively parallel neural computation system.
CHAPTER 9. CONCLUSIONS 147
9.1.1 Scalability Hypothesis
The scale of neural network that can be handled by a neural computation system in real-time
must be able to be increased by scaling the system to multiple devices
Based on the review of related work in Chapter 2, it is clear that a neural com-
putation system implemented using today’s electronics needs to scale to multiple
devices to support large-scale neural computation.
Recent technology scaling trends point towards future computation systems using
multiple, small devices rather than fewer large devices. Given that many research-
ers ultimately aim to perform neural computations at a scale that matches the hu-
man brain, future neural computation systems will need to handle larger neural
networks than current systems, and so it appears likely that future neural compu-
tation systems will need to be implemented using multiple, linked devices even
if the size of neural network that can be handled by an individual device can be
increased. This means that the implementation technology must provide suitable
inter-device communication infrastructure.
9.1.2 Communication-Centric Hypothesis
The scalability of a neural computation system is communication-bound, not compute-
bound
The analysis in Section 4.3 makes it clear that the volume of neural communication
events that needs to be modelled per sampling interval is significantly greater than
the volume of neuron modelling equation evaluations. With 105 neurons, mean
fan-out of 103, mean spike frequency of 10 Hz and a sampling interval of 1 ms, 109
synaptic updates need to be communicated and applied and 108 neuron modelling
equations need to be evaluated every second. Every synaptic update will require
some kind of communication in a neural computation system, and so the volume
of communication events that need to be handled is an order of magnitude greater
than the volume of neuron modelling equation evaluations, which makes neural
computation communication-bound rather than compute-bound.
9.1.3 Bandwidth Hypothesis
The scale of neural network that can be handled by a neural computation system in real-time
is bounded by inter-device communication bandwidth and memory bandwidth
Section 4.3 and Section 4.5 found that the memory and communication bandwidth
required by a neural computation of 105 neurons with mean fan-out of 103, mean
148 9.2. COMPARISON TO OTHER SYSTEMS
spike frequency of 10 Hz and sampling interval of 1 ms is at the limit of what is
provided by today’s electronics. Therefore it is impossible to increase the scale of
neural network that can be handled by a neural computation system in real-time
without either:
1. Making more efficient use of bandwidth by optimising data storage and ac-
cess patterns and communication methods
2. Providing more memory and communication bandwidth to each FPGA, us-
ing either more memory and communication channels or new technologies
3. Using multiple FPGAs, which provides more memory and communication
channels
Since it is possible that future technology might bring sufficient bandwidth that
neural computation on that technology becomes compute-bound rather than com-
munication bound, it is most accurate to say that:
The scale of neural network that can be handled by a neural computation system that is
implemented using today’s electronics in real-time is bounded by inter-device com-
munication bandwidth and memory bandwidth
There is evidence for this in related work. For example the scale of neural net-
work that can be handled in real-time by the GPU-based neural computation sys-
tem implemented by Fidjeland and Shanahan (2010) is limited to whatever can be
achieved using a single GPU as the implementation platform does not provide suf-
ficient inter-device communication bandwidth to allow a computation to scale to
multiple GPUs.
All three hypotheses point at communication needing to be considered as a first-
class design constraint alongside computation when designing a massively parallel
neural computation system.
9.2 Comparison to other systems
We will now compare the scale and performance of the FPGA-based massively par-
allel neural computation system to other neural computation systems, examining
how their scalability is affected by the degree to which communication was con-
sidered in their design.
CHAPTER 9. CONCLUSIONS 149
9.2.1 Supercomputer-based systems
Supercomputer-based neural computation systems must consider communication
as a first-class design constraint as their programming model relies on distributing
work over a very large number of CPUs and inter-process communication using
custom interconnects and either shared memory or message passing communica-
tion models.
A direct comparison of the multi-FPGA neural computation system with
supercomputer-based systems is difficult, as supercomputer-based systems have
tended to use significantly more complex neuron models than other systems. The
neural networks used in published work such as that by Ananthanarayanan et al.
(2009) have higher fan-out (104 rather than 103) and a much larger number of neur-
ons (109 compared to around 105) than the neural networks that have been used to
evaluate other systems, including the system in this work. Supercomputer-based
systems typically operate much slower than real-time e.g. around 800× slower
than real-time for a network with a mean spike frequency of 10 Hz.
A real-time neural computation of 109 Izhikevich neurons with a fan-out of 103
and mean spike frequency of 10 Hz using the architecture from this work would
require 16× 103 FPGAs, assuming that the architecture could scale sufficiently. Al-
ternatively the architecture could handle a neural computation of around 2× 108
Izhikevich neurons 800× slower than real-time using 4 FPGAs, given the currently
unrealistic assumption of each FPGA having 100 MB of on-chip memory or low-
latency off-chip memory. If this could be scaled to more FPGAs, then 20 FPGAs
should be able to handle a computation of 109 neurons.
This would still have a far less complex neuron model and lower fan-out than the
supercomputer-based system, and so no direct conclusions can be drawn. How-
ever this does serve to illustrate the potential of FPGA-based neural computation
systems. The cost of building a multi-FPGA system with 20 FPGAs is many orders
of magnitude less than that of a supercomputer such as an IBM BlueGene/L.
In conclusion supercomputers are a suitable platform for neural computation as
they provide sufficient communication and compute resources for a wide range of
neuron and communication models, but their cost means that they are not readily
available to researchers.
150 9.2. COMPARISON TO OTHER SYSTEMS
9.2.2 PC-based systems
Comparing the multi-FPGA massively parallel neural computation system to a
commodity PC-based system gives an indication of the performance increase that
can be expected by the majority of neuroscientists, who are currently using various
PC-based systems. Since commodity PCs only support scaling using networking
technologies such as Ethernet, which has a significant protocol overhead, this ef-
fectively limits PC-based systems to using a single PC.
Steven Marsh has written a neural computation system in C that uses the synaptic
update communication and application algorithm from Chapter 6 and the fixed-
point neural modelling equation from Chapter 3 (Moore et al., 2012).
Using the same benchmark neural network that is used to evaluate the FPGA-based
system, a single-threaded version of this PC-based system required 48.8 s to per-
form a computation of 300 ms of neural activity using a single thread of a 16-thread,
4-core Xeon X5560 2.80 GHz server with 48 GB RAM.
Therefore the 4-FPGA neural computation system is 162 times faster than this par-
ticular PC-based system. The PC-based system can clearly be improved, and in
particular an adaptation of NEST (which is commonly used by neuroscientists for
large neural computations) would provide a fairer “real-world” comparison, how-
ever this is sufficient to show that a neural computation system that has been de-
signed without considering communication as a first-class design constraint has
significantly inferior performance to the multi-FPGA system which was designed
with communication and scalability in mind.
9.2.3 GPU-based systems
The GPU-based system implemented by Fidjeland and Shanahan (2010) provides
an interesting point of comparison as it uses a discrete-time Izhikevich neuron
model with a 1 ms sampling interval and benchmark neural networks with mean
fan-out of 103, mean spike frequency of 10 Hz. This is the same neuron model that
is used in the massively parallel neural computation system and both benchmark
neural networks have the same statistical properties, so the two systems are readily
comparable.
While the GPU-based system is able to handle neural networks with up to 55k
neurons in real-time using a single GPU, it does not scale to larger networks using
multiple GPUs. In comparison, the system in this work is able to handle up to 64k
neurons per FPGA in real-time, with support for scaling to larger networks using
multiple FPGAs.
CHAPTER 9. CONCLUSIONS 151
GPUs have high communication bandwidth between each GPU and the CPU in
the host PC, but if there are multiple GPUs then communication between them
must all go via the CPU, which has high latency and does not allow the topology
of a multi-GPU system to be adapted to suit an application. This shows that their
scalability is hampered by not considering communication as a first-class design
constraint.
If GPUs were provided with a more flexible communication topology with multiple
links per GPU, then it would seem likely that they would be able to exhibit similar
performance to FPGA-based systems, which suggests that future work could focus
on neural computation systems that combine the more general-purpose compute
architecture and high memory bandwidth of a GPU-based system with the flexible
communication topology of a FPGA-based system.
9.2.4 Bespoke hardware-based systems
Many hardware-based neural computation systems have clearly not been designed
with communication as a first-class design constraint as they do not support scal-
ing to multiple devices or make inefficient use of communication resources. In par-
ticular FPGA-based systems that directly map neuron models into hardware and
use FPGA routing resources to connect these “hardware neurons” support orders
of magnitude less neurons and synaptic connections than the multi-FPGA system
introduced in this work.
There is a similar situation with analogue hardware-based neural computations.
These typically model small numbers of neurons and synaptic connections in
greater detail than simple spiking neuron models, many times faster than real-time.
But many of these systems are unable to support scaling to larger neural networks.
Those systems that do support scaling to larger neural networks typically use some
kind of packet-based signalling and an inter-chip network (often combined with
an on-chip network), which shows that when communication is considered as a
design constraint alongside computation the scale of neural computation can be
increased.
Both of these types of neural computation systems are limited in scale if they ded-
icate resources to each individual neuron. If neural computation systems are de-
signed to take account of the difference in the operating rate of biological neur-
ons and today’s electronics (a difference that is expected to increase in future) by
time-multiplexing compute and communication resources then the scale of neural
computation that can be supported per device is increase by orders of magnitude.
Combined with the support for scalability that this approach brings, it would ap-
152 9.3. FUTURE WORK
pear that all large-scale neural computation systems should use time multiplexed
compute and communication resources rather than direct mapping of biological
features to hardware.
This can be justified by comparing the scale and scalability of direct-mapped sys-
tems compared to systems with multiplexed resources. Bailey (2010) implements
100 neurons and 200 synaptic connections per FPGA using direct mapping into
hardware and communication via FPGA routing resources, compared to 64k neur-
ons and 64M synaptic connections per FPGA in the system implemented in this
work. Another point of comparison is the system implemented by Cassidy et al.
(2011), who also use time-multiplexed resources. They claim to perform computa-
tion of a network of up to 106 neurons on a single FPGA, far more than is likely
to be possible with the system implemented in this work using currently available
technology. However the fan-out of each neuron is very limited in this case (per-
haps even to the point of each neuron having a fan-out of 1), and so the utility of
this system is ultimately limited.
Given the scaling results in Chapter 8, the system in this work should be able to
perform a computation of 106 neurons in real-time using 16 FPGAs. This would
result in a network with the same number of neurons as used by Cassidy et al., but
at least 100× or even 1000× as much fan-out. Therefore it will ultimately outper-
form any system that has been designed to maximise the number of neurons per
FPGA at the expense of fan-out in any application where biologically-plausible fan-
out is required. Since increased fan-out leads to increased communication, could
be considered another example of the limitations that occur when communication
is not considered as a first-class design constraint.
9.3 Future work
There are two main targets for future work – increasing the scale of neural com-
putation that can be performed by a massively parallel neural computation system
and increasing their flexibility.
9.3.1 Increase in scale
Increasing a system’s scale could be achieved by both using more FPGAs in a sys-
tem and by changing the implementation platform to take advantage of new tech-
nology.
CHAPTER 9. CONCLUSIONS 153
Using more FPGAs
The first target to increase the scale of a massively parallel neural computation
system is to use more FPGAs. Our Bluehive system has 16 DE4 boards per rack
box, so each rack box should allow neural computation of up to 106 neurons with
a fan-out of 103 in real-time.
Change of implementation platform
Altera and Terasic have recently announced their DE5-NET FPGA evaluation
board, which uses a Stratix V FPGA. This board has many features that are similar
to the NetFPGA-10G board produced by the NetFPGA project, including 36 MB of
QDRII SRAM and up to 8 GB of DDR3 SDRAM. While it lacks some of the peri-
pherals of the DE4 board (such as an SD card slot), the SRAM would be sufficient
to hold the parameters of up to 2× 106 neurons. This would provide significantly
more bandwidth for accessing neuron parameters.
This would leave the SDRAM dedicated to read-only accesses to fan-out, update
and spike injector tuples, along with write accesses from the spike auditor. Com-
bined with the significantly larger FPGA (950k logic elements compared to 228k on
the DE4-230), this should allow for neural computations of at least 106 neurons with
a fan-out of 103 in real-time on a single FPGA. This is equivalent to scale of neural
computation that should be possible in real-time using 16 DE4 boards – the differ-
ence is a result of the higher off-chip memory bandwidth provided by the QDRII
SRAM, once again illustrating that neural computation using today’s electronics is
primarily communication-bound.
Given the lack of peripherals on the DE5-NET board compared to the DE4, it
may be appropriate to construct a multi-FPGA system using DE5-NET boards as
the “core” and DE4 boards at the periphery to handle I/O and control functions.
The high-speed transceivers on the DE4 and DE5-NET are expected to be compat-
ible.
9.3.2 Increase in flexibility
The current system is a hand-optimised implementation of a massively parallel
neural computation system for a specific neuron modelling equation. Allowing the
neuron equation to be specified by a user would increase the number of potential
users of the system. The NineML project (Raikov et al., 2011) provides a framework
to specify neuron modelling equations programatically (using a C-like syntax). By
154 9.3. FUTURE WORK
analysing an equation using a simple parser it should be possible to derive an eval-
uation pipeline in Bluespec, and hence allow the system to use arbitrary neuron
modelling equations.
A more flexible system could use a stream processor with a simple software pro-
gram to evaluate neuron modelling equations while retaining the synaptic updates
communication and application system. This would allow for computation of any
combination of neuron modelling equation and neural network. Given that the
performance exhibited by a GPU-based system using a single GPU is compar-
able to the FPGA-based system using a single FPGA, a flexible system for high-
performance scientific computation of many kinds could be created by combining
the compute architecture and high memory bandwidth of GPUs with the commu-
nication bandwidth of FPGAs.
Bibliography
ABBOTT, L. (1999). Lapicque’s introduction of the integrate-and-fire model neuron
(1907). Brain Research Bulletin, 50 (5-6), 303 – 304.
ABELES, M. (1982). Local cortical circuits: an electrophysiological study. Springer-
Verlag.
ALVADO, L., TOMAS, J., SAI¨GHI, S., RENAUD, S., BAL, T., DESTEXHE, A. and MAS-
SON, G. L. (2004). Hardware computation of conductance-based neuron models.
Neurocomputing, 58-60, 109 – 115.
ANANTHANARAYANAN, R., ESSER, S. K., SIMON, H. D. and MODHA, D. S. (2009).
The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses. In
High Performance Computing Networking, Storage and Analysis, Proceedings of 2009
Conference for, pp. 63:1 – 63:12.
BAILEY, J. A. (2010). Towards the neurocomputer: an investigation of VHDL neuron
models. Doctoral Thesis. University of Southampton, School of Electronics and
Computer Science.
BASSETT, D. S., GREENFIELD, D. L., MEYER-LINDENBERG, A., WEINBERGER,
D. R., MOORE, S. W. and BULLMORE, E. T. (2010). Efficient physical embed-
ding of topologically complex information processing networks in brains and
computer circuits. PLoS Computational Biology, 6 (4), Article Id: e1000748.
BASU, A., RAMAKRISHNAN, S. and HASLER, P. (2010). Neural dynamics in recon-
figurable silicon. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE Inter-
national Symposium on, pp. 1943 –1946.
BERGER, T. and LEVY, W. (2010). A mathematical theory of energy efficient neural
computation and communication. Information Theory, IEEE Transactions on, 56 (2),
852 –874.
BOAHEN, K. (2000). Point-to-point connectivity between neuromorphic chips using
155
156
address events. Circuits and Systems II: Analog and Digital Signal Processing, IEEE
Transactions on, 47 (5), 416 – 434.
CASSIDY, A., ANDREOU, A. and GEORGIOU, J. (2011). Design of a one million
neuron single FPGA neuromorphic system for real-time multimodal scene ana-
lysis. In Information Sciences and Systems (CISS), 45th Annual Conference on, pp. 1 –
6.
DALLY, W. J. and TOWLES, B. (2001). Route packets, not wires: on-chip inteconnec-
tion networks. In Design Automation, Proceedings of the 38th Conference on, pp. 684
– 689.
EMERY, R., YAKOVLEV, A. and CHESTER, G. (2009). Connection-centric network for
spiking neural networks. In Networks-on-Chip, Proceedings of the 3rd ACM/IEEE
International Symposium on, pp. 144 – 152.
FIDJELAND, A. and SHANAHAN, M. (2010). Accelerated simulation of spiking
neural networks using GPUs. In Neural Networks (IJCNN), The 2010 International
Joint Conference on, pp. 1 – 8.
GEWALTIG, M.-O. and DIESMANN, M. (2007). NEST (NEural Simulation Tool).
Scholarpedia, 2 (4), 1430.
GOODMAN, D. F. M. and BRETTE, R. (2008). The Brian simulator. Frontiers in Neur-
oscience, 3 (26), 192 – 197.
HEBB, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. Wiley.
HINES, M. L. and CARNEVALE, N. T. (1997). The NEURON simulation environ-
ment. Neural Computation, 9 (6), 1179 – 1209.
HODGKIN, A. L. and HUXLEY, A. F. (1952). A quantitative description of mem-
brane current and its application to conduction and excitation in nerve. The
Journal of Physiology, 117 (4), 500 – 544.
INDIVERI, G., CHICCA, E. and DOUGLAS, R. (2006). A VLSI array of low-power
spiking neurons and bistable synapses with spike-timing dependent plasticity.
Neural Networks, IEEE Transactions on, 17 (1), 211 – 221.
IZHIKEVICH, E. (2003). Simple model of spiking neurons. Neural Networks, IEEE
Transactions on, 14 (6), 1569 – 1572.
— (2004). Which model to use for cortical spiking neurons? Neural Networks, IEEE
Transactions on, 15 (5), 1063 – 1070.
JIN, X., FURBER, S. and WOODS, J. (2008). Efficient modelling of spiking neural
BIBLIOGRAPHY 157
networks on a scalable chip multiprocessor. In Neural Networks, 2008 IEEE Inter-
national Joint Conference on, pp. 2812 – 2819.
MAGUIRE, L. P., MCGINNITY, T. M., GLACKIN, B., GHANI, A., BELATRECHE, A.
and HARKIN, J. (2007). Challenges for large-scale implementations of spiking
neural networks on FPGAs. Neurocomputing, 71 (1-3), 13 – 29.
MAHOWALD, M. and DOUGLAS, R. (1991). A silicon neuron. Nature, 354 (6354), 515
– 518.
MARKRAM, H. (2006). The Blue Brain project. Nature Reviews Neuroscience, 7 (2), 153
– 160.
MARTINEZ-ALVAREZ, J., TOLEDO-MOREO, F. and FERRANDEZ-VICENTE, J. (2007).
Discrete-time cellular neural networks in FPGA. In Field-Programmable Custom
Computing Machines, 15th Annual IEEE Symposium on, pp. 293 – 294.
MEAD, C. (1990). Neuromorphic electronic systems. Proceedings of the IEEE, 78 (10),
1629 – 1636.
MIGLIORE, M., CANNIA, C., LYTTON, W., MARKRAM, H. and HINES, M. (2006).
Parallel network simulations with NEURON. Journal of Computational Neuros-
cience, 21, 119 – 129.
MOORE, S. W., FOX, P. J., MARSH, S. J. T., MARKETTOS, A. T. and MUJUMDAR, A.
(2012). Bluehive - a field-programable custom computing machine for extreme-
scale real-time neural network simulation. In Field-Programmable Custom Comput-
ing Machines, 20th Annual IEEE Symposium on, pp. 133 – 140.
NIKHIL, R. S. and CZECK, K. R. (2010). BSV by Example. CreateSpace.
RACHMUTH, G., SHOUVAL, H. Z., BEAR, M. F. and POON, C.-S. (2011). A
biophysically-based neuromorphic model of spike rate- and timing-dependent
plasticity. Proceedings of the National Academy of Sciences, 108 (49), E1266 – E1274.
RAIKOV, I., CANNON, R., CLEWLEY, R., CORNELIS, H., DAVISON, A., DE SCHUT-
TER, E., DJURFELDT, M., GLEESON, P., GORCHETCHNIKOV, A., PLESSER, H.,
HILL, S., HINES, M., KRIENER, B., LE FRANC, Y., LO, C.-C., MORRISON, A.,
MULLER, E., RAY, S., SCHWABE, L. and SZATMARY, B. (2011). NineML: the
network interchange for neuroscience modeling language. BMC Neuroscience,
12 (Supplement 1), P330.
RALL, W. (1959). Branching dendritic trees and motoneuron membrane resistivity.
Experimental Neurology, 1 (5), 491 – 527.
RICE, K., BHUIYAN, M., TAHA, T., VUTSINAS, C. and SMITH, M. (2009). FPGA
implementation of Izhikevich spiking neural networks for character recognition.
158
In Reconfigurable Computing and FPGAs, 2009 International Conference on, pp. 451 –
456.
ROWLEY, H., BALUJA, S. and KANADE, T. (1998). Neural network-based face de-
tection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20 (1), 23 –
38.
SAIGHI, S., TOMAS, J., BORNAT, Y. and RENAUD, S. (2005). A conductance-based
silicon neuron with dynamically tunable model parameters. In Neural Engineer-
ing, 2nd International IEEE EMBS Conference on, pp. 285 – 288.
SCHMIDT, A., KRITIKOS, W., DATTA, S. and SASS, R. (2008). Reconfigurable com-
puting cluster project: Phase I Brief. In Field-Programmable Custom Computing Ma-
chines, 16th Annual IEEE Symposium on, pp. 300 – 301.
SONG, S., MILLER, K. D. and ABBOTT, L. F. (2000). Competitive Hebbian learning
through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3 (9),
919 – 926.
STEIN, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical
Journal, 5 (2), 173 – 194.
THOMAS, D. and LUK, W. (2009). FPGA accelerated simulation of biologically
plausible spiking neural networks. In Field-Programmable Custom Computing Ma-
chines, 17th Annual IEEE Symposium on, pp. 45 – 52.
TOUMAZOU, C., GEORGIOU, J. and DRAKAKIS, E. (1998). Current-mode analogue
circuit representation of Hodgkin and Huxley neuron equations. Electronics Let-
ters, 34 (14), 1376 – 1377.
TRAUB, R. D., CONTRERAS, D., CUNNINGHAM, M. O., MURRAY, H., LEBEAU,
F. E. N., ROOPUN, A., BIBBIG, A., WILENT, W. B., HIGLEY, M. J. and WHIT-
TINGTON, M. A. (2005). Single-column thalamocortical network model exhibit-
ing gamma oscillations, sleep spindles, and epileptogenic bursts. Journal of Neuro-
physiology, 93 (4), 2194 – 2232.
UPEGUI, A., PEO`A-REYES, C. A. and SANCHEZ, E. (2005). An FPGA platform for
on-line topology exploration of spiking neural networks. Microprocessors and Mi-
crosystems, 29 (5), 211 – 223.
YANG, S., WU, Q. and LI, R. (2011). A case for spiking neural network simulation
based on configurable multiple-FPGA systems. Cognitive Neurodynamics, 5, 301 –
309.