^{*}

Edited by: Inês Hipólito, Humboldt University of Berlin, Germany

Reviewed by: Tianqi Wei, University of Edinburgh, United Kingdom; Udaya B. Rongala, Lund University, Sweden

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The neuroplasticity rule Differential Extrinsic Plasticity (DEP) has been studied in the context of goal-free simulated agents, producing realistic-looking, environmentally-aware behaviors, but no successful control mechanism has yet been implemented for intentional behavior. The goal of this paper is to determine if “short-circuited DEP,” a simpler, open-loop variant can generate desired trajectories in a robot arm. DEP dynamics, both transient and limit cycles are poorly understood. Experiments were performed to elucidate these dynamics and test the ability of a robot to leverage these dynamics for target reaching and circular motions.

Robot control is still very much a work in progress. While much has been learned of how humans and animals control their bodies (Winter,

One issue with these frameworks is the assumption that the brain directly controls the output of each available degree of freedom; typically a learning agent will adjust its body's motor torques at each time step to produce a desired result in a rigid body system within a given environment (see for example OpenAI Gym; Brockman et al.,

On the other hand, could the complexity of the human body actually be a help and not a hindrance to the perception/control problem? The bio-inspired research agenda known as Embodied Intelligence suggests so (Pfeifer and Bongard,

These latter neuroplasticity-generated spontaneous behaviors, detailed in the book “The Playful Machine” (Der and Martius,

A complicating factor in the practical usage of DEP is the current lack of an analytical solution, despite the research time invested. In all likelihood, even simple DEP systems are too complex to be fully described analytically and so research has tended to be empirical, treating DEP as a pre-existing natural phenomenon. This is not an insurmountable issue; it places DEP within the context of related research into algorithmic information and complexity theory, both areas cited in theories of the development of the human brain (Hiesinger,

Given this, how can we study DEP and map out its potential? First, we must simplify: by temporarily removing environmental feedback we can map out baseline behaviors for DEP, following the methods employed in Pinneri and Martius (

This paper takes the first steps in this direction. By employing “short-circuit DEP” (see below) and with a simple test case, where the output of short-circuit DEP drives a simulated 2 degree of freedom (DOF) robot arm, we show that DEP can be made to accomplish specific goals and that these goals cover a useful region of task space.

DEP describes a way of wiring motors and related sensors together with a neuroplasticity rule, such that a DEP-enabled agent produces a large set of “natural looking” behaviors that respond to interactions with the environment. Summarizing (Der and Martius,

In the “classical” version of Differential Extrinsic Plasticity (DEP) comprises two overlapping dynamical systems. [^{t+1} will be fed back to the input to start the cycle again. [^{t+1} is also fed to an inverse model that infers the rate of change of the motor torques

For a two-layer artificial neural network with input layer _{i}, output layer _{i}, weights _{ij}, biases _{i}, and a

A simple feedback controller for an agent with rotary motors may then be constructed where _{i} drives motor torques and _{i} is driven by the resulting motor positions (see

The behavior of this dynamical system can be overlaid by a second dynamical system driven by neural plasticity, that is, the evolution over time of the controller's weights. Many plasticity schemes have been studied (see ^{1}

Different plasticity schemes.

Hebbian learning | τĊ_{ij} = _{i}_{j} |

Differential Hebbian learning | τĊ_{ij} = ẋ_{i}ẏ_{j} |

Differential extrinsic plasticity | τĊ_{ij} = ẋ_{i}(ẏ_{j}+δẏ_{j})−_{ij} |

Differential Extrinsic Plasticity extends Differential Hebbian learning by introducing an inverse model ^{t+1} back to the inferred rate of change of motor torques

In most DEP implementations the inverse model

and, as in this paper, it is often assumed to be the identity matrix.

The revised update rule uses

One way to think about

This substitution is shown in the final row of

The weight matrix

Finally, the activation rule is modified from Equation (1) to use the normalized Ĉ rather than

The combination of these two overlaid dynamical systems produces an agent that cycles through a series of complex behaviors that are responsive to environmental feedback.

A simplified version of DEP was used in Pinneri and Martius (^{t+1} = ẏ^{t} and

This is effectively Differential Hebbian Learning with damping and normalization.

By eliminating the environment, the behaviors generated can be simplified to a set of predictable limit cycles, examples of which are shown in _{1}, _{2}.

_{1} and _{2}. The second and third trajectories oscillate between two endpoints. The other trajectories are all rotational. _{1}, _{2}. The color bar refers to the rotational angle of the attractor, in the case of rotational attractors. Cyan refers to the non-rotational attractors shown in the second and third examples in

Following that paper, a map of the attractors reached based on differing initial conditions for _{1}, _{2} is shown in

It should be noted that (Pinneri and Martius,

In the present paper's experiments, as well as _{1}, _{2}, the initial value _{0} of the matrix _{0} elicits different trajectories and ultimate limit cycles for each combination of the initial values _{1}, _{2}. One way of looking at this is to say that different _{0} can select different behaviors for a given initial _{1}, _{2}.

In the two experiments described, the “short circuit” DEP system is used to drive a simple 2 degree of freedom robotic arm (see

_{0} that reaches a desired target ^{*} from starting position _{0}.

The state

so that at each timestep ^{t+1} ← ^{t}).

We can then use a robot arm with segment lengths _{1}, _{2}, here 0.5 m, to “read out” the state _{1}, θ_{2}, are driven by a “driver” function

Note that this is an open-loop controller. None of the reported benefits of environmentally-aware “Classic” DEP are used here, in line with the goal of learning to control a very simple DEP system. The position of the robot's end effector can be considered as a simple transformation or readout of DEP's internal state

Two types of task are considered. In the first, the goal is for the robot arm's end effector that starts at position _{0} to ^{⋆}. For this type of task, function _{reach}(

In other words, the output of

In the second type of task, the goal is for the end effector to trace a _{circle}(^{t+1}−^{t} and ^{t} − ^{t−1}. See _{1}, θ_{2} can be defined in the new driver function:

At a fixed point of ^{t+1} → ^{t} and if |

Under these conditions, _{2} then θ_{2} will also be a constant.

_{2} is derived from the angle between subsequent vertices of

Given an input of an initial end effector position _{0} and target position or trajectory ^{⋆}, our goal is to obtain an initial matrix _{0} that will drive the system to reach ^{⋆}. _{0} is obtained by a search algorithm, detailed in

The 2 × 2 matrix _{0} has four parameters that here each vary between −1 and +1. The algorithm linearly divides the range of each parameter into eight values, giving 8 × 8 × 8 × 8 = 4,096 possible values for _{0}. A simple grid search is performed, with each value being trialed in a rollout of 20,000 time steps.

In the case of the reaching task, at each time step of the rollout, if the distance between _{t} and ^{⋆} is within a given tolerance ϵ, then success is declared. 10 random starting positions _{0} and 10 random targets ^{⋆} were combined to give 100 trials, each of which is an execution of the algorithm in

In the case of the circular task, success is declared after a full rotation of the end effector, where the mean squared radius error with respect to _{0} and five random radii ^{⋆} were combined to give 25 trials, each of which is an execution of the algorithm in

The experiments were implemented in Python on Jupyter notebooks. The full source code may be downloaded from GitHub^{2}

The trajectories in DEP space produced in the experiments generally consisted of a transient phase where the system “wanders” in _{1}, _{2} followed by a limit cycle phase. The Reaching task leveraged both transient and limit cycle phase, while the Circular task leveraged the limit cycles.

One hundred trials of the Reaching task were performed. In every case, the system reported success: it found a path to all end effector targets from all end effector starting positions. The tolerance ϵ| had a value of 0.01 m.

Trajectory examples are show in _{0} matrices (

Examples of Reaching trials with:

A second example, in the second row of

Twenty-five trials of the Circular task were performed. In every case, the system reported success: it managed to describe a circular trajectory of at least one rotation where the mean squared radius error with respect to the desired radius was less than ϵ, in this case 0.01 m.

Trajectory examples are shown in

Examples of Circular Trajectory trials with:

In the second example, in the second row of

The relationship of search time to tolerance ϵ can be seen in

Search complexity and variance increases with lower error tolerance, for

The controller described in this paper is unlikely to signal the end of inverse kinematics. To borrow Dr. Johnson's phrase, it “is like a dog's walking on his hinder legs. It is not done well; but you are surprised to find it done at all” (Boswell,

First, what is the prognosis for DEP as a control system? The present controller has reduced a high dimensional control problem to one of simple selection of one of 4,096 different discrete values of the _{0} matrix. The original motivation for this paper was to find a way to leverage DEP within the context of Reinforcement Learning. _{0} provides a low dimensional interface for higher level systems to exploit. Yet most of the solutions are indirect, taking time for the end effector to reach its goal.

The search algorithm could be extended to optimize for lower time steps to reach the desired target position or trajectory. Different trajectory types could be produced with different driving functions, although fewer functions would be preferable to more. Driving functions could be abstract, as they are here, or derived from physical models of body elements, such as springs, tissue, or muscles.

There is scope for improving the search algorithm itself from a simple grid search, depending on what patterns, if any, can be found in the mapping of target to _{0}. Are there basins of attraction for _{0}? Is this controller learnable in a way that generalizes?

Once understanding of the core behavior of “short-circuit” DEP has improved, environmental awareness, one of the core supposed advantages of the neuroplasticity rule, could be reintroduced. This opens the way to recovery from perturbations and short term, “reflex” reactions to changes in the environment.

A continuing expressed frustration in the DEP literature is the lack of a full analytical treatment of DEP behavior. That may be due a lack of human resources applied to the problem, or it may be that a full treatment is simply intractable. Some algorithms are mathematically “undecidable,” which is to say that their behavior cannot be predicted without executing the algorithm itself. Perhaps DEP falls into this category.

In either case, this paper follows recent work in taking an empirical, engineering approach to analysing DEP, rather than a theoretical treatment. There remain many questions to be answered.

DEP has produced some fascinating simulations, with realistic looking and intriguing behaviors, such as gait switching, overcoming obstacles, and interaction with devices such as handles. How much of the observed behaviors are due to DEP as a neuroplasticity rule and how much are due to the particular body morphology of the simulated agents? Passive walkers also produce realistic behaviors and respond to the environment in a limited way, yet they have no neuroplasticity at all. Clearly, the agents behavior is generated by the complete system of neuroplasticity plus body plus environment. How we can disentangle the contributions of each?

Finally, does DEP scale? What are the limit cycles of higher dimensional DEP systems? Our understanding of DEP behavior is only just beginning.

DEP is an example of self-organization in action: of complexity generated from simple rules. Self-organization is easy to spot, but hard to design, yet may be necessary to enable long-term learning processes such as evolution to work effectively (Kauffman,

The leverage of pre-existing complex behaviors is seen in Physical Reservoir Computing (PRC), a field that applies a thin layer of learning over highly complex, pre-existing dynamics in a real or simulated body. A PRC system leverages a set of dynamical behaviors as if they were basis functions and combines them using a shallow artificial neural network. The network can then be trained to perform some desired function. The dimensionality of a problem that might require training a very deep neural network has been reduced to that of training a shallow one.

In the case of PRC, the pre-existing complexity is physical. In other cases it may be algorithmical. A curious example is the history of procedural content generation in computer games (Smith,

What is unclear is whether and to what extent nature has leveraged these potential sources of complexity. In developmental biology, there is a gap between the information specified in the genome and the complexity of the end product (Hiesinger,

Differential Extrinsic Plasticity remains a fascinating phenomenon. Neuroplasticity remains an under-explored component of Embodied Intelligence and a rich opportunity for future work.

The original contributions presented in the study are publicly available. This data can be found here: GitHub,

SB researched DEP, performed the experiments, and wrote the paper under the supervision of FI, with suggestions and comments from AA. All authors contributed to the article and approved the submitted version.

This project was possible thanks to EPSRC Grant EP/L01-5889/1, the Royal Society ERA Foundation Translation Award (TA160113), EPSRC Doctoral Training Program ICASE Award RG84492 (cofunded by G's Growers), EPSRC Small Partnership Award RG86264 (in collaboration with G's Growers), and the BBSRC Small Partnership Grant RG81275.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

The authors would like to thank Josie Hughes for her help and encouragement. Also to Georg Martius and Cristina Pinneri for giving access to the code behind their paper.

The Supplementary Material for this article can be found online at:

Evolution of the dynamics in the first row of

Evolution of the dynamics in the second row of

Evolution of the dynamics in the third row of

Evolution of the dynamics in the first row of

Evolution of the dynamics in the second row of

Evolution of the dynamics in the third row of

^{1}Using Hebbian learning here would give a system that resembles a continuous variable Hopfield Network, but with normalization and an inverse model.

^{2}