Static Shape Control of Soft Continuum Robots Using Deep Visual Inverse Kinematic Models

Soft continuum robots are highly flexible and adaptable, making them ideal for unstructured environments such as the human body and agriculture. However, their high compliance and maneuverability make them difficult to model, sense, and control. Current control strategies focus on Cartesian space control of the end-effector, but few works have explored full-body control. This study presents a novel image-based deep learning approach for closed-loop kinematic shape control of soft continuum robots. The method combines a local inverse kinematics formulation in the image space with deep convolutional neural networks for accurate shape control that is robust to feedback noise and mechanical changes in the continuum arm. The shape controller is fast and straightforward to implement; it takes only a few hours to generate training data, train the network, and deploy, requiring only a web camera for feedback. This method offers an intuitive and user-friendly way to control the robot's 3-D shape and configuration through teleoperation using only 2-D hand-drawn images of the desired target state without the need for further user instruction or consideration of the robot's kinematics.


I. INTRODUCTION
S OFT continuum robots bend continuously along their length via elastic deformation, making them highly flexible and adaptable [1]. They provide high dexterity and maneuverability over constrained unstructured spaces using fewer actuators and simple control strategies [2]. This makes them ideal for robotic exploration [3] and inspection [4], [5] in constrained environments, especially in medical applications [6], [7], [8], [9], [10]. An example of this type of robot is STIFF-FLOP, designed for minimally invasive robotic surgeries (see Fig. 1). However, the high mechanical adaptability and dexterity of soft continuum robots present numerous challenges in their control.
Unlike the control of rigid robotics, the control of soft robots is nontrivial [11], [12]. Soft continuum arms exhibit large to infinitely many degrees of freedom with elastic materials that are highly nonlinear, making accurate analytical modeling an arduous task. Significant variabilities in their design and actuation make the development of a general modeling framework difficult [13], [14].
Most controllers for soft continuum robots are based on some kinematics model. Kinematic controllers take on a steady-state assumption for the soft manipulator, where they are treated as quasi-static structures. The most common modeling technique used for soft continuum arms is the constant curvature (CC) model, where each section of the arm is assumed to bend with CC. Hence, the kinematic shape of the manipulator can be represented by the arm length, curvature, and its angle [15], [16], [17], [18], [19], [20]. Piecewise constant curvature (PCC) is an extension of CC, which treats the shape of each section itself as a series of finite curved links [21], [22]. These models, however, assume that the manipulator or its sections are uniform and symmetrical with negligible external loads or torsion [15]. More complex models with increased accuracy have also been proposed such as the variable constant curvature (VCC) [23], [24], the spring-mass-damper model [25], the Cosserat rod [26], [27] and beam theory models [16], [28], and finite-element models (FEMs) [29], [30], [31]. Once the forward kinematic model is developed, controllers can be achieved by inverting the model. For CC models, this can be done using several differential inverse kinematics (IK)-based approaches [19], [20], [32]. Similar differential IK-based controllers have also been developed for PCC [33], [34] and VCC [23] kinematic models. Alternatively, optimization-based methods can be used for solving the inverse Fig. 1. Prototype of the soft continuum robot arm used in this article emerged from the EU-project STIFF-FLOP. This version has two segments that have an overall diameter of 11.5 mm and a length of 59 mm. Actuation will result through pneumatic air pressurization of three chamber pairs; each has a diameter of 1.5 mm.
problem [22]. Camarillo et al. [28] were able to achieve configuration open-loop control by direct inversion of the beam theory kinematics. The FEM has demonstrated promise in accurately modeling and controlling soft systems [29], [30], [31]. Nonetheless, their application necessitates preexisting computational models of the robot, which may not be readily accessible, and the need to fine-tune parameters, such as determining the optimal number of nodes, to maintain sufficient accuracy without sacrificing real-time control frequency.
Kapadia et al. [17], [18] developed the first dynamic closedloop controller using a CC-based kinematic model and an Euler-Lagrangian dynamic model for a soft arm, capable of controlling the robot in task space. Another example that uses the CC model with dynamics is the work by Falkenhahn et al. [35]. They developed a dynamic controller that optimized a trajectory in terms of time and actuator jerk. They also developed a controller with feedback linearization capable of fast trajectories that minimizes dynamical errors [36]. Della Santina et al. [34] extended PCC-based point mass models and showed impedance control of the end-effector for interactions with an unstructured environment. Alqumsan et al. [37] introduced a sliding-mode dynamical controller for a simulated Cosserat rod model. Spinelli and Katzschmann [38] utilized model-predictive control for the task-space control of a pneumatic continuum robot, which utilized PCC with augmented rigid-body model assumptions.
These model-based controllers are, however, heavily reliant on the underlying analytical models, which are developed with numerous assumptions. These can cause large discrepancies between the model and the physical robot if factors such as material hysteresis, friction, mechanical asymmetry, torsion, fabrication imperfections and other external loads are not considered or compensated for [27], [39]. As a result, researchers have taken an interest in controllers that are hybrid, combining analytical models with learning-based approaches, as well as completely model-free controllers. These types of controllers utilize real-world data to build internal models and controllers, requiring minimal to no prior knowledge of the physical robot arm or the surrounding environment.
Regarding hybrid controllers, Braganza et al. [40] utilized a combination of a feedforward neural network and a nonlinear feedback component to create a control strategy that compensates for uncertain dynamics during trajectory tracking without requiring accurate knowledge of the continuum robot's dynamical model. Queißer et al. [41] combined feedback control with a feedforward control that approximates the continuum arm's inverse dynamics under equilibrium. They used this technique for kinesthetic teaching of the robot's posture. Wang et al. [42] implemented a hybrid adaptive control approach that employs neural networks for learning the robot's IK and online adaption of proportional-integral-derivative control parameters for path tracking. Bruder et al. [43] utilized Koopman-based modelpredictive controllers (MPCs) to track trajectories when the manipulator has an unknown payload. Tang et al. [44] introduced an iterative learning model-predictive control method for softbending actuators, using the iterative learning controller (ILC) to refine the model. Their results demonstrated that their approach outperformed ILC and MPC independently. Hanh et al. [45] developed a method that uses dynamic motion information to refine their FEM's viscoelastic parameters. Their approach enabled them to design an open-loop control strategy in simulation, which was executed on a physical soft robot. However, their work requires the use of expensive motion capture systems.
For model-free controllers, Yip and Camarillo [46] were one of the first to introduce a completely model-less controller. They achieved this through online empirical estimations of the robot's Jacobian. Although it allows the robot to navigate in unstructured environments, it is seemingly limited to the control of the robot's tip. Alambeigi et al. [47] estimated the Jacobian of a continuum arm using the optimization Broyden update rule to manipulate a target point on a deformable material such that it corresponds to the desired point in their endoscopic camera's image space. Li et al. [22] employed an adaptive Kalman filter controller for trajectory tracking. Fuzzy logic controllers have also been explored [48]. Giorelli et al. [49], [50] were the first to implement feedforward neural networks for learning a one-to-one IK mapping of the actuator space to the task space. Hence, their controller is incapable of tolerating redundant solutions for the same tip position in real soft continuum arms. Rolf and Steil [51] proposed the use of goal babbling for obtaining movement samples, which can be used to bootstrap learnt IK solvers. In the process of generating the samples, a redundancy weighting scheme is applied to encourage smooth solutions for target points in the task space. One major issue with learning the IK is the nonuniqueness of IK solutions that are enclosed in a concave set, which makes learning global IK intractable. Thuruthel et al. [52] proposed a formulation that can achieve direct inversion of forward kinematics (FK) through linearization at the current state. Sahoo and Chakraborty [53] expanded on this work by employing a meta-learning approach to reduce the training sample required for adapting the network to unknown tip-loading conditions. Distal learning is another method for inverting the kinematics of a redundant robot and has been used for soft robots by Melingui et al. [54]. Learning-based approaches have been notably effective for dynamic control of soft robots, where they have been employed for learning an accurate forward dynamics model [55], [56], [57] or for directly learning a closed-loop control policy using reinforcement learning [58], [59], [60]. Irrespective of the method used to control the soft continuum robot, be it model-based, model-free, or hybrid models, control approaches require expensive motion-tracking sensors to achieve closed-loop control [19], [43], [52].

A. Related Work
All the works mentioned above focused mainly on taskspace/end-effector control of the soft continuum arm. Using analytical models for controlling the robot's shape in configuration space, through joint space control, requires making simplifying assumptions. As a result, the robot's actual shape may differ from that predicted by the model. Complete authority over the shape of the continuum robot, however, is critical in scenarios where minimal collisions or whole-arm path planning are necessary, such as in teleoperated endoscopy or minimally invasive surgery of the human body [61], [62].
Shape control of soft continuum robots is a straightforward problem when simple geometric models are used. Bajo et al. [63] showed that the use of both extrinsic and intrinsic sensory information can lead to better regulation and shape tracking performance, mitigating the effects of actuation coupling for a CC model. Wang et al. [39] extended the CC theory with the Kirchoff elastic rod theory for open-loop shape control of a continuum robot designed for in situ aeroengine maintenance. Various works have implemented open-loop follow-the-leader techniques, where desired arm configurations based on CC are traced by extendable robotic arms [64], [65]. Similarly, Santoso and Onal [66] investigated the use of CC with a damped leastsquares Jacobian for the closed-loop control of an origami-based continuum robot to grow into desired shape configurations. Bern et al. [67] created soft "plush" robots which they modeled using the 2-D FEM. To control the robots and move them to a desired 2-D position, they developed a simple and intuitive method of dragging mesh nodes. They then extended this work to the locomotion control of a 3-D soft cable-driven quadruped robot [68]. The feasibility of applying their method to any type of intricate soft robot comprising diverse and functional materials, which give rise to highly irregular structures and employ different types of actuation such as pneumatics or magnetics [69], remains uncertain. Such robots would necessitate a more refined mesh, a greater number of nodes, and an appropriate computational model [70].
Ouyang et al. [71] developed a control approach that used shape correspondence to command the shape of the robot to a hand-drawn curve provided by the user. Their controller uses an online estimation of the Jacobian, with the aim of minimizing the error between the desired curve approximated by cubic spline interpolation and feature points located on the robot. The shape of the robot is then approximated by PCC. Hand-drawn shapes are much more intuitive for users operating the continuum robot.
The development of completely 3-D kinematic shape controllers with an intuitive interface would enhance the versatility of continuum robot arms by enabling direct control to more complex pose configurations [64], [66]. CC and PCC models approximate a large degree-of-freedom arm with a lower dimensional and discretized representation. Extending these control strategies to more complex geometric models like Cosserat Rod, however, is not trivial. Although FEMs show promise, they tend to grow in complexity when dealing with nonhomogeneous structures and functional materials with various forms of actuation methods. Likewise, they also require preexisting computational models. Inversion of fully 3-D kinematic shape models scalable to any arbitrarily complex robots with appropriate representations of the target shapes is a challenge yet to be addressed in this field.
To the best of our knowledge, this article presents the first demonstration of a model-free shape controller for a soft continuum robot. We propose a deep visual IK model for the shape control of a soft redundant continuum robotic manipulator. Based on steady-state assumptions and differential IK, a unique methodology of representing the state of the soft robot as images is presented. The method offers relatively straightforward and fast learning of the differential IK without requiring a priori knowledge about the arm or its environment, such as shape, size and geometry, actuators, hysteresis, friction and internal loads, mechanical asymmetry, and torsion. The method takes only 3 h to generate motor-babbling data (generation of motion through random actuation values), train the network, and deploy on a physical continuum robot.
Furthermore, minimal sensing is needed, requiring only a simple color camera to learn the mapping between the actuator space and the proposed configuration space representation. The generalizing ability of the deep visual model enables a more user-friendly and intuitive method of controlling the soft manipulator, simply by drawing the desired target shapes without any prior training on the dataset, or the need for further user instruction or consideration of the robot's kinematics.
The rest of this article is organized as follows. Section II first explains the formulation of the learning algorithm in task space, followed by the description of the image-based extension. Section III details the simulated spring-mass-damper continuum arm on which the image-based method is trialed to show empirical evidence of the image-based method's efficacy, which is given in the section after. Section V then details the physical STIFF-FLOP manipulator and the experimental setup. This is followed by Section VI showcasing the experimental results. Finally, Section VII concludes this article.

A. Learning Cartesian Kinematic Controllers
For soft robots, kinematic relationships can be formulated using steady-state models [13]. At steady-state conditions, the forward kinematics of a soft robot can be represented as a surjective function from the actuator space q ∈ R n to the task (1) The actuator space q typically consists of the state of the actuators of the soft robot (e.g., pneumatic pressure, tendon forces, etc.). The task space x is typically represented by the Cartesian end-effector pose.
Learning the inverse of this forward mapping is not trivial because of the high redundancy in the system. Due to the high redundancy of a soft CR, there are infinite valid solutions to the inverse problem. Moreover, these infinite solutions do not form a convex set, making the direct learning of the IK invalid [52], [72]. Direct inversion of the forward differential kinematics, however, can be done through linearization at some arbitrary feasible actuator state q 0 . The forward differential kinematics can be obtained by taking the derivative of the forward kinematics at this state (q 0 )ẋ where J is the Jacobian matrix that maps the actuator velocitiesq to the end-effector velocitiesẋ at state q 0 . Further discretization of (2) using Taylor expansion (ignoring second-order terms and above) allows for learning a local IK [72] Δx ≈ J(q 0 )Δq.
To frame it as a learning problem, this expression can be expanded and rearranged as where G is the inverse of the Jacobian matrix, q i and q i+1 are the actuator states at the current and the next time steps, respectively, and likewise for task-space states x i and x i+1 . Training data can then be obtained through random actuator motion (motor babbling) ensuring spatial locality (|q i+1 − q i | < E) [52]. These data can then be learnt by a simple neural network. By providing the target points as x i+1 to the learnt network, the network outputs q i+1 , which brings the robot configuration closer to the target. Repeating the process with the updated q i and x i will eventually bring the Cartesian end-effector coordinates to the desired location, providing that the target location is not physically obstructed and is geometrically reachable, with a task-space tracking sensor that has reasonable accuracy. This is the learning equivalent of the resolved motion rate controller, which also makes the controller robust to inaccuracies in the learned representation of the Jacobian inverse [73].

B. Learning Kinematic Shape Controllers
The theory described above, which was developed for controlling the pose of an end-effector in Cartesian space, can also be applied to other types of task-space representations. Although control of the tip position is sufficient for many tasks, due to the redundancy of the system, the resulting shape of the soft robot is not within our control authority and is heavily influenced by the initial configuration and environmental constraints. This is problematic in scenarios where a certain specific shape is required in order to reduce environmental interactions, such as endoscopic operations [6], [61], [62]. In addition, there are several tasks where shape control is more intuitive for the user, such as in Agritech [74] and other constrained environments [3].
Due to the large degrees of freedom for a soft continuum robot, there is no single method for shape parameterization. CC models have largely been used for shape parameterization, but are based on several simplifying assumptions that parameterize the shape of a soft robot using low-dimensional representations. In this article, we propose the extension of the data-driven IK controller (see Fig. 2) to take image states as a high-dimensional hyper-redundant task-space representation I. The static images are a function of the actuator state q ∈ R n and the environmental conditions η The region of the image that is dependent on the actuator variables obeys the static relationships as described in the previous section If I q i+1 and I q i can be extracted from the observations of I i+1 and I i , then an image-based kinematic controller can be developed. Assuming that a deep network can extract this information implicitly, the mapping (I i+1 , Iq i , I i ) → (q i+1 ) can be sampled and learned as before. Note that Iq i is the transformation of the vector q i into a 2-D image array to ease the concatenation of I i+1 , Iq i , and I i as network inputs. Varying lighting conditions, background noise, and camera displacements, in theory, would not affect the mapping accuracy as long as I q is visible. Occlusions, hence, would affect the accuracy of the mapping. Deep convolutional neural networks (CNNs) are used to learn the image-based differential IK mapping. These networks are excellent at extracting image features from noisy images while exhibiting some shift and scale invariance [75], [76], [77].

A. Simulation Arm
A simulation environment is used to test and validate the image-based kinematic shape controller. We created a simulated continuum arm in MATLAB and Simulink using the Simscape library.
The continuum arm comprises multiple individual modules. As shown in Fig. 3, each module is a three-spring-mass structure, with three springs attached concentrically to a disk with mass. Each spring is a force-driven spring-damper object defined by an equilibrium position, and stiffness and damping coefficients. The stiffness and damping coefficients are set to relatively high values 10 kN · m −1 and 10 N · s · m −1 , respectively, to minimize any dynamical behaviors. They are driven by external force inputs from a controller. Each disk has three degrees of freedom. A translation in Z-axis and rotations about Fig. 2. Convolutional-neural-network-based kinematic shape controller. The deep network provides the control actions that iteratively take the robot to the target shape given the current shape and control action. X and Y axes are related to the previous disk. For our simulation, we mounted ten modules in total to form two sections containing five modules. For each section, consecutive springs receive the same control signal. For example, spring objects in Spring 1 receive the same control signals. This was done to conceptually match the real STIFF-FLOP arm shown in Section V-A. Our simulation robot was intentionally created with a distinct geometry, materials, and actuation mechanism compared to the STIFF-FLOP robot to demonstrate the image-based IK solver's capability in controlling vastly different continuum arms. Gravity was configured to point from the base to the tip to simulate a downward mounting arrangement. However, we deliberately kept the orientations between the robot and the camera to be different in simulation and in reality to demonstrate that the controller can effectively operate even when the camera and the robot have different relative orientations.

B. Simulation Experimental Setup
Training data were obtained through a quasi-static motor babbling method [72]. The vector q i holds the individual actuation values for each of the six springs in the simulation robot. The actuation input applied to each of the six springs was limited to 150 N to produce an arm configuration with sufficient complexity without causing instability. Random shape trajectories were generated by adding stochastic force values Δq to the current force inputs that satisfy 0 ≤ q i ≤ 150 N every 5 s. A Sigmoid smooth-step function S(t) [see (10)] was multiplied to Δq to ensure smooth trajectories in the generated training data. This is done to ensure that the steady-state assumption used in deriving the theory remains valid The parameters a, t 50 , and t r are used to make the smoothstep function reach the Δq values over a time period of 3 s, with 0 ≤ t ≤ 5 s such that the newly generated random shape is reached when the value S(t) is equal to 1. The values used for the parameters a, t 50 , and t r are 1, 2, and 1, respectively. These were determined empirically. Saturation after 3 s was done to include training examples that stabilize to a final shape.
Motor babbling data of 12 000 s were gathered. State images I were obtained at a rate of 1 Hz via screen capture giving 12 000 data samples. The image states were converted to gray scale and downsampled to 128 × 128 to speed up learning. Actuator image inputs Iq i were obtained by repeatedly copying the current actuation signals q i into a 2-D image array. This allows us to concatenate the image state and actuator state easily. The next state I i+1 , current state I i , and actuator inputs Iq i were resized to a height and width of 128 × 128, followed by concatenation to form an 128 × 128 × 3 input. The six-element vector q i+1 was kept as is for the data regression labels.

C. Model-Free Deep Visual Network IK Solver
The controller uses a bespoke architecture to learn the map- Fig. 2). The architecture was built systematically by incrementally adding convolutional and rectified linear unit (ReLu) activations layers until the rootmean-squared error on a small 10% subset of the training data no longer improved. The same architecture is then used for real-world experiments. The network has six regression node outputs for controlling the six simulation springs q i+1 . The control loop for the real arm is given in Fig. 11, which is similar to the simulation control block. The only difference is that the control outputs are fed into the simulation spring objects rather than through pneumatic regulators.
Training data were split into 10% and 90% for training and testing, respectively. The network was trained for 1500 epochs using the Adam optimizer with an initial learning rate of 0.005 at a learning drop rate of 0.99, which drops at every 100 epochs. The state images I i+1 and I i were augmented with random speckle noise, random translations and rotations, and random occlusions to obstruct the visibility of the target and feedback states during training to make the controller more robust to noise. The data were not normalized and no overfitting was observed. For the experiments, the controller is run at 10 Hz, even though the training data were obtained at 1 Hz. It was observed that this strategy led to smoother trajectories and faster target convergence.

IV. SIMULATION RESULTS
Four simulation experiments were undertaken. During the experiments, no online changes were made to the network weights.
The first experiment was to qualitatively verify the imagebased IK controller for random various target shapes taken randomly from the validation dataset, shown in Fig. 4. During the experiments, the arm starts from a home configuration where it is initially straight and all the actuation values are set to zero. The controller requires multiple steps to converge to the target shape. Hence, for each case, the controller ran for 60 s to obtain the final resulting shape state. Note that the IK solutions provided by the controller are not unique. It is dependent on the initial configuration of the robot, and hence, the controller is more robust to model inaccuracies.
The second experiment was to determine the robustness of the learned controller to feedback I i translational and rotational noise to simulate the effect of the camera being displaced out of position and orientation after learning. Twenty-five more shapes were gathered to obtain an average performance of the controller. The overlaid images for these remaining shapes are given in the supplementary material. For the translation noise, the feedback images I i were translated in the X and Y directions. Black pixels were used to pad the images after translation (see Fig. 5(b) for an example). Results are given in Fig. 6. For the rotational noise, the feedback images are rotated clockwise and counterclockwise. As before, black pixels were used to pad the images after rotation (see Fig. 5(c) for an example). The results are given in Fig. 7. The state errors were obtained by the image pixel subtraction of the final resulting state I i to the target state I i+1 images  and averaging the absolute of the error image. Note that the nonaugmented final feedback images were used to gather state errors.
Figs. 6 and 7 demonstrate that the controller's accuracy is greater in the absence of noise, as evidenced by lower means and smaller standard deviations. Despite the increased severity of both types of noise, the fact that their standard deviations significantly overlap suggests that the controller's performance remains unchanged. However, the higher variances indicate that the performance of the controller with noise is also dependent on the target shape. Note that no direct intervention is applied to the network to reduce the error between the target and the feedback. The error reduction arises implicitly based on the learning architecture and the sample data used for learning.
The third experiment was to verify the robustness of the controller when the feedback images were partially occluded. For the same 25 shapes used in the third experiment, a black box is added to the feedback state I i to simulate partial occlusion.   Table II. Although the black box is placed randomly, the same pixel position was used for all 20 shapes (see Fig. 5(d) for an example of this occlusion). Results are given in Table I. From this table, it can be seen that the standard deviations between no noise and occlusion noise overlap greatly, indicating that the control is robust even when the feedback is partially occluded.
The fourth experiment was to verify the robustness of the controller to viscoelastic changes in the soft material properties of physical arms such as creep or stress relaxation. These can occur from usage over time due to periodic strains and stresses from actuation, which can induce permanent deformations [78]. Thus, softening the material over time. To model this behavior, we tested the controller on simulations with reduced spring stiffness. The original simulation stiffness 10 kN · m −1 was reduced in increments of 5% up to 20% reduction. Note that 10 kN · m −1 is the stiffness used to gather the motor babbling data. The same target shape was used for all simulation models with varying stiffness using one of the shapes in Fig. 4. For this particular target shape configuration, only springs 1, 2, and 6 need to be actuated. From Fig. 8, it can be seen that the image-based controller can compensate for the lowered stiffness values by generally reducing the required spring forces in order to achieve   Fig. 8).

A. STIFF-FLOP Continuum Robot
Our proposed static shape controller is implemented on a miniaturized soft, pneumatically actuated manipulator to validate and demonstrate its efficacy. The fundamental design and manufacturing process (as shown in Fig. 9) has been introduced in an EU FP7 project called STIFF-FLOP [79], [80], [81]. Hence, the soft robot in this article is referred to as the STIFF-FLOP manipulator, a cylindrical robotic device made of silicone (Ecoflex 00-50 Supersoft, SmoothOn), with six fully fiber-reinforced chambers. Two adjacent chambers are internally connected together via 1-mm silicone pipes and actuated as one chamber pair. The molds are 3-D-printed using Tough2000 resin (Formlabs Form 3). Following a five-step fabrication process [as illustrated in Fig. 9(a)], the final robotic manipulator has a diameter of 11.5 mm. A central working channel with a 4.5-mm diameter is preserved for feeding through instruments, e.g., to conduct surgical tasks. Details on the dimension of the robot can be found in Fig. 9(b). Two or more manipulators can be connected in series via 3-D-printed connection plates, as shown in Fig. 9(c).

B. Experimental Setup
The STIFF-FLOP continuum arm is mounted upside down on a table platform (see Fig. 10). Six SMC regulator valves rated with a maximum pneumatic output of 0.5 MPa are used to actuate the robot's air chambers. An Arduino Mega is interfaced with six MCP4725 digital-to-analog converters (DACs) through a multiplexer for providing the 0-10 V analog voltage control inputs required by the valves. A Lenovo Webcam with a resolution of 1920 × 1080 is used to capture the image states of the robot. A workstation with an RTX 3070 graphics unit and an Intel i7 processor is used for data processing and learning. The whole platform is placed inside a photo booth for better lighting conditions. Similar to the simulation arm, training data were obtained through the described quasi-static motor babbling algorithm. The training data were not augmented to speed up the learning progress. The maximum control voltage was set to 3 V for all chambers that correspond to 150 kPa, the chamber's pressure  limit. As before, random shape trajectories were generated by adding stochastic pressure values Δq multiplied by the Sigmoid function [see (10)] to the current pressure value q i . The parameters a, t 50 , and t r all had the value of 1. These were also determined empirically. A new trajectory is generated every 5 s. Due to the multiplexing of the DACs, the fastest control frequency allowable is at 2.7 Hz. The STIFF-FLOP has a tendency to bend more in its lower module compared to the upper section due to the influence of gravity on the whole assembly. The upper module also houses the three actuation pipes of the lower section, which makes it stiffer. Similar to the simulation goal babbling, state images I were, therefore, obtained at a rate of 2.7 Hz and were subsequently downsampled to 0.27 Hz for the training data, resulting in approximately 1200 training images. It took approximately an hour and 30 min to gather the whole training data including the setup time. The training took approximately 2 h using the workstation. The network takes an average of 0.002 s to process the image inputs. The state images were cropped, grayscaled, and resized to 128 × 128 followed by concatenation to form the individual training samples. As done previously with the simulation setup, the output of the network, which corresponds to the six actuator inputs for the next time step q i+1 , is used to update q i . This is then transformed into a 2-D matrix of size 128 × 128 (through repeating values) and inserted as a layer between states I i and I i+1 , making it feasible to input it back into the convolutional network controller.
The same network architecture and training parameters were used; however, the network was only trained for 500 epochs due to the lower number of training samples. The complete control diagram of the experimental setup with the proposed CNN IK solver is shown in Fig. 11. A low-pass filter is used on the output of the IK solver, where the difference between the predicted force q i+1 and the current force q i is multiplied by a small gain value, which is then subsequently added to q i+1 to form q i+1 . The low-pass filter was used to ensure that the arm remains quasi-static during movement. The current state I i is obtained via the camera feedback. q i+1 is also fed back as the current force inputs for the next iteration.

VI. EXPERIMENTAL RESULTS
To test and validate the performance of the image-based IK solver on the real STIFF-FLOP arm, four experiments were conducted. The first was to qualitatively validate the controller for six random trajectories taken from the validation dataset. Similar to the simulation experiments, the arm starts in a straight home configuration where all the pressure actuation inputs are set to 0 V. Fig. 12 shows the ability of the solver to generalize IK solutions and reach arbitrary target shapes when starting from the home position. The CNN IK controller is able to accurately determine the correct pose of the end-effector, such as pointing toward or away from the camera, using only the low-resolution 2-D input images and the position of the gray tip of the robot (see Fig. 9). This gray tip essentially acts as a feature point. Nonuniqueness of the 3-D space projection to the 2-D image plane can occur when the robot is bending directly away from the camera (where the tip is fully occluded by the body); however, this issue can be addressed by using additional feature points and multiple camera views in future work. Fig. 13 shows the average error for the six target shapes over time. The errors were obtained by using Otsu thresholding [82] on I i+1 and I i to obtain their masks and subtracting the difference [see Fig. 15(b)]. This was done to remove the effects of noise caused by the flicker of the camera and the slight variances in the lighting conditions. These errors were then normalized and averaged. This metric, however, does have limitations. As this metric uses image masks, in some cases such as when the robot is facing directly toward the camera, nonunique 3-D projections can also occur. However, the six target shapes shown in Fig. 12 13. Average mask error between the target and the feedback states for the six target shapes given in Fig. 12 over the duration of 60 s. Mask errors were obtained by the image subtraction between I i (feedback) and I i+1 (target) states after Otsu thresholding, followed by normalization and averaging of the absolute image errors. On average, the controller converges to the final shape after 10 s. do retain their uniqueness when projecting from 3-D to 2-D, even after obtaining the image mask. For instance, a posture bending toward the camera will appear larger. This makes the masking metric suitable for evaluating the performance of the controllers on the real robot, and using multiple camera views will enhance its robustness. In the future, work will be done to further improve this metric by incorporating 3-D shape sensing technologies.
From Fig. 13, it can be seen that, on average, the error quickly converges at around 10 s or 27 time steps. The relatively high error standard deviations compared to the repeatability test in the following experiment [see Fig. 15(c)] shows that the accuracy of the controller is dependent on the desired target shape. The starting average mask error between the robot's straight position and the target shape was 0.0354. At the final step, the average error value dropped to 0.0133, which represents a 63% decrease in error. This demonstrates that the control system was able to successfully match the target, resulting in a reduction in error, as seen in Fig. 12. It is important to reiterate that error reduction arises implicitly based on the learning formulation. However, the error does not reach zero, as there are minute differences between the resulting final and the desired target states, as seen from Fig. 12. Factors such as the lighting condition, slight changes in the table position with respect to the camera, suboptimal learning, nonlinearities in the physical material properties, such as hysteresis, imperfect air sealing, and the nonuniqueness of the solutions, can contribute to the observed error. Unlike the simulation environment, the real-world data are filled with noise and variabilities, even with the arm being contained in a semiclosed system. An example of this is the differing average lighting intensities between the target and feedback images for the six shapes given in Fig. 12, which can be observed from the histogram in Fig. 14. From this figure, it can be seen that the feedback states are generally darker than the desired target states. From the supplementary video, these variabilities are even more pronounced. Camera flicker was found to occur due to the frequency difference between the camera and the light source. In addition, the supplementary video also shows that there are changes in the background shadows between the target and the camera feedback states due to misalignments. It is important that the controller is able to tolerate changes in the lighting condition and background states, and this is indeed observed in our experimental results. The second experiment was to validate the repeatability of the controller. Using one of the target shapes from Fig. 12, given in Fig. 15(a), ten repeatability tests were performed. Errors between the current state I i and target state I i+1 were obtained using the Otsu-masking subtraction method described before [see Fig. 15(b)]. Fig. 15(c) shows that although there are slight variances as it moves toward the target, the arm is able to converge to a low final error with reasonable precision, as indicated by the small standard deviations in error. This is also evident in Fig. 15(d), which highlights the convergence of the valve control  Table III. voltages to their final values. The low standard deviations of the final states control voltages after 60 s, as given in Table III, also prove the repeatability of the controller. Higher variability before convergence shows that the trajectory taken by the soft arm is not unique, even though the starting condition is the same. For this particular shape, it takes approximately 30 s, which corresponds to 81 time steps to reach the final shape, which is three times longer than average. This is likely due to the nonunique solutions generated by the controller. The oscillations in the mean error are likely due to the imperfect pneumatic sealing, variation, and lighting, as well as the other factors mentioned previously. The third experiment was to determine the effects of adding translation noise to the feedback I i of the solver. In order to have full accurate control of the translation, image feedback states were shifted to the right rather than moving the camera itself. The feedback images are also cropped and padded by white 255 pixels as they are translated to the right. From Fig. 16, despite the loss in accuracy as the feedback states are translated, the controller remains capable of discovering a solution that captures the overall form of the target. Note that no data augmentation was done to the training data of the physical system to make the controller robust to these translations, and the emergent behavior arises from the learning architecture and data structure.
The fourth and final experiment was to validate the generalizability and applicability of the controller. Here, we provide hand-drawn targets to the controller rather than actual images of the robot. Fig. 17 shows how the hand-drawn target shapes were made on top of the background image. First, a template without the arm was created in order to match the image environment. This template was then loaded into Microsoft paint. Random target shapes were then created using the line tools. These images are then given target to the robot. No retraining is done for the experiment.
Qualitative results for the six random hand-drawn target trajectories are given in Fig. 18. Even though the drawn targets are user-defined without the consideration of the robot kinematics, the controller was still able to achieve qualitatively similar configurations. We have deliberately kept the hand-drawn target images ambiguous, omitting tip-like feature points, to test the generalizability of the network. The solver was also able to automatically determine a suitable pose for the robot, despite not being explicitly given in the target states. For example, in the third image in Fig. 18, the given target state is a short line. As there were no feature points, the controller was able to reasonably generalize this to a shape that is pointing away from the camera. Similar to the other experiments, the controller was also able to reduce the error between the target and feedback states without any intervention and was also able to tolerate varying thicknesses in the drawn target shapes (see Fig. 19). Overall, this shows the generalizability of the controller due to the deep learning architecture. This also provides the user with an intuitive way to control the shape of the soft robot.    19. Average mask error between the hand-drawn targets and the feedback states for the six target shapes given in Fig. 18 over the duration of 60 s. Mask errors were obtained by the image subtraction between I i (feedback) and I i+1 (target) states after Otsu thresholding, followed by normalization and averaging of the absolute image errors. On average, the controller converges to the final shape after 10 s.

VII. DISCUSSION AND CONCLUSION
In this article, we introduced an image-based deep-learning kinematic controller for continuum robots. The controller is able to realize desired arbitrary target shapes given only image inputs with high accuracy and minimal errors. Thus, giving direct authority over not only the robot's tip but also its shape and configuration. We have also shown a simple yet elegant technique of teleoperating the soft continuum arm, where 2-D hand-drawn images are used to control the robot's 3-D configuration. This is much more user-friendly and intuitive compared to analytical methods, which require plenty of parameters to define the arc and configuration for each of the robot's section [15], [16], [16], [17], [23], [26]. Hence, due to the generalizability of deep networks, the controller is able to realize an appropriate pose given a hand-drawn target without any user guidance or manual consideration of the robot's kinematics.
Our control system has demonstrated its versatility on a range of continuum robotic manipulators, regardless of their actuation method, materials, geometry, kinematics, and the number of degrees of freedom, as evidenced by our simulation and physical experiments with vastly different robots. This indicates that the system can be effectively applied to any type of such manipulator, as long as the network architecture and training data size are appropriate, and the target shape images are kinematically feasible. Our method is also applicable to robotic manipulators with lower degrees such as robotic fingers or manipulators, provided that they are relatively easy to draw. Similarly, our controller is also able to operate regardless of the relative orientation between the robot and the camera, as shown by our results. Furthermore, our kinematic controller has a simple learning process, as demonstrated in our results. In our work, it took approximately 3 h to generate samples, train, and deploy the controller on the STIFF-FLOP manipulator. We verified the efficacy and robustness of the image-based controller in both simulations and in reality. In the simulation, it was seen that the deep CNN network is able to tolerate noise in the image feedback such that the desired shape is reached when trained with data augmentation techniques such as random translations, rotations, and partial obstructions. More importantly, it was able to compensate for drastic changes in mechanical properties for up to 15% reduction in stiffness, while maintaining the accuracy of the resulting final shape to the desired target shape.
In the physical setup, the controller was shown to tolerate noise and variabilities such as inconsistent lighting conditions and translational noise in the image feedback inputs even without any training data augmentation. Based on our simulation experiments, however, we have shown that data augmentation can increase the performance of the controller under the presence of noise. Future work will, hence, conduct a more thorough analysis of the optimal level of data augmentation necessary to achieve robustness of physical systems to feedback noise. The controller also requires minimal sensors, needing only an inexpensive and off-the-shelf camera without requiring any calibration. These are major advantages over sensors used in current work such as optical markers for end-effector tracking, which are expensive and require calibrations [52]. As the camera is an external form of sensing, it removes the need for additional fabrication steps, manual labor, and the challenges of integrating sensors, particularly in the case of soft robots. The same applies to the process of adding sensors to a preexisting robot. Our controller also eliminates the requirement for additional steps and algorithms to process embedded sensing.
Model-based prior works have relied on simplifications and assumptions such as sectional CC, pure symmetry in the structure, and minimal to no torsion in the exhibited actuated shapes. These assumptions severely limit the possible shape configurations and the scalability to more complicated continuum arms, hence affecting the accuracy of shape controllers based on such models [71]. Feedback control in these cases also involves an additional stage, where visual data have to be segmented and parameterized for comparison to the analytical model. The advantage of our technique is the whole use of image-state representations of both the task and the configuration space. This allows for high-dimensional and hyper-redundant descriptions of the arm, which encodes all of these inherent yet implicit structural, mechanical, and material properties of the arm, which is wholly controllable in 2-D. Therefore, there is also no need to formulate specific mappings of the actuator space to the joint space that is typical of model-based controllers [15].
Counterintuitively, the mapping between the shape configuration of the arm and the IK solutions is also not unique. This is greatly affected by the initial starting configuration, the material and structural properties, the noise in the image data, and even temporal dependence. For example, the same target shape is reachable from infinitely many directions. The ability to tolerate this redundancy is another major advantage of our local IK formulation compared to other model-free global IK techniques, such as goal babbling [51], which learn particular and specific solutions to the IK problem [49]. As shown in Fig. 15(c), there are slight fluctuations in the trajectory leading to convergence for the same target shape, occurring between 10 and 30 s. This is due to two physical factors. The pneumatic tubes had slight leaks, which resulted in unexpected and random vibrations when the pressure was varied. In addition, the STIFF-FLOP is a small silicone-based soft robotic system. Movement is a dynamic process due to the body's mass, combined with the imperfections in the pneumatic system, causing the robot to act like a 3-D pendulum. Despite these fluctuations, our controller was able to bring the robot to the desired shape and hold it once it stabilized, which was consistent with our simulation results where the robot was able to handle changes in mechanical stiffness (refer to Fig. 8). A direct solution to this issue of oscillation would be to reduce the low-pass filter gain and the control frequency, at the expense of speed to convergence.
Enhancements to the shape control accuracy, speed, and robustness can be made to both the control system and the arm design. One way to improve is by gathering more training data through image augmentation and a diverse range of backgrounds and foregrounds typically used in computer vision. In addition, using higher resolution state images and optimizing the network architecture and hyperparameters will increase the controller's resistance to noise and decrease steady-state errors. The use of well-established CNN architectures, such as ResNet [83], which have demonstrated effective performance in noisy environments, will also be taken into consideration. The control frequency is currently bottlenecked by the multiplexed DACs. The network takes 2 ms to process the image inputs; thus, faster control frequencies can be achieved by replacing DACs with a better stand-alone digital-analog board. With faster control frequencies, our work can be extended to dynamic shape control by combining image state representations and convolutional networks with recurrent neural networks as was done in prior work for continuum soft robots [55]. The issue of nonunique projections will also be addressed using additional feature points as well as multiple camera views in future work. Data augmentation would not address the issue of full occlusions, which are common in medical scenarios such as minimally invasive surgery or endoscopy. A direct avenue of future research could be to leverage computed tomography scans previously performed on medical continuum robots [84]. Alternatively, embedded sensors within the arm that permit 3-D shape estimation can also be incorporated.

ACKNOWLEDGMENT
For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.