Adaptive Techniques in Signal Processing and Connectionist Models By M.R.Lynch Christ's College A Dissertation submitted to the University of Cambridge for the degree of Doctor 6f Philosophy Department of Engineering June 1990 < • To My Parents 1 Acknowledgements I would like to thank my supervisor Or. P.J. Rayner of Cambridge University Engineering Department for his guidance, encouragement and aid during the course of this research. I would like also to thank the Science and Engineering Research council and Dr. W. Fitzgerald and Marconi for their aid in financing the research. Declaration I certify that this dissertation contains an account of my own work, except were specific reference is made to others, and includes nothing which is the outcome of work done in collaboration. The work was carried out in Cambridge University Engineering Department between October 1986 and January 1990. This dissertation has not been submitted to any other university for any other degree. This dissertation comprises 171 pages in total. M.R.Lynch. 2 Keywords: Signal Processing, Adaptive, Digital Filters, Image Processing, Connectionist Models, Neural Networks, Non-linear Systems, Volterra Analysis, Conjugate Gradients, Least Mean Squares, Pattern Recognition, Image Registration, Non-linear filters. 3 Summary This thesis covers the development of a series of new methods and the application of adaptive filter theory which are combined to produce a generalised adaptive filter system which may be used to perform such tasks as pattern recognition. Firstly, the relevant background adaptive filter theory is discussed in Chapter 1 and methods and results which are important to the rest of the thesis are derived or referenced. Chapter 2 of this the- sis covers the development of a new adaptive algorithm which is designed to give faster convergence than the LMS algorithm but unlike the Recursive Least Squares family of algorithms it does not require storage of a matrix with n 2 elements, where n is the number of filter taps. In Chapter 3 a new extension of the LMS adaptive notch filter is derived and applied which gives an adaptive notch filter the ability to lock and track signals of varying pitch without sacrificing notch depth. This application of the LMS filter is of interest as it demonstrates a time varying filter solution to a stationary problem. The LMS filter is next extended to the multidimensional case which allows the application of LMS filters to image processing. The multidimensional filter is then applied to the problem of image registration and this new application of the LMS filter is shown to have significant advantages over current image registration methods. A consideration of the multidimensional LMS filter as a template matcher and pattern recogniser is given. In Chapter 5 a brief review of statistical pattern recognition is given, and in Chapter 6 a review of relevant connectionist models. In Chapter 7 the generalised adaptive filter is derived. This is an adaptive filter with the ability to model non-linear input-output relationships. The Volterra functional analysis of non-linear systems is given and this is combined with adaptive filter methods to give a generalised non-linear adaptive digital filter. This filter is then considered as a linear adaptive filter operating in a non-linearly extended vector space. This new filter is shown to have desirable properties as a pat- tern recognition system. The performance and properties of the new filter is compared with current connectionist models and results demonstrated in Chapter 8. In Chapter 9 further mathematical analysis of the networks leads to suggested methods to greatly reduce network complexity for a given problem by choosing suitable pattern classification 4 indices and allowing it to define its own internal structure. In Chapter 10 robustness of the network to imperfections in its implementation is considered. Chapter 11 finishes the thesis with some conclusions and suggestions for future work. 5 Contents 1 Current Adaptive Methods and Theory. 1.1 Introduction .......... . 1.1.1 The FIR Wiener Filter. . 1.2 Properties of the Correlation Matrix and its Eigen-Values . 1.3 The Adaptive Approach . . . . . . . . . . 1.4 The LMS or stochastic gradient algorithm. 1.4.1 Problems of the LMS algorithm 1.4.2 Choice of the adaption coefficient 1.5 Recursive Least Squares 1.6 HR Adaptive Filters . . 1.7 Singular Value Decomposition 2 Conjugate Gradient Method 2.1 Introduction . . ...... . 2.1.1 The LMS Algorithm 2.2 Derivation of the Algorithm 2.2.1 Theorem 1 . 2.2.2 Theorem 2 . 2.2 .3 Application to Adaptive Filters 2.2.4 The Algorithm 2.3 Results ....... . 2.4 PARTAN Algorithm 2.5 Conclusion ..... . 6 15 15 17 19 20 21 22 24 26 27 27 29 29 30 30 31 32 32 34 34 36 37 3 Pitch Extraction 3.1 Introduction. 3.1.1 Harmonic Signal Case: 3.1.2 Limitations: 3.1.3 Results: .. 3.1.4 Conclusions: 38 38 42 42 43 44 4 Image Registration 45 II 4.1 Image Registration Utilising Multidimensional LMS Adaptive Filters. 45 4.1.1 Current Registration Methods 45 4.1.2 The New Method 46 4.2 4.3 4.4 4.5 4.6 4.7 4.8 Derivation . . . . . . . . Image Registration System . The Adaption Coefficient . Filter Solutions Output Image Performance . 4.7.1 Photograph of image result 1 4.7.2 Photograph of image result 2 4.7.3 Photograph of image result 3 Computational Load ..... . 4.9 Template Matching Possibilities 4.10 Conclusion ........... . 46 48 49 50 51 52 54 55 56 61 61 61 63 5 Pattern Recognition 5.1 Introduction. 64 64 64 64 5.2 The Problem 5.2.1 Adaptivity and Learning 7 5.2.2 Interpolation and Generalisation. 5.2.3 Learning Speed .. ... 5.2.4 Certainty of Learning. 5.2.5 Multiple Class Problems 5.2.6 Parameter Tuning. 5.2.7 Parallelism ..... 5.2.8 Probabilistic Properties. 5.3 Traditional Approaches . ... Bayesian Classification 5.3.1 5.3.2 5.3.3 Classical Fisher Statistical Pattern Recognition Limitations of the Classical Approaches. 6 Connectionist Models and Neural Networks 6.1 Introduction .................. . 6.1.1 Limitations of the Conventional Syntactical Approaches. 6.1.2 The Adaptive Approach. 6.2 Artificial Neural Networks 6.3 ADALINE ..... . .. . 6.4 Hidden Layer Back-propagation 6.4.1 Advantages and disadvantages of HLBP 6.5 Radial Basis Functions ............. . 6.5.1 Advantages and Disadvantages of the RBF Approach 6.6 Group Method of Data Handling .. 6.7 Higher Order Networks. 6.8 The New Approach ... 6.9 Other Continuous Supervised Connectionist Model Forms. 7 Generalised Adaptive Filters 7.1 The Signal Processing Approach. . . . . . . . . 7.1.1 Non-linear System Modelling Approach. 7.1.2 The Volterra Functional Analysis .... 8 65 65 66 66 66 66 66 66 67 68 70 72 72 72 72 73 74 75 77 77 77 78 80 80 80 81 81 83 84 7.1.3 Adaptive Volterra Expansions 7.2 Multiple Level Classifications 8 Network Performance 8.0.1 Generalisation Results 8.1 Network Comparisons 8.1.1 Interpolation Performance Comparison .. 8.1.2 Multiclass Performance. . . . . . . . .. 8.2 Determination of Network Capacities and Degeneracy 8.3 Visual Recognition Results. 8.4 8.5 8.6 8.3.1 Direct Recognition Invariant Recognition. 8.4.1 Preprocessor. 8.4.2 Results. Conclusion . Audio Recognition 8.6.1 The Data . 8.6.2 Preprocessing and AR Model 8.6.3 Results ... 8.6.4 Conclusion. 9 Output Mapping Problem and Principal Non-linearities 9.1 Output Mapping .. . . . 9.2 Vector Space Approach. 9.3 The Bayesian Approach 9.3.1 Bayes' Theorem. 9.3.2 Application to Connectionist Models 9.3.3 Conclusion ......... . 9.4 Optimisation of Output Map Order 9.4.1 Derivation 9.4.2 Results .. 9 85 88 91 91 91 101 105 106 108 108 109 109 111 113 114 114 114 120 122 123 123 124 124 125 125 127 127 127 131 9.4.3 Conclusion ........................ . 9.4.4 Analogy with G~neralised Karhunen-Loeve Transform. 9.5 New Network Using Adaptive Output Mapping 9.6 Principal Non-linearities ..... . ..... .. . 9.7 Adaptive Identification of Principal Non-linearities . 9.8 9.7.1 9.7.2 9.7.3 9.7.4 Degeneracy . . . . . . . . . . . . . Adaptive Minimum Norm Solution. Perceptron Analogy. . . . . . . . . Adaptive Self Collapsing Network Examples. Conclusion. . 10 Implementation 10.1 Introduction ..... . 10.2 Exact Implementation 10.3 Random Interconnection 10.3.1 The Two Neuron Random Model 10.3.2 Results for Random Connection 10.3.3 Conclusion ... . ... .... . 10.4 Limited Dynamic Range and Imperfect Multiplication. 10.4.1 Results. 10.5 Conclusion ... 11 Conclusion and Future Research 11.1 Future Research ........ . 11.1.1 Time Series Network Incorporation 11.1.2 Non-clustering Patterns and Preprocessing 11.1.3 Image Coding 11.1.4 Other Data sets 11.2 Conclusion. . . . . . . References and Publications 10 136 136 137 139 139 140 140 142 143 143 145 145 145 147 147 147 150 150 151 159 160 160 161 161 164 165 165 165 Symbol Definitions Where possible the following conventions are used: w Weight Vector. x Input Vector. d Desired Signal. € Error signal. fJ Adaption Coefficient. v Expanded Space Vector. EO Expectation operator. R Correlation Matrix. p Cross-correlation vector. A Diagonal matrix. ,\ Eigenvalue. Abbreviations The following are common abbreviations: LMS Least Mean Squares. RLS Recursive Least Squares. ADF Adaptive Digital Filter. FIR Finite Impulse Response. SVD Singular Value Decomposition. HR Infinite Impulse Response. 11 OCR Optical Character Recognition. RBF Radial Basis Function. MLP Multilayer Perceptron. HLBP Hidden Layer Back Propagation. 12 Introduction Although the areas of adaptive signal processing and connectionist modeling owe their early origins to the same work, the two subjects have tended to become independent. This thesis uses some approaches and methods more recently developed in the field of signal processing and applies them to the field of connectionist models. Consequently the thesis is divided into two parts, the first of these covers work in the area of adaptive signal processing, and the second the application of signal processing methods to connectionist modeling to develop a new connectionist model and methods by which to analyse it and refine it. 13 Part I 14 Chapter 1 Current Adaptive Methods and Theory. 1.1 Introduction Practical information processing systems must be able to function in continually changing environments; they must also be capable of taking into account the many complexities of the real world. Adaptive signal processing gives a method for producing such systems. Adaptive systems may track changes in their environments. The adaption process may also allow a system to design itself taking account of many complex aspects of the task it is required to perform. The basis of adaptive signal processing is the adaptive filter. In signal processing adaptive filters have until recently tended to be linear, that is, the filter output is a linear function of the data applied to its input. The statistical approach to designing a linear filter is to assume various statistical measures are known, normally the mean and auto correlation functions of the information and noise signal inputs. The adaptive approach does not require this 'a priori' statistical knowledge. The adaption t:f algorithm forms internal estimates of this information whilst the filter operating, and uses these estimates to adapt the filter to a suitable state. One of the first applications of adaptive signal processing was that of adaptive beam forming . A beam former for the reception of RADAR or radio signals was required to be able to null out jamming signals coming from a localised direction. Bernard Widrow [8J Input DeFining SysteM Output Model error Figure 1.1: Adaptive Filter developed such a system based on an error surface gradient following method, which was later to become known as the Least Mean Squares (LMS) algorithm. The similarity of the mathematics applying to the beam forming problem and digital filtering led to the application of this new approach to digital filters. One of the classical applications of these adaptive digital filters is that of noise cancellation [1][8]. In general an adaptive filter is required to minimise some function of an error signal; this error is normally the difference. of the output of the filter and some desired signal. A common function to minimise is the mean square of this error signal. Such a filter may be con figured to model another system as shown in figure 1.1. The importance of adaptive filters has become apparent in the field of signal processing due to their ability to take account of slow time varying systems such as communication channels in channel equalisation, or acoustic pathways in active noise cancelling. Another aspect of adaptive systems, the ability to adapt themselves to solve problems, is making them very important in the field of artificial intelligence. It has become apparent that it is very difficult to define the rules that will allow a machine to perform many 'real-world' tasks, such as visual pattern recognition under general conditions. The beauty of adaptive systems for such problems is that it is no longer necessary to derive a set of explicit rules as the system can be left to adapt itself 'on the job' to solve the problem without being explicitly programmed with sets of rules. In considering such systems it is necessary to return to the basic Wiener Filter. In this thesis the discussion is limited to discrete time systems. The definition of such systems 16 Input x k Er r o r 'diener Fil ter DeSired elk Figure 1.2: Wiener filter may be found in [34J. 1.1.1 The FIR Wiener Filter. The Wiener filter is an optimal filter in the mean square sense. As shown in figure 1.2 an output error is defined as the difference between the desired reponse and the output of the filter. Classical Wiener theory uses the mean square error as the function to be minimised, and assumes that the information and noise processes are statistically stationary. Such a filter is said to be optimal in the least square sense. In the Wiener filtering problem it is required to design the optimal filter for producing an output signal as similar (in the mean square sense) to the desired signal as possible. Where dk is the desired signal, Xik the input signal for the i th delay point in an N tap FIR filter, Yk the output signal and W i the filter coefficients. An error signal Ek may be defined by the filter equation: Ek = dk - Yk = dk - L WiXik i=l,N The expectation of the error squared E( E~) is: E(EZ) = E((dk - L WiXik)2) i=l,N For notational convenience we may define a vector composed of the filter weights: 17 Likewise the input values in the filter: The filter equation becomes: Define a 2 = E[dZl: E(t2 ) = a 2 - 2dk wtE[Xkl + wtE[XkX~lw E(e2 ) = a 2 - 2wtp + wtRw = J Where R is the correlation matrix: 2 X Ok ... XOkXnk X1k X Ok ... XlkXnk R = E X2k XOk ... X2k X nk 2 ... Xnk and p is the cross correlation E[dkXk], and J is the mean square error. The minimum square error may be easily found by finding the derivative of the mean square error with respect to the weight vector and setting this to zero. Rw= P R - 1 Wopt = P This requires that the correlation matrix is non-singular so that its inverse exists. 18 It is necessary to know a priori the signal statistics of the input, reference, and desired signals. This may not be possible in many practical cases. In a non-stationary environ- ment the Wiener filter theory is deficient as the optimal filter is time varying, that is, the coefficients of the filter are time dependent. Although Kalman filter theory can provide an approach to obtaining results for such problems an often more convenient approach is that of the adaptive filter. The adaptive digital filter avoids the need to obtain these a priori estimates and conse- quently may be used where the signal statistics are unknown or changeable. The statistics must be 'pseudo-stationary' that is vary sufficiently slowly with time for the adaption al- gorithms to work. In the case of Rank-deficient systems it is necessary to use a method which will give reliable results with singular matrices. A good method for use in such situations is that of singular value decomposition. This gives the minimum norm solution [1]. 1.2 Properties of the Correlation Matrix and its Eigen-Values It is important to state some basic properties of the correlation matrix which will be used later. Further properties and their proofs may be found in Haykin[l]. • The correlation matrix of a stationary discrete time stochastic process is Hermitian: Where H denoteEi conjugate transpose. • The correlation matrix of a stationary discrete time stochastic process is Toeplitz. • The correlation matrix of a stationary discrete time stochastic process is always non-negative definite. 19 Where x is an arbitrary non-zero M by 1 complex valued vector (The correlation matrix being M by M). The eigenvector equation defines a vector q (M by 1): Rq=Aq 1.3 The Adaptive Approach Wiener theory is limited in that it assumes stationary random processes with statistics which are known a priori. In practical applications it is rare to encounter full stationarity or to have precise knowledge of statistics prior to the required operation. One approach might be to use the operating data to estimate the statistics and use these estimates to solve the Wiener filter. This would be computationally heavy and unlikely to be a real- time solution. The adaptive filter is a system that avoids these problems. It functions as an automaton with the capacity to tailor its operation to suit an unknown, and possibly time-varying, statistical environment. This is achieved by use of a performance feedback recursive algorithm which alters the filter parameters to optirnise performance in some sense. In fact the standard linear transversal adaptive filter will converge to give the optimal Wiener solution for a statistically stationary problem. It should be borne in mind that the adaptive filter parameters being data dependent lead to the adaptive filter as a system being non-linear. However, this non-linear aspect is not usually referred to and the terms 'non-linear' or 'linear' are applied to adaptive filters in respect of their transfer functions at a point in time where their parameters are stationary in time. There is a series of different algorithms to perform the parameter adjustments for FIR transversal filters, but the desired properties of an adapt ion algorithm are similar for many applications . • Convergence Speed: This is the number of algorithm iterations required to reduce the error to within a given tolerance of the optimal solution. This should be as fast as possible, that IS, 20 require as few iterations as possible to get within a given tol- erance of the optimal solution: • Misadjustment: This is the difference between the performance of the filter after it has been allowed to adapt for an infinite time and the optimal filter. • Robustness: This is the ability of the filter to operate in unsuit- able signal and noise environments, for example ill-conditioned data for linear adaptive filters. • Computational Load, Structure and Numerical Sensitivity: The amount and type of requisite arithmetical operations, the sen- sitivity of the algorithm to inaccurate calculation, and the re- quired storage, may all be of importance in various applica- tions. The structure of the algorithm may be of importance in consideration of algorithm implementation. The following isa discussion of the more important of these adaption algorithms. 1.4 The LMS or stochastic gradient algorithm. This is the pioneering algorithm of adaptive signal processing and is notable for its light computational load and ease of implementation. From the above equations it can be seen that the error surface of the FIR filter is quadratic in form. Thus the error surface exhibits no local minima, as it contains only the global minimum. An optimisation algorithm may be applied to find the global minimum. The simplest of these algorithms is the stochastic gradient algorithm. This algorithm, which is also known as the method of steepest descent, involves use of the local gradient to determine the direction of the next iteration. The mean square error equation can be differentiated with respect to the weight vector to give an expression for the gradient on the mean square error surface: 21 , \lwJ = -2p + 2Rw The steepest descent algorithm may thus be applied by always moving in the direction of steepest descent (changing the the time index k to the iteration index n): where n is the iteration number and J1 is the adaption coefficient. As there is no a priori information about the correlation matrix R and the cross-correlation vector p it is not possible to use the above equations to give an exact value of the gradient. The LMS algorithm forms instantaneous estimates of Rand p from the following: R = E(x(n)xt(n)) is estimated by: similarly for p. These estimates ,are unbiased (correct in their mean), but they may have large variances . However the LMS algorithm smooths these out. The gradient estimate is thus: \lw(n)J(n) = -2x(n)d(n) + 2x(n)xt (n)w(n) = -2x(n)e(n) 1.4.1 Problems of the LMS algorithm The LMS algorithm suffers from some disadvantages such as slow convergence times and sensitivity to eigenvalue spread, and misadjustment. Misadjustrnent Consider the optimal solution (in the Wiener Sense) for the weight vector. This value of the weight vector will be denoted as Wo. At any iteration in the algorithm a weight error vector may be defined: ew(n) = w(n) - Wo 22 Haykin [1 J gives an involved analysis of the LMS algorithm. This analysis is difficult as the random nature of x( n) and' d( n) propagates into the weight vector estimate to give non-stationary weight vector estimates. Haykin[IJ gives the result: It can be shown that the above equation converges to zero, provided"" is chosen in the range 0 < "" < .,..-L. This is known as convergence in the mean. The propagation of "mal:' the weight error vector variance is represented by K the weight error vector cOFt'!'elation matrix: Haykin [IJ gives the following recursion relationship in K: K(n + 1) = K(n) - ",,[RK(n) + K(n)R] + ",,2Rtr[RK(n)] + ",,2lminR where tr[J is the trace operator. It can be seen that K(n) can never reach zero as the last term represents a forcing term. That is the weight vector does not reach the optimal vector but fluctuates around it. This is due to the non-exact nature of the gradient estimate. This wandering leads to an excess error lex above the lmin value. The magnitude of this misadjustment may be found (Haykin [1]) who shows the excess error at n = 00: E(lex) = ""lmin Lt=t )..k 2 - "" Lk=t)..k In which the )..k are the eigenvalues of the correlation matrix R The misadjustment is defined as: lex lmin Thus it can be seen that in cases when the eignevalues are widely spread, the misadjust- ment is mainly limited by the larger eigenvalues. Effect of Eigenvalue Spread Considering the previously defined weight error vector equation: 23 This equation shows that the time evolution modes of the error weight vector are coupled. By using the unitary similarity transform the convergence modes may be uncoupled: R = QSQt The weight error vector may be redefined in terms of the principal axes: Yielding: E(v(n + 1) = (I - JLS)E(v(n)) This shows that each of the convergence modes is separated into a homogeneous first order difference equation with solution: It is clear that unless all the Ak are equal the convergence rates for each mode are different and an optimal JL cannot be found for all the modes. 1.4.2 Choice of the adaption coefficient It can be found experimentally that the performance of the LMS algorithm can be greatly affected by the choice of the adaption coefficient. Yassa [3] developed a useful approach to the question of the choice of adaption coefficient. Considering the expression for the square error of an LMS filter: In the case of a complex filter this becomes for the m+ 1 iteration: Also the complex gradient vector is given by: 24 w C/) L OptiMQl AdQption Coeff, Figure 1.3: Parabolic relationship of adaption coefficient and mean square error. Using this it can be seen: where: (Vt:2 )m = 2[Rtwm - p~] = 2[R*wm - p*] Given a nonzero (Vt:2 )m it can be shown:qm > O. Hence the relationship between I-l and the mean square error is parabolic. The optimall-l may thus easily be found as: where: In the case of the LMS algorithm: R= xx* 25 These reduce the expression for a: to: Thus a measure of the input power to the filter when suitably low pass filtered may be used to set and keep J1 near its optimal value. 1.5 Recursive Least Squares The performance of the LMS algorithm is not as good in terms of number of iterations to convergence as may be expected for an adaptive system from a theoretical analysis of adaptive methods [2][1]. Indeed there are methods which offer convergence in fewer iterations than the LMS algorithm. Perhaps the most common of these, and certainly the basis of a host of related algorithms, is the recursive least squares algorithm [1], which itself arises from the classical method of least squares. Defining: Input data vector u, a forgetting factor A which if close to one gives the algo- rithm infinite memory and if smaller allows the algorithm to track time varying problems, and a matrix P which is of order m by m where m is the number of elements of u. Initialise algorithm by setting: Wo = 0 For each time instant n = 1,2, ... compute: k(n) _ A-1P(n - 1)u(n) - 1 + A-1uH(n)P(n - 1)u(n) a:(n) = d(n) - wH(n -1)u(n) w(n) = w(n - 1) + k(n)a*(n) P(n) = A-1P(n - 1) - A-1k(n)uH(n)P(n -1) It is immediately clear that the algorithm requires the storage of a matrix and although this is not a problem for many adaptive filter problems it is a prohibitive storage require- ment for systems with many weights such as image filters or connectionist models. 26 1.6 IIR Adaptive Filt~rs There has been a great deal of interest in adaptive HR filters. It has however proved very difficult to develop a practical algorithm. The problems are mainly due to the multimodal nature of the error surface and the difficulty of constraining the filter to be stable for all iterations of the algorithm. As adaptive HR filter theory is not used elsewhere in the thesis the subject is not pursued here, however it can serve to illustrate some of the problems encountered by adaption methods with multimodal error surfaces.[29][32] 1. 7 Singular Value Decomposition This is not an adaptive method but is very important for the study and analysis of adaptive systems in which the correlation matrix is rank deficient. The singular value decomposition theorem states: For a matrix A of rank w there are two unitary matrices XY such that: where: and: Normally we could solve the least squares problem by inverting the autocorrelation matrix to give the optimal vector. w = R-1p This theorem (SVD) can be used to derive a solution for w even when the matrix R is singular. By solving: p=Rw for w. More generally this equation may be rewritten: 27 L.. 1>3 fill. Where Am is the'Moore-Penrose generalised inverse or pseudoinverse of matrix A. Sin- gular Value Decomposition can give a unique solution to the equation, giving the one solution with the minimum-norm. The pseudoinverse of the matrix A may be defined as: where: ~-1 d' (-1 -1 -1) L.J = zag0"1 ,0"2 "",O"w And w is the rank of the matrix A. This is used as follows: =x ( E-1 o The unitary matrices may be partitioned: Assuming (L > M): = XI~-2X~AHb w = "" xix~AHb ~ O"~ 1 1=1 1 This equation may be used by first computing the singular values of the data matrix A and the associated singular vectors Xl .. . Xw and substituting them into the above equation to give w. This gives a numerically well behaved method which may even be used for rank deficient systems. More recently, interesting work has been done on recursive implementations of SVD [43], although the computational load of such methods is still high. 28 Chapter 2 Conjugate Gradient Method 2.1 Introduction Over recent years a series of algorithms for the updating of adaptive filter coefficients have been developed. Some of these methods, notably those related to the method of Recursive Least Squares (RLS) [1] have provided significant increases in performance over the Least Mean Squares (LMS) [1] algorithm. For many applications the problems posed by the lack of performance of the LMS algorithm can be overcome by implementing one of these other algorithms. However, almost all of this new generation of algorithms require the storage of a matrix and a significant increase in the computational load. This is not a problem for most adaptive filters which are one-dimensional and utilise a limited number of coefficients. In the case of multidimensional adaptive filters, such as those used in image processing [7], or connectionist models or for any adaptive filter with many coefficients however, it may be totally impractical to consider storage of a matrix containing 0 (n 2 ) coefficients where n is the number of coefficients. Likewise the increase in computational load associated with current approaches may be too great. In this chapter new algorithm is presented which gives a substantial increase in perfor- mance over the LMS algorithm but does not require the storage of a matrix and has a computational burden of approximately five times that of LMS. 29 Figure 2.1: LMS Convergence Path 2.1.1 The LMS Algorithm In order to construct a new algorithm with increased performance over the LMS algorithm it is necessary first to consider the failing of the LMS algorithm. The update equation for the LMS or stochastic gradient method is [1]: or more generally: Where w is the filter weight vector and k is the time index, a and 11 are adaption co- efficients, t the filter error and x the filter input data vector. Consider an error surface which is a long narrow valley. The steepest descent type algorithms, such as LMS, will in general tend to zig-zag down the valley as shown (Fig.2.1). This behaviour even occurs for perfectly parabolic surfaces and so leads to large inefficiencies in terms of convergence time. 2.2 Derivation of the Algorithm Consider the minimisation of a function f over an N dimensional coordinate system. The function can be approximated by the first three terms of a Taylor series of the function about a point P. 30 Where: 1 =c-bw+-wAw 2 c = f(P) b=-\1f fJ2 f [AJij = 8 8 Wi Wj It should be noted that the above is exact for a quadratic function. Matrix A is the second partial derivative matrix of the function f, the Hessian of f, at P. If a minimisation in the direction of vector u is performed it is necessary to find a direction in which to minimise so as not to destroy the previous minimisation. Consequently the next direction 8w must be such that the gradient is perpendicular to u, that is v! u.8(\1f) = u.A.v = 0 By calculating the gradient of the above approximation with respect to the weights we obtain: \1f = A.w - b And hence, the effect on the gradient of a small change in w is: 8\1f = A.8w u and v are referred to as mutually conjugate. It is also necessary that our search directions must be mutually orthogonal so that the whole space may be spanned in the search. Following the derivation of the Fletcher-Reeves [4J algorithm it is possible to invoke a theorem which will allow the construction of mutually orthogonal and conjugate vector sequences. 2.2.1 Theorem 1 If A is a symmetric, positive definite N by N matrix and go is an arbitrary vector. For values of i = 0,1,2 ...... two series of vectors may be defined: 31 Where: Then for all i such that i i= j: gf.gj = 0 hf.A.hj = 0 That is, the gi are mutually orthogonal and the hi are mutually conjugate. The proof of this theorem is beyond the scope of this chapter but may be found in Polak's book [2]. It is a form of Gram-Schmidt orthogonalisation. However, this theorem still requires storage of a matrix A. Another theorem enables this problem to be overcome. 2.2.2 Theorem 2 If gi = - V'f(Pd and a minimisa:tion of f in the direction hi of f is performed to find a new point Pi+! and gi+! = - V'f(Pi+!) Then the same sequence is generated as in theorem 1. This may be proved by induction [2]. Consequently a series of mutually orthogonal vectors and a series of mutually conjugate vectors may be generated without knowledge or storage of A. 2.2.3 Application to Adaptive Filters In order to use the above theorems it is still necessary to perform line minimisations and evaluate the gradient. The gradient evaluation may be easily performed for the adaptive filter case. For the standard adaptive filter [1] as shown below (Omitting the time index for convenience): Taking the mean square error: 32 Error ,I > ----:T Figure 2.2: Standard Adaptive Filter Where x is the vector representing the filter inputs, d is the desired response,w the weight vector,R is the autocorrelation matrix of the input and v the cross-correlation vector of the input and the desired signal. The gradient may thus be calculated as: 2 8€2 8€ \7( € ) = - = 2€- = -2€x 8w 8w The other problem is the line minimisation. The mean square error is a parabola as a function of distance along the li~e. Consider the line generated by: W = W +,8h Where ,8 is the line generating scalar. The filter error can be written: € = d - 2:Xi(Wi + ,8hi ) i The second derivative of the mean square error with respect to distance along the line h may be calculated as: Given expressions for the derivative, second derivative and the knowledge that the rela- tionship is parabolic it is possible to find that: at the line minimum along h. This gives only an estimate of the line minimum as the instantaneous error values could be affected by noise and incorporate no time averaging aspect. The approximation is found in practice to be sufficient for many applications. However this approximation c'an lead to a set of mutually conjugate and mutually or- thogonal vector series which use up all viable directions before arriving at the minimum. This problem can be cured by periodic resetting of the algorithm, but a much more el- egant solution is to use the Polak-Ribiere[2] extension of the Fletcher-Reeves algorithm which by using a slightly different expression for, leads to an algorithm which in effect automatically resets the algorithm if necessary.[33] 2.2.4 The Algorithm With initial values: 2.3 Results ht Wi+l = Wi + ~h x. t = d - xt.w gi+l = 2tx (gi+l - gitgi+l ,= t gi ·gi hi+l = gi+l + ,hi go = 2tx ho = 2tx The algorithm was tested against the LMS algorithm. Two filters, one LMS and one conj ugate gradient filter (CG) were run in parallel modelling a FIR system. The experiment was repeated with random coefficients in the 50 th order FIR system being modelled. The input signal was also varied: it was composed of a series of sinusoids of random amplitude and frequency. The convergence curves for these runs where averaged and plotted in figure 2.4. The position of the sudden slight rise in error which can be seen on the conjugate traces even after averaging over runs can be shown to be related to the filter length. This may be explained as the algorithm resetting itself internally after having converged to a point and 34 Figure 2,3: CG v LMS Convergence test system 1 1 , I 11 , I 1 1 ~ I ! o l! ~ I! ~ i ~ I Q.) I CG ~ i ~ I! ......... 1 I 0 11 00 1 I ~I I ~ ! I < 11 i I , , , ; I 1 'I I , I,', I II I1 I !/ \ J \ " '. I '-'----t I Iteration Number Figure 2.4: Plot of average absolute error over fifty runs , 35 I I Iteration Number 80 Figure 2.5: Typical Run with no averaging. Note smooth convergence of LMS algorithm compared with the jumps of the conjugate gradient algorithm. yet having an error still present. In fact the alorithm continues to reset itself periodically as can be shown by monitoring the algorithm terms, however the error is not changed much by these resets as the solutions all lie very close to the optimal solution. 2.4 PARTAN Algorithm There is a method closely related to that of conjugate gradients known as partial tangents (or PARTAN) [25]. This is of interest as it is insensitive to inaccuracies in the gradient determinations. Starting at an arbitrary point Xo the point Xl is found by a single step of steepest descent. After that from a point Xk the corresponding Yk is first found by steepest descent from Xk , and then Xk+l is taken to be the minimum point on the line connecting Xk-l and Yk. The process is continued for n steps and then restarted with steepest descent. It can be shown that for a quadratic objective function the Xk are the same points as would arise from conjugate gradients[25]. The disadvantage of the algorithm is the need for two line searches per step as opposed to the one of Fletcher-Reeves. However, PARTAN is et" insensitive to inaccuracies in the line searches, which is a desirable quality for adaptive 36 y 1 x~ Figure 2.6: PARTAN Method filter algorithm as the effect of noise can seriously upset line searches. 2.5 Conclusion The algorithm performs well, giving significant increases in convergence rates over the LMS algorithm and requires only O(4n) storage rather than O(n2) for other fast meth- ods. It is necessary to note that the basic algorithm is operating more in a Least Squares rather than a Mean Least Squares mode. This can lead to unexpected results, especially for short filters with low frequency signals, and some form of error averaging may be needed to give a Mean Least Squares solution. Provided that this complication is con- sidered, the algorithm is an improvement to LMS and may be used when methods such as Recursive Least Squares or Singular Value Decomposition cannot be applied due to storage or computational load constraints. 37 Chapter 3 Pitch Extraction 3.1 Introduction An adaptive method of pitch estimation is presented which will estimate the pitch of a harmonic signal in noise to a high degree of accuracy, and track any slow drifts in the pitch of the harmonic signal. The approach is based on an adaptive notch filter and hence avoids the problems encountered in the use of integer length adaptive delay lines. The approach is particularly efficient when using an LMS notch filter and will allow increased performance of the notch filter by accurately fixing the reference signal frequency. Adaptive pitch estimators for harmonic signals in noise can be difficult to design due to the nature of the error surface encountered if one attempts to adapt a fundamentally time stationary system. This approach solves this problem by utilising a time varying system. An attempt was made to remove a harmonic interference from an archive gramophone recording using a standard FIR LMS adaptive notch filter [lJ. This approach worked well in periods in which the harmonic interference was present with background noise of similar amplitude, but failed in periods with music or speech present. The amplitude of the music was many times greater than that of the interference and caused the adaptive notch filter to ring and destroyed its interference cancelling properties. As in the case encountered by Lim [35J it was decided to freeze the adaptive notch filter in these regions. However, as these regions persisted for tens of thousands of samples the reference sinusoids to the adaptive notch filter had to have their pitches accurately determined so that in the frozen periods the interference and anti-interference output from the FIR filter should stay synchronised. 38 Sinusoid ~ror I I~k I A,DL' I->Ic;)~ d LI~tPut In pu t _K __________ ----' Figure 3.1: Standard Adaptive LMS Notch Filter Thus an accurate and computationally efficient pitch estimator was required which could also track a slowly drifting pitch. Consider an LMS adaptive notch filter with one notch. Glover [28] analysed the LMS adaptive notch filter and showed its behaviour to include time varying aspects under certain conditions. Let C (z) be the Z- transform of the ith coefficient of the adapted notch filter, with adap- tion coefficient a and reference sinusoid of frequency Wr and amplitude C, U(z) is the Z transform of the LMS weight up,date equation (an integrator) as shown in Fig 3.2. Cj(z) = aC U(z)[E(ze-jwrT)ejOi + E(zejwrT)e-jOi] 2 Glover shows that the pole zero plot of E( z) would indicate poles at z = e±jwdT and E( ze- jwrT) represents a counterclockwise rotation of E( z) through angle wrT. Therefore the pole-zero plot of : would show poles at ±(wr + wd)T and at ±(wr - wd)T. This rotated spectrum is filtered t.hrough U(z) to give Cj(z) . Each weight therefore consist.s of t.he sum and difference frequency of Wr and Wd. U(z) is strongly low pass, hence the difference frequency domi- nates. For the FIR delay line ()j = () - wrT[i - 1] which shows that the coefficients Cj are a sinusoid of frequency Wr with each weight varying as a sinusoid at frequency Wr - Wd. Glover refers to this as the sinusoid in the coefficients 'moving' at a rate equal to the difference frequency between the desired and reference frequencies. This may be viewed 39 " / " / Figure 3.2: Notch Filter Analysis Diagram G( z ) /1 1 1 ~ as a heterodyning process and is a time varying aspect to the solution for an adaptive notch filter; it may also be considered as a simple case of a quasistationary filter using a slowly varying phase reference at the desired frequency. It is this time varying aspect that the adaptive pitch estimator (APE) will use. A single notch adaptive filter is constructed and the motion of the sinusoid is detected by the following algorithm: 1 aCk Pn+l = Pn + J.lP -~:ut In put -----'.K _________ -----'_ Figure 3.4: Pitch Tracking System 3.1.1 Harmonic Signal Case: In the case of harmonic signals the accuracy of the system may be increased by usmg a reference sinusoid of frequency near one of the higher harmonics present rather than the fundamental, as this will increase the difference frequency and increase the coefficient sinusoid movement by a factor equal to the harmonic number. Thus leading to a larger effect for the same pi tch error in the fundamental. 3.1.2 Limitations: The limitations on the accuracy of the algorithm are set by the adaption noise found on the coefficient values. This adaption noise can be estimated from the expression for the minimum mean square error. It is clear that it may be reduced by reducing the adaption coefficient of the notch filter. The bandwidth of the notch filter is set by this coefficient which in turn determines the range of frequency error over which the above analysis applies. Thus there is a trade off between final pitch accuracy and initial lock range. The initial lock range being defined as the maximum difference in frequency between the desired and reference signals for which the system will converge. The notch bandwidth is given by: NaC2 BW ~ 2T rads/ s. Where N is the filter length. 42 Limitations on the rate of convergence are set by the need to prevent instability due to undermining the quasistationary' assumptions. The changes in the pitch of the reference oscillator must therefore be suitably slow, this may be achieved by using a suitably small pitch adaption coefficient /-lp. 3.1.3 Results: The algorithm was found to be capable of high accuracy pitch estimates in quite high levels of noise and was capable of pitch convergence in SNR of -l5dB. In the case of SNR of lOdB the system could attain pitch accuracy estimates in the order of 1 part in four thousand. A series of convergence curves are shown in Fig:3.5. The occasional initial divergences are caused by transient effects as the data enters the filter. Fig. 3.5 Pitch tracking curve. +20% J. o J. ~ O%-t----------- .a (J ... .... c.. - 20% o Iteration Number 43 - - ---- --- 7000 Fig. 3.6 Input Musical Waveform (top), Oscillator waveform (bottom) and Output. (quiet section with no music). 3.1.4 Conclusions: The system is best suited to applications requiring high accuracy pitch estimates on sta- tionary signals for which a low accuracy pitch estimate is already known. The necessary computations for the system are based around a FIR filter, thus making it easily imple- mentable in hardware. The filter may be quite short in practice and in the case of an adaptive notch filter to remove a single sinusoid the pitch estiri\ator may be added at very little extra computational cost by using the cancelling filter as the pitch estimating filter. This leads to deeper notches and increased performance and as such may be a simple improvement that may be added to FIR adaptive notch filters. It is clear that there are many possibilities for variations on the above theme that may be well suited to other applications. 44 I ! Chapter 4 Image Registration 4.1 Image Registration Utilising Multidimensional LMS Adap- tive Filters. The need for image registration algorithms arises in many fields such as airborne ground sensing, aircraft navigation, medical imaging. In the latter separate frames of an image must be registered in the presence of involuntary patient movement. This problem arises wherever there is relative movement between the image sensor and the image scene. In practical cases the images may be 'degraded by noise, may contain weakly spatially varying shifts or one of the two images to be registered may contain objects not found in the other or moving independently of the rest of the scene. All of these effects can significantly reduce the performance of current image registration methods such as phase correlation. 4.1.1 Current Registration Methods Broadly speaking there are two practically used approaches to image registration. The first of these is a syntactic approach which normally involves the extraction of simple basic image elements such as line segments or boundaries. These are compared in the two images and some search method is employed to register the two images. This method works well for simple shapes in clean images under constant conditions. It is also necessary that the image contain the features employed, such as sharp line segments. In general, however, images can be noisy, have variable content, be complex and arise from environments with varying lighting and so on. Thus this method cannot usually be applied well to general Images. 45 The second method is that of phase correlation. This method is based on properties of the Fourier Transform. Consider two images 11 and 12 which are to be registered. The images are assumed to be related by a spatially invariant shift. Define F1(wx ,wy ) to be the Fourier transform of 11, FT(It) Define F2(Wx,Wy) to be the Fourier transform of 12. Then by the shift theorem: Where 6.x is the x direction shift and similarly 6.y. Thus the function: ( 1 (F2 P x, y) = FT- Fl ) returns a delta function at the shift value. However noise effects and the fact that many shifts are not constant over the image area mean that this delta function mq.,y become distorted or lost under noise in the phase correlation output image. 4.1.2 The New Method In the case of images the new method employs a two dimensional Least Mean Squares adaptive filter which makes almost no a priori assumptions about the images or their degradations. This allows the method to be used with general images and consequently it is not limited to images with sharp edges or similar artifacts as may be required by syntactic methods. 4.2 Derivation The derivation of the multidimensional LMS adaptive filter is closely related to that of the one dimensional case. The derivation is given for a N dimensional adaptive filter with each dimension of order m. 46 Consider an input tensor x of rank N order m and an N rank weight tensor w. Both tensors elements have N indices i .. .'z. The N-dimensional adaptive linear combiner is given by (k is the iteration index): Where the operation wx is defined as the inner product, that is: L L ... L Xi ... zWi ... z i=l,mj=l,m z=l,m The error may be defined as the difference between a desired output dk and the actual output Yk: Hence: We now assume that f.k, dk and Xk are' statistically stationary and consider the expectation of the error squared over k. This assumption can have some important implications for work with practical images as in many cases the degree of stationarity can be less than is normally found with many practical one dimensional signals. We can define E(XkXk) as the N dimensional correlation 'Tensor' of rank 2N, R. Likewise the cross-correlation tensor E( dkxK) denoted, p. This shows that the mean square error is precisely a quadratic function of the weight values. Thus the error surface is unimodal with no local minima. It is this quality that will allow some form of gradient search to converge to the optimum solution. Having shown the nature of the error or Performance surface the problem is now to find a search algorithm to find the minimum. Again the cue is taken from the ID case of 47 the derivation of the Widroff-Hoff LMS algorithm [1] and a simple gradient search IS employed. An adaptive algorithm may be formed by the following approach: 2 8t% 8tk \1tk = - = 2tk- = - 2tkX ,( 8w 8w We can specify a steepest descent algorithm: Where J1 is the adaption coefficient. Explicitly in the case of 2D ADFs the update equation is: The derivation of the LMS adaptive filter is independent of the dimensionality of the data, thus the same derivation can be used to lead to 2D,3D or ND ADFs. This derivation also shows us that the error surface is unimodal even for higher dimensionality data. The fact that 2D and ID ADFs are based on the same theory also allows us to predict with reasonable confidence the behaviour of 2D ADFs by analogy with the ID case, however care must be exercised in this practice as some of the problems encountered in 2D have no analogy in ID; also the nature of practical image data may differ from practical ID signals. 4.3 Image Registration System The basic shift registration system is shown in figure 4: 1. The reference image may in practice often be one of the previous frames of a series of image frames. The adaptive filter is run and its coefficients are updated. The filter will perform a gradient search to minimise the mean square error image. In the case of a shift this will occur when the output image has been correctly shifted by the filter to be registered with the reference image. The maximum shift that the system can cope with is given by the filter size and consequently its maximum spatial delay. The system is arranged so that there is in effect a spatial shift of half of the filter size between the two images so that the filter can cope with a positive or negative delay of magnitude less than m/2. Input I Mo.ge s Mul t idi Me nSiOno.l A.Dr . Output Registered IMQge Figure 4.1: Adaptive Image Registration Using 2D Filters. The particular path over the image that the filter runs can be important in the case of spatially varying shifts.If the variation is occurring more quickly in one of the two axis directions it would be advisable to track the filter along the other axis direction to allow the filter more iterations in which to update to the new shift. Consequently if images have a spatial variation which is mainly a function of the x coordinate it would be advisable to run the filter down the lines of constant y coordinate only advancing in x coordinate at the end of each y line. It is possible to improve the performance of a given order of filter by ensuring that there is no large relative dc offset between the images. This may occur if the two images were captured under very different illumination levels. Although it may be expected that the filter could compensate for such an offset the process of doing this degrades the performance of the system as it is utilising zeroes of the filter that may be more usefully applied elsewhere. Correspondingly it is therefore often useful to high pass filter the images before registration if there is a possibility of large dc offsets. 4.4 The Adaption Coefficient The value of the adaption coefficient is important in determining the mode of operation of the system. As a generalisation the nearer the adaption coefficient approaches the divergence value so the output image becomes sharper and the system can track spatially varying parameters more rapidly. However, smaller values of the adaption coefficient give higher noise immunity. One of the main problems in the practical implementation 49 of adaptive image registration is the choice of a suitable adaption coefficient. In fact because of the nonstationary statistics of practical images it is necessary, in order to improve performance, to implement a self regulating adaption coefficient which varies with the image. A simple yet reasonably effective approach is to update the coefficient of adaption as a function of the input power to the filter. This approach, which is based on the work of Yassa [3], has been simplified to reduce computational load. A power estimate is obtained from the row of unseen data about to enter the filter, this is then filtered by a single pole filter whose feedback factor can be adjusted to control the smoothing of the power estimate. Incident power: Pk = L X;,l i=l,m Smoothed power:Fk = Fk-1g + Pk Where 9 is the feedback factor. The adaption coefficient is updated by: Where /-lint is the initial adaption coefficient. If it is desired to use the filter output image directly or to track spatially varying shifts a larger adaption coefficient may be needed. Even with the above adaption coefficient updator divergence may still occur on certain parts of some images so it is necessary to have a divergence detector. This may be done by checking that the output pixels do not exceed a certain value. If this should occur the /-lint value is immediately reduced by a factor (usually about 0.7) and the filter coefficients reset to either all zero, or all zero with a value one at the coefficient corresponding to a recent shift estimate. 4.5 Filter Solutions As may be expected from the one dimensional analogy the performance of the two di- mensional LMS algorithm depends on the eigenvalue spread of the input data, and the time taken to converge to the correct shift estimate is reduced as the whiteness of the data increases. In order to extract the shift estimate explicitly from the coefficients of 50 Input IMage Power ~Fl EstiMator . L, "'" ilFllter H.P.Filters LMS Coetf'. Update Error DisplaceMent Vector ~ Figure 4.2: The Complete Registration System. the adapted filter we must consider the filter solution. In the case of white input signals and integer shifts the filter converges to a Kronecker delta function at the shift delay as may be expected, and for non-integer shifts it converges to an interpolator. Almost all practical images do converge to solutions with peaks at the delay value. The dominance of this peak is however determined by the whiteness of the input image, whiter images giving a more dominant peak. A simple peak detector may be used to extract the shift and interpolation of the peak may be used to obtain non-integer shift estimates. By using such a shift extractor at regular intervals a displacement vector map may be built up showing the displacement vector over the image. 4.6 Output Image As a byproduct of the processing, a registered and shifted corrected image is produced at the filter output. However, apparently due to the sensitivity of the LMS algorithm to eigenvalue spread and the differing convergence rates for different eigenvalues, the solu- tions reached are near optimal but lead to subjectively slightly blurred images. These problems may be alleviated by using adaption coefficient values near to divergence, but 51 this can lead to a risk of of divergence occurring. Another approach is to use the displace- ment vector field to 'manually' shift back the original image. This is currently done with many registration methods. 4.7 Performance The filters were run on clean images with a spatially stationary shift. In practice a lock onto the correct shift is achieved in thirty or forty pixels. The adaption coefficient value may be small such that the output image is very blurred but yet give good results for the displacement vector. All the results shown were generated using images 80 by 80 pixels which were quantised to 6 bits. The same experiment was repeated with the addition of equal levels of white uncorrelated noise to each of the images and the displacement vector field recorded. As may be expected the filters took longer to lock onto the shift, and in the case of high noise levels, the lock could be lost and regained throughout the picture. However, even at these high noise levels further increases in performance could be obtained by the utilisation of averaging of the displacement vector by use of the property of continuity of the displacement vector [12]. Finally the filters were run on an image pair in which the filter input image had been obtained by a spatially varying shift which was a function of the x coordinate. Photographs The photograph format is: • Top left reference image. • Top right input image. • Bottom left error squared. • Bottom right output image. • Small Plot of final coefficient values. The photograph and results are in the following order: 52 1. Results of constant shift case. Filter 20 by 20 with a (5,4) shift and the initial adaption coefficient set to .008. 2. Results of spatially varying shift case. Filter 12 by 12 with a shift 8y = 5 * sin( .004 * x). Initial adaption coefficient set to .004. 3. Results of constant shift case with additive Gaussian white noise. Filter 20 by 20 with a (3,2) shift and the initial adaption coefficient set to .004. 4. Plot of final coefficient values for spatially stationary experi- ment. 5. Plot of final coefficient values for spatially varying experiment. 6. Plot of vector displacement output for spatially stationary ex- periment. 7. Plot of vector displacement output for spatially varying exper- iment. 53 Final coefficients result 1 57 Final Coeffcients result 2 58 I I I I I I I I I I I I I I I I I I I I I I I Vector Displacement result 1 , '. \ \ " \, ... " \ " \ \ \ " " \ \ '. '\ , , I '. \. I .... ". \ " \ " I I. \. ':. \ \ \ '. '. '. '. " , , ... \ , \ \ ... ", \ . ... \ '. " \ '. " \ I \ \. \ I, \. I, \ I, \ " , - \ \ \ \ \ " .. \ " , '. \ \ " " " \ \ ". \ \ , -.. " " ", \, \ \ I, I \'" \ \ ... ... \, :.. \ \ \. \ " \. \ \ \ \ \. :" \ \ '. '. " I '. " \ i \. \. \. \ ', ', ', \ " " " " " " \'" \'" "1" - '\ .~ \ "'. \, \ \ " " t. " \, \ , \ , I • \ \ \ \ \ \ , I I \ \ :. " t \ '. \ " '. \ ~ :, \ 'I. 'I, , \, '. , '. " " , , " \ " '. " \, \ \ \, \ , \" \ '.. \ '. " '\ \ \ " \, I " \. \. " '. \ ... " '. , '. '. '. , I. '. \ " '. " , i \ " , " i \ " " \, \. '\ " \ \. \ \. I, \ \ \. I \ \, \ , \ '< \ \ \ ". " .. :'. ... \ \ \ \ , \, .... \ .... , , .... " i, " \ ... " \ , \ \ \ .... \ \. " " \. ). \ '\ \, \, \ \. \ \ \ \ I. \ - '. '. \ \ \ \, \ \ \, \, \ \ \\ \ \ \ \ \ \ \ \ \ \ _\ \ \ ~ \ \\\ '\\\\\\\\ '\ '.. \'. \\\ '" \\ \ \ \\ \\ ... ", \'. \ , \ \ ' , \ ' \ \ \ , \ " \ \ I, , \ I ~. ~ \ '" " \ \ ~ \ \. '\ \ ~. '\ '\ \, '" '\ \ '. \ '. ' '. \ \ '. ", ", '. \ ... . \ \ ......... \ " '. ' " \ \ \ " \ \ \ I ~ I \ \ \ " \ '~ \ \ " \ \ \ \. \ \ \ \ \. \ \ \ \ \ ':, '. \\\\" \\\'..\\\'.\\\\\ \\\\\\\\\ \\\\ \,.\ \\\ '. " " " ... \' , " \ ... \ \ \ ' \ \ \ \ \ \ \ \ \ \ \ \ \ \. \ \ \. \ \ \ \ . " \ " , \ \ \, '\ i., -"\ '\ "\ '." " \ . ' \\\\\\\\.' " 1 "\ \ _ \ \ \. \. ~. \ ',, " ... \ 't '" .. ", \ \ ~ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ I , I , I I I \ I I I " " " " '. \ ~. '. \ \ \ \ \ \ " , " " \ \\\\\,.,., \\\\\\\ .' ... \ \ .,.11 \\\\\". , \1\1\\1 \\' \\\\\ \\\\\\\\\\\ " \\\ 1\\11 il,,11 \' ",' \' \ ' \\\\\\\\\' \ \ \\\ '\', , \ '\ I \ \ \ \ " I I I \ I \ \ - - - - - - - - - - - - - - - - - - - - 59 Vector Displacement result 2 60 Figure 4.3: Template Matching System. 4.8 Computational Load For spatially stationary shifts the filter need only be run over small portions of the image. The filter size is set by the largest shift to be dealt with, but this may be reduced by calculating displacement vector fields from a decimated image. The algorithm is inherently parallel and suited to the use of current adaptive digital filter integrated circuits. 4.9 Template Matching Possibilities The following system may be used to implement the multidimensional adaptive filter as a template matcher by using the reference image of an object as the desired input and a scene as the filter input. The filter correlates the object in the scene and reference and shifts the scene. Thus the error image can be used as a measure of the match over the relevant region, thus providing a weak shift invariant template matching system. The use of higher order,for example 3D, adaptive filters may be utilised to register over a sequence of frames by the same principle as is outlined above. 4.10 Conclusion The LMS multidimensional adaptive filter can provide a new approach to image registra- tion which is well suited to parallel implementations. The system also has some useful 61 I 11 properties, giving the noise immunity of correlation based methods but not requiring a search over possible shifts.It also has the ability to deal with weakly spatially varying shifts, unlike transform based methods. The adaptive filter can also handle weak im- age degradations as its frequency response is not constrained and may adapt to correct differences in the two images caused by sensors or uneven illumination and so on. The highly non-stationary nature of real images means that care must be taken to optimise the adaption coefficient continually if sharp output images are to be obtained. This appears to be related to the unequal convergence rates for the different eigenvalues of the image correlation matrix as may be expected from one dimensional work [1]. Setting the adap- tion coefficient close to the optimal can lead to streaking effects as the filter momentarily diverges since the optimal adaption coefficient value changes with the local statistics of the image. In general the method provides a useful and easily implemented approach which will perform better than the transform based methods for many practical cases due to its ability to overcome weak spatially varying effects. 62 Part 11 63 I I \ I : I Chapter 5 Pattern Recognition 5.1 Introduction A standard application of connectionist networks is in pattern recognition. This half of the thesis attempts to define the abilities that a pattern recognition system requires, and to investigate the performance of the extended vector space connectionist model with respect to these abilities. 5.2 The Problem In the pattern recognition problem as addressed in this thesis, a system is required to output a class index which reflects a particular class to which its input signal belongs. Consider a decision rule which partitions a space into regions Oi ,i = 1, ... ,N , where N is the number of classes. An object is classified as coming from class Wk if its corresponding vector representation x lies in region Ok. The boundaries between the regions are called decision surfaces. The task can be made more difficult by the effect of noise on the data and the variability of data for patterns belonging to the same class. The following is a list of desirable properties for a general pattern recognition system. 5.2.1 Adaptivity and Learning A suitable pattern recognition system will need to be adaptive for most applications. The necessary rules to define most real-world recognition tasks are too complex to be analytically derived. This becomes more apparent when the effects of noise or class 64 Input Patterns a Output Index Figure 5.1: Pattern Recognition Problem variation are considered. These problems may be circumvented by utilisation of a self learning system, which by experience of the recognition problem adapts itself to solve the problem. 5.2.2 Interpolation and Generalisation Once a system has been trained using the members of a training set it is hoped that the system would correctly classify a new pattern which was not in the training set but a member of a class defined by it. This property is often referred to as interpolation or generalisation as it is the ability of the system to interpolate the classifier output between training patterns. There is an implicit assumption in the concept of interpolation that patterns belonging to the same class will cluster to a similar region of the decision space. This may not always apply, for example in the case of a capital and a small letter 'A'. However, from an analytical viewpoint this may be considered as two classes mapping to the same output value. It would be hoped that capital A letters will cluster and so on. 5.2.3 Learning Speed It is obviously an advantage if the recognition system can adapt itself to a state of 'good' performance with as few pattern presentations as possible. The ability to adapt quickly is also of advantage as it also enables the recognition system to respond to changes in the noise or class statistics. 65 I ' 5.2.4 Certainty of Learning It is impractical to have systems which may get stuck in their learning processes and not reach their optimal performance; we thus require a system that will not suffer from local minima problems. These problems would require the system to be supervised and reset, incurring multiple runs and consequently large computational expenditure. 5.2.5 Multiple Class Problems Although most recognition tasks concerning multiple classes may be reduced to a series of one class recognition problems, this is in general highly inefficient in terms of compu- tational load. An ideal system should be able to perform multiple class recognition. 5.2.6 Parameter Tuning For a system to be practically implement able a method of tuning any of the system network parameters must be found to prevent the need to repeat training runs in order to optimise these parameters. 5.2.7 Parallelism The system should be highly parallel to allow high speed operation. A parallel imple- mentation may also allow the processing resources to be distributed and so it may be composed of many simple processing units. 5.2.8 Probabilistic Properties A pattern recognition system must take account of the fundamental probability distribu- tions relating the observations and the true classification of the object. An understanding of the importance of this may be gained from [17] [26] and the following Bayes' recogniser section. 5.3 Traditional Approaches Pattern recognition methods can be broadly classified into two distinct groups, statistical and syntactic. This thesis is primarily concerned with statistical pattern recognition 66 1\ rather than syntactic recognition. There are many methods which have been developed in the statistical pattern recognition field "and an introduction to many of them may be found in [26] these include kernel methods, Nearest neighbour methods, Fisher discriminants and so forth. The Bayesian basis of pattern recognition and Fisher's approach are briefly described here. 5.3.1 Bayesian Classification U.sing the notation of the introduction begin by assuming the probability that an object comes from class Wj is a known P(Wj). As this is an overall probability known before an observation vector x has been observed it is a Prior probability. Once an observation is made and an observation vector x is known we can compare the probabilities of belonging to each class for an observation x and classify according to whichever is larger. for all j i= k, where Ok is the set of objects in the k th class. This rule is known as Bayes' minimum error rule. The P(wjlx) are known as the Pos- teriori probabilities. Sadly these are not normally known and must be estimated. This estimation can be done by making use of samples of known classification as is pursued later in this thesis. In many cases however, Bayes' theorem is applied to give: P( "I ) = P(xlwdP(wd w, x P(x) Yielding: For all j i= k. As before however P(x 1 Wj) are not often known. It is often the case however that an incorrect classification may be of varying importance as a function of class. For example, in medical diagnosis it is often a minor problem in classifying a healthy patient as unwell but it is very dangerous to classify an unwell patient as well. 67 To build in this factor a cost function may be defined Cj which gives the cost of misclas- sifying an object from class Dj as from class Dj. If x E Dj the expected cost is: N Tj = LCjj 1. P(x I wddx j=l OJ The overall expected cost or risk is thus: This is minimised by defining Dk such that x E Dk whenever: For all j =1= k. This is the Bayes minimum risk decision rule. 5.3.2 Classical Fisher Statistical Pattern Recognition Fisher in 1936 founded the classical approach to discriminant analysis with Fisher's crite- rion. The problem is one of finding that direction in the discriminant space along which the two groups to be classified are maximally separated. Fisher defined the separation between the two groups in a particular direction as the distance between the means of the two groups standardised for the within group variance in the specified direction. The importance of this standardisation may be appreciated by consideration of the following example. Prior to standardisation the separation in the Xl direction appears greater than that in the X2 direction (Fig.5:2). After standardisation it is however clear that the separation in the X2 direction is greater. In general, standardisations may be performed in any general direction, and the problem is to find the direction v such that (vtxl - vtx2) is maximised relative to the standard deviation (V t SV)1/2 in that direction, where the Xi is the sample mean for the design set for class Wj (i=1,2), and S is the assumed common sample variance-covariance matrix. I I I I I I I I I I I I \ \ clossl c=> c=> closs2 Figure 5.2: Prior to standardisation 061 closs2 Figure 5.3: After standardisation 69 I' For each class if we have a set of observed sample vectors 0'0 ... 0'0 then: and: . 1 n xi = - LO'P n p=l 1 n S = --1 L (O'p - xd(O'p - Xi)t n - p=l To find v maximise with respect to v: vtXI - vtX2 (vtS)1/2 Differentiating this with respect to v and equating it to zero gives: vt(XI - X2)SVt Xl - X2 = (vt Sv)I/2 As only the direction is required: 5.3.3 Limitations of the Classical Approaches It is difficult to generalise over so many different approaches to the pattern recognition problem. However, in general the computational load required to perform multiclass mul- tivariate recognition is prohibitive with many of the classical methods. The computational form is also often not suitable for parallel implementation, and it is often very difficult to create an adaptive method which will optimise itself easily as more data arrives, without a complete recalculation. Most of the most popular classical techniques are the parame- terised distribution techniques which assume that the problem closely resembles a priori probability distributions (such as Gaussian) and although some methods allow the dis- tribution parameters to vary as a function of class the distribution form is usually fixed. This assumption is often invalid as the probability distributions inherent in the problem are often highly anisotropic, class and observation dependent. For example, the variation found in handwritten characters arises not from purely random processes but the different observed vector directions , such as displacements or speeds, may correlate in the way they vary. The non-parametric methods such as nearest neighbour normally become com- putationally very heavy as they may require multidimensional search operations. These 70 '\ I I I I I I I I I I I I I I I problems have led to the search for computationally lighter, adaptive, parallel methods which require fewer a priori assumptions. Connectionist models avoid these limitations as they are non-parametric and make weaker assumptions about the shapes of the underlying distributions than traditional statistical pattern recognisers. 71 Chapter 6 Connectionist Models and Neural Networks 6.1 Introduction 6.1.1 Limitations of the Conventional Syntactical Approaches The modern computer has proved itself to be well suited to solving many problems and provides a powerful tool for aiding the solution of many other problems. There has, how- ever, emerged a type of problem for which standard 'syntactic' or programming methods do not perform well. These problems generally have many input variables and many com- plex rules relating their outputs to their inputs. The rules that arise in many real world problems, such as visual object recognition under practical conditions, are in general too complex for a suitable set of rules to be worked out explicitly. Consequently, in general, syntactic methods of sufficient complexity cannot practically be constructed to solve these problems. 6.1.2 The Adaptive Approach. The adaptive neural network can address this problem in that it does not require explicit statement of the rules of the problem, as it will implicitly learn them. Such networks are composed of many simple processing units linked together to create a system capable of performing complex tasks. It is also capable of learning statistical or probabilistic rules which may be fuzzy in nature. 72 6.2 Artificial Neural Networks For a long time there has been an interest in reproducing in a machine the abilities of the mammalian brain. Attempts to develop detailed mathematical models began in the 1940s with work by McCulloch [44], and Rosenblatt[45]. In the 1960s Widrow's work [46] with linear perceptrons created much excitement which, as the limitations of the linear approach became apparent, was to fade again until the more recent work by Hopfield [47], Rumelhart and McClelland [48], and the sudden growth of interest that ensued. The definition of what precisely constitutes an artificial Neural Network, or perhaps more scientifically, a connectionist model, is vague and to attempt to review all the methods un- der this general heading would be excessive. In this thesis the work relates to a particular class of connectionist model. Networks may be divided into two classes: 1. Binary: These networks normally take binary input data and produce binary output data. However, internally continuous arithmetic may be used. An example of such a network is the Hopfield Net. 2. Continuous: These networks take continuous (or multilevel) input data and may produce continuous or binary output. Each of these two classes may again be divided according to the following as a function of its learning abilities. 1. Supervised: The network is given a series of correct output val- ues from some other problem defining system and may use these in conjunction with the input data to adapt itself to improve its performance. 2. Unsupervised: The network is not given correct outputs but is required to extract relationships in the data to allow it to form classification groupings which may in some manner clas- sify 'like' patterns together. 73 1\ Neural Net Class lf;ers for Fixed Patterns Binary Input ~ Supervised Unsupervised HOPfle~ng Net I Grossberg Net Continuous value Input ~-----Supervised /\-High Order MLP RBF Unsupervised I Kohonen Map Figure 6.1 : Classes of Neural Networks . This thesis is only concerned with continuous input signal networks. This is the most general signal form and that which is found in most practical problems. The discussion will be constrained to deal only with supervised networks as these most closely resemble adaptive signal processing systems. It is also clear that currently the most general prob- lems being addressed by Neural Networks are usually of the supervised continuous signal type. 6.3 ADALINE The adaptive Linear Neuron or ADALINE is a simple linear combiner with an LMS feed- back used to adjust its weights. The output is usually passed through a threshold device to produce a binary output. The linear nature of the decision region (line in the signal space which marks the transition of the outputs from classification -1 to classification +1) is easily shown by considering a 3 coefficient ADALINE [24]. For a threshold of zero: This gives: Wo Wl X2 = - - - - Xl W2 W2 Thus ad aline can only differentiate signals into classes which are linearly separable. 74 lfl -j--> ::::s 0... c t---t Output Figure 6.2: Hidden Layer Back propagation Neuron . 6.4 Hidden Layer Back-propagation If the decision boundaries of systems composed of linear perceptrons are considered it is clear that the decision region is a line (in 2 Dimensions) or a hyper-plane (in N dimensions). This restricts the system from forming suitably flexible decision regions. To avoid this restriction to linear systems it is necessary to introduce some non-linearity into the transfer function. This may be done by adding a non-linearity to the basic perceptron unit. Obviously there are many non-linearities which could be chosen. One in particular which has gained favour due to its similarity to some transfer functions measured in biological neurons, is the sigmoid, or hard limiter. The hidden layer back propagation network is composed of neurons made up from a weighted linear combiner followed by a sigmoidal non-linearity. This non-linearity is normally based on a linear region around the origin. There is not a great deal of support from a signal processing point of view for this particular non-linearity, it however has become widely used as it bears a similarity to the transfer function observed in biological neurons. In fact even this justification has been questioned as the sigmoidal response is found in motor control neurons rather than information processing neurons. The network is made up from sheets of such neurons connected in layers. The name 'hidden layer' comes from the fact that there may be a series of layers 'hidden' between the output and input layers. The weights of the network are adapted by an algorithm which is very similar to the LMS algorithm, known as back-propagation. 75 Figure 6.3: Decision Contour for Back propagation Multilayer Perceptron Net- work . Consider the case of a sigmoidal non-linearity: 1 f( et) = 1 + e-(a-B) Consider an input vector x and a desired output vector d. From the output layer to the input layer adjust the weights by: Wjj(t) is the weight from hidden node i or from an input node j at time t. x; is either the output of node i or is an input, T7 is a gain term, and 8j is an error term for node j. If node j is an output node then: where dj is the desired output of node j and Yj is the actual output. If node j is an internal hidden node then: 8j = xi(l - xi) I: 8kwjj k where k is over all nodes above node j . Internal nodes are adapted similarly by assuming that they are connection weights on links from auxiliary constant-value inputs. The class regions are bounded by line segments the rounding of the nodes of which is a function of the hardness of the sigmoidal non-linearity. The number of line segments is related to the number of nodes. 76 I ' 'I \ \ 6.4.1 Advantages and disadvantages of HLBP Although perhaps currently the most popular form of connectionist model it is indeed difficult to find advantages for it over other networks. It is slow to adapt, can become stuck in local minima, has very limited mathematical justification, and is difficult to analyse. The analogy of its form with biological neurons is also of question. There tend to be a number of ad hoc parameters which must be optimised experimentally to allow the adaption algorithm to function. 6.5 Radial Basis Functions These networks are based on the method of radial basis functions, a technique for inter- polating in high dimensional spaces [16]. The RBF networks are of interest in that they have arisen from a careful and mathemat- ically well founded assessment of the required tasks of a connectionist model. It is clear that the classification problem may be viewed as a multidimensional interpo- lation problem. RBF networks are consequently a good method to produce a network capable of classifying input vectors into a series of classes. Define a set of m generally non-linear basis functions: (11 x - Y 11) Interpolating functions may be constructed from these basis functions with the form: m s(x) = L Aj(II x - Yj 11) j=l These basis functions may be centred on any points Yi including those not members of the training data set. 6.5.1 Advantages and Disadvantages of the RBF Approach The RBF approach produces networks which, by the very nature of its derivation, tend to have good generalisation and interpolation properties. The most basic form of the RBF network is also unimodal and consequently avoids all of the problems encountered 77 with HLBP networks due to local minima, and can consequently show quite fast adaption rates. However, the necessity of choosing a priori the basis functions leads to a limitation of the performance of the method as it is assuming a priori the form of the pattern distributions. This assumption may not only be incorrect in general but also it is likely that the form of pattern clusters and the statistical properties of the patterns may be different for different regions of the recognition space. This can lead to highly variable results depending on the form of RBF and the RBF centre positions for a given problem. The RBFs being localised may be expected to work moderately well for well clustered data. Attempts to address the RBF preselection problems by some form of adaption of the RBF position and form have been attempted, but these suffer from the same problems of multimodality and slow convergence found with the HLBP [48]. From a theoretical viewpoint it is interesting to note that RBF networks are subsets of general higher order networks in which polynomial terms have related coefficients. This can be shown by expanding each RBF as a power series. RBFs are however a well conceived approach and may be of great use for some types of problems in which there is good a priori information. 6.6 Group Method of Data Handling. Ivakhnenko [36] in the sixties was among the first to suggest the possibility of using a polynomial type non-linearity in an adaptive system to solve generalised problems. This approach has sadly been largely neglected by the neural network community. It is in fact a reasonably well conceived method with good performance. A multilayer polynomial system was developed and this is the method normally referred to as Group Method of Data handling (GMDH). This method attempts to reduce the number of polynomial terms by using the outputs of smaller polynomials as the inputs to other polynomials in a layered structure. Whilst the approach is much better grounded in theory than other methods the multi layer approach leads to multimodality and problems in adaption if a normal 'on the job adaption' is to be done. The method is more suited 78 '1 I I \ \ I to block processing on a predetermined training set. The ad hoc nature of some of the elimination rules also requires multiple runs and a priori knowledge to allow good performance. Experience in signal processing shows that it is better to avoid the considerable problems encountered in multimodal system adaption by using a different method to create and select the useful polynomial terms. Most of the work in this area has been done in prediction problems with surprisingly little work in pattern recognition and consequently very little has been done in optimisation of the method for this problem. There is now a series of closely related approaches based in general on the following. Firstly a set of first layer polynomials are calculated directly from the input data Xl ••• Xm by taking pairs of input points giving I = 1 ... m( m-I) /2 new values: YI = A + Ex; + CXj + Dx~ + Ex; + Fx;xj The coefficients of these equations are each optimised so that the error € = d - YI taken over the training set is minimised in the mean square error sense. Polynomials whose error after optimisation is highest are eliminated from the algorithm. These YI are now used as the inputs to another set of polynomials and so on. This is continued until the minimum error from a polynomial in stage n is less than that for stage n+l. The normalised RMS error over the training set is called the regularity criterion and is defined: where there are p training set values d;. The multimodal nature of the surface leads to the need to apply a very slow adaption algorithm which is usually some form of random search with a modification. Such ap- proaches called 'guided random search' are not dissimilar to simulated annealing or genetic algorithms [36] and as such are very slow to converge. Another problem with GMDH is that although two polynomials may each independently fail to lower output errors their 79 cross products may lower the output error. Thus the independence of the terms selection procedure is questionable. 6.7 Higher Order Networks. These are simple polynomial based networks. During the time of the work covered in this thesis work has been done in this area [24] mainly motivated by the ease of implementation of multiplication-like terms in optical methods . Most of the higher order approaches used in neural networks are very simple and no attempt is made to optimise their required computational load as in GMDH. They have often been based on binary networks and only applied to small problems. 6.8 The New Approach The system developed in this thesis is a higher order system which is developed and refined using a signal processing approach and is aimed at the pattern recognition problem. The new approach is related to GMDH in that it is also attempting to find a polynomial network. However, it is designed to be adaptive in operation both for training data and the real data and is optimised in performance by methods which are consistent with unimodaloperation. It is also optimised with respect to the pattern recognition problem. 6.9 Other Continuous Supervised Connectionist Model Forms There is now a series of proposed modifications of current networks, and some new types of network. These can be found referenced in [24]. Many of these other networks can be expressed as modifications of HLBP or radial basis functions. Some other methods of interest are Kohonen Self Organising Nets (unsupervised), and adaptive resonance theory (Unsupervised) [24] . Chapter 7 Generalised Adaptive Filters 7.1 The Signal Processing Approach In the early days of neural networks there was a tendency to over-extrapolate known physiological information to provide networks. There was also a tendency to produce networks by empirical methods. This led to systems which could not easily be analysed or optimised. Recently the field of signal processing has advanced sufficiently to allow the mathematical derivation of a network form by using a well defined signal processing approach. Firstly, it is necessary to consider a basic problem. A standard 'bench-mark' for neural networks is the exclusive-OR or parity problem. In this problem a series of binary inputs are input to the network and it is required to return the parity function value of the inputs. We can consider this problem from a signal space approach. Consider a vector space created by plotting each of the input variables along a set of orthogonal axes. For a 2 input variable problem this space is a plane. For the binary exclusive-OR problem the input values (or 'input vectors') are (1,1) (0,0) (1,0) (0,1), the first two being in class one and the second in class two. It can be seen from Fig. 7.1 that a linear discriminator cannot be used to differentiate between the classes in the EX-OR problem. We can however transform the problem into an extended vector space in which the problem may be solved using a linear discriminator. We increase the dimensionality of the decision space by adding a dimension X1X2 . A linear plane may now be used to separate the classes. 81 +' ::l D- c x xl Input Figure 7.1: EXOR Problem. , -' ---------------~;),( " , C\J .- x ,: ___________ ______ "' " , , , - -- - - ~ - - - - -- - -- - ,-," : .. ," : ,," , " , x l Input " Figure 7.2: Extended space EXOR Problem. 82 x xl Input Figure 7.3: Extended space EXOR Problem decision contour . Input Pn ttern Classif'ica tion r------- l\ 1----,----7--1 Def'lning SysteM / \ e .g .HUMnn Error Figure 7.4: Pattern Recognition as an adaptive Modelling Problem. If we collapse the decision plane of Fig 7.2 onto the 2D plane as in Fig 7.3 it can be seen that the decision contour is non-linear. So by using a linear discriminator in an extended vector space it is possible to perform classification tasks requiring non-linear discriminators. 7.1.1 Non-linear System Modelling Approach The pattern recognition problem may be thought of as a system modelling problem in which the adaptive pattern recognition is attempting to model the classification system; for example, a human defining the relationship between a set of character font images and a set of collating sequence numbers. The block diagram showing the recognition problem cast as a system modelling problem is shown in Fig 7.4. It is clear that the relationship between the inputs and outputs of such a system is in general a non-linear one, thus the recognition system must be capable of generating a 83 \ I I I I I I I I I I I sufficiently close approximation to the non-linear relationship defining the system. 7.1.2 The Volterra FUnctional Analysis In signal processing such non-linear systems may be modelled by use of the Volterra series. The form of the Volterra series was first studied by Vito Volterra, but Norbert Wiener was the first to apply it to non-linear systems theory. He used it to model the input-output relationship of non-linear systems. The continuous Volterra series may be written: y(t) = Ho + Hl[X(t)] + H2[x(t)] + ... + Hn[x(t)] + ... where y(t) is the output of a system and x(t) is the input and in which: 1:00 ••• 1:00 hn(Tl, ... , Tn)X(t - T1)'" x(t - Tn)dT1 ' " dTn where hn(T1"" ,Tn) is the Volterra kernel. Stability of Volterra Functional The stability analysis of a Pth order Volterra functional has not yet been done. However a sufficient but not necessary condition for the Volterra functional to be bounded-input, bounded-output stable (BIBO) is by inspection: 1:00 ••• 1:00 I hp( T1,· .. ,Tp) I dT1 ... dTp{OO where hp is the pth order impulse response. [13] gives an example of a stable system not meeting the above. Discrete Volterra Series A discrete form of the series was developed. In general this will need to be a heterogenous Volterra expansion in order to model systems composed of multiple orders of non-linearity. The heterogenous discrete Volterra series may be written as: L: Hn i=1,N 84 , where Hn are the Volterra Kernels, and ware coefficients which define the model as follows :- Ho = wo Zeroth Order term or 'D.C.' term HI = L WjXj First Order Linear term H2 = L L WjjXjXj Second order Quadratic Term j H3 = L L L Wjj,kXjXjXk .... and so on. j k 7.1.3 Adaptive Volterra Expansions Normally the coefficients are fixed by use of analytical methods: however, they may be fixed adaptively. Consider the following system in which the inputs are combined to give the individual Volterra terms. These terms are then multiplied by their respective coefficients and the terms summed. It can be seen that the system can be decomposed into a vector expansion stage followed by a linear Finite Impulse Response (FIR) filter. The adaption of the coefficients may ' then be achieved by any of the methods used in FIR adaptive filter theory. The analytical basis of this approach allows us to calculate explicitly the form of the error surface. The Volterra Series is found to converge for most practical non-linearities, although large input levels or very high order non-linearities should be avoided. A very closely related expansion may be obtained by using the Gabor- Kolmogorov polynomial: An often convenient notational method for such systems is to write the expansion terms as a series of Kronecker products. In a system such as that shown the effect of joining an FIR filter to the Volterra expansion may be considered. The system has a set of input values Xo • .. Xn which are input to the non-linear Volterra state expander to produce a set of output values 'Po ••• Vg with n being less than q. These values then form the input to an FIR adaptive filter. Consider the 85 ~ Volterra Expansion Input Pn t tern Figure 7.5: Adaptive Volterra System. output values for the vector state expander. This set of values may be represented by the vector v. The filter coefficients may also be represented by the vector w. Using the standard FIR equation we mfl-y write the error € as: in which d is the desired output from the filter, that is in our case the correct classification given the set of inputs to the whole system Xi. In vector form this may be written: If we wish to consider the error function for the Least Mean Squares, or LMS, criterion with the mean being considered over the pattern presentations, the equation becomes: where R is the autocorrelation matrix of the data in the non-linearly extended vector space and similarly p is the crosscorrelation vector between the desired d signal and the non-linearly expanded data. (7 is the variance of the desired response. The objective is to find a set of weights w which will minimise the mean square error. The system then will output classifications which differ from the desired classifications with 86 the minimum mean squared error possible for the system. The solution for the weights of the filter is obviously the same as that of a Wiener Filter operating in the extended space and so the methods used to find the Wiener solution explicitly may be applied. A more suitable approach for the recognition problem is to find the weight solution adaptively. This has the advantage of not requiring a priori knowledge of R, and does not require the inversion of a matrix. It is clear from the above that the LMS error is solely a quadratic function of the weight vector. This shows that the performance surface is a hyperparaboloid and consequently always unimodal. This is a highly desirable property as it shows that the system will not exhibit local minima on the error surface. Such local minima cause the adaption performance of networks to be greatly reduced as the adaptive algorithm may get stuck in local minima and consequently not reach the optimal coefficient values. The property of unimodality also confers the ability to adapt much more quickly as there are no valleys in which to dither and the gradient is proportional to the distance from the solution. This last property prevents the adaption being greatly slowed or halted by regions of the error surface with small gradients. A stochastic gradient algorithm may be easily applied by following the derivation of the LMS algorithm. The usual gradient algorithm may be written by differentiating the mean square error with respect to the weights to give an iterative weight update equation. where 11 is an adaption coefficient which sets the speed of learning. A simple expression for V' E( £2) may be derived. aE( £2) = 2E aE = -2EX aw aw The analytic basis of the method combined with the property of linearity in the coefficients allows the application of the standard least squares methods. In general for the pattern recognition the problem is degenerate or rank-deficient. A single solution may still be obtained by the application of Singular Value Decomposition (SVD) which yields the minimum norm solution. Such analytical methods may also be used to study the network performance and obtain the eigenvalues for real recognition problems . 87 Parameter Tuning There is only one parameter to be chosen In the new network: this is the adaption coefficient. However, this adaption coefficient is that of a standard adaptive filter. Con- sequently the large body of knowledge pertaining to adaptive filter theory may be im- mediately applied. Yassa[3] developed an expression for choosing the optimal adaption coefficient. No such method may be applied in the case of current networks as the nature of the error surface is complex and contains local minima. Consequently in current net- works the parameters such as the adaption coefficient must in general be set by trial and error. 7.2 Multiple Level Classifications The output of the network is well suited to multiple class classifications. This may be done by the use of different levels on the same output. For example the desired network outputs may be assigned to the integers. In this case an output within .5 of the desired classification can be considered as belonging to the classification. If it is important for the network to only classify patterns correctly and the cost of an incorrect classification is high, a 'guard' band may be defined around each classification. 88 I L 7~----~--~----~----~----~----~----~----~----~--~ 6 5 4 3 21--------i 1 o~----~--~----~----~----~----~----~----~----~--~ 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Figure 7.6: Classification against network output: Simple case. 89 I I,. 1 6r-------.--------.--------.-------.------.-.-,~--~ 5 4 3 2 1 Figure 7.7: Classification against network output: Guard case. 90 j ~ I I I I I I Chapter 8 I I Network Performance I I I I 8.0.1 Generalisation Results The ability of the network to form general decision contours is tested for small two- variable problems. In fact this is rather a harsh test as in practice if any hyperplane were as densely populated by patterns as in the examples, a further extension to separate the patterns would probably be used. The contours were obtained by forming the extended space correlation matrix over all the patterns and inverting it to solve the standard weight equation. The first class was assigned an index of one and the second an index of zero. The output threshold was set at 0.5. The peanut cluster has proved a particularly difficult test for some types of network as it requires a larger number of decision line segments than the simple cluster problem. As there was no noise in this experiment the function was not constrained in regions in which no patterns existed, and consequently used the freedom of these regions to allow itself to fit the patterned regions more accurately. 8.1 Network Comparisons The new network's performance was compared with that of the current networks. The exclusive-OR problem was used as a test. The adaption times in terms of numbers of iterations were compared for the different network types. The times for the hidden layer back-propagation networks are taken from an American study [14]. These timings are for the best run, which was achieved by optimising the network parameters over a series of runs. If the HLBP network became stuck in a local minimum it was stopped and 91 30~------~--------~------~--------~------~------~ o o o 25 o o o 20 + + + 0 0 15 + + + + 0 0 + 10 5 0 0 0 0 0 0 0 5 10 15 20 25 30 Figure 8.1: Localised Cluster : Patterns and Decision Contour. 92 Figure 8.2: Localised Cluster: Discriminant Function Values. 93 \ Figure 8.3: Localised Cluster: Thresholded Discriminant Function Values . 94 30 '-.../ 0 0 ~o 25 0 0 0 0 20 + 15 0 + + 0 10 + + 0 0 5 0 <=> 0 0 0 0 0 5 10 15 20 25 30 Figure 8.4: Dual Cluster: Patterns and Decision Contour. 95 Figure 8.5: Dual Cluster: Discriminant Function Values. 96 Figure 8.6: Dual Cluster: Thresholded Discriminant Function Values. 30 " ,/ 0 0 ~o 25 0 0 0 20 + 0 + 15 + + + 0 10 + + I I + 0 5 0 0 0 c::::> 0 0 0 5 10 15 20 25 30 Figure 8.7: Peanut Cluster: Patterns and Decision Contour . 98 Figure 8.8: Peanut Cluster: Discriminant Function Values. 99 Figure 8.9: Peanut Cluster: Thresholded Discriminant Function Values. 100 restarted from different initial conditions. In the case of the new network the results are for the first time run as the adaption coefficient could be selected beforehand, and the local minimum problem did not exist. Thus when comparing the figures it is necessary to realise that in terms of total pattern presentation numbers the factor between the number of presentations to each network is in fact considerably greater. Number of bits Volterra HLBP 2 15 95 3 17 265 4 62 1200 5 106 4100 6 292 20000 7 556 100000 8 1287 over 500000 If the figures in the above table are plotted for the two networks it is possible to see that the learning times for the current network rise quickly as a function of the number of input variables, whereas those for the new network rise at a lower rate. The results of the graph imply that the learning times for current networks are likely to be extremely large for complex problems with more than a few input variables. In fact to see the Volterra results on the same scale a log plot is needed. 8 .1.1 Interpolation Performance Comparison. In this section the interpolative ability of two current network types Hidden-Layer and Ra- dial Basis Function (RBFs)[42] is compared with that of the new network. The network is trained on the exclusive OR class points (0,0), (1,0),(0,1),(1,1) until full convergence. The 101 x105 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 2 3 4 5 6 Figure 8.10: Presentations v Number of bits. HLBP solid, Volterra dashed line 7 8 I r I \ I 106~-------.--------.---------r--------.--------.-------~ 105 -------------- ",.-,.-, ", " 101~------~--------~---------L--------~--------~------~ 2 3 4 5 6 Figure 8.11: LOG Presentations v Number of bits. HLBP solid, Volterra dashed line 103 7 8 r adaption is then stopped and the inputs are varied over the full signal space. The class decision is then plotted as a point for clas's one and no point for class two. Graphs a) and b) :x:2 xl were provided by Niranjan[19]. x2 xl 104 c) :x: 2 xl The hidden layer network displays poor interpolative abilities which is to be expected as the decision region has linear boundaries. The Radial Basis Function network performs much better, producing suitably localised classes. This increase in performance is to be expected due to the relationship between RBFs and multidimensional interpolation. The new network converges to the best solution of the three, which is extremely near to the optimal solution, given no information other than at the class points. In fact, if the above is repeated for the network before it has fully converged, it displays results which are similar to those of RBF networks, that is the decision regions for a partially converged network is of a similar form to that of the fully converged RBF network. However the new network is not constrained to a set of a priori 'RBFs' and so can converge further. In fact this behaviour can be seen by following the decision contours as the network adapts. 8.1.2 Multiclass Performance In practice it has not proved possible, in general, to use current networks for classification with more than 26 classes. In general, multiple class problems are addressed by using a series of networks. The flexibility of the decision surfaces of the new network, however, allows a single network to perform multiclass problems. One such problem is that of character recognition. The network was presented with an 8 by 8 pixel image of a character 105 I ' I Font Frenc Font Figure 8.12: Multiple Mutliclass Problem. Font RU55iCAn Font and required to output the character's position in the collating sequence. The network was capable of learning to recognise all 80 characters in the font. A second order Volterra extension was used with all terms present. The coefficients were set by the LMS algorithm. It is also possible to increase the number of classes that a system can recognise by using the same vector expansion 'hardware' with multiple adaptive filter sections connected to it. 8.2 Determination of Net~ork Capacities and Degeneracy The capabilities of the network to differentiate inputs into a number of classes can depend on the degree of non-linearity of the relationships between the input patterns and output indices. A measurement of the 'fullness' of the network can be determined by considering the rank of the extended-space correlation network R. A rank lower than the number of expanded terms shows that the solution to the problem is degenerate and consequently the network should be able to perform well in that it can produce a number of solutions which will minimise the mean square error. However, Volterra network generalisation properties are found to be reduced for rank deficient problems as the polynomial form of the network can tend to over-model parts of the extended space. This can be avoided by application of the network size optimisers described later in this thesis. The extended-space correlation matrix is given by: R - t ex - VV 106 where v is the extended state vector. The rank of the matrix may be obtained by various methods, although Singular Value Decomposition can give good results when working with rank deficient systems. Once the rank of the extended space correlation matrix is full this implies that there is now a unique solution. The ability of the network correctly to classify patterns now also depends on the regions over which the outputs are thresholded to within the correct class. Thus, for example, a correct output may be obtained if the network desired output is 1.00 but the network actually outputs 1.07, if the output thresholder operates between .5 and 1.5. This question of output region definitions is also relevant to the cost of incorrect classifications and more work may be done in future to relate the thresholded mean square error criterion to the cost and probabilities of incorrect classifications and hence set optimal output ranges. 107 I ' Figure 8.13: Example character. 8.3 Visual Recognition Results 8.3.1 Direct Recognition The network was first applied to recognising binary font images of size 8 by 8 pixels. The sixty four pixels were input to the network and an error signal was generated by subtract- ing the network output from the value of the font member in its collating sequence!. The adaption coefficient was approximately set by using an estimate of the average power of the input image. Figure 8.112: Examples of the font. lThe term 'collating sequence' means the index to the character in an ordered font 108 I I , J 8.4 Invariant Recognition A preprocessing section was used to produce patterns which were invariant to shift, scale or rotation. The recognition process was repeated using real camera images of simple handwritten capital letters. 8.4.1 Preprocessor One of a series of capital letters was drawn with a marker pen onto a card. The image of the card was digitised by the use of a camera and framestore. The first stage performed by the preprocessor was an adaptive level binarisation; this was followed by a calculation of the centroid of the resulting image. The next stage was to use a simple edge finding algorithm to reduce the image to an outline image of the character. A line following algorithm was used to find sixty-four points in order around the outline. This algorithm was started from the leftmost point on the letter, and this process was repeated three times to take account of inner loops in the letters, for example, the letter B. BE Figure 8.113: Example characters. The euclidean distance from each point to the centroid was calculated. The referencing of the distance data to the centroid of the letter confers the property of positional invariance on the data. The power of the data is now normalised to be one. This confers the property of scale invariance on the data. A peak finding algorithm was used on the data and the data was shifted in a circular manner to put the peak value as the first data point. This operation gave the data the property of rotational invariance. Consequently the centroidal data patterns output by the pre processor should be shift, scale, and rotation independent. Figure 8.114: Preprocessor Stages. Although in the majority of cases the preprocessor produces similar output patterns for an individual letter there are some letters which under certain circumstances will produce significantly different patterns. An example is the letter B. The first pass of the outline follower follows the outside of the letter, however the second pass could follow the top loop or the bottom loop and correspondingly generate significantly different patterns depending solely on which loop has the leftmost point. Consequently a connectionist model must be used which is powerful enough to cope with two significantly different clusters mapping to the same output index. I I 1\ I', I •.... --. ___ --0 .-.. -.... H I ill B I', .---. ..-_ I, ......... . ....... - """ ... ~. , -....... . .... _ .. ,~. __ ...... .r -_. '","-". .. ........ oI'~ ... . .. ...... __ ' Figure 8.115 : Centroidal Data form. 8.4.2 Results Direct recognition The network was found to recognise correctly all the characters in an eighty character font in around 700'0 pattern presentations . 80 Member Font recognition 2nd Order Expansion,64 Pixels Figure 8.116: Plot of number of errors in the last fifty presentations against number of presentations. 50~~~----~--~----'-~-.----~----.-----r----.----, 40 30 20 30000 Figure 8.117. Plot of number of errors in the last fifty presentations against number of presentations with added noise. Preprocessed Recognition The system was trained on twelve letters A. .... L, initially 112 using one card for each letter. Once the network was found to correctly recognise each card then gradually other cards with .the same letters were introduced into the training runs. A new card was only introduced when the network had converged to a state giving no errors on the current cards. Letter Number of Cards Learnt A 4 B 4 C 4 D 4 E 4 F 4 G 4 H 4 I 4 K 4 L 4 Figure 8.119 : Recognition Results 8.5 Conclusion In the direct recognition case the network performed well, learning eighty separate class members: this is a higher number than usually reported for current networks. The conver- gence times were fast by comparison with current hidden-layer networks which have been reported to require many tens of thousands of presentations to learn fewer characters. In the preprocessed case the network demonstrated its ability to cope with variation in the patterns, and consequently be applicable in practical recognition systems which require not only position invariance, but tolerance to character variation. 113 I ' 2000~--------.----------.---------.----------.---------~ 1000 o -1000 -2000L---------~--------~--------~--------~--------~ o 50 100 Figure 8.14: Example of Mug Data. 8.6 Audio Recognition 8.6.1 The Data 150 200 250 The following audio signals were generated by various acoustical means. For example, by hitting a cup with a pencil and so on. A ,series of recordings were made of signals generated by the same means. No care was taken to produce identical sounds: for example, the cup was not always hit in the same place. The process was repeated for a series of different sound generating processes. These included a mug being hit, a click, a clap and so on. The plots also show the variation within one of the class sets. The similarity of the frequency spectrum of different classes may also be seen. 8.6.2 Preprocessing and AR Model As it is clear that the raw time series data is unlikely to produce clustering patterns for each class a preprocessor was used. The audio signals were preprocessed by over a range of forty samples finding the largest sample. The position of this sample was used as the starting position for ten samples to be taken. These ten samples had their power normalised and this pattern served at the network input. Ten point sections of data were taken from between the tenth and fortieth samples starting at the maximum sample in this range. This avoided scaling problems and reduce time origin variation effects. I1 2000 1000 O~ Nw . f\/I.rv 'fV. .A vv -1000 ,-2000 -3000 0 50 100 150 200 250 Figure 8.15: Example of Click Data. 1 x104 0.5 -1 -1.5 L-____ -L-____ --L ____ -----IL....-____ ....I..-____ ---J o 50 100 150 200 250 Figure 8.16: Example of Clap Data. 115 4000.---------.---------.---------.----------.--------~ 2000 -2000 -4000~------~~------~--------~--------~--------~ o 50 100 150 200 250 Figure 8.17: Example of Box Data. 1000r-~------.---------.---------.----------.--------_, 500 o -500 -1000L---------~--------~--------~--------~--------~ o 50 100 150 200 250 Figure 8.18: Example of Mug 0 Data. 116 I I I 2000~--------.---------.----------.---------.---------. 1000 o -1000 -2oooL---------J---------~--------~~--------~--------~ o 50 100 150 200 250 Figure 8.19: Example of Mug 4 Data. 6~xl~0~4----~------~----~------~------~------~----~ 4 2 120 140 Figure 8.20: FFT of Mug 4 Data. 117 I I 2 x104 1.5 1 0.5 1 0.5 0 -0.5 -1 0 20 40 60 80 Figure 8.21: FFT of clik 0 Data. 1 2 3 4 5 Figure 8.22: Processed mug data (10 examples). 118 100 120 140 6 7 8 9 10 I [ 1r=~--'-----'-----'-----'-----'-----'-----'------'----, 0.5 o -0.5 _1L-----~----~----~~----~----~----~----~~----~----~ 1 2 3 4 5 6 7 8 9 10 Figure 8.23: Processed Box data (10 examples) . 1~----~-----.-----.------r-----'------.-----.------~----, 0.5 o -0.5 _1L-----~----~----~------L-----~-----L----~------~----~ 1 2 3 4 5 6 7 8 9 10 Figure 8.24: Processed click Data (10 examples) . 119 I I' 1~----.-----.-----.-----.-----.-----.-----.-----.-----. 0.5 o -0.5 _1~----~----~----~------~----~----~----~------L---~ 1 2 3 4 5 6 7 8 9 10 Figure 8.25: Processed clap Data (10 examples). 8.6.3 Results The network was trained using 8 of the ten realisations of the data for all the classes and after learning to classify these correctly the 2 remaining examples were presented. 120 Sound Desired Output Actual Output Error Mug 0 1 1.01 .01 Mug 1 1 1.02 .02 Mug 2 1 .99 .01 Mug 3 1 .97 .03 Mug 4 1 1.02 .02 Mug 5 1 .99 .01 Mug 6 1 1.02 .02 Mug 7 1 1.00 .00 Mug 8 1 1.01 .01 Mug New 1 1 . ~6 .4 Mug New 2 1 1.3 .3 Click 0 3 3.01 .01 Click 1 3 3.00 .00 Click 2 3 3.00 .00 Click 3 3 2.98 .02 Click 4 3 2.99 .01 Click 5 3 3.01 .01 Click 6 3 2.84 .16 Click 7 3 2.96 .04 Click 8 3 3.02 .02 Click New 1 3 3.30 .30 Click New 2 3 3.12 .12 Clap 0 4 3.98 .02 Clap 1 4 3.99 .01 Clap 2 4 3.97 .03 Clap 3 4 4.02 .02 Clap 4 4 4.01 .01 Clap 5 4 3.99 .01 Clap 6 4 4.03 .03 Clap 7 4 4.17 .17 Clap 8 4 4.01 .01 Clap New 1 4 4.41 .41 Clap New 2 4 3.65 .35 Box 0 5 5.21 .21 Box 1 5 4.99 .01 Box 2 5 4.99 .01 Box 3 5 5.02 .02 Box 4 5 5.00 .00 Box 5 5 4.98 .02 Box 6 5 5.01 .01 Box 7 5 4.97 .03 Box 8 5 5.09 .09 Box New 1 5 5.39 .39 Box New 2 5 5.41 .41 121 8.6.4 Conclusion The network performed well in the audio transient recognition task and required a com- paratively light computational load to do so. The audio preprocessor could obviously be extended to form higher order correlations for systems with higher order statistics. The generalisation of the recogniser allowed it to recognise the previously unseen non-training patterns, although the error for such new patterns was considerably higher than for the training data. 122 Chapter 9 Output Mapping Problem and Principal Non-linearities 9.1 Output Mapping. Consider a single input system with a series of output classes as shown below in figure 9:1. This would require a high order non-linearity and consequently complex boundaries to the decision regions and so a larger network. If the output indices had been suitably chosen the whole problem could have been reduced to a linear one, figure 9:2. Although the above is a simple case the problem becomes difficult to solve in multiple dimensions with a degenerate system. This question of output values has not been fully appreciated with current networks and a method for determining suitable indices would lead to far smaller networks. The approach being currently investigated involves a princi- c o .p d U 4- If) If) d U Input vor io.ble x Figure 9.1: One Dimensional Problem 123 c o +' d U 4- I.J1 I.J1 d U Input vo.rio.ble x Figure 9.2: Linearised Problem pIe component analysis of the pattern difference vectors and has provided the ability to use smaller networks. It was hoped to develop an adaptive approach to network contraction. 9.2 Vector Space Approach The problem of deciding on an optimal set of output indices may be approached from a vector space method. The problem is to find a resolving vect?r, that is, the vector onto which the extended space class vector is projected, which maximises the difference in output value between the two classes with minimum separation. Therefore it is necessary to find a projection vector w which maximises the difference of Xi.W for the two input patterns Xl and Xk which give the closest values of Xi.W. Thus it is necessary to maximise (Over all i,j such that i not equal to j) the minimum of: Where eij is the difference vector between the i th and j th input pattern vectors. This may be addressed by a Lagrange multiplier approach. However, this led to a fruit- less conclusion as the analysis gave an unusable result in terms of active and inactive constraints. 9.3 The Bayesian Approach An attempt was made to apply Bayesian methods to the output mapping problem. I 11 cv '" :0 d "<: d > Input Vnrlnbl .. 1 9:3.1 Bayes' Theorem P(M I D) = P(D I M)P(M) P(D) where P(M I D) is the probability of the M given D, and P(D) the probability of D. Bayes theorem may be applied by using it to specify parameters in a model which is being tested against some data. 9.3.2 Application to Connectionist Models Baye's Theorem may be applied in the co?nectionist model problem by using the following model. The operation of the network is separated into two stages. The first stage is the familiar vector operation. For the sake of simplicity the first stage will be limited to a linear operator (FIR filter). This first stage is followed by a scalar non-linearity. This non- linearity may be formulated in many ways but the simplest is a truncated polynomial. Bayes theorem is applied to estimate the coefficients of the power terms in this scalar non-linearity by integrating out the linear vector first stage as nuisance parameters. Thus Bayes theorem allows the separating of the decision boundary generation and the output mapping translation. This should lead to much smaller networks as the network's first stage will only have to converge to whichever solution separates the classes. The output mapping being a scalar non-linearity rather than a vector one is correspondingly far more efficient. Consider a series of pattern vectors Xi with pattern classification output value Cj where i is the pattern index. The proposed system modelling the relationship between the patterns and the output Input dota Ql Q'bx'cx~dx~ ........ Sco.lor Non-linearity lInpo.r 'Vector' Stage values is assumed to be of the form shown . The error is defined as usual as: where F[ ] is: F[u] = L inun n=l,P in being the non-linearity coefficients to be determined. From this a likelihood function may be constructed for error values observed. Assuming E to be distributed as a Gaussian variable with zero mean: M referring to the model and D to the data. Bayes theorem may now be applied: P(M I D) = P(D I M)P(M) P(D) O 1 - 1 ~ (2 P(F ,w I E) 1j2;aNe '2u'I L...=I ,N I P(F[], w) The coefficients w of the linear combiner may now be integrated out: Assume P( F[], w) = A a constant. The analytical solution of this integral is difficult, so an attempt was made to perform numerical integration. 126 I , I I r I I 9.3.3 Conclusion Although the Bayesian approach is an interesting attempt to investigate the output map- ping problem the impracticality of the calculation of the integrals by numerical means for systems with more than a few coefficients is at present sufficient to prevent the use of the method. If an analytical solution or a good approximation to the solution could be found this approach could yet prove fruitful. 9.4 Optimisation of Output Map Order A constrained optimisation approach was developed to allow the optimisation of the output index labels. 9.4.1 Derivation The following approach to suitable selection of these indices has been shown to give good results. It uses a constraint to ensure that the output indices derived are separated from one another and consequently takes the form of a constrained optimisation. Suppose that there are M classes and in the training phase there are Ni (i=l,M) presen- tations for each class. Let di be the classification index for each class. Let xn{i) be the extended space pattern vector for the ith class and the nth example of it. Let w be the network weight vectors. The di must be constrained. A suitable constraint is found to be: M Ld; =c i=1 where c is an arbitrary constant. The method of Lagrange multipliers may now be applied. Set up the objective function: 127 where k=l,M. From 1: Define: and: Then: From 2: 4 becomes: Ni :L xn(i) = p(i) n=l M Ni :L:L xn(i)x~(i) = Rxx i=l n=i M :Ldip(i) - Rxxw = 0 ... 3 i=l Nk (.\ + Nk)dk - :L wtxn(k) = 0 n=l Nk (A + Nk)dk - w t :L xlk) = 0 ... 4 n=l k = 1, M. Multiply 5 by dk and sum over k (change to i): M M :L(.\ + Ndd; - :L wtp(i)di = 0 i=l i=l 128 I I For convenience, assume Ni = N for i = 1, M : M M (,\ + N) Ld~ - ~wtp(i)di = 0 i=l i=l But: ~d~ = c , so: Substitution (5,3): M 1 ~ ,\ + N wtp(i)p(i) - Rxxw = 0 1 ,\ + NPxxw = Rxxw where: M P xx = ~ p(i)pt(i) i=l Assuming Rxx non-singular then: That is, w is an eigenvector of ~~Pxx. w =ae where e is an eigenvector. Multiply 3 by w t : M w t ~diP(i) - wtRxxw = 0 ... 7 i=l Substitute 6 into 7: Substituting 8 into 5 gives: From 8: 129 where 11 is the eigenvalue corresponding to e, i.e. A + N = 11. Now, to decide on which eigenvalue to choose, consider the erro,r: We have: and: and from 9: Expanding 10 gives: Substitute 7: From 11: Substitute in 12: M N E = L L [d~ - 2diWtXn(i) + (wtXn(i))2] i=l n=l M = Ne - 2 Ldiwtp(i) + wtRxxw i=l M E = Ne - Ldiwtp(i) ... 12 i=l t dk t w p(k) = -w Rxxw e M d2 "" ' t E = Ne - LJ --!..w Rxxw i=l e = Ne - wtRxxw E = e(N - 11) where 11 is the eigenvalue corresponding to e. Hence the error is a minimum when the largest eigenvalue is selected. Substitution shows: t dk = W Pk 11 130 By use of this expression a set of 'optimal' output indices may be found. These indices are in general not integers and in fact a , convenient approach to practical implementation has been found. The original integer indices are simply reordered for the classes in the same order as the optimal ones. This is found to work well as may be expected if the connectionist model is viewed as a multidimensional interpolator. 9.4.2 Results The output mapping scheme was first applied to a two input variable problem with class clusters as shown below in figure 9.3. These clusters and assignments gave the following outputs for each class. The output mapping system was now run to give the following optimal indices and the original indices reordered. The plots show the desired and actual outputs against pattern number. The solid line is the desired output and the dotted line the actual output. It can be seen that the output error has been greatly reduced by optimal ordering of the output indices. The output mapping approach was next ~pplied to the problem of optical character recog- nition, using a 80 font member set with 8 by 8 pixels for each character. The desired output was the position of the character in the collating sequence 0 to 79. 131 r I I . ~ . . . . . . . .. .. .. .. . ... . .... . ............ . !... .. ......... . .. .. ind~x=-l + + : + ........ .. .. ........ .... ..... ..... .. t .. ; ............. ...... ...... ... . ...... .... . .. ..... .... ....... . + + : + + # : index=2 : : + : 1 *+ .... ...... ......... "(" ........ ... ....... ; ..... .......... ~:f .......... ..... ..... . . . + : . . .. . . .. .. .. .. .. .. .. .. .. .. . .. .... . .. .. ... ~ .. .. .. .. .. . .. . . . . . .. . ... .... .. .... .... .. :.... .. .... ........ . + ++ .. ..... .. .. ........ ..... . ...... ... .... + ...................... ......... ! ++ ~ + + ind~x=l Figure 9.3: Input pattern clusters. 132 11 ...... . -: .. . ... ... ........ .. ..... ... ... ........ . . index=-2 ++ .. +. .. t+' .... :iI: .. ........ . + * .................... :- .......... ............. .. .. .. .. .. .. ... . I I + t- + ····t···+.j.· ..... ... .. .. .. + + : + + ind~x=O Assigned class indices 2r-----.-----.-----.-----.-----.-----.-----.-----"1----.-----. 1.5 1 ! I ~ I I I I ,\ ' I I \ " ~ , , I ...... ' I ,-' \ I I I , __ ... J I 0.5 I I : ~,,\ : \ o , '... ,'--'--' : ... , "'''' ... -0.5 -1 -1.5 - " I"" \ " I \ 1\ I \ I I \ I , \1 '-- , , , , • I ,\ '\ ... , I \ I \ ,''', , \ I \ ' " \ ....... -... " I \,' I I I : I !,r-------------~ I I , , J _2L---~-----L----~--~----~----~----L---~----~--~ o 5 10 15 20 25 30 35 Pattern number Figure 9.4: Plot of output with original indices. Solid line is desired output and dotted is actual. 133 40 45 50 I' ) I Re-ordered class indices 2.5,...---,-----,--,.---,--,----,-----,--,.---,------, 2 1.5 1 0.5 o -0.5 -1 -1.5 -2 . " ,'\ I ...... __ ...... '~ , \I " , ' , , ' ,,' " ,'\ I~ ,,' \ 'I \ " \ ~" v \ \~ " \ '\ , , " \' , , , ,-- , , , , ,: " , \ \ -2.50L----5l.....---1'-0----..J1L-5 ----..J2L-0 -----'25--....L30---I.35---I.40--....L45--50 Pattern number Figure 9.5: Plot of Outputs with reordered indices. Solid line is desired output and dotted is actual. 00 2113 Tap 2nd Order No.of presentations 6000 Figure : Results for 80 member font OCR with first and second order terms without optimisation of indicies. With the arbitrary indices the linear system is not capable of solving the problem. How- ever, if the indices are optimised a linear system can now solve the problem. 11 00 ~ 0 0 ~ I(j ~ .,; tzl fIJ ~ CIj 65 Tap Linear 0 t-l . ~ 0 '0"4 Z No.of presentations 6000 Fig. OCR results using linear first order terms without optimisation of output indices. 135 ~o aI ~ 0 0 IQ ~ .,l ~ ~ CIl as 66 Tap Linear ~ ~ 0 ~ . 0 ~ Z No.of preaentat10na 6000 Fig. Results for 80 member font OCR with first order terms but optimised indices 9.4.3 Conclusion The Volterra connectionist model has already been shown to have significant advantages over other systems in terms of analytical basis, speed of learning, scaling properties, and generalisation. The output mapping method demonstrates an approach which makes the Volterra connectionist model highly efficient computationally, by comparison with current neural networks. Although in the above results the system was applied to yield linear recognisers, exactly the same approach may be used to reduce a higher order non-linear recogniser to a lower order one. The network is no longer being constrained to fit arbitrary indices and so may utilise all of its degrees of freedom to improve recognition performance. 9.4.4 Analogy with Generalised Karhunen-Loeve Transform The eigenvectors created by application of the output mapping may themselves be used for creation of a signal space. Consider the K.L. transform. This transform creates a signal space in which the signal energies are optimally concentrated in as few of the space defining vectors as possible. This is of course of interest in communications as it allows transmission of a given number of the new space vectors to transmit as much of the signal 136 " energy as possible. However, it is not necessarily a good thing for pattern recognition in which we are more interested in th.e class differences than the classes themselves. The space created by the output mapping eigenvectors also has some interesting properties which must be investigated in future work. It appears to offer a space in which the class separation for a limited selection of the eignevectors is maximised. So two classes which cannot be separated by projection on to the first eigenvectors may be separated by using a multidimensional output space composed from subsequent eigenvectors. More work is to be done on this subject in the future. 9.5 New Network Using Adaptive Output Mapping Above, an output mapping scheme was developed which, in the pattern recognition prob- lem, minimised the order of non-linearity required, to relate the input data and output indices, and thus enabled much smaller networks than usual to be applied to pattern recognition problems. The method was based on a constrained optimisation. The error defined over M classes in the training phase was defined as: M Ni t = L L[di - w~n(i)12 i=l n=l where Ni is the number of presentations for each class and xn(i) is the extended space pattern vector for the ith class and nth example of it. The dj are constrained: M Ld; = c j=l where c is an arbitrary constant. In the original derivation the method of Lagrange multipliers is now applied to give, by the solution of a generalised eigenvalue equation and the selection of the largest eigenvalue, a set of optimal output indices. Such a block based method is not directly applicable to an adaptive approach as it requires the indices to be calculated a priori. Consequently an effort was made to derive an adaptive approach to optimisation of the output mapping. An adaptive approach may be derived as follows: If the original error equation is considered it is clear that the mean square error is a quadratic or unimodal function of the index dj • This makes it suitable for a gradient 137 search approach. This gradient based optimisation must, however, take account of the constraint equation to restrict the lo~us of search points to the values dictated by the constraint equation. This constraint is most easily done by performing the index update in spherical polar coordinates: the weight update may still be performed in cartesian coordinates. In SP coordinates the constraint equation becomes a constant radius constraint, with the various angular coordinates being updated by gradient search. The actual radius is an irrelevant gain factor which simply scales all the indices. Rewriting the dj as cos OJ in SP, then: j=l becomes: j=l As expected the constraint equation is a hypersphere in SP coordinates. The error is defined as before: Or in SP coordinates for the output dj space: To optimise the outputs a gradient search is performed to minimise f. with respect to the indices, but steps are only allowed around the hypersphere because of the constraint equa- tion, that is, the component of the steps in the radial direction is zero. The constrained gradient search then becomes: where JLom is a step size constant parameter. Now: oE( f.2) Of. . o ()j = 2f. 0 OJ = -2f.sm OJ therefore the algorithm is: I1 11 f € = COS(}i - WX The classification attached to each class 'may be discovered by running a known training pattern through the system as the adaption proceeds. It is hoped to continue work on this approach in the future. 9.6 Principal N on-linearities. As is appreciated in the GMDH [36] approaches not all of the polynomial terms may actually play an important role in network performance for a given problem. Some terms may be redundant. It is a shame to have to calculate these terms each time. Methods however which use a priori information to remove random terms cannot perform well in general for on-the-job applications as the redundancy in the terms may change. The remaining terms which are required for good network performance may be called the principal non-lineari ties. 9.7 Adaptive Identification of Principal N on-linearities It is clear that in order to perform a totally general transfer function a very large number of terms from the Volterra series may be needed. Past experience has shown that many real problems may, however, be solved with limited order networks of order one, two or three. This is especially true in the field of pattern recognition where the above output optimisation has been applied to reduce the order of required transfer function. It has also become apparent that many problems may be completely independent of some of the terms in the general polynomial. This independence or irrelevance of certain terms leads to larger, less efficient networks, but it is very difficult to predict in advance which terms will be irrelevant and remove them. It has also been demonstrated that networks composed of random terms will in general operate well (shown later in this thesis). This frees the network from the constraint of having complete Volterra expansions. An alteration to the adaption algorithm is derived here which allows the network to organise itself to eliminate some of these unhelpful terms to decrease network size. Alternatively the computational load decrease incurred by the network when a term is lost may allow another term to be 139 I, added. Such a network which removes terms to optimise its structure may be referred to as a self collapsing network (SCN). If pattern class densities in the multidimensional space are considered, an interesting theory may been proposed that suggests that for complex pattern recognition problems limited non-linearities should be required, for well preprocessed problems. Consider an image, which for human purposes may be considered as 256 by 256 pixels. This means our signal space has 256 by 256 dimensions and it is likely that in such a large space a low order network could separate many classes if they were correctly output mapped. 9.7.1 Degeneracy The problem in the SCN is to detect redundant terms and neutralise them. Consider a degenerate problem in which the extended space correlation matrix Rex is singular. For such cases there is in fact a series (quite often an infinite number!) of possible solutions. The standard block approach to such a problem is to employ Singular Value Decomposition to yield the unique minimum norm solution. SVD is however not suitable for an adaptive approach and incu~s a relatively heavy computational overhead. In a minimum norm solution the weights associated with any redundant terms will be set to zero. 9.7.2 Adaptive Minimum Norm Solution. Attempts were made to develop such an algorithm: however it became apparent that many methods failed with such rank deficient problems and the only approach found was to use Singular Value Decomposition. Although there are recursive implementations of SVD these are computationally heavy and would be impractical for many connectionist model applications. In order to obtain a minimum norm solution the following adaptive approach was derived with a new error measure: 2 _ (d )2 t T/ k - k - Y k + O:W W In which 0: is a constant setting the relative importance of term elimination to reduction of mean square error. Consequently if a term is truly redundant the value of its weight 140 I I will be arbitrary. The above addition to the error term adds a penalty for setting such a weight to a non-zero value. The squ~re error may be differentiated with respect to the weights to lead to a stochastic gradient update equation: Where fLscn is a step size constant. The derivative may be found as follows: Where f.k = dk - wtxk as before. The f.Z derivative can (as in the LMS case) be shown to give: The wt Wk term gives: Hence the update becomes: ~ c.c, -: 26<:j(3< ft -;::- 2j11'" It can be seen that the norm minimiser adds only one multiplication and one subtraction per weight update. Now that the minimum norm solution is being preferred by the adaption algorithm, terms may periodically be eliminated. To give a constant size network terms may also be ran- domly added each time. Through this process the network may perform self structuring. Consequently, the irrelevant polynomial terms are selected and eliminated. Similarly, the total number of terms may be set as a function of the classification error rate. It should, however, be noted that the optimal solution of the weights (in the mean square error sense) is perturbed by the extra term and consequently a is normally kept small to minimise this perturbation. The effect of a being small is for the network to initially adapt its weights irrespective of the norm constraint. Only when the M.S.E is minimised will the a term begin to have more effect. The setting of a is determined by the relative importance of network collapsing and minimising output error. 141 Figure 9.6: Type I Neuron . ) Figure 9.7: Type II Neuron. 9.7.3 Perceptron Analogy As will be discussed in chapter 10 the Volterra network can be composed from randomly connected type I and II perceptrons (Figure 9.6 and 9.7). The second type of neuron is the output neuron. This is the same as the linear perceptron with weighted input and one is needed for each multiclass output. In perceptron terms the performance of each perceptron is being investigated and if it is found to be of little use to the system its inputs connections are destroyed and new ones 'grown' to random connection points. Terms may also be added to keep an optimised constant size structure. This has been 142 I I done by random selection, although a host of other rationales could be investigated using higher order correlations. It should be noted that the weight energy penalty term tends to slightly distort the normal weight solution and thus leads to a misadjustment giving a slightly higher mean square error. The effect can be avoided by only including the penalty term during periods of training or retraining rather than during the recognition phase. 9.7.4 Adaptive Self Collapsing Network Examples. As a simple example, a two pattern problem is addressed. A second order extension is used: The two patterns are alternately presented and the smallest coefficient after 50 presenta- tions is shown: Pattern 1 Pattern 2 Redundancy Eliminated Weight (2,-1) (-1,2 ) Xl X 2 W3 (2,-1) (-2,-1) X2 W2 (-1,-2) (-1,2 ) Xl Wl The results demonstrate that the SCN correctly eliminates unnecessary terms and so can solve the problem with a smaller network. Unlike SVD the effect of noise of the apparent degeneracy of a degenerate problem is not of importance as the SCN method does not require only exact degeneracy to work but will assume the minimum norm solution for near degenerate problems. 9.8 Conclusion The output mapping approach leads to much smaller networks and consequently sig- nificant decreases in computational load with no degradation in performance. A full understanding of the implications of the new output mapping approach will require more work in order to investigate its behaviour. 143 The new network has the ability due to its derivation to model general non-linearities and maintains high speed convergence wit~out local minima problems, due to its unimodality. It will also function well for multiclass problems and will work with computational loads far lower than other networks by minimising the non-linearity of the problem and then concentrating on the principal non-linearities of the problem. The self collapsing adaptive mapping final network is for input vector h : x = F(h) where FO is a random selection of polynomial terms. Weight update: Index update: t Yk =WX Onewi = Oiold + 2«:: sin Oiold After a preset number of presentations, terms with a weight value of modulus less than a threshold are eliminated from FO and a new random polynomials term added. 144 I I Chapter 10 Implementation 10.1 Introd uction In order to facilitate the understanding of the new network it has so far been discussed as two sections: firstly the vector state expansion and secondly the adaptive filter section. The same network may be implemented in a more distributed form composed of a series of identical 'neurons' by factorising the expansion polynomial. Once this has been done it is possible to find factorisations composed of identical units. Consequently a network may be created using a network of simple terms. Another approach which is currently under investigation is to define two types of 'neuron'. The first is composed of a simple multiplier and the second a standard linear perceptron. An interesting form of the second implementation is that in which the network elements are randomly connected. This generates a connectionist model containing random terms from the non-linear state expansion. However, provided that the network is large enough, the inherent degeneracy allows the network to find possible solutions without the missing terms. 10.2 Exact Implementation The first computational reduction in implementing the network may be found by realising that terms such as X1X2 and X2Xl are numerically the same and thus only one of them is needed. Computational load may also be saved by storing the result of each term's calculation before it is multiplied by its coefficient so that the term is not recalculated for 145 the update equation. The network may be directly implemented with each term being calculated from scratch. This is ineffici~nt but easy to code. A more efficient approach is to use previous lower order terms to calculate higher order ones. The full application of this idea is in the nested approach. The vector space expansion equation may be factorised to give more efficient implementations of the network by utilisation of nesting. Consider the simple expansion: This may be nested in the form: Where: That is, a simple recursive application of: out = constl + in * const2 Other implementation approaches are to separate the expansion and summation opera- tions to allow multiple problem solving and to allow the random internal connections of self structured networks. The connections of such random or self structuring networks can be easily implemented by a two input quasi-multiplicative element (or 'neuron'!) whose inputs can be arbitrarily connected. This can easily and efficiently be implemented for simulation on a digital computer by storing the input positions for all the elements on a stack and using indirect addressing to fetch the inputs. 146 IF Input wl pl x l x w2 Y;; w2 p2 x2 w3~ Nested IMpleMento. tion 3 rd ot ' der 2 input w31 10.3 Random Interconnection It is of interest to consider the effects of incorrect or even random connection between the units of the new network. This is of interest as many possible future implementations of large networks may not be able to guarantee 100 percent fabrication accuracy. Such future implementations using light or analog VLSI may use tens of thousands of perceptrons and hundreds of thousands of interconnections. To produce such systems in a totally accurate manner may be very difficult. It is also considered unlikely that biological information processing systems can grow and maintain themselves exactly to a predetermined 'wiring' diagram. Thus it is of interest to consider the effects of faulty or missing connections. In the extreme case it is interesting to consider the performance of large versions of the network composed of perceptrons connected in a random fashion determined by some probability generating function. This would be of interest as it would allow self-organising networks. One scenario has each perceptron growing out a random connection to another perceptron, and there are variations on this theme. 10.3.1 The Two Neuron Random Model The network may be decomposed into two types of neurons. The first is a neuron which performs a vector space expansion. This may be as simple as a multiplication of two inputs, or a neuron taking in multiple inputs to give the product of them all. The second type of neuron is the output neuron. This is the same as the linear perceptron with weighted inputs and one is needed for each multiclass output. The network is made up by the random connection of layers of the type I neurons. Random connections are made from the inputs of the type II neurons into the type I layers. 10.3.2 Results for Random Connection The random network was now applied to the optical character recognition problem and the number of errors in the last fifty presented characters was plotted against the presentation number. 147 Figure 10.1: Type I Neuron . ) Figure 10 .2: Type 11 Neuron . 148 If Figure 10.3: Random Network. I I In Fig. Example Random Network 149 Radon Nrtwark:li8 3i11Jut typI I El.",nt 1 228 input tIP 11 EI.Mnt 35 .11Ml1u ~ layrr S by 7 Pt.] Fem 28 lnterl in font. ~=a.s Fig. Results for Random Network 10.3.3 Conclusion The random network works well and will converge to give a viable operation; as may be expected the computational load is heavier than the non-random network. 10.4 Limited Dynamic Range and Imperfect Multiplication In the previous section the effect of imperfect or random interperceptron connections was considered. In this section the effect of imperfections in the perceptrons is considered. Such imperfections are considered as those expected to arise in an analog implementation such as light, analog electronics and so on. Practical analog systems can only work over a limited dynamic range: if the signal is too small the signal becomes lost in noise, too large and it will exceed the range of the system (for example, hard limiting in an amplifier). In order to avoid this problem the network should be able to function with all internal calculations giving signals within some modest dynamic range. In implement - ing of the new network it is clear that the effect of multiplications imple- mented by analog multipliers should be considered. The exact form of these imperfections 150 will be specific to the exact implementation used and as such is beyond the scope of this discussion. However, the effect of general degradation may be considered. This degrada- tion may be modelled as low level white noise added to the multiplier transfer function. 1004.1 Results A two dimensional problem was attempted using a network for which the expansion stage multiplications were limited in dynamic range by a hard limiter for product results over 100 in magnitude. The input values were normalised over the range 0-10. The extension was next calculated using a noise corrupted multiplier, the noise being uniformly distributed over the range of 0 to 0.1 and fixed as a function of time. The network inputs were normalised to the range -0.5 to +0.5. Was replaced by: c = a * b + 0.1 * random((a + 0.5) * 5, (b + 0.5) * 5)) Where random is a matrix of uniformly distributed numbers ranging from 0 to 1. 1. Fig.10.4:0utput against two inputs for perfect two input mul- tiplier. 2. Fig.10.5:0utput against two inputs for static noise corrupted two input multiplier. 3. Fig.10.6:0utput against two inputs for limited dynamic range two input multiplier. 4. Fig.10.7:Decision regIOns for limited dynamic range network case. 5. Fig.10.8:Discriminator surface for limited dynamic range mul- tiplier network. 6. Fig.10.9:Decision regions for static noise corrupted multiplier network. 151 Figure 10.4: Ideal Multiplier Figure 10.5: Static Noise Corrupted Multiplier. I' i I I I Figure 10.6: Limited Dynamic Range Multiplier. 154 30 0 0 25 0 0 0 I ( 20 + + 0 0 15 + + + + 0 0 + 10 5 o o o o O~------~~------~---------L--------~--------J-------~ o 5 10 15 20 25 30 Figure 10.7: Limited Dynamic Range Multiplier Results. 155 P' • I' I ( Figure 10.8: Limited Dynamic Range Multiplier Discriminator Surface Results . 156 30.-------~------~--------.-------~-------.--------. o o o 25 o o o 20 b' + 0 15 + + < + + 0 0 + 10 <> ~ 5 0 0 0 0 0 0 5 10 15 20 25 30 Figure 10.9: Static Noise Corrupted Multiplier Results. 157 I' I Figure 10.10: Static Noise Corrupted Multiplier Discriminator Surface Results. 158 I I I I I 7. Fig.l0.l0:Discriminator surface for static noise corrupted mul- tiplier network. 10.5 Conclusion The network still has the ability to function for limited dynamic range calculation and for imperfect calculation. Its performance is however in general degraded for the limited range operation. The network will correctly operate with quite severe noise corruptions on its multiplication operations. It may be possible in future work to analyse the effects of such degr~dations by expansion of the transfer function by power series and replacing each network multiply by a power series expansion of the imperfect multiply. By considering the imperfections in this manner in the light of the network random connection results it is possible to explain why the network can still operate with imperfect multiplies, as these imperfect multiplies can be written as the addition of a series of perfect multiplies. The effect of time varying multiplication noise errors has been left for further work. 159 1" . Chapter 11 Conclusion and Future Research 11.1 Future Research The quadratic error surface lends itself ideally to the use of a more efficient adaption algorithm, such as that of conjugate gradients [2]. A variation of this method is cur- rently being developed, which is giving considerably faster convergence times than those discussed earlier without requiring the storage of very large matrices. A form of the network based on the Wiener G-functionals is being considered, as these form an orthogonal set . This may lead ' to advantages in adaptive performance for some applications and form another basis from which to analyse the network behaviour.[13] Likewise other non-linear expansions such as the method of gating functions are being considered. This may produce compact networks for problems which contain many satu- rating elements.[13] The use of a least squares criterion for multiclass problems is not a particularly good one as it can weight the error likelihood on a class as a function of its output value, so other error functions are being investigated. Ideally, a pattern recognition system should construct an estimate of the underlying prob- ability density functions which are generating the input vectors given an element from a particular class. Although, due to the expectation operator in the mean square er- ror equation, the probability of a pattern vector occurring is taken into account by the network, it may be possible for the network to make far more complete estimations of underlying probability processes by using Bayesian methods. 160 I I ~ Inputs Recognition Clo.ssifiCo. tion V Network > I ~ ./ I "-... F eedbo.ck Clo.SSiCo. tion Figure 11.1: Recursive Networks 11.1.1 Time Series Network Incorporation Many pattern recognition problems require a large amount of contextual information. Such a problem is found in connected speech recognition in which the recognition of a phoneme may require knowledge of immediately previous phonemes. Crude attempts to achieve this have been done by feeding back a previous classification as an input to the network. The fundamental and complex problem of determining the stability behaviour of such non-linear recursive adaptive systems has, however, not been addressed and consequently such recursive networks are not robust. The analogy of such systems with adaptive HR filters demonstrates the likely complexity of such stability analyses even before non-linearities are taken into account. 11.1.2 Non-clustering Patterns and Preprocessing At the moment there are two approaches to the determination of the preprocessor stage in pattern recognition. The first is to assume problem-specific models or problem prop- erties which may be exploited to produce a network input data form that will lead to a sufficiently small number of single class sub-clusters. The second is an ad hoc trial and error approach. Both of these approaches are problem specific and the first requires knowledge of a suitable 161 \1 I I 30 0 + + 25 + 0 + I { 20 0 0 0 + 0 15 0 0 ' + + + 0 + 10 5 o + o o o~------~--------~--------~------~--------~------~ o 5 10 15 20 25 30 Figure 11.2: Badly Preprocessed Problem. 162 \ I 30 0 0 0 25 0 0 0 20 + + + 0 0 15 + + ' + + 0 0 + 10 5 n 0 0 0 0 0 I I 0 5 10 15 20 25 30 Figure 11.3: Well Preprocessed Problem. 11 163 ~x a, '0 xx 00 x~ x 0 \ Xx x x 0 0- - --0 X \ 00 Xx 00 Xx x2 x2Q \ x2 x .... ,0 Xx 0 .... 0' 00 0 x xl xlQ xlb Figure 11.4: Successive clustering. system model. A more general approach would be to generate a transformation from the training data that will transform an unpreprocessed problem into a well processed one by combining sub clusters belonging to the same classification. The output mapping work may provide a method to give such a general preprocessor by utilising a space created from the larger eigenvalue eigenvectors of the output mappIng analysis. 11.1.3 Image Coding The image coding problem can be viewed as an attempt to find a low dimensionality signal set that represents a higher dimensionality signal set, the original image. By transmitting the lower dimensionality set, a compression in the amount of data needing to be transmit- ted over a communication channel may be achieved. The Volterra connectionist model when combined with an optimal output mapping strategy has some interesting proper- 164 ties for use in such a problem, and would allow such a system to incorporate subjective perceptual judgements into the compression process. 11.1.4 Other Data sets It is hoped to apply the network to other problems such as speech, fingerprint, underwater transient, and more general image recognition tasks. It is also hoped to apply the network to more examples of time series data, especially time series with non-linear statistics. 11.2 Conclusion The new approaches have been shown to have some advantages over current methods and allow greater understanding of network operation than some of the more empirical methods. Firstly, the network convergences to its optimal solution orders of magnitude faster than current networks, and this convergence time does not rise as quickly as a function of the size of the pattern vectors. Secondly, it is unimodal and conseq~ently does not suffer from the problems of local minima, such as needing to be reset, getting stuck or uncertainty about having converged. These problems can seriously affect the performance of current networks and prevent them being used in an adaptive mode while actually performing their tasks. Thirdly, the complete mathematical derivation of the new network allows a much fuller understanding of its operation and allows direct application of adaptive filter knowledge. This allows easy parameter tuning and application of other adaptive filter update methods. It also allows the effects of approximation and implementation to be considered and the advantages and understanding of aspects such as output index selection to be obtained. The adaptive network output index and network structure methods allow the application of the networks to problems which previously would have needed impractically large networks. The work done in this thesis has only begun to probe the possibilities which may be explored by a firm signal processing approach to connectionist modelling and it is hoped that others will continue on this path. 165 I ' ( Bibliography [1] S.Haykin, Adaptive Filter Theory Englewood Cliffs, NJ:Prentice-Hall, 1986. [2] E.Polak, Computational Methods tn Optimisation London, UK:Academic Press, 1971. [3] F.Yassa, Optimality in the Choice of the Convergence Factor for Gradient Based Adaptive Algorithms IEEE Trans. ASSP, vol. ASSP-35, No.l, Jan. 1987, pp.48-59. [4] R. Fletcher, Practical Methods of Opti'Tr!-isation New York, NY: John Wiley, 1987. [5] E.Eleftheriou & D.Falconer, Tracking Properties and Steady- state Performance of RLS Adaptive Algorithms IEEE Trans. ASSP, vol. 34 ASSP-34, No.5, Oct. 1986, pp.l097-1109. [6] D.Panda & A.Kak, Recursive Least Squares Smoothing of Noise in Images IEEE Trans. ASSP, vol. 25 ASSP-25, No.6, Dec. 1977, pp.520-524. [7] D. Tufts & R. Kumaresan, Data Adaptive Signal Estimation by Singular Value Decomposition of Data Matrix IEEE Proc. Vol. 70 , No.6, Jun. 1982, pp.684-685. [8] B.Widrow, "Adaptive Signal Processing", Prentice Hall, New York 1985. 166 [9] F.M.Reed and P.Feintuch, "Time Delay Estimation Using LMS Adaptive Filters.", IEEE trans, ASSP, Vol.29 No.3 June 1981. [10] Y.Bressler and S.Merhav, " Recursive Image Registration with Application to Motion Detection", IEEE Trans. ASSP vol.35 No.! Jan 1987. [11] S.Alliney and C.Morandi, "Digital Image Registration Using Projections", IEEE Trans. PAMI vol.8 No.2 March 1986. [12] H.Nagel and W.Enklemann," An Investigation of the Smooth- ness Constraint for the Estimation of Displacement Vector Fields from Image Sequences", IEEE Trans. PAMI vol.8 No.5 Sept.86. [13] Schetzen, The Volterra and Wiener Theories of Non-linear Sys- tems New York, NY:John Wiley, 1980. [14] G.Tesauro & R.Janssens, Scaling Relationships in Backpropa- gation Learning Technical Report:Center for Complex Systems Research Univ. of Illionis. 19-2-1988. [15] P. Alper, A Consideration of the Discrete Volterra Series IEEE Trans. Automatic Control, vol. , J ul. 1965, pp.322-327. [16] M.J.D.Powell, Radial basis function approximations to Polyno- mials Internal Report . Department of Applied Mathematics and Theoretical Physics. Cabridge University. [17] D.Spect, Generation of Polynomial Discriminant Functions for Pattern Recognition IEEE Trans. EC, vol. EC-16, No.3, Jun. 1967, pp.308-319. [18] S.T.Alexander, Transient Weight Misadjustment Properties for the Finite Precision LMS Algorithm IEEE Trans. ASSP,vol. ASSP-35, No. 9, Sept. 1987, pp.1250-1258. 167 I I I I [19] M.Niranjan & F.Fallside Neural Networks and Radial basis functions in classifying static speech patterns Internal Re- port Cambridge University Engineering Department CUED F- INFENG TR 22(1988). [20] D. Rumelhart & G. Hinton Learning Representations by back- propagating errors Nature Vol.323 9-10-1986. [21] S.Fakhouri, Identification of the Volterra Kernels of Non-linear Systems lEE Proc. vo1.127 No.6, Nov. 1980, pp.296-304. [22] S.Narayan, Frequency Domain Least-Mean-Squares Algorithm IEEE Proc. vol.69 No.l, Jan. 1981, pp.125-126. [23] C.Nikias, Bispectrum Estimation: A Digital Signal Processing Framework IEEE Proc. vol.75 No.7, Jul. 1987, pp.869-891. [24] R.Lippman, An Introduction to Computing with Neural Nets IEEE ASSP Magazine. Apr. 1987, pp.4-22. [25] D Luenberger Linear and Non-linear Programming Addison- Wesley, Massachesetts. [26] C Therrien Decision Estimation and Classification John Wiley, New York. [27] Hand Discrimination and Classification John Wiley, New York. [28] J Glover, Adaptive Noise Cancelling Applied to Sinusoidal In- terference. IEEE Trans. ASSP vol.ASSP -25 1977, pp.484-491. [29] P Feintuch, An adaptive Recursive LMS Filter. IEEE Proc. vol.64 1976, pp.1622. [30] B Friedlander, Analysis and Performance evaluation of an adaptive notch filter. IEEE Trans. Info. Theory voLIT-30 1976, pp.283-295. 168 I \ I i I I' [31] S Elliot & P Darlington, Adaptive Cancellation of periodic syn- chronously sampled interference. IEEE Trans. ASSP vol.ASSP- 33 1985, pp.715-717. [32] C Johnson, Adaptive IIR Filtering:Current Results and Open Issues. IEEE Trans. Info. Theory vol.IT-30 1984, pp.237-250. [33] W Press et. al Numerical Recipes: The Art of scientific Comput- ing. Cambridge University Press 1986. [34] Rabiner and Gold. Theory and Application of Digital Signal Processing. Prentice Hall. 1975. [35] W Harrison and J Lim A New Application of Adaptive Noise Cancellation. IEEE Trans. ASSP vol.ASSP-34 1986, pp.21-27. [36] S. Farlow, Self-Organizing Methods in Modeling Dekker Inc, New York 1984. [37] T Koh and E Powers, Second Order Volterra Filtering and Its application to Non-linear System Identification IEEE Trans. ASSP vol.ASSP-33 1985. [38] A Invakhnenko, The Group Method of Data Handling. Soviet Automatic Control, vo1.13 No 12 . 1968. [39] D Psaltis and C Park, Non-linear Discriminant Functions and Associative Memories. American Institute of Physics 1986. [40] R Gorman and T Sejnowski, Learned Classification of Sonar Targets Using a Massively Parallel Network. IEEE Trans. ASSP vol.ASSP-36 1988. [41] V Klema and A Laub, The Singular Value Decomposition: Its Computation and some applications. IEEE Trans. Automatic Control. vol.AC-25 No.2 April 1980. 169 [42] D S Broomhead and D. Lowe, Radial Basis Functions, Multi- variable Functional Interpolation an'd Adaptive Networks . Royal Signals and RADAR establishment Memorandum 4148. IEEE Trans. ASSP vol.ASSP-34 1986, pp.21-27. [43] J Bunch and C Nielsen, Updating the Singular Value Decompo- sition. Numerische Mathematik 31 1978. [44J W.S.McCulloch, A logical Calculus of the ideas imminent in nervous activity. Bulletin of mathematical biophysics 5, 115- 137, 1943. [45] R Rosenblatt, Principles of Neurodynamics. New York, Spartan Books 1959. [46] B. Widrow, Adaptive Switching Circuits. 1960 IRE WESCON, Cony. record Part 4 96-104 Aug 1960. [47] J Hopfield, Computing with Neural Circuits. Science vol. 233 , 625-673 Aug. 1986. [48] Moody and Darka, Learning with localised rceptive fields. Re- search report YALEUjDCSjRR-649 Yale Univ. Published Work 1. M.R.Lynch and P.J.Rayner "A New Approach to Image Regis- tration Utilising Multidimensional Adaptive Filters" , Proceed- ings of the International Conference on Acoustics Speech and Signal Processing (IEEE), New York, USA, April 1988. 2. P.J . Rayner and M.R.Lynch "A New Connectionist Model based on a Non-linear Adaptive Filter", Proceedings of the Interna- tional Conference on Acoustics Speech and Signal Processing (IEEE), Glasgow, UK, May 1989. 170 I I I I I \ I 3. M.R.Lynch and P.J.Rayner "Optical Character Recognition us- ing a New Connectionist Model" Proceedings of the Interna- tional Conference on Image Processing, (lEE) Univ. of War- wick, UK, July 1989. 4. M.R.Lynch and P.J .Rayner "The Properties and Implementa- tion of the Non-Linear Vector Space Connectionist Model", Proceedings of the First International Conference on Artificial Neural Networks, (lEE) Savoy Place, London UK. Oct 1989. 5. P.J .Rayner and M.R.Lynch "Network Complexity Reduction in Volterra Connectionist Modelling by Consideration of out- put Mapping and Principal Non-linearities", Proceedings of the International Conference on Acoustics Speech and Signal Pro- cessing, Alberquerque, New Mexico, USA, April 1990 . Internal Report: 6. M.R.Lynch "A Unimodal Neural Network Utilising Non-linear Vector Space Expansion Methods, Comparison with Current Networks and Research Discussion". Cambridge University En- gineering Department, Trumpington St., Cambridge. Yet to be Submitted for publication: 7. M.R.Lynch "An Adaptive Filter Algorithm usmg Conjugate Gradient Methods." 8. M.R.Lynch "An Adaptive Pitch Estimator for LMS Notch Fil- ters". PhD. Thesis: 9. "Adaptive techniques in Signal Processing and Connectionist Models." 171 I I A New Approach to Image Registration Utilising Multidimensional LMS Adaptive Filters. By Michael R. Lynch and Peter J.W.Rayner Engineering Depar.tment, University of Cambridge Cambridge,England. Abstract [n this paper a new method is presented for the adaptive rt'gistration of twu images between whic-h the image ob- jt"ct is shifted. The method is ba~ed on multidimensional LMS adaptive filters, whose derivation is first given and the pt'rformance surface is shown to be u.nimodal. The multidimensional filter's similarities and differences with thl' on~-climensional LMS adaptive filter are hriefly dis- ('Ilssed . It is shown that the filter will perform t.he neces- sary operations of correlation, interpolation,fiitering and shift to produce a registered output image. An algorithm is also dl'veloped to work in conjunction wi th the filter to provide explict shift measurements. The advantages of this iuherently parrallel adaptive approach over current methods are discussed and are must apparf"nt where the images are corrupted by noise or are related by spatially varying shifts. finally the possibility of application of multidimensional LMS adaptive filters to pattern recog- nition is briefly mentioned . Exalllples of the need for application of image registra- tion algorithms arise in many fields such as airborne ground sensing and aircraft navigation, medical imag- ing, in which separat.e frames of an image must be reg- istered in the presence of involuntary patient movement alld wlwl't· ever t.llt're is relat.ive lIIovenH'nt bet.wet'n the illlage sl'nsnr and the image scene. III practical cases the images lIIay he degraded by noise, lIIay contain weakly spatially varying shifts or one of t.he two images tu be regisLt'recl may ('ont.ain ohjl'cts not found in t.he oLher or moving illdl'pclldent.iy of the rest. of the scent'. All of tht'se effeds can significantly reduce the performalH'e of current. image registration methods such as phase cor re- lat.ion. In the case of images this method employs,a two dimen- siollal Least Mean Squares adaptive filter whkh Illakes allllost no a priori assumptions abou t t.he images or their degredat.iolls. This allows the method to he used with gt'neral images and consequently it is not limited t.o im- agt's with sharp edges or similar artifact.s as may be re- quired hy synt.act.ic met.hods. Dt'rjvatjoll Thl'! derivation of the mutildimensioual LMS I1daptivl' filter is closely related to that of tht' one dilllensional ('ast' . The derivation is given for 1\ N dimt'nsiolll\l adaptivt' filt('r with each dimension of order Ill. Consider an input tensur of rank N order III X allcl a N rank weight tensor W .Both tensors element.s have N indices i . .. z. The N-dimensional adaptive linear cumiJiller is ~iv('1I by: Where the operatioll WX is delilled as I he illnel' prociu('t, that is : L L ... L ~· i ... ,II ' i .. . : .:= 1,"1 The error may be defilled as I ht' di fTt'r t' lIct' i)t'llI't'('1I a desired output dk and t.he th h •• , •• 2. S •• Shitt (3.2) Fi~ 4. Results with addition of white noise to both IUl- ~f'S.Fiher 20 by 20,Pinl = .004 Format as Fi@3 Also Dis- pla.cmlent vt'Ctor llIap. Finally tht filters Wf'rt run on an illlaj!:t' pair in whkh the filt~r input imag~ 111\(1 ht"t'1I nhtainecl hy A ~pl"ial vlHying shift which wu a function oC tht x coordinate. I • •• ' " , .. , ... . , ," I '" I •• ' " " I "' . I •• • • '" 1 "" . I I , ," " I ••• "" I I I ., I I" I I I I . I I " t I I I I I I.· •• • , I I I" I I I " 1 ' " I I ' ,. "' " I I " "'" I I '" I • • ' I I • I • •• •• I' • • I ' ' " . I I I I I' .· •• • I. I I " I I : : :: :: :. Fig~. Rt'sul15 of Spatially varying shift case.Shift 6y = 5. sio(O.04 .. zl. Filtf'r 12 by 12 Jlint = .004 romput~tio!!'!l1Qacl For spatially stationary shifts the filtt"r nf'ed ollly be rUII ov~r small portions of tilt" imag~. Tht filtf'r size is s~t Ily th~ largt"st shift to hf' df'alt with, but this may bf' reducffi by calculating 0 , ! E 0 1 ~ (Even Parity) y < 0, ! E O2 = (Odd Parity) y =- (XI + X2 )2_2 Y - -2 + x/ + 2XIX2 + X2 2 y > 0 , ! E 0 1 - (Even Parity) y < 0 , ! E ~- {Odd Parity} The lIeneral discriminator may be considered as a non-linear operator on the elements of the pattern vector. However an alternltive viewpoint is to consider the lIeneral discriminator IS a Linear operator over a Non-linearly Extended Vector Space. This approach lives considerable insipt and leads to a Connectionist Model with desirable properties. For the 2-bit parity problem the discriminator: y - XI X2 may be considered as the linear discriminator: y - [0 0 operatin& over the non-linear space shown in fig. 2. Fi(' .. 2. Discriminator boundaries in extended vector space for 2-bit parityproblem 3. Form of tbe Non-linear Extension There are I nu . ber of pOSSIble formulations for the non-linear extellSlOn vector but one which i5 closely related to the Volterra expansion for a non- linear dynamic syste.. is the discrete Kolmolorov- Gabor polynomial [SI y - l!.T ~ .. . (2) This will be termed aa ( N, K ) discriminator system. ie. N variables and upto lU> ' order product terms. An N variable linear discrimiator is an ( N, I ) system. The form of the Extension Vector for a ( 3,2 ) system is: !.1 - [I XI X2 X3 x~ XIX2 XIX, X2XI x~ X2X3 X3XI X3X2 xn 4. Optimum Discriminator Weillht Vector The optimum discriminator weillht vector l!. may be determined with reference to fill.3. Fi&.3. Extended vector space linear discriminator The error in the d~riminator output for an input pattern !, and associlted classification index cJ{j) is ( , - C J(j) '- C J(j} Expected squared error E( (,2 Minimisin& wIth respect to the wei&ht vector l!. where: RI. = E(cJ(j) !.!i } K~. - E{ !.., !.!.} .. . (3) .. . (4) .. . (S) In practical applications the dimensions of the correlation matrix R .. may mean that the computation of the inverse is DOt practicable. Moreover some applications of Connectionist Models require that the ) I Bodel be continuall.y re-trained as the pa ttern data vanes. In slllnal processinll terms this IS equivalent to deahnll with non-stationary slllnals. It is necessary, therefore, to seek an alternative approach. The form of the linear discriminator is identical to that of a Finite Impulse Response filter and eqn.5 is, in eHec:t, the classical Wiener filter but defined over a non- linearly extended vector space. There has been a considerable amount of work over the past few year. d .. linK with adaptive approaches to determininK opti .. al filters (6) and most of this work may be applied to the extended vector space system. The simplest of the a1&orithms wil be derived; this is known as the Least Mean Square (LMS) algorithm or the Stochastic Gradient alKorithm. The block dia&ram of an adaptive approach to the discriminator problem is shown in fi&. 4. Fi&.4. Adaptive extended vector space linear discrimina tor The prinCiple of the LMS al&orithm is to update the discriminator ",el&ht vector in the direction of steepest descent down the error surface. The update equation is: + aE( £,2 !!.'., - !!., /.l a!!., where /.l is a scalar parameter which determines the rate of conver&ence. ~l.l !!., + E(2 /.l £, ;;' } -, From equation 3, -;I -., 2/.l E( £, !., ... (6) In tbe LMS al&orithm the expected value of the error gradient is approximated by the instantaneous value so that the update equation becomes: Thus the operation of the system in the supervised learninK phase is to present a sequence of pattern vectors !., and associated classification indices cJm. The discriminator weight vector is updated in accordance with eqn.6 as each pattern and index is presented. A crucial pOlat IS broullht out la eqn." ",hlch shows that the error surface IS hyper-peraboloid and so has a SIngle global minimum. This in sharp contrast to the more standard Connectionist Models in which the shape of the error surface is not k.own and it is not possible, therefore, to determine ",hether the model has conver&ed to a local or &Iobal minimu •• A further point of note is that the learnin& or adaptive phue of the network has ben described in te~ of the LMS al&orithm but there are a number 01 alternative . al&orithms, such as the Recursive Least Squares or Deterministic Kalmaa Filter, which exhibit considerably faster conver&ence but at the expense of additional computational complexity. 5. Network Structure The realisa tioa of the proposed Connectionist Model has not been considered in any detaiJ but it is clear that there is considerable structure iD the non- linear formulation liven by eqn.2. ODe particular decomposi tion rewritten as: will be demonstra ted. Eqn.2 which may be further expressed as: y = I may be The above equation lIIay be computed usilll a nested poloynomial approach as shown below (or a (2,3) system. and in nested form: where: In y - Po + X,(P, + X,(P 2 + X,(P3' )) Po - woo + X2(WOI + X2(W 02 +:I:(w03))) P, - W ,O + x/wl1 + X 2(W ,2» P2 - w20 + X/W21 ) 6. Experimental Results order to compare q uantita tively the performance of tbe extended vector space system with I' I I r current connectlOnlst models, the network was trained to perform the n-bit parity recognition problem . This problem is one of the few for which there are sets of quantitative convergence time data for current connectionist models and demonstrates the ability of the network to perform nonlinear decision making processes. The network is presented with an n bit binary word and is required to output the binary parity value of the word .The nonlinear extended space network was implemented usina the LMS adaptive algorithm on the extension space aenerated by the Kolmoaorov-Gabor polynomial given by equation 2. This network is compared with a standard hidden layer back propoga,tion neural network adapted with the Back Propaaation algorithm [2]. Such networks are among the most powerful and widely used current networks.The figures for the convergence times of the back propogation network are taken from a recent study of network convergence times on the n-bit parity problem (3]. The back progation network used in this study was co.posed of n input units, 2n hidden units and one output unit with full connectivity. The problem of choosing the network parameters was addressed by running the networks many times for each n with differing network parameters and taking the best run. The problem of back propogation nets convergini on local minima was dealt with by discarding from the results any run that did not appear to beconvergini. In the case of the extended space network the problem of parameter choice did not arise as the only parameter, the optimal adaption coefficient Jj, can be chosen using lIIethods standard to adaptive filter theory. A Simple and usually adequate method is that of Yassa (7]. Thus only one run was needed for each value of n. The problem of local minima for the extended space network does not arise as the system is unimodal so that the system always converges to the optimal solution. Table I compares the convergence times T for the proposed connectionist model with the convergence times T BP for the Back Projection algorithm as a function of the number of bits n. No.bits Convergence Time n T Tap 2 IS 95 3 17 265 4 62 1200 5 106 4100 6 292 20000 7 558 100000 8 1287 500000 Table 1. Comparison of converience times 7. Concl usions The extended space approach leads to networks which can. with sufficient extension, syntheSise any nonlinear discrimin .. t function whilst maintainini unimoda1ity. This lead.a to ieneral networkl which are ractora or between OM hundred and one thousand time. faster in convergence in the experimental investliations perforllled_ The property of unimodality means that there are 00 local .inima and the network alwaya converge. to the g~1 optimum. The approach haa a firm analytical basis' and large body of supporting knowledge in edaptive filter theory. The adaptive filter knowledge also allows analytical solutions to the vital question of parameter tunini thus allowing good first run performance on practical data.The supporting knowledge from adaptive filter theory may also be used to predict the effects of eigenva1ue IIpread and rank defficiency in the correlation matrix in equation 5, or to utilise other mo~ powerful adaption methods such as Recursive Leut Squares, Singular Value Decomposition, and their derivatives. RefereDCell 11] D. Rumllelhart, J. L. McClelland &. the PDP Re- search Group, "Parallel Distributed Processing: Explor- ations in the Microstructure of Cognition," Vol. I, MlT Press, Cambridge, Massachusetts. 1986 [2] D. E. RUlllmelhart. G. Hinton &. R. Williams, "Learn- ing representations by back propagating erron," Nature, Vol. 323, pp. 533-536. (3] G. Teseuro &. R. lanssens, "Scalini relationship in back-propagation learning: dependence on predicate order," Technical Report, Center for Complex Systems Research, University of Illinois at Urbana-Champaign, 1988 [4] D. J. Hand, "Discrimination and Classification," John Wiley &. Sons, Chichester, 1981 (5] A. G. Ivakhnenko, "Heuristic self-organisation in problems of engineering cybernetiCS," Avtomatika ,Vol. 6, pp. 207-219, 1970 (6] S. S. Haykin, KAdaptive Filter Theory,K Prentice- Hall, New Jersey. 1986. 17] F. Yassa, "Optimality in the choice of convergence f actor fo r grad ient-based adaptive a lgor ithms," IEEE Transactions Acoustics, Speech and Signal Processing, ASSP-35, No.l. pp. 48-59, 1987. 11 Optical Character Recognition using a New Connectioniist Model M.R.Lynch and P.J.Rayner Cambridge University Engineering Department, U.K. 1 Abstract TbU pap~r delcrlbea the developement of. I)'a- tem for recognJalng hand drawn capital lettera. whIch la dellgned to be poaltlon. Icale and ro- tation Invariant. The object of the paper 1. to Invutlgate the performance of a new unlmodal form of connectlonllt model when coupled to a preprocessing stage. The new cODnectionlst model la brIefly derived and Ita error lurface la mown to be unlmodal. The new connecUonlat model may be conaId- ered ... being composed of a non-Unear vector Ipace expander followed b), a Uneu adaptive fil- ter operating In the extended Ipace. The vector Ipace expanalon may be performed b)' meana of the Volterra lIerlell. Thi. approach haa the advantage of allowing the direct appUcaUon ot adaptIve filter theor),. 1.1 Non-linear System Modeling Approach The pattern recognition problem may be thought oC u a IYltem modeling problem, in which the adaptive pat- tern recognUer it attempting to model the classification Iy.tem; for example. a human defining tbe relationship between a .et of character font image. and a set of coI- lating sequence nu.mben. The block diagram showing lbe recognition problem cast as a system modeling prob- lem is .hown in Fig.I. Fie. 1 Cl •• «1f)ceC1°anD It it clear that the relationship betweaI d&t ~ input and output oC such a ITItem it in general a -a-linear one. thus the recognition .y.tem mUlt be capUable of leneratin& a JUfficiently do.e approrimatioa &e t.hme DCIl- linear relationship defining the .y.tem. 1.2 The Volterra Functional Annaly- sis 10 signal processing nch DOD-linear .Yllam ImD&y be modeled by use oC the Voltena lenet!. The burn oC the Voltena Jeriu was fir.t Iludied by Vito Voka%zra, bot Norbert Wiener was the firll to apply it ~ a.cm;.n-linear .ystems theory. He used it to model the ia:pumt out- put relationship oC non-linear .ystems. The cgntjrcinuCRlS Voltena Jeriet may be written: ~(t) = HO + Hl[z(t)] + H2[z(t)] + ... + Hnl:z(t)!)] + ... Where ~(t) is the output of a .ystem and z(t) i is the input and in which: Hn[z(t)] = [:00 ... i:- h .. (T'l ..... r .. )z(t-rI) ... z(t-r.}4idrl .. dr" Where h.( T'l ..... r.) is the Vo)tena kezDtL __ A cID- crete fonn oC the series was developed. Uin ,CIlcral thl. will n~ to he a hdcrogeneow 'oisncaa expansion in order to model IYltema compoled t:JPDltiple orocn oC non-linearity. The Hderogeneow Vol~ series may writteD u: :L HI "I,N ~eft H. are the Volterra Kernels, and", are coel- which ddine the model u CoDowl. Ho = Wo Zeroth Order 'D.C.· term HI = L wlxl Fint Order Linear term 1 :2 = L L wIJxi Xj Second order Quadratic term 1 J HS = L L L wiJ.kXJXjXk .... and 10 on. i j k y the coefficients are fixed by we oC analytical thods hOYenr they may be fixed adaptively. Con- the CoDo.m, Iystem in which the inputs are com- to pn the individual Volurra terms. Thele terms ~ then multiplied by their respectin coefficients and e &.e:nnI II1lIJIDled. It can be seen that the Iystem can ~ decomposed into a 'fector expansion Itage followed ~( a linear Finite Imp\ilie Response (Fm) filur. The . tion of the coefficients may then be achieved by '1 oC the methods wed in FIR adaptive filter theory. De analyLical basu oC thi. approach allowl w to ex-"citJy calculate the Corm oC the error lurface. The oDllaTa series u found to con'fer,e for mOlt practical t.:..nti~. allbou,h lu,_ ;'put k"" o. "1 ,",h der non-linearities Ihould be .... oided. A very clolely . e.xpan.sion may be obtained by uling the Gabor- oroy polynomial: Fi&-2 The Connectionitt modd in block form. Vol t.erra. Exp. xl x2 In a I yl tem such as that . hown in Fig.2 the effect r joininr; an Fm filter to the Voltura expa.ruion may be conJidered. The.,..c.an hu a set oC input values Zo ••. z. which are inp1& to the DOD-linear Volterra .tate upander to produce a ~ of output 'falues Vo ••• v, with n bein, Je .. than f. nae ya}ues then form the input to an Fm adaptin fil&s. CoDaider the output yalues for the 'fector mu expmder. This.et of 'falue. may be represented by the y~ V. The filter coefficient. may alto be repruented by De 'fector W. Ulin, the Itandard Fm equatioD we may write the error (u: ( = l- L "''''' i In which d it the desiral output from the filter, that it in our case the corred clas.mcation liven the set oC inputs to the whole IJ'I'GD 1:;. In 'fector form thit may be written: If we with to conJider the error function for the Leut Mean Squares, or LMS. aitereoD with the mean bein, con.idered O'fer the ~ presentatioDJ, the equation becomes: E«(') = E{(d - W TV)'1 = tT' -lWP + Wll.W Where R it the autocmrdati.cm matrix oC the data in the DOD-linearly extended ftdor 'Pace and .imilarly P is the crollcorrdatioD yed,or betweeD the de.ired d .ienal and the DOD-linearly apanded data-tT u the va.riance oC the desired responae. The objectin is to W a &et of weipts W which will minimise the mean .cpan error. The .y.tem theD will output clanineations .mch differ from the desired clu- .mcation. with the mjwimnm mean Iquare error pOllible for the .ystem. The toiUion Cor the weights of the filter is obviously the same _ that oC a Weiner Filter oper- atin, in the exteDded space and .0 the methods used to find the Weiner soDtioD explicitly may be applied. A more .uitable appnKb for the recognitioD problem is to find the wei&ht _tiOD adaptively. ThiI hu the ad vantage of not requ.i:in& a priori knowledge of R, and doe. Dot require the izftrsion oC a matrix. It it clear from the .boTe that the LMS error is lolely a quadratic function. the wei,ht 'feetor. Thi. show. that the performance lIIrface is a hyperpuaboloid and consequently alwaYI mimodal. This is a highly de- sirable property u it thow. that the system will not exhibit local minima _ the error .urface. Such local minima cawe the adllf"ion performance of networks to be greatly reduced as the adaptive algorithm may ,et .tuck in local minima and consequeDtly not reach the optimal coefficient yahes., The property of unimodality also confers the abilitJ to adapt much more quickly a. 1 I 1 r ther~ ar~ DO nDey. iD which to dither and the ,radient iI proportioaaJ to the diatance Crom the .olution. 'I'hia tut property prnenb the adaption bein, greatly .tow~ or halted by re,iolUl of the enor .urCace with unall er" dimu... A ftochutic padient aJ&orithm may be easily applied.. FoDowin, the denTation oC the LMS al,orithm. The ..uaI padient aJ&orithm may be written by di{- Cerattiatinc the mean aquare enor with re.ped to the weichU to cin an iteratin wei&ht updaw equation. w = W - pVE(f') Where p iI an adaption coefficient which .et. the .peed oC lurnin,. A .imple expre .. ion Cor V E( t') may be derind ... . 8E(I!') = 2(~ = -21!X 8W 8W The analytic basil oC the method combiDed with the property oC linearity iD the coefficient. allow. the appli- cation oC the .tandard least aquare. method.. In gen- eral Cor pattern recognition the problem is degenerate or rank-deJicicnt. A tingle .olution may .till be ob- t&in.ed by the application oC SiDgular Value Decompo- sition (SVD) which yield. the minimum norm .olution. Such analytical method may also be used to atudy the network performance and obtain the ei&envaluca Cor real recopUtion probluna. 1.3 Parameter Tuning Tbae is only one parameter to be chosen in the new nehr~k, this is the adaption coefficient. However, this adaption coefficient it that oC a .tandard adaptin fil.. tu. Co~ently the large body oC knowledge per- tainiD& to adaptive filter theory may be immediately appOed. Yun{T] de't'eloped an expre .. ion Cor chooain, tM optimal adaption coefficient. No .uch method may be applied iD the cue of current networks as the natwe oC the error turl'ace is complex and contairu local min- ima.. Consequently the parameter. such as the adaption coefticient must in general be let by trial and error iD curn:nt networb. 1.4 Implementation The VoHena expansion may be factorised to give effi- cient methods for it. calculation requiring fewer multi- plicatioIU than the direct fonn. It is necessary to take care in the case oC large input .iplIw and high order ex- panDolUI to prevent onrflow problems. Such problems are ea.sily avoided by lCaling the data IUch that it is ap- proxilnately nonnaliJed. The extended .pace network is found to work well Cor multiclass problems, in which the network is required to classify the input to one of • ..uies oC output claue •. This is most efficiently done by using one output which is required to output a .erica of le ... ds corresponding to the different classca. This iI iD coatrut with many networu which use a .erie, oC binary outputs to output multiple clan classifications. Il a direct implelllUltation of the network is wed it ia wo pouible to we the tame exp&lUlion .tage with multiple adaptive filter.. 'I'hia kad. to an efficient .y .. tem which may be wed to perform alUie. of unrelated multic:las. problema. Each muliclau problem bein, in- dependent of the othera and brin, ita own .et of eoef· ficienta. Fig.3 Multiple Multicl ... Recocnition. Ar_blo tOAt rDDt Multiple Multicla.as Problem 2 Direct Recognition The nelwork 'MU lint applied to the application of recognising binary font images of .ue 8 by 8 pixels. The sixty Cour pixels were input to the network and an error .ienal 11''' generated by .ubuading the network out- put &om the value oC the font member in ita collatin& sequence. The adaptioD coefficient wu approximately Ht by wing an estimaW oC the .... erage power oC the input image. A .econd order network was wed which wu adapted u.iDg the LMS algorithm. Aa.bB Io·· oF $5%e QJ<;-(]B)!uOP+ Fig ... Examples of the font. Fig,S Individual Character I1I I 3 Invariant Recognition A preproceuin& .ection WIU UKd to produce patterIll "hich were inYllliant to • hilt , KAle or rotation. The Icco,nition procen WIU repeated lUin& real camera Un- , .,et ol.imple handwritten capitalletten. 3.1 Preprocessor One ol • .aie. oC capital letters Will c:hawn with • IJUl.Iker pen onto a card. The imaCe oC the card w .. di,itised by the we oC a camera and Cramellore. The fint Itace pe.rlormed by the preprocessor Will an adap- tive Inel biruLrisation; tbis was Collowed by a calculation oC the centroid oC the resultin, ima,e. The nut Itage "Ill to use • simple edge findin& al&orithm to reduce the ima«e to an outline image oC the character. A line CollowinC alcorithm wa. wed to find .idy·four point. in order around the outline. This algorithm Will .tarted Crom the lellmo.t point on the letter, and this proce .. Will repeated three times to tak.e account oC inner 100pI in the letters, Cor example, the letter B. E Fi,.6 EX&lIlple characters. The euclidean distance from each point 10 the cen- troid was calculated. The reCerencinc oC the distance data to the centroid oC the letter conCers the property oC positional inl'ariAnce on the data. The power of the data is now nol'JIl8.fued to be one. This conCeu the property of Kale inn.r1a.nce on the data. A peak. finding algo- rithm wu used on the data and the data Will .hified in , circular manner to put the peak. value as the first data point. This operation cave the data the property oC ro- tational inn.riance. Consequently the cenlroidal data pattern. output by the post processor should be .hilt, scale, and rotation independent. "El' . H "" ~ lH[ • " .... : -.!....... ..,g....... ' =i 1 ~ • : •• .. ---,. --- -) 1 ------.,. ... ~ Fi,.7 Preprocellor StaCCl. Althou,h in the majority ol ~ I ,i.he pnproceuor producea limilar output patter:.. m amn iDdiridna1 let- ter there are tome letien whicll us.d.c:ttr ccrt.a.in c:ircum- Itancet will produce li~canUy ~t patterns. An example is the letter B. The lint .... G:ol the outline fol- lower (011011'1 the oullide o( the ~.:. ho'Wel'u the leC- ond PUI could (011011' the top s-v arT the bottom loop and conelpondincly ,enaate ripiiicamnUy different pat- terIll dependin, loley OD which IDap Iuaas the further left furthell left point. COJUequently. ammnectionist model mwt be used which is powat.l .."...."h to cope with two .ignificantly different chutcn IDIIpgJpmc to the lame output indu. H -,.,.".. / - ..... -~ ---~ ..- ..... --... B " . I ". ...... Fig.8 Centroidal Data Corm. 3.2 Results Direct recognition The ndwu:k w,.. .. found to cor- rectly recognise all the cha.racten:n 8DlD. ei&hty character Cont in around 7000 pattern preot.a.Lations. 60 J ~. ! I I~I'---_~\ ....... " r eooo Fi&.g Plot of nwnber oC enoll in the lut fifty preaen- tatiolu agairut nwnbu of prelentaUolll. Preproceued Recognition The Iyatem wu trained on twelve leHeH A. .••• L. initially wing one card for each letter. Once the ndwork wu Cound to cor- rectJy recognise each card then gradually other card. with the lame leUeu were introduced into the training l'UlU. A new card wu only introduced when the net- work had converged to a .tate giving no errore on the C1UTent card.. ABCDEFGHIJKL No.." c:..- 444444444444 ~Error 332355421322 Table lhoY1n& Dumber ot letter eu4a and pen:ent.ace error onr nndom pOllUoll.ua!e. rotlUDIl. 4 Conclusion In the dired recognition cue the network performed well learning eighty aeparate du. members. this is a higher number than wuaDy reported Cor current net- worD. The convergence timel were rast by comparison with current hidden-layer network which have been re- ported to require many tent of thousands oC presenta- tion. to learn Cewer characteH. In the preprocelsed ca.e the network demonstrated its ability to cope with variation in the patternl. and consequently be applicable in practical recognition Iya- teInS which require not only position invariance, but tole:rance to character variation. This paper has introduced an application of the ex- tended .pace netW'Cld: in optical character recoplition. The network hu bees demoll.ltrated to have a aenea of de.irable properties and by the we of more advanced preproculOu may w u.aed u the buis of hi&her perfor- mance rccolDitioD ~tema. References [I] S.Haykin, A Upiioe Filt,,. Theory Engkwood CliJU, NJ:PraU:ice-Hall. 108S • [2] Schellen. n.e YolteM'G and Weine,. Theori" of Non-unea,. Spurn. New York. NY~ohn Wiley, 1980. [3] P. Alper. A CMVideration of the Ducrde YolteM"G Seri" IEEE 'InnI. Automatic Control, yoL • Jul. 1965. pp.322-U1. [4] D.r.Sped. Gennation of Polvnomial Ducriminant Functio", for Patte,.n Recognition IEEE Trana. EC, yol. EC-II. No.3. Jun. 11167, pp.308-3lt. [5] F.Yasaa, Oplira4litJl in the Choice of the Conve,.. gence Factor for Gradient Baled Adoptioe Algo- rithm. IEEE 'I'rall.l. ASSP. vol. ASSP-35. No.1. Jan. 1987. pp.43-59. [6) S.T.Alexander. Traruient Weight Muodjwtmenl p,.opertiu for the Finite Precuion LMS Algorithm IEEE Trans. ASSP,YoL ASSP-35, No. 9. Sept. 1987, pp.1250-1258. [7] D. Rumelhart It G. Hinton Learning Repruenta- tion. bJl baclrpropogating error. Nature Vo1323 g- 10-1986. [8) E.Pol&k. Com,.,totional Method. in Optimuation London, UK:.Academic Prell, 1971. The Properties and Implementation of the Non-Linear Vector Space 'Connectionist Model. M.R.Lynch and P.J.Rayner Cambridge University Engineering Department, U.K . . 1 Introduction In ~onsidering a connectionist model and its derivation it is necessary to rust define the desirable properties of such a system and then hopefully construct such a system on a firm mathematical basis. A suitable pattern recognition system will need to be adap- tive for most applications. The necessary rules to define most real-world recognition tasks are too complex to be analyti- cally derived. Thi5 becomes more so when the effects of noise or class variation are considered. Once a system has been trained using the members of a training set it is hoped that the system will correctly classify a new pattern which was not in the training set but a member of a class defined by it. This property is often referred to as interpolation or generalisation as it is the a.bility of the system to interpolate the classifier output between training patterns. It is obviously an advantage if the recognition system can adapt itself to a state of 'good' performance with as few pre- sentations as possible. It is impractical to have systems which may get stuck in their learning processes and not reach their optimal performance; we thus require a system that will not suffer from local minima problems. These problems would re- quire the system to be supervised and reset incurring multiple ruru and coruequently large computational expenditure. Although most recognition tasks concerning multiple classes may be reduced to a series of one class recognition problems, this is in general highly inefficient in terms of com- putational load. An ideal system should be able to perform multiple class recognition. For a system to be practically implementable a method of tuning any of the system network parameters must be found to prevent the need to repeat training runs in order to opti- mise these parameters. The system should be highly parallel to allow high speed operation. A parallel implementation may also allow the pro- cessing resources to be distributed and so it may be composed of many simple processing units. By considering work in the field of signal processing and connectionist modili it possible to produce such a network, with a defined mathematical derivation. Firstly consider a basic problem. A standard 'bench-mark' for neural networks is the exclusive OR or parity problem. In thi5 problem a ~ries of binary inputs are input to the network and it is required to return the parity function value of the inputs. We can consider this problem from a signal space approach. Coruider a Tector space created by plotting each of the input variables along a set of orthogonal axes. For a 2 input variable problem this space is a plane. For the binary exclusive-OR problem the input values (or 'input vectors') are (1,1) (0,0) ( 1,0) (0,1). The first two being in class one and the second in class two. It is clear [11) that a linear discriminator cannot be used to differentiate between the classes in the EX-OR problem. We can, however, traruform the problem into an extended vector space in which the problem may be solved using a linear discriminator . We increase the dimeruionality of the dec is ion space by adding a dimension :I: 1:1:2 • A linear plane may noW' be used to separate the classes. So by using a linear discriminator in an extended vector space it is possible to perform classification tasks requiring non-linear discriminators. .2 Non-linear System Modeling The pattem recognition problem may be thought of as a sys- tem modelling problem, in which the adaptive pattern recog- niser is attempting to model the classification system; for example, a human defining the relationship between a set of character font images and a set of collating sequence num- bers. The block diagram showing the recognition problem cast as a s~stern modelling problem is shown in Fig.I. It is clear thM the relationship between the inputs and outputs of such a system is in general a non-linear one, thus the recognition system must be capable of generating a suf- ficiently cl~ approximation to the non-linear relationship defining the system. Recently the field of Signal Process- ing has se= a large growth in interest in non-linear systenu analysis by use of such methods as higher order spectra, and polycepstra [12) [4) [10). These methods owe much of their basis to the Volterra Functional Analysis of non-linear sys- tems . \ I Fig.l CO~"\ Cl ••• &n,,·UOA .3 The Volterra Functional Analysis In signal processing such non-linear systems may be modeled by use of the Volterra series. The form of the Volterra series was first studied by Vi to Volterra, but Norbert Wiener was the first to apply it to non-linear systems theory. He used it to model the input output relationship of non-linear systems. The continous Volterra series may be written: lI(t) = HO + Hl[x(t)] + H 2[x(t)] + ... + Hn[x(t)] + ... Where y( t) is the output of a system and x{ t) is the input and in which: Hn[x(t)] = 1+00 ... ! +00 h,,( Tl, . .. , T,,)X( t - TI) ... x( t - T" )dTI . .. dT" - 00 -00 Where h .. ( Tl, .• • , T,,) is the Volt err a ke~el. A discrete form of the series was developed. In general this will need to be a heterogenous Volterra expansion in order to model systems composed of multiple orders of non-linearity. The Heterogeneeu~ dUcrete Volterra ~es may written as: L Hn i=l.N Where H" are the Volterra Kernels, and w are coefficients which define the model as follows . Ho = Wo Zeroth Order term 'D.e.' term Hl = L wixi First Order Linear term H2 = L L WijXiXj Second order Quadratic Term j HS = L L L Wij,kXiXjXk .. .. and so on. j k Normally the coefficients are fixed by use of analytical meth- ods, however they may be fixed adaptively. Consider the following system in which the inputs are combined to give the individual Volterra terms. These terms are then multi- plied by their respective coefficients and the terms summed. It can be seen that the system can be decomposed into a vector expansion stage followed by a linear Finite Impulse Response (FIR) filter. The adaption of the coefficients may then be achieved by any of the methods used in FIR adap- tive filter theory. The Volterra Series is found to converge for most practical non-linearites. Other very closely related expansions may also be used such as the Gabor-Kolmogorov polynomial [l3]: Fig.2 d "-_-;-'-w~ ...... I_J:_/_.....Jh'\b ! -==r I . _ x y cl::lS. ou~ pattern inL-______________ ~ In a system such as that shown in Fig .2 the effect of joining an FIR filter to the Volterra expansion may be considered. The system has a set of input values:l:o . .. x" which are input to the non-linear Volterra state expander to produce a sel of output values Vo •• • Vq with n being less than q. These values then form the input to an FIR adaptive filter. Consider the output values for the vector state expander. This set of values may be represented by the vector V. The filter coefficients may also be represented by the vector W. Using the standard FIR equation we may write the error t as: In which d is the desired output from the filter, that is in our case the correct classification given the set of inputs to the whole system Xi . In vector form this may be written: If we wish to consider the error function for the Least Mean Squares , or LMS, critereon with the mean being considered over the pattern presentations, the equation becomes: Where R is the auto correlation matrix of the data in the non-linearly extended vector space and similarly P is the crosscorrelation vector between the desired d signal and the non-linearly expanded data.O' is the variance of the desired response . The objective is to find a set of weights W which will minimise the mean square error (the Wiener solution in the extended space) . The most suitable approach for the recog- nition problem is to find the weight solution adaptively. Tlus has the advantage of not requiring a priori knowledge of R, and does not require the inversion of a matrix. It is clear from the above that the LMS error is solely a quadratic function of the weight vector. This .shows that the performance surface is a hyperparaboloid and consequently always unimodal. This is a hlghly desirable property as it shows that the system will not exhibit local minima on the error surface. Such local minima cause the adaption per- formance of networks to be greatly reduced as the adaptive al,orithm may get stuck in local rnirUrna and consequently not reach the optimal coefficient values. The property of uni- modality also confers the ability to adapt much more quickly as t.here are no valleys in whlch to dither and the gradient i. proportional to the distance from the solution. A stochas- tic padient algorithm may be easily applied, following the derivation of the LMS algorithm, to give an iterative weight update equation. Where /J. is an adaption coefficient whlch sets the speed of learning and noting: 8E((l) = 2(~ = -2(X 8W aw The analytic basis of the method combined with the prop- erty of linearity in the coefficients allows the application of the standard and other adaptive least squares methods (RLS (1] and so on) . In general for pattern recognition the problem is degenerate or rank-deficient. A single solution may still be obtained by the application of Singular Value Decomposition (SVD) whlch yields the rninimwn norm solution. Such an- alytical methods are also useful in the study of the network performance. .4 Network Parameter Tuning There is only one parameter to be chosen in the new network. this is the adaption coefficient. However. thls adaption co- efficient is that of a standard adaptive filter . Consequently the large body of knowledge pertaining .to adaptive filter the- ory may be inunediately applied. No such method may be applied in the case of current networks as the nature of the err1>r surface is complex and contains local minima. Conse- quently the parameters such as the adapt ion coefficient must in general be set by trial and error. Yassa (7) demonstrated the relationshlp between the av- erage input power P to the adaptive filter and the optimal adaption coefficient /i-op" (/J.;n, is the constant adaption co- efficient) . /J.;n' /J.opla p In setting the adaption coefficient value it has, as may be ex- pected, been found that the system works best with an adap- tion coefficient which differs for each order of non-linearity as the order of the non-linearity affects the input power seen by the adaptive filter section. A neat method of achievin& this is to use the same adaption coefficient for all tenns but to arrange that the input power is roughly normalised for each tap by a suitable constant scaling of the input patterns. This a150 has the advantage of reducing the dynamic range of val- ues for the higher order non-linearities whlch could be helpful for non digital implementations. .5 Network Comparisons The new network', performance was compared with that of the current networks. The exclusive OR problem was used as a test. The adaption times in terms of numbers of itera- tions were com~ for the different networks. The times for the hldden layer back-propagation networks are taken from an American study (3). These timings are for the best run, whlch was achiend by optirnising the network parameters over a series of runs. If the network became stuck in a lo- cal minima it was nopped and restarted from different initial conditions. In the case of the new network the results ale for the first time run &5 the adaption coefficient could be selected beforehand. and the local minimwn problem did not enist. Thus when comparing the figures it is necessary to realise that in tenns of total pattern presentation numbers the fac- tor betw~ the munber of presentations to each network is in fact considerably greater. Fig.3 No_bits Convergence Time • T T~ 2 15 95 3 17 265 .. 62 1200 5 106 4100 6 292 20000 7 558 100000 8 1287 500000 Table 1. Comparison or converaence limea If the figures in the above table are ploted for the two networks it is possible to see that the learning times for the current network rue quickly as a function of the number of input variable!, .... hereas those for the new network rise at a lower rate . The results of the graph imply that the learning time! for current networks are likely to be extremely large for complex probleIru with more than a few input variables. 1 I I Fig.4 EHclus1va-OR Problem • 2 l , 7 11 9 .e ......,.. 01 ~.u UIi Nuar of PrflRIrrtAUons for tonu.rwnc. V. Biu .6 Interpolation Performance Compar- ison. In this section the interpolative ability of two current net- work types (Hidden-Layer [8] and Radial Basis Function (RBFs)[S]) is compared with that of the new network. The network is trained on the exclusive OR class points (0,0), (1,0 ),(0,1 ),( 1,1) W1.til full convergence . The adaption is then stopped and the inputs are varied over the full signal space. The class decision is then ploted as a point for class one and no plot for class two [8] . Fig.S e)H.ldden Ul.ver x2 ,et The hidden layer network displays poor interpolative abili- ties which is to be expected as the decision region is bOW1.ded by linear bOW1.dries. The Radial Basis FW1.ction network per- forl115 much better, producing suitably localised classes. This increase in performance is to be expected due to the relation- ship between RBFs and multidimensional interpolation. displ ay s [(sults which are similar to those of RBF networks . Howev(r the new network is not constrained to a .et of (I pn- OM 'RBFs' and so can converge further. In fact RBF networks may be considered as a subset of expanded space networks in which basis fW1.ctions are used by wing the same coefficient multiplying a series of expansion terl115 . .7 Multiclass Performance In practice it has not proved possible, in general, to use cur- rent networlcs for classification with large numbers of classes. In general, multiple class problel115 are addressed by wing a series of networlcs. The flexibility of the decision .urfaces of the new network, however, allow. a single network to perform multiclass problel115. One such problem is that of character recognition. The network was presented with an 8 by 8 pinl image of a character and required to output the characters position in the collating sequence. The network was capable of learning to recognise all 80 characters in the font. A sec- ond order Volterra extension was wed with all terl115 present . The co(fficients were set by the LMS algorithm. More details of this and other recognition experiments may be found in [9]. It is also possible to increase the number of classes that a system can recognise by using the same vector expansion 'hardware ' with multiple adaptive filter sections connected to it. Each problem, although connected to the same expansion system, has completely separate coefficients and thus they do not interact . Fig .6 Plot of number of incorrect classifications in the last fifty presentations against presentation number. 50 . ~'J. 80 .. ember Font. reco~it.ioD i T'r!~~2Dd Order ErpaDaioD,O' Plula 12 .9 'Is i" 9 j zS Fig.Sa 0 The new network converges to the best solution of the three, which is extremely near to the optimal solution, given no information other than at the class points. In fact, if the above is repeated for the network before it has converged, it .8 Implementation Factorisation: In order to facilitate the understanding of the new network it has 50 far been discussed as two sections: firstly the vector state expansion and secondly the adaptive filter section. The 5ame network may be implemented in a more distributed form composed of a series of identical 'neu- rons' by factorising the expansion polynomial. Once this has been done it is possible to find factorisations composed of identical units. Consequently a network may be created us- ing a network of simple terms [13]. Another approach which is currently under investigation is to define two types of 'neu- ron'. The first is composed of a simple multiplier and the second a stan~ard linear perceptron. An interesting form \ I \ of the second implementation is that in which the network elements are randomly connected. This r;enerates a connec- tionist model containing random terms Crom the non-linear state expansion. However, provided that the network is large enough, the inherent degeneracy allows the network to find another possible solution without the mis.sing terms. 3ID-Type I II Fig.7 Basic Perceptrons Th.e level of degeneracy in a given network for a given problem may be indicated by the rank of the extended space conelation matrix R. This may be calculated by Singular Value Decomposition (1). The random network was created by randomly connecting the inputs of type I neurons to type I neuron outputs in any earlier layer . The output type II neuron was connected to random type one outputs. 91 RwJdgn Hrlmr1c : JS8 Jinput t_ I E1_ j 58 1 Z28 input '- 11 E1_ .!i JS .1 __ per 10yr0r 5 bu ? Ph,.] F'ont 20 ]nUorw in font . ~~~~~/~~'~8.S~~~~~ :z 1..., Fig.9 Results of Random Connection Network .9 Conclusion The non-linear vector space expansion connectionist model has been shown to have a number of adv-antages over cur- rently applied networks for application to general connection- ut problems. Firstly, it converg es to its optimal solution far futer, by several orders of magnitude, than cUnent networu, and this convergence time does not rise as quiclcly as a function of the size of the pattern vectors. Secondly, it is unirnodal and consequently does not suffer Crom the problems of local minima such as needing to be reset, getting stuck or uncertainty about having converged. It may thw be left in adaptive mode while being 'on the job' as well as in training. Thirdly, the complete mathematical derivation of the new network allows a much fuller understanding of its operation and allows direct application of adaptive filter knowledge. This allows easy parameter tuning and application of other adaptive filter update methods. It also allows the effech of approximation and implementation to be considered. The ability of the network to function well for random connection also allows the possibility of the implementation ofthe network by use ofleu exact technologies than electronic digital methods. References [1] S.Haykin, Adaptive Filter Theory Englewood CliJU, N J :Prentice-Hall, 1986. [2] Schetzen, The Volterra and Wiener Theorie, 0/ Non- linear Sy~tem~ New York, NY:John Wiley, 1980. (3) G.Tesauro & R.Jarusens, Scaling Relation~hip, in Back- propogation Leaming Technical Report:Center for Com- plex Systems Research Univ. of lllionis.I9-2-1988. (4) P. Alper, A Con,ideration 0/ the Di,crete Volterra Serie, IEEE Tram. Automatic Control, vol. , Jul. 1965, pp.322- 327. (5) M.J.D .Powell, Radial ba,i, /unction approximation. to Polynomial, Proc. Department of Applied Mathematics and Theoretical Physics. (6) D.f.Spect, Generation of Polynomial Di,criminant Function, for Pattem Recognition IEEE Trana. EC, vol. EC-16, No.3, Jun. 1967, pp.30S-3I9. (7) F.Yassa, Optimality in the Choice 0/ the Convergence Factor for Gradient Baud Adaptive Algorithm, IEEE TraIlS. ASSP, vol. ASSP-35, No.l, Jan. 1987, pp.48-59 . (8) M.Niranjan &t F.Falliide Neural Network, and Radial 00- ~i, /unction, in da"ifying Ita tic 'peech pattem, Internal Report Cambridge University Engineering Department CUED F-INFENG TR 22(1988). (9) M.R.Lynch &. P.J.Rayner Optical Character Recognition U,ing a New Connectioni,t Model Proc. IEE Interna- tional Conference on Image Processing 89, Univ of War- wick. [10] S.Fakhouri, Identification of the Volterra K erne/, of Non-linear Sy~terru IEE Proc. vol.l27 No.6, Nov. 1980, pp.296-304. (11) P.J .Reyner &t M.R.Lynch A New Connectioni~t Model Ba,ed on a Non-linear Adaptive Filter Proc. ICASSP 89, pp.1l91. (12) C .Niki8s, Bi'pectrum E,timation: A Digital Signal Pro- ceiling Framework IEEE Proc. vol.75 No.7, Jul. 1987, pp.869-891. (13) A.G.Ivekhnenko Heuri,tic ulf-oryani,ation in Prob- lem, of Engineering CybemetiCl Automatika vol.6 1970, pp.207-219. , l \ .I I' Complexity Reduction in Volterra Connectionist Modelling by Consideration of Output Mapping P.J.Rayner and M.R.Lynch Cambridge University Engineering Department, U.K. Introduction The Volterra Connectionist model has already been demon- strated to have advantages over current networks [6](8)[9]. It has been shown to have much faster learning times and be- ing unimodal it does not suffer from local minima problems. The mathematical derivation of the network has also allowed a deeper analysis of network operation, and it is this under- standing of the system which may be used to greatly reduce network complexity by consideration oC the output mapping. Consider the pattern recognition problem as a system trans- fer function : Input Po t tern CIClss;f;cCI t;on ,..----------, Error Fig.l Pattern Recognition Problem In general the relationship between the input pattern data and the output classification data is non-linear and conse- quently the pattern recognition system must in general be a non-linear one. The Volterra connectionist model may be shown to perform this task [2][9] . A heterogenous Volterra may be used to model systems composed of multiple orders of non-linearity. The Heteroge- nous discrete Volterra series may written as: L Hn n=l,N Where H n are the Volterra Kernels, and ware coefficients which define the model as follows . Ho = Wo Zeroth Order 'D.C.' term HI = L wixi First Order Linear term i H2 = L L WijXiXj Second order Quadratic Term j Ha = L L L Wij,kXiXjXk .... and so on. j k Consider the following system in which the inputs are com- bined to give the individual Volterra terms. These terms are then multiplied by their respective coefficients and the terms summed. It can be seen that the system can be de- composed into a vector expansion stage followed by a lin- ear Finite Impulse Response (FIR) filter. The adaption oC the coefficients may then be ou:hieved by any oC the meth- ods used in FIR adaptive filter theory. The Volterra Se- ries is found to converge for most practical non-linearites . d k > 1 7\ ~ V ot terro 7 Exp o ns;on Input Pattern Fig.2 Recognition System Output Mapping The basis of complexity reduction may be easily demon- strated by consideration of a classification problem for one input variable.(Fig.3} . This would require a higher order oC non-linearity and consequently many terms from the Volterra expansion leading to the need for a connectionist model with many terms. If, however, the particular values of the classi- fication indices had been correctly chosen the problem could be reduced to a linear one (Fig.4) . The problem of choosing the indices suitably is considerably more difficult for prou:tical problems with many input variables . 1 c 0 +' X 0 U <+- Vl Vl 0 U Input vo.r ioble x Fig.3 One variable recognition problem . Input Vo.rio.ble x Fig.4 Linearised Problem Derivation The following approa.ch to suitable selection of these indices ha.s been shown to give good results.It uses a constra.int to ensure that the output indices derived are separated from one another and consequently ta.kes the form of a constra.ined optimisation. Suppose that there are M cla.sses and in the tra.ining pha.se there are N, (i=l,M) presentations for ea.ch cla.ss . Let d; be the classification index for each class. Let xn(i) be the extended spa.ce pattern vector for the ith cla.ss and the nth example of it. Let w be the network weight vectors. ,~ t [t. [d. - w'x.U))'] The d; must be constra.ined. A suitable constra.int is found to be: M Ld~ =c ,=1 where c is a.n a.rbitrary consta.nt . The method of Lagrange multipliers may now be applied . Set up the objective function : where k=l ,M. From 1: Define: Ni LXn(i) = p(i) a.nd : M Ni LLXn(i)x~(i) = Rxx i ~ l n=l Then: M Ld,p(i) - Rxxw = 0 .. . 5 .=1 From 2: N. (A + Nk)d k - LW'xn(k) = 0 n = 1 N. (>'+Nk)dk - Lx(k) = O ... 4 n=l 4 becomes: (A + Nk)dk - wtp(k) = 0 .. . 6 k = 1, M. Multiply 6 by dk and sum over k (change to i): M M L(>' + N;)d~ - L wtp(i)d; = 0 ;=1 ,=1 For convenience, assume N, = N for i = 1, M : M M (>. + N) L dt - L wtp(i)d; = 0 i=) 1=1 But : so: M >. + N = ;wt L d,p(i) .. . 7 i z: 1 Substitution (5,6) : M L >.: N wtp(i)p(i) - Rxxw = 0 i . 1 1 >. + N Pxxw = Rxxw I I where: AI Pxx = LP(i)pt(i) i . l Assuming Rxx non-singular then: Rx~Pxxw = (A + N)w That is, w is an eigenvector of RXipxx . w= ae where e is an eigenvector. Multiply 5 by w t AI w t L diP(i) - wtRxxw = 0 .. . 10 . %1 S~bstitute 7 into 10: t e(A + N) = w Rxxw ... 11 Substituting 11 into 6 gives: From 11: die = ewtp(k) wtRxxw clJ = a1etRxxe .. . 13 where IJ is the eigenvalue corresponding to e, i.e. A + N = IJ. Now, to decide on which eigenvalue to choose , consider the error: We have: and: and from 13: die = ewtp(k) wtRxxw Expanding 14 gives: M M ( = L L [d: - 2diwtxn(i) + (wtXn(i»l] i = 1 n = l M = Ne - 2 Ldiwtp(i) + wtRxxw i= 1 Substitute 10: From 15: M ( = Ne - Ldiwtp(i) .. . 17 _=1 t die t w p(k) = - w Rxxw e Substitute in 17: = Nc-wtRxxw c = c(N - 1£) where IJ is the eigenvalue corresponding to e. Hence the error is aminimum when the largest eigenvalue is selected. Substitution shows: wtDt, die = ---..:..A 1£ By use of this expression a set of 'optimal' output indices may be found. These indices are in general not integers and in fact a convenient approach to practical implementation has been found. The original integer indices are simply reordered for the classes in the same order as the optimal ones. This is found to work well as may be expected if the connectionist model is viewed as a multidimensional interpolator. Results The output mapping scheme was first applied to a two in- put variable problem with class clusters as shown below: - - '--' i--i---- --'- ___ ~I _ '-__ • ____ ~ __ _ . Fig.S Input pattern clusters and indices These clusters and assignments gave the following outputs for each class . The output mapping system was now run to give the following optimal indices and the original indices reordered . The plots show the desired and actual outputs against pattern number, solid line is the desired output and the dotted line the actual output. Fig.6 Plot of output with original indices It can be seen that the output error has been greatly re- duced by optimal ordering of the output indices. u u Fig.7 Plot of outputs with reordered indices The output mapping approach was next applied to the prob- lem of optical character recognition, using a 80 font member set with 8 by 8 pixels for each character. The desired output was the position of the character in the collating sequence 0 to 79. 00 2113 Tap 2nd Order No.of presentations 6000 Fig 8.Results for 80 member font OCR with first and sec- ond order terms With the arbitrary indices the linear system is not capable of solving the problem. However, if the indices are optimised a linear system can now solve the problem. III 00 '" 0 0 '" IQ '" .., ~ IJl '" 41 65 Tap Linear 0 ...l 0 ~ Z No.of presentations 6000 Fig. 9 Ocr results using linear first order terms. 00 IJl '" 0 0 IQ s.. .., s.. ~ IJl 41 65 Tap Linear '"' ...l 0 ~ 0 Z No.of preaentations 6000 Fig.l0 Results for 80 member font OCR with first order terms but optimised indices Conclusion The Volterra. connectionist model has already been shown to have significant ad vantages over other systems in terms of an- alytical basis, speed of learning, scaling properties, and gener- alisa.tion. The output mapping method demonstrates an ap- proach which mues the Volterra connectionist model highly efficient computalionally, by comparison with current neural networks. Although in the above results the system was ap- plied to yield linear recognisers exactly the same approach may be used to reduced a higher order non-linear recogniser to a lower order one. The network is no longer being con- strained to fit arbitrary indices and some may utilise all of its degrees of freedom to improve recognition performance. References [1] S.Haykin, Adaptive Filter Theory Englewood Cliffs, NJ:Prentice-Hall,1986. [2] Schetzen, The Volterra and Wiener Theoriej of Non. linear SYjtemj New York, NY:John Wiley, 1980. [3] G.Tesauro &. R.Janssens, Scaling Relationjhipj in Back. propogation Learning Technical Report :Center for Corn· plex Systems Research Univ. of Iilionis.19·2·1988 . [4] P. Alper, A Consideration of the Dijcrete Volterra Series IEEE Trans. Automatic Control, vol. , Jul. 1965, pp.322· 327. [5] D.F.Specht, Generation of Polynomial Dijcriminanl Functionj for Pattern Recognition IEEE Trans. EC, vol. EC-16, No.3, Jun. 1967, pp.308-319. [6] M.R.Lynch &. P.J.Rayner Optical Character Recognition Ujing a New Connectionijt Model Proc. lEE Interna· tional Conference on Image Processing 89, Univ of War- wick. [7] S.Fakhouri, Identification of the Volterra K eme/j of Non·linear Systemj lEE Proc. vo1.l27 No.6, Nov. 1980. pp.296-304. [8] P.J .Rayner &. M.R.Lynch A New Connectionist Model Bajed on a Non· linear Adaptive Filter Proc. ICASSP 89, pp.119!. [9] M.R.Lynch &. P.J .Rayner Propertiej and Implementa. tion of the Non.linear vector space Connectionist model Proc. lEE 1st International Conference on Artificial Neural Networks 89, London. r A Unimodal Neural Network Utilising Non-linear Vector Space Expansion Methods, Comparison with Current Networks and Research Discussion. M.R.Lynch Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 IPZ, U.K. 1 ' Introduction This document is designed to be a basic introduction to the vector space expansion connectionist model. The modern computer has proved itself to be well suited to solving many problems and provides a powerful tool for aiding in the solution of many other problems. There has, however, emerged a type of problem for which standard 'syntactic' or programming methods do not perform well. These problems generally have many input variables and many complex rules relating their outputs to their inputs. The rules that arise in many real world problems, such as visual object recogni- tion under practical conditions, are in general too complex for a suitable set of rules to be explicitly worked out. Conse- quently, in general, syntactic methods of sufficient complexity cannot practically be constructed to solve these problems. The adaptive neural network (15) can address this problem in that it does not require explicit statement of the rules of the problem, as it will implicitly learn them. Such networks are composed of many simple processing units linked together to create a system capable of performing complex tasks. A standard application of such networks is in pattern recognition. This document attempts to define the abilities that a pattern recognition system requires, and, to investigate the performance of the extended vector space connectionist model with respect to these abilities. 1.1 Adaptivity and Learning A suitable pattern recognition system will need to be adap- tive for most applications. The necessary rules to define most real-world recognition tasks are too complex to be analyti- cally derived. This becomes more so when the effects of noise or class variation are considered. These problems may be cir- cumvented by utilisation of a self learning system, which by experience of the recognition problem adapts itself to solve the problem. 1.2 Interpolation Once a system has been trained using the members of a train- ing set it is hoped that the system would correctly classify a new pattern which was not in the training set, but a member of a class defined by it. This property is often referred to as interpolation or generalisation, as it is the ability of the system to interpolate the classifier output between training patterns. There is an implicit assumption in the concept of interpolation that patterns belonging to the lame class will cluster to a similar region of the decision space . This may not always apply, for example in the case of a capital and a small letter' A '. However, from an analytical viewpoint this may be considered as two classes mapping to the same output value. It would be hoped that capital A letters will cluster and so on. 1.3 Learning Speed It is obviously an advantage if the recognition system can adapt itself to a state of 'good' performance with as few pre- sentations as possible. The ability to adapt quickly is also of advantage as it also enables the recognition system to respond to changes in the noise or class statistics . 1.4 Certainty of learning It is impractical to have systems which may get stuck in their learning processes and not reach their optimal performancei we thus require a system that will not suffer from local min- ima problems. These problems would require the system to be supervised and reset incurring multiple runs and conse- quently large computational expenditure. 1.5 Multiple class Problems Although most recognition tasks concerning multiple classes may be reduced to a series of one class recogntion problems, this is in general highly inefficient in terms of computational load. An ideal system should be able to perform multiple class recognition. 1.6 Parameter Tuning For a system to be practically implement able a method of tuning any of the system network parameters must be found to prevent the need to repeat training runs inorder to opti- mise these parameters. 1. 7 Parallelism The sy.tem should be highly parallel to allow high speed operation. A parallel implemetation may also allow the pro- cessing resources to be distributed and so it may be composed of many simple processing units. 2 The Signal proach Processing Ap- In the early days of neural networks there was a tendency to over-extrapolate known physiological information to provide networks . There was also a tendency to produce networks by empirical methods. This led to systems which could not easily be analysed or optimised. Recently the field of signal processing has advanced sufficiently to allow the mathemati- cal derivation of a network form by using a well defined signal processing approach. Firstly it is necessary to consider a basic problem. A stan- dard 'bench-mark' for neural networks is the exclusive OR or parity problem. In this problem a series of binary inputs are input to the network and it is required to return the parity function value of the inputs. We can consider this problem from a signal space ap- proach. Consider a vector space created by plotting each of the input variables along a set of orthogonal axes. For a 2 input variable problem this space is a plane. For the binary exclusive-OR problem the input val- ues (or 'input vectors') are (1,1) (0,0) (1,0) (0,1). The first two being in class one and the second in class two. Fig. 1 x2 • xl It can be seen from Fig. 1 that a linear discriminator cannot be used to differentiate between the classes in the EX-OR problem. We can however transform the problem into an extended vector space in which the problem may be solved using a linear discriminator. We increase the dimensionalit.y of the desicion space by adding a dimension Xl X2 • A linear plane may now be used to separate the classes. Fig.2 x2 1 If we collapse the desicion plane of Fig.2 onto the 2D plane as in Fig.I it can be seen that the decision contour is non- linear. Fig.3 So by using a linear discriminator in an extended vector space it is possible to perform classification tasks requiring non-linear discriminators. 2.1 Non-linear System Modeling Ap- proach The pattern recognition problem may be thought of as a sys- tem modeling problem, in which the adaptive pattern recog- nition is attempting to model the classification system: for example, a human defining the relationship between a set. of character font images and a set of collating sequence num- bers. The block diagram showing the recognition problem cast. as a system modelling problem is shown in Fig.4. FigA Correct ~ Input Cl .... 1fic .. Uon E.tinl .. t",d Cl .... Ulc .. Uon It is clear that the relationship between the inputs and outputs of such a system is in general a non-linear one, thus the recognition system must be capable of generating a suf- ficiently close approximation to the non-linear relationship defining the system. 2.2 The Volterra Functional Analysis In signal processing such non-linear systems may be modeled by use of the Volterra series [2]. The form of the Volterra series was first studied by Vito Volterra, but Norbert Wimer was the first to apply it to non-linear systems theory. He used it to model the input output relationship of non-linear systems. The continous Volterra series may be written: v(t) = HO + HI [:z:(t)] + H2[:z:(t)] + ... + Hn[:z:(t)] + ... Where lI(t) is the output of a system and :z:(t) is the input and in which: Hn[:z:(t)] = [:00 .. . [:00 hn( TI, ... ,Tn):Z:( t _ T!) ... :I:( t - Tn)dTI ... dTn Where hn( TI, ... , rn) is the Volterra kernel. A discrete form of the series was developed [4]. In general this will need to be a heterogenous Volterra expansion in order to model systems composed of multiple orders of non-linearity. The Heterogenous discrete Volterra series may written as: L Hn i=l,N Where Hn are the Volterra Kernels, and ware coefficients which define the model as follows . Ho = Wo Zeroth Order term 'D.C.' term HI = L wi Xi First Order Linear term H2 = L L WijXiXj Second order Quadratic Term j HS = L L L wIJ.kxIXjXk .... and so on. i j k Normally the coefficients are fixed by use of analytical meth- ods [12] however they may be fixed adaptively. Consider the following system in which the inputs are combined to give the individual Volterra terms. These terms are then multiplied by their respective coefficients and the terms lununed. It can be seen that the system can be decomposed into a vector ex- pansion stage followed by a linear Finite Impulse Response (FIR) filter. The adaption of the coefficients may then be achieved by any of the methods used in FIR adaptive filter theory. The analytical basis of this approach allows us to ex- plicitly calculate the form of the error surface. The Volterra Series is found to converge for most practical non-linearites, although large input levels or vey high order non-linearities should be avoided. A very closely related expansion may be obtained by using the Gabor-Kohnogorov polynomial: [16] Fig.5 Volterra Exp. In a system such as that shown in Fig.5 the effect of joining an FIR filter to the Volterra expansion may be considered. The system has a set of input values :1:0 ••• :1:" which are input to the non-linear Volterra state expander to produce a set of output values 110 ••• 119 with n being less than q. These values then form the input to an FIR adaptive filter. Consider the output values for the vector state expander . This set of values may be represented by the vector V . The filter coefficients may also be represented by the vector W. Using the standard FIR equation we may write the error € as: €= d - L 1I i 'Wi In which d is the desired output [1] from the filter, that is in our case the correct classification given the set of inputs to the whole system lei. In vector form thU may be written: If we wish to consider the error function for the Least Mean Squares, or LMS, critereon with the mean being considered over the pattern presentations, the equation becomes: Where R is the autocorrelation matrix of the data in the non-linearly extended vector space and similarly P is the crosscorrelation vector between the desired d signal and the non-linearly expanded data.O' is the variance of the desired response. The objective is to find a set of weights w which will minimise the mean square error. The system then will out- put classifications which differ from the desired classifications with the mimimum mean squared error possible for the sys- tem. The optimal solution for the weights of the filter is the same as that of a Wiener Filter [1] operating in the extended space and so the methods used to find the Weiner solution explicitly may be applied. A more suitable approach for the recognition problem is to find the weight solution adaptively. This has the advantage of not requiring a priori knowledge of R, and does not require the inversion of a matrix. It is clear from the above that the LMS error is solely a quadratic function of the weight vector. This shows that the performance surface is a hyperparaboloid and consequently always unimodal. This is a highly desirable property as it shows that the system will not exhibit local minima on the error surface. Such local minima cause the adapt ion per- formance of networks to be greatly reduced as the adaptive algorithm may get stuck in local minima and consequently not reach the optimal coefficient values. The property of uni- modality also confers the ability to adapt much more quickly as there are no valleys in which to dither and the gradi- ent is proportional to the distance from the solution. This last property prevents the adaption being greatly slowed or halted by regions of the error surface with small gradients. A stochastic gradient algorithm [11] may be easily applied. Fol- lowing the derivation of the LMS algorithm. The usual gra- dient algorithm may be written by differentiating the mean square error with respect to the weights to give an iterative weight update equation. Where J.L is an adaption coefficient which sets the speed of learning . A simple expression for \! E( e2 ) may be derived. The analytic basis of the method combined with the prop- erty of linearity in the coefficients allows the application of the standard least squares methods. In general for the pat- tern recognition the problem is degenerate or rank-deficient. A single solution may still be obtained by the application of Singular Value Decomposition (SVD) which yields the min- imum norm solution. Such analytical method may also be used to study the network performance and obtain the eigen- values for real recognition problems. 2.3 Parameter Tuning There is only one parameter to be chosen in the new network, this is the adaption coefficient. However, this adaption coef- ficient is that of a standard adaptive filter. Consequently the large body of knowledge pertaining to adaptive filter theory may be immediately applied. Yassa[7] developed an expres- sion for choosing the optimal adaption coefficient. No such method may be applied in the case of current networks as the nature of the error surface is complex and contains local minima. Consequently the parameters such as the adaption coefficient must in general be set by trial and error in current networks. 3 Network Comparisons The new network's performance was compared with that of the current networks. The exclusive OR problem was used as a test. The adapt ion times in terms of numbers of iter- ations were compared for the different network times. The times for the hidden layer back-propogation networks [10] are taken from an American study [3]. These timings are for the best run, which was achieved by optimising the net- work parameters over a series of runs. If the network be- came stuck in a local minima it was stopped and restarted from different initial conditions. In the case of the new net- work the results are for the first time run as the adapt ion coefficient could be selected beforehand, and the local min- imum problem did not exsist. Thus when comparing the figures it is necessary to realise that in terms of total pat- tern presentation numbers the factor between the number of presentations to each network is in fact considerably greater. Fig.6 Pari ty Pro blern. f..L= .Ol Volterra lUdden No.ot Bit,,_ Elrten810n ~yer 11 Tooa. ... TOODY ~ 15 29:5 17 4 62 1200 5 106 4100 6 292 20000 7 556 100000 8 1287 500000 If the figures in the above table are ploted for the two networks it is possible to see that the learning times for the current network rise quickly as a function of the number of r input variables, whereas those for the new network rise at a lower rate. The results of the graph imply that the learning times for current networks are likely to be extremely large for complex problems with more than a few input variables. Fig .7 E~clus~ve-OR Problem Hlddft1 Layer Nrtwork /~---- ~ Uolterra Network 2 3 6 6 7 8 NUf1b.r 0 f b 1 t. 9 10 LOG Nunber of Pre.entgtlon& for Conueroence U. Blt&. Fig.S 'b) Haillul I.:h.f:St~ FUIlCcti01.1 x2 xl The hidden layer network displays poor interpolative abili- ties which is to be expected as the decision region is bounded by linear boundries. The Radial Basis Function network per- forms much better, producing suitably localised classes. This increase in performance is to be expected due to the rela- tionship between RBFs and multidimensional interpolation. The new network converges to the best solution of the three, which is extremely near to the optimal solution, given no in- formation other than at the class points. In fact, if the above 3.1 Interpolation Performance parison. is repeated for the network before it has converged, it displays Com- results which are similar to those of RBF networks. However the new network is not constrained to a set of a priori 'RBFs' and so can converge further. In this section the interpolative ability of two current network types Hidden-Layer and Radial Basis Function (RBFs )[5)[9) is compared with that of the new net- work. The network is trained on the exclusive OR class points (0,0), (1,0),(0,1),(1,1) until full convergence. The adaption is then stopped and the inputs are var- ied over the full signal space. The class decision is then ploted as a point for class one and no plot for class two. Graphs a) and b) were provided by Niranjan[9J . 3.2 Multiclass Performance In practice it has not proved possible, in general. t.o use cur- rent networks for classification with more than a 26 classes. In general, multiple class problems are addressed by using a series of networks. The flexibility of the decision surfaces of the new network however allows a single network to perform multiclass problems. One such problem is that of charac- ter recognition. The network was presented with an 8 by 8 pixel unage of a character and required to output the char- acters position in the collating sequence. The network was capable of learning to recognise all 80 characters in the font. A second order Volterra extension was used wit.h all terms present. The coefficients were set by the LMS algorithm. Fig.9 Some examples of the training font. A:a.bB Io .. oF S5%e QK ; <]B ) ... ..... O P + Fig.ID Plot of number of incorrect clauifications in the last fifty presentations against presentation number. o L-----------~----~-~----~------~~~~ It is also possible to increase the number of classes that a system can recognise by using the same vector expansion 'hardware ' with multiple adaptive filter sections connected to it. Fig.ll Ar .. "htc tont Mult.iclu.sB ProblelIl 3.3 Implementation In order to facilitate the understanding of the new network it has so far been discussed as two sections: firstly the vec- tor state expansion and secondly the adaptiv(' filter section. The same network may be implemented in a more distributed form composed of a series of identical 'neurons ' by factoris- ing the expansion polynomial.Once this has been done it is pouible to find factorisation! composed of identical units . Consequently 11 network may be created using a network of simple terms. Another approach which is currently under investigation is to define two types of 'neuron'. The first is composed of a simple multiplier and the second a standard linear perceptron. An interesting form of the second imple- mentation is that in which the network elements are randomly connected. This generates a connectionist model containing random terms from the non-linear state expansion. However, provided that the network is large enough, the inherent de- generacy allows the network to find another possible solution without the missing terms . y ~ w., + W 1X 1 ;- W1 X : + w~x) + W11 X I X \ + w1?x,x : -t- w\ )X , X ; -t- W:1X:,X , T W1ZX:?X:, .- W?,X2 X ~ + W, tX ?X I + W):X?X: ~ W)) X';IX) may be factorised as : 4 Y .. W0 +X\ (W\ +Yt' ll X , + w1:?x ; + W\ :x 3) + x:(W :,+ w: 1x, +w:: x:+ w:,?x? ) + + x ,(w ,? +W ~!X; + WnX: +w?? x :,) ~ ----------------~ :,Q-X, "JIC) w" w" .." .." .." output Fig .12 Implementation of Network. Current Research The quadratic error surface lends itself ideally to the use of a more efficient adaption algorithm, such as that of conju- gate gradients [11) . A variation of this method is currently being developed, which is giving considerably faster conver- gence times than those discussed earlier without requiring the storage of very large matrices. As previously mentioned the effect of various implementa- tions is being considered, including random connection under a series of network generating rules, and the effect of limited accuracy and limited dynamic range multiplications [8). A form of the network based on the Weiner G-functionals [2) is being considered, as these form an orthogonal set. This may lead to advantages in adaptive performance [13) for some applications and form another basis from which to analyse the network behaviour. An examination of the network from a vector space per- spective has made it apparent that the actual values of the outputs for various classes if suitably chosen can lead to the use of very small networks . Although an analytical solution has not yet been found, a series of good approximate solu- tions have been developed and yeiled SOl11e surprisingly small networks. The importance of the output values can be easily demonstrated. Consider a single input system with a series of output classes as shown below: Fig.13 n II~ ______________________________ ~ IupuL x This would require a high order non-linearity and conse- quently complex boundries to the desicion regions and so a larger network. If the output indices had been suitably cho- sen the whole problem could have been reduced to a linear one: Fig.12 n i lL:::::::. _________ --:---+ IupuC X Although the above is a simple case the problem becomes difficult to solve in multiple dimensions with a degenerate system. This question of output values has not been fully appreciated with current networks and a method for deter- mining suitable indices would lead to far smaller networks. The approach being currently investigated involves a princi- ple component analysis of the pattern difference vectors and has provided the ability to use smaller networks. It is hoped to develope an adaptive approach to network contraction. The use of a least squares critereon for multiclass problems is not a particularly good one, other error functions are being investigated. Ideally a pattern recognition system should construct an estimate of the underlying probability demity functions which are generating the input vectors given an element from a particular class. Although, due to the expectation opera- tor in the mean square error equation, the probability of a pattern vector occurring is taken into account by the net.- work, it may be possible for the network to make far more complete estimations of underlying probability processes by using Bayesian methods. 5 Conclusion The non-linear vector space expansion connectionist model has been shown to have a number of advantages over cur- rently applied networks for application to general connection- ist problems. Firstly it convergences to its optimal solution far faster by orders of magnitude than current networks, and this conver- gence time does not rise as quickly as a function of the siEe of the pattern vectors. Secondly it is unimodal and consequently does not suffer from the problems of local minima such as needing to be re- set, getting stuck or uncertainty about it having converged. These problems can seriously affect the performance of cur- rent networks and prevent them being used in an adaptive mode while actually performing their tasks. Thirdly, the complete mathematical derivation of the new network allows a much fuller understanding of its operation and allows direct application of adaptive filter knowledge. This allows easy parameter tuning and application of other adaptive filter update methods. It also allows the effects of approximation and i.mplementation to be considered. References [1] S.Haykin, Adaptive Filter Theory Englewood Cliffs, NJ:Prentice-Hall,1986. [2] Schetzen, The Volterra and Weiner Theorie8 of Non- linear SY8tem8 New York, NY:John Wiley, 1980. [3] G .Tesauro & R.Janssens, Scaling Relation8hip8 in Back- propogation Learning Technical Report:Center for Com- plex Systems Research Univ. of Illionis.19-2-1988. [4) P . Alper, A Con~ideration of the Di8crete Volterra Serie8 IEEE Trans. Automatic Control, vol. ,Jul. 1965, pp.322- 327. [5] M.J .D.Powell, Radial ba8i8 function approximation8 to Polynomial~ Proc. Department of Applied Mathematics and Theoretical Physics. [6] D.f.Spect, Generation of Polynomial Diuriminant Function8 for Pattern Recognition IEEE Tram. EC, vol. EC-16, No.3, Jun. 1967, pp.308-319. [7] F.Yassa, Optimality in the Choice of the Convergence Factor for Gradient Ba8ed Adaptive Algorithm8 IEEE Tram. ASSP, vol. ASSP-35, No.l, Jan. 1987, pp.48-59 . [8] S.T.Alexander, Tran8ient Weight Mi8adju8tment Prop- ertie8 for the Finite Preci8ion LMS Algorithm IEEE Trans. ASSP,vol. ASSP-35, No . 9, Sept. 1987, pp.1250- 1258. [9] M.Niranjan & F.Fallside Neural Networb and Radial ba- 8i8 function8 in claHifying 8tatic 8peech pattern8 Internal Report Cambridge University Engineering Department CUED F-INFENG TR 22(1988) . [10] D. Rumelhart & G. Hinton Learning Repres entation8 by backpropogating error8 Nature Vol.323 9-10-1986 . (11) E.Polak, Computational Method8 in Optimi.!ation Lon- don, UK:Academic Press, 1971. (12) S.Fakhouri, Identification of the Volterra Kernel.! of Non-linear SY8tem8 lEE Proc. vol.127 No.6, Nov. 1980, pp.296-304. (13) S.Narayan, Frequency Domain Lea8t-Mean-Square.! AI. gorithm IEEE Proc. vo1.69 No.l, Jan. 1981, pp.125-126. (14) C.Nilcias, Bi8pectrum E8timation: A Digital Signal Pro. cel8ing Framework IEEE Proc. vo1.75 No.7, Jul. 1987, pp.869-891. (15) R.Lippman, An Introduction to Computing with Neural Net8 IEEE ASSP Magat;ine. Apr. 1987, pp.4-22. (16) A.Ivakhnenko, Heuri6tic Self·Organi8ation Problem.! of Engineering Cyberneticu Automatica. vo1.6 1970, .pp.207-219. (17) V.Klema &£ A.Laub, The Singular Value Decomp08ition: It, Computation and Some Application8. IEEE Trans. AC,vol. AC-25,No.2 . Apr . 1980, pp.164-176. An Adaptive Filter Algorithm using Conjugate Gradient Methods M.R.Lynch and P.J.Rayner Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 IPZ, U.K. 1 Introd uction Over recent years a series of algoritluru for the updating of adaptive filter coefficients have been developed. Some of these method., notably those related to the method of Re- cursive Least Squares (RLS) [1) have provided significant in- creases in performance over the Least Mean Squares (LMS) [1) algorithm. For many applications the problems posed by the lack of performance of the LMS algorithm can be over- come by implemeting one of these other algoritluru. However, almost all of this new generation of algoritluru re- quire the storage of a matrix and a significant increase in the computational load. This is not a problem for most adaptive filters which are one-dimensional and utilise a limited num- ber of coefficients. In the case of multidimensional adaptive filters, such as those used in image processing [7), or con- nectionist models, or for any adaptive filter with many co- efficients however, it may be totally impractical to consider storage of a matrix containing O( nl) coefficients where n is the number of coefficients. Likewise the increase in compu- tational load associated with current approaches may be too great. In this paper a new algorithm is presented which gives a substantial increase in performance over the LMS algorithm but does not require the storage of a matrix and has a com- putational burden of approximately five times that of LMS. 1.1 The LMS Algorithm In order to construct a new algorithm with increased perfor- mance over the LMS algorithm it is necessary to first consider the failings of the LMS algorithm. The update equation for the LMS or stocastic gradient method is [1): wk+l = wk + 2/!~x or more generally: 2 wk+l = wk - ocV(~ ) Where W is the filter weight vector and k is the time index, ex and /! are adaption coefficients, E the filter er- ror and x the filter input data vector. Consider an error surface which is a long narrow valley. The steepest de- scent type algorithms, such as LMS, will in general I.end to zig-IIag down the valley as shown in Fig.I. This be- haviour even occurs for perfectly paraboidal surfaces and 1 50 lead. to large inefficiencies in terml of convergence time. Fig.1 ~ C:;;;;;< LMS con vergence 2 Derivation of the Algorithm Consider the minimisation of a function f over an N dimen- sional coordinate system. The function can be approximated by the first three terms of a Taylor series of the function about a point P. I) f 1 1)2 I f(w) ~ I(P) + L Wi Wi + 2 L I)WiI)W ' 1IIi1ll; • .. J Where: , 'J 1 = c - bw + -wAw 2 c = f(P) b= -VI 1)21 [A)i; = 1)1IIi1)1II; It should be noted that the above is exact for a quadratic function. Matrix A is the second partial derivative matrix of the function f ,that is the Hessian, of I at P. IT a minimisa- tion in the direction of vector u is performed it is necessary to find a direction in which to minimise so as not to destroy the previous minimisation. Consequently the next direction 0:; must be such that the gradient is perpendicular to u , that is: u.o(Vf) = u.A.v = 0 By calculating the gradient of the above approximation with respect to the weights we obtain: Vf = A.w - b And hence, the effect on the gradient of a .mall change in w is: 6Vf = A.6w u and V are referred to as mutually cot\iugate. It i. also neces- sary that our .earch direction. must be mutually orthogon&l so that the whole space may be spanned in the .earch. Followin& the derivation of the Fletcher-Reeves [4] algo- rithm it i. pouible to invoke a theorem which will allow the construction of mutually orthogon&l and cot\iugate vector se- quencel. 2.1 Theorem 1 HAil a symmetric, positive definite Nbt.JN matrix and go is an arbitary vector. For value. of i = 0,1,2 ...... two leriel of vecton may be defined: Where: gl+l = gl - AIAhl hl+l = Ki+l + 1'lhl t Ai = gl·gl gl·A.hl t gi+l·A .hl "Yi = t hl·A.hi Then for all i such that i '" i: t hl.A.hj = 0 That is, the gl are mutually orthogon&l and the hi are mutu- ally cot\iugate. The proof of this theorem is beyond the scope of this paper but may be found in Polak's book [2], it is a form of GrlUll-Schmitt orthogonalisation. However, this the- orem still requires storage of a matrix A. Another theorem enables this porblem to be overcome. 2.2 Theorem 2 H KI = - Vf(PI) and a minimisation of I in the direc- tion hi of I is performed to find a new point PI+ 1 and Kl+ 1 = - Vf(Pl+ 1) The slUlle sequence is generated as in theorem 1. This may be proved by induction [2]. Consequently a series of mutually orthogon&l vectors and a series of mutually cot\iugate vectors may be generated with- out knowledge or storage of A. 2.3 Application to Adaptive Filters In order to use the above theorems it is still necessary to perform line minimisation. and evaluate th!' gradi!'nt. Th!' gradient evaluation may be easily p!'rformed for th!' adap- tive filter case. For the standard adaptive filt!'r [1) as shown below: Fig.2 Standard Adaptive Filter d x f = d. - wtx Taking the mean square error: E(f') = E(d~) - 2wv + wRw Where x is the vector reprelenting the filter inpub, d il the desired response,W the weight vector,R is the autocorrela- tion matrix of the input and V the cron-correlation vector of the input and the desired signal. The gradient may thus be calculated as: , (h' at V(f ) = - = 2f- = -2fX 8w aw The other problem is the line minimisation. The mean Iquare error is a parabola as a function of distance &lon& the line. Consider the line generated by: w = w +l3h Where {3 is the line generatin& sc&lar, the tuter error can be written: t = d - L Zi(Wi + I3hi) The second derivative of the mean square error with relpect to distance &long the line h may be c&lculted as: 8' t' a af' a af '" ' 813' = 813 813 = a{3 (2f al3) = (L:-t Zi .hi) • Given expressions for the derivative, .econd derivative and the knowledge that the relationship is paraboid&l it is possible to find: 13 = _f_ xt.h Is the value of 13 at the line minimum &long h. This gives only an estimate of the line minimum as the instantaneous error values could be affected by noise and incorporate no time av!'raging aspect. The approximation is found in practice to b!' suflici!'nt for many applications. This approximation can l!'ad to a set of mutually conjugat!' and mutually orthogonal v!'ctor .!'ri!'s which us!' up all viable dir!'ctions before arriving at the minimum. This problem can be cured by p!'riodic I ru~ting of th~ algorithm. but a much mor~ el~gant solution is to us~ th~ Polak-Ribi~r~[2] ut~nsion of th~ FI~tch~r-R~~ns algorithm. which by using a .lightly diff~r~nt upr~Slion for 'Y I~ads to an algorithm which in effect automatically res~ts the algorithm if n~c~ulUy. 2.4 The Algorithm git1 = 2fX With initial values: go = 2fX hO = 2fX 3 Results The algorithm was t~sted against the LMS algorithm. Two filters. one LMS and on~ col\iugate gradient filter (CG) were run in parallel modelling a FIR system. Fig.3 y Itera ion Number 80 Fig.4 Plot of average absolute ~rror valuu over fifty runs. .. o li ~ ' CG ::s '0 all ~ UlS (smoothed) umber 80 Fig.S Typical Run with no averaging. Note smooth con- vergence of LMS algorithm complUed with the jwnps of the col\iugate gradient algorithm. 4 Conclusion The algorithm performs well. giving significant increases in e 1 conv~rg~nce rates over the LMS algorithm and r~quires only O( 4n) storag~ rather than O( n') of oth~r fast m~thods. It is nec~S5ary to note that the basic algorithm is operating more in a L~ast Squar~s rather than a Mean Least SqulUe5 mode. e 2 This can lead to un~xpect~d results, especially for shod filters The ~xp~rim~nt was rep~at~d with random coefficients n the 50t h order FIR syst~m being modelled, The input. ignal was also vari~d: it was composed of a series of si- IUsoids of random amplitude and fr~quency, The com'er- ;ence curves for these runs where averaged and plotted: with low frequency signals and some form of error averaging may be ne~d~d to give a Mean Least SqulUes solution. Pro- vid~d that this complication is consid~r~d, th~ algorithm is an improv~m~nt to LMS and may b~ used wh~n m~thods such as Recursive Least Squares [1][6] or Singular Valu~ Decompo- sition [7] cannot be applied due to storag~ or computational load constraints . References [1] S.H.ylcin. Adaptive Filter Theo'1l En,lewood CliIU. NJ:Prentice-Hall. IPse. [2] E.Polalc. Computational M,thod, in Optimi,ation Lon- don. UK:Academic Pre ... an. [l] F.Ya .... Optimalitll in the Choice 01 the Convergence Factor lor Gradient Baled Adaptive Algorithm' IEEE Traru. ASSP. yol. ASSP-36. No.l. Jan. 1987. pp.fS-59. [4] R. Fldcher. Practical Method, 01 Optimi,ation New York. NY: John Wiley. 19S7. [5] E.Eleftheriou " D.Falconer. 7\-ocking Propertiu and Steadll-,tale Performance 01 RLS Adaptive Algorithm. IEEE Traru. ASSP. yol. 34 ASSP-34. No.5. Oct. IPSe. pp.l0n-llOP. [S] D.Panda " A.KaIe. Recurlive Lea.t Square. Smoothing 01 Noilt. in Imagu IEEE Tranl. ASSP. vol. 25 ASSP-25. No.S. Dec. an. pp.520-524. [7) D. 'IUlh " R. Kumutlan. Data Adaptive Signal Elli- mation "11 Singular Value Decompo.ition of Data Matriz IEEE Proc. Vol. 70 • No.6. Jun. 1982. pp.6S4-686. An Adaptive Pitch Estiluator for LMS Notch Filters. M.R.Lynch and P.J.Rayne/· Ca.mbridge Universit.y Engineering Department, Trumpingt.on Street., Ca.mbridge, CB2 1 PZ, U.K. 1 Abstract An adaptive method of pitch estimation is presented which will estimate the pitch of a harmonic signal in noise to a high degree of accuracy, and track any slow drifts in the pitch of the harmonic signal. The ap- proach is based on an ndapitive notch filter and hence avoids the problems encountered in the use of int.eger length adaptive delay lines. The approach Is partic- ularly efficient when using an LMS notch filter and will allow increased performance of the notch filter by accurately fixing the reference signal frequ~ncy. Adaptive pitch estimators for harmonic signals in noise can be difficult to design due to the nature of the error surface encountered if one attempts to adapt a fundamentally time stationary system. Tlus approach solves this problem by util- ising a time varying system. An attempt was made to remove a harmonic interference from an archive gramophone recording using a standard FIR LMS adaptive notch filter [lJ. This approach worked well in periods in which the harmonic interference was present with background noise of similar amplitude. but failed in periods with music or speech present. The amplitude of the music was many times greater than that of the interference and caused the adaptive notch filter to ring and destroyed its interference cancelling properties. As in the case encountered by Lim [3J it was decided to freeze the adaptive notch filter in these regions. However as these regions persisted for t.ens of thou- sands of samples the reference sinusoids t.o t.he adapt.i ve not.ch filter had to have their pitches accurately determined so that. in the frozen periods the interference and ant.i-int.erference output fro111 the FIR filter should stay synchronised. ~S_I~gn_e_I __________________________ ~ V Reference OSCillator LMS Adeptive / Filter Error Fig!: An adaptive notch filter. Thus an accurate and computation ally efficient pitch esti- mator was required wluch could also track a slowly drifting pitch. Consider an LMS adaptive notch filter with one notch . Glover [:lJ analysed the LMS adaptive notch filter and showed its behaviour to include time varying aspects under certain conditions. ____ ~l~ ______ _ I u(z ), _1_ ,\y" 1 '- l -1 __ _ J__ I c : ~w ... :w. 'J ;""--' 1 1 r-- '---4~' , " . l :z \_~ . I :- ~ -- . -- -: . :T:I x., I Fig :l:Notch Filter Analysis Diagram. Let C , (z) be the Z-tran~form of the i 1h coefficient of the adapted notch filter, with adaption coefficient er and refer- ence ~inu~oid of frequency w. and runplitude C, U( z) i~ the Z trRn~form of the LMS weight update equRtion (Rn integrRtor) a~ shown in Fig2 . Glover ~how~ that the pole zero plot of E( z) would indicate pole~ at z = e ±iw. T and E( ze - JW. T) repre~ents a counter- clockwi~e rotation of E( z) through angle w. T . Therefore the pole-zero plot of : [E(ze-iw.T)ej8; + E(zeiwrT)e-i8;] would show pole~ at ±(wr + wd)T and at ±(wr - wd)T. This rotated ~pectrurn i~ filtered through U(z) to give C;(z). Each weight therefore consists of the sum and difference fre- quency of Wr and Wd. U( z) is strongly low pa~~, hence the difference frequency dominate~ . For the FIR delay line (); = (}-w . T(i-I] which show~ that the coefficient~ C; are a ~i­ nusoid of frequency Wr with each weight varying as a sinusoid at frequency w. - Wd. Glover refers to this as the sinusoid in the coefficient~ 'mov- ing' at a rate equal to the difference frequency between the de~ired and reference frequencie~. This may be viewed a~ a heterodyning proce~~ and is a time varying aspect to the ~olu­ tion for an adaptive notch filter; it may also be cOll5idered a~ a ~imple ca~e of a qua~i~tationary filter u~ing a ~lowly vary- ing phase reference at the desired frequency. It i~ thi~ time varying a~pect that the adaptive pitch e~timator (APE] will use. A single notch adaptive filter is constructed and the motion of the sinusoid is detected by the following algorithm: 1 ()q Pn + 1 = Pn + J.Lp