[Santa Fe Institute Series] John a. Hertz, Anders S. Krogh, Richard G. Palmer - Introduction to the Theory of Neural Computation (1991,2018, Westview Press, CRC)

INTRODUCTION TO THE THEORY OF NEURAL COMPUTATION INTRODUCTION TO THE THEORY OF NEURAL COMPUTATION John Hertz NORDITA

Views 66 Downloads 0 File size 9MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

The Theory of Computation

U TNE TJHEEORY OF COI.~1P UTA~T ION\ THEE THEORY OF COM1PUTATIOIN BERNARD M. MOIRET Uniiversify of New Mexico Ad AD

59 2 24MB Read more

Solution Manual Introduction to the Theory of Computation Sipser

29 0 17MB Read more

Solution Manual Introduction to the Theory of Computation Sipser

90 2 17MB Read more

Mood - Introduction to the Theory of Statistics

71 2 24MB Read more

Theory of Computation - Assignment 1

6 0 100KB Read more

Solution Manual to Introduction to Languages and the Theory of Computation (3rd Ed) by John C. Martin

Bismillah hir Rehman nir Raheem ------------^^^^^^^^^^----------Assalat o Wasalam o Allika Ya RasoolALLAH Solution Manu

14 1 15MB Read more

John v. Guttag Introduction to Computation and Programming Using Python

65 1 31MB Read more

John Stowell Santa Fe WORKSHOP

33 0 1MB Read more

An Introduction to the Theory of Groups (Rotman).pdf

Graduate Texts in Mathematics S. Axler Springer New York Berlin Heidelberg Barcelona Hong Kong London Milan Paris Sing

79 2 39MB Read more

The Globalization of World Politics. Introduction To International Relations Theory

4 1 27MB Read more

Author / Uploaded
Gustavo Galindo Rosales

Citation preview

INTRODUCTION TO THE THEORY OF NEURAL COMPUTATION

INTRODUCTION TO THE THEORY OF NEURAL COMPUTATION John Hertz NORDITA

Anders Krogh Niels Bohr Institute

Richard G. Palmer

Duke University and the Santa Fe Institute

Lecture Notes Volume I SANTA. FE IN STIT U T E STU DIES IN T H E SCIEN CES O F C O M P L E X IT Y

(a*

CRC Press Taylor & Francis Group Boca Raton London New York

CRC Press is an imprint of the Taylor & Francis Group, an informa business

First published 1991 by West view Press Published 2018 by CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742

,

CRC Press is an imprint o f the Taylor & Francis Group an informa business Copyright © 1991 Taylor & Francis Group LLC No claim to original U.S. Government works This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access w w w .copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Hertz, John A. Introduction to the theory of neural computation / John A. Hertz, Richard G. Palmer, Anders S. Krogh. p. cm.— (Santa Fe Institute studies in the sciences of complexity. Lecture notes: v. 1) (Computation and neural systems series) Includes index. 1. Nerual computers. 2. Neural circuitry. I. Palmer, Richard G. II. Krogh, Anders S. III. Tide. IV. Series. V. Series: Computation and neural systems series. Q A 76.5.H 475 1991 006.3— dc20 91-701 —ISBN 0-201-51560-1 (pbk.) ISBN 13: 978-0-201-51560-2 (pbk) This volume was typeset using TgXtures on a Macintosh II computer.

About the Santa Fe Institute The Santa Fe Institute (SFI) is a private, independent organization dedicated to multidisciplinary scientific research and graduate education in the natural, com putational, and social sciences. The driving force behind its creation in 1984 was the need to understand those com plex systems that shape human life and much o f our immediate world— evolution, the learning process, the immune system, the world economy. The intent is to make the new tools now being developed at the frontiers o f the computational sciences and in the mathematics o f nonlinear dynamics more readily available for research in the applied physical, biological, and social sciences.

All titles from the Santa Fe Institute Studies in the Sciences o f Complexity series will carry this imprint, the SFI logo, which is based on a Mimbres pottery design (circa A .D . 9501150), drawn by Betsy Jones.

Santa Fe Institute Editorial Board September 1990 Dr. L. M. Simmons, Jr., Chair Executive Vice President, Santa Fe Institute Professor Kenneth J. Arrow Department o f Economics, Stanford University Professor W . Brian Arthur Dean & Virginia Morrison Professor o f Population Studies and Economics, Food Research Institute, Stanford University Dr. David K. Campbell Director, Center for Nonlinear Studies, Los Alamos National Laboratory Dr. George A . Cowan President, Santa Fe Institute and Senior Fellow Emeritus, Los Alamos National Laboratory Professor Marcus W . Feldman Director, Institute for Population k Resource Studies, Stanford University Professor Murray Gell-Mann Division o f Physics k Astronomy, California Institute o f Technology Professor John H. Holland Division o f Computer Science k Engineering, University o f Michigan Dr. Bela Julesz Head, Visual Perception Research, AT& T Bell Laboratories Professor Stuart Kauffman School o f Medicine, University o f Pennsylvania Dr. Alan S. Perelson Theoretical Division, Los Alamos National Laboratory Professor David Pines Department o f Physics, University o f Illinois

Santa Fe Institute Studies in the Sciences of Complexity

PROCEEDIN GS VOLUMES Volume I II III IV

Editor David Pines Alan S. Perelson Alan S. Perelson Gary D. Doolen et al.

V

Philip W . Anderson et al.

VI

Christopher G. Langton

VII

George I. Bell k Thom as G. Marr W ojciech H. Zurek

VIII IX

Alan S. Perelson k Stuart A. Kauffman

Title Emerging Syntheses in Science, 1987 Theoretical Immunology, Part One, 1988 Theoretical Immunology, Part Tw o, 1988 Lattice Gas Methods o f Partial Differential Equations, 1989 The Economy as an Evolving Complex System, 1988 Artificial Life: Proceedings o f an Interdisciplinary Workshop on the Synthesis and Simulation o f Living Systems, 1988 Computers and DNA, 1989 Complexity, Entropy, and the Physics o f Information, 1990 Molecular Evolution on Rugged Landscapes: Proteins, R N A and the Immune System, 1990

LECTU RES VOLUM ES Volume I II

Editor Daniel L. Stein Erica Jen

Title Lectures in the Sciences o f Complexity, 1989 1989 Lectures in Com plex Systems, 1990

LE C TU R E N OTES VOLUM ES Volume I

II

Author John Hertz, Anders Krogh, k Richard Palmer Gerard Weisbuch

Title Introduction to the Theory o f Neural Computation, 1991 Complex Systems Dynamics, 1991

Contents

Series Foreword by L. M. Simmons, Jr. Foreword by Jack Cowan Foreword by Christ o f Koch Preface

xiii xv xvii xix

ONE 1.1 1.2 1.3

Introduction Inspiration from Neuroscience History The Issues

TW O 2.1 2.2 2.3 2.4 2.5

The Hopfield Model The Associative Memory Problem The Model Statistical Mechanics of Magnetic Systems Stochastic Networks Capacity of the Stochastic Network

11 11 13 25 32 35

Extensions of the Hopfield Model Variations on the Hopfield Model Correlated Patterns Continuous-Valued Units Hardware Implementations Temporal Sequences o f Patterns

43 43 49 53 58 63

Optimization Problems The Weighted Matching Problem The Travelling Salesman Problem Graph Bipartitioning Optimization Problems in Image Processing

71 72 76 79 81

THREE 3.1 3.2 3.3 3.4 3.5 FOUR 4.1 4.2 4.3 4.4

1 2 6 8

X

Contents

FIVE 5.1 5.2 5.3 5.4 5.5 5.6 5.7 SIX 6.1 6.2 6.3 6.4 6.5 6.6 SE V E N 7.1 7.2 7.3 7.4 E IG H T 8.1 8.2 8.3 8.4 N IN E 9.1 9.2 9.3 9.4 9.5 9.6 9.7 TEN 10.1 10.2

Simple Perceptrons Feed-Forward Networks Threshold Units P roof o f Convergence of the Perceptron Learning Rule Linear Units Nonlinear Units Stochastic Units Capacity of the Simple Perceptron

89 90 92 100 102 107 110 111

Multi-Layer Networks Back-Propagation Variations on Back-Propagation Examples and Applications Performance o f Multi-Layer Feed-Forward Networks A Theoretical Framework for Generalization Optimal Network Architectures

115 115 120 130 141 147 156

Recurrent Networks Boltzmann Machines Recurrent Back-Propagation Learning Time Sequences Reinforcement Learning

163 163 172 176 188

Unsupervised Hebbian Learning Unsupervised Learning One Linear Unit Principal Component Analysis Self-Organizing Feature Extraction

197 197 199 204 210

Unsupervised Competitive Learning Simple Competitive Learning Examples and Applications o f Competitive Learning Adaptive Resonance Theory Feature Mapping Theory of Feature Mapping The Travelling Salesman Problem Hybrid Learning Schemes

217 218 223 228 232 240 244 246

Formal Statistical Mechanics of Neural Networks The Hopfield Model Gardner Theory o f the Connections

251 251 265

Contents

A PP EN D IX A .l A .2 A .3

XI

Statistical Mechanics The Boltzmann-Gibbs Distribution Free Energy and Entropy Stochastic Dynamics

275 275 277 279

Bibliography

281

Subject Index Author Index

307 321

Series Foreword

We are witnessing the creation o f new sciences o f complexity, sciences that may well occupy the center o f intellectual life in the twenty-first century. The Santa Fe Institute was founded to assist at the birth o f these new sciences. Those involved in this activity are proceeding under the conviction that there is a common set o f principles shared by the disparate complex systems under study, that the time is ripe to understand those principles, and that it is essential to develop them and the associated tools for dealing in a systematic way with complex systems. Complex systems typically do not fit within the confines of one o f the tradi tional disciplines, but require for their successful study knowledge and techniques from several disciplines. Thus one task o f the Institute has been to find new ways to encourage cooperative research among scholars from different fields. The Studies in the Sciences o f Complexity is one means that the Institute has adopted for acceler ating the development o f the sciences o f complexity. These volumes make available to the scholarly community the results o f conferences and workshops sponsored by the Institute, lectures presented in the Complex Systems Summer School, other lecture notes, and monographs by active researchers. The sciences of complexity are emerging in part as a synthesis o f some o f the traditional sciences, including biology, computer science, physics, and mathematics. In part they are emerging as a result o f new ideas, new questions, and new tech niques only recently developed. Am ong these latter are the emergence o f heretofore undreamed o f computer power on the scientist’s desktop and the not unrelated progress in nonlinear dynamics, computer graphics, and adaptive programs. These newly emerging tools and techniques also offer the prospect o f new collaboration between the traditional sciences and the social sciences, a collaboration that will

X iV

Introduction to the Theory of Neural Computation

extend modeling techniques to incorporate realistic detailed models o f human be havior. Thus, this Series is intended to range broadly across many fields o f intel lectual endeavor incorporating work in all the areas listed above. The apparently disparate topics, however, share com mon themes that relate them to the emergent sciences o f complexity. The Santa Fe Institute, and hence this Series, would not exist without the sup port o f farsighted individuals in government funding agencies and private founda tions who have recognized the promise o f the new approaches to complex systems research being fostered here. It is a pleasure to acknowledge the broad research grants received by the Institute from the Department o f Energy, the John D. and Catherine T . Mac Arthur Foundation, and the National Science Foundation that, together with numerous other grants, have made possible the work o f the Institute.

L. M. Simmons, Jr. Santa Fe, New Mexico October 1, 1990

Foreword

The past decade has seen an explosive growth in studies o f neural networks. In part this was the result o f technological advances in personal and main-frame computing, enabling neural network investigators to simulate and test ideas in ways not readily available before 1980. Another major impulse was provided by Hopfield’s work on neural networks with symmetric connections. Such networks had previously been dismissed as not brain-like and therefore not worth studying. I myself fell into this trap some twenty-five years ago when I formulated what are now termed the stan dard equations for studying neural networks, those using the so-called squashing or logistic function. It was to Hopfield’s credit that he “stepped back from biolog ical reality” as Toulouse has put it, and uncovered an interesting set o f properties and uses for symmetric networks. W hat followed is an interesting episode in the sociology o f science. Hopfield’s papers triggered an explosion, particularly in the statistical physics community, leading to a whole series o f dramatic advances in the understanding o f symmetric networks and their properties, especially in respect o f their utility as distributed memory stores, and as solvers o f constrained optimization problems, e.g., small versions o f the famous Traveling Salesman Problem. At more-or-less the same time, other developments in neural networks, possiblely even more important, were taking place, culminating in the publication by Rumelhart, Hinton, and Williams o f the now well-known “Back-Propagation A lgo rithm” for solving the fundamental problem o f training neural networks to compute desired functions, a problem first formulated by Rosenblatt in the late 1950’s in his now classical work on Perceptrons. Again this paper triggered a massive explosion o f work on trainable neural networks which continues to this day. The authors o f this book, Palmer, Krogh, and Hertz, are statistical physicists who have experienced these developments. They have sought to provide an intro duction to the theory behind all the hoopla, and to summarize the current state

XV

XVi

Introduction to the Theory of Neural Computation

o f the subject. They have, wisely, eschewed neurobiology from their coverage, and have concentrated on what they know best, statistical mechanics, and how it is applied to neural networks. In my opinion they have succeeded admirably in pro viding a clear and readable account o f the statistical mechanical ideas underlying neural networks, including some account o f the analogy between neural networks and spinglasses, and the famous Replica M ethod for analyzing such materials. They have also done justice to Back-Propagation in providing an up-to-date treatment o f Recurrent Back-Propagation in its various manifestations. Readers who take the trouble to follow the mathematics outlined in this book will be rewarded with valu able insights into how neural networks really work. One cannot ask for much more in any scientific publication.

Jack Cowan Mathematics Department The University o f Chicago September 24> 1990

Foreword

It is quite clear, as convincingly illustrated in this textbook, that the theory under lying learning and com puting in networks o f linear threshold units has developed into a mature subfield existing somewhere between physics, computer science, and neurobiology. We have not only a growing number o f examples where learning tech niques are successfully applied to practical problems such as recognizing handwrit ten postal mail codes or protein structures or cases where theories o f unsupervised Hebbian learning mimicking certain aspects o f neuronal development but now pos sess a solid understanding o f why these algorithms perform so well on certain types o f processing or why they fail, why certain features— such as hidden units— are necessary and how these approaches to learning relate to more traditional methods used in statistics to estimate a pootly sampled or unknown function in the presence o f noise. Thus, it appears that neural networks are here to stay after three consecu tive cycles o f enthusiasm and skepticism, first peaking in the 1940’s with McCulloch and P itt’s seminar work, then again in the 1960’s with Rosenblatt’s perceptron con vergence theorem and its denouement by Minsky and Papert, and finally for a third time in the 1980’s with Hopfield’s energy approach and the modern era o f multilay ered networks ushered in by the backpropagation learning technique. The influence o f the neural network learning paradigm on Artificial Intelligence will be profound, so much so that we will need to modify our basic notion o f the Turing test as an op erational definition o f intelligence to encompass at least some rudimentary learning abilities. At this point in time, it is still too early to describe the long-term effect o f neural networks on neurobiology and experimental neuroscience. While neural network analysis has been one o f the key impulses behind “ computational neuro science,” we have yet to develop specific instances where such an analysis has been used to successfully analyze and understand some real neurobiological circuits.

xvii

xviii

Introduction to the Theory of Neural Computation

This monograph succinctly captures these trends and summarizes the current state o f the art by way o f highlighting the analogies to statistical mechanics and electric circuit theory as well as by discussing various practical applications. It is done without overdue emphasis on a formal mathematical treatment, appealing rather to the intuition o f the reader. Throughout the book, the emphasis is on those features o f neural networks relevant to information processing, storage and recall, that is to com putation and function, linking physics to com puting machines. The Computation and Neural Systems Series— Over the past 600 million years, biology has solved the problem of processing massive amounts o f noisy and highly redundant information in a constantly changing environment by evolving networks o f billions o f highly interconnected nerve cells. It is the task o f scientists— be they mathematicians, physicists, biologists, psychologists, or computer scientists— to un derstand the principles underlying information processing in these com plex struc tures. At the same time, researchers in machine vision, pattern recognition, speech understanding, robotics, and other areas o f artificial intelligence can profit from understanding features o f existing nervous systems. Thus, a new field is emerg ing: the study of how computations can be carried out in extensive networks o f heavily interconnected processing elements, whether these networks are carbonor silicon-based. Addison-W esley’s new “ Computation and Neural Systems” series will reflect the diversity o f this field with textbooks, course materials, and m ono graphs on topics ranging from the biophysical modeling o f dendrites and neurons, to computational theories o f vision and m otor control, to the implementation o f neural networks using VLSI or optics technology, to the study o f highly parallel com putational architectures.

Christof Koch Pasadena, California September 21, 1990

Preface

We generally like our titles shorter than an Introduction to the Theory o f Neural C om putation, but all those words are important in understanding our purpose:

Neural Computation Our subject matter is com putation by artificial neural networks. The adjective “neural” is used because much o f the inspiration for such networks comes from neuroscience, not because we are concerned with networks o f real neurons. Brain modelling is a different field and, though we sometimes describe biological analogies, our prime concern is with what the artificial networks can do, and why. It is arguable that “neural” should be purged from the vocabulary o f this field— perhaps Network Computation would have been more accurate in our title— but at present it is firmly ensconced. We do however avoid most other biological terms in non-biological contexts, including “neuron” (unit) and “synapse” ( connection).

Theory We emphasize the theoretical aspects o f neural com putation. Thus we provide little or no coverage o f applications in engineering or computer science; implementations in hardware or software; or implications for cognitive science or artificial intelligence. There are recent books on all these topics and we prefer to complement rather than to com pete. On the other hand, we feel that even those whose interest in the subject is completely practical may benefit from a broad theoretical perspective. We are no doubt biased by the fact that we are theorists by trade, but in our own experience we found this background to be essential in using neural networks for practical applications (not described in this book).

XX

Preface

Introduction Our book is intended as an introduction. This has implications at both extremes: where we start from, and how far we go. We try to start from the beginning, and assume little o f the reader beyond some mathematical training. We do not assume any prior knowledge o f neural networks, or o f physics, engineering, or computer science. There are local exceptions to this ideal, but nothing that is central. On the other hand we do not go to the end. The theory built up around neural networks is huge, and we cannot hope to cover it all. We do discuss most o f the m ajor architectures and theoretical concepts, but at varying depth. We stop short o f very intricate or technical analysis, and o f most directions that we consider dead ends. We aim not at mathematical rigor, but at conveying understanding through mathematics. Understanding should, we feel, consist not only o f “knowing what,” but also o f “knowing how” ; especially knowing how to go on [Wittgenstein, 1958]. W ith that in mind we are usually not satisfied with simply stating or deducing a given result, but instead try to show the reader how to think about it, how to handle and hold it.

Bibliography and Coverage At the same time we try to provide access to the research literature for further reading. We selected the bibliography with the primary aim o f assisting the reader, and only with the secondary aim o f attributing credit. When there was a choice we picked the more readable source, especially in references to the associated areas o f mathematics, com puter science, and physics. It may be tempting to consider this book as a comprehensive survey o f what has been done in the theory o f neural computation. It does have that character in some sections, particularly those near the forefront, where we try to describe just who has done what recently. But that is not our overall aim and we have not tried to be complete. Omissions may be for reasons o f ignorance, complexity, irrelevance, obscurity, pedagogy, space, or many other reasons. We apologize only for our ignorance.

Approach Our selection and treatment o f material reflects our background as physicists. This background has helped us to understand how these com plex systems function, often in terms o f physical analogies. Others might find easier paths into the subject area from computer science, statistics, or psychology, and there could be written equally good or better books along these lines. We tell the story the way that we are best able to understand it, and hope that our readers find the perspective enlightening. We often view the analysis o f artificial neural networks as a statistical mechanics problem. Like many systems in condensed matter physics, these networks are large collections o f interacting entities with emergent properties, and it should not be

Preface

xxi

surprising that related mathematical tools are useful. Nevertheless, this is a new feature in the engineering world, where one normally expects every minute aspect o f the operation o f machinery (or software) to follow an explicit plan or program. But systems are becoming so com plex that this kind o f detailed engineering is neither possible nor desirable. Some o f the networks described in this book illustrate how systems can design themselves with relatively little external guidance. The user will never know all the internal details, but will need methods like those we describe to analyze how the whole thing works. In our experience some people, even among those working on neural networks, appear to be frightened by the prospect o f having to learn statistical mechanics. They also question its necessity. Often, however, their equations turn out to be exactly the same ones that we would have written. Anyone trying to analyze the typical behavior o f a many-component system is doing statistical mechanics whether it is called that or not. We hope that the doubters will not be put off by a few partition functions here and there, but will benefit by seeing many problems put in a more unified perspective. This is not to say that we employ only a statistical mechanics or “physics” approach, or that one has to be a physicist to read this book. Far from it. Explicit statistical mechanics is used only rarely, and is explained where it arises (and in the A ppendix). It often underlies and motivates our approach, but is not often visible at the surface. And one certainly does not need to be a physicist; we have tried hard not to assume anything (besides mathematics) o f the reader, and to avoid or explain words and ideas specific to physics.

Prerequisites There are no prerequisites besides mathematics. The mathematical level varies somewhat, with more required for the starred (* ) sections and for Chapter 10 than for the rest. These sections may be omitted however without loss o f continuity. On the whole we expect our readers to know something about multi-variate calculus, ordinary differential equations, basic probability and statistics, and linear algebra. But in most o f these areas a general familiarity should be enough. The exception for some readers may be linear algebra, where we use vectors and matrices, in ner products, matrix inversion, eigenvalues and eigenvectors, and diagonalization o f quadratic forms. However, eigenvalues and eigenvectors are not used in any es sential way until Chapter 8, and diagonalization appears only in starred sections. Kohonen [1989] provides an introduction to the linear algebra needed for neural network theory.

Acknowledgments This book began as lecture notes for a half-semester course given at Duke University (and broadcast on the North Carolina teleclassroom network) in the spring o f 1988. The audience for these lectures was very broad, including people from neurosciences

xxii

Preface

and cognitive sciences as well as computer science, engineering, and physics. Later versions were used as the basis o f summer school lectures at Santa Fe in June 1988 [Palmer, 1989], and for a one-semester course for physics and computer science students at the University o f Copenhagen in the fall o f 1989. We thank all o f the students in all o f these courses for constructive feedback that led to successive improvements. We owe a debt o f gratitude to many o f our colleagues in Durham, Copenhagen, and Santa Fe who encouraged, supported and helped us in this work. These include Ajay, Alan, Benny, Corinna, Dave, Ingi, David, Frank, Gevene, Jack, John, Jun, Kurt, Lars, Marjorie, Mike, Per, Ronda, Spren, Stu, Tamas, and Xiang. Tw o o f us (JH and A K ) also thank the Physics Department at Duke for their hospitality in the spring of 1988, when this whole enterprise got started. A K thanks the Carlsberg Foundation for generous financial support. Finally, we reserve our deepest appreciation for our wives and families. It is a hackneyed theme to thank loved ones for patience and understanding while a book was being written; but now we know why, and do give heartfelt thanks. Richard Palmer Anders Krogh John Hertz Durham and Copenhagen, August 1990

Electronic mail addresses for the authors:

p a lm e r@ p h y . duke . edu a s k r o g h 0 n b i v a x . n b i . dk h e r t z 0 n o r d i t a . dk

Introduction

ONE

Anyone can see that the human brain is superior to a digital computer at many tasks. A good example is the processing o f visual information: a one-year-old baby is much better and faster at recognizing objects, faces, and so on than even the most advanced A l system running on the fastest supercomputer. The brain has many other features that would be desirable in artificial systems. ■

It is robust and fault tolerant. Nerve cells in the brain die every day without affecting its performance significantly.

■

It is flexible. It can easily adjust to a new environment by “learning” — it does not have to be programmed in Pascal, Fortran or C.

■

It can deal with information that is fuzzy, probabilistic, noisy, or inconsistent.

■

It is highly parallel.

■

It is small, com pact, and dissipates very little power.

Only in tasks based primarily on simple arithmetic does the computer outper form the brain! This is the real motivation for studying neural computation. It is an alterna tive computational paradigm to the usual one (based on a programmed instruction sequence), which was introduced by von Neumann and has been used as the basis o f almost all machine com putation to date. It is inspired by knowledge from neu roscience, though it does not try to be biologically realistic in detail. It draws its methods in large degree from statistical physics, and that is why the lectures on which this book is based originally formed part o f a physics course. Its potential applications lie o f course mainly in computer science and engineering. In addition it may be o f value as a modelling paradigm in neuroscience and in sensory and cognitive psychology.

1

2

ONE Introduction

FIGURE 1.1 Schematic drawing o f a typical neuron.

The field is also known as neural networks, neurocomputation, associative net works, collective com putation, connectionism, and probably many other things. We will use all these terms freely.

1.1 Inspiration from Neuroscience T oday’s research in neural com putation is largely motivated by the possibility o f making artificial com puting networks. Yet, as the term “neural network” implies, it was originally aimed more towards modelling networks o f real neurons in the brain. The models are extremely simplified when seen from a neurophysiological point o f view, though we believe that they are still valuable for gaining insight into the principles o f biological “com putation.” Just as most o f the details o f the separate parts o f a large ship are unimportant in understanding the behavior o f the ship (e.g., that it floats, or transports cargo), so many details o f single nerve cells may be unimportant in understanding the collective behavior o f a network o f cells.

Neurons The brain is com posed o f about 1011 n e u ro n s (nerve cells) o f many different types. Figure 1.1 is a schematic drawing o f a single neuron. Tree-like networks o f nerve fiber called d e n d r ite s are connected to the ce ll b o d y or so m a , where the cell nucleus is located. Extending from the cell b ody is a single long fiber called the a x o n , which eventually branches or a r b o r iz e s into strands and substrands. At the ends o f these are the transmitting ends o f the s y n a p tic ju n c t io n s , or sy n a p se s, to other neurons. The receiving ends o f these junctions on other cells can be found

3

1.1 Inspiration from Neuroscience

FIGURE 1.2 Schematic diagram o f a McCulloch-Pitts neuron. The unit fires if the weighted sum WijUj o f the inputs reaches or exceeds the threshold

both on the dendrites and on the cell bodies themselves. The axon o f a typical neuron makes a few thousand synapses with other neurons. The transmission o f a signal from one cell to another at a synapse is a complex chemical process in which specific transmitter substances are released from the sending side o f the junction. The effect is to raise or lower the electrical potential inside the body o f the receiving cell. If this potential reaches a threshold, a pulse or a c tio n p o t e n t ia l o f fixed strength and duration is sent down the axon. We then say that the cell has “fired” . The pulse branches out through the axonal arborization to synaptic junctions to other cells. After firing, the cell has to wait for a time called the r e fr a c t o r y p e r io d before it can fire again. M cCulloch and Pitts [1943] proposed a simple model o f a neuron as a binary threshold unit. Specifically, the model neuron computes a weighted sum o f its inputs from other units, and outputs a one or a zero according to whether this sum is above or below a certain threshold: n,(< + 1) = © Q T

WijTijit) - fj,i) .

(1.1)

j See Fig. 1.2. Here n,- is either 1 or 0, and represents the state o f neuron i as firing or not firing respectively. Time t is taken as discrete, with one time unit elapsing per processing step. 0 ( x ) is the unit s te p fu n c tio n , or Heaviside function:

0

if a: > 0 ; otherwise.

/12x

The weight Wij represents the strength o f the synapse connecting neuron j to neuron i. It can be positive or negative corresponding to an e x c it a t o r y or in h ib it o r y synapse respectively. It is zero if there is no synapse between i and j . The cellspecific parameter is the threshold value for unit i\ the weighted sum o f inputs must reach or exceed the threshold for the neuron to fire. Though simple, a M cCulloch-Pitts neuron is computationally a powerful device. M cCulloch and Pitts proved that a synchronous assembly o f such neurons is capable in principle o f u n iv e rs a l c o m p u t a t io n for suitably chosen weights Wij. This means that it can perform any com putation that an ordinary digital computer can, though not necessarily so rapidly or conveniently. Real neurons involve many complications omitted from this simple description. The most significant ones include:

4

ONE Introduction

■

Real neurons are often not even approximately threshold devices as described above. Instead they respond to their input in a continuous way. This is some times referred to as a g r a d e d re s p o n s e . But the nonlinear relationship be tween the input and the output o f a cell is a universal feature. Our working hypothesis is that it is the nonlinearity that is essential, not its specific form. In any case, continuous-valued units can be modelled too, and are sometimes more convenient to deal with than threshold units.

■

Many real cells also perform a nonlinear summation o f their inputs, which takes us a bit further from the M cCulloch-Pitts picture. There can even be significant logical processing (e.g., AND, OR, N O T) within the dendritic tree. This can in principle be taken care o f by using several formal M cCullochPitts neurons to represent a single real one, though there has been little work along these lines so far. We will generally ignore this complication, since the simple M cCulloch-Pitts picture is already very rich and interesting to study.

■

A real neuron produces a sequence o f pulses, not a simple output level. Rep resenting the firing rate by a single number like n,-, even if continuous, ig nores much information— such as pulse phase— that might be carried by such a pulse sequence. The majority o f experts do not think that phase plays a significant role in most neuronal circuits, but agreement is incomplete.

■

Neurons do not all have the same fixed delay (t —* t + 1). Nor are they up dated synchronously by a central clock. We will in fact use asynchronous up dating in much o f this book.

■

The amount o f transmitter substance released at a synapse may vary unpredictably. This sort o f effect can be modelled, at least crudely, by a stochastic generalization o f the M cCulloch-Pitts dynamics.

A simple generalization o f the M cCulloch-Pitts equation (1.1) which includes some o f these features is := 9

v>ij n j - m ) .

(1.3)

3

The number n,- is now continuous-valued and is called the s ta te or a c tiv a tio n o f unit i. The threshold function 0 ( x ) o f (1.1) has been replaced by a more general nonlinear function g (x ) called the a c tiv a tio n fu n c t io n , g a in fu n c t io n , tr a n s fe r fu n c tio n , or s q u a sh in g fu n c tio n . Rather than writing the time t or t+ 1 explicitly as we did in (1.1), we now simply give a rule for updating n,* whenever that occurs.1 Units are often updated asynchronously: in random order at random times. Nowhere in this book do we attempt a detailed description o f networks o f real neurons, or o f other neurobiological structures or phenomena. Kandel and Schwartz [1985] give an excellent introduction. We do sometimes appeal to biological realism, and do describe a few models o f cortical organization in Chapters 8 and 9, but the emphasis is generally on the computational abilities o f network models, not on their

1Note that we use the symbol to emphasize that the right-hand side is assigned to the left-hand side upon update; the equality is not continuously true.

1.1 Inspiration from Neuroscience

5

direct applicability to brain modelling. Nevertheless, despite the intimidating detail and com plexity o f real brains, we do believe that the kind o f theory discussed in this book is relevant to neuroscience. But the connection is not so much at the level o f detailed modelling as at the level o f algorithms and representation [Marr, 1982]. That is, this kind o f approach can help in formulating and testing what sort o f computational algorithms the brain is using in different tasks. While the biological and artificial implementations o f the algorithms are very different, there can be many features in com m on at the algorithmic level. W hen discussing artificial neural networks it remains commonplace to talk o f “neurons” and “synapses” , even though the network components are far simpler than their biological counterparts. We prefer to use the terms “units” and “connec tions” (or “weights” ) except when discussing networks that are intended as direct models o f brain structures. Other terms in use for the units include “processing elements” and “neurodes” .

Parallel Processing In computer science terms, we can describe the brain as a parallel system o f about 1011 processors. Using the simplified model (1.3) above, each processor has a very simple program: it computes a weighted sum o f the input data from other processors and then outputs a single number, a nonlinear function o f this weighted sum. This output is then sent to other processors, which are continually doing the same kind o f calculation. They are using different weights and possibly different gain functions; the coefficients Wij are in general different for different i, and we could also make g (x ) be site-dependent. These weights and gain functions can be thought o f as local data stored by the processors. The high connectivity o f the network (i.e., the fact that there are many terms in the sum in (1.3)), means that errors in a few terms will probably be inconsequential. This tells us that such a system can be expected to be robust and its performance will degrade gracefully in the presence o f noise or errors. In the brain itself cells die all the time without affecting the function, and this robustness o f the biological neural networks has probably been essential to the evolution o f intelligence. The contrast between this kind o f processing and the conventional von Neu mann kind could not be stronger. Here we have very many processors, each execut ing a very simple program, instead o f the conventional situation where one or at most a few processors execute very complicated programs. And in contrast to the robustness o f a neural network, an ordinary sequential computation may easily be ruined by a single bit error. It is worth remarking that the typical cycle time o f neurons is a few milliseconds, which is about a million times slower than their silicon counterparts, semiconductor gates. Nevertheless, the brain can do very fast processing for tasks like vision, motor control, and decisions on the basis o f incomplete and noisy data, tasks that are far beyond the capacity o f a Cray supercomputer. This is obviously possible only because billions o f neurons operate simultaneously. Imagine the capabilities o f a

6

ONE Introduction

Output

Input

FIGURE 1.3 A two-layer perceptron.

system which could operate in parallel like this but with switching times o f current semiconductor devices!

1.2 History The history o f these sorts o f ideas in psychology originates with Aristotle. Yet as a basis for computational or neural modelling we can trace them to the paper o f M cCulloch and Pitts [1943], which introduced the model described above. During the next fifteen years there was considerable work on the detailed logic o f threshold networks. They were realized to be capable o f universal computation and were analyzed as finite-state machines; see Minsky [1967]. The problem o f making a reliable network with unreliable parts was solved by the use o f redundancy [von Neumann, 1956], leading later to distributed redundant representations [Winograd and Cowan, 1963]. At the opposite extreme to detailed logic, continuum theories were also de veloped. Known as n e u r o d y n a m ic s or n e u ra l fie ld th e o r y , this approach used differential equations to describe activity patterns in bulk neural matter [Rashevsky, 1938; Wiener, 1948; Beurle, 1956; W ilson and Cowan, 1973; Amari, 1977]. Around 1960 there was a wave o f activity centered around the group o f Frank Rosenblatt, focusing on the problem o f how to find appropriate weights Wij for par ticular computational tasks. They concentrated on networks called p e r c e p t r o n s , in which the units were organized into layers with feed-forward connections between one layer and the next. An example is shown in Fig. 1.3. Very similar networks called a d a lin es were invented around the same time by W idrow and Hoff [1960; W idrow, 1962]. For the simplest class o f perceptrons without any intermediate layers, Rosen blatt [1962] was able to prove the convergence o f a le a r n in g a lg o r ith m , a way to change the weights iteratively so that a desired com putation was performed. Many

1.2 History

7

people expressed a great deal o f enthusiasm and hope that such machines could be a basis for artificial intelligence. There was however a catch to the learning theorem, forcefully pointed out by Minsky and Papert [1969] in their book Perceptrons: the theorem obviously applies only to those problems which the structure is capable o f computing. Minsky and Papert showed that some rather elementary computations could not be done by Rosenblatt’s one-layer2 perceptron. The simplest example is the e x c lu s iv e o r (X O R ) problem: a single output unit is required to turn on (n = + 1 ) if one or the other o f two input lines is on, but not when neither or both inputs are on. Rosenblatt had also studied structures with more layers o f units and believed that they could overcome the limitations of the simple perceptrons. However, there was no learning algorithm known which could determine the weights necessary to implement a given calculation. Minsky and Papert doubted that one could be found and thought it more profitable to explore other approaches to artificial intelligence. W ith this most o f the computer science community left the neural network paradigm for almost 20 years. Still, there were a number o f people who continued to develop neural net work theory in the 1970’s. A major theme was a s s o c ia tiv e c o n te n t-a d d r e s s a b le m e m o r y , in which different input patterns become associated with one another (i.e., trigger the same response) if sufficiently similar. These had actually been proposed much earlier [Taylor, 1956; Steinbuch, 1961], and were later revived or rediscovered by Anderson [1968, 1970; Anderson and Mozer, 1981], Willshaw et al. [1969], Marr [1969, 1971] and Kohonen [1974-1989]. Grossberg [1967-1987] made a comprehensive reformulation o f the general problem o f learning in networks. Marr [1969, 1970, 1971] developed network theories o f the cerebellum, cerebral neocortex, and hippocampus, assigning specific functions to each type o f neuron. A number o f people, including Marr [1982], von der Malsburg [1973], and Cooper [1973; Nass and Cooper, 1975], studied the development and functioning o f the visual system. Another thread o f development can be traced to Cragg and Temperley [1954, 1955]. They reformulated the M cCulloch-Pitts network as a spin (magnetic) system o f the sort familiar in physics. Memory was believed to reside in the hysteresis o f the domain patterns expected for such a system. Caianiello [1961] then constructed a statistical theory, using ideas from statistical mechanics, and incorporated learning in a way which drew on the ideas o f Hebb [1949] about learning in the brain. The same theme was taken up in the 1970’s by Little [1974; Little and Shaw, 1975, 1978] and again in 1981 by Hopfield [1982]. Hopfield was able to add some helpful physical insight by introducing an e n e r g y fu n c tio n , and by emphasizing the notion o f memories as dynamically stable attractors. Hinton and Sejnowski [1983, 1986] and Peretto [1984] constructed formulations using s to c h a s t ic u n its which follow the dynamics (1.1) or (1.3) only approximately, making “mistakes” with a certain probability analogous to temperature in statistical mechanics. The real power o f

2

We never count input lines as units in numbering layers. Figure 1.3 is thus a two-layer network. Until recently it would often have been called a three-layer network, but the convention is changing.

8

ONE Introduction

statistical mechanics was then brought to bear on the stochastic network problem by Am it et al. [1985a, b; Am it, 1989], using methods developed in the theory o f random magnetic systems called sp in glasses. Perhaps the most influential development in this decade, however, takes up the old thread o f Rosenblatt’s perceptrons where it was cut 20 years ago. Various people have developed an algorithm which works quite well for adjusting the weights connecting units in successive layers o f multi-layer perceptrons. Known as b a ck p r o p a g a tio n , it appears to have been found first by Werbos [1974] in the mid70’s, and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986a, b], and by Parker [1985]. Le Cun [1985] also proposed a related algorithm. Though not yet the holy grail o f a completely general algorithm able to teach an arbitrary computational task to a network, it can solve many problems (such as X O R ) which the simple one-layer perceptrons could not. Much current activity is centered on back-propagation and its extensions. Many o f the important early papers have been collected in Anderson and Rosenfeld [1988], including many o f those mentioned here. This is an excellent collection for those interested in the history o f neural networks. We also recommend the review article by Cowan and Sharp [1988a, b], which we drew on for this section.

1.3 The Issues Massive parallelism in computational networks is extremely attractive in principle. But in practice there are many issues to be decided before a successful implemen tation can be achieved for a given problem: ■

W hat is the best architecture? Should the units be divided into layers, or not? How many connections should be made between units, and how should they be organized? W hat sort o f activation functions g (x ) should be used? W hat type o f updating should be used: synchronous or asynchronous, deter ministic or stochastic? How many units are needed for a given task?

■

How can a network be programmed? Can it learn a task or must it be pre designed? If it can learn a task, how many examples are needed for good per formance? How many times must it go through the examples? Does it need the right answers during training, or can it learn from correct/incorrect rein forcement? Can it learn in real-time while functioning, or must the training phase be separated from the performance phase?

■

W hat can the various types o f network do? How many different tasks can they learn? How well? How fast? How robust are they to missing information, incorrect data, and unit removal or malfunction? Can they generalize from known tasks or examples to unknown ones? W hat classes o f input-to-output functions can they represent?

1.3 The Issues

■

9

How can a network be built in hardware? W hat are the advantages and dis advantages o f different hardware implementations, and how do they compare to simulation in software?

These questions are obviously coupled and cannot be answered independently. The architecture, for instance, strongly influences what the network can do, and what hardware options are available. Much o f this book will be concerned with refining and answering the above questions. However we will generally approach them from a theoretical point o f view, rather than from a design one. That is, we will attempt to understand the behavior o f networks as a function o f their architecture, and only rarely raise the question o f designing networks to fulfill particular goals. But o f course the two viewpoints are not independent, and a strong understanding o f principles is invaluable for good design. Three o f the issues raised above deserve a little more comment here, as general background before we become involved in details.

Hardware Almost everything in the field o f neural computation has been done by simulating the networks on serial computers, or by theoretical analysis. Neural network VLSI chips are far behind the models, as is natural at this point. The main problem with making neural network chips is that one needs a lot o f connections, often some fraction o f the square o f the number o f units. The space taken up by the connections is usually the limiting factor for the size o f a network. The neural chips made so far contain o f the order o f 100 units, which is too few for most practical applications. Potential alternatives to integrated circuit chips include optical computers. The field is very young, but electro-optical and optical associative memories have already been proposed or built. Efficient hardware is crucially important in the long term if we are going to take full advantage o f the capabilities o f neural networks, and there is growing activity in this area. However, it is largely beyond the scope o f this book; we return to hardware issues only briefly in Section 3.4.

Generalization The reason for much o f the excitement about neural networks is their ability to generalize to new situations. After being trained on a number o f examples o f a relationship, they can often induce a complete relationship that interpolates and extrapolates from the examples in a sensible way. But what is meant by sensible generalization is often not clear. In many problems there are almost infinitely many possible generalizations. How does a neural network— or a human for that matter— choose the “right” one? As an example one could train a neural network on three o f the four X O R relations mentioned earlier, and it would be very unlikely that any o f

10

ONE Introduction

the known types o f networks would actually generalize to the full X O R . Nevertheless neural networks com m only make very useful generalizations that would be judged sensible in human terms.

Programming Like most o f the work done in neural networks, much o f this book is concerned with the problem o f programming or learning: how do we choose the connection weights so the network can do a specific task? We will encounter some examples where we can choose the weights a priori if we are a little clever. This e m b e d s some information into the network by design. But such problems are the exception rather than the rule. In other cases we can often “teach” the network to perform the desired computation by iterative adjustments o f the Wij strengths. This may be done in two main ways: ■

S u p e r v is e d le a r n in g . Here the learning is done on the basis o f direct com parison o f the output o f the network with known correct answers. This is sometimes called le a r n in g w ith a te a ch e r. It includes the special case o f r e in fo r c e m e n t le a rn in g , where the only feedback is whether each output is correct or incorrect, not what the correct answer is.

■

U n s u p e r v is e d le a rn in g . Sometimes the learning goal is not defined at all in terms o f specific correct examples. The only available information is in the correlations o f the input data or signals. The network is expected to create categories from these correlations, and to produce output signals correspond ing to the input category.

There are many exciting implications o f the possibility o f training a network to do a com putation. Instead o f having to specify every detail o f a calculation, we simply have to compile a training set o f representative examples. This means that we can hope to treat problems where appropriate rules are very hard to know in advance, as in expert systems and robotics. It may also spare us a lot o f tedious (and expensive) software design and programming even when we do have explicit rules. John Denker has remarked that “neural networks are the second best way o f doing just about anything.” The best way is to find and use the right rules or the optimum algorithm for each particular problem, but this can be inordinately expensive and time consuming. There is plenty o f scope for a second best approach based on learning by example.

The Hopfield Model

TWO

2.1 The Associative Memory Problem Associative memory is the “fruit fly” or “ Bohr atom” problem o f this field. It illus trates in about the simplest possible manner the way that collective computation can work. The basic problem is this: Store a set o f p patterns £? in such a way that when presented with a new pattern £*, the network responds by producing whichever one o f the stored patterns most closely resembles Q. The patterns are labelled by p = 1, 2, . . . , p, while the units in the network are labelled by i = 1 , 2 , . . . , N. Both the stored patterns and the test patterns £,• can be taken to be either 0 or 1 on each site i, though we will adopt a different convention shortly. We could of course do this serially in a conventional computer simply by storing a list o f the patterns writing a program which com puted the Hamming distance1 E [{f(i-o ) + ( i - ^ i i

(2 .i)

between the test pattern and each o f the stored patterns, finding which o f them was smallest, and printing the corresponding stored pattern out. Here we want to see how to get a McCulloch-Pitts network to do it. That is, if we start in the configuration n,- = £,*, we want to know what (if any) set o f Wij’s 1The Hamming distance between two binary numbers means the number of bits that are different in the two numbers.

11

12

TWO The Hopfield Model

FIGURE 2.1 Example o f how an associative memory can recon struct images. These are binary images with 130 x 180 pixels. The images on the right were recalled by the memory after presentation o f the corrupted images shown on the left. The middle column shows some intermediate states. A sparsely connected Hopfield network with seven stored images was used.

will make the network go to the state with rzt* = where it is pattern number Ho that is the smallest distance (2.1) from £,*. Thus we want the memory to be c o n te n t-a d d r e s s a b le and insensitive to small errors in the input pattern. A content-addressable memory can be quite powerful. Suppose, for example, we store coded information about many famous scientists in a network. Then the starting pattern “evolution” should be sufficient to recall everything about Darwin, and “ E = me3” should recall Einstein, despite the error in the input pattern. Note that som e pattern will always be retrieved for any clue (unless we invent a “don’t know” pattern); the network will never retrieve a linear combination of, say, Darwin and Wallace in response to “evolution” but will pick the best match according to what has been stored. This depends on the nonlinearity o f the network, and obviously has advantages for many practical applications. Other com mon examples o f applications for an associative memory are recog nition and reconstruction o f images (see Fig. 2.1), and retrieval o f bibliographic information from partial references (such as from an incomplete title o f a paper). Figure 2.2 shows schematically the function o f the dynamic associative (or content-addressable) memories that we construct in this chapter. The space o f all possible states o f the network— the c o n fig u r a tio n s p a ce — is represented by the region drawn. W ithin that space the stored patterns £? are a ttr a c to r s . The dy namics o f the system carries starting points into one o f the attractors, as shown by the trajectories sketched. The whole configuration space is thus divided up into

2.2 The Model

13

FIGURE 2.2 Schematic configuration space o f a model with three attractors.

b a sin s o f a t t r a c t io n o f the different attractors. This picture is very idealized, and in particular the space should really be a discrete set o f points (on a hypercube), not a continuous region. But it is nevertheless a very useful image to keep in mind. In the next section we treat the basic Hopfield [1982] model o f associative mem ory. In Section 2.3 we turn to statistical mechanics, studying some magnetic sys tems that are analogous to our networks. Then in Section 2.4 we define a stochastic version o f the original model, and analyze it using statistical mechanics methods. Finally, Section 2.5 presents a heuristic derivation o f the famous 0.138V capacity o f the Hopfield model. Various embellishments and generalizations o f the basic model are discussed in the next chapter.

2.2 The Model For mathematical convenience we now transform to a formulation where the acti vation values o f the units are + 1 (firing) and —1 (not firing) instead o f 1 and 0. We denote2 them by Si rather than nt*. Conversion to and from the n,* = 0 or 1 notation is easy via 5,- = 2n,- — 1. The dynamics o f the network corresponding to (1.1) or (1.3) now reads

(2 .2 ) where we take the sign function sgn(a?) (illustrated in Fig. 2.3) to be

(2.3)

2

We reserve the symbol Si for ± 1 units throughout this book.

14

TWO The Hopfield Model

sgn(x)

1 0

-2

-1

0

1

2 FIGURE 2.3 The function sgn(x).

-1

and the threshold 0t- is related to the m in (1.1) by 0,- = 2//* — In the rest o f this chapter we drop these threshold terms, taking 0,* = 0, because they are not useful with the random patterns that we will consider. Thus we use (2.4) J

There are at least two ways in which we might carry out the updating specified by (2.4). We could do it synchronously, updating all units simultaneously at each time step. Or we could do it asynchronously) updating them one at a time. Both kinds o f models are interesting, but the asynchronous choice is more natural for both brains and artificial networks. The synchronous choice requires a central clock or pacemaker, and is potentially sensitive to timing errors. In the asynchronous case, which we adopt henceforth, we can proceed in either o f two ways: ■

At each time step, select at random a unit i to be updated, and apply the rule (2.4).

■

Let each unit independently choose to update itself according to (2.4), with some constant probability per unit time.

These choices are equivalent (except for the distribution o f update intervals) because the second gives a random sequence; there is vanishingly small probability o f two units choosing to update at exactly the same moment. The first choice is appropriate for simulation, with central control, while the second is appropriate for autonomous hardware units. We also have to specify for how long (for how many updatings) we will allow the network to evolve before demanding that its units’ values give the desired stored pattern. One possibility in the case o f synchronous updating is to require that the network go to the correct memorized pattern right away on the first iteration. In the present discussion (using asynchronous updating) we demand only that the network settle eventually into a stable configuration— one for which no Si changes any more. Rather than study a specific problem such as memorizing a particular set o f pictures, we examine the more generic problem of a random set o f patterns drawn from a distribution. For convenience we will usually take the patterns to be made

15

2.2 The Model

up o f independent bits which can each take on the values + 1 and —1 with equal probability. More general situations are discussed in Section 3.2. Our procedure for testing whether a proposed form o f Wij is acceptable is first to see whether the patterns to be memorized are themselves stable, and then to check whether small deviations from these patterns are corrected as the network evolves.

One Pattern To motivate our choice for the connection weights, we consider first the simple case where there is just one pattern £,• that we want to memorize. The condition for this pattern to be stable is just sgn ( £ > « < , - ) = 6

(for a l i i)

(2.5)

j because then the rule (2.4) produces no changes. It is easy to see that this is true if we take Wij OCt it j (2.6) since £? = 1. For later convenience we take the constant o f proportionality to be 1/iV, where AT is the number o f units in the network, giving

Wij = j j t i t j ■

(2.7)

Furthermore, it is also obvious that even if a number (fewer than half) o f the bits o f the starting pattern Si are wrong (i.e., not equal to & ), they will be overwhelmed in the sum for the net input h{ = ^ ^

Sj

(2-8)

j by the m ajority that are right, and sgn{hi) will still give An initial configuration near (in Hamming distance) to & will therefore quickly relax to This means that the network will correct errors as desired, and we can say that the pattern & is an a ttra ctor. Actually there aretwo attractors in this simple case; the other one is at —&. This is called a r e v e r s e d sta te. All starting configurations with morethan half the bits different from the original pattern will end up in the reversed state. The configuration space is symmetrically divided into two basins o f attraction, as shown in Fig. 2.4.

16

TWO The Hopfield Model

V » •*

FIGURE 2.4 Schematic con figuration space for the one pattern case, including the re versed state.

Many Patterns This is fine for one pattern, but how do we get the system to recall the most similar o f many patterns? The simplest answer is just to make a superposition o f terms like (2.7), one for each pattern:

(2.9)

Here p is the total number o f stored patterns labelled by p. This is usually called the “Hebb rule” or the “generalized Hebb rule” because o f the similarity between (2.9) and a hypothesis made by Hebb [1949] about the way in which synaptic strengths in the brain change in response to experience: Hebb suggested changes proportional to the correlation between the firing o f the pre- and post-synaptic neurons. If we apply our set o f patterns to the network during a tr a in in g p h a se, and adjust the w^j strengths according to such p re/post correlations, we arrive directly at (2.9). Technically, however, (2.9) goes beyond H ebb’s original hypothesis because it changes the weights positively when neither o f the units is firing (£ f = £J = —1). This is probably not physiologically reasonable. Equation (2.9) can even cause a particular connection to change from excitatory to inhibitory or vice versa as more patterns are added, which is never believed to occur at real synapses. It is possible to modify the equation in various ways to remedy these defects [Toulouse et al., 1986], but here we use the simple form (2.9) unchanged. An associative memory model using the Hebb rule (2.9) for all possible pairs i j, with binary units and asynchronous updating, is usually called a H o p fie ld m o d e l. The term is also applied to various generalizations discussed in the next chapter. Although most o f the ingredients o f the model were known earlier, Hopfield’s influ ential paper [Hopfield, 1982] brought them together, introduced an energy function, and emphasized the idea o f stored memories as dynamical attractors. Earlier related models, often also using the Hebb rule, are reviewed by Cowan and Sharp [1988a, b]. Particularly important is the Little model [Little, 1974; Little and Shaw, 1975, 1978], which is based however on synchronous updating.

17

2.2 The Model

Let us examine the stability o f a particular pattern £{'. The stability condition (2.5) generalizes to sgn(hj') = ( j

(for all i)

(2.10)

W

where the net input hf to unit i in pattern v is

*f s

£ » « « / j

= j E

E j

-

P

We now separate the sum on \i into the special term p = v and all the rest:

+

(2 -12) j

If the second term were zero, we could immediately conclude that pattern number v was stable according to (2.10). This is still true if the second term is small enough: if its magnitude is smaller than 1 it cannot change the sign o f h\, and (2.10) will still be satisfied. It turns out that the second term, which we call the cr o s s ta lk te r m , is less than 1 in many cases o f interest if p (the number o f patterns) is small enough. We will discuss the details shortly; let us assume for now that the crosstalk term is small enough for all i and v. Then the stored patterns are all stable— if we start the system from one o f them it will stay there. Furthermore, a small fraction o f bits different from a stored pattern will be corrected in the same way as in the single pattern case; they are overwhelmed in the sum ^ WijSj by the vast m ajority o f correct bits. A configuration near (in Hamming distance) to £? thus relaxes to This shows that the chosen patterns are truly attractors o f the system, as already anticipated in Fig. 2.2. The system works as expected as a content-addressable memory.

Storage Capacity Consider the quantity j

v

(2-13)

This is just — times the crosstalk term in (2.12). If C\ is negative the crosstalk term has the same sign as the desired £? term, and thus does no harm. But if C f is positive and larger than 1, it changes the sign o f h% and makes bit (or unit) i o f pattern v unstable; if we start the system in the desired memory state it will not stay there. The C f ys just depend on the patterns that we attempt to store. For now we consider purely random patterns, with equal probability for = -f 1 and for

18

TWO The Hopfield Model

-1

FIGURE 2.5 The distri bution o f values for the crosstalk C\ given by (2.13). For p random pat terns and N units this is a Gaussian with variance 1) .

(2.14)

Clearly P e r r o r increases as we increase the number p o f patterns that we try to store. Choosing a criterion for acceptable performance (e.g., P e r r o r < 0.01) lets us find the storage c a p a c ity pmax o f the network: the maximum number o f patterns that can be stored without unacceptable errors. As we will see, there are actually several different expressions for pmBX, depending on the type o f criterion we use for P e r r o r Let us first calculate P e r r o r - It depends on the number o f units N and the number o f patterns p. We assume that both N and p are large compared to 1, because this is typically the case and because it makes the mathematics easier. Now C\ is 1/N times the sum o f about Np independent random numbers,3 each o f which is + 1 or —1. From the theory o f random coin tosses [Feller, 1968] it has therefore a b in o m ia l d is tr ib u t io n with mean zero and variance a 2 = p/N. But since Np is assumed large this can be approximated by a Gaussian distribution with the same mean and variance, as shown in Fig. 2.5. The probability P e r r o r that C\ exceeds 1 is just the shaded area in Fig. 2.5. Thus Pee r r o r =

i[l-e rf(l/V ^ 2 )]

=

^ [l — erf ( a / N/2p)]

(2.15)

3There are actually N (p — 1) terms if we include the i = j terms, or (N — l)(p — 1) terms if we don’t, but these are both approximately Np for large N and p.

19

2.2 The Model

TABLE 2.1 Capacities perror

1

0.001 0.0036 0.01 0.05 0.1

Pmax/N 0.105 0.138 0.185 0.37 0.61

where the e r r o r fu n c t io n erf(x) is defined by erf(x ) = -^= / exp( - u 2 )d u . V 7r Jo

(2.16)

Table 2.1 shows the values o f p/N required to obtain various values o f Perror* Thus if we choose the criterion P e r r o r < 0 . 0 1 for example, we arrive at pmax = 0.157V. This calculation only tells us about the initial stability o f the patterns. If we choose p < 0.185TVfor example, it tells us that no more than 1% o f the pattern bits will be unstable initially. But if we start the system in a particular pattern £? and about 1% o f the bits flip, what happens next? It may be that the first few flips will cause more bits to flip. In the worst case there could be an avalanche phenomenon in which more and more bits flip until the final state bears little or no resemblance to the original pattern. So our estimates for pmax are upper bounds; smaller values may be required to keep the final attractors close to the desired patterns. The more sophisticated calculation given in Section 2.5 deals with this problem, and shows that an avalanche occurs if p > 0.138AT, making the whole memory useless. Thus pmax = 0.1387V if we are willing to accept the errors that occur up to that point. At p = 0.1387V table 2.1 shows that only 0.37% o f the bits will be unstable initially, though it turns out that about 1.6% o f them flip before a stable attractor is reached. An alternative definition o f the capacity insists that most o f the memories be recalled perfectly. Since each pattern contains TV bits, we need P e r r o r < 0.01/7Vto get all TVbits right with 99% probability.4 This clearly implies p/N —►0 as TV —►oo, so we can use the asym ptotic expansion o f the error function 1 — erf(x) —*■ e~ x 2 /y/Orx

(as x —►oo)

(2-17)

to obtain log ( P e r r o r ) « - log 2 - N/2p - \ log 7T- \ \og{N/Ip) . This turns the condition

P erro r

(2.18)

< 0.01/TV into

- log 2 - N/2p - | log 7r - | log(7V/2p) < log 0.01 - log TV

(2.19)

4Strictly speaking we should write (1 — Perror)^ > 0.99 here, but Perror < 0.01/TV is a good approximation from the binomial expansion.

20

TWO The Hopfield Model

or, taking only the leading terms for large AT, N/2p > log N

(2.20)

giving the capacity pmax = AT/2 log ATfor this case. Even more stringently, we could ask that all the patterns be recalled perfectly. This requires us to get Np bits right with, say, 99% probability, and so needs terror < 0.01/pN. It is easy to see that this changes (2.20) to N/2p > log(ATp)

(2.21)

which gives pmax = AT/4 log N because log(ATp) ~ log AT2 = 2 log AT in leading order. Note that we have assumed in the perfect recall cases that the s are indepen dent o f one another. Closer examination shows that this is justified. More detailed derivations o f the N/ log N results are available in Weisbuch and Fogelman-Soulie [1985] and McEliece et al. [1987]. In summary, the capacity pmax is proportional to N (but never higher than 0.138A) if we are willing to accept a small percentage o f errors in each pattern, but is proportional to N/ log N if we insist that most or all patterns be recalled perfectly. Realistic patterns will not in general be random, though some precoding can make them more so. The Hopfield model is usually studied with random patterns for mathematical convenience, though the effect o f correlated patterns has also been examined (see Section 3.2). At the other extreme, if the different patterns are strictly orthogonal, i.e., 0

for a l l / / ^ i /

(2.22)

3

then there is no crosstalk at all; C f = 0 for all i and v. In this orthogonal case the memory capacity pmax is apparently N patterns, because at most AT mutually orthogonal bit strings o f length AT can be constructed. But the useful capacity is somewhat smaller. Trying to embed ATorthogonal patterns with the Hebb rule actually makes all states stable; the system stays wherever it starts, and is useless as a memory. This occurs because the orthogonality conditions (2.22) lead necessarily to5 Wij = { \ 10

= otherwise.

(2.23)

so each unit is connected only to itself. To define a useful measure o f capacity for such a case it is clearly necessary to insist on a finite basin o f attraction around each desired pattern. This leads to a useful capacity slightly less than N.

5 Consider the matrix X with components X^,- = Equation (2.22) implies XXT = ATI, where 1 is the unit matrix, while the Hebb rule (2.9) may be written w = (l/AT)XTX. Using (AB)T = BT AT leads immediately to w = 1.

21

2.2 The Model

FIGURE 2.6 It is often useful (but sometimes dangerous) to think o f the energy as some thing like this landscape. The z-axis is the energy and the 2 n corners o f the hypercube (the possible states o f the sys tem) are formally represented by the x - y plane.

The Energy Function One o f the most important contributions of the Hopfield [1982] paper was to intro duce the idea o f an energy function into neural network theory. For the networks we are considering, the energy function H is

H = ~ \ Y l wHSiSi ij

(2-24)

The double sum is over all i and all j . The i = j terms are o f no consequence because S f = 1; they just contribute a constant to H , and in any case we could choose wa = 0. The energy function is a function o f the configuration { 5 , } o f the system, where { 5 , } means the set o f all the S i’s. We can thus imagine an e n e r g y la n d s c a p e “above” the configuration space o f Fig. 2.2. Typically this surface is quite hilly. Figure 2.6 illustrates the idea. The central property o f an energy function is that it always decreases (o r re mains constant) as the system evolves according to its dynamical rule. We will show this in a moment for (2.24). Thus the attractors (memorized patterns) in Fig. 2.2 are at local minima o f the energy surface. The dynamics can be thought o f as sim ilar to the motion o f a particle on the energy surface under the influence o f gravity (pulling it down) and friction (so that it does not overshoot). From any starting point the particle (representing the whole state {S »} o f the system) slides downhill until it comes to rest at one o f these local minima— at one o f the attractors. The basins o f attraction correspond to the valleys or catchment areas around each min imum. Starting the system in a particular valley leads to the lowest point o f that valley. The term e n e r g y fu n c t io n comes from a physical analogy to magnetic sys tems that we will discuss in the next section. But the concept is o f much wider applicability; in many fields there is a state function that always decreases during dynamical evolution, or that must be minimized to find a stable or optimum state.

22

TWO The Hopfield Model

In some fields the convention is reversed; the function increases or must be maxi mized. The most general name, from the theory o f dynamical systems, is L y a p u n o v fu n c t io n [Cohen and Grossberg, 1983]. Other terms are H a m ilto n ia n in statisti cal mechanics, c o s t fu n c t io n or o b je c t i v e fu n c tio n in optimization theory, and fitn e ss f u n c t io n in evolutionary biology. For neural networks in general an energy function exists if the connection strengths are sym m etric, i.e., Wij = wji. In real networks o f neurons this is an unreasonable assumption, but it is useful to study the symmetric case because o f the extra insight that the existence o f an energy function affords us. The Hebb prescription (2.9) which we are now studying automatically yields symmetric Wij’s , Gerard Toulouse has called Hopfield’s use o f symmetric connections a “clever step backwards from biological realism.” The cleverness arises from the existence o f an energy function. For symmetric connections we can write (2.24) in the alternative form H = C - J 2 wijS>Sj

(2.25)

to) where ( ij) means all the distinct pairs i j, countingTor example 12 as the same pair as 21. We exclude the ii terms from ( ij); they give the constant C . It now is easy to show that the dynamital rule (2.4) can only decrease the energy. Let S,' be the new value o f Si given by (2.4) for some particular unit i: S'i = sg n (^ ^ W {jS j).

(2.26)

j Obviously if S{ = Si the energy is unchanged. In the other case S.[ = —Si so, picking out the terms that involve S',-,

H'-H

= - J 2 wtjS'Sj + — 25,* ^ ^ = 2Si

£ WijSiSj

Sj W{j Sj -

2 wu

.

(2.27)

j Now the first term is negative from (2.26), and the second term is negative because the Hebb rule (2.9) gives wu = p/N for all i. Thus the energy decreases every time an Si changes, as claimed. The s e lf-c o u p lin g te r m s wu may actually be omitted altogether, both from the Hebb rule (where we can simply define wu = 0) and from the energy function. It is straightforward to check that they make no appreciable difference to the stability o f the patterns inthe large N limit. But they do affect the dynamics and the number o f spurious states,and it turns out to be better to omit them [Kanter and

23

2.2 The Model

Sompolinsky, 1987]. We can see why simply by separating the self-coupling term out o f the dynamical rule (2.4): Si := sgn (wuSi + ^

.

(2.28)

i#* If wu were larger than ^ Z j^ W ijS j in some state, then Si = + 1 and Si = —1 could both be stable.6 This can produce additional stable s p u rio u s sta te s in the neighborhood o f a desired attractor, reducing the size o f the basin o f attraction. If wa = 0, then this problem does not arise; for a given configuration o f the other spins Si will always pick one o f its states over the other.

Starting from an Energy Function The idea o f the energy function as something to be minimized in the stable states gives us an alternate way to derive the Hebb prescription (2.9). Let us start again with the single-pattern case. We want the energy to be minimized when the overlap between the network configuration and the stored pattern £,• is largest. So we choose H = ~ i ] ? ( £ « • ') 2 i

where the factor 1/2iV is the product o f inspired hindsight. For the many-pattern case, we can try to make each o f the £? into local minima o f H just by summing (2.29) over all the patterns:

=

fi= 1

3

l + e*p‘( - 2 W

FIGURE 2.10 Mean field theory. All the spins but one are replaced by their aver age values.

It is worth remarking that our result (2.42) also applies to a whole collection o f N spins if they all experience the same external field and have no influence on one another. Such a system is called a p a r a m a g n e t. The famous Curie law for paramagnets, dM /dh oc 1 /T at h = 0, follows immediately upon identifying the total magnetization M with N ( S ) .

Mean Field Theory In a problem o f many interacting spins the problem is not so easily solved. The evolution o f spin 5* depends on a local field hi = ]T^. WijSj -f hext which involves variables Sj that themselves fluctuate back and forth. There is in general no way to solve the many-spin problem exactly, but there is an approximation which is sometimes quite good. It will turn out to be very useful for the analysis o f neural networks. Known as m e a n fie ld th e o r y , it consists o f replacing the true fluctuating hi by its average value 1 (dashed line) there are three solutions.

FIGURE 2.13 The positive solution o f equation (2.47) as a function o f tempera ture.

small p/N, the relaxation toward any one o f them is just like the relaxation o f a ferromagnet to a uniformly magnetized state. At finite temperature we use mean field theory. Using (2.46) in (2.45) and putting in what we know characterizes the ferromagnetic state, i.e., that its mag netization is uniform, (Si) = (S'), we obtain a single equation to be solved for (S ): (S) = ta n h (/? J (5 » . (2.47) Here we have set hext = 0 for convenience, although the generalization is obvious. Equation (2.47) is easily solved graphically for (S) as a function o f T, as shown in Fig, 2.12, The kinds o f solutions depend on whether (3J is smaller or larger than one. This corresponds to the different behavior above and below the critical temperature Tc, so we can deduce that Tc = J / k W hen T > Tc (or /?J < 1), there is only the trivial solution (S) = 0; the spin on each site points up and down equally often. But for T < Tc, there are two other solutions with (S ) ^ 0, one the negative o f the other. It turns out that the new solutions are stable against small disturbances in (S'), while in this temperature range the trivial (S ) = 0 solution is unstable. This says that the ferromagnet can be found with its spins either predominantly up or predominantly down. If it is in one o f these phases, it will not flip over to the other (in the limit o f an infinite system). The magnitude o f the average s p o n ta n e o u s m a g n e tiz a tio n ( S) rises sharply (continuously, but with infinite derivative at T = Tc) as one goes below Tc; see Fig. 2.13. As T approaches 0, (S) approaches ± 1 ; all spins point in the same direction.

32

TWO The Hopfield Model

Thus this simple mean field theory approximation does give the right behavior for a ferromagnet. It is actually exact for the infinite range case defined by (2.46), but is qualitatively correct for any ferromagnet.

2.4 Stochastic Networks We now apply much o f the preceding section to neural networks, making the units stochastic and introducing the analogue o f temperature. This will enable us to use mean field theory and hence ultimately to compute the storage capacity o f the network. Taking the zero-temperature limit will always reduce our system to a deterministic Hopfield network, but the finite temperature extension will prove very useful for analysis. We make our units behave stochastically exactly as for the spins in an Ising model [Hinton and Sejnowski, 1983; Peretto, 1984]. That is, we use (2.40),

P , o b ( £ = ± 1 ) = M ± K ,) =

1 + exp; T2ffl||) •

(2.48)

for unit 5,- whenever it is selected for updating, and select units in a random order as before. The function fp (h ) is given by (2.37) and illustrated in Fig. 2.8. It is often called the lo g is tic fu n c t io n in this context. We could actually make other choices for the activation function g(h ) in (2.36), but the choice g(h ) = fp (h ) allows the application o f statistical mechanics. We can also reinterpret (2.48) as describing an ordinary deterministic threshold unit with a r a n d o m th r e s h o ld 0 drawn from a probability density fp ( 6 ). W hat is the meaning o f this stochastic behavior? In real neural networks, neu rons fire with variable strength, and there are delays in synapses, random fluctua tions from the release o f transmitters in discrete vesicles, and so on. These are effects that we can loosely think o f as n oise, and crudely represent by thermal fluctuations as we have done in writing (2.48). O f course, the parameter /? is not directly related to the physical temperature; it is simply a parameter controlling thenoise level, or the likelihood that the deterministic rule (2.4) is violated. Nevertheless, it is useful to define a p s e u d o -t e m p e r a t u r e T for the network by

0 = ^

(2-49)

T is emphatically not the real temperature o f a piece o f brain, or that o f a network o f circuits. Note that we did not need to put a constant k s into (2.49), since T is not a physical temperature. In effect we set k s = 1 in applications o f statistical mechanics formulae. Henceforth we use (2.49) constantly and without comment, converting freely between /? and T as convenient. Moreover we generally refer to T as the “temperature” even though “pseudo-temperature” would be more appropriate.

33

2.4 Stochastic Networks

As we noted for the magnetic system, and illustrated in Fig. 2.8, the tem perature T controls the steepness o f the sigmoid fp {h ) near h = 0. At very low temperature the sigmoid becomes a step function and (2.48) reduces to the de terministic M cCulloch-Pitts rule (2.4) for the original Hopfield network. As T is increased this sharp threshold is softened up in a stochastic way. The use o f stochastic units is not merely for mathematical convenience, nor simply to represent noise in our hardware or neural circuits. It is actually useful in many situations because it makes it possible to kick the system out o f spurious local minima o f the energy function. Generally the spurious minima will be less stable (higher in energy) than the desired retrieval patterns, and will not trap a stochastic system permanently. The network will in general evolve differently every time it is run. Meaningful quantities to calculate are therefore averages over all possible evolutions, weighted by the probabilities o f each particular history. This is just the type o f calculation for which statistical mechanics is ideal. There is however an additional requirement for the use o f most o f statistical mechanics: we need to know that our network comes to equilibrium. This means that average quantities, such as the average value (Si) o f a particular 5,*, eventually becom e time-independent. Luckily it can be proved that networks with an energy function— in the present context, networks with symmetric connections Wij— do indeed come to equilibrium. So even though we can no longer talk about absolute stability o f particular configurations { 5 , } o f the network, we can still study stable e q u ilib r iu m sta tes {(«?»)} in which the average values do not change in time.

Mean Field Theory We now apply the mean field approximation to the stochastic model we have just defined, with the Hebb rule (2.9) for the connection strengths w ^. We restrict ourselves at present to the case o f a relatively small number o f patterns, p /ar2)] • J V27T

(2.65)

Now we can calculate r. We just square (2.64) i2 2

_

Trip —

xtanh[/?(mx +

x tanh

mi+

E # *> /•)]

E

(2-66)

and average the result over patterns. Since pattern v does not occur inside the tanh’s, the pattern factors outside the tanh’s can be averaged separately, and only the i = j term survives. Then the remaining average o f the tanh’s just gives a factor o f q as in (2.62). The result is independent o f v, so from (2.59):

r=

[i - /? (i - ?)]2 •

(2 -67)

We also need an equation for m i. Using the same approach, starting again from (2.60) with v = 1, it is easy to obtain

/

dz i 2 - ^ = e~ tanh[/?(mi + y/arz) ] .

(2.68)

'6 4 )

38

TWO The Hopfield Model

T he three equations (2.65), (2.67), and (2.68) can now be solved simultaneously for rri!, g, and r. In general this must be done numerically. We examine in detail only the T —►0 (or /? —►oo) limit. In this limit it is clear from (2.65) that q —* 1, but the quantity C = /3(1 — q) remains finite. The T —» 0 limit lets us evaluate the integrals in (2.65) and (2.68), using the identities

/

- ~ = e *2/ 2( l — tanh 2 (3[az + b]) V 2 tt e ^2/2 Itanh2 (3 [az+b\—Q *

j

dz (1 - tanh 2 P[az + 6])

= e~ b t 2* 2 - z f d z - ^ - tanh /3[az -f b] V 2 tt a/3 J dz

and

/

~^L= e~ z 212 tanh (3[az + 6] V27T T —*-0

/

^

■

2f

“

" r( ^

e ' ' ’ / is g n la z + 6 1

dz

e 'V 2 _

i

J—b/a, )

< 2 '7 0 )

where the error function erf (a?) was defined in (2,16). Our three equations thus becom e:

(2 ji> - = (ir e ?

ro = erf( v £ ? )

(2-73)

where we have written m for m\. We can find the capacity o f the network by solving these three equations. Setting y = m/y/2 a r , we obtain the equation

y ( y 2^ +

= erf(y)

which is easily solved graphically as shown in Fig. 2.16.

( 2-74)

2.5 Capacity of the Stochastic Network

39

FIGURE 2.16 Graphical solution o f equation (2.74). The solid lines show the lefthand side for several values o f a , while the dotted line shows the right-hand side, erf(y). Nontrivial solutions with m > 0 are given by intersections away from the origin.

0.00

0.05

0.10

0.15

FIGURE 2.17 The phase diagram obtained by Am it et al.. The desired memory states are only stable in the shaded region.

The Phase Diagram of the Hopfield Model By solving (2.74) we can see that there is a critical value a c o f a where the nontrivial solutions (m j^ O ) disappear. A numerical evaluation gives a c « 0.138.

(2.75)

The jum p in m at this point is considerable: from m « 0.97 to zero. Recalling (2.56), this tells us that (at T ~ 0) we go discontinuously from a very good working memory with only a few bits in error for a < a c to a useless one for a > a c. Figure 2.17 shows the whole p h a s e d ia g ra m for the Hopfield model, delineat-

40

TWO The Hopfield Model

FIGURE 2.18 An attempt to visualize the energy landscape in different parts o f the phase diagram. The dots show the desired memory states, while the small ripples represent spurious states. The four cases correspond to the four regions A -D o f the phase diagram (figure 2.17).

ing different regimes o f behavior in the T - ol plane. There is a roughly triangular region where the network is a good memory device, as indicated by the shaded region of the figure. The result (2.75) corresponds to the upper limit on the a axis, while the critical temperature Tc = 1 derived previously (see Fig. 2.14) for the p a c) before the bounds are reached, and then not work at all. In between there is an optimum value for rj. Then there is no sharp threshold a c; adding more memories simply erases earlier ones, so the memory is termed a p a lim p s e s t .1 This observation has been related to the limitations o f short-term memory [Nadal et al., 1986] and to interpretations o f dream sleep [Geszti and Pazmandi, 1987].

Synchronous Dynamics As mentioned early in the discussion, we could choose to update all the units simultaneously instead o f one at a time [Little, 1974; Amit et al., 1985a]. How does this affect the results we have found? Again, there is no significant change. In particular, the memory states (those with (Si) oc £? for some //) for p r < r = v X

(3-2 i)

V

which have the property that

This is where the name pseudo-inverse arises; if we regard £% and rj^ as N x p matrices, then (3.21) says that r]T£ = N 1, and N ~ l r)T is called the pseudo-inverse o f £. More generally for any m x n matrix M with m > n the pseudo-inverse (or 2The Kronecker delta symbol Srs is defined to be 1 if r = s and 0 otherwise, so S^x means the identity matrix.

51

3.2 Correlated Patterns

generalized inverse, or Moore-Penrose inverse)

can be defined as

(MTM) *MT

whenever M T M is non-singular. It obeys M+M = 1 but not normally MM-f = 1 . See Rao and Mitra [1971] for details. In terms o f p? the connection matrix can be written

P '22'

p

Thinking now o f the r/f and the as components o f vectors and £**, these equations tell us that each rj *1 is orthogonal to all ^ ’s (for v ^ //), and that the matrix w is a sum o f outer products o f £** and 77 ^ vectors:

w = ^

£ W P

) T-

(3-23)

It therefore naturally takes every £** into itself: w £** = £**

for all p .

(3.24)

We can also regard (3.24) as a set o f equations to be solved for the TV x N weight matrix w,given the p pattern vectors £**. If p < N there aremany different solutions to theseequations, including the obvious TV x TVidentity matrix. The solution chosen by the pseudo-inverse prescription (3.23) is the unique one where, for any vector y orthogonal to all the pattern vectors, we have wy = 0 .

(3.25)

From (3.24) and (3.25) it follows that w projects onto the p-dimensional subspace spanned by the pattern vectors. This is why this method is also called the p r o je c t io n m e t h o d . Note that any vector in this pattern subspace is stable in the sense that (3.24) is satisfied, though o f course not all such vectors have purely ± 1 components. It is easy to see that the overlap matrix Q cannot be inverted if there are any linear dependencies among the patterns. But then a suitable w matrix can be found by restricting attention in (3.17) and (3.18) to a linearly independent subset o f pattern vectors that spans the pattern subspace. If, however, the patterns span the whole TV-dimensional space the recipe makes all patterns stable, and the memory is useless; w becomes an identity matrix. The same problem was encountered in (2.23) for the Hebb rule with TVorthogonal patterns. So up to TV — 1 linearly independent patterns can be usefully stored by this method. That is the capacity pmax in most practical situations .3 3

In principle more linearly dependent patterns can be usefully stored if the size of any linearly in dependent subset is iV —1 or less. But for random ± 1 patterns the probability of linear dependency is small (becoming 0 as TV —^ 00) for p < TV, and 1 for p > TV.

52

THREE Extensions of the Hopfield Model

For all this linear algebra to hold, one has to retain the self-coupling terms— the diagonal terms wa o f w. But it can be shown that they can usually be set equal to zero without problems [Kanter and Sompolinsky, 1987]. Indeed, making them zero usually improves the performance, as for the Hebb rule in (2.28). The original Hebb rule had a biological appeal, in that such connections could develop in response to correlations o f firings between only the pre- and post-synaptic cells. W ith (3.18) no such interpretation is possible; the pseudo-inverse prescription is n o n lo c a l, because com putation o f tu,j requires knowledge o f ££ for all k. There is however an iterative scheme which converges to the same set o f Wij’s using only local information (including h*); see Diederich and Opper [1987] for details. We will encounter similar schemes for layered networks in Chapter 5.

Special Cases There are many special cases o f correlated patterns which can be treated with the general prescription (3.18) or by direct analysis. An important one is patterns which have a certain “average pattern” in common [Amit et al., 1987b]. The simplest case is o f b ia s e d p a tte r n s , in which the probabilities for ({* = 4 - 1 and = —1 are unequal (like a biased coin), but still independent for all i and //. A generalization o f this is a hierarchical correlation o f patterns, where they can be grouped into families, families within families, and so on, on the basis o f their mutual overlaps [Cortes et al., 1987; Krogh and Hertz, 1988; Parga and Virasoro, 1986; Gutfreund, 1988]. In both o f these cases the result o f the prescription (3.18) is a fairly simple generalization o f the Hebb rule. One simply has to imprint both the average features (e.g., the mean value for biased patterns) and the deviations from them in a Hebblike fashion, with suitable relative weights which can be calculated from (3.18). Another interesting case is that o f sp a rse p a tte r n s , in which almost all the are the same. This would be the case, for example, in patterns which are mostly “off” , such as a small fraction o f black pixels on a predominantly white screen. In this limit it is convenient to go over to units n,* and patterns which both take values 0 or 1 instead o f ± 1 . For random biased patterns with a fraction a o f ones, the appropriate distribution is w i . f l * ~ \0

with probability a ; with probability 1 — a ;

and we consider the case a 1 for every pair o f distinct attractors u\ and u\. A neural network circuit implemented using (3.43) would certainly approach its attractors more rapidly than with (3.32). Barhen et al. [1989] found a remarkable speedup in learning time in an application to inverse kinematics for robot manipula tors. More control over spurious attractors and basin sizes also seems to be possible. In applications to networks with hidden units (discussed in Chapter 7) it can be shown that terminal attractors never becom e repellers, as can happen in some cases with normal attractors. Note however that the locations o f the attractors must be known in advance, which, while still appropriate for content-addressable memory, limits the applicability to many other problems in neural computation.

3.4 Hardware Implementations In this section we briefly discuss two hardware implementations o f Hopfield models. One is an analog electrical circuit using continuous-valued units. The other is an optical implementation using binary units and clipped connections. This is the only place in the book where hardware is discussed, and even here we cover only some basic principles.

Electrical Circuit Implementation Figure 3.5(a) shows an electrical circuit constructed to implement a unit obeying (3.32) [Hopfield, 1984; Hopfield and Tank, 1986]. A network o f four such circuits is shown in Fig. 3.5(b), making it clear that the resistors Rij play the role o f the connections . Such circuits have actually been made in analog VLSI. Each unit i is com posed o f the circuit shown in Fig. 3.5(a). it,- is the input voltage, Vi is the output voltage, and the operational amplifier has the transfer function Vi = g(u i). The input o f each unit is connected to ground with a resistor

59

3.4 Hardware Implementations

(b)

(a) Vj •—

1

£> ^ik

Vk> VW V \ r -

1

=F c

{> {>

.

11 1

1

i

A

FIGURE 3.5 (a) The circuit described in the text, (b) A network o f such circuits.

p in parallel with a capacitor C , and the output o f unit j is connected to the input o f unit i with a resistor R ij. In terms o f modelling a real neuron, we can regard p and C as the transmembrane resistance and input capacitance respectively. The circuit equations are ^dui

Ui

dt

p

(3.44) j

Rij

or, equivalently, dt - ~ ui +

E

wH 3(uj)

(3.45)

where R iC

(3.46)

with 1

1

S’ = R,

~0

1

+ ^ W j

(3.47)

and Wij — R if Rij •

(3.48)

Equation (3.45) is identical to (3.32). From any starting state the circuit there fore settles down to a steady state in which (3.30) is obeyed. The circuit implementation o f an abstract network involves choosing resistances Rij to satisfy (3.48). If we choose p small enough we can make 7Z,- » p for all i and

60

THREE Extensions of the Hopfield Model

P ^AAAAr

c ' i *

FIGURE 3.6 Circuit for a virtual ground unit, with normal and inverted outputs.

thus simplify (3.48) to W ij »

p / R ij .

(3.49)

Then the conductance 1/R^ o f each connection is chosen to be proportional to the desired Wij. This works only for positive Wij’s. To deal with negative connection strengths we can add inverters to the circuit so that —V* is available as well as Vi at each unit. If a desired is negative, the term WijVj is simply replaced by K K - V ; ) ; a positive connection \w{j\ is made to the inverted output —V}. Alternatively, we can transform to unipolar connections, as described on page 49; this requires one extra unit to compute V . V}, which is then fed to all other units. It is possible to avoid the need for small p , and yet obtain oc I/Rij (and Ti independent o f i), by m odifying the circuit slightly [Denker, 1986]. As shown in Fig. 3.6, we simply connect the resistor p and the capacitor C between the input and the inverted output, instead o f between the input and ground. Then if we use a high gain amplifier the negative feedback loop keeps the input close to ground potential, giving rise to the name v ir tu a l g r o u n d u n it. In the high gain limit it is easy to show that the circuit equations reduce to (3.50) j where r = p C and = p / R ij. The result is linear even if the operational amplifier is nonlinear. To regain a saturation property for Vi one can replace the resistor p by a nonlinear element such as a reversed pair o f parallel diodes [Denker, 1986]. The hardest problem in creating networks o f these circuits in VLSI is fabricating the connection resistors R ij. We need 2N 2 resistors for N fully connected units (using inverters), and the resistance values must be large to limit power dissipation. High resistance paths are difficult to make in conventional CM OS technology, and especially difficult to make small. Nevertheless G raf et al. [1986] were able to make custom chips with N = 256 fully connected units, using 2 N 2 « 130,000 resistor sites. Each resistor had to be added to the otherwise finished CMOS chip using electron-beam lithography, and all had approximately the same resistance. Resistors could not be changed once made.

3.4 Hardware Implementations

61

Others have employed active elements (transistors) for the connections instead o f passive resistive paths [Sivilotti et al., 1987; Alspector et al., 1988; Mead, 1989]. Typically the connection strengths have a discrete set o f gray levels (e.g., 1/Rij = 0, 1 , . . . , 16). Most importantly, they can be programmed after fabrication, allowing general purpose chips and the possibility o f learning. In most applications it is important to make the connection strengths non-volatile, so that they retain their values even when the device is turned off.

Optical Implementation In silicon technology it is easy to construct the units but hard to make the many connections needed. Optical technology has the reverse problem; because o f the inherent linearity o f most optical media, it is relatively easy to make the connections but harder to construct the units. Different rays can cross or be superposed without interfering with one another, so many optical connections can be made by a set o f crossing light rays. But this typical linearity makes it hard to construct units with appropriate nonlinearity such as threshold behavior. Many o f the earlier optical implementations o f neural networks were therefore hybrid o p t o e le c t r o n ic systems, in which the nonlinearity was placed in the electronic part. But now there is more effort going into fully optical systems, using nonlinear optical media and devices. We first describe a very simple optoelectronic system due to Farhat et al. [1985]. There were 32 units, each represented by a separate electronic circuit shown schematically in Fig. 3.7(a). Each unit had an LED (Light Emitting Diode) as its output and a pair o f photodiodes (PDs) as its input, one for excitatory and one for inhibitory signals. The LED was on when the unit was firing, and not otherwise. The desired connections were binarized so that Wij = —1 or 1 for each i j (and zero on the diagonal). Thus the optical problem was simply to make the light from the LED for unit j shine on unit Vs excitatory photodiode, or inhibitory photodiode, or neither, according to the value o f w^. The weight matrix was photographically coded onto a two-dimensional mask with two pixels for each o f the 32 x 32 connec tions. The two pixels were opaque and transparent as required for the desired (black-white, white-black, or black-black corresponding to 1 , —1 , or 0 ). Figure 3.7(b) shows how the optical interconnections were made. The trans parency encoding the weights was placed in front o f an array o f 64 long photodiodes such that each column o f the transparency exactly covered one diode .4 Each o f the L E D ’s in the vertical output array illuminated one row o f the transparency. This was done by vertically focusing the light and horizontally smearing it with a pair o f lenses so that the light from one unit became a horizontal line on the appropriate row o f the transparency. The input to each unit was accumulated in the corre sponding pair o f vertical photodiodes (inhibition in one, excitation in the other).

4Actually the transparency was cut into two halves, each covering 32 photodiodes, for technical convenience.

62

THREE Extensions of the Hopfield Model

LED

I (b)

LEDs

FIGURE 3.7 O ptoelectronic Hopfield network o f Farhat et al. [1985]. (a) Single unit. The signals from the excitatory and inhibitory photodiodes (PD s) are am plified and thresholded before driving the light emitting diode (LE D ), (b) Optical arrangement for the interconnections. Lenses are not shown, and the number o f units has been scaled down by a factor o f four.

Incoherent light was used, so the total intensity on each photodiode automatically gave the appropriate summation o f total input. The electronic circuitry allowed the designers to selectively turn units on and off, and then release them. W hen three or fewer random patterns were stored in the memory mask it was found that any could be recalled by starting the system close enough to it. Indeed the successes and failures corresponded almost exactly to those for the same Hopfield network simulated by computer. In later versions o f this scheme the need for excitatory and inhibitory inputs was eliminated by using unipolar connections, as described on page 49. The adaptive threshold term . Sj was easily computed by focusing light from all the LEDs onto a single photodiode. Most recent work has employed a hologram rather than a mask for the connec tion matrix [Cohen, 1986; A bu-M ostafa and Psaltis, 1987; Hsu et al., 1988; Peterson et al., 1990]. Using coherent light, a hologram can direct a light wave from any in coming direction independently to any outgoing direction, and thus connect many pairs o f directions. A plane hologram on photographic film can implement about 10 8 such connections per square inch, so 10 4 sources can easily be fully connected to the same number o f sensors. Volume holograms made from photorefractive crystals have an even greater potential; in principle they can hold more than 1 0 12 con nections per cm3, corresponding to a million fully connected units. However, the

3.5 Temporal Sequences of Patterns

63

readout from a volume hologram is partially destructive, so the stored patterns decay slowly unless refreshed. O f course one does not want to calculate the appropriate hologram for a partic ular set o f memories; one wants to produce it optically from, e.g., a set o f pictures. The details o f this are beyond the scope o f this book, but it is worth distinguishing angular and spatial multiplexing o f different stored patterns. In the angular case the patterns are stored at different input angles, or at different lateral locations on the hologram, while spatial multiplexing involves assigning them to different layers o f a volume hologram. Spatial multiplexing appears capable o f storing more patterns [Peterson et al., 1990]. Many optical implementations are all-optical rather than optoelectronic. This means that a nonlinear medium or device must be found to perform the threshold ing function for the units. Various approaches have been tried, including the use o f strongly pumped phase-conjugate mirrors [Softer et al., 1986; Anderson, 1986], etalon arrays [Wagner and Psaltis, 1987], and two-dimensional spatial light modu lators [Abu-M ostafa and Psaltis, 1987; Hsu et al., 1988]. The last o f these seems to have the greatest potential. Some recent optical work has gone beyond Hopfield networks with the construc tion o f adaptive optical neural networks that are able to learn. Simple perceptrons, multilayer networks with back-propagation, and deterministic Boltzmann machines have all been implemented [Wagner and Psaltis, 1987; Hsu et al., 1988; Peterson et al., 1990].

3.5 Temporal Sequences of Patterns We have so far been concerned with networks that evolve to a stable attractor and then stay there. They can thus act as content addressable memories for a set o f static patterns. It is also interesting to investigate the possibility o f storing, recalling, generating, and learning sequences o f patterns. These obviously occur widely in biological systems. In this section we mainly examine how to generate a sequence o f patterns. Instead o f settling down to an attractor we want the network to go through a predetermined sequence, usually in a closed lim it c y c le . The sequences will be embedded in the choice o f connections, as elsewhere in Chapters 2 and 3. Networks that can recognize sequences, or learn sequences incrementally by example, are considered separately in Section 7.3. Simple sequences can be generated by a set o f units that are synchronously updated by connecting together a chain o f units Si with excitatory connections tu*+if* from each unit to its successor, and then turning the first unit on. A closed cycle can be made by joining the ends o f the chain, as shown in Fig. 3.8. However this sort o f scheme is very limited. It can only produce sequences o f patterns related by shifting or permutation. It is far from robust; one lost bit and the whole scheme

64

THREE Extensions of the Hopfield Model

FIGURE 3.8 A very simple sequence generator. This relies on synchronous updating by a central clock, and must be initialized by turning on unit 1. Then it gen erates the cyclic sequence 1 -2 -3 -4 -5 -6 .

disintegrates. It needs a central pacemaker to control the synchronous updating, and is thus no good as an independent clock or oscillator.

Asymmetric Connections How can we generate more arbitrary sequences o f patterns, using only asynchronous updating? Suppose we have an associative network o f units 5,* (i = 1, 2, . . . , N ) and a sequence o f patterns £? (// = 1 , 2 , . . . , p) that we want it to produce in order, with pattern 1 following pattern p cyclically. Hopfield [1982] suggested that asymmetric connections, (3.51)

might be able to achieve this. Here A is a constant that governs the relative strength o f symmetric and asymmetric terms. We take £?+1 to mean £/. If such a system is in the state Si = then the input h\ to unit i is

+ A£^+1 + cross-term s.

(3.52)

The cross-terms are small if the patterns are uncorrelated and there are not too many o f them, as we have seen earlier. For A < 1 the original pattern £? is stable because sgn(£f + A ^ +1) = s g n (ff). But for A > 1 the second term dominates and tends to move the system to the next pattern, For correlated patterns the Hebb rule (3.51) may be replaced by a pseudo inverse rule [Personnaz et al., 1986] generalizing (3.18): (3.53)

where the matrix Q is given by (3.17) as before. Unfortunately these schemes do not work very well in practice. The asyn chronous updating tends to dephase the system so that one obtains states that

65

3.5 Temporal Sequences of Patterns

overlap several consecutive patterns, and the sequence is soon lost. Only if the length o f the sequence is very small (p 0, and added terms to (3.51) to inhibit transitions to pattern states that were not next in sequence. The parameter values were chosen so that the system tended to remain in a particular pattern state for some time before being driven to the next one by the thermal noise (T > 0); in effect A in (3.51) was a little less than one. No transitions were made at all if the temperature T was too low. The time between transitions was rather unpredictable, but became sharper and sharper as the system size N was increased.

Fast and Slow Connections Sompolinsky and Kanter [1986], Kleinfeld [1986], and Peretto and Niez [1986] found ways o f controlling how long each state in the sequence stabilizes before moving on to the next. In effect the parameter A is small when each new state is entered, and then grows steadily until it provokes the next transition. Peretto and Niez proposed doing this directly with connection strengths that changed dynamically. However, this is expensive to implement in hardware or software because there are so many connections. The same effect can be achieved with the dynamics located only in the units if we use two types o f connections [Sompolinsky and Kanter, 1986; Kleinfeld, 1986]. Between units i and j we have the usual symmetric s h o r t-t im e c o n n e c tio n s ,

>4 =

'£ (!(! n

that stabilize each pattern, and asymmetric lo n g -tim e co n n e c tio n s ,

wii =

(3-55)

that tend to cause transitions in the sequence. The long-time connections represent slow synapses that have delayed or sluggish responses. Explicitly, the input h ,(t) to unit i at time t is now given by h i(t) = £

[w fjS jit) +

(3.56)

i where the delayed response S j(t) is a weighted moving average (sometimes called a memory trace) over the past values o f S j:

66

THREE Extensions of the Hopfield Model

FIGURE 3.9 Illustration o f (3.61). The solid curve shows S j(t) for some unit j . The other curves show S j(t) us ing equations (3.58) [dotted], (3.59) [dashed], and (3.60) [dot-dashed].

Various forms can be used for the kernel G (t), such as a delta function ,5 G (t) = S(t -

t)

(3.58)

a step function, G (t) = Q (r - t ) f r

(3.59)

G (t) = exp (—t/ r)/ r .

(3.60)

or an exponential decay,

The last o f these is easily implemented using decay units, as discussed in Section 7.3 (page 181). In all cases / 0°° G (t) dt = 1. If the network has been in pattern f j' ” 1 for a time long compared to r and then changes to £? at time we obtain for

(3 61)

for t — to >> r . This is illustrated in Fig. 3.9. Thus h

f ( l + A) ^

for*-*0 < r ;

**

U f + A f f * 1 for f - l o > r

( 3 -6 2 )

(besides cross-terms), which causes the next transition after a time o f order r if A > 1. 5The Dirac delta function 6(27) is defined to be zero for x ^ 0, with an infinite peak o f area 1 at x = 0. Thus J f(x )8 ( x ) dx = / ( 0) for any continuous function f ( x ) if the range of integration includes x = 0. 6(x) can also be regarded as the derivative of the stepfunction 0 (x). The delta function is not strictly speaking a function, but can be defined as the limit of a sequence of functions.

67

3.5 Temporal Sequences of Patterns

Time FIGURE 3.10 Example o f sequence generation. The curves show the overlap o f the state Si at time t with each o f the embedded patterns £ /, , £f. The overlaps were calculated using The parameters used were N = 100, p — 4, A = 2, r = 8 using the step function kernel (3.59).

An intuitive picture o f this process consists o f an energy surface that has minima at each but which also tilts steadily while the system is in a particular state, so that a downhill move from £” to £^+1 occurs eventually. But this picture is a little deceptive, because the dynamics cannot be represented simply as descent on an energy landscape unless the connections are symmetric. In general the scheme works well and is quite robust. Figure 3.10 shows an example using the step function kernel (3.59). The parameters A and r must be chosen carefully however; in a detailed analysis Riedel et al. [1988] showed that some choices using (3.60) with large A can lead to chaotic behavior instead o f the desired sequence. The essential idea can also be generalized in most o f the ways considered in Sec tion 3.1. For example Gutfreund and Mezard [1988] demonstrated the applicability to the strong dilution model o f Derrida, Gardner, and Zippelius [1987] discussed on page 46.

Central Pattern Generators Kleinfeld and Sompolinsky [1989] have applied the model just described to the c e n tra l p a t t e r n g e n e r a to r for swimming in the mollusk Tritonia diomedea. Central pattern generators are groups o f neurons, typically in the spinal cord, that collec tively produce a cyclic sequence without either feedback from the controlled system or continuous control from the brain. There is no single pacemaker neuron. In some cases multiple patterns can be generated by the same set o f neurons. Patterns can be started, stopped, and modulated in period by external control inputs. In the case o f Tritonia, there are four neural groups that fire in a well-defined sequence. Modelling these groups by one unit each, Kleinfeld and Sompolinsky calculated the

68

THREE Extensions of the Hopfield Model

FIGURE 3.11 Architecture for counting pulses. The connections between the units (heavy lines) have fast and slow components, and are asymmetrical. The input is connected to all units with ran dom dtl connections.

required connections from (3.54) and (3.55), and then verified that the network generated the appropriate sequence. Particularly impressive was the comparison o f their com puted connection strengths with those measured experimentally— in Tritonia one can measure individual synapse strengths and identify delayed and non-delayed components. In every case where a synapse was found experimentally its sign was the same as that in the computed set.

Counting Amit [1988] suggested a modification o f the same sequence generation scheme to count input pulses in a network. For counting we want each transition v —►v + 1 to occur only when an input pulse is received. This can be achieved with the same architecture— using fast and slow connections— if we use A < 1. Then no transitions occur in the absence o f input (at T = 0). The input pulse is arranged to apply an additional input to each unit. Here i = ± 1 , uncorrelated with any o f the sequence patterns f f , and l + A > / i > l — A. Figure 3.11 shows the architecture. If the system has been in state £? for a time long compared to r, the total input to unit i is given by ii + A ff +1

without input pulse;

(V + A£l' +1 + n fc

with input pulse.

.

.

W ithout an input pulse the current state is stable because A < 1. W ith an input pulse, there are two cases. If and £f? +1 have the same sign then they determine the sign o f hi. If they have the opposite sign, then fc determines the sign o f h*, favoring and equally. So the input pulse moves the state to a point approximately equidistant from £? and £*+1, but in the subsequent dynamical motion the delayed connections break the tie and carry the state forward to f j'* 1, as desired.

Delayed Synapses In the case G (t) = 6 (t —r ) the long time connections simply correspond to d e la y e d s y n a p ses that pass a given signal after a delay r. Equation (3.62) makes it clear

69

3.5 Temporal Sequences of Patterns

that r is also the time between transitions, so the rule (3.55) for the delayed synapse strengths may be written

wii = ] j

E

w

+ * )* # )

( 3-64)

t=0,r,2r,...

if we regard the desired sequence as a time-varying pattern &( N K

(desired).

(5.22)

This says geometrically that all points in x-space must be further than N k /|w| from the plane perpendicular to w . In Fig. 5.2, for example, we can see that pattern F would fail to satisfy the condition unless k were very small or |w| were very large. We can also give a geometrical picture o f the learning process. In our single output vector notation, ( 5 .2 1 ) becomes A w = t] Q ( N k — w -x /x) x Ai

(5.23)

which says that the weight vector w is changed a little in the direction o f x^ if its projection w-x^ onto x^ is less than AT/c/|w|. This is done over and over again until all projections are large enough. An example for k = 0 is sketched in Fig. 5.6. Observe that a final solution was found after three steps w —►w ' —* w " —►w '" ; there are no patterns left in the bad region o f w '" so no further updates will occur. Whatever the value o f /c, a direction for w in which all the projections are positive will give a solution if scaled up to a large enough magnitude |w|. Depending on the pattern vectors x M, there may be a wide range o f such directions, or only a narrow cone, or (if no solution exists) none at all (Fig. 5.7 shows two examples). We can use this observation to quantify how easy or hard a problem is. The quantity D ( w ) = -j— r min w-x^ |w| n

(5.24)

99

5.2 Threshold Units

FIGURE 5.6 How the weight vector evolves during train ing for ac = 0, tj = 1. Suc cessive values o f the weight vector are shown by w , w', w", and w //x. The darker and darker shading shows the “bad” region where w - x < 0 for the successive w vectors. Each w is found from the previous one (e.g., w ' from w ) by adding an from the current bad region.

fe.

(a)

/ o

(b) o

• W

O

FIGURE 5.7 (a) An easy problem: any weight vector in the 135° angle shown will have positive pattern projections, (b) A harder problem where the weight vector must lie in a narrow cone.

depends on the worst o f the projections. It is just the distance o f the worst x^ to the plane perpendicular to w , positive for the “good ” side and negative for the “bad” side (see Fig. 5.8). The l/|w| factor makes it a function only o f the direction o f w . If D ( w ) is positive then all pattern points lie on the good side, and so a solution can be found for large enough |w|. If we maximize D ( w ) over all possible weights we obtain the best direction for w , along which a solution will be found for smallest |w|. This solution is called the o p t im a l p e r c e p t r o n . It can also be defined, equivalently, as the solution with largest margin size k for fixed |w|. The value A n ax = m a x D (w )

(5.25)

100

FIVE Simple Perceptrons

FIGURE 5.8 Definitions o f D (w ) and D max. Pattern A is nearest to the plane perpendic ular to weight vector w , so the distance to A gives D ( w ). Maximizing D ( w ) with respect to w gives w', with D ( w ') = D max. Note that both B and C are distance D max from the plane.

o f .D (w ) in this direction tells us how easy the problem is: the larger D max, the easier the problem. If D max < 0, it cannot be solved. For example, D max is l /\ /l 7 for the A N D problem (from Fig. 5.4) and —l/y/3 for X O R .

5.3 Proof of Convergence of the Perceptron Learning Rule * We assume that there is a solution to the problem and prove that the perceptron learning rule (5.21) reaches it in a finite number o f steps. All we need to assume is that we can choose a weight vector w * in a “good ” direction; one in which D (w * ) > 0. Our p roof is related most closely to that given by Arbib [1987]; see also Rosenblatt [1962], Block [1962], Minsky and Papert [1969], and Diederich and Opper [1987]. At each step in the learning process a pattern is chosen and the weights are updated only if the condition (5.20) is not satisfied. Let M ** denote the number o f times that pattern fi has been used to update the weights at some point in the learning process. Then at that time w

= 7?5 ] M ' V ‘

(5.26)

if we assume that the initial weights are all zero. The essence o f the p roof is to compute bounds on |w| and on the overlap w -w * with our chosen “good ” vector w*. These let us show that w -w */|w | would get arbitrarily large if the total number o f updates M = ^ M ** kept on increasing. But this is impossible (since w * is fixed), so the updating must cease at some finite M . Consider w -w * first. Using (5.26) and (5.24) we obtain w -w * = 77^

A P x ^ -w * > t}M m in x^-w * = r)MD(w*)\w*\.

(5.27)

5.3 Proof of Convergence of the Perceptron Learning Rule

101

Thus w w * grows like M. Now for an upper bound on |w|, consider the change in length o f w at a single update by pattern a: A |w |2 =

(w + rjxa)2 — w 2

=

772 ( x a ) 2 -f 2r)w-xa

w x a for performing an update with pattern a. Note that we also used x % = ± 1 , so that ( x a ) 2 = N , but the proof is easily generalized to other types o f patterns. By summing the increments to |w |2 for M steps we obtain the desired bound |w|2 < MNrj(r} + 2K ) .

(5.29)

Thus |w| grows no faster than \ /M , and therefore from (5.27) the ratio w -w */|w | grows at least as fast as V M . But this cannot continue, so M must stop growing. More precisely, we can bound the normalized scalar product _ ^

(w -w * ) 2 |w |2 |w*|2

(5 -30)

which is the squared cosine o f the angle between w and w*. Because it is the squared cosine it is obviously less than or equal to 1 (as also follows from the Cauchy-Schwarz inequality). But with (5.27) and (5.29) we find

\>> M N(r} + 2k)

(5.31) v '

which gives us an upper bound on the number o f weight updates (using the best possible w *): M < N 1 + 2k/>? . U'mrr\\ ax

(5.32)

This bound is proportional to the number o f input units, but interestingly enough it does not depend on the number o f patterns p. O f course the real convergence time does depend on p because one typically has to continue checking all p patterns to find the ones for which a weight update is needed; the number o f such checks increases with p even if the number o f actual updates does not. Additionally, D max typically decreases with increasing p, resulting in a growing M . Note also that the bound on M grows linearly with « , because for larger k the learning must reach a larger |w| along any given good direction.

102

FIVE Simple Perceptrons

5.4 Linear Units So far in our study o f simple perceptrons we have considered only threshold units, with g(h) = sgn(/i). We turn now to continuous-valued units, with g(h) a continuous and differentiable function o f u. The great advantage o f such units is that they allow us to construct a cost function £*[w] which measures the system ’s performance error as a differentiable function o f the weights w = {w ,* }. We can then use an optimization technique, such as gradient descent, to minimize this error measure. We start in this section with lin e a r u n its, for which g(h ) = h. These are not as useful practically as the nonlinear networks considered next, but are simpler and allow more detailed analysis. The output o f a linear simple perceptron subjected to an input pattern £ f is given by

=

(5 -33) k

and the desired association is O f = £ f as usual, or Cf =

X I Wik^k

(desired).

(5.34)

k Note that the O f ’s are continuous-valued quantities now, although we could still restrict the desired values £ f to ± 1 .

Explicit Solution For a linear network we can actually compute a suitable set o f weights explicitly using the p s e u d o -in v e r s e method. We saw on page 50 how to find the weights to satisfy (5.34) for the special case o f auto-association, £ f = £f. The generalization to hetero-association is just

=

Q _1) ^

fits

N. This includes AND and X O R , and indeed all the other threshold network examples used or illustrated in this chapter.

Gradient Descent Learning We could use (5.35) and (5.36) to com pute a set o f weights tu**. that produce exactly the desired outputs from each input pattern. But we are more interested here in finding a learning rule that allows us to find such a set o f weights by successive improvement from an arbitrary starting point. We define an error measure or c o s t fu n c tio n by

=

j E w - o f t ’ =

j E ( < ' - E « « ) 2' in

k

This is smaller the better our w ^ s are; E is normally positive, but goes to zero as we approach a solution satisfying (5.34). Note that this cost function depends only on the weights Wik and the problem patterns. In contrast, the energy functions considered in Chapters 2 to 4 depended on the current state o f the network, which evolved so as to minimize the energy. Here the evolution is o f the weights (learning), not o f the activations o f the units themselves.

104

FIVE Simple Perceptrons

Given our error measure j£[w], we can improve on a set o f Wik s by sliding downhill on the surface it defines in w space. Specifically, the usual g r a d ie n t d esce n t a lg o r it h m suggests changing each Wik by an amount Aw,* proportional to the gradient o f E at the present location:

(5.40)

If we make these changes individually for each input pattern £ f in turn, we have simply (5.41)

A w ik = T) Sffg

(5.42)

if we define the errors (or d e lta s ) 6? by (5.43) This result— (5.41), or (5.42) plus (5.43)— is com monly referred to as the d e lta ru le, or the a d a lin e ru le, or the W id r o w -H o ff ru le, or the L M S (least mean square) ru le [Rumelhart, McClelland, et al., 1986; W idrow and Hoff, I960]. It is also nearly identical to the Rescorla-Wagner model o f classical conditioning in behavioral psychology [Rescorla and Wagner, 1972]. Equation (5.41) is identical to our simple rule (5.19) for the threshold unit network. Note however that (5.19) originated in an empirical Hebb assumption, but now we have derived it from gradient descent. As we will see, the new approach is easily generalized to more layers, whereas the Hebb rule is not. The new approach obviously requires continuous-valued units with differentiable activation functions. Actually the equivalence o f (5.41) and (5.19) is a little deceptive because O f is a different function o f the inputs. In the threshold case the term £ f — O f is either 0 or ± 2 , whereas here it can take any value. As a result the final weight values are not the same in the two cases. Nor are the conditions for existence o f a solution (linear separability versus linear independence). And in the threshold case the learning rule stops after a finite number o f steps, while here it continues in principle for ever, converging only asymptotically to the solution. The cost function (5.39) is just a quadratic form in the weights. In the subspace spanned by the patterns the surface is a parabolic bowl with a single minimum. Assuming that the pattern vectors are linearly independent, so that there is a solution to (5.34), the minimum is at E = 0. In the directions (if any) o f w-space orthogonal to all the pattern vectors the error is constant, as may be seen by inserting (5.38) into ( 5 .3 9 ); the ££ term makes no difference because “ 0-

105

5.4 Linear Units

Error

Orthogonal Subspace

FIGURE 5.9 The “rain gut ter” shape o f the error sur face for linear units. The error takes its minimum value o f 0 along level val leys if the patterns do not span the whole of £space.

In other words, the error surface in weight space is like a ra in g u tte r , as pictured in Fig. 5.9, with infinite level valleys in the directions orthogonal to the pattern vectors. The gradient descent rule (5.40) or (5.41) produces changes in the weight vec tors w 2- = ( w n , . . . , Wijsi) only in the directions o f the pattern vectors . Thus any component o f the weights orthogonal to the patterns is left unchanged by the learning, leaving an inconsequential uncertainty o f exactly the form (5.38) in the final solution. Within the pattern subspace the gradient descent rule necessarily decreases the error if rj is small enough, because it takes us in the downhill gradient direction. Thus with enough iterations we approach the bottom o f the valley arbi trarily closely, from any starting point. And any point at the bottom o f the valley solves the original problem (5.34) exactly.

Convergence of Gradient Descent * The argument just given for convergence to the bottom o f the valley is intuitively reasonable but bears a little further analysis. The first step is to diagonalize the quadratic form (5.39). As long as the pattern vectors are linearly independent, this allows us to write JE'fw] in the form

M e

= J 2 ° ^ - w° ) 2 -

(5-44)

A=1

Here M is the total number o f weights, equal to A times the number o f output units, and the w\’s are linear combinations o f the Wik’s. The a\’s and u ;°’s are constants depending only on the pattern vectors. The eigenvalues a\ are necessarily positive or zero because o f the sum-of-squares form of ( 5 .3 9 ); the quadratic form is positive semi-definite. Notice first that if some of the a * ’s are zero then E is independent o f the corresponding w\’s. This is equivalent to the rain gutters already described; the

106

FIVE Simple Perceptrons

FIGURE 5.10 Gradient descent on a simple quadratic surface (the left and right parts are copies o f the same surface). Four trajectories are shown, each for 20 steps from the open circle. The minimum is at the + and the ellipse shows a con stant error contour. The only significant difference between the trajectories is the value o f r7, which was 0.02, 0.0476, 0.049, and 0.0505 from left to right.

corresponding eigenvector directions in w space are orthogonal to all the pattern vectors. Now let us perform gradient descent on (5.44). Because the transformation from Wik to w\ is linear, this is entirely equivalent to gradient descent in the original basis. But the diagonal basis makes it much easier: Awx =

BE -T ?^ -

Thus the distance 6w\ = w\ — formed according to 6 < ew =

= -2 tia x (w x -w °).

(5.45)

from the optimum in the A-direction is trans

6w$ld + A w x = ( 1 - 2 t)ax ) 6 w f d .

(5.46)

In the directions for which a\ > 0 we therefore get closer to the optimum as long as |1 — 2f]a\\ < 1. The approach is first order; each distance 6w\ gets multiplied by a fixed factor at each iteration. The value o f 77 is limited by the largest eigenvalue a™ax, corresponding to the steepest curvature direction o f the error surface; we must have rj < l/a™ ax or we will end up jum ping too far, further up the other side o f the valley than we are at present. But the rate o f approach to the optimum is usually limited by the smallest non-zero eigenvalue a™m, corresponding to the shallowest curvature direction. If is large progress along the shallow directions can be excruciatingly slow. Figure 5.10 illustrates these points in a simple case. We show gradient descent on the surface E = a?2+ 20y 2 for 20 iterations at different values o f 77. This quadratic form is already diagonal, with a\ = 1 and a2 = 20. Notice the distance o f the last point from the minimum. At 77 = 0.02 we reach y « 0 fairly quickly, but then make

107

5.5 Nonlinear Units

only slow progress in x. At the other extreme, if 17 > 1/20 = 0.05 the algorithm produces a divergent oscillation in y. The fastest approach is approximately when the x and y multipliers |1 — 2r)\ and |1 — 407/| are equal, giving rj = 1/21 = 0.476 (second illustration). O f course this analysis assumes that we actually take steps along the gradient. Progress will normally be somewhat slower if instead we change one component at a time, as in (5.41), though the incremental algorithm is often more convenient in practice. Another alternative sometimes used is to take steps o f constant size (usually decreasing with time) in the gradient direction; this can help to speed up convergence when a™ax/a™m is large, but requires careful control. We have also assumed o f course that a perfect solution exists, which requires linear independence o f the patterns. It is however interesting to ask what happens if the patterns are not linearly independent. Then we can only diagonalize (5.39) in the form

M

E = E 0 + ^ a \ ( w \ - iu° ) 2

(5.47)

A=1

with E 0 > 0. There is still a single minimum (or gutter), but E = E q > 0 there, showing that the desired association has not been found. Trying gradient descent on the X O R problem, for example, produces a single minimum with E = 2 at wq = wi = W2 = 0, which makes the output always 0. The X O R problem is obviously not linearly independent, since p > N.

5.5 Nonlinear Units It is straightforward to generalize gradient descent learning from the linear g(h) = h discussed in the previous section to networks with any differentiable g(h). The sumof-squares cost function becomes

in

-°n 2 =

in

- f f ( E ^ ) ] 2k

(5-48>

We then find 9E dwik

(5-49)

so the gradient descent correction —rjdE/dwik to it;,-* after presentation with pat tern number p is o f the same form as (5.42): = r ,6 ? $ .

(5.50)

But now the quantity (5.51)

108

FIVE Simple Perceptrons

has acquired an extra factor gf( h f ) o f the derivative o f the activation function g(h). W ith a sigmoid form for = l+.xp(-2W

l612)

g(h) = tanh /?h

(6.13)

and

respectively for the activation function. The steepness parameter (3 is often set to 1, or 1/2 for (6.12). As we noted in Chapter 5, the derivatives o f these functions are readily expressed in terms o f the functions themselves as gf(h) = 2 /?y( 1 — g) for (6.12) and gf(h) = (3(1 — g2) for (6.13). Thus one often sees ( 6 .8 ), for example, written as 6? = O f (1 - O f - O f) (6.14) for 0 /1 units with (3 = 1/2. Because back-propagation is so important, we summarize the result in terms o f a step-by-step procedure, taking one pattern // at a time (i.e., incremental updates). We consider a network with M layers m = 1 , 2, . . . , M and use V™ for the output

l o c a l i t y is necessary for biological implementation, but not sufficient. Bidirectional bifunctional connections are not biologically reasonable [Grossberg, 1987b], but can be avoided, allowing hypo thetical neurophysiological implementations [Hecht-Nielsen, 1989]. Nevertheless back-propagation seems rather far-fetched as a biological learning mechanism [Crick, 1989].

120

SIX Multi-Layer Networks

o f the ith unit in the rath layer. V*° will be a synonym for &, the ith input. Note that superscript ra’s label layers, not patterns. We let wW mean the connection from V™~1 to V™. Then the back-propagation procedure is: 1. Initialize the weights to small random values. 2. Choose a pattern ££ and apply it to the input layer (ra = 0) so that for a f l t .

V£=££

(6.15)

3. Propagate the signal forwards through the network using V T = 9 { h f ) = g f e w n v p - 1) i

(6.16)

for each i and ra until the final outputs V(M have all been calculated. 4. Compute the deltas for the output layer 6 ? = g '(h f')[

with a small positive constant //. Nevertheless this whole approach can fail to con verge, and seems to give only a modest speed increase over simple back-propagation. Owens and Filkin [1989] have suggested replacing the discrete gradient descent rule dE A = —77-T----for all i j (6.44) dwij by the continuous differential equations

dt

= —rijr— dw^

for all i j .

(6.45)

When we write out the right-hand side o f (6.45) as an explicit function o f the weights we obtain a set o f coupled nonlinear ordinary differential equations (O D E ’s). The conventional gradient descent rule (6.44) can be viewed as a simple fixedstep “forward-Euler” integrator for these equations. But in fact there are much better ways o f integrating such equations numerically, particularly if the equations are mathematically stiff, which is equivalent to there being shallow steep-sided valleys in the cost function .5 Most neural network problems do seem to be stiff, so specialized stiff ODE solvers are appropriate, and are |ound to give very large speed increases over ordinary back-propagation. A very different approach uses a g e n e tic a lg o r ith m to search the weight space without use o f any gradient information [Whitley and Hanson, 1989; Montana and Davis, 1989]. See Goldberg [1989] for an introduction. A complete set o f weights is coded in a binary string (or chromosome), which has an associated “fitness” that depends on its effectiveness. For exafhple the fitness could be given by —E where E is the value o f the cost function Tor that set o f weights. Starting with a random population o f such strings, successive generations are constructed using g e n e tic o p e r a t o r s such as mutation and crossover to construct new strings out o f old ones, with some form o f survival o f the fittest; fitter strings are more likely to survive and to participate in mating (crossover) operations. The crossover operation combines 5Technically, stiffness depends on the ratio o f the largest and smallest eigenvalues of the underlying Hessian, just as found in Chapter 5 for the maximum convergence rate of gradient descent.

6.2 Variations on Back-Propagation

129

part o f one string with part o f another, and can in principle bring together good building blocks— such as hidden units that compute particular logical functions— found by chance in different members of the population. The way in which the weights are coded into strings and the details o f the genetic operators are both crucial in making this effective. Genetic algorithms perform a global search and are thus not easily fooled by local minima. The fitness function does not need to be differentiable, so we can start with threshold units in Boolean problems, instead o f having to use sigmoids that are later trained to saturation. On the other hand there is a high computational penalty for not using gradient information, particularly when it is so readily available by back-propagating errors. An initial genetic search followed by a gradient method might be an appropriate compromise. Or a gradient descent step can be included as one o f the genetic operators [Montana and Davis, 1989], There are also large costs in speed and storage for working with a whole population o f networks, perhaps making genetic algorithms impractical for large network design. There has not yet been sufficient comparative study to determine which o f the techniques discussed here is best overall. Conventional back-propagation is slow compared to almost any o f the improved techniques, but o f course speed is not the only issue. We should also consider storage requirements, locality o f quantities required (for a parallel or network implementation), reliability o f convergence, and the problem o f local minima. Probably there is no single best approach, and an optimal choice must depend on the problem and on the design criteria.

Local Minima Gradient descent, and all o f the other optimization techniques discussed above, can become stuck in lo c a l m in im a o f the cost function. Actually local minima have not, in fact, been much o f a problem in most cases studied empirically, though it is not really understood why this should be so. There certainly are local minima in some problems [Mclnerny et al., 1989], though apparent local minima sometimes turn out to be the bottom s o f very shallow steep-sided valleys. The size o f the initial random weights is important. If they are too large the sigmoids will saturate from the beginning, and the system will becom e stuck in a local minimum (or very flat plateau) near the starting point. A sensible strategy is to choose the random weights so that the magnitude o f the typical net input hi to unit i is less than— but not too much less than— unity. This can be achieved by taking the weights to be o f the order o f l/y/ki where k{ is the number o f j ’s which feed forward to i (the fan-in o f unit i) . A com mon type o f local minimum is one in which two or more errors com pensate each other. These minima are not very deep so just a little noise (random fluctuation) is needed to get out. As we mentioned earlier, a simple and very effi cient approach is to use incremental updating (one pattern at a time), choosing the patterns in a random order from the training set. Then the average over patterns is avoided and the random order generates noise. If, on the other hand, the training

130

SIX Multi-Layer Networks

set is cycled through in the same sequence all the time it is possible to get stuck because a series o f weight changes cancel. Another way o f introducing noise is to let upward steps in E be allowed occa sionally, as in the stochastic networks discussed earlier, with a temperature T con trolling the probabilities. Annealing— a gradual lowering o f T — is then performed. However, this approach tends to make learning very slow. Alternatively it is possible to add noise explicitly by randomly changing the w 's slightly [von Lehman et al., 1988] or by adding noise to the training set inputs ££, independently at each presentation [Sietsma and Dow, 1988]. In each case there seems to be an optimum amount of noise; a little helps, but too much hurts and slows down the learning process considerably.

6.3 Examples and Applications Although our main focus in this book is on the theoretical aspects o f neural com putation, it is helpful at this stage to survey some examples. Back-propagation and its variants have been applied to a wide variety o f problems, from which we select a few. Our first three examples are “toy problems” which are often used for testing and benchmarking a network. Typically the training set contains all possible input patterns, so there is no question o f generalization. In the later examples the training set is only a part of the problem domain, but the networks are able to generalize to previously unseen cases. Most o f the examples employ straightforward back-propagation learning by gradient descent in two-layer networks with full connectivity between layers. Some o f them illustrate that one can solve some nontrivial real-world problems without special tricks or refinements of this standard, off-the-shelf network. A few examples, on the other hand, require a little more thinking about the problem and consequent m odification o f the algorithm for a satisfactory solution.

XOR The X O R problem was described on page 96, where we saw that it could not be solved by a single layer network because it is not linearly separable. However, there are several solutions using one hidden layer [Rumelhart, Hinton, and Williams, 1986b], two o f which are shown in Fig. 6.5 for threshold units with 0/1 inputs and outputs. In solution (a) the two hidden units compute the logical O R (left) and AND (right) o f the two inputs, and the output fires only when the O R unit is on and the AND unit is off. Solution (b) is not a conventional feed-for ward architecture, but is interesting because it needs only two units; the hidden unit computes a logical AND to inhibit the output unit when both inputs are on.

6.3 Examples and Applications

131

FIGURE 6.5 Tw o networks that can solve the X O R problem using 0 / 1 thresh old units. Each unit is shown with its threshold.

FIGURE 6.6 A network to solve the N = 4 parity problem with 0 / 1 threshold units.

Solutions like (a) are readily found using back-propagation with continuous valued units. However, the weight values found becom e much larger than those shown in the figure, so as to drive the sigmoid units to saturation. The training time from random starting weights is surprisingly long; hundreds o f e p o c h s — passes through the training set— are required to get good results.

Parity The parity problem is essentially a generalization o f the X O R problem to N inputs [Minsky and Papert, 1969]. The single output unit is required to be on if an odd number o f inputs are on, and off otherwise. One hidden layer o f A units suffices to solve the problem, as shown in Fig. 6 .6 . Hidden unit j is on when at least j input units are on, and either excites or inhibits the output unit depending on whether j is odd or even. Again back-propagation finds this solution using continuous-valued units, except that the weights becom e scaled up by a large factor. The parity problem (or X O R , its AT = 2 version) is often used for testing or evaluating network designs. It should be realized, however, that it is a very hard problem, because the output must change whenever any single input changes. This is untypical o f most real-world classification problems, which usually have much more regularity and allow generalization within classes o f similar input patterns [Fahlman, 1989].

132

SIX Multi-Layer Networks

FIGURE 6.7 A 5 -3 -5 encoder net work.

Encoder The general encoding problem involves finding an efficient set o f hidden unit pat terns to encode a large number o f input/output patterns. The number o f hidden units is intentionally made small to enforce an efficient encoding [Ackley, Hinton, and Sejnowski, 1985]. The specific problem usually considered involves a u t o -a s s o c ia tio n , using iden tical unary input and output patterns. We take a two-layer network o f N inputs, N output units, and M hidden units, with M < N. This is often referred to as an N - M - N encoder; Fig. 6.7 shows the architecture o f a 5 -3 -5 encoder. There are exactly N members o f the training set (p = N), each having one input and the corresponding target on, and the rest off: ( ? = ( ? = V A permutation o f the outputs could also be used— it makes no difference to the network. One solution to the problem is to use binary coding on the hidden layer, so that the activation pattern o f the hidden units gives the binary representation o f p, the pattern number. This can be achieved with connection strengths patterned after the binary numbers. Clearly we need M > log 2 N for such a scheme. But back-propagation tends to find alternative schemes, often using intermediate (non saturated) values for the hidden units, and can sometimes solve the problem even when M < log 2 N. Encoder problems have been widely used for benchmarking networks. One ad vantage is that they can be scaled up to any desired size, and another is that their difficulty can be varied by changing the ratio M/ log 2 N. They also have practical applications for image compression, as discussed later on page 136.

NETtalk In the preceding examples the training set normally includes all possible input patterns, so no generalization issue arises. But from here onwards the training set is only a part o f the problem domain, and the networks are able to generalize to

133

6.3 Examples and Applications

T

h

i

n

p

u

t

FIGURE 6.8 The NETtalk architecture.

previously unseen cases. In some cases the emergence o f significant internal repre sentations is also observed; some of the hidden units discover meaningful features in the data (meaningful to humans) and come to signify their presence or absence. The NETtalk project aimed at training a network to pronounce English text [Sejnowski and Rosenberg, 1987]. As shown in Fig. 6 .8 , the input to the network consisted o f 7 consecutive characters from some written text, presented in a moving window that gradually scanned the text. The desired output was a phoneme code which could be directed to a speech generator, giving the pronunciation of the letter at the center o f the input window. The architecture employed 7 x 29 inputs encoding 7 characters (including punctuation), 80 hidden units, and 26 output units encoding phonemes. The network was trained on 1024 words from a side-by-side English/phoneme source, obtaining intelligible speech after 10 training epochs and 95% accuracy after 50 epochs. It first learned gross features such as the division points between words and then gradually refined its discrimination, sounding rather like a child learning to talk. Some internal units were found to be representing meaningful properties of the input, such as the distinction between vowels and consonants. After training, the network was tested on a continuation of the side-by-side source, and achieved 78% accuracy on this generalization task, producing quite intelligible speech. Damaging the network by adding random noise to the connection strengths, or by removing some units, was found to degrade performance continuously (not catastrophically as expected for a digital com puter), with rather rapid recovery upon retraining. It is interesting to compare NETtalk with the commercially available DEC-talk, which is based on hand-coded linguistic rules. There is no doubt that DEC-talk performs better than NETtalk, but this should be seen in the context o f the effort involved in creating each system. Whereas NETtalk simply learned from examples, the rules embodied in DEC-talk are the result o f about a decade o f analysis by many linguists. This exemplifies the utility of neural networks; they are easy to construct and can be used even when a problem is not fully understood. However, rulebased algorithms usually out-perform neural networks wdien enough understanding is available.

134

SIX Multi-Layer Networks

Protein Secondary Structure The primary structure o f a protein consists o f a linear sequence o f amino acid residues chosen from 20 possibilities, like text with an alphabet o f 20 letters. The secondary structure involves the local configuration o f this linear strand, most com monly as an ct-helix or a /?-sheet. This secondary structure is then folded into the tertiary structure. In an obvious parallel to NETtalk, Qian and Sejnowski [1988a] constructed a network which took as input a moving window o f 13 amino acids and produced as output a prediction of a-helix, (3-sheet, or other for the central part o f the sequence. After training on extensive protein folding data it achieved an accuracy o f 62% on previously unseen sequences, compared to about 5 3 % for the best available alternative (the Robson m ethod). This scheme was invented indepen dently by Bohr et al. [1988] and developed further. A neural network approach is thus currently the m ethod o f choice in this problem. The accuracy obtained may in fact be very close to the best possible achievable from a local window; interactions from residues far along the primary sequence (but spatially nearby in the tertiary structure) may be needed to do any better.

Hyphenation Algorithms Here the problem is to determine the points at which a word may be hyphenated. This depends on whole syllables, not isolated letter-pairs, as shown by the examples prop a g a tion, pro p a g a n da, and propane. Reasonably good deterministic algo rithms (with exception dictionaries) exist for English [Liang, 1983], but there is a current lack o f good methods for languages such as Danish and German. A neural network is likely to be the quickest way to produce a working solution, again with an architecture like that o f NETtalk, and is currently under study [Brunak and Lautrup, 1989, 1990].

Sonar target recognition In another interesting example, Gorman and Sejnowski [1988a, b] trained the same kind o f standard two-layer perceptron to distinguish between the reflected sonar signals from two kinds o f objects lying at the bottom o f Chesapeake Bay: rocks and metal cylinders. The network inputs were not taken directly from the raw signal x t (where t is time), but were based on the frequency spectrum (Fourier transform) o f that signal. This preprocessing was approximately linear

Zk = Y l aktXt t

(6-46)

(for discrete time and suitable coefficients ctjbt), and so should not be necessary in principle; a network should be able to learn an appropriate linear transformation if one is needed. However, going to the frequency domain made it possible to use a much smaller number o f input units than with raw time domain data.

135

6.3 Examples and Applications

100

50 0

50

100

150

200

Training Epochs

250

300

FIGURE 6.9 Learning curves for sonar target recognition (averaged over many trials and sm oothed). The notation 60-12-2 means 60 inputs, 12 hidden units, and 2 output units.

The network had 60 input units and two output units, one for rock and one for cylinder. The number o f hidden units was varied from none (one-layer only) to 24. Figure 6.9 shows the le a r n in g c u r v e s — the percentage o f correct classifications as a function o f the number o f presentations o f the training data set— for 0, 3, and 12 hidden units. W ithout any hidden units the network rapidly reached about 80% correct performance, but then improved only very slowly. W ith 12 hidden units the performance came close to 100% accuracy, on every trial. No improvement was visible in the results on increasing the number o f hidden units from 12 to 24. After training, the network was tested on new data not in the training set. A bout 85% correct classification was achieved with 12 hidden units, showing rea sonable generalization ability. This was improved to about 90% when the training set was more carefully selected to contain examples o f more o f the possible signal patterns.

Navigation of a Car Pomerleau has constructed a neural network controller for driving a car on a winding road [Pomerleau, 1989; Touretzky and Pomerleau, 1989]. The inputs to the network consist o f a 30 x 32 pixel image from a video camera mounted on the roof o f the car, and an 8 x 32 image from a range finder coding distances in a gray scale. These inputs are fed to a hidden layer o f 29 units and from there to an output layer o f 45 units arranged in a line. The output unit in the center represents “ drive straight ahead” while gradually sharper left and right turns are represented by the units to the left and right. The network was trained on 1200 simulated road images using back-propagation. After training on each image about 40 times the network performed well, and the car could drive with a speed o f about 5 km /hr (3 mph) on a road through a wooded area near the Carnegie-Mellon campus. The speed is limited by the time taken by

136

SIX Multi-Layer Networks

the small Sun-3 computer in the car to do a forward pass through the large net work, and could obviously be increased by using some dedicated parallel hardware. Nevertheless the speed was about twice as fast as that obtained using any o f the other (non-network) algorithms which were tried.

Image Compression In high-definition television, the channel capacity is too small to allow transmission o f all the information (intensity and color) characterizing every pixel on the screen. Practical transmission schemes must therefore exploit the redundancy that natu rally exists in most images, encoding the picture in a much smaller number o f bits than the total required to describe it exactly. The encoded or “compressed” image must then be decoded at the receiver into a full-sized picture. Ideally the encoding and decoding should be done in a way that optimizes the quality o f the decoded picture. A great many ad hoc schemes have been tried on this problem. We can formulate this as a supervised learning problem by making the tar gets equal to the inputs (auto-association), and taking the compressed signal from the hidden layer. Then we just have a scaled-up version o f the encoder task dis cussed earlier (page 132 and Fig. 6.7), sometimes called s e lf-s u p e r v is e d b a ck p r o p a g a t io n in this context. The input-to-hidden connections perform the encod ing and the hidden-to-output connections do the decoding. This approach was tried by Cottrell et al. [1987]. The network they studied most took input from 8 x 8-pixel regions o f the image (64 input units, each specified to 8bit precision) and had 16 hidden units. It was trained by standard back-propagation on randomly selected patches o f a particular image for typically 150,000 training steps. Then it was tested on the entire image patch by patch, using a complete set o f nonoverlapping patches. Cottrell et al. obtained near state-of-the-art results from this simple procedure. As one might expect, the performance was somewhat worse when the network was tested on very different pictures, but it was still respectable. The fact that one can do so well with the simplest kind o f architecture imaginable suggests that more com plex networks could do even better. An interesting aspect of this problem is that nonlinearity in the hidden units is theoretically o f no help [Bourland and Kamp, 1988], and indeed Cottrell et al. found that nonlinearity conferred no advantage in their simulations. Using linear units allows a detailed theoretical analysis [Baldi and Hornik, 1989], which shows that the network projects the input onto the subspace spanned by the first M p rin c ip a l c o m p o n e n t s o f the input, where M is the number o f hidden units. Principal components are discussed in detail in Section 8.3. A projection onto principal com ponents discards as little information as possible, by retaining those components o f the input vector which vary the most.

Backgammon In the game o f backgammon each player in turn rolls two dice and then typically

6.3 Examples and Applications

137

has a choice o f around 20 possible moves consistent with the dice. Look-ahead is difficult because o f the number o f possible rolls and the number o f possible moves for each roll. A network was trained to score, on a scale o f —100 (terrible move) to + 1 0 0 (best move), triples {current position, dice values, possible m ov e } using a set o f 3000 examples hand-scored by an expert player [Tesauro and Sejnowski, 1988]. The input to the network consisted o f the triple itself, suitably coded, plus some precomputed features such as p ip co u n t and d e g r e e o f tr a p p in g , with 459 units in all. There were two hidden layers o f 24 units each, and a single continuous-valued output unit for the score. Noise was added to the training data by including some positions with a randomly chosen score. During development notable errors made by the network were corrected by adding hand-crafted examples to the training set. Playing against Sun Microsystems’ gammontool program, the final network won about 59% o f the time. Note that generalization is almost always necessary; after the first few moves o f a game it would be rare to encounter an example from the training set. The network exhibited a great deal o f “common sense,” almost invariably choosing the best move in situations that were transparent or intuitively clear to a human player. The success rate dropped to 41% without the precomputed features, showing their importance. W ithout the training set noise (but with the precomputed features) it was 45%, indicating that noise actually helps the system. A later version o f “Neurogammon” won the gold medal at the computer olym piad in London, 1989. It defeated all other programs (five commercial and two non-commercial), but lost the game to a human expert by a score o f 2 -7 [Tesauro, 1990].

Signal Prediction and Forecasting The problem o f signal prediction can be treated in a purely empirical fashion, but the theoretical foundations o f the problem are also o f fundamental conceptual importance. Since our emphasis in this book is on theory, we digress for a few paragraphs to summarize the important ideas before taking up the operational problem and its neural network implementation. In many situations in science and technology we want to predict the future evolution o f a system from past measurements on it. This is in fact the central procedure o f classical physics: we make a mathematical model o f a system by writ ing equations o f m otion and try to integrate them forward in time to predict the future state. Everyone who has taken an elementary physics course is familiar with this paradigm in systems with a small number o f degrees o f freedom. Mathemat ically, we would say that we describe the state o f the system by a point x in a multi-dimensional space T (with one dimension for each degree o f freedom), and the dynamics can be characterized as the motion o f x in T. This procedure runs into trouble, however, in nonlinear systems with many de grees o f freedom, like a turbulent fluid, the weather, or the economy. It is not practi cal to try to solve the equations o f fluid dynamics explicitly except for rather simple special situations; we just cannot keep track o f motion in such a high-dimensional

138

SIX Multi-Layer Networks

space. But all is not lost. Studies o f the dynamics o f apparently chaotic systems with many degrees o f freedom reveal that dissipation (e.g., viscosity) can reduce the number o f effectively relevant degrees o f freedom to a small number. That is, the motion o f the system, which in principle occurs in a very high-dimensional space T, becomes confined after some time to a subspace Ta o f low dimensionality called an a t tr a c to r . The attractor Fa is often a rather strange fr a c ta l object, with a dimensionality d that is not even an integer [Mandelbrot, 1982]. If we can somehow identify the coordinates which characterize the attractor Ta , then our problem becomes like one in simple physics, with only a few degrees o f freedom. This sounds at first difficult or impossible: when there are so many degrees o f freedom in the original description, how can we possibly know— without already having solved the problem— which combinations are the few relevant variables? But in fact this is not a difficulty at all. It is not at all crucial how we choose the new variables as long as there are enough o f them, and a set o f previous values o f the quantity to be predicted is in fact a completely adequate choice. The physical argument [Packard et al., 1980] for expecting a result like this is that most measurements will probe some combination o f the relevant variables (since the motion o f the system lies on the attractor), and measurements at different times will in general probe different combinations. Thus a sequence o f m such measurements should contain enough information to predict the motion on the attractor, provided m is sufficiently large compared to the attractor dimensionality d. Packard et al. tested this idea in numerical experiments and found support for it. There is also a rigorous theorem, proved by Takens [1981], which says that there exists a smooth function o f at most 2d + 1 past measurements that correctly predicts the future value o f the variable in question. The prediction is just as good as the one we would have made as if we had been able to solve the complete system with its millions o f degrees o f freedom. Furthermore, if the measured values are known with infinite precision, then the result is insensitive to both the time interval between the past measurements and to how far ahead we want to predict the future measurement. But in practice noise and imprecision in the measurements limit how freely these quantities can be chosen. Thus we know that we can in principle reduce the prediction problem for a com plex dynamical system whose motion lies on a low-dimensional attractor to something like an elementary physics problem. W hat Takens’ theorem does not give us is the explicit form o f the function which accomplishes the desired extrapolation. It is here that neural networks come into the picture. The idea is to train a feed forward network using the set o f variables x ( t ), x (t - A ), x(t - 2 A ), . . . , x(t - (m - 1 )A )

(6.47)

as the input pattern, and the (known) value x ( t + T ) as target, for many past values o f t . In this way one is approximating the true extrapolation mapping by a function like (6.4), parameterized by the weights and thresholds o f the network. Once the network has been trained, it can be used for prediction a time T into the future.

139

6.3 Examples and Applications

Lapedes and Farber [1988] tested this idea using a signal x (t) produced by the numerical solution o f the Mackey-Glass differential-delay equation [Mackey and Glass, 1977] dx

^

0 .2 x (t — t )

- =-0Mt)+1+,\ t : ’Ty

Aa\

(6 -48 )

Because o f the delay r , this is an infinite-dimensional system— one has to know the initial value o f the function on a continuous interval, r controls how chaotic the motion is. The larger r, the larger the dimensionality o f the attractor. For r = 17, one finds d « 2.1 and for r = 30 one gets d « 3.5. The Lapedes-Farber network had two hidden layers, because o f arguments dis cussed later about how many layers it takes to approximate an arbitrary function efficiently. Furthermore, the single output unit was taken to be linear (flf(x) = x) because the target takes on a continuous range o f real values; there is no need for the saturation given by the usual sigmoidal nonlinearity. Training this network by ordinary back-propagation is very slow because o f the numerical accuracy desired and the presence o f two hidden layers. Lapedes and Farber were able to speed up the convergence by using a conjugate gradient technique. Still, they had to use a Cray supercomputer to achieve a prediction accuracy comparable to that achieved by another new method due to Farmer and Sidorowich [1987, 1988]. In the Farmer and Sidorowich approach one keeps a file o f thousands o f previous input vectors (6.47), stored in a way that makes it easy to find those vectors close to any chosen point in the m-dimensional space. W hen a new input vector is presented as the basis for a prediction o f x (t + T ), the program looks up the closest vectors in the file and notes where they evolved to after a time T. Then it does a linear interpolation from those examples, creating a linear fit to the t —►t -f T mapping in the neighborhood o f x(t), and uses that to predict x(t + T ). Both methods predict deterministic chaotic time series much better than tra ditional methods. The best results can often be achieved with a hybrid o f the local-map and neural network schemes; this is discussed at the end o f Chapter 9 , since it involves unsupervised as well as supervised learning.

Recognizing Hand-Written ZIP Codes A back-propagation network has been designed to recognize handwritten ZIP codes (numerical postal codes) from the U.S. mail [Le Cun et al., 1989]. The network employs many interesting ideas, some o f which we will return to in the next section. Almost 10,000 digits recorded from the mail were used in training and testing the system. These digits was located on the envelopes and segmented into digits by another system, which in itself had to solve a very difficult task. The network input was a 16 x 16 array that received a pixel image o f a particular handwritten digit, scaled to a standard size. As sketched in Fig. 6.10, this fed forward through three hidden layers to the 10 output units, each o f which signified one o f the digits 0-9.

140

SIX Multi-Layer Networks

10 output units

30 units

12 feature detectors (4 by 4)

12 feature detectors (8 by 8)

16 by 16 input FIGURE 6 .10 Architecture o f the ZIP-code reading network.

The first two hidden layers consisted o f trainable feature detectors. The first hidden layer had 12 groups o f units with 64 units per group. Each unit in a group had connections to a 5 x 5 square in the input array, with the location o f the square shifting by two input pixels between neighbors in the hidden layer. All 64 units in a group had the same 25 weight values, so they all detected the same feature in different places on the retina. This w eig h t sh a rin g, and the 5 x 5 receptive fields, reduced the number o f free parameters for the first hidden layer from almost 200,000 for fully connected layers to only (25 + 64) x 12 = 1068 (including independent thresholds for every unit). The second hidden layer was a very similar set o f trainable feature detectors consisting o f 12 groups o f 16 units, again using 5 x 5 receptive fields, a 50% scale reduction, and weight sharing. Inputs to different units were taken from different combinations o f 8 o f the 12 groups in the first hidden layer, making 8 x 25 shared weights per second-layer group. The third hidden layer consisted o f 30 units fully connected to all units in the previous layer, and the 10 output units were in turn fully connected to the all third-layer units. In all there were 1256 units and 9760 independent parameters. The network was trained by back-propagation, accelerated with the pseudoNewton rule (6.43). It was trained on 7300 digits and tested on 2000, giving an error rate o f about 1% on the training set and 5% on the test set. The error on the

6.4 Performance of Multi-Layer Feed-Forward Networks

141

test set could be reduced to about 1 % by rejecting marginal cases with insufficient difference between the strongest and next strongest outputs. This procedure led to a rejection rate o f about 1 2 %. To obtain good generalization it is important to limit the number o f free param eters o f the network, as discussed later in this chapter. The weight sharing already reduces this number greatly, but the group has also been experimenting with what they call “optimal brain damage” to remove further weights from the network [Le Cun, Denker, and Solla, 1990]. Using information theoretic ideas a method was developed to make an optimal selection o f unnecessary weights. These were then removed and the network was retrained. The resulting network had only around 1 / 4 as many free parameters as that described above, and worked even better: the rejection rate necessary to achieve 99% correct classification o f the test set was reduced to about 9% [Le Cun, Boser, et al., 1990].

Speech Recognition Speech recognition is one o f the most studied tasks in artificial intelligence and neural networks. Speech is difficult to recognize for several reasons: ■

Speech is continuous; often there is no pause between words.

■

The speed o f speech is varying.

■

The meaning and pronunciation o f a word is highly dependent on context.

■

Pronunciation, speed, and syntax are speaker dependent.

Current methods generally perform poorly on speaker-independent continuous speech recognition. Many groups have attempted to train various networks to per form this task. Because o f the temporal structure o f speech it is natural to consider using recurrent networks, as we describe in Section 7.3. Feed-forward networks have also been applied to some speech recognition problems, such as distinguishing among a set o f words. The scheme is simple: input some representation o f a spoken word and train the perceptron to recognize it. But in practice this only works on tasks limited to a small vocabulary and separate words. Lippmann [1989] gives a good review o f the status o f speech recognition by neural networks.

6.4 Performance of Multi-Layer Feed-Forward Networks There are many theoretical questions concerning what multi-layer feed-forward networks can and cannot do, and what they can and cannot learn to do. How many layers are needed for a given task? How many units per layer? To what extent does the representation matter? W hen can we expect a network to generalize? W hat do we really mean by generalization? How do answers to these questions depend on the details o f learning rate, momentum terms, decay terms, etc.?

142

SIX Multi-Layer Networks

Some o f these questions, and similar problems, can be answered definitively, often with solid theorems. Answers to others are not yet in, but are under intensive study. The following discussion draws strongly on papers by Lapedes and Farber [1987, 1988] and Denker et al. [1987]. See also Lippmann [1987].

The Necessary Number of Hidden Units First consider layered networks o f continuous-valued units, with activation function g(u) = 1 / ( 1 -f e “ u)for hidden units and g(u) = ufor output units.Overall such a network implements a set o f functions yi = from inputvariables Xk to output variables where { x k } means x 2, •••, «jv- Explicitly, a network with no hidden layers computes Vi = 2 2 w'kXk ~ O' i

(6.49)

and one with one hidden layer computes

Vi = 2 2 W{j3 ( 2 2 w*kXk ~ t j ) ~ 6i j k

(6-50)

and so on. We have included explicit thresholds 0t- and (f>j for convenience. Suppose we want to approximate a particular set o f functions F ,{ x *.} to a given accuracy; how many hidden layers and how many units per layer do we need? The answer is at most two hidden layers, with arbitrary accuracy being obtainable given enough units per layer [Cybenko, 1988]. It has also been proved that only one hidden layer is enough to approximate any continuous function [Cybenko, 1989; Hornik et al., 1989]. The utility o f these results depends, o f course, on how many hidden units are necessary, and this is not known in general. In many cases it may grow exponentially with the number o f input units, as occurs in the case o f the general Boolean functions discussed later. Here we give a non-rigorous p roof due to Lapedes and Farber [1988] that two hidden layers are enough. The essential points are that 1 . any “reasonable” function Fi{xjc} can be represented by a linear combina

tion o f localized b u m p s that are each non-zero only in a small region o f the domain { x * } ; and 2 . such bumps can be constructed with two hidden layers.

W ith one variable x the function g (x ) - g(x - c) obviously gives a peak at x = c /2 and is zero far from there. A sharp peak anywhere can be produced with g(a x + b) —g(ax + c). In two or more dimensions we can add together one such function for each dimension, producing a highest peak wherever desired, but also some secondary peaks and valleys. All but the highest peak can be suppressed, however, by another application o f g(u) with a suitable threshold. In two dimensions, for example, the function g(A [g(ax + b) - g(a x + c) + g(ay + d) - g(ay + e)] - B )

(6.51)

6.4 Performance of Multi-Layer Feed-Forward Networks

143

can be used to produce a localized bump at any desired ( x , y ) . This is o f course exactly the type o f function that we can calculate with two hidden layers. Following the same scheme in N dimensions, we need 2N units in the first hidden layer and one in the second hidden layer for each bump. The output layer then sums the bumps to produce the desired function(s), in a manner similar to Fourier analysis or Green’s function representation. The bump approach may not be the best one for any particular problem, but it is only intended as an existence proof. More than two hidden layers may well permit a solution with fewer units in all, or may speed up learning. In fact the construction says nothing about learning (or generalization), and it is possible that some functions are representable but not learnable with two hidden layers, perhaps because o f local minima. It is possible to construct units that themselves have a localized bump-like response, each becoming activated only for inputs in some small region o f the input space. Not surprisingly, only one hidden layer o f such units is needed to represent any reasonable function [Hartman et al., 1990]. These radial basis function units are discussed further in Section 9.7. Now let us consider Boolean functions, using x * = ± 1 , (k = 1, 2, . . . , N) for inputs, g(u) = sgn(u) on the single output unit, and g(u) = tanh(u) for the hidden units. How many layers and units/layer do we need to represent any Boolean function? We have already seen (page 94) that only linearly separable functions can be represented with no hidden layers. But just one hidden layer suffices to represent any Boolean function! A one sentence p roof for computer scientists consists o f the observation that such a network in the deterministic threshold limit contains a programmable logic array (P L A ) as a special case, a PLA contains a read-only memory (R O M ) as a special case, and a ROM with AT-bit addressing can obviously implement any Boolean function on N bits. For the rest o f us, the p roof is by construction. Use 2N units in the hidden layer, = 0 , 1 , . . . , 2n — 1 , and set Wj* = + 6 if the kth digit in the binary representation o f j is a 1 or Wjk = —b if it is a zero. Use a threshold o f N(b — 1) at each hidden unit. Then one o f the hidden units (the one coded in binary by the input) receives a net input o f hj = + 6 , while all others have hj < —6 . Making b large enough thus turns one hidden unit on ( ~ + 1 ) and the rest off ( ~ —1 ) for each input pattern. The hidden units are said to act as m a tc h filters or g r a n d m o t h e r c e lls .6 The second stage consists o f a link = + c to the output unit i from grandmother cells representing inputs for which the answer is to be + 1 , and links = —c from the rest. The threshold on the output unit is set to Wij, giving hi = ± 2 c for the two answers. O f course this solution is not likely to be practical in the real world. The number o f hidden units, and all fan-ins and fan-outs, grow exponentially with the number o f input bits N. It is therefore useful to define efficient network solutions

6The term comes from discussion as to whether your brain might contain cells that fire only when you encounter your maternal grandmother, or whether such higher level concepts are more distributed.

144

SIX Multi-Layer Networks

as those which use a total number o f units that grows only polynomially in N. This immediately raises the question o f what Boolean functions can be represented by such networks. Denker et al. [1987] call such functions NERFs, for Network Efficiently Representable Functions, and compare them to low order polynomials in curve fitting. Little is yet known as to what this class includes and excludes. It is actually rather hard to find useful Boolean functions that are definitely not NERFs.

Input Representation Another issue is the importance o f input representations. Do they really matter, or can we expect a network to be able to learn a task from any representation? The answer is that they certainly do matter. Consider first two predicates concerned with integers n: 1 . n is odd. 2 . n has an odd number o f prime factors.

If n is presented to a network in binary then predicate 1 is easy; you just look at the lowest order bit. Indeed a simple perceptron is guaranteed to learn such a task from a small number o f examples. On the other hand 2 is hard; factoring a number is a very hard problem in computer science and there is no reason to expect a network to learn it from a small training set. But suppose we change the representation? In base 3 rather than base 2, predicate 1 becomes hard too; every bit counts. If however we represented numbers by specifying their prime factors, then predicate 2 would becom e easy. Representation is thus crucial. Indeed, we can prove a silly theorem: learning will always succeed, given the right preprocessor. The p roof makes clear why it is silly: let the preprocessor compute the answer and attach it as additional bits to the raw input; then a simple perceptron (or any generalization thereof) is guaranteed to learn to copy the answer bits and ignore the rest. It is also important to ask whether the information apparently represented in the input is actually available to the network. W ith fully connected layers, for example, the network has no inherent sense o f the order o f the inputs; a permutation o f the input bits is a symmetry o f the network. A problem that might appear easy for us, given a particular ordering o f the input bits, might be much harder for such a network. An example studied by Denker et al. [1987] is the two-or-m ore-clumps predicate, in which the result is to be + 1 if there are two or more consecutive sequences o f + l ’s in the (ordered) input. Permutation makes the predicate quite obscure, as seen in table 6.1 for the permutation (0123456789) —►(3120459786). Actually a fully connected network learns this 10-input predicate fairly quickly, but it gets rapidly harder as the number o f inputs nodes is made larger [Solla, 1989]. If ordering or another geometrical or topological property o f the input is im por tant, then the network should be told about it in some way. For example some input bits may be known to be related or “near” to each other, in the sense that their

6.4 Performance of Multi-Layer Feed-Forward Networks

145

TABLE 6.1 The iwo-or-more-clumps predicate Original input

— +++-----— ++_+— —h++++++++ +++---++--- +

Result

Permuted input

-1 +1 -1 +1 -1

+-----++------+-----+ -+ ----+++-++++++ -+++-+++---

correlation is likely to affect the output, whereas others may be “far” from each other, each having a more-or-less independent influence on the output. This occurs for instance when the input bits are representing pixel illuminations in a spatial array. In such cases one may limit the spatial range o f the input-to-hidden connec tions. Solla et al. [1988] found for the iwo-or-more-clumps predicate that this can improve both learning and generalization, and this kind o f limited receptive field also proved useful in the ZIP-code reading network described earlier. Another idea is to add a penalty term to the cost function, \wik\2I V\ > V2 > . . . > Vp\ see Fig. 6.14. The fraction o f the remaining weight space belonging to a particular function / is modified after learning p examples from R o ( f ) to

R p (f) = ^

(6-60)

vp

where Vp( f ) is the volume o f weight space consistent with both / and the training examples:

n p

w >

=

/

dw p (w ) 0 / ( w )

H fw , n

( 6 .6 i)

fi=l =

(6.62) P=1

Note that we took the / ( / , factors outside o f the integral— which then reduced Vo( / ) — because the 0 ^( w) factor makes / w equal to / in all non-vanishing cases. The result (6.62) shows that Vp( f ) is either equal to V o (/) or is 0 , according to whether or not / agrees with the training examples. The corresponding entropy Sp — — ^ f

R p ( f ) log 2 R p ( f )

(6.63)

is a measure of how many implementable functions are compatible with the training set. As training proceeds it decreases steadily, and would go to zero if we ever reached the stage where only the desired function / was possible. Since Sp is actually a measure of the information required to specify a particular function, the difference Sp - 1 — Sp tells us how much information (in bits) is gained by training on the pth example. This cannot be more than the information required to specify the output which is one bit for a single binary output, so we would expect Sp = S q — p if training were perfectly efficient. We can use this idea, along with an initial estimate of So, in a couple o f ways: to bound the number o f training examples needed to learn a function / , or to estimate the actual efficiency o f training [Denker et al., 1987]. In this discussion we have discriminated sharply between weights which are consistent and inconsistent with particular examples, using the / ( / , £ M) factors in (6.61) and (6.62). For continuous-valued outputs we would have to relax this as sumption, replacing the sharp / ( / , £^) by a smooth function exp (—f3ep) o f the error Sp in the p th example. This function falls off gradually from 1 if there is no error

151

6.5 A Theoretical Framework for Generalization

to 0 for large error, with a rate governed by the parameter /?. Introducing such exponential factors makes (6.61) and (6.62) look exactly like partition function cal culations, and indeed they can be treated by statistical mechanics methods [Tishby et al., 1989]. However, we restrict ourselves in this book to the sharp yes/n o case. So far we have said nothing about the training sequence £*, £2, . . . , £p. Let us now assume that each each input pattern is chosen randomly from some distribution P (£ ) o f possible inputs, without any dependence on previous choices or on success rate. Then each factor /(/,£ * * ) in (6.62) is independent o f the others, and we can average over possible training sequences to obtain

< W )>

=

M / ) ( n / ( / , n )

=

V o ( f ) g ( f )p .

(6.64)

Here the averages are over £* to £p, with the appropriate weights /*(£**), and 9 (f) =

= P r o b (/(C ) = / ( 0 )

(6.65)

is the probability that a particular function / agrees with / for an input £ randomly chosen from P (£ ). The quantity g ( f ) is usually called the g e n e r a liz a tio n a b ility o f / ; it tells us how well / conforms to / . Note that it is between 0 and 1, and is independent o f the training examples. For example, suppose that / and / are Boolean functions o f N inputs, and that all 2N input patterns are equally likely. The functions can be each defined by the 2N bits in their truth tables. Then g ( f ) is the fraction o f those bits where the two truth tables agree. Now we consider the probability Pp( f ) that a particular function / can be implemented after training on p examples o f / . Our basic approach is to take such probabilities as proportional to weight-space volume, so Pp( f ) is equal to the average fraction o f the remaining weight space that / occupies:

The approximation is based on the assumption that Vp will not vary much with the particular training sequence, so « (Vp) for each probable sequence; we say that Vp is s e lf-a v e ra g in g . This assumption is expected to be good as long as p is small compared to the total number o f possible inputs. We can use ( 6 .6 6 ) to compute something more useful; the distribution o f gen eralization ability g ( f ) across all possible / ’s: Pp(g) = £ > , ( / ) % - < K / ) ) 1. But it also leaves intact the q correct classifications by Vi, for which we find Ox = sgn Q T wxj V f ) i

= sgn ^

+ e C ^ 2 Vt V7 ) •

(6‘92)

Because (N — 2)£ < 1, the second term in the sign function could only change the correct sign o f the first term if \X\ = N where

x = Y ^ v? vj -

(6-93)

But X cannot be —N because V£ = Vq = 1, and it cannot be + N because V ? and Vj are not identical if the level L representation is faithful. So the q originally correct patterns are not upset, and the new unit O* classifies at least q-f- 1 patterns correctly, as claimed. It is not yet clear which o f the three methods described is best for a given problem. All o f them have given encouraging results on simple test problems, both in terms o f generalization and in terms of finding efficient architectures. In one comparative test the upstart algorithm used fewer units than the tiling algorithm [Frean, 1990], but more wide-ranging studies are needed. It is also likely that further construction algorithms will be proposed in the future.

Recurrent Networks

SEVEN

The preceding chapter was concerned strictly with supervised learning in feed forward networks. We now turn to supervised learning in more general networks, with connections allowed both ways between a pair o f units, and even from a unit to itself. These are usually called r e c u r r e n t n etw o rk s . They do not necessarily settle down to a stable state even with constant input. Symmetric connections ( = Wji) ensure a stable state o f course (as seen in the Hopfield networks), and the Boltzmann machines discussed first are limited to the symmetric case. In the second section we consider networks without the symmetry constraint, but only treat those that do reach a stable state. Then we examine networks that can learn to recognize or reproduce time sequences; some o f these produce cyclic output rather than a steady state. We conclude with a discussion o f reinforcement learning in recurrent and non-recurrent networks.

7.1 Boltzmann Machines Hinton and Sejnowski [Hinton and Sejnowski, 1983, 1986; Ackley, Hinton and Sejnowski, 1985] introduced a general learning rule applicable to any stochastic net work with sym m etric connections, Wij = wji. They called this type o f network a B o lt z m a n n m a ch in e because the probability o f the states o f the system is given by the Boltzmann distribution o f statistical mechanics. Boltzmann machines may be seen as an extension o f Hopfield networks to include hidden units. Just as in feed-forward networks with hidden units, the problem is to find the right connec tions to the hidden units without knowing from the training patterns what the hidden units should represent.

163

164

SEVEN Recurrent Networks

Output

•

hidden unit

O

visible unit

Input

FIGURE 7.1 (a) A Boltzmann machine has the units divided into visible and hid den ones, (b ) The visible units can be divided into input and output units.

In its original form, the Boltzmann learning algorithm is very slow because o f the need for extensive averaging over stochastic variables. There have been therefore few applications compared to back-propagation. But a deterministic “mean field” version o f the Boltzmann machine speeds up the learning considerably and may lead to more applications. Note that stochastic networks are sometimes called Boltzmann machines even if the weights are chosen a priori and there is no learning involved. We avoid this usage, and discuss stochastic networks along with their deterministic counterparts in most chapters o f this book.

Stochastic Units The units Si are divided into v is ib le and h id d e n u n its as shown in Fig. 7.1(a). The visible units can, but need not, be further divided into separate input and output units as in Fig. 7.1(b). The hidden units have no connection to the outside world. The connections between units may be complete— every pair— or structured in some convenient way. For example, one might have everything except direct inputoutput connections. But whatever the choice, all connections must be symmetric, Wi j -

Wji.

The units are stochastic, taking output value Sj = + 1 with probability g ( h i ) and value S* = —1 with probability 1 — g ( h i ), where (7-1) j

and 9^

~ 1 + exp (—2 ph)

(7.2)

165

7.1 Boltzmann Machines

just as in (2.48). Here /? = l/ T as usual, and we omit thresholds for convenience. Because o f the symmetric connections (cf. Chapter 2) there is an energy function H {S i } = - \ j 2 w ijSiSj

(7.3)

which has minima whenever there is a stable state characterized by S{ = sgn(ht). The Boltzmann-Gibbs distribution from statistical mechanics, P {S i} =

(7.4)

gives, at least in principle, the probability o f finding the system in a particular state {«?,} after equilibrium is reached. The denominator Z , called the partition function, is the appropriate normalization factor

z = Y , e ~pH{St} {SJ

(7 -5)

A more detailed discussion o f the Boltzmann-Gibbs distribution is provided in the Appendix. We can use (7.4) to compute the average (X ) o f any quantity X { 5 , } that depends just on the state { 5 , } o f the system: (X ) = £

P {S i}X {S i} .

(7.6)

{Si}

In Boltzmann learning we attempt to adjust the connections to give the states o f the visible units a particular desired probability distribution. We might, for example, want only a few o f the possible states to have appreciable probability o f occurring, and want those few to have equal probability. At low temperature we can use such a scheme for p a t t e r n c o m p le tio n , in which missing bits o f a partial pattern are filled inby the system. Or we might divide thevisible units into input and output units and do ordinary association. Our information as to the“ correct” output for each input pattern might itself be probabilistic (or fuzzy), as for example in medical diagnosis o f diseases from patterns o f symptoms [Hopfield, 1987]. Asking for an approximation to a probability distribution is a general goal covering these and other cases. The task is similar to that given an auto-associative (Hopfield) network, but the architecture o f the Boltzmann machine differs in having hidden units. W ithout hidden units we can do no more than specify all (5 > ., P J 2 e ~pHa0/z p

( 7-7)

where Z = Y ^ e ~ pHafi

(7.8)

a/3

and H ap = - \ Y , w^

S

f

(7.9)

is the energy o f the system in state a/?, in which is the value ( ± 1 ) o f S{. Equation (7.7) gives the actual probability PQ o f finding the visible units in state a in the freely running system, and is determined by the s. We, on the other hand, have a set o f d e s ir e d p r o b a b ilitie s R a for these states. A suitable measure o f the difference between the distributions Pa and R a , properly weighted by the occurrence probabilities i?a , is the relative entropy E = ' £ /R a l o g ^ .

(7.10)

This may be derived from information-theoretic arguments as for (5.52), though we will discuss analternative statistical mechanics interpretation below. E is always positive or zero, and can only be zero if Pa = R a for all a ? Wetherefore minimize E , using gradient descent: dE Au^

=

^

R a dPa

(n 11 ^ (7-n )

1We use (3 in this sense as well as to mean 1/T , but the latter never occurs as an index or subscript. 2It is easy to show from the integral definition of the logarithm that lo g X > 1 — 1/X. Therefore

E > E * jR« (1 -

- P“ ) = °-

167

7.1 Boltzmann Machines

Using (7 .7 )-(7 .9 ) we find dPa

Z2

Z

dwij

p {Y ,s f sf

Thus A tDij

(7.12)

p ° e - p° ( SiSi)\ •

10[E T-£ ct

a

P

-£** ° g --j-)

(7.29)

using 5,- —►ra,- in the energy (7.3). A more formal derivation gives the same re sult [Peterson and Anderson, 1987]. A similar expression for F ^ f is found for the clamped states, the only difference being that the visible ra,-’s are fixed at ± 1 . Minimizing F mf with respect to ra,- by setting dFMr/drrii = 0 gives (7.25) again. We may also regard this deterministic version o f the Boltzmann machine in dependently o f the original stochastic version, simply looking at the problem o f minimizing the cost function E mf = E q + P [ Ffo F — F m f ]

(7.30)

with respect to the Wij’s [Hinton, 1989]. Equation (7.30) is just the mean field approximation to (7.21). Note that a double minimization is still involved; for each choice o f the Wij’s we must minimize (7.29) with respect to the ra,*’s by iteration o f equations (7.26). Using gradient descent for the minimization gives directly

A

= - 7

7

=

tj/3

[ m f m f - rriimj]

(7.31)

which is the standard result (7.13) within the approximation (7.24).

7.2 Recurrent Back-Propagation Pineda [1987, 1988, 1989], Almeida [1987, 1988], and Rohwer and Forrest [1987] have independently pointed out that back-propagation can be extended to arbitrary networks as long as they converge to stable states. At first sight it will appear that an N x N matrix inversion is required for each learning step, but Pineda and Almeida each showed that a modified version o f the network itself can calculate what is required much more rapidly. The algorithm is usually called r e c u r r e n t b a c k -p r o p a g a tio n . Consider N continuous-valued units Vi with connections and activation function g(h ). Some o f these units may be designated as input units and have input

173

7.2 Recurrent Back-Propagation

(a)

(b)

]_c

E

FIGURE 7.2 Recurrent back-propagation can handle arbitrary connections be tween the units, (a) Units 1 and 2 are output units with targets £i and £2 • Units 1, 3, and 5 are input units. Note that unit 1 is both an input and an output unit, while unit 4 is neither, (b) The corresponding error-propagation network, with reversed connections and error inputs E\ and E^,

values specified in pattern \i. We define £? = 0 for the rest. Similarly some may be output units, with desired training values Figure 7.2(a) provides an example. Henceforth we drop the pattern index \i for convenience; appropriate sums over /i are implicit. Various dynamical evolution rules may be imposed on the network, such as (7.32)

which is based on (3.31). It is easily seen that this dynamical rule leads to the right fixed points, where dVi/dt = 0, given by (7.33) j We assume that at least one such fixed point exists and is a stable attractor. Alternatives such as limit cycles and chaotic trajectories will be considered in the next section. A natural error measure for the fixed point is the quadratic one (7.34> k where

if k is an output unit; otherwise.

(7.35)

174

SEVEN Recurrent Networks

Gradient descent gives

* *

■

and to evaluate dVk jd w pq we differentiate the fixed point equation (7.33) to obtain

=

9 '{h i)

[sipVt + £

^ -J -] .

( 7 .3 7 )

Collecting terms this may be written as dV 5 3 *■*«? ~ f>ip9'(hi)Vq ' dw< j uwpq

(7.38)

L*j = $ij — 9 {h i)wij

(7.39)

hi = Y , wn vj + t i j

( 7-4° )

where

and

is the net input to unit i when the network is at the attractor. Inverting the linear equations (7.38) gives ^

= (L - 1 W ( M n

(7-41)

and thus from (7.36) A Wpq = ^ £ * ( L ~ l )kpg\hp)Vq . k

(7.42)

This may be written in the familiar delta-rule form A w pq = TjSpVq

(7.43)

« p = * ' ( M X > * ( L - 1 )*pk

C7-44)

if we define

Equation (7.43) is our new learning rule. As it stands it requires a matrix inversion for the 6’s, which could be done numerically [Rohwer and Forrest, 1987]. If however we write Sp = g'(hp)Yp (7.45) so that Yp = J £

E k ( l ~ 1)kP

(7-46)

175

7.2 Recurrent Back-Propagation

then we can undo the inversion and obtain linear equations for Yp piYp = Ei

(7.47)

P

or, using (7.39), Y i - Y , g ' { h P)wPiYp = E i.

(7.48)

P

This equation is o f the same form as the original fixed point equation (7.33), and can be “solved” in the same way, by the evolution o f a new e r r o r -p r o p a g a t io n n e tw o r k with a dynamical equation analogous to (7.32):

= ~ Yi + £

9 '(hp)wpiYp

+ E i.

(7.49)

P

The topology o f the required error-propagation network is the same as that o f the original network, with the coupling from j to i replaced by g'{hi)w ij from i to j , a simple lineartransfer function g (x ) = x ) and an input term E{(the error at unit i in the original network) instead o f Figure 7.2(b) shows the error-propagation network corresponding to Fig. 7.2(a). In electrical network theory the two networks would be called n e tw o r k tr a n s p o s itio n s o f one another. The whole procedure is thus: 1. relax the original network with (7.32) to find the V*’s; 2. compare with the targets to find the E^s from (7.35); 3. relax the error-propagation network with (7.49) to find the Y*’s; and 4. update the weights using (7.43) and (7.45). This allows us to extend back-propagation to general networks without requiring any non-local operations (such as matrix inversion) that go outside the original connectivity o f the network. Back-propagation itself is a special case o f the general procedure; without recurrent connections the relaxation steps can be replaced by the corresponding fixed-point equations (7.33) and (7.48). One can also have the two networks running together. The original network continuously supplies the error-propagation network with the error signals E { , and the error-propagation network in turn adjusts the weights in theoriginalnetwork. A similar idea was put forward earlier by Lapedes and Farber [1986a, 1986b] in their m a s te r-s la v e n e tw o r k , where the master network calculates the weights for the slave. However, they had one master unit for each connection in the slave network ( N 2 master units for JVslave units), and made the master network calculate appropriate weights without using the slave for feedback. The problem solved by the master— find weights for the slave to minimize the total sum-of-squares error— was essentially an optimization problem, and was treated in much the same way as those described in Chapter 4 .

176

SEVEN Recurrent Networks

It is worth noting that (7.48) is a stable attractor o f (7.49) if (7.33) is a stable attractor o f (7.32) (as we assumed). This may be shown by linearizing the dynamical equations about their respective fixed points, just as we did in (3.36). Writing Vi = -f Si in (7.32) and YJ = Y f + 77,- in (7.48), where V* and Y f are respectively the solutions o f (7.33) and (7.48), we obtain to linear order =

~£i + g 'ih ^ Y jW ijZ j = j

-Y ,U je j j

(7-50)

and Ti t =

_7?*'+

p

p

(7-51)

These linear equations are congruent except that L is replaced by its transpose LT. But L and LT have the same eigenvalues, so the local stability o f the two equations is the same. Thus both attractors are stable (eigenvalues o f L all positive) if one is. A further analysis o f stability and convergence properties has been provided by Simard et al. [1989]. The scaling properties o f the algorithm are interesting even on a sequential digital computer where no parallelism or hardware network is exploited [Pineda, 1989]. If we consider a fully connected network o f N units, then L is an N x ATmatrix, and using (7.44) with direct matrix inversion would take a time proportional to N 3 using standard methods. In comparison, the time required to integrate (7.49) numerically scales as only N 2 provided the fixed points are stable. This translates into a great saving for large N. O f course both methods are much better than direct numerical differentiation o f the error measure (7.34) with respect to all N 2 parameters, which would scale as N 4 (a relaxation o f (7.32), taking order N 2 steps, would be needed for each parameter). Using g (x ) — tanh(x) and choosing the entropic error measure (5.52) instead o f the quadratic one (7.34) simplifies the derivation a little and leads to slightly different equations. Almeida [1987] has shown that recurrent networks give a large improvement in performance over normal feed-forward networks for a number o f problems such as pattern completion. Qian and Sejnowski [1988b] have used recurrent backpropagation to train a network to calculate stereo disparity in random -dot stere ograms. The network configures itself to use the Marr and Poggio [1976] algorithm in some cases, and finds a new algorithm in others. Barhen et al. [1989] have applied a modified version o f recurrent back-propagation, including terminal attractors, to inverse kinematics problems for the control o f robot manipulators.

7.3 Learning Time Sequences In Section 3.5 we discussed ways o f making a network generate a temporal sequence o f states, usually in a limit cycle. This naturally required a recurrent network with

7.3 Learning Time Sequences

177

asymmetric connections; neither a feed-forward network nor a network with sym metric connections will do, because they necessarily go to a stationary state. The desired sequences were embedded into the network by design; the s were calcu lated, not learned. We now turn to the problem o f learning sequences. There are actually three distinct tasks: ■

S e q u e n c e R e c o g n it io n . Here one wants to produce a particular output pat tern when (or perhaps just after) a specific input sequence is seen. There is no need to reproduce the input sequence. This is appropriate, for example, for speech recognition problems, where the output might indicate the word just spoken.

■

S e q u e n c e R e p r o d u c t io n . In this case the network must be able to generate the rest o f a sequence itself when it sees part o f it. This is the generalization o f auto-association or pattern completion to dynamic patterns. It would be appropriate for learning a set o f songs, or for predicting the future course o f a time series from examples.

■

T e m p o r a l A s s o c ia t io n . In this general case a particular output sequence must be produced in response to a specific input sequence. The input and output sequences might be quite different, so this is the generalization o f hetero-association to dynamic patterns. It includes as special cases pure se quence generation and the previous two cases.

We discuss in this section several approaches that can learn to do one or more o f these tasks. The first task— sequence recognition— does not necessarily require a recurrent network, but belongs naturally with the other cases, which clearly do need recurrency.

Tapped Delay Lines The simplest way to perform sequence recognition (but not sequence reproduction or general temporal association) is to turn the temporal sequence into a spatial pat tern on the input layer o f a network. Then conventional back-propagation methods can be used to learn and recognize sequences. We already studied examples o f this approach in Chapter 6 . In NET-talk (page 132) a text string was moved along in front o f a 7-characte^ window, which thus received at a particular time characters n, n —1, . . . , n — 6 from the string. In our signal processing example (page 137) the values x ( t ), x (t - A ), . . . , x (t - (m — 1)A ) from a signal x (t) were presented simultaneously at the input o f a network. In a practical network these values could be obtained by feeding the signal into a d e la y lin e that was tapped at various intervals, as sketched in Fig. 7.3. Or, in a syn chronous network, a sh ift r e g is te r could be used, in effect keeping several “old” values in a buffer [Kohonen, 1989]. The resulting architectures are sometimes called tim e -d e la y n e u ra l n e tw o rk s .

178

SEVEN Recurrent Networks

FIGURE 7.3 A time-delay neu ral network. Only one input x {i) is shown. z(tf), x (t — r ), . . . , x (t — 4r ) are fully con nected to a hidden layer.

The tapped delay line scheme has been widely applied to the problem o f speech recognition [e.g., McClelland and Elman, 1986; Elman and Zipser, 1988; Tank and Hopfield, 1987b; Waibel et al., 1989; Waibel, 1989; Lippmann, 1989]. Here the signal at a particular time is a set o f spectral coefficients (strength in each o f several frequency bands), so with a set o f different time delays one has input patterns in a two-dimensional time-frequency plane. Different phonemes show different patterns in this plane and can be learned by an ordinary feed-forward network. Similar approaches can be applied to other units o f speech or written language, such as syllables, letters, and words. There are several drawbacks to this general approach to sequence recogni tion [Mozer, 1989]. The length o f the delay line or shift register must be chosen in advance to accom m odate the longest possible sequence; we cannot work with arbitrary-length sequences. The large number o f units in the input layer leads to slow com putation (unless fully parallel) and requires a lot o f training examples for successful learning and generalization. And, perhaps most importantly, the in put signal must be properly registered in time with the clock controlling the shift register or delay line, and must arrive at exactly the correct rate. Tank and Hopfield [1987a] suggested a way o f compensating for the last o f these problems, thus making the scheme more robust. Given a raw signal x ( t ) the usual delay line technique would be to use x (t), x (t — n ) , x (t — 72 ), . . . , x ( t — rm) for the network inputs at time t. Tank and Hopfield suggest replacing the fixed delays by filters that broaden the signal in time as well as delaying it, with greater broadening for longer delays. They used

y (i',n )=

f

G (t - t ' ;T i ) x ( t ') d t '

(7.52)

J—OO for the network inputs at time f, with

G (< ;r< )= ( - Y \T{ /

(7.53)

7.3 Learning Time Sequences

179

FIGURE 7.4 The function G (t — V'.Ti) given by (7.53) for Ti = 1, 2, 3, 4, 5 with a = 10. The network inputs at time t are averages o f the past signal weighted by these functions.

for each r,-. This is normalized to have maximum value 1 when t = r,-. The param eter a controls the broadening; for larger a the function G has a narrower peak. Figure 7.4 shows an example at a = 10. This scheme can successfully compensate for phase and speed variations if they are not too large. Note too that it works in continuous time; there is no need for precise synchronization by a central clock. Schemes like this using a variety o f delays are biologically feasible, and may be used in the brain, in auditory cortex for example [Hopfield and Tank, 1989]. Shift registers might also occur biologically [Anderson and Van Essen, 1987].

Context Units A popular way to recognize (and sometimes reproduce) sequences has been to use p a r tia lly r e c u r r e n t n e tw o rk s . The connections are mainly feed-forward, but include a carefully chosen set o f feedback connections. The recurrency lets the network remember cues from the recent past, but does not appreciably complicate the training. In most cases the feedback connections are fixed, not trainable, so backpropagation may easily be used for training. The updating is synchronous, with one update for all units at each time step. Such networks are sometimes referred to as s e q u e n tia l n e tw o rk s . Figure 7.5 shows several architectures that have been used. In each case there is one special set o f c o n t e x t u n its G{ , either a whole layer or a part thereof, that receives feedback signals. The forward propagation is assumed to occur quickly, or without reference to time, while the feedback signal is clocked. Thus at time t the context units have some signals coming from the network state at time t — 1 , which sets a context for processing at time t. The context units remember some aspects o f the past, and so the state o f the whole network at a particular time depends on an aggregate o f previous states as well as on the current input. The network can therefore recognize sequences on the basis o f its state at the end o f the sequence. In some cases it can generate sequences too. Elman [1990] suggested the architecture shown in Fig. 7.5(a). Similar ideas were proposed by Kohonen; see Kohonen [1989]. The input layer is divided into two parts: the true input units and the context units. The context units simply hold a

180

SEVEN Recurrent Networks

(3 )

) Output

Hidden |

| Context |

Input" T ~

(C)

(d)

Output

Output |

Hidden

a /T

Context Cont

I Hidden

~*j

a ; r\

r— — I — Context

1

|

t Input ~ T ~ FIGURE 7.5 Architectures with context units. Single arrows represent connections only from the ith unit in the source layer to the ith unit in the destination layer, whereas shaded arrows represent fully connected layers.

copy o f the activations o f the hidden units from the previous time step. The m od ifiable connections are all feed-for ward, and can be trained by conventional backpropagation methods; this is done at each time step in the case o f input sequences. Note that we treat the context units like inputs in applying back-propagation, and do not introduce terms like dCk/dwij into the derivation. The network is able to recognize sequences, and even to produce short continuations o f known sequences. Cleeremans et al. [1989] have shown that it can learn to mimic an existing finite state automaton, with different states o f the hidden units representing the internal states o f the automaton. Figure 7.5(b) shows the Jordan [1986, 1989] architecture. It differs from that o f Fig. 7.5(a) by having the context units fed from the output layer (instead o f the hidden layer), and also from themselves. Similar behavior can probably be obtained whether the feedback is from a hidden or from an output layer, though particular problems might be better suited to one rather than the other. One could even use multiple sets o f context units, one set from each layer [Bengio et al., 1990]. More importantly, the self-connections give the context units Ci themselves some individual memory, or inertia. Their updating rule is C i(t + l ) = a C i(t) + O i(t)

(7.54)

181

7.3 Learning Time Sequences

where the Oi are the output units and a is the strength o f the self-connections. We require a < 1. If the outputs O, were fixed, then the C\ would clearly decay exponentially towards 0 , / ( 1 - a ), thus gradually forgetting their previous values. Such units are sometimes called d e c a y u n its, in te g r a tin g u n its , or c a p a c itiv e u n its. Iterating (7.54) we obtain

C i(t + 1) = O i(t) + a O i(t - 1 ) + a 2O i(t - 2) + • • • =

£ < * ‘ - * '0 , ( 0

(7-55)

t'=0

or, in the continuum limit, C i(t)=

I* e - r t - ^ O i W d t ? Jo

(7.56)

where 7 = |loga|. Thus in general context units accumulate a weighted moving average or tr a c e o f the past values they see, in this case o f 0,*. By making a closer to 1 the memory can be made to extend further back into the past, at the expense o f loss o f sensitivity to detail. In general the value o f a should be chosen so that the decay rate matches the characteristic time scale o f the input sequence [Stornetta et al., 1988]. In a problem with a wide range o f time scales it may be useful to have several context units for each output unit, with a different decay rate (a value) for each [Anderson et al., 1989]. Note in passing that the result (7.56) is exactly o f the form (3.57) with the exponential decay (3.60), so context units could be used to implement the moving averages needed in Section 3.5. W ith fixed input, the network o f Fig. 7.5(b) can be trained to generate a set o f output sequences, with different input patterns triggering different output se quences [Jordan, 1986, 1989]. W ith an input sequence, the network can be trained to recognize and distinguish different input sequences. Anderson et al. [1989] have applied this to categorizing a class o f English syllables, and find that a network trained by one group o f speakers can successfully generalize to another group. Figure 7.5(c) shows a simpler architecture, due to Stornetta et al. [1988], that can also perform sequence recognition tasks. Note that the only feedback is now from the context units to themselves, giving them decay properties as in ( 7 .5 4 ). But their input is now the network input itself, which only reaches the rest o f the network via the context units. In effect the input signal is preprocessed by the context units by the formation o f a weighted moving average o f past values; technically it is equivalent to processing by an infinite impulse response digital filter with transfer function 1 /(1 — a z " 1). This preprocessing serves to include past features o f the input signal into the present context values, thus letting the network recognize and distinguish different sequences. However, this architecture is less well suited than Fig. 7.5(a) or (b) to generating or reproducing sequences. Mozer [1989] suggested the architecture shown in Fig. 7.5(d). This looks at first similar to case (c), but differs in two important ways. First, there is full connectivity

182

SEVEN Recurrent Networks

with modifiable connections between input and context units instead o f only one-to-one connections. This gives the network greater flexibility to use the context units in a way appropriate to the problem. Second, the self-connections to the context units are no longer fixed, but are trained like other connections. The context units still act fundamentally as integrators, but the decay rate can be found during training to match the time scales o f the input. Now however we need a new learning rule; conventional back-propagation does not allow for the modifiable recurrent connections. Recurrent back-propagation will not suffice either, since it assumes constant inputs and approach to an attractor, whereas here we are interested in input and/or output sequences. Mozer [1989] derived a new learning rule by differentiating an appropriate cost function and using gradient descent. The result is almost a special case o f the “real-time recurrent learning” rule o f Williams and Zipser [1989a] which we will treat shortly, so we omit the details here. The “ almost” comes about because Mozer uses the update rule C i(t + 1) = O iCi(t) + g (

W ijtj)

(7.57)

j for the context units (where is an input pattern), whereas the Williams and Zipser approach for fully recurrent networks would apply strictly only to C i(t + l ) = g (a iC i(t) + £ j

.

(7.58)

The m odification to the learning algorithm is minor however. Equation (7.57) seems to work somewhat better than (7.58), which has itself been studied by Bachrach [1988]. Like case (c), the architecture o f case (d) is better suited to recognizing se quences than to generating or reproducing them. It works well for distinguishing letter sequences for example. W hen applying it to a problem involving reproducing a sequence after a delay, Mozer [1989] added an additional set o f context units to the input layer to echo the output layer, in the style o f case (a) or (b). Other architectures, not illustrated in Fig. 7.5, are o f course possible, and some have been tried. Shimohara et al. [1988] categorize the possibilities and examine several.

Back-Propagation Through Time The networks just discussed were partially recurrent. Let us now turn to fully recurrent networks, in which any unit Vi may be connected to any other. We retain for now synchronous dynamics and discrete time, so the appropriate update rule is Vi(t + 1) = g (h i(t)) = g f e w i j V j i t ) + m j

) •

(7-59)

183

7.3 Learning Time Sequences

t=4 22 t=3 r22

t= 2 22

t =1

FIGURE 7 .6 Back-propagation through time, (a) A recurrent network, (b) A feed forward network that behaves identically for 4 time steps.

Here & (/ ) is the input, if any, at node i at time t. If there is no such input at that node at that time we put £i(t) = 0 . The general temporal association task is to produce particular output sequences Ci(t) in response to specific input sequences £;( Cf 25 the correct output because —S f must be correct if S f is wrong. In that situation reinforcement learning is the same as ordinary supervised learning with known targets, and the present approach reduces to the ordinary delta rule. To construct our learning rule we compare the target Cf with the average output value ( S f ) to compute the appropriate 6f , just as in (5.58): (7-74) We could find ( S f ) by averaging for a while, or simply by calculation from (7.72) as in Chapter 2: ( S f ) = (+l)tf(fcf) + ( - l ) [ l - f f ( f c n ] = ta n h /?/if .

(7.75)

Then our update rule for the output weights is the usual delta rule A

(7.76)

in which we let the learning rate t] depend on whether the output was right or wrong. Typically ^ (+ 1 ) is taken as 10-100 times larger than We could have put a ( s , n s

(7.80)

where the sum is over all possible output configurations S, each o f which has a probability P(S|w ,^/i) that depends on the input and all the weights w. In fact, since output unit i chooses Si = ± 1 independently o f what all the other other units are doing, this probability factorizes:

where h% = Ylj wkj€j as usual. Now we want to differentiate (7.80) with respect to some particular weight , in order-to show that our learning rule goes in the gradient direction. So we need

193

7.4 Reinforcement Learning

to differentiate the product in (7.81) with respect to w^j:

dP{ S|w "■(S = /? < r (S ,^ )[5 f-(S f)])^

(7.84)

where the averages (• ••) are over all output configurations, weighted only by the usual stochastic unit probabilities (7.72). This is our desired result. First note that it would vanish if r ( S , ^ ) were the same for all output configurations S that had appreciable probability P(S|w ,^/i), because ([S f — ( S f )]) = 0. This is what happens when we reach a correct solution— we almost always get r *4 = + 1 , so o f course (r )*4 has a local maximum there. Now comparing with (7.79) we see that we may write

so that the average weight update in response to input pattern \i is in the upward gradient direction o f our (negative) cost function (r)*4. Finally we can average this result over all patterns /i and hence write

< * * » > - j

t7-86*

where now the averages are across all patterns and outcomes. The result suggests that the learning rule (7.79) will continue to increase the average reinforcement until a maximum is reached. It does not prove such conver gence however, because it only tells us about the average behavior. In a particular case a sequence o f unlucky chances might decrease (r) considerably, and might lead us to becom e stuck in a very poor configuration. The convergence o f stochastic

194

SEVEN Recurrent Networks

processes like this is very difficult to treat, and is not yet fully understood for the A r p algorithm. Nevertheless the result is encouraging, and allows us once again to visualize the procedure in terms o f gradient descent on a cost surface, or gradient ascent on an average reinforcement surface. Equation (7.86) applies to the symmetric rule (7.79) and not to the original rule (7.77). It also relies on 77 being independent o f r. Both these restrictions actually make the algorithm worse in practice, but seem to be needed to prove the result. In fact it may be that the better versions are not always moving uphill in (r), even on average, but are thus more free to explore alternatives and to find better solutions in the end. Indeed it is found that the algorithms to which the result applies tend to converge rapidly, but to rather poor local maxima [Williams and Peng, 1989]. Another case in which (7.86) applies is obtained by m odifying our reinforcement signal from ± 1 to 0 /1 . Then maximizing (r) is a different objective, in which failures are not penalized. Now (7.86) applies if we use the original A r p rule (7.77), but set 77“ to 0. This is sometimes called the associative reward-inaction rule, or A r i . Williams [1987, 1988b] has identified a wider class o f algorithms that possess the gradient ascent property (7.86). His derivation is also applicable to class II problems with a stochastic environment, in which (r) would be an average over the environment’s responses as well as over the stochastic choices o f the output units. However the class o f algorithms, which Williams calls REINFORCE algorithms, still does not include the full A r p algorithm.

Models and Critics Another approach to reinforcement learning involves modelling the environment with an auxiliary network, which can then be used to produce a target for each output o f the main network instead o f just a global signal r [Munro, 1987; Williams, 1988a; W erbos, 1987, 1988]. The general idea o f a separate modelling network can also be extended to more com plex reinforcement learning problems, including class III problems. A suitable architecture is shown in Fig. 7.8. The lower network is the main one that produces an output to the environment in response to an input from it, as before. Again it normally employs stochastic output units to encourage exploration. The upper network is the modelling network. It monitors at its input both the input and the output o f the main network, and thus sees everything that the environment does. It has a single continuous-valued output unit i2, called the e v a lu a tio n u n it, which is intended to duplicate the reinforcement r produced by the environment. If the m odel is a good one we will have R « r for each input-output pair that the model sees. In class I and II problems it is easy to see how to train the modelling net work. The environment provides the correct response r for each case encountered (at least on average), so we can use an ordinary supervised approach, such as backpropagation, to minimize (r — R ) 2. So that the modelling network can see a broad

7.4 Reinforcement Learning

195

FIGURE 7.8 A reinforcement learning network which tries to model its environ ment.

sample o f possible input-output pairs it may be expedient to turn up the temper ature parameter (T = 1 //? ) o f the main network’s stochastic output units while training the modelling network. Once it is trained the modelling network can be used to train the main network. Now the aim is to maximize J?, which we take to be a good model o f r. This can also be done by back-propagation, through both networks. We do not change any weights in the modelling network, but propagate the appropriate Sys through it, starting from 6 = —R at the evaluation unit. By the time these Sys propagate back to the main network we have an error signal for each unit instead o f a global signal r, and can use ordinary supervised learning to adjust the weights. Thus overall the whole scheme reduces the reinforcement learning problem to two stages o f supervised learning with known targets. In principle at least we could do both stages at the same time, thinking o f the two parts as a single cascaded network. However it is still essential to keep distinct the separate goals o f the two subnetworks; it would be no good to train the whole network to minimize ( R — r ) 2 without any incentive towards maximizing r, and equally pointless to maximize R overall without regard to r. In practice though, it seems best to train the modelling network first, at least partially. One advantage o f the present approach is that the modelling network can smooth out the fluctuations o f a stochastic environment in class II reinforcement

196

SEVEN Recurrent Networks

FIGURE 7.9 A reinforcement learning network with a critic. The main network might be a simple feed-forward network as sketched, or could itself include a modelling network as in Fig. 7.8.

learning problems. W ith the A r p algorithm there is a tendency to chase meaningless fluctuations o f r, but maximizing R should be less hazardous. For class III problems there are various ways to go further, making the m od elling network a better predictor o f reinforcement. Actually it should try to estimate a weighted average o f the future reinforcement, because the main network needs to maximize the cumulative reinforcement, not the instantaneous value. In a game, for example, the system might need to wait many moves for an eventual reinforce ment. And even if continuing evaluation is provided by the environment, it may be appropriate to sacrifice short-term rewards for longer-term gains. It may be worthwhile constructing a separate network (or algorithm) to perform this task o f predicting reinforcement. As shown in Fig. 7.9 this c r it ic receives the raw reinforcement signal r from the environment and feeds a processed signal p on to the main network. The p signal represents an evaluation o f the current behavior o f the main network, whereas r typically involves the past history. If the critic can produce a prediction R o f r, we could for example use p = r — R f so the main network would be rewarded when r exceeded expectations, and penalized when it fell short; this is r e in fo r c e m e n t c o m p a r is o n [Sutton, 1984]. The critic itself would normally be adaptive and would improve with experience; the name a d a p tiv e h e u r is tic c r it ic is sometimes used. We omit descriptions o f appropriate algorithms, but note that they are often closely related to d y n a m ic p r o g r a m m in g ; see Sutton [1984, 1988] and Barto et al. [1991]. They are also related to models o f classical conditioning in behavioral psychology [e.g., Tesauro, 1986; Sutton and Barto, 1991]. Barto et al. [1983] constructed a network consisting o f an adaptive critic and a set o f A rp units that learned a problem o f balancing a pole on a movable cart. The system was able to learn the task even though the ‘‘failure” reinforcement signal typically came long after the mistakes that produced it.

____________________EIGHT Unsupervised Hebbian Learning _______

8.1 Unsupervised Learning This is the first o f two chapters on unsupervised learning. In unsupervised learn ing there is no teacher. We still consider a network with both inputs and outputs, but there is no feedback from the environment to say what those outputs should be or whether they are correct. The network must discover for itself patterns, fea tures, regularities, correlations, or categories in the input data and code for them in the output. The units and connections must thus display some degree o f self o r g a n iz a tio n . Unsupervised learning can only do anything useful when there is r e d u n d a n c y in the input data. W ithout redundancy it would be impossible to find any patterns or features in the data, which would necessarily seem like random noise. In this sense redundancy provides knowledge [Barlow, 1989]. More technically, the actual information content o f the input data stream must be less than the maximum that could be carried on the same channel (the same input lines at the same rate); the difference is the redundancy. The type o f pattern that an unsupervised learning network detects in the input data depends on the architecture, and we will consider several cases. But it is interesting first to consider what sort o f things such a network might tell us; what could the outputs be representing? There are a number o f possibilities: 1. F am iliarity . A single continuous-valued output could tell us how similar a new input pattern is to typical or average patterns seen in the past. The net work would gradually learn what is typical. 2. P r in c ip a l C o m p o n e n t A n a ly s is . Extending the previous case to several units involves constructing a multi-component basis, or set o f axes, along which to measure similarity to previous examples. A common approach from

197

198

EIGHT Unsupervised Hebbian Learning

statistics, called principal component analysis, uses the leading eigenvector directions o f the input patterns’ correlation matrix. 3. C lu s te r in g . A set o f binary-valued outputs, with only one on at a time, could tell us which o f several categories an input pattern belongs to. The ap propriate categories would have to be found by the network on the basis o f the correlations in the input patterns. Each cluster o f similar or nearby pat terns would then be classified as a single output class. 4. P r o t o t y p in g . The network might form categories as in the previous case, but then give us as output a prototype or exemplar from the appropriate class. It would then have the function o f an associative memory, but the memories would be found directly from the input patterns, not imposed from outside. 5. E n c o d in g . The output could be an encoded version o f the input, in fewer bits, keeping as much relevant information as possible. This could be used for data compression prior to transmission over a limited-bandwidth channel, assuming that an inverse decoding network could also be constructed. 6 . F e a tu re M a p p in g . If the output units had a fixed geometrical arrangement

(such as a two-dimensional array) with only one on at a time, they could map input patterns to different points in this arrangement. The idea would be to make a topographic map o f the input, so that similar input patterns al ways triggered nearby output units. We would expect a global organization o f the output units to emerge. These cases are o f course not necessarily distinct, and might also be combined in several ways. The encoding problem in particular could be performed using principal component analysis, or with clustering, which is often called v e c t o r q u a n tiz a tio n in this context. Principal component analysis could itself be used for d im e n s io n a lity r e d u c t io n o f the data before performing clustering or feature mapping. This could avoid the “curse o f dimensionality” usually encountered when looking for pat terns in unknown data; a high-dimensional space with a modest number o f samples is mostly empty. It is worth noting that unsupervised learning may be useful even in situations where supervised learning would be possible: ■

Multi-layer back-propagation is extremely slow, in part because the best weight in one layer depends on all the other weights in all the other layers. This can be avoided to some extent by an unsupervised learning approach, or a hybrid approach, in which at least some layers train themselves just on the outputs from the immediately preceding layer.

■

Even after training a network with supervised learning, it may be advisable to allow some subsequent unsupervised learning so that the network can adapt to gradual changes or drift in its environment or sensor readings.

Unsupervised learning architectures are mostly fairly simple— the complications and subtleties come mainly from the learning rules. Most networks consist o f only

199

8.2 One Linear Unit

FIGURE 8.1 Architecture for simple Hebbian learning. The output unit is linear, so V — ^ w j£ j.

a single layer. Most are essentially feed-forward, with the notable exception o f Adaptive Resonance Theory. Except in the case o f Feature Mapping, usually there are many fewer output units than inputs. The network architectures also tend to be more closely modelled after neurobiological structures than elsewhere in neural computation. In most of the cases considered in this chapter and the next the architecture and learning rule come simply from intuitively plausible suggestions. However in some cases there is a well-defined quantity that is being maximized, such as the information content or variance o f the output. These optimization approaches are closer to those o f statisticians. There are in fact close connections between many o f the networks discussed here and standard statistical techniques o f pattern classifi cation and analysis [Duda and Hart, 1973; Devijver and Kittler, 1982; Lippmann, 1987]. In this chapter we consider some techniques based on connections that learn using a modified Hebb rule. The outputs are continuous-valued and do not have a winner-take-all character, in contrast to most o f the networks considered in the next chapter. Thus the purpose is not clustering or classification o f patterns, but rather measuring familiarity or projecting onto the principal components o f the input data. In a multi-layered network this can however lead to some remarkable feature extraction properties, as we will see in Section 8.4.

8.2 One Linear Unit Let us assume that we have some input vectors £, with components £,* for i = 1, 2, . . . , N, drawn from some probability distribution P (£ ). The components could be continuous-valued or binary-valued. At each time step an instance £ is drawn from the distribution and applied to the network. After seeing enough such samples the network should learn to tell us— as its output— how well a particular input pattern conforms to the distribution. We will make this more precise shortly. The case o f a single linear output unit is simplest. The architecture is shown in Fig. 8 .1 and we have simply V =

= w t £ = £t w i

(8.1)

200

EIGHT Unsupervised Hebbian Learning

where w is the weight vector. We find it convenient in this chapter to use m a tr ix m u lt ip lic a tio n n o t a t io n (e.g., w T £) rather than inner product notation (e.g., w •£), writing x T for the transpose (a row vector) o f x (a column vector). All boldface symbols are vectors, except the matrix C introduced below. W ith just one unit we clearly want the output V to becom e a scalar measure o f familiarity. The more probable that a particular input £ is, the larger the output \V\ should becom e, at least on average. It is therefore natural to try p la in H e b b ia n le a r n in g A wi = r)V£i (8.2) where t) controls the learning rate as usual. This strengthens the output in turn for each input presented, so frequent input patterns will have most influence in the long run, and will come to produce the largest output. But there is a problem; the weights keep on growing without bound and learn ing never stops. It is nevertheless instructive to examine the rule(8.2) in more detail.Suppose for a moment that there were a stable equilibrium point for w . After sufficient learning the weight vector would remain in the neighborhood o f the equilibrium point, just fluctuating around it (by an amount proportional to rj) but not going anywhere on average. So at the equilibrium point we would expect the weight changes given by ( 8 .2 ) to average to zero: 0 = (A W i ) =

(V£i)

=

ijWj

j

= Cw

(8.3)

j

where the angle brackets indicate an average over the input distribution P (£ ) and we have defined the c o r r e la t io n m a tr ix C by C,i = (titi)

(8-4)

C

(8.5)

or ee(

« t >.

Several things should be noted about C before we proceed. First, it is not quite the c o v a r ia n c e m a tr ix o f the input, which would be defined in terms o f the means //,• = (&) as ((& - //*)(£; - /ij))- Secondly, C is sym m etric, Ci;- = C ji, which implies that its eigenvalues are all real and its eigenvectors can be taken as orthogonal. Finally, because o f the outer product form, C is positive sem i-definite1; all its eigenvalues are positive or zero. Now let us return to (8.3). It says that (at our hypothetical equilibrium point) w is an eigenvector o f C with eigenvalue 0. But this could never be stable, because C necessarily has some positive eigenvalues; any fluctuation having a component along an eigenvector with positive eigenvalue would grow exponentially. We might suspect (correctly) that the direction with the largest eigenvalue Amax o f C would

1Proof: for any vector x , x T Cx = x T (£ £ T )x = (x T £ £ Tx) = ((£ Tx )2) > 0.

8.2 One Linear Unit

201

FIGURE 8.2 Examples o f O ja’s unsupervised learning rule. The dots show 1000 samples from each P (£ ). The arrows represent the average weight vector w after many updates. The thin lines show a typical trajectory o f w during training (for 2500 updates in (a); for 1000 updates in (b )). The parameter rj was held at 0.1.

eventually becom e dominant, so that w would gradually approach an eigenvector corresponding to Amax, with a increasingly huge norm. But in any case w does not settle down; there are only unstable fixed points for the plain Hebbian learning procedure ( 8 .2 ).

Oja’s Rule We can prevent the divergence o f plain Hebbian learning by constraining the growth o f the weight vector w . There are several ways to do this, such as a simple renormal ization w[ = aw{ o f all the weights after each update, choosing a so that |w'| = 1 . But Oja [1982] suggested a more clever approach. By modifying the Hebbian rule itself, he showed that it is possible to make the weight vector approach a constant length, |w| = 1, without having to do any normalization by hand. Moreover, w does indeed approach an eigenvector o f C with largest eigenvalue Amax. We will henceforth call this a m a x im a l e ig e n v e c t o r for brevity. O ja’s rule corresponds to adding a weight decay proportional to V 2 to the plain Hebbian rule: A w i = r]V(€i - V w i ) . (8.6) Note that this looks like reverse delta-rule learning; A w depends on the difference between the actual input and the back-propagated output. It is not immediately obvious that ( 8 .6 ) makes w approach unit length or tend to a maximal eigenvector, though we will prove these facts shortly. First let us see the rule in action. In each part o f Fig. 8.2 we drew samples for £ from a twodimensional Gaussian distribution; £ has two components and £2> and there are

202

EIGHT Unsupervised Hebbian Learning

two weights, w\ and W2 > The weights were started from small random values and updated according to ( 8 .6 ) after each sample. The thin lines show that |w| grew initially but reached a constant length (o f 1 ) and then merely fluctuated on an arc o f the circle |w| = 1. The fluctuations did not die away and were much larger in case (b ) than in case (a). Convergence to the unit circle was much faster in case (b) than in case (a). The final average weight vectors w are shown by the heavy arrows. W hat do these tell us? D o they really make the output V represent the familiarity o f a par ticular input £? Well, yes and no; they do the best they can within the constraints o f our architecture. Because we chose linear units, the output V is just the com po nent o f the input £ along the w direction. In case (a), with zero-mean data, this is zero on average, whatever the direction o f w , but it is largest in magnitude for the direction found. In case (b ), the average o f V itself is maximized for the direction found. So in both cases the direction o f w found by O ja ’s rule gives larger |Vj’s on average than for any other direction, for points drawn from the original distribu tion. Points drawn from another distribution— “unfamiliar” points— would tend to give smaller values o f |Vj, unless they had larger magnitudes on average. Thus the network does develop a familiarity index for the distribution as a whole, though not necessarily for any particular sample in case (a), for instance, the most probable (and mean) £ is at the origin, which gives V = 0 . In fact O ja ’s rule chooses the direction o f w to maximize ( V 2). This confirms, and makes more precise, the above observations based on Fig. 8.2. For zero-mean data, such as case (a), it corresponds to v a ria n ce m a x im iz a tio n at the output, and also to finding a principal component, as discussed in Section 8.3.

Theory of Oja’s Rule * We have left ourselves with a number o f claims to prove. We have said that O ja’s rule converges to a weight vector w with the following properties: 1. Unit length: |w| = 1, or

wf = 1.

2. Eigenvector direction: w lies in a maximal eigenvector direction o f C. 3. Variance maximization: w lies in a direction that maximizes ( V 2). Let us first show that property 3 is a simple consequence o f property 2. Using (8.1) and (8.5) we can write ( V 2) =

((w Tif)

= (w r ££T w ) = w T C w .

(8.7)

Now for fixed |w| and a symmetric matrix C, it is a standard result that the quadratic form w T C w is maximized when w is along a maximal eigenvector direc tion o f C. So this direction also maximizes ( V 2), as claimed. [The standard result can be proved by diagonalizing the quadratic form, writing it as w T Cw = ^ Aai6>2 a

(8.8)

203

8.2 One Linear Unit

where the Xays are the eigenvalues o f C and the wa ys are the components o f w along the corresponding eigenvectors. These eigenvectors are orthogonal and |w| = 1, so 12a wa = 1 * To maximize ( 8 .8 ) under this constraint we clearly have to put all the weight into the term with the largest Aa, which is Amax by definition. Thus w must be along a maximal eigenvector. In the case o f a degenerate largest eigenvalue any direction in the space spanned by the corresponding eigenvectors will suffice.] To prove properties 1 and 2 we must return to O ja’s rule ( 8 .6 ) itself. Just as in (8.3) for the plain Hebbian case, we expect that the average weight change will vanish when an equilibrium has been reached: 0 =

(A ll*) =

(V ii - V 2Wi)

= (JT, wjtjti j =

wjZjU’ktkw') i*

j

[ 5 3 u,iC J-*u>t ]u* jk

(8.9)

or 0 =

( A w ) = C w — [wT C w ]w .

(8.10)

Thus an equilibrium vector must obey Cw = Aw

(8.11)

with A = w T Cw =

w t Aw

= A|w|2 .

(8.12)

Equation (8.11) shows clearly that an equilibrium w must be an eigenvector o f C, and ( 8 .1 2 ) proves that |w| = 1 . All that remains is to show that A = Amax. A ny normalized eigenvector o f C would satisfy (8.10). But only one belonging to Amax is stable. To show this, let us take w close to a particular normalized eigenvector c a o f C w = ca + e with C ca = Xa c a and o f e to order e is (A c ) =

(8.13)

|= 1 . Then, using (8.9), the average subsequent change

(Aw) =

C (c “ + e) - [(( c “ )T + eT )C (c a + e)] (c a + e)

=

A“ c " + Cc - [(c a,)T C ca,] c a - [£T C ca] c a - [ ( c “ )T C £ j c “ — [(c a)T C c " ] £ + 0 (£2)

=

Ce — 2 A" [£Tc " ] c “ — Xae + 0 ( e 2) .

(8.14)

Now take the com ponent o f this along the direction o f another normalized eigen vector c p o f C by multiplying by (cP )T on the left (ignoring the 0 (e2) terms):

(c^)t (Ac)

= X ^ c ^ T c - 2 X a [cT ca]6a 0 - X a ( ^ ) T c =

[X? - \ a - 2 \ a 6a/3](cl})T e.

( 8 . 15 )

204

EIGHT Unsupervised Hebbian Learning

Note that we used (c^ )T c a = 6ap by orthonormality o f the eigenvectors. Equa tion (8.15) is just what we need; for /? ^ a it says that the component o f € along cP will grow— and hence the solution will be unstable— if A^ > Aa. So if Aa is not the largest eigenvalue Amax there will always be an unstable direction. On the other hand, an eigenvector corresponding to Amax stable in all directions, including the c a direction itself. This completes the proofs o f our claims. It is worth remarking that we have not really proved convergence to the solution, only that it is on average a stable fixed point of O ja’s rule. In a nonlinear stochastic dynamical system like this there are several things that might conceivably happen besides convergence to a fixed point. We might see quasi-cyclic behavior, or we might simply see the system continuing to fluctuate, just like a thermodynamic system at T > 0 (above a phase transition perhaps, so the typical state could be quite unlike the ground state). We could consider quenching the fluctuations by steadily reducing tj to zero, but if we did this too rapidly the system might get stuck in the wrong place and never get to the desired solution. More powerful techniques, such as s to c h a s t ic a p p r o x im a tio n th e o r y , are needed to prove convergence in these systems [see e.g., Kushner and Clark, 1978]. Such proofs have in fact been constructed for O ja’s rule [Oja and Karhunen, 1985].

Other Rules O ja ’s rule ( 8 .6 ) is not the only way one can modify the plain Hebbian rule (8.2) to keep the weights bounded. Linsker [1986, 1988] uses simple clipping; the individual weights Wi are constrained to obey w _ < < w+. Yuille et al. [1989] suggest a rule A Wi = r)(V£i - w,|w|2) (8.16) which makes w converge to the same maximal eigenvector direction as O ja’s rule, but with |w| = -y/Amax instead o f |w| = 1. It has the disadvantage over O ja’s rule o f being nonlocal— to update we need information about other w j's— but the mathematical advantage that there is an associated c o s t fu n c t io n E = -^ ^ C ijW iW j +

= - | w T C w -i| w | 4 .

(8.17)

ij • The average effect (Ait;,*) o f (8.16) corresponds exactly to gradient descent on this cost surface. Pearlmutter and Hinton [1986] also derive a gradient-descent learning rule, based on maximizing the information content o f the output. It is applicable to nonlinear as well as linear units.

8.3 Principal Component Analysis A common method from statistics for analyzing data is p r in c ip a l c o m p o n e n t

8.3 Principal Component Analysis

205

FIGURE 8.3 Illustration o f principal component analysis. OA is the first principal component direction o f the distribution that generated the cloud o f points. The projection onto OA shows up more structure than the projection onto another direction OB. After Linsker [1988].

a n a ly sis (P C A ) [see e.g., Jolliffe, 1986]. In communication theory it is known as the Karhunen-Loeve transform. It is also closely related to least squares methods, factor analysis, singular value decomposition, and matched filtering. Linsker [1988] notes that performing principal component analysis is equivalent to maximizing the information content o f the output signal in situations where that has a Gaussian distribution. The aim is to find a set o f M orthogonal vectors in data space that account for as much as possible o f the data’s variance. Projecting the data from their original Ndimensional space onto the M -dimensional subspace spanned by these vectors then performs a d im e n s io n a lity r e d u c t io n that often retains most o f the intrinsic information in the data. Typically M =

A2 > . . . > A *

(8.19)

with A1 = Amax. Now we proceed by mathematical induction and assume that principal components 1 to k — 1 are along the first k — 1 eigenvector directions. The fcth principal com ponent is constrained to be perpendicular to these directions, so we must have x \ . . . = 0. Maximizing subject to this condition, with |x| = 1 (and thus x\ = 1 ), clearly results in zj = \ t l tu

i{i = k ' otherwise.

( 8 .20 )

Thus the Arth principal com ponent is along the Arth eigenvector, as claimed. Moreover (8.18) shows that the variance itself is equal to Xk when x is along the Arth principal component direction.

One-Layer Feed-Forward Networks O ja ’s rule ( 8 .6 ) finds a unit weight vector which maximizes the mean square output ( V 2). For zero-mean data this is just the first principal component. It would however be desirable to have an M -output network that extracts the first M principal com ponents. Sanger [1989a] and Oja [1989] have both designed one-layer feed-forward networks that do this. Other network architectures for principal component analysis will be discussed later. The networks are linear, with the ith output Vi given by Vi =

=

wf t

= £t w j .

(8.21)

j Note that the vectors here are iV-dimensional (i.e., in the input space); w , is the weight vector for the ith output. Sanger’s approach can also be extended to non linear units [Sanger, 1989b], but we discuss only the linear case. Sanger’s learning rule is i A

= t]Vi ( t j -

Vkwkj ) fc=i

(8.22)

207

8.3 Principal Component Analysis

and O ja’s M -unit rule is

N A

= vVi ( t j - Y , v *w* i) • k=l

(8‘23)

The on ly difference is in the upper limit o f the summation. Both rules reduce to O ja ’s 1 -unit rule ( 8 .6 ) for the M = 1 single output case. Sanger’s rule always reduces to O ja ’s 1 -unit rule for the first unit i = 1, so we already know that that unit will find the first principal component o f the input data .2 It turns out that in both cases the w t- vectors converge to orthogonal unit vectors, w f w j = Sij. For Sanger’s rule the weight vectors becom e exactly the first M principal component directions, in order: w , —►±c*

(8.24)

where c* is a normalized eigenvector o f the correlation matrix C belonging to the ith largest eigenvalue A*; we take the eigenvalues to be in decreasing order, as in (8.19). For O ja ’s Af-unit rule the M weight vectors converge to span the same subspace as these first M eigenvectors, but do not find the eigenvector directions themselves. In both cases the outputs project an input vector £ onto the space o f the first M principal components. Sanger’s rule is usually more useful in applications because it extracts the principal components individually in order, and gives a reproducible result (up to sign differences) on a given data set if the eigenvalues are nondegenerate. It performs exactly the Karhunen-Loeve transform. Different outputs are statistically uncorrelated and their variance decreases steadily with increasing i. Thus, in applications to data compression and encoding, fewer bits are needed for later outputs. It may be appropriate to look at the variance o f a given output (which is just the value o f the corresponding eigenvalue) as a measure o f the usefulness o f that output, perhaps taking only the first few outputs down to a variance cutoff. O ja ’s M -unit rule gives weight vectors which, while spanning the right subspace, differ individually from trial to trial. They depend on the initial conditions and on the particular data samples seen during learning. On average the variance o f each output is the same; this may be useful in some applications (such as in one layer o f a multilayer network) where one wants to keep the information spread uniformly across the units. Furthermore, if any algorithm o f this sort is implemented in real brains, it would probably look more like O ja’s than Sanger’s: there is no obvious advantage for an animal in having information sorted out into individual principal components. Neither o f these new learning rules (8.22) and (8.23) is local. Updating weight requires more information than is available at input j and output i. However, 2 Here, and henceforth, we consider only zero-mean data. Strictly speaking, the networks find the eigenvectors o f the correlation matrix but the principal components are the eigenvectors of the full covariance matrix ((£; — “ /*;))• For zero-mean data there is no difference.

208

EIGHT Unsupervised Hebbian Learning

FIGURE 8.4 Network to implement Sanger’s unsupervised learning rule. Only one input line is shown. The weighted output V, o f unit i is subtracted from the input before it reaches unit i + 1. Note that each weight appears twice; the two values must be kept equal. ^

Sanger suggested a reformulation o f his rule which does allow a local implementa tion. We simply separate out the k = i term in (8.22) *—l A w ij = T)Vi

^

Vkwkj^j - ViiD.j]

(8.25)

*=i and observe that this is just the same as O ja’s 1 -unit rule for unit i except that the input has been replaced by ^ Vkwk j• So with inputs modified in this way we can use O ja ’s original rule, which is local. The modified inputs can be calculated by progressively decrementing the total inputs, as indicated in Fig. 8.4.

Theory of Sanger’s Rule * We examine only Sanger’s rule in detail; O ja’s M -unit rule has been analyzed by Krogh and Hertz [1990]. Our aim is to demonstrate the result (8.24). First let us substitute (8.21) into (8.22) and average: * (AWij)/r) =

( ^ 2 Wiptptj ~ ^ 2 Wiptp Y L ] C Wk ^ q Wk j) p

p

k=l

q

i

=

WjpCpj — y ] p

or

k=l

WfcgCpgWjpj Wkj

(8.26)

pq

* -i

(A w j)/t) = Cw, - ] P [ w * Cw,]w* - [w f Cw*]wj

(8.27)

fc=i where we have separated out the k = i term from the summation. Now let us proceed by mathematical induction and assume that weight vectors 1 , 2 , . . . , i — 1 have already found their appropriate eigenvectors, so w* = ± c * for k < i. The

209

8.3 Principal Component Analysis

first two terms on the right-hand side o f (8.27) then become the projection o f Cw, onto the subspace orthogonal to these i — 1 eigenvectors; recall (e.g., from the Gram-Schmidt orthogonalization procedure) that x — (y T x ) y is the projection o f x perpendicular to a unit vector y . Thus we have (Aw i)/rj = (C w ,)-1- - [ w f Cw,-]w,-

(8.28)

where (Cw,-)-1- means the projection o f Cw,- into the subspace orthogonal to the first i — 1 eigenvectors. Note that (Cw,-)-1- is equal to Cw^- because C preserves this subspace. Suppose now that w,- has a component not in this subspace. For that component the first term on the right-hand side o f (8.28) gives nothing, and the second causes a decay towards zero (recall that w f C w , is a positive scalar). Thus w,- relaxes to wards the subspace. But when restricted to the subspace the whole equation (8.28) becomes just O ja ’s 1 -unit rule ( 8 .6 ) for unit i, and so leads to the maximal eigenvec tor in the subspace, which is ±c* with eigenvalue A,-. This completes the induction. In fact the weight vectors w,- approach their final values simultaneously, not one at a time, but our argument still applies to the endpoint.

Other Networks The architectures considered so far for principal component analysis have all been one-layer feed-forward networks. Other networks, with more layers or with lateral connections, can also perform it, and may have some advantages. We have already met the s e lf-s u p e r v is e d b a c k -p r o p a g a t io n approach, on page 136. A two-layer linear network with iV inputs, AToutputs, and M < N hidden units is trained using back-propagation so that the outputs are as close as possible to the inputs in the training set. This forces the network to encode the A-dimensional data as closely as possible in the M dimensions o f the hidden layer, and it should not be surprising to find that the hidden units end up projecting onto the subspace o f the first M principal components. The network produces essentially the same result as O ja ’s M -unit rule, with equal variance on average on each o f the hidden units. The dynamical approach to the solution is also similar, though not identical [Sanger, 1989a]. Another approach is to use a one-layer network, but have trainable lateral con nections between the M output units VJ. One architecture is shown in Fig. 8.5 [Rubner and Tavan, 1989; Rubner and Schulten, 1990]. Note that a lateral connec tion Uij from Vj to Vi is present only if i > j . The ordinary weights w,-* from the inputs are trained with a plain Hebbian rule (8.2), followed by renormalization o f each weight vector to unit length. But the lateral weights employ a n t i-H e b b ia n le a rn in g , equivalent to a Hebb rule with negative 77: A

= -y V iV j .

(8.29)

Clearly the first output unit V\ extracts the first principal com ponent, just as in O ja ’s 1-unit rule and Sanger’s rule. The second unit would do the same but for its

210

EIGHT Unsupervised Hebbian Learning

U31

FIGURE 8.5 A network for principal component analysis. The lateral Uij connections between the output units use anfo’-Hebbian learning.

lateral connection 1/21 from unit 1, which tries through (8.29) to decorrelate it with V\. This is very like the effect o f Sanger’s rule, and leads to the same result; the second unit extracts the second principal component. And so on; the network learns to extract the first M principal components in order, just as in Sanger’s network. The lateral weights end up going to zero, but are still required to be present for stability against fluctuations. A very similar proposal was made by Foldiak [1989] using full lateral connections in the output layer; there was no i > j restriction.

8.4 Self-Organizing Feature Extraction Hebbian learning has been applied in a number o f ways to producing fe a tu r e d e t e c t o r s in a network whose input is a two-dimensional pixel array or “retina.” Usually there is a well-defined set o f possible inputs— such as letter patterns, or bars at certain angles— and each output cell is expected to become most sensitive to one o f those inputs, with different output cells choosing different input patterns. There may be as many or more output units as input units; the aim is not to reduce the input information, but to transform it. One can usefully define the s e le c tiv ity o f a particular output Oi as [Bienenstock et al., 1982] Selectivity

=

1 ------------------------------------------------- (8.30) max Oi

where the average (O i) and the max are both taken over all the possible inputs. The selectivity approaches 1 if the output unit only responds appreciably to a single input (or a narrow range o f inputs in a continuous case), and goes to zero if it responds equally to all inputs. The problem is to define an architecture and learning rule in which the outputs go from low initial selectivity to high final selectivity. Moreover one wants different output units to becom e optimally sensitive to different input patterns, with some output unit or units approximately matched to every input pattern. Finally, if the output units are arranged in a geometrical way such as two-dimensional array, one

211

8.4 Self-Organizing Feature Extraction

would like similar input patterns to give maximum response at nearby output units; this is a form o f feature mapping, which is discussed more fully in Section 9.4. Most investigation has been with the visual cortex in mind. Experimental evi dence shows the existence (e.g., in area 17 or V I) o f neurons that respond prefer entially to bars or edges in a particular orientation in the visual field [Hubei and Wiesel, 1962]. These are largely organized into o r ie n ta tio n a l c o lu m n s perpen dicular to the cortical surface; neurons in a column respond maximally to approxi mately the same orientation. After the work o f Blakemore and Cooper [1970] on the imprinting o f orienta tional preferences in kittens brought up in unusual environments, it was commonly thought that the whole orientational column structure is learned postnatally. Early models therefore employed input patterns o f lines or edges, such as might be en countered postnatally, and showed how these could lead to orientational selectivity [von der Malsburg, 1973; Perez et al., 1975; Nass and Cooper, 1975; Bienenstock et al., 1982]. Now it is known that some orientation specificity is present prenatally, so such models cannot be the whole story [von der Malsburg and Cowan, 1982]. They may however be appropriate for later optimization and stabilization o f the pattern, which occurs during a critical developmental period after birth. Recently, Linsker [1986] proposed a model o f self-organization in the visual sys tem that does not required structured input. It uses a version o f Hebbian learning in a layered network. His work is summarized in Linsker [1988], a paper that also treats principal components, maximum information, etc., and is highly recommended. Linsker’s network resembles the visual system, with an input retina feeding on to a number o f layers corresponding to the layers o f the visual cortex. The units o f the network are linear and are organized into two-dimensional layers indexed A (input), B, C, and so on. There are feed-forward connections between layers, with each unit receiving inputs only from a neighborhood in the previous layer, the r e c e p t iv e field . These limited receptive fields are crucial; they let units respond to spatial correlations in the input or the previous layer. Figure 8.6 shows the arrangement. Note that this is the first time we have encountered a multi-layer unsupervised learning network. Because the units are linear the final output in principle could be found by a network with just one linear layer— a product o f linear transformations is a linear transformation. But this does not apply to the unsupervised learning rule itself, which can benefit from multiple layers. If a particular unit receives input from K units numbered j = 1, 2, . . . , K its output O is K

O = a + Y 2 wjV j j =l

(8.31)

where Vj is either the input pattern (if the unit is in layer B) or the output o f a previous layer. The threshold term a could be omitted. Linsker used a modified Hebbian learning rule Aw i = rj(ViO + bVi + cO + d)

(8.32)

212

EIGHT Unsupervised Hebbian Learning

Layer D

Layer C

Layer B

Layer A FIGURE 8.6 Architecture o f Linsker’s multi-layer Hebbian learning network, showing the receptive fields o f a few units.

ttttt External input

where b-d are parameters which may be tuned to produce various types o f behavior. The weights were clipped to the range w _ < < w+ to prevent them growing indefinitely. Explicit clipping can be avoided through the use o f alternative rules such as (8.16) [Kammen and Yuille, 1988; Yuille et al., 1989]. Or, going more in the direction o f biological realism, a mixture o f purely excitatory (0 < Wi < w +) and purely inhibitory (w _ < < 0 ) connections can be used instead, and leads to the same results [Linsker, 1986]. Let us find the average weight change ( A Wi). We assume that all the inputs Vi have the same mean V and put Vi = V + Then (8.32) becomes (A Wi) =

r i ^ ( V + V i)Y ^[a + W j(V + Vj)]^ + bV + c'*jr(a + W jV ) + d

j =

3

Cij W j + A ( „ - £ ttfj)]

V j

(8.33)

3

where the constants A and fi are combinations o f the old constants a-d and V, and is the K x K covariance matrix (v{Vj) o f the inputs to our chosen unit. Equation (8.33) is the form o f the rule actually used. It is easy to verify that it is equivalent to the average o f gradient descent learning, A Wi = —rjdE/dwi, on the c o s t fu n c t io n

E = - ^ w T C w + ^ ( / ' - E wi )

•

( 8-34)

3 In the first term, w T Cw is just the variance o f the output O, by extension o f (8.7) to the full covariance matrix. The second term may be regarded as a Lagrange multiplier or penalty term to impose the constraint wj = I*- So Linsker’s rule

8.4 Self-Organizing Feature Extraction

213

tries to maximize the output variance subject to this constraint and to the clip ping constraint w _ < Wi < w+. This may be compared to O ja ’s rule ( 8 .6 ), which maximizes the same output variance subject to w? = 1 , and does not need a clipping constraint. An equilibrium state o f Linsker’s rule will not normally have the right-hand side o f ( 8 .3 3 ) equal to zero, which would require ( 1 , 1 , . . . , 1 ) to be an eigenvector o f C. Instead, most o f the weights Wi will be pinned at one o f their boundary values; either at with (A w ) < 0 or at w+ with (A w ) > 0. In fact all, or all but one, o f the weights must be at a boundary value; any weight vector with two or more components not at a boundary is unstable under (8.33). To prove this, suppose the contrary and consider a perturbation w ; = w + £ away from the supposed equilibrium point. In particular, choose e so that Ylj = 0 (with £,• = 0 if W{ is at a boundary). Then (8.33) gives simply (A e ) = rjCe

(8.35)

which causes |e| to grow, because C is positive definite even when restricted to the non-boundary rows and columns. So the chosen point cannot be an equilibrium point, which proves the claim. The most interesting aspects o f Linsker’s network arise from the spatial geom etry o f the layers and the receptive fields. We will approach them in a moment by describing Linsker’s simulation results. Theoretical analysis, in detail beyond the level o f this book, has been provided by Kammen and Yuille [1988], Yuille et al. [1989], and M acKay and Miller [1990], each using slightly modified rules and a c o n tin u u m a p p r o x im a tio n . In the continuum approximation the weights become a function w (x ,y ) or it;(r) o f position in the plane. The covariance matrix Cij becomes a two-point correlation function C (r, s) which takes the form C ( r — s) if there is translational invariance. A Fourier transform to wave-vector space makes the analysis easier. The weight patterns that arise can also be described in terms o f a set o f functions much like those encountered in the quantum mechanics o f atoms (Is, 2 s, 2 p, etc.).

Simulation Results Linsker first trained the connections between the input layer A and the first hid den layer B, then the next layer o f connections from B to C, and so forth. Such sequential training may in general be important for teaching successively higherlevel “concepts” to a layered unsupervised network, building gradually on elements learned earlier. This may require a teacher who can monitor the progress o f learning within the network, and selectively activate or inhibit learning at different levels [Silverman and Noetzel, 1988]. However in Linsker’s case the sequential approach was mainly for convenience; only one layer was simulated at a time, using an input covariance matrix calculated from the output properties o f the previous layer. The simulations used the equations (8.33) for the average (Ait;,-), not the orig inal learning rule (8.32) for A w i itself. Thus only the appropriate input covariance

214

EIGHT Unsupervised Hebbian Learning

FIGURE 8.7 Sketch o f positive and negative connection strengths within the re ceptive fields o f units in Linsker’s network, (a) A center-surround cell in layer C. (b) A bilobed orientation-selective cell in layer G. After Linsker [1986].

matrix C was needed to calculate how a particular unit would behave. Once the resulting connection strengths feeding a whole layer had been calculated, the covariance matrices for that layer could be found from those o f the preceding layer (giving a linear transformation o f the matrices). Then the next layer could be treated in the same way. Linsker took independent random noise as input in layer A, making the covari ance matrix proportional to the identity matrix, the same for every site in layer B. The resulting weights from layer A to layer B depend on the parameters A and /i, and on the size o f the receptive fields. For a range o f these parameters it was found that all the weights saturated to w+ , so the units in layer B simply averaged the input activity over their receptive fields. This made the B-units highly correlated because their receptive fields overlapped one another; a high activity in one o f B i ’s inputs would typically be seen also by many o f B i ’s neighbors. As a result o f this neighbor correlation in B, the units o f layer C developed into c e n te r -s u r r o u n d cells. They responded maximally either to a bright spot in the center o f their receptive field surrounded by a dark area or to a dark spot on a bright background. In the quantum mechanical analogy this is a “2s” function [MacKay and Miller, 1990]. Figure 8.7(a) illustrates the pattern o f connection strengths found for one such unit; since most connections are pinned at either w _ or w+ we show their values simply by - f ’s and —’s. Different units o f layer C had a Mexican hat covariance function; nearby units were positively correlated while further away there was a ring o f negative correla tion. Note that this is a correlation that has evolved through learning, not imposed by the terms o f the model. Using the parameter values chosen by Linsker, this cor relation gave rise in the following few layers (D -F ) to center-surround units with sharper and sharper Mexican hat correlations; the negative trough became deeper and deeper. For layer G Linsker changed the parameters, increasing the radius o f the re ceptive field. This produced a variety o f possible weight patterns, depending on the

8.4 Self-Organizing Feature Extraction

215

parameters, many o f which were no longer circularly symmetric. Units were found with alternating bands o f positive and negative connections, or with two or three islands o f inhibition around the center o f an otherwise excitatory field; Fig. 8.7(b) shows such a “bilobed” unit. These units had thus indeed becom e o r ie n ta tio n s e le c tiv e cells, responding maximally for example to a bright edge or bar in a particular orientation against a dark background. The emergence o f orientationally asymmetric units in a system with symmetric architecture and parameters is a b r o k e n s y m m e t r y phenomenon, and has been examined in those terms by Kammen and Yuille [1988]. The orientation selected by a given unit is rather arbitrary and varies from unit to unit. Up to now we have needed no lateral connections within each layer. But now they can help us to obtain a smooth variation o f orientational preference across the output layer. The addition o f local excitatory connections within layer G — local lateral excitation— was found to produce banded patterns o f approximately co-oriented cells, strikingly similar to some patterns o f orientational columns found biologically [Linsker, 1986]. This is an example o f feature mapping, which will be discussed in greater detail in the following chapter. Alternatively, lateral inhibitory connections between pairs o f neighboring units can give rise to “quadrature pairs” o f cells with spatial receptive field patterns that are approximately 90° out o f phase [Yuille et al., 1989]. In the mammalian visual system center-surround cells are found as early as in the retina itself, and orientation-selective cells (including quadrature pairs) are found in the visual cortex. Considering the complexity o f the visual system, in cluding feedback and nonlinearity, it is remarkable that such a simple model can develop a similar structure. O f course this should not be taken to imply that fea ture detecting cells in retina and visual cortex develop just as in the model; there are many alternative ideas as to how it might occur [see e.g., Rose and Dobson, 1985]. However it does show that relatively simple mechanisms o f Hebbian learn ing could produce such structures, without either visual input or detailed genetic programming.

___________________ _NINE Unsupervised Competitive Learning_____

In the preceding chapter we studied unsupervised learning approaches— all based on Hebbian learning— in which multiple output units are often active together. We now turn to c o m p e t it iv e le a r n in g in which only one output unit, or only one per group, is on at a time. The output units compete for being the one to fire, and are therefore often called w in n e r -ta k e -a ll units. They are also sometimes called g r a n d m o th e r cells (cf. page 143). The aim o f such networks is to c lu s te r or c a te g o r iz e the input data. Similar inputs should be classified as being in the same category, and so should fire the same output unit. The classes must be found by the network itself from the correlations o f the input data; in the language o f statistics we must deal with unlabelled data. Categorization has obvious uses for any artificially intelligent machine trying to understand its world. More immediately, it can be used for data encoding and compression through v e c t o r q u a n tiz a tio n , in which an input data vector is re placed by the index o f the output unit that it fires. It also has applications to function approximation, image processing, statistical analysis, and combinatorial optimization. It is worth mentioning at the outset that grandmother-cell representations have some rather generic disadvantages over distributed representations [Caudill, 1989]: ■

They need one output cell (and associated connections) for every category in volved. Thus N units can represent only TV categories, compared to 2N for a binary code. Actually K-o\it-of-N codes may prove best for network com puta tion, but are not yet well understood or exploited [Hecht-Nielsen, 1987a].

■

They are not robust to degradation or failure. If one output unit fails, then we lose that whole category. O f course the same sort o f problem arises in most digital circuitry, but we have come to expect better o f neural networks.

217

218

y

■

NINE Unsupervised Competitive Learning

y ^2

y ^3

y ^4

FIGURE 9.1 A simple com petitive learning network. The connections shown with open arrows are inhibitory; the rest are excitatory.

y ^5

They cannot represent hierarchical knowledge. T w o input patterns are either lumped together or not; there is no way to have categories within categories. A dding further layers to a network clearly does not help unless we relax the single-winner condition.

A closely related topic is fe a tu r e m a p p in g , which we discuss in Sections 9 .4 9.6. Feature mapping is distinguished by the development o f significant spatial organization in the output layer, whereas in simple competitive learning there is no particular geometrical relationship between the output units. Nevertheless the his tory and concepts o f the two ideas are closely intertwined. Indeed, because feature mapping is so closely associated with the name o f Kohonen, even simple com pet itive learning units are sometimes called Kohonen units. A history o f either field should include the seminal work o f Rosenblatt [1962], von der Malsburg [1973], Fukushima [1975, 1980], Grossberg [1976a, b], Kohonen [1982], and Rumelhart and Zipser [1985].

9.1 Simple Competitive Learning In the simplest com petitive learning networks there is a single layer o f output units O,*, each fully connected to a set o f inputs via excitatory connections > 0. Figure 9.1 shows the architecture. We consider mainly binary 0 /1 inputs and outputs. Only one o f the output units, called the w in n e r, can fire at a time. The winner is normally the unit with the largest net input hi =

= Wi £

(9.1)

i for the current input vector £. Thus w t* •£ > Wi •£

(for all i)

(9.2)

219

9.1 Simple Competitive Learning

defines the winning unit i* with O i- = 1. If the weights for each unit are normalized, so that (say) |w, |= 1 for all i, then ( 9 .2 ) is equivalent to

|w ,- - £ | < |w , -* |

(for a lii) .

(9.3)

This says that the winner is the unit with normalized weight vector w closest to the input vector £. It does not much matter how the winner-take-all character o f the output is implemented. In a computer simulation one can merely search for the maximum h{. In a real network it is possible to implement a set o f winner-take-all units with la te r a l in h ib itio n ; each unit inhibits the others, as shown in Fig. 9.1. A self-excitatory connection is also required, and the lateral weights and nonlinear activation function must be chosen correctly to ensure that only one output is chosen and that oscillation is avoided. For details see Grossberg [1976a, 1980]. Other schemes for winner-take-all networks have been proposed by Feldman and Ballard [1982], Lippmann [1987], and Winters and Rose [1989]. A winner-take-all network implements a pattern classifier using the criterion (9.2) or (9.3). The problem now is how to get it to find clusters in the input data and choose the weight vectors w,- accordingly. We start with small random values for the weights; it is important that any symmetry be broken. Then a set o f input patterns £** is applied to the network in turn or in random order. Alternatively the inputs might be drawn independently from a distribution P (£ ). For each input we find the winner i* among the outputs and then update the weights Wi*j fo r the winning unit only to make the w t* vector closer to the current input vector This makes the winning unit more likely to win on that input in the future. The most obvious way to do this would be

A Wi.j = ritf

(9.4)

but this is no good by itself because it makes the weights grow without bound, and one unit comes to dominate the competition for all inputs. One way to correct (9.4) is to follow it with a normalization step = awi*j (for all j ), choosing a so that = 1 or (w -^ )2 = 1 (i.e., |w{.| = 1). It is easy to show that the first o f these choices corresponds to an overall rule

A Wi.j = tj'

- W f.j)

(9.5)

in which the first term is a normalized version o f the input. Another solution, which we will henceforth refer to as the s ta n d a r d c o m p e t itiv e le a r n in g ru le, is to use

A Wi.j =

- Wi.j)

(9.6)

which moves w,* directly towards O f course (9.6) is equivalent to (9.5) if the input is already normalized, and indeed (9.6) works best for pre-normalized inputs.

220

NINE Unsupervised Competitive Learning

FIGURE 9.2 Com petitive learning. The dots represent the input vectors, the crosses represent the weights for each o f three units, (a) Before learning, (b) After learning.

Grossberg [1976a] shows how to normalize the input appropriately using an extra input layer o f units. Note that these learning rules are for the winner only. Because Oi* = 1 and Oi = 0 for i ^ i* by definition o f the winner, we can write (9.6) in the form A

= r)Oi(Cj ~ Wij)

(9.7)

which is valid for all i and looks more like a Hebb rule with a decay term. In fact it is identical to Sanger’s rule (8.22) and to O ja ’s M -unit rule (8.23) when they are specialized to the single winner case. It is also an appropriate generalization o f com petitive learning for multiple continuous-valued winners [Grossberg, 1976a]. A geometric analogy is very useful for understanding the consequences o f com petitive learning rules. Consider a case with 3 inputs so that each input pattern is a three-dimensional vector £** = (£ f >£3 )* For binary this vector lies on one o f the vertices o f a cube, but in this low dimension it is more convenient to think in terms o f continuous variables. We can represent the direction o f each £** by a point on the surface o f a unit sphere, as shown by the dots in Fig. 9.2. The direction o f the weight vector w* = (w u , w,*2 , ^*3 ) for each output unit i also corresponds to a point on the same sphere; the weights for 3 output units are depicted by crosses in the figure. These vectors are normalized to unit length (and therefore lie on the surface o f the sphere) if we use the |wi| = 1 normalization. Diagram (a) shows a possible initial state and diagram (b) a typical final state; the output units have each discovered a cluster o f inputs and have moved to its center o f gravity. This is the principal function o f a com petitive learning network— it discovers clusters o f overlapping input vectors. We can see how this works geometrically. First, (9.2) defines the winner for a given input as the output unit with the greatest value o f w ,• •£, which means the smallest angle 0 to the £ direction (if |w,| = 1). So the winner o f a given dot is the closest cross. Secondly, the winning w t- is modified by (9.5) or (9.6), which

9.1 Simple Competitive Learning

221

moves it towards the active input vector. The patterns thus com pete for output units, continually trying to bring the nearest one closer. The final state shown in Fig. 9.2(b) is a stable attractor if the inputs are activated about equally often. One problem is that units with w t- far from any input vector may never win, and therefore never learn. These are sometimes called d e a d u n its. They may actually be desirable if the future might bring different input patterns, but otherwise can be prevented in several ways: 1. We can initialize the weights to samples from the input itself, thus ensuring that they are all in the right domain. 2. We can update the weights o f all the losers as well as those o f the winners, but with a much smaller 77 [Rumelhart and Zipser, 1985; see also Grossberg, 1987b]. Then a unit that has always been losing will gradually move towards the average input direction until eventually it succeeds in winning a com peti tion. This has been called le a k y lea rn in g . 3. If the units are arranged in a geometrical way, such as in a two-dimensional array, we can update the weights o f neighboring losers as well as those o f the winners. This is the essence o f Kohonen feature mapping, discussed in detail in Section 9.4. 4. We can turn the input patterns on gradually, using + (1 — a )v , where v is some constant vector to which all the weight vectors w , are initialized. As we turn a up gradually from 0 to 1 the pattern vectors move away from v and take weight vectors with them [Hecht-Nielsen, 1987b]. 5. We can subtract a bias (or threshold) term //,• from hi in (9.1) and adjust the thresholds to make it easier for frequently losing units to win. Units that win often should raise their ^ s, while the losers should lower them [Grossberg, 1976b; Bienenstock et al., 1982]. Stabilizing such a scheme is a little tricky, but can be done; indeed one can force M output units to win 1 /M th o f the time on average [DeSieno, 1988]. The mechanism is sometimes called a c o n s cie n ce ; frequent winners are supposed to feel guilty and so reduce their win ning rate. 6 . We can smear the pattern vectors with the addition o f noise, using a distribu

tion with a long tail so that there is some positive probability for q»ny £ [Szu, 1986; Hueter, 1988].

Cost Functions and Convergence It would be nice to prove that competitive learning converges to the “best” solution o f a given problem. But the best solution o f a general clustering problem is not very clearly defined, and there exist various different criteria in the statistics literature [see e.g., Krishnaiah and Kanal, 1982 or Duda and Hart, 1973]. In practice most authors have defined an algorithm and then looked after the fact (if at all) at what it optimized.

222

NINE Unsupervised Competitive Learning

O f most interest is the standard rule (9.6), for which there is an associated cost (Lyapunov) function given by [Ritter and Schulten* 1988c]

= \ £ M tiS ? - Wijf ijfi

=

\ j 2 \ e - w,-. I2. H

(9.8)

Here M ? is the c lu s te r m e m b e r s h ip m a tr ix which specifies whether or not input pattern £** activates unit i as winner: M? = ( 1 10

if

otherwise.

(9.9)

Note that thewinner i* is a function o f the input //, and o f all theweights , in both (9.8) and (9.9). Thus the membership matrix in general willchange in the course o f learning. Gradient descent on the cost function (9.8) yields sp (A w ij) =

-w ^ ) O W ij

(9.10)

^

which is just the sum o f the standard rule (9.6) over all the patterns for which i is the winner.1 Thus on average (for small enough rj) the standard rule decreases (9.8) until we reach a local minimum. This works even though (9.8) is only piecewise differentiable. This result is, however, somewhat deceptive for two reasons. Firstly, there are typically very many local minima, corresponding to different clusterings o f the data. Nevertheless the cost function does let us rank different clusterings in order o f desirability, regarding lower cost ones as better. Stochastic noise (e.g., from the input presentation order) may be able like simulated annealing to kick us out o f the higher minima and towards progressively lower ones, but there is no guarantee o f finding the global minimum. Secondly, we have averaged across the different patterns. We could actually use (9.10) directly, and update in b a tc h m o d e by accumulating the changes A given by each o f a finite set o f patterns £** before actually updating the weights. Then our argument would be watertight and we would necessarily reach a stable equilibrium (though still only a local minimum). In fact that procedure corresponds exactly to the classical k -m e a n s c lu s te rin g algorithm [e.g., Krishnaiah and Kanal, 1982]. But with incremental updates— after each pattern \i— continuing changes in the i*(n) classification can occur. W ith a regular cyclic presentation o f patterns one can even produce examples in which the winners are permuted in every cy cle [Grossberg, 1976a]. These problems can be reduced to some extent by doing the weight adjustments with “momentum” as in (6.24), effectively performing a 1We assume that the patterns are weighted equally. In general a probability P ** could be inserted into (9.8) and (9.10).

9.2 Examples and Applications of Competitive Learning

223

weighted moving average over a set o f recent patterns that resulted in the same winner. Only in the case o f sufficiently sparse patterns can one prove stability and convergence theorems for incremental updating [Grossberg, 1987b, and references therein]. The patterns are sparse enough if, for example, we can find a set o f clusters so that the minimum overlap £** • within a cluster exceeds the maximum overlap between that cluster and any other. In practice, both to prove theorems (using e.g., stochastic approximation meth ods) and to classify real data, it helps to decrease the learning rate 77 during learn ing. An initial large value encourages wide exploration, while later on a smaller value suppresses further winner changes and allows refinement o f the weights. It is com mon to use either 77 (f) = r)ot~a (with a < 1 ) or 77(f) = 77 0 (1 — ext). It is possible to start with a cost function and derive a learning rule from it, as we have seen elsewhere. An interesting example was suggested by Bachmann et al. [1987] who replaced the quadratic form in (9.8) by a different power law:

ijf*

(9-n )

They also let the winning weight vector be repelled by input vectors in other clus ters, in addition to being attracted by those in its own cluster, taking Qf = 2Af/*-l.

(9.12)

For p = N — 2 the m otion2 o f the weight vectors in the JV-dimensional input space is like that o f electrostatic point charges attracted toward stationary charges fixed at the s o f the winning cluster, and repelled by charges at other s. This kind o f model is therefore called a C o u lo m b e n e r g y n e tw o r k [Scofield, 1988].

9.2 Examples and Applications of Competitive Learning Graph Bipartitioning As a simple illustration of competitive learning we consider the graph bipartitioning problem defined in Section 4.3; we want to divide a given graph into two equal parts with as few links as possible between the parts. Let us have one binary 0 /1 input for each vertex o f the graph and two output units, one to indicate each partition. Then, after the problem has been solved, we should be able to turn on one vertex and have the output tell us the partition to which it belongs. 2

As in almost all the physical analogies in this book, the motion we think of here is overdamped; we should imagine it as taking place in a very viscous fluid.

224

NINE Unsupervised Competitive Learning

(a)

FIGURE 9.3 Application o f competitive learning to graph bipartitioning. There are two output units, L and R, and one input for each vertex. Vertices are shown solid if L wins when that vertex is on alone (monopole stimulus), and open if R wins. Edges are shown heavy if L wins when the vertices at both ends are both on (dipole stimulus), and light if R wins. After Rumelhart and Zipser [1985].

How can we hope to produce such a network? Clearly we have to tell it about the edges o f the graph. A good way to do this is to represent each edge by a d ip o le stim u lu s in which the two vertices belonging to the edge are turned on together. If we use this set o f patterns (one for each edge) as our input data f j4, the network should learn to divide them into two roughly equal halves with minimum overlap. This does not quite correspond to the graph bipartitioning problem, for which we want equal numbers o f vertices in each partition, not necessarily equal number o f edges. However, it is often close enough to be interesting, and serves as a useful illustration [McClelland and Rumelhart, 1988]. Figure 9.3 shows two examples. In (a) the graph is a regular grid, examined in detail by Rumelhart and Zipser [1985] using the rule (9.5). The figure shows a typical result after learning, indicating which o f the two output units fires in response to m onopole stimuli (one vertex) and to dipole stimuli (two vertices on one edge). The network has found one o f the two optimum solutions to the bipartitioning problem, and has divided the dipole stimuli roughly equally. However, it sometimes finds a solution which divides the graph diagonally, which is far from optimal for the bipartitioning problem. In (b) we show the result o f applying the rule (9.6) to the bipartitioning problem o f Fig. 4.6. We accumulated the changes A Wij for each o f the 28 dipole stimuli (edges) before actually updating the weights. Starting the weights from uniform random values in the range [0,1] we found convergence to an optimum solution in approximately 30% o f our trials, using rj = 0.1. Convergence usually occurred within 10 weight updates. The typical solution shown has not only solved the bipartitioning problem for the vertices, but has also divided the edges exactly in two.

9.2 Examples and Applications of Competitive Learning

225

FIGURE 9.4 Voronoi tessellation. The space is divided into polyhedral regions according to which o f the prototype vectors (dots) is clos est. The boundaries are perpendicular bisector planes o f lines joining pairs o f neighboring pro totype vectors.

Vector Quantization Probably the most important application o f competitive learning is v e c t o r q u a n t iz a tio n for data compression. Vector quantization is used both for storage and for transmission o f speech and image data. Gray [1984] and Nasrabadi and King [1988] provide reviews o f the general problem and o f traditional algorithms. The idea is to categorize a given set or distribution o f input vectors £** into M classes, and then represent any vector just by the class into which it falls. The vector components are usually continuous-valued. We can just transmit or store the index o f the appropriate class, rather than the input vector itself, once a set o f classes, or a c o d e b o o k , has been agreed on. Normally the classes are defined by a set o f M p r o t o t y p e v e c to r s , and we find the class o f a given input by finding the nearest prototype vector using the ordinary (Euclidean) metric. This divides up the input space into a V o r o n o i te s s e lla tio n (or D irich le t te s s e lla tio n ), illustrated in Fig. 9.4 The translation to a competitive learning network is immediate. W hen an input vector is applied at the network input, the winning output tells us the appropriate class. The weight vectors w,* are the prototype vectors, and we find the winner using (9.3): lw »* - £| < |wi - £| (for all t). (9.13) Note that this is not equivalent to maximizing w,- •£ unless the weight vectors are normalized. We can find an appropriate set o f prototype vectors (i.e., weights) with the standard com petitive learning algorithm (9.6). When exposed to sample data the weights change to divide up the input space into Voronoi polyhedra containing roughly equal numbers o f sample points; they provide a discretized map o f the input probability P (£ ). After training we can freeze the weights to establish a static codebook. Figure 9.5 shows two simple examples. In each case we defined a probability distribution P (£ ) for the two-dimensional input vectors £ = ( £ r ,f y). In (a) the two output units correctly classified the input into its two clusters. For the L-shaped distribution in (b) the 10 units always divided up the input distribution fairly evenly. More input samples gave a more precise division, symmetric about the diagonal. In practical applications for data compression it is essential to add additional mechanisms to avoid dead units and to ensure an equitable distribution o f units in the pattern space. This has been done using Kohonen feature mapping [Nasrabadi and Feng, 1988; Naylor and Li, 1988] and by the use o f a conscience mechanism

226

NINE Unsupervised Competitive Learning

FIGURE 9.5 Examples o f vector quantization. In each case the 1000 dots are sam ples from an input probability distribution. The black squares show the weight vectors o f each unit after training on these data. A unit wins the com petition when its weight vector is closest to the input vector, so the weight vectors shown define a Voronoi tessellation o f the plane.

[Ahalt et al., 1990]. The conscience mechanism appears best, and produces nearoptimal results. Kohonen [1989] has also suggested a supervised version o f vector quantization called le a r n in g v e c t o r q u a n tiz a tio n (LVQ). In a supervised case the classes are predefined and we have a b ody o f labelled sample data; each sample input vector £ is tagged with its correct class. We may usefully have several prototype vectors per class. The winning rule (9.13) is unmodified but the update rule depends on whether the class o f the winner is correct or incorrect: _ f +T)({j ~ W i.j)

if class is correct;

~~

if class is incorrect.

“ f?(£j* — Wi*j)

^

In the firstcase the rule is the standard one, but in the second casethe weight vector is moved in the opposite direction, away from the sample vector.This minimizes the number o f misclassifi cat ions. An improved algorithm called LVQ2, closer in effect to Bayes decision theory, has also been suggested by Kohonen [1989]. The learning rule is only applied if: 1. the input vector £ is misclassified by the winning unit i*; 2. the next-nearest neighbor %' has the correct class; and 3. the input vector £ is sufficiently close to the decision boundary (perpendicu lar bisector plane) between w ,* and w ,/. Then both w ,* and w,-/ are updated, using the incorrect and correct cases o f (9.14) respectively. This rule has shown excellent performance in studies on statistical and speech data [Kohonen et al., 1988].

227

9.2 Examples and Applications of Competitive Learning

(b)

(a)

A ?

g>

/

g?

o o o o o o

o• o• o• o• o• o•

o o o o o o

• • • • • •

• • • • • •

• • 0 * 0 0 • •o • oo • •o • oo • •o • o o • •o • o o • •o • o o V2

V1

g>/ • o o • o •

• o o • o •

• •• • o o o o o o o o • • • • o o o o • •• • H1

o • • o • o

oo o o o • ••• • ••• •• oo o o o • •••• oo o o o H2

FIGURE 9.6 The horizontal/vertical line problem o f Rumelhart and Zipser [1985]. (a) Architecture o f the network. Not shown are teacher inputs nor connections; each layer is fully connected to the next. The input layer is a 6 x 6 pixel array on which line stimuli are presented. The intermediate layer has two groups o f four winner-take-all units. The output layer has two winner-take-all units, signalling horizontal or vertical input lines, (b) Typical sensitivity o f the four units in one intermediate layer group; one fires for each o f the 12 line stimuli.

Multi-Layer Networks Our com petitive learning models so far have had a single layer in which exactly one unit wins for each input pattern. To generalize this to a multi-layer network we need to allow several winners per layer so that there are interesting patterns for subsequent layers to analyze. In this way we can hope to detect successively higher order patterns in the data. One approach is to divide up each layer into groups, in each o f which one unit wins [Rumelhart and Zipser, 1985]. Even if the groups in a given layer are structurally identical, they may well cluster the data in different ways (starting from random connection strengths) and hence add to the available information. As an example, consider the horizontal/vertical line problem studied by Rumel hart and Zipser [1985]. The inputs are arranged in a 6 x 6 array that is fully con nected to the remaining network (which therefore has no architectural information about the two-dimensional layout). Input stimuli consist o f excitation o f a whole horizontal or vertical line o f the inputs. We want to have a pair o f output units which learn to fire on horizontal and vertical lines respectively. W ith a single output layer this task is impossible. If the desired vertical feature detector is to win for any o f the 6 vertical lines, by symmetry it will have equal weight from each o f the 6 inputs in each o f the 6 vertical lines. But this is all 36 units! Even with a teacher it cannot work. The simplest architecture that did work required two layers, and is shown in Figure 9.6(a). An extra teaching input was needed initially to say whether the line

228

NINE Unsupervised Competitive Learning

was horizontal or vertical, but could be omitted after training. In each group o f the intermediate layer the four units learn to respond to three vertical, or three horizontal lines. The output layer learns the correlations between the two groups in the intermediate layer, which normally leads to discrimination between horizontal and vertical input lines. This fails if both groups in the intermediate layer happen to divide up the lines (6 —►3 + 3) in the same way, but this is rare and could be cured using more than two groups. Another approach to multi-layer architecture is to impose mutual inhibition between units only when they are within a certain distance d o f each other (we assume that the units are arranged geometrically, usually within a two-dimensional layer). Then no two units within d can fire together, though further units may. Each unit has a v ic in it y a re a o f radius d which is its local group, but groups can overlap. This is perhaps more appropriate for biological modelling than the discrete groups considered above. Fukushima [1975, 1980] employed this idea in his multi-layer unsupervised learning networks called the c o g n it r o n and n e o c o g n it r o n .3 Up to eight layers were used, with a very specific architecture o f interconnections between layers, us ing limited receptive fields and different functional groups. The network was able to learn to differentiate different letters (presented as patterns on a 16 x 16 pixel array) even with the imposition o f considerable translation, scaling, or distortion.

9.3 Adaptive Resonance Theory As noted earlier (page 222), there is no guarantee o f stability o f the categories formed by by competitive learning. Even if we continue to present a finite fixed sequence o f patterns, the category (i.e., winning unit) o f a particular pattern may continue changing endlessly. One way o f preventing this is to reduce the learning rate rj gradually to zero, thus freezing the learned categories. But then the network loses its plasticity, or ability to react to any new data. It is not easy to have both stability and plasticity; this is Grossberg’s s ta b ility -p la s t ic ity d ile m m a . To make a real-time learning system that can exist and continue adapting in a nonstationary world we must clearly deal with this dilemma. But there is also another related stability issue: how many output units should we utilize? If we keep that number fixed and avoid dead units, then enforcing long-term stability o f the learned categories means that there will be no units available for new patterns subsequently encountered. On the other hand, if we somehow have an inexhaustible supply o f output units we must avoid paving the input space ever more finely, since that is not categorization at all. A good solution would be to have a finite (or infinite) supply o f output units, but not use them until needed. Then at any time there would be some output units in use, stably detecting the known categories,

3Later versions of the neocognitron [Fukushima et al., 1983] used supervised learning.

229

9.3 Adaptive Resonance Theory

and some more waiting in the wings until needed. If we used up the whole supply we should probably stop adapting; stability should come above plasticity when we are at full capacity. Carpenter and Grossberg [1987a, b, 1988] have developed networks called ART1 and A R T2 that behave in just this way. These networks overcome the stabilityplasticity dilemma by accepting and adapting the stored prototype o f a category only when the input is sufficiently similar to it. “A R T ” stands for a d a p tiv e r e s o n a n c e t h e o r y ; the input and stored prototype are said to resonate when they are sufficiently similar. W hen an input pattern is not sufficiently similar to any existing prototype a new category is formed, with the input pattern as the prototype, using a previously uncommitted output unit. If there are no such uncommitted units left, then a novel input gives no response. The meaning o f sufficiently similar above is dependent on a v ig ila n c e p a r a m e te r p, with 0 < p < 1. If p is large the similarity condition becomes very stringent, so many finely divided categories are formed. On the other hand a small p gives a coarse categorization. The vigilance level can be changed during learning; increasing it can prompt subdivision o f existing categories. ART1 is designed for binary 0 /1 inputs, whereas A R T2 is for continuous-valued inputs. We focus exclusively on A R T1, the simpler case. We suggest Moore [1988] for further reading, as well as Carpenter and Grossberg [1988] for an overview, and Carpenter and Grossberg [1987a, b] for the details o f ART1 and A R T2 respectively.

The ART1 algorithm It is easiest to present ART1 as an algorithm before describing the network im plementation. Let us take input vectors £ and stored prototype vectors w b o t h with TV binary 0 /1 components, i indexes the output units, or categories, each o f which can be enabled or disabled. We start with w , = 1 for all i, where 1 is the vector o f all ones; this will represent an uncommitted state, not a category. Then the algorithm upon presentation o f a new input pattern £ is as follows: 1. Enable all the output units. 2. Find the winner i* among all the enabled output units (exit if there are none left). The winner is defined as the one for which w , •£ is largest, where w , is a normalized version o f w T h e normalization is given by w*=

w

£+

(9 -15)

where Wji is the jih component o f w,-. The small number e is included to break ties, selecting the longer o f two w,- ’s which both have all their bits in £. Note that an uncommitted unit wins if there is no better choice. 3. Test whether the match between £ and w,* is good enough by computing the ratio

r=

t f

/>, where p is the vigilance parameter, there is resonance; go to step 4. Otherwise if r < p the prototype vector w t* is rejected; disable unit i* and go back to step 2. 4. Adjust the winning vector wj* by deleting any bits in it that are not also in £. This is logical AN D operation, and is referred to as m a sk in g the input. This algorithm can terminate in one o f three ways. If we find a matching pro totype vector we adjust it (if necessary) in step 4 and output that category i*. If we find no suitable prototype vector from among the previous categories, then one o f the uncommitted vectors is selected and made equal to the input £ in step 4; again we output the appropriate (new) category i*. Finally, if there are no matches and no uncommitted vectors we end up with all units disabled, and hence no output. It should now be clear how this algorithm solves the various problems we raised. It continues to have plasticity until all the output units are used up. It also has sta bility; a detailed analysis shows that all weight changes cease after a finite number o f presentations o f any fixed set o f inputs. This comes essentially from the fact that the adaptation rule, step 4, can only remove bits from the prototype vector, never add any. Thus a given prototype vector can never cycle back to a previous value. Note that the loop from step 3 back to step 2 constitutes a s e a rch through the prototype vectors, looking at the closest, next closest, etc., by the maximum w , £ criterion until one is found that satisfies the r > p criterion. These criteria are different, so going further away by the first measure may actually bring us closer by the second. The first measure is concerned with the fraction o f the bits in w , that are also in £, whereas r is the fraction o f the bits in £ that are also in w , . O f course this search is comparatively slow (and more like a conventional algorithm), but it only occurs before stability is reached for a given input set. After that each category is found on the first attempt and there is never a jum p back from step 3.

Network Implementation of ART1 Carpenter and Grossberg designed the ART1 network using previously developed building blocks that were based on biologically reasonable assumptions. The se lection o f a winner, the input layer, the weight changes, and the enable/disable mechanism can all be described by realizable circuits governed by differential equa tions. There are at least three time scales involved: the relaxation time o f the winner-take-all circuit; the cycling time o f the search process; and the rate o f weight update. We describe a simpler version, taking the winner-take-all circuit for granted and simplifying certain other features. Figure 9.7 shows this reduced version. There are two layers, with units Vj in the input layer and 0 { in the output layer fully connected in both directions. The forward weights Wij are normalized copies o f the backward weights Wji, according to (9.15). Note that the Wji are each 0 or 1, as are £;-, Vj, O i, A , and R. The output layer consists o f winner-take-all units; only the unit with the largest net input . Wij Vj among all enabled units has Oi = 1. If the “reset” signal R is

231

9.3 Adaptive Resonance Theory

Output Layer Oj I!

/ \

reset

A

-N

I I W:: 11 •J

W

J«

1 jjjjp

Input Layer Vj

-1

7 R ---------

FIGURE 9.7 The ART1 network. The large shaded arrows represent connections between all pairs o f output units O, and input units V j .

turned on while a winner is active, that unit is disabled and removed from future com petitions. All units can be re-enabled by another signal that is not shown. The input units Vj are designed so that if no output Oi is on; a

wjiOi

otherwise

(9.17)

where A means logical AND. For technical reasons, and to allow some other func tions that we omit, this is done using an auxiliary unit A , which is on (A = 1) if any input is on but no output is, and off (A = 0) otherwise. A could be a 0 /1 threshold unit with total input Y j — N Y i Oi and threshold 0.5, as suggested by the connection weights shown in the figure. The input units receive total input hj — €j + ^ 2

+ A

(9.18)

and fire (Vj = 1) if this exceeds a threshold o f 1.5. This is equivalent to (9.17), and is referred to as the 2 /3 rule; two out o f its three inputs Y i wjiO i, and A must be on for a unit Vj to fire. The disabling or “reset” signal is generated when r from (9.16) is less than p. This can be accomplished with a 0 /1 threshold unit R that receives input p Y j — Y^j Vj and has threshold 0, as shown by the connection weights in the figure. Finally the backward weights are updated slowly according to = vOi(Vj - wji)

(9.19)

232

NINE Unsupervised Competitive Learning

so that the prototype vector for the winner i* becomes equal to the masked input Vj after resonance has occurred. The forward weights have a slightly more com plicated learning rule which leads to a similar, but normalized, result. Given the details that we have omitted, this network runs entirely autonomously. It does not need any external sequencing or control signals, and can cope stably with an infinite stream o f input data. It has fast access o f well-known categories, automatic search for less-known categories, creation o f new categories when nec essary, and refusal to respond to unknown inputs when its capacity is exhausted. And its architecture is entirely parallel. In practice ART1 networks are somewhat tricky to adjust, and are very sensitive to noise in the input data. If random bits are sometimes missing from the input patterns, then the stored prototype vectors can be gradually degraded by the one way masking form o f the adaptation rule. ART1 networks are also rather inefficient in their storage requirements; we need one unit and 2N modifiable connections for each category, and many fixed connections for the A unit, R unit, and lateral inhibition. They also share the limitations o f a grandmothering approach common to most com petitive learning schemes (see page 217). Some o f these problems are solved in the A R T2 network.

9.4 Feature Mapping Up to now we have paid little attention to the geometrical arrangement o f our com petitive output units. If however we place them in a line or a plane, we may consider networks in which the location o f the winning output unit conveys some information, with nearby outputs corresponding to nearby input patterns. If £* and £2 are two input vectors, and r 1 and r 2 are the locations o f the corresponding winning outputs, then r 1 and r 2 should get closer and closer, eventually coinciding, as £* and £ 2 are made more and more similar. Additionally we should not find r 1 = r 2 unless the patterns £* and £2 are similar. A network that performs such a mapping is called a fe a tu r e m a p . Technically what we are asking for is a t o p o lo g y p r e s e r v in g m a p from the space o f possible inputs to the line or plane o f the output units. A topology pre serving map, or t o p o g r a p h ic m a p , is essentially a mapping that preserves neigh borhood relations. The idea may seem almost trivial, but it is not. This is best seen by realizing that such a map is not necessarily possible between a given pair o f spaces. For example we cannot map a circle onto a line without a discontinuity somewhere. We have not yet specified the nature o f the inputs, beyond assuming a welldefined distance (m etric) between pairs o f input patterns. There are actually two cases com m only considered, as exemplified by the two architectures o f Fig. 9.8. In both cases there is a geometrical array o f outputs, which we show as twodimensional, but the form o f the inputs is quite different.

9.4 Feature Mapping

233

FIGURE 9.8 Tw o types o f feature mapping architecture, (a) is the conventional one, with continuous-valued inputs £i, £ 2 - 0>) ls ° f biological interest in mapping from, e.g., retina to cortex. Layers are fully connected, though only a few connec tions are shown.

In the first case there are a few continuous-valued inputs, such as £1 and £2 shown in Fig. 9.8(a). We expect a map onto the output space (ar, y) from the twodimensional region o f (£ i,£ 2 ) space in which inputs occur; this might or might not be a square domain. Another natural situation in this class would be a single input £1 mapped to a linear array (line) o f outputs. We could also have, for instance, three inputs £ i ~£3 mapped topologically onto a two-dimensional plane if the actual input patterns all fell into a two-dimensional subspace o f the three-dimensional £ space. The second case, Fig. 9.8(b), is studied less often but has more biological im portance. The inputs £,j themselves are arranged in a two-dimensional array that defines the input space (i, j ) . In the simplest case the input patterns have only one input turned on at a time, thus defining a single point in this space. The problem is to learn a continuous mapping from ( i , j ) onto (x ,y ), roughly between points “above” one another in the two spaces. The significance o f the problem is the fre quent occurrence o f such topologically correct mappings in the brain, including connections from the sense organs (eye, ear, skin) to the cortex, and connections between different areas o f the cortex. The r e t in o t o p ic m a p from the retina to the visual cortex (in higher vertebrates) or optic tectum (in lower vertebrates) is the most studied example o f a two-dimensional map, though the s o m a t o s e n s o r y m a p from the skin onto the somatosensory cortex (where there is an image o f the body surface) is also important. The t o n o t o p ic m a p from the ear onto the audi tory cortex forms a one-dimensional example, with sounds o f different frequencies laid out in order on the cortex. In each o f these cases it is not very likely that the axon routes and synapses are entirely preprogrammed or hardwired by the genes, so one needs a mechanism o f creating the appropriate topographic maps during de velopment. Unsupervised learning is one approach (am ong several) to solving this problem by softwiring.

234

NINE Unsupervised Competitive Learning

FIGURE 9.9 The “Mexican hat” form o f lateral connec tion weights.

There are a number o f ways to design an unsupervised learning network that self-organizes into a feature map: 1. For the second case just described, limited receptive fields for the output layer can be put in by hand. Unsupervised learning can then refine the map ping, as we saw in Linsker’s model o f orientational selectivity (page 211). However this begs the question for the softwiring problem itself. 2. We can use ordinary competitive learning with the addition o f lateral con nections within the output layer. These need to have the Mexican hat form, excitatory between nearby units and inhibitory at longer range, with strength falling off with distance, as in Fig. 9.9. Thus neighboring units gain neighbor ing responses, while units further away find select different patterns. 3. We can use ordinary competitive learning, but update weights going to the neighbors o f the winning unit as well as those o f the winning unit itself. This is K o h o n e n ’ s a lg o r ith m . We briefly discuss two examples o f (2), and then focus on (3).

Willshaw and von der Malsburg’s Model An early model using lateral connections was proposed by Willshaw and von der Malsburg [1976] for the retinotopic map problem. They used the architecture o f Fig. 9.8(b), but added lateral connections o f Mexican hat form within the output layer. Their output units were not strictly winner-take-all, though a threshold en sured that only a few fired at a time. For learning they employed a plain Hebbian rule followed by renormalization o f the weights. To train their network Willshaw and von der Malsburg used d ip o le s tim u li with two adjacent inputs turned on. Learning then converged on an ordered to pographic map, such as that shown in Fig. 9.10. Note that this is a diagram o f the output weights plotted in the input space. Specifically each output unit Oi is represented by the intersection between lines that connect it to its neighbors, and is plotted at the point WikT*> where *k is the location o f input £* in the input space. This gives the center o f mass o f the effective receptive field o f unit O t. The mapping could converge on any o f eight possible orientations; corners A B C D o f the input array might map, for instance, onto B A D C o f the output. In order to select a particular orientation (as found biologically), Willshaw and von der Malsburg chose four o f the inputs to be p o la r ity m a rk ers, with initial strong

235

9.4 Feature Mapping

FIGURE 9.10 A mapping between two 6 x 6 arrays produced by Willshaw and von der Malsburg’s net work. The outer square represents the input array and the locations o f the line intersections represent the output weights. The polarity markers are labelled a -d .

connections to the correct places in the output array. The polarity markers acted as nucleation regions to break the initial symmetry. Another approach would be to pin certain units at the boundary o f the desired mapping [Cottrell and Fort, 1986]. Amari [1980] and Takeuchi and Amari [1979] analyzed a modified version o f Willshaw and von der Malsburg’s model using continuous one-dimensional lines for both input and output. They found that a correct feature map will form provided the width o f the Mexican hat function is large enough compared to the width o f typical input patterns. If this condition is not satisfied they find that the mapping develops discontinuous jumps. It is not now believed that this unsupervised learning mechanism is solely re sponsible for the retinotopic maps found biologically. In particular it cannot repro duce the experimental phenomena o f regeneration after damage. Current theories involve some degree o f ch e m o a ffin ity , in which growing axons carry chemical markers that help to define appropriate target sites. However a degree o f soft wiring by unsupervised learning is often invoked to refine the map.

von der Malsburg’s Model for Orientation Column Formation Somewhat earlier, von der Malsburg [1973] used a very similar scheme to model the formation o f the orientation columns in the visual cortex described in Section 8.4. There is no longer a simple mapping between position in the input space and posi tion in the output layer. Instead, a mapping develops from the angle o f orientation o f an input object to position in the output layer. The preferred orientation o f the output units changes smoothly (with occasional breaks) across the array, as found in nature. It is important to notice that the network itself is not essentially different from the one which makes the position-to-position map; the difference lies in the set o f input images it is trained on. Whereas the Willshaw-von der Malsburg model used spatially localized dipole patterns, the von der Malsburg orientation column model used elongated bar-like patterns, always centered at the middle o f the image and presented in various orientations. The very different results illustrate dramatically the influence o f experience on the properties o f neural networks.

236

NINE Unsupervised Competitive Learning

Kohonen’s Algorithm Kohonen’s algorithm takes a computational shortcut to achieve the effect accom plished by the Mexican hat lateral interactions. There are no lateral connections, but the weight update rule is modified to involve neighborhood relations in the output array. The algorithm is normally applied to the architecture o f Fig. 9.8(a). Thus there are AT continuous-valued inputs, £1 to £/v, defining a point £ in an N-dimensional real space. The output units O,* are arranged in an array (generally one- or twodimensional), and are fully connected via to the inputs. A competitive learning rule is used, choosing the winner i* as the output unit with weight vector closest to the current input lw «* — £ I < lw * — £ I

(for all i).

(9.20)

As remarked before, this cannot be done by a linear network unless the weights are normalized, but the algorithm is generally used computationally rather than with an actual network. Thus far we have exactly the same design as for e.g., vector quantization. The difference is in the learning rule, which is [Kohonen, 1982, 1989] A

= rjA(i,

- wf j)

(9.21)

for all i and j . The n e ig h b o r h o o d fu n c t io n A (i, i*) is 1 for i = i* and falls off with distance |r,* — r,-* |between units i and i* in the output array. Thus units close to the winner, as well as the winner i* itself, have their weights changed appreciably, while those further away, where A (i, i*) is small, experience little effect. It is here that the topological information is supplied; nearby units receive similar updates and thus end up responding to nearby input patterns. The rule (9.21) drags the weight vector w t * belonging to the winner towards £. But it also drags the w^’s o f the closest units along with it. Therefore we can think o f a sort o f e la s tic n e t in input space that wants to come as close as possible to the inputs; the net has the topology o f the output array (i.e., a line or a plane) and the points o f the net have the weights as coordinates. This picture is particularly useful to keep in mind when interpreting Figs. 9.10-9.13. To construct a practical algorithm we have to specify A (i, i*) and rj. It turns out to be useful (and often essential for convergence) to change these dynamically during learning. We start with a wide range for A (i, i*) and a large 77, and then reduce both the range o f A (i, i*) and the value o f 77 gradually as learning proceeds. This allows the network to organize its elastic net rapidly, and then refine it slowly with respect to the input patterns. Making 77 go to 0 eventually stops the learning process. A typical choice for A (i,i* ) is A (i, f ) = ex p (—|r< - r,-. |2/2‘ Y , Z i S i -

one for each pattern

:

(1.0.3)

We will set all the strengths hp to zero later, after these fields have served their purpose. The partition function (10.1) is now Z = e f}p^2 Trs exp \ A . 2N

(10.4)

This would be easy to evaluate if both the terms in the exponent were linear in the S i’s. Then the trace would simply factor into a product o f N independent terms, one for each i 1 every term being a simple sum over Si = + 1 and Si = —1 .

* Usually the trace o f an operator (or matrix) means the sum of all the diagonal elements. This way of using it originates in quantum statistics.

253

10.1 The Hopfield Model

Unfortunately the first term is quadratic. But we can use the “Gaussian integral trick” to make it linear, at the expense o f some other complications. This trick exploits the identity

which can be used to turn an exponential in b2 (on the right) into an exponential linear in b (on the left). The cost, o f course, is the introduction o f the auxiliary variable « , and the integral over it. Our price is actually p times higher, because (10.4) contains p quadratic terms, one for each p. So we introduce p auxiliary variables ra**, and take a = (3N/2 and b** = /? Yli

&ye

x T r s J J I* dm** ex p f-| /?A T (m /i) 2 +

+ h*)

Sj) . (10.6)

Now let us adopt a shorthand vector notation, taking m , h , and to be p-com ponent vectors with components m^, h**, and £? respectively. Then (10.6) becomes Z = e-Pr/2 ( g ) ' 7

dm e ~^Nm^ 2 J J T iSi e« ” + h)^ f5‘ .

(10.7)

The trace is now easy because the exponent is linear in 5*. Using e*4-e x = 2 cosh x, we obtain after a little reorganization

( 10 .8 ) with /( /? , m ) = \a + ± m 2 -

^

lo g (2 cosh[/?(m + h ) •^ ] ) .

(10.9)

We still have a p-fold integral to do, but the fact that the exponent in (10.8) is proportional to N allows us to evaluate it in the limit o f large N. The bigger N is, the more the integral is dominated by contributions from the region where / is smallest. So we can approximate it by finding the value o f m which minimizes /, and expanding the integrand around there. This is called the s a d d le -p o in t m e t h o d , and is best understood through a simple example. Suppose that we had a one-dimensional integral o f the form

I=VN

Then expanding the exponent around the point I = y/N

J

( 10 .10 )

dxe-N9W. xq

where g {x ) is minimized we get

dx e x p (—JV[sf(x0) + \ g " {x 0) (* - a:0) 2 H

])

(10.11)

254

TEN Formal Statistical Mechanics of Neural Networks

using g*(xo) = 0. If we truncate the expansion at this point, the integral is just a Gaussian one, so

7

=

^

For large N this result isdominated by theexponential seen by putting it in the form

- j f loS 1

=

9 (z o) +

•

factor, as can

be clearly

(lo6 9 "(x o ) ~ log 2tt)

g (x 0) •

(10.13)

Thus all we need to do is to find xq ; this is often called the s a d d le p o in t, from behavior in the com plex x plane. For (10.8) we use a p-dimensional version o f the same idea, thereby obtaining - - ^ l o g Z = /? m in /( /? ,m ) iv m

(10.14)

in the N —►oo limit. Comparing this with (A .9) we see that F/N = min /( /? , m ) m

(10.15)

where F is the free energy, so minm / ( /? ,m ) gives us the free energy per unit. We now have to minimize /( /? , m ), which requires ° =

X X i

tanh[/?(m + h) •$,].

(10.16)

Note that this is a set o f p nonlinear simultaneous equations for the p unknowns m These equations appear to depend on the random patterns , but in fact the system is se lf-a v e r a g in g ; we can replace the average over units by an average over patterns at any one site, yielding m** = ((£•* tanh[/?(m + h ) •{ ] ) )

(10.17)

where ((• ••)) indicates an average over the random distribution o f £ patterns. Sim ilarly (10.9) becomes (with a —►0) / = ^ m 2 — /?- 1 ((log(2 cosh[/?(m + h ) •£ ]))) .

(10.18)

It is easy to see how the self-averaging property arises. As we go from unit to unit in the sum on i in (10.9) or (10.16), we are choosing AT independent ^ /s from the distribution P(£)> which we take to be uniform over the 2P possibilities. So if N

255

10.1 The Hopfield Model

is large compared to 2P the average over sites is equivalent to an average over the distribution. This requires p s we assume 1. Let us consider the contribution o f the last term in the integrand o f (10.34) for a particular p > s, one o f the small m £ ’s:

(n -p K E ^ ;))) p

i

= n f * ( * r s » w ) » i

p

=

n c o s h ^ E ^ r ) * p

=

exp [ 5 3 !°g COsh [ p

**

*

m p Si ) ] p

exp [ £ ^ 0 ?E m?5O ]

=

i

p

e x p (^ 5 3 5 f5 fm ^ m ^ )

(10.35)

tp s) in (10.34) as 0N ^ = e x p (-^ 5 3 A „m ^ ).

(10.37)

pa 2Strictly speaking: the saddle-point values of the ra£’s. That is, we will again evaluate the multi dimensional integral by the saddle-point method, and the values of the nip that will matter will be those at the saddle point.

259

10.1 The Hopfield Model

This leaves us with an n-dimensional Gaussian integral, giving

We get a contribution exactly like this for every value o f p greater than s (about p in all, since p >> s), giving an overall factor (d etA ) p/ 2 = e x p (—| p logd et A) = exp

log J J A ^ p (10.39)

p where \p are the eigenvalues o f A. The extra complications we encounter for a > 0 all come from this factor (10.39), which, together with the other parts o f (10.34), now has to be summed over all the 5 f. Unfortunately the 5-dependence is buried in the eigenvalues Ap, and the trace is far from easy. So now we use some more auxiliary variable tricks. First let us define a generalized version of A pa: Apa = (1 —‘

~~ Pqpcr

(10.40)

This is equal to A pa if for p ^ a ;

(10.41)

otherwise. Thus we can write any function G {A P} o f the eigenvalues o f A in the form

g

{a ,}=

J L (p°)

i

(10-42)

using a Dirac delta function, where the Ap’s are the eigenvalues o f A, and are functions o f the gp 0, its state a will vary with time, and quantities such as Ha that depend on the state will themselves fluctuate. Although there must be some driving mechanism for these fluctuations, part o f the idea o f temperature involves ignoring this and treating them as random; thermal energy is by definition disordered. W hen a system is first prepared, or after a change o f parameters, the fluctuations may have on average a definite direction, such as that o f decreasing energy Ha . But after a while any such trend ceases and the system just fluctuates around constant average values. Then we say that it is in th e r m a l e q u ilib r iu m . A fundamental result from physics tells us that in thermal equilibrium each o f the possible states a occurs with probability pa = ^ e ~ H'-lkBT

(A .l)

275

276

APPENDIX Statistical Mechanics

where the normalizing factor Z = Y ^ e ~ Ha,kBT a

(A.2)

is called the p a r t it io n fu n c tio n , k s is B o ltz m a n n ’ s c o n s ta n t, with value 1.38 x 1 0 "16 erg /K . Equation ( A .l) is called the Boltzmann-Gibbs distribution. We will not at tempt to justify it here. It is usually derived from very general assumptions about microscopic dynamics, but can also be interpreted from the viewpoint o f informa tion theory. It usually works well, but can fail when a system does not explore all o f its possible states. In applications to neural networks, the “temperature” o f a stochastic network is o f course not related to the physical temperature, but is simply a parameter controlling the update rule. So its scale is irrelevant, and we can choose to measure it in units such that k s = 1. We have implicitly done this in all the statistical mechanics calculations in this book, simply omitting the k s factor, as we do in the rest o f this Appendix. If we know the energy function Ha for a system we can in principle use ( A .l) to calculate the probability o f finding it in each o f its states a . Then we can compute the average value (A ) o f any quantity A that has a particular value A a in each state, using (A ) = J 2 A a Pa . a

(A.3)

This is called a th e r m a l a v era g e, and for a large system usually corresponds closely to the measured value o f A . This is how we can compute useful quantities for prediction or comparison with experiment. In practice the hardest part is calculating the partition function Z . On the other hand,once we have done so, we can usually compute the desired averages (A) from Z itself, and do not have to evaluate summations like (A .3) explicitly. Let us see how this works for the example o f a Hopfield network, with the energy function H { S } = - ^ E wHSiSi - E biSi • ij i

(A -4)

Here we have added explicit bias terms 6; for the purposes o f the present calculation; we can set 6,* = 0 later if we so desire. Note that H is a function o f all the activations £,•; the set o f all 5,*’s plays the role o f the generic state label a used above. The partition function is a sum over all the possible states, all the 2N com bi nations o f Si = ± 1 :

z -E• •Eexp(^Eu;,'-’+/?E6,'5»') 5 i= ± l

1

ij

55

*

(A -5)

277

A.2 Free Energy and Entropy

where fi = 1/T. The sum over all states is called a t r a c e and instead o f writing out all the summations it is com m on to write (A .5) as Z = Trs exp

^

WijSiSj + / ? ^ 6,-S

Thus e~ F! T/Z (which is just 1) can be thought o f as the sum o f the probabilities o f the individual states. The significance o f this is that it still applies if the sum over all a ’s is replaced by a restricted sum over a subset o f them; then e~ F lT/Z gives the probability o f finding the system in that subset, where F* is the restricted free energy. This is the approach used in Section 7.1.

A.3 Stochastic Dynamics In some cases we have detailed knowledge o f the stochastic sequence o f states ot\, For example in the Hopfield network we used the update rule (2.48) P ,ob (S , = ± 1 ) = / , ( * * , ) =

( A .19)

where hi was the net input Y lj wijS j + &,• to unit i. This can be rewritten as a tr a n s itio n p r o b a b ilit y to flip unit i from Si to —Si

t+ J

(