Bidirectional Retrieval from Associative Memory
Sommer, Friedrich, Palm, Günther
Similarity based fault tolerant retrieval in neural associative mem(cid:173) ories (N AM) has not lead to wiedespread applications. A draw(cid:173) back of the efficient Willshaw model for sparse patterns [Ste61, WBLH69], is that the high asymptotic information capacity is of little practical use because of high cross talk noise arising in the retrieval for finite sizes. Here a new bidirectional iterative retrieval method for the Willshaw model is presented, called crosswise bidi(cid:173) rectional (CB) retrieval, providing enhanced performance. We dis(cid:173) cuss its asymptotic capacity limit, analyze the first step, and com(cid:173) pare it in experiments with the Willshaw model. Applying the very efficient CB memory model either in information retrieval systems or as a functional model for reciprocal cortico-cortical pathways requires more than robustness against random noise in the input: Our experiments show also the segmentation ability of CB-retrieval with addresses containing the superposition of pattens, provided even at high memory load.
Cholinergic Modulation Preserves Spike Timing Under Physiologically Realistic Fluctuating Input
Tang, Akaysha, Bartels, Andreas, Sejnowski, Terrence J.
Neuromodulation can change not only the mean firing rate of a neuron, but also its pattern of firing . Therefore, a reliable neu(cid:173) ral coding scheme, whether a rate coding or a spike time based coding, must be robust in a dynamic neuromodulatory environ(cid:173) ment. The common observation that cholinergic modulation leads to a reduction in spike frequency adaptation implies a modifica(cid:173) tion of spike timing, which would make a neural code based on precise spike timing difficult to maintain. In this paper, the effects of cholinergic modulation were studied to test the hypothesis that precise spike timing can serve as a reliable neural code. Using the whole cell patch-clamp technique in rat neocortical slice prepara(cid:173) tion and compartmental modeling techniques, we show that cholin(cid:173) ergic modulation, surprisingly, preserved spike timing in response to a fluctuating inputs that resembles in vivo conditions. This re(cid:173) sult suggests that in vivo spike timing may be much more resistant to changes in neuromodulator concentrations than previous physi(cid:173) ological studies have implied. 112 A. C. Tang, A. M. Bartels and T. J. Sejnowski
Analytical Mean Squared Error Curves in Temporal Difference Learning
Singh, Satinder, Dayan, Peter
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal difference value estimation algorithms change with offline updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its step(cid:173) size and eligibility trace parameters.
3D Object Recognition: A Model of View-Tuned Neurons
Bricolo, Emanuela, Poggio, Tomaso, Logothetis, Nikos K.
In 1990 Poggio and Edelman proposed a view-based model of ob(cid:173) ject recognition that accounts for several psychophysical properties of certain recognition tasks. The model predicted the existence of view-tuned and view-invariant units, that were later found by Lo(cid:173) gothetis et al. (Logothetis et al., 1995) in IT cortex of monkeys trained with views of specific paperclip objects. The model, how(cid:173) ever, does not specify the inputs to the view-tuned units and their internal organization. In this paper we propose a model of these view-tuned units that is consistent with physiological data from single cell responses.
Learning Bayesian Belief Networks with Neural Network Estimators
Monti, Stefano, Cooper, Gregory
In this paper we propose a method for learning Bayesian belief networks from data. The method uses artificial neural networks as probability estimators, thus avoiding the need for making prior assumptions on the nature of the probability distributions govern(cid:173) ing the relationships among the participating variables. This new method has the potential for being applied to domains containing both discrete and continuous variables arbitrarily distributed. We compare the learning performance of this new method with the performance of the method proposed by Cooper and Herskovits in [7]. The experimental results show that, although the learning scheme based on the use of ANN estimators is slower, the learning accuracy of the two methods is comparable. Category: Algorithms and Architectures.
Approximate Solutions to Optimal Stopping Problems
Tsitsiklis, John, Van Roy, Benjamin
We propose and analyze an algorithm that approximates solutions to the problem of optimal stopping in a discounted irreducible ape(cid:173) riodic Markov chain. The scheme involves the use of linear com(cid:173) binations of fixed basis functions to approximate a Q-function. The weights of the linear combination are incrementally updated through an iterative process similar to Q-Iearning, involving sim(cid:173) ulation of the underlying Markov chain. Due to space limitations, we only provide an overview of a proof of convergence (with prob(cid:173) ability 1) and bounds on the approximation error. This is the first theoretical result that establishes the soundness of a Q-Iearning(cid:173) like algorithm when combined with arbitrary linear function ap(cid:173) proximators to solve a sequential decision problem. Though this paper focuses on the case of finite state spaces, the results extend naturally to continuous and unbounded state spaces, which are ad(cid:173) dressed in a forthcoming full-length paper.
Multidimensional Triangulation and Interpolation for Reinforcement Learning
Davies, Scott
Dynamic Programming, Q-Iearning and other discrete Markov Decision Process solvers can be -applied to continuous d-dimensional state-spaces by quantizing the state space into an array of boxes. This is often problematic above two dimensions: a coarse quantization can lead to poor policies, and fine quantization is too expensive. Possible solutions are variable-resolution discretization, or function approximation by neural nets. A third option, which has been little studied in the reinforcement learning literature, is interpolation on a coarse grid. In this paper we study interpolation tech(cid:173) niques that can result in vast improvements in the online behavior of the resulting control systems: multilinear interpolation, and an interpolation algorithm based on an interesting regular triangulation of d-dimensional space. We adapt these interpolators under three reinforcement learning paradigms: (i) offline value iteration with a known model, (ii) Q-Iearning, and (iii) online value iteration with a previously unknown model learned from data. We describe empirical results, and the resulting implications for practical learning of continuous non-linear dynamic control. 1 GRID-BASED INTERPOLATION TECHNIQUES Reinforcement learning algorithms generate functions that map states to "cost-t • Fine grids may be used in one or two dimensions. Above two dimensions, fine grids are too expensive. Value functions can be discontinuous, which (as we will see) can lead to su boptimalities even with very fine discretization in two dimensions . • Neural nets have been used in conjunction with TD [Sutton, 1988] and Q-Iearning [Watkins, 1989] in very high dimensional spaces [Tesauro, 1991, Crites and Barto, 1996]. While promising, it is not always clear that they produce the accurate value functions that might be needed for fine near(cid:173) optimal control of dynamic systems, and the most commonly used methods of applying value iteration or policy iteration with a neural-net value func(cid:173) tion are often unstable. [Boyan and Moore, 1995].
Viewpoint Invariant Face Recognition using Independent Component Analysis and Attractor Networks
Bartlett, Marian, Sejnowski, Terrence J.
We have explored two approaches to recogmzmg faces across changes in pose. First, we developed a representation of face images based on independent component analysis (ICA) and compared it to a principal component analysis (PCA) representation for face recognition. The ICA basis vectors for this data set were more spatially local than the PCA basis vectors and the ICA representa(cid:173) tion had greater invariance to changes in pose. Second, we present a model for the development of viewpoint invariant responses to faces from visual experience in a biological system. The temporal continuity of natural visual experience was incorporated into an attractor network model by Hebbian learning following a lowpass temporal filter on unit activities. When combined with the tem(cid:173) poral filter, a basic Hebbian update rule became a generalization of Griniasty et al. (1993), which associates temporally proximal input patterns into basins of attraction. The system acquired rep(cid:173) resentations of faces that were largely independent of pose. 1 Independent component representations of faces Important advances in face recognition have employed forms of principal compo(cid:173) nent analysis, which considers only second-order moments of the input (Cottrell & Metcalfe, 1991; Turk & Pentland 1991). Independent component analysis (ICA) is a generalization of principal component analysis (PCA), which decorrelates the higher-order moments of the input (Comon, 1994). In a task such as face recogni(cid:173) tion, much of the important information is contained in the high-order statistics of the images. A representational basis in which the high-order statistics are decorre(cid:173) lated may be more powerful for face recognition than one in which only the second order statistics are decorrelated, as in PCA representations. We compared an ICA(cid:173) based representation to a PCA-based representation for recognizing faces across changes in pose. 818 M. S. Bartlett and T. J. Sejnowski
Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems
Singh, Satinder, Bertsekas, Dimitri
In cellular telephone systems, an important problem is to dynami(cid:173) cally allocate the communication resource (channels) so as to max(cid:173) imize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traf(cid:173) fic patterns. We present results on a large cellular system with approximately 4949 states. In cellular communication systems, an important problem is to allocate the com(cid:173) munication resource (bandwidth) so as to maximize the service provided to a set of mobile callers whose demand for service changes stochastically. A given geograph(cid:173) ical area is divided into mutually disjoint cells, and each cell serves the calls that are within its boundaries (see Figure 1a). The total system bandwidth is divided into channels, with each channel centered around a frequency. Each channel can be used simultaneously at different cells, provided these cells are sufficiently separated spatially, so that there is no interference between them. The minimum separation distance between simultaneous reuse of the same channel is called the channel reuse constraint. When a call requests service in a given cell either a free channel (one that does not violate the channel reuse constraint) may be assigned to the call, or else the call is blocked from the system; this will happen if no free channel can be found. Also, when a mobile caller crosses from one cell to another, the call is "handed off" to the cell of entry; that is, a new free channel is provided to the call at the new cell. If no such channel is available, the call must be dropped/disconnected from the system. RLfor Dynamic Channel Allocation 975 One objective of a channel allocation policy is to allocate the available channels to calls so that the number of blocked calls is minimized. An additional objective is to minimize the number of calls that are dropped when they are handed off to a busy cell. These two objectives must be weighted appropriately to reflect their relative importance, since dropping existing calls is generally more undesirable than blocking new calls. To illustrate the qualitative nature of the channel assignment decisions, suppose that there are only two channels and three cells arranged in a line. Assume a channel reuse constraint of 2, i.e., a channel may be used simultaneously in cells 1 and 3, but may not be used in channel 2 if it is already used in cell 1 or in cell 3. Suppose that the system is serving one call in cell 1 and another call in cell 3. Then serving both calls on the same channel results in a better channel usage pattern than serving them on different channels, since in the former case the other channel is free to be used in cell 2. The purpose of the channel assignment and channel rearrangement strategy is, roughly speaking, to create such favorable usage patterns that minimize the likelihood of calls being blocked. We formulate the channel assignment problem as a dynamic programming problem, which, however, is too complex to be solved exactly. We introduce approximations based on the methodology of reinforcement learning (RL) (e.g., Barto, Bradtke and Singh, 1995, or the recent textbook by Bertsekas and Tsitsiklis, 1996). Our method learns channel allocation policies that outperform not only the most commonly used policy in cellular systems, but also the best heuristic policy we could find in the literature. 1 CHANNEL ASSIGNMENT POLICIES Many cellular systems are based on a fixed assignment (FA) channel allocation; that is, the set of channels is partitioned, and the partitions are permanently assigned to cells so that all cells can use all the channels assigned to them simultaneously without interference (see Figure 1a). When a call arrives in a cell, if any pre(cid:173) assigned channel is unused; it is assigned, else the call is blocked. No rearrangement is done when a call terminates. Such a policy is static and cannot take advantage of temporary stochastic variations in demand for service. More efficient are dynamic channel allocation policies, which assign channels to different cells, so that every channel is available to every cell on a need basis, unless the channel is used in a nearby cell and the reuse constraint is violated. The best existing dynamic channel allocation policy we found in the literature is Borrowing with Directional Channel Locking (BDCL) of Zhang & Yum (1989). It numbers the channels from 1 to N, partitions and assigns them to cells as in FA. The channels assigned to a cell are its nominal channels. If a nominal channel is available when a call arrives in a cell, the smallest numbered such channel is assigned to the call. If no nominal channel is available, then the largest numbered free channel is borrowed from the neighbour with the most free channels. When a channel is borrowed, careful accounting of the directional effect of which cells can no longer use that channel because of interference is done. The call is blocked if there are no free channels at all. When a call terminates in a cell and the channel so freed is a nominal channel, say numbered i, of that cell, then if there is a call in that cell on a borrowed channel, the call on the smallest numbered borrowed channel is reassigned to i and the borrowed channel is returned to the appropriate cell. If there is no call on a borrowed channel, then if there is a call on a nominal channel numbered larger than i, the call on the highest numbered nominal channel is reassigned to i. If the call just terminated was itself on a borrowed channel, the 976 S. Singh and D. Bertsekas call on the smallest numbered borrowed channel is reassigned to it and that channel is returned to the cell from which it was borrowed. Notice that when a borrowed channel is returned to its original cell, a nominal channel becomes free in that cell and triggers a reassignment. Thus, in the worst case a call termination in one cell can sequentially cause reassignments in arbitrarily far away cells - making BDCL somewhat impractical. BOCL is quite sophisticated and combines the notions of channel-ordering, nominal channels, and channel borrowing. Zhang and Yum (1989) show that BOCL is superior to its competitors, including FA. Generally, BOCL has continued to be highly regarded in the literature as a powerful heuristic (Enrico et.al., 1996) . In this paper, we compare the performance of dynamic channel allocation policies learned by RL with both FA and BOCL. 1.1 DYNAMIC PROGRAMMING FORMULATION We can formulate the dynamic channel allocation problem using dynamic program(cid:173) ming (e.g., Bertsekas, 1995). State transitions occur when channels become free due to call departures, or when a call arrives at a given cell and wishes to be assigned a channel, or when there is a handoff, which can be viewed as a simultaneous call departure from one cell and a call arrival at another cell. The state at each time consists of two components: (1) The list of occupied and unoccupied channels at each cell. We call this the configuration of the cellular system. It is exponential in the number of cells. (2) The event that causes the state transition (arrival, departure, or handoff). This component of the state is uncontrollable. The decision/control applied at the time of a call departure is the rearrangement of the channels in use with the aim of creating a more favorable channel packing pattern among the cells (one that will leave more channels free for future assign(cid:173) ments) . Unlike BDCL, our RL solution will restrict this rearrangement to the cell with the current call departure. The control exercised at the time of a call arrival is the assignment of a free channel, or the blocking of the call if no free channel is currently available. In general, it may also be useful to do admission control, i.e., to allow the possibility of not accepting a new call even when there exists a free channel to minimize the dropping of ongoing calls during handoff in the future. We address admission control in a separate paper and here restrict ourselves to always accepting a call if a free channel is available. The objective is to learn a policy that assigns decisions (assignment or rearrangement depending on event) to each state so as to maximize J = E {lCO e- f3t e(t)dt} , where E{-} is the expectation operator, e(t) is the number of ongoing calls at time t, and j3 is a discount factor that makes immediate profit more valuable than future profit. Maximizing J is equivalent to minimizing the expected (discounted) number of blocked calls over an infinite horizon. 2 REINFORCEMENT LEARNING SOLUTION RL methods solve optimal control (or dynamic programming) problems by learning good approximations to the optimal value function, J*, given by the solution to RLfor Dynamic Channel Allocation 977 the Bellman optimality equation which takes the following form for the dynamic channel allocation problem:
Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient
Amari, Shun-ichi
The parameter space of neural networks has a Riemannian met(cid:173) ric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the true steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient learning algorithm is much more effective if the natural gradient is used. The present paper studies the information-geometrical structure of perceptrons and other networks, and prove that the on-line learning method based on the natural gradient is asymptotically as efficient as the optimal batch algorithm. Adaptive modification of the learning constant is proposed and analyzed in terms of the Riemannian mea(cid:173) sure and is shown to be efficient. The natural gradient is finally applied to blind separation of mixtured independent signal sources. 1
Extraction of Temporal Features in the Electrosensory System of Weakly Electric Fish
Gabbiani, Fabrizio, Metzner, Walter, Wessel, Ralf, Koch, Christof
The encoding of random time-varying stimuli in single spike trains of electrosensory neurons in the weakly electric fish Eigenmannia was investigated using methods of statistical signal processing. At the first stage of the electrosensory system, spike trains were found to encode faithfully the detailed time-course of random stimuli, while at the second stage neurons responded specifically to features in the temporal waveform of the stimulus. Therefore stimulus infor(cid:173) mation is processed at the second stage of the electrosensory system by extracting temporal features from the faithfully preserved image of the environment sampled at the first stage.
A Model of Recurrent Interactions in Primary Visual Cortex
Todorov, Emanuel, Siapas, Athanassios, Somers, David
A general feature of the cerebral cortex is its massive intercon(cid:173) nectivity - it has been estimated anatomically [19] that cortical neurons receive upwards of 5,000 synapses, the majority of which originate from other nearby cortical neurons. Numerous experi(cid:173) ments in primary visual cortex (VI) have revealed strongly nonlin(cid:173) ear interactions between stimulus elements which activate classical and non-classical receptive field regions. Recurrent cortical con(cid:173) nections likely contribute substantially to these effects. However, most theories of visual processing have either assumed a feedfor(cid:173) ward processing scheme [7], or have used recurrent interactions to account for isolated effects only [1, 16, 18]. Since nonlinear sys(cid:173) tems cannot in general be taken apart and analyzed in pieces, it is not clear what one learns by building a recurrent model that only accounts for one, or very few phenomena. Here we develop a relatively simple model of recurrent interactions in VI, that re(cid:173) flects major anatomical and physiological features of intracortical connectivity, and simultaneously accounts for a wide range of phe(cid:173) nomena observed physiologically. All phenomena we address are strongly nonlinear, and cannot be explained by linear feedforward models.
VLSI Implementation of Cortical Visual Motion Detection Using an Analog Neural Computer
Etienne-Cummings, Ralph, Van der Spiegel, Jan, Takahashi, Naomi, Apsel, Alyssa, Mueller, Paul
Two dimensional image motion detection neural networks have been implemented using a general purpose analog neural computer. The neural circuits perform spatiotemporal feature extraction based on the cortical motion detection model of Adelson and Bergen. The neural computer provides the neurons, synapses and synaptic time-constants required to realize the model in VLSI hardware. Results show that visual motion estimation can be implemented with simple sum-and(cid:173) threshold neural hardware with temporal computational capabilities. The neural circuits compute general 20 visual motion in real-time.
Local Bandit Approximation for Optimal Learning Problems
Duff, Michael, Barto, Andrew
In general, procedures for determining Bayes-optimal adaptive controls for Markov decision processes (MDP's) require a pro(cid:173) hibitive amount of computation-the optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP. Bandit processes constitute an important subclass of MDP's, and have optimal learning strategies (defined in terms of Gittins indices) that can be computed relatively efficiently. Thus, one scheme for achieving approximately-optimal learning for gen(cid:173) eral MDP's proceeds by taking actions suggested by strategies that are optimal with respect to local bandit models.
Interpreting Images by Propagating Bayesian Beliefs
Weiss, Yair
A central theme of computational vision research has been the re(cid:173) alization that reliable estimation of local scene properties requires propagating measurements across the image. Many authors have therefore suggested solving vision problems using architectures of locally connected units updating their activity in parallel. Unfor(cid:173) tunately, the convergence of traditional relaxation methods on such architectures has proven to be excruciatingly slow and in general they do not guarantee that the stable point will be a global mini(cid:173) mum. In this paper we show that an architecture in which Bayesian Be(cid:173) liefs about image properties are propagated between neighboring units yields convergence times which are several orders of magni(cid:173) tude faster than traditional methods and avoids local minima. In particular our architecture is non-iterative in the sense of Marr [5]: at every time step, the local estimates at a given location are op(cid:173) timal given the information which has already been propagated to that location. We illustrate the algorithm's performance on real images and compare it to several existing methods.
Why did TD-Gammon Work?
Pollack, Jordan, Blair, Alan
Although TD-Gammon is one of the major successes in machine learn(cid:173) ing, it has not led to similar impressive breakthroughs in temporal dif(cid:173) ference learning for other applications or even other games. We were able to replicate some of the success of TD-Gammon, developing a competitive evaluation function on a 4000 parameter feed-forward neu(cid:173) ral network, without using back-propagation, reinforcement or temporal difference learning methods. Instead we apply simple hill-climbing in a relative fitness environment. These results and further analysis suggest that the surprising success of Tesauro's program had more to do with the co-evolutionary structure of the learning task and the dynamics of the backgammon game itself.
Sequential Tracking in Pricing Financial Options using Model Based and Neural Network Approaches
Niranjan, Mahesan
This paper shows how the prices of option contracts traded in finan(cid:173) cial markets can be tracked sequentially by means of the Extended Kalman Filter algorithm. I consider call and put option pairs with identical strike price and time of maturity as a two output nonlin(cid:173) ear system. The Black-Scholes approach popular in Finance liter(cid:173) ature and the Radial Basis Functions neural network are used in modelling the nonlinear system generating these observations. I show how both these systems may be identified recursively using the EKF algorithm. I present results of simulations on some FTSE 100 Index options data and discuss the implications of viewing the pricing problem in this sequential manner.
NeuroScale: Novel Topographic Feature Extraction using RBF Networks
Lowe, David, Tipping, Michael
Dimension-reducing feature extraction neural network techniques which also preserve neighbourhood relationships in data have tra(cid:173) ditionally been the exclusive domain of Kohonen self organising maps. Recently, we introduced a novel dimension-reducing feature extraction process, which is also topographic, based upon a Radial Basis Function architecture. It has been observed that the gener(cid:173) alisation performance of the system is broadly insensitive to model order complexity and other smoothing factors such as the kernel widths, contrary to intuition derived from supervised neural net(cid:173) work models. In this paper we provide an effective demonstration of this property and give a theoretical justification for the apparent 'self-regularising' behaviour of the 'NEUROSCALE' architecture. 1 'NeuroScale': A Feed-forward Neural Network Topographic Transformation Recently an important class of topographic neural network based feature extraction approaches, which can be related to the traditional statistical methods of Sammon Mappings (Sammon, 1969) and Multidimensional Scaling (Kruskal, 1964), have been introduced (Mao and Jain, 1995; Lowe, 1993; Webb, 1995; Lowe and Tipping, 1996). These novel alternatives to Kohonen-like approaches for topographic feature extraction possess several interesting properties. For instance, the NEuROSCALE architecture has the empirically observed property that the generalisation perfor- 544 D. Lowe and M. E. Tipping mance does not seem to depend critically on model order complexity, contrary to intuition based upon knowledge of its supervised counterparts. This paper presents evidence for their 'self-regularising' behaviour and provides an explanation in terms of the curvature of the trained models. We now provide a brief introduction to the NEUROSCALE philosophy of nonlinear topographic feature extraction. Further details may be found in (Lowe, 1993; Lowe and Tipping, 1996). We seek a dimension-reducing, topographic transformation of data for the purposes of visualisation and analysis. By 'topographic', we imply that the geometric structure of the data be optimally preserved in the transformation, and the embodiment of this constraint is that the inter-point distances in the feature space should correspond as closely as possible to those distances in the data space. The implementation of this principle by a neural network is very simple. A Radial Basis Function (RBF) neural network is utilised to predict the coordinates of the data point in the transformed feature space. The locations of the feature points are indirectly determined by adjusting the weights of the network. The transformation is determined by optimising the network parameters in order to minimise a suitable error measure that embodies the topographic principle. The specific details of this alternative approach are as follows. Given an m(cid:173) dimensional input space of N data points x q , an n-dimensional feature space of points Yq is generated such that the relative positions of the feature space points minimise the error, or 'STRESS', term: N E = 2: 2:(d~p - dqp)2,
Time Series Prediction using Mixtures of Experts
Zeevi, Assaf, Meir, Ron, Adler, Robert
We consider the problem of prediction of stationary time series, using the architecture known as mixtures of experts (MEM). Here we suggest a mixture which blends several autoregressive models. This study focuses on some theoretical foundations of the predic(cid:173) tion problem in this context. More precisely, it is demonstrated that this model is a universal approximator, with respect to learn(cid:173) ing the unknown prediction function . This statement is strength(cid:173) ened as upper bounds on the mean squared error are established. Based on these results it is possible to compare the MEM to other families of models (e.g., neural networks and state dependent mod(cid:173) els). It is shown that a degenerate version of the MEM is in fact equivalent to a neural network, and the number of experts in the architecture plays a similar role to the number of hidden units in the latter model.
Neural Network Models of Chemotaxis in the Nematode Caenorhabditis Elegans
Ferrée, Thomas, Marcotte, Ben, Lockery, Shawn
We train recurrent networks to control chemotaxis in a computer model of the nematode C. elegans. The model presented is based closely on the body mechanics, behavioral analyses, neuroanatomy and neurophysiology of C. elegans, each imposing constraints rel(cid:173) evant for information processing. Simulated worms moving au(cid:173) tonomously in simulated chemical environments display a variety of chemotaxis strategies similar to those of biological worms.
GTM: A Principled Alternative to the Self-Organizing Map
Bishop, Christopher, Svensén, Markus, Williams, Christopher
The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuris(cid:173) tic ideas and this leads to a number of significant limitations. In this paper, we consider the problem of modelling the probabil(cid:173) ity density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algo(cid:173) rithm (for Generative Topographic Mapping), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectation-maximization) algo(cid:173) rithm. Our approach overcomes the limitations of the SOM, while introducing no significant disadvantages. We demonstrate the per(cid:173) formance of the GTM algorithm on simulated data from flow diag(cid:173) nostics for a multi-phase oil pipeline.
Support Vector Method for Function Approximation, Regression Estimation and Signal Processing
Vapnik, Vladimir, Golowich, Steven, Smola, Alex
The Support Vector (SV) method was recently proposed for es(cid:173) timating regressions, constructing multidimensional splines, and solving linear operator equations [Vapnik, 1995]. In this presenta(cid:173) tion we report results of applying the SV method to these problems.
Smoothing Regularizers for Projective Basis Function Networks
Moody, John, Rögnvaldsson, Thorsteinn
Smoothing regularizers for radial basis functions have been studied extensively, but no general smoothing regularizers for projective basis junctions (PBFs), such as the widely-used sigmoidal PBFs, have heretofore been proposed. We de(cid:173) rive new classes of algebraically-simple mH'-order smoothing regularizers for networks of the form f(W, x) = L7=1 Ujg [x T Vj + Vjol + uo, with general projective basis functions g[.]. These regularizers are: Ra(W,m) = LU;lIvjIl2m-1 GlobalForm
Learning from Demonstration
Schaal, Stefan
By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely at(cid:173) tempt to learn from scratch. They extract initial biases as well as strategies how to approach a learning problem from instructions and/or demonstrations of other humans. For learning control, this paper investigates how learning from demonstration can be applied in the context of reinforcement learning. We consider priming the Q-function, the value function, the policy, and the model of the task dynamics as possible areas where demonstrations can speed up learning. In general nonlinear learning problems, only model-based rein(cid:173) forcement learning shows significant speed-up after a demonstration, while in the special case of linear quadratic regulator (LQR) problems, all methods profit from the demonstration. In an implementation of pole balancing on a complex anthropomorphic robot arm, we demonstrate that, when facing the complexities of real signal processing, model-based reinforcement learning offers the most robustness for LQR problems. Using the suggested methods, the robot learns pole balancing in just a single trial after a 30 second long demonstration of the human instructor.
Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs
Caruana, Rich, de Sa, Virginia
In supervised learning there is usually a clear distinction between inputs and outputs - inputs are what you will measure, outputs are what you will predict from those measurements. This paper shows that the distinction between inputs and outputs is not this simple. Some features are more useful as extra outputs than as inputs. By using a feature as an output we get more than just the case values but can. learn a mapping from the other inputs to that feature. For many features this mapping may be more useful than the feature value itself. We present two regression problems and one classification problem where performance improves if features that could have been used as inputs are used as extra outputs instead. This result is surprising since a feature used as an output is not used during testing.
Hidden Markov Decision Trees
Jordan, Michael, Ghahramani, Zoubin, Saul, Lawrence
We study a time series model that can be viewed as a decision tree with Markov temporal structure. The model is intractable for exact calculations, thus we utilize variational approximations. We consider three different distributions for the approximation: one in which the Markov calculations are performed exactly and the layers of the decision tree are decoupled, one in which the decision tree calculations are performed exactly and the time steps of the Markov chain are decoupled, and one in which a Viterbi-like assumption is made to pick out a single most likely state sequence. We present simulation results for artificial data and the Bach chorales.
Representing Face Images for Emotion Classification
Padgett, Curtis, Cottrell, Garrison
We compare the generalization performance of three distinct rep(cid:173) resentation schemes for facial emotions using a single classification strategy (neural network). The face images presented to the clas(cid:173) sifiers are represented as: full face projections of the dataset onto their eigenvectors (eigenfaces); a similar projection constrained to eye and mouth areas (eigenfeatures); and finally a projection of the eye and mouth areas onto the eigenvectors obtained from 32x32 random image patches from the dataset. The latter system achieves 86% generalization on novel face images (individuals the networks were not trained on) drawn from a database in which human sub(cid:173) jects consistently identify a single emotion for the face .
Contour Organisation with the EM Algorithm
Leite, José, Hancock, Edwin
This paper describes how the early visual process of contour organ(cid:173) isation can be realised using the EM algorithm. The underlying computational representation is based on fine spline coverings. Ac(cid:173) cording to our EM approach the adjustment of spline parameters draws on an iterative weighted least-squares fitting process. The expectation step of our EM procedure computes the likelihood of the data using a mixture model defined over the set of spline cover(cid:173) ings. These splines are limited in their spatial extent using Gaus(cid:173) sian windowing functions. The maximisation of the likelihood leads to a set of linear equations in the spline parameters which solve the weighted least squares problem. We evaluate the technique on the localisation of road structures in aerial infra-red images.
Combining Neural Network Regression Estimates with Regularized Linear Weights
Merz, Christopher, Pazzani, Michael
When combining a set of learned models to form an improved es(cid:173) timator, the issue of redundancy or multicollinearity in the set of models must be addressed. A progression of existing approaches and their limitations with respect to the redundancy is discussed. A new approach, PCR , based on principal components regres(cid:173) sion is proposed to address these limitations. An evaluation of the new approach on a collection of domains reveals that: 1) PCR was the most robust combination method as the redundancy of the learned models increased, 2) redundancy could be handled without eliminating any of the learned models, and 3) the principal compo(cid:173) nents of the learned models provided a continuum of "regularized" weights from which PCR * could choose.
Triangulation by Continuous Embedding
Meila, Marina, Jordan, Michael
When triangulating a belief network we aim to obtain a junction tree of minimum state space. According to (Rose, 1970), searching for the optimal triangulation can be cast as a search over all the permutations of the graph's vertices. Our approach is to embed the discrete set of permutations in a convex continuous domain D. By suitably extending the cost function over D and solving the continous nonlinear optimization task we hope to obtain a good triangulation with respect to the aformentioned cost. This paper presents two ways of embedding the triangulation problem into continuous domain and shows that they perform well compared to the best known heuristic. 1 INTRODUCTION. WHAT IS TRIANGULATION? Belief networks are graphical representations of probability distributions over a set of variables. In what follows it will be always assumed that the variables take values in a finite set and that they correspond to the vertices of a graph. The graph's arcs will represent the dependencies among variables. There are two kinds of representations that have gained wide use: one is the directed acyclic graph model, also called a Bayes net, which represents the joint distribution as a product of the probabilities of each vertex conditioned on the values of its parents; the other is the undirected graph model, also called a Markov field, where the joint distribution is factorized over the cliques! of an undirected graph. This factorization is called a junction tree and optimizing it is the subject of the present paper. The power of both models lies in their ability to display and exploit existent marginal and conditional independencies among subsets of variables. Emphasizing independencies is useful 1 A clique is a fully connected set of vertices and a maximal clique is a clique that is not contained in any other clique. 558 M. Meilii and M. /. Jordan from both a qualitative point of view (it reveals something about the domain under study) and a quantitative one (it makes computations tractable). The two models differ in the kinds of independencies they are able to represent and often times in their naturalness in particular tasks. Directed graphs are more convenient for learning a model from data; on the other hand, the clique structure of undirected graphs organizes the information in a way that makes it immediately available to inference algorithms. Therefore it is a standard procedure to construct the model of a domain as a Bayes net and then to convert it to a Markov field for the purpose of querying it. This process is known as decomposition and it consists of the following stages: first, the directed graph is transformed into an undirected graph by an operation called moralization. Second, the moralized graph is triangulated. A graph is called triangulated if any cycle of length> 3 has a chord (i.e. an edge connecting two nonconsecutive vertices). If a graph is not triangulated it is always possible to add new edges so that the resulting graph is triangulated. We shall call this procedure triangulation and the added edges the fill-in. In the final stage, the junction tree (Kjrerulff, 1991) is constructed from the maximal cliques of the triangulated graph. We define the state space of a clique to be the cartesian product of the state spaces of the variables associated to the vertices in the clique and we call weight of the clique the size of this state space. The weight of the junction tree is the sum of the weights of its component cliques. All further exact inference in the net takes place in the junction tree representation. The number of computations required by an inference operation is proportional to the weight of the tree. For each graph there are several and usually a large number of possible triangu(cid:173) lations, with widely varying state space sizes. Moreover, triangulation is the only stage where the cost of inference can be influenced. It is therefore critical that the triangulation procedure produces a graph that is optimal or at least "good" in this respect. Unfortunately, this is a hard problem. No optimal triangulation algorithm is known to date. However, a number of heuristic algorithms like maximum cardinality search (Tarjan and Yannakakis, 1984), lexicographic search (Rose et al., 1976) and the minimum weight heuristic (MW) (Kjrerulff, 1990) are known. An optimization method based on simulated annealing which performs better than the heuristics on large graphs has been proposed in (Kjrerulff, 1991) and recently a "divide and conquer" algorithm which bounds the maximum clique size of the triangulated graph has been published (Becker and Geiger, 1996). All but the last algorithm are based on Rose's (Rose, 1970) elimination procedure: choose a node v of the graph, connect all its neighbors to form a clique, then eliminate v and all the edges incident to it and proceed recursively. The resulting filled-in graph is triangulated. It can be proven that the optimal triangulation can always be obtained by applying Rose's elimination procedure with an appropriate ordering of the nodes. It follows then that searching for an optimal triangulation can be cast as a search in the space of all node permutations. The idea of the present work is the following: embed the discrete search space of permutations of n objects (where n is the number of vertices) into a suitably chosen continuous space. Then extend the cost to a smooth function over the continuous domain and thus transform the discrete optimization problem into a continuous nonlinear optimization task. This allows one to take advantage of the thesaurus of optimization methods that exist for continuous cost functions. The rest of the paper will present this procedure in the following sequence: the next section introduces and discusses the objective function; section 3 states the continuous version of the problem; section 4 discusses further aspects of the optimization procedure and presents experimental results and section 5 concludes Triangulation by Continuous Embedding
Bayesian Model Comparison by Monte Carlo Chaining
Barber, David, Bishop, Christopher
The techniques of Bayesian inference have been applied with great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and the treatment of hyper-parameters. However, the problem of model comparison is a much more challenging one for which current techniques have significant limitations. In this paper we show how an extended form of Markov chain Monte Carlo, called chaining, is able to provide effective estimates of the relative probabilities of different models. We present results from the robot arm problem and compare them with the corresponding results obtained using the standard Gaussian approximation framework. 1 Bayesian Model Comparison In a Bayesian treatment of statistical inference, our state of knowledge of the values of the parameters w in a model M is described in terms of a probability distribution function. Initially this is chosen to be some prior distribution p(wIM), which can be combined with a likelihood function p( Dlw, M) using Bayes' theorem to give a posterior distribution p(wID, M) in the form
Consistent Classification, Firm and Soft
Baram, Yoram
A classifier is called consistent with respect to a given set of class(cid:173) labeled points if it correctly classifies the set. We consider classi(cid:173) fiers defined by unions of local separators and propose algorithms for consistent classifier reduction. The expected complexities of the proposed algorithms are derived along with the expected classifier sizes. In particular, the proposed approach yields a consistent re(cid:173) duction of the nearest neighbor classifier, which performs "firm" classification, assigning each new object to a class, regardless of the data structure. The proposed reduction method suggests a notion of "soft" classification, allowing for indecision with respect to objects which are insufficiently or ambiguously supported by the data. The performances of the proposed classifiers in predict(cid:173) ing stock behavior are compared to that achieved by the nearest neighbor method.
Bayesian Unsupervised Learning of Higher Order Structure
Lewicki, Michael, Sejnowski, Terrence J.
Multilayer architectures such as those used in Bayesian belief net(cid:173) works and Helmholtz machines provide a powerful framework for representing and learning higher order statistical relations among inputs. Because exact probability calculations with these mod(cid:173) els are often intractable, there is much interest in finding approxi(cid:173) mate algorithms. We present an algorithm that efficiently discovers higher order structure using EM and Gibbs sampling. The model can be interpreted as a stochastic recurrent network in which ambi(cid:173) guity in lower-level states is resolved through feedback from higher levels. We demonstrate the performance of the algorithm on bench(cid:173) mark problems.
An Architectural Mechanism for Direction-tuned Cortical Simple Cells: The Role of Mutual Inhibition
Sabatini, Silvio, Solari, Fabio, Bisio, Giacomo
A linear architectural model of cortical simple cells is presented. The model evidences how mutual inhibition, occurring through synaptic coupling functions asymmetrically distributed in space, can be a possible basis for a wide variety of spatio-temporal simple cell response properties, including direction selectivity and velocity tuning. While spatial asymmetries are included explicitly in the structure of the inhibitory interconnections, temporal asymmetries originate from the specific mutual inhibition scheme considered. Extensive simulations supporting the model are reported.
Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions
Singer, Yoram, Warmuth, Manfred K. K.
We present new algorithms for parameter estimation of HMMs. By adapting a framework used for supervised learning, we construct iterative algorithms that maximize the likelihood of the observations while also attempting to stay "close" to the current estimated parameters. We use a bound on the relative entropy between the two HMMs as a distance mea(cid:173) sure between them. The result is new iterative training algorithms which are similar to the EM (Baum-Welch) algorithm for training HMMs. The proposed algorithms are composed of a step similar to the expectation step of Baum-Welch and a new update of the parameters which replaces the maximization (re-estimation) step. The algorithm takes only negligi(cid:173) bly more time per iteration and an approximated version uses the same expectation step as Baum-Welch. We evaluate experimentally the new algorithms on synthetic and natural speech pronunciation data. For sparse models, i.e. models with relatively small number of non-zero parameters, the proposed algorithms require significantly fewer iterations. 1 Preliminaries We use the numbers from 0 to N to name the states of an HMM. State 0 is a special initial state and state N is a special final state. Any state sequence, denoted by s, starts with the initial state but never returns to it and ends in the final state. Observations symbols are also numbers in {I, ... , M} and observation sequences are denoted by x. A discrete output hidden Markov model (HMM) is parameterized by two matrices A and B. The first matrix is of dimension [N, N] and ai,j (0:5: i :5: N - 1,1 :5: j :5: N) denotes the probability of moving from state i to state j. The second matrix is of dimension [N + 1, M] and bi ,k is the probability of outputting symbol k at state i. The set of parameters of an HMM is denoted by 0 = (A, B). (The initial state distribution vector is represented by the first row of A.) An HMM is a probabilistic generator of sequences. It starts in the initial state O. It then iteratively does the following until the final state is reached. If i is the current state then a next state j is chosen according to the transition probabilities out of the current state (row i of matrix A). After arriving at state j a symbol is output according to the output probabilities of that state (row j of matrix B). Let P(x, slO) denote the probability (likelihood) that an HMM 0 generates the observation sequence x on the path s starting at state 0 and ending at state N: P(x, sllsl = Ixl + 1, So = 0, slSI = N, 0) ~ I1~~ll as._t,s.bs.,x •. For the sake of brevity we omit the conditions on s and x. Throughout the paper we assume that the HMMs are absorbing, that is from every state there is a path to the final state with a 642 Y. Singer and M. K. Warmuth non-zero probability. Similar parameter estimation algorithms can be derived for ergodic HMMs. Absorbing HMMs induce a probability over all state-observation sequences, i.e. Ex,s P(x, s18) = 1. The likelihood of an observation sequence x is obtained by summing over all possible hidden paths (state sequences), P(xI8) = Es P(x, sI8). To obtain the likelihood for a set X of observations we simply mUltiply the likelihood values for the individual sequences. We seek an HMM 8 that maximizes the likelihood for a given set of observations X, or equivalently, maximizes the log-likelihood, LL(XI8) = r:h EXEX In P(xI8). To simplify our notation we denote the generic parameter in 8 by Oi, where i ranges from 1 to the total number of parameters in A and B (There might be less if some are clamped to zero). We denote the total number of parameters of 8 by I and leave the (fixed) correspondence between the Oi and the entries of A and B unspecified. The indices are naturally partitioned into classes corresponding to the rows of the matrices. We denote by [i] the class of parameters to which Oi belongs and by O[i) the vector of all OJ S.t. j E [i]. If j E [i] then both Oi and OJ are parameters from the same row of one of the two matrices. Whenever it is clear from the context, we will use [i] to denote both a class of parameters and the row number (i.e. state) associated with the class. We now can rewrite P(x, s18) as nf=l O~'(X,S), where ni(x, s) is the number of times parameter i is used along the path s with observation sequence x. (Note that this value does not depend on the actual parameters 8.) We next compute partial derivatives ofthe likelihood and the log-likelihood using this notation.
A Constructive Learning Algorithm for Discriminant Tangent Models
Sona, Diego, Sperduti, Alessandro, Starita, Antonina
(HSS) developed an algo(cid:173) To reduce the computational complexity of classification systems using tangent distance, Hastie et al. rithm to devise rich models for representing large subsets of the data which computes automatically the "best" associated tan(cid:173) gent subspace. Schwenk & Milgram proposed a discriminant mod(cid:173) ular classification system (Diabolo) based on several autoassociative multilayer perceptrons which use tangent distance as error recon(cid:173) struction measure. We propose a gradient based constructive learning algorithm for building a tangent subspace model with discriminant capabilities which combines several of the the advantages of both HSS and Diabolo: devised tangent models hold discriminant capabilities, space requirements are improved with respect to HSS since our algorithm is discriminant and thus it needs fewer prototype models, dimension of the tangent subspace is determined automatically by the constructive algorithm, and our algorithm is able to learn new transformations.
Second-order Learning Algorithm with Squared Penalty Term
Saito, Kazumi, Nakano, Ryohei
This paper compares three penalty terms with respect to the effi(cid:173) ciency of supervised learning, by using first- and second-order learn(cid:173) ing algorithms. Our experiments showed that for a reasonably ade(cid:173) quate penalty factor, the combination of the squared penalty term and the second-order learning algorithm drastically improves the convergence performance more than 20 times over the other com(cid:173) binations, at the same time bringing about a better generalization performance.
Multi-effect Decompositions for Financial Data Modeling
Wu, Lizhong, Moody, John
High frequency foreign exchange data can be decomposed into three components: the inventory effect component, the surprise infonnation (news) component and the regular infonnation component. The presence of the inventory effect and news can make analysis of trends due to the diffusion of infonnation (regular information component) difficult. We propose a neural-net-based, independent component analysis to sep(cid:173) arate high frequency foreign exchange data into these three components. Our empirical results show that our proposed multi-effect decomposition can reveal the intrinsic price behavior.
The Effect of Correlated Input Data on the Dynamics of Learning
Halkjær, Søren, Winther, Ole
The convergence properties of the gradient descent algorithm in the case of the linear perceptron may be obtained from the response function. We derive a general expression for the response function and apply it to the case of data with simple input correlations. It is found that correlations severely may slow down learning. This explains the success of PCA as a method for reducing training time. Motivated by this finding we furthermore propose to transform the input data by removing the mean across input variables as well as examples to decrease correlations. Numerical findings for a medical classification problem are in fine agreement with the theoretical results.
Temporal Low-Order Statistics of Natural Sounds
Attias, Hagai, Schreiner, Christoph
In order to process incoming sounds efficiently, it is advantageous for the auditory system to be adapted to the statistical structure of natural auditory scenes. As a first step in investigating the relation between the system and its inputs, we study low-order statistical properties in several sound ensembles using a filter bank analysis. Focusing on the amplitude and phase in different frequency bands, we find simple parametric descriptions for their distribution and power spectrum that are valid for very different types of sounds. In particular, the amplitude distribution has an exponential tail and its power spectrum exhibits a modified power-law behavior, which is manifested by self-similarity and long-range temporal cor(cid:173) relations. Furthermore, the statistics for different bands within a given ensemble are virtually identical, suggesting translation in(cid:173) variance along the cochlear axis. These results show that natural sounds are highly redundant, and have possible implications to the neural code used by the auditory system.
Learning Temporally Persistent Hierarchical Representations
Becker, Suzanna
A biologically motivated model of cortical self-organization is pro(cid:173) posed. Context is combined with bottom-up information via a maximum likelihood cost function. Clusters of one or more units are modulated by a common contextual gating Signal; they thereby organize themselves into mutually supportive predictors of abstract contextual features. The model was tested in its ability to discover viewpoint-invariant classes on a set of real image sequences of cen(cid:173) tered, gradually rotating faces. It performed considerably better than supervised back-propagation at generalizing to novel views from a small number of training examples. 1 THE ROLE OF CONTEXT The importance of context effects l in perception has been demonstrated in many domains. For example, letters are recognized more quickly and accurately in the context of words (see e.g. McClelland & Rumelhart, 1981), words are recognized more efficiently when preceded by related words (see e.g. Neely, 1991), individual speech utterances are more intelligible in the context of continuous speech, etc. Fur(cid:173) ther, there is mounting evidence that neuronal responses are modulated by context. For example, even at the level of the LGN in the thalamus, the primary source of visual input to the cortex, Murphy & Sillito (1987) have reported cells with "end(cid:173) stopped" or length-tuned receptive fields which depend on top-down inputs from the cortex. The end-stopped behavior disappears when the top-down connections are removed, suggesting that the cortico-thalamic connections are providing contex(cid:173) tual modulation to the LGN. Moving a bit higher up the visual hierarchy, von der Heydt et al. (1984) found cells which respond to "illusory contours", in the absence of a contoured stimulus within the cells' classical receptive fields. These exam(cid:173) ples demonstrate that neuronal responses can be modulated by secondary sources of information in complex ways, provided the information is consistent with their expected or preferred input. 1 We use the term context rather loosely here to mean any secondary source of input. It could be from a different sensory modality, a different input channel within the same modality, a temporal history of the input, or top-down information. Learning Temporally Persistent Hierarchical Representations 825 Figure 1: Two sequences of 48 by 48 pixel images digitized with an IndyCam and prepro(cid:173) cessed with a Sobel edge filter. Eleven views of each of four to ten faces were used in the simulations reported here. The alternate (odd) views of two of the faces are shown above. Why would contextual modulation be such a pervasive phenomenon? One obvious reason is that if context can influence processing, it can help in disambiguating or cleaning up a noisy stimulus. A less obvious reason may be that if context can influence learning, it may lead to more compact representations, and hence a more powerful processing system. To illustrate, consider the benefits of incorporating temporal history into an unsupervised classifier. Given a continuous sensory signal as input, the classifier must try to discover important partitions in its training data. If it can discover features that are temporally persistent, and thus insensitive to transformations in the input, it should be able to represent the signal compactly with a small set offeatures. FUrther, these features are more likely to be associated with the identity of objects rather than lower-level attributes. However, most classifiers group patterns together on the basis of spatial overlap. This may be reasonable if there is very little shift or other form of distortion between one time step and the next, but is not a reasonable assumption about the sensory input to the cortex. Pre-cortical stages of sensory processing, certainly in the visual system (and probably in other modalities), tend to remove low-order correlations in space and time, e.g. with centre-surround filters. Consider the image sequences of gradually rotating faces in Figure 1. They have been preprocessed by a simple edge(cid:173) filter, so that successive views of the same face have relatively little pixel overlap. In contrast, identical views of different faces may have considerable overlap. Thus, a classifier such as k-means, which groups patterns based on their Euclidean distance, would not be expected to do well at classifying these patterns. So how are people (and in fact very young children) able to learn to classify a virtually infinite number of objects based on relatively brief exposures? It is argued here that the assumption of temporal persistence is a powerful constraining factor for achieving this, and is one which may be used to advantage in artificial neural networks as well. Not only does it lead to the development of higher-order feature analyzers, but it can result in more compact codes which are important for applications like image compression. Further, as the simulations reported here show, improved generalization may be achieved by allowing high-level expectations (e.g. of class labels) to influence the development of lower-level feature detectors. 2 THE MODEL Competitive learning (for a review, see Becker & Plumbley, 1996) is considered by many to be a reasonably strong candidate model of cortical learning. It can be implemented, in its simplest form, by a Hebbian learning rule in a network
Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning
Schneider, Jeff
Model learning combined with dynamic programming has been shown to be effective for learning control of continuous state dynamic systems. The simplest method assumes the learned model is correct and applies dynamic programming to it, but many approximators provide uncertainty estimates on the fit. How can they be exploited? This paper addresses the case where the system must be prevented from having catastrophic failures dur(cid:173) ing learning. We propose a new algorithm adapted from the dual control literature and use Bayesian locally weighted regression models with dy(cid:173) namic programming. A common reinforcement learning assumption is that aggressive exploration should be encouraged. This paper addresses the con(cid:173) verse case in which the system has to reign in exploration. The algorithm is illustrated on a 4 dimensional simulated control problem.
Spectroscopic Detection of Cervical Pre-Cancer through Radial Basis Function Networks
Tumer, Kagan, Ramanujam, Nirmala, Richards-Kortum, Rebecca, Ghosh, Joydeep
The mortality related to cervical cancer can be substantially re(cid:173) duced through early detection and treatment. However, cur(cid:173) rent detection techniques, such as Pap smear and colposcopy, fail to achieve a concurrently high sensitivity and specificity. In vivo fluorescence spectroscopy is a technique which quickly, non(cid:173) invasively and quantitatively probes the biochemical and morpho(cid:173) logical changes that occur in pre-cancerous tissue. RBF ensemble algorithms based on such spectra provide automated, and near real(cid:173) time implementation of pre-cancer detection in the hands of non(cid:173) experts. The results are more reliable, direct and accurate than those achieved by either human experts or multivariate statistical algorithms.
Radial Basis Function Networks and Complexity Regularization in Function Learning
Krzyzak, Adam, Linder, Tamás
In this paper we apply the method of complexity regularization to de(cid:173) rive estimation bounds for nonlinear function estimation using a single hidden layer radial basis function network. Our approach differs from the previous complexity regularization neural network function learning schemes in that we operate with random covering numbers and 11 metric entropy, making it po~sibleto consider much broader families of activa(cid:173) tion functions, namely functions of bounded variation. Some constraints previously imposed on the network parameters are also eliminated this way. The network is trained by means of complexity regularization in(cid:173) volving empirical risk minimization. Bounds on the expected risk in tenns of the sample size are obtained for a large class of loss functions. Rates of convergence to the optimal loss are also derived.
On-line Policy Improvement using Monte-Carlo Search
Tesauro, Gerald, Galperin, Gregory
We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo sim(cid:173) ulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP! and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algo(cid:173) rithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.
A New Approach to Hybrid HMM/ANN Speech Recognition using Mutual Information Neural Networks
Rigoll, Gerhard, Neukirchen, Christoph
This paper presents a new approach to speech recognition with hybrid HMM/ANN technology. While the standard approach to hybrid HMMI ANN systems is based on the use of neural networks as posterior probability estimators, the new approach is based on the use of mutual information neural networks trained with a special learning algorithm in order to maximize the mutual information between the input classes of the network and its resulting sequence of firing output neurons during training. It is shown in this paper that such a neural network is an optimal neural vector quantizer for a discrete hidden Markov model system trained on Maximum Likelihood principles. One of the main advantages of this approach is the fact, that such neural networks can be easily combined with HMM's of any complexity with context-dependent capabilities. It is shown that the resulting hybrid system achieves very high recognition rates, which are now already on the same level as the best conventional HMM systems with continuous parameters, and the capabilities of the mutual information neural networks are not yet entirely exploited.
Competition Among Networks Improves Committee Performance
Munro, Paul, Parmanto, Bambang
The separation of generalization error into two types, bias and variance (Geman, Bienenstock, Doursat, 1992), leads to the notion of error reduction by averaging over a "committee" of classifiers (Perrone, 1993). Committee perfonnance decreases with both the average error of the constituent classifiers and increases with the degree to which the misclassifications are correlated across the committee. Here, a method for reducing correlations is introduced, that uses a winner-take-all procedure similar to competitive learning to drive the individual networks to different minima in weight space with respect to the training set, such that correlations in generalization perfonnance will be reduced, thereby reducing committee error.
Selective Integration: A Model for Disparity Estimation
Gray, Michael, Pouget, Alexandre, Zemel, Richard, Nowlan, Steven, Sejnowski, Terrence J.
Local disparity information is often sparse and noisy, which creates two conflicting demands when estimating disparity in an image re(cid:173) gion: the need to spatially average to get an accurate estimate, and the problem of not averaging over discontinuities. We have devel(cid:173) oped a network model of disparity estimation based on disparity(cid:173) selective neurons, such as those found in the early stages of process(cid:173) ing in visual cortex. The model can accurately estimate multiple disparities in a region, which may be caused by transparency or oc(cid:173) clusion, in real images and random-dot stereograms. The use of a selection mechanism to selectively integrate reliable local disparity estimates results in superior performance compared to standard back-propagation and cross-correlation approaches. In addition, the representations learned with this selection mechanism are con(cid:173) sistent with recent neurophysiological results of von der Heydt, Zhou, Friedman, and Poggio [8] for cells in cortical visual area V2. Combining multi-scale biologically-plausible image processing with the power of the mixture-of-experts learning algorithm represents a promising approach that yields both high performance and new insights into visual system function. Selective Integration: A Model for Disparity Estimation
The Neurothermostat: Predictive Optimal Control of Residential Heating Systems
Mozer, Michael C., Vidmar, Lucky, Dodier, Robert
The Neurothermostat is an adaptive controller that regulates in(cid:173) door air temperature in a residence by switching a furnace on or off. The task is framed as an optimal control problem in which both comfort and energy costs are considered as part of the con(cid:173) trol objective. Because the consequences of control decisions are delayed in time, the N eurothermostat must anticipate heating de(cid:173) mands with predictive models of occupancy patterns and the ther(cid:173) mal response of the house and furnace. Occupancy pattern predic(cid:173) tion is achieved by a hybrid neural net / look-up table. The Neu(cid:173) rothermostat searches, at each discrete time step, for a decision sequence that minimizes the expected cost over a fixed planning horizon. The first decision in this sequence is taken, and this pro(cid:173) cess repeats. Simulations of the Neurothermostat were conducted using artificial occupancy data in which regularity was systemat(cid:173) ically varied, as well as occupancy data from an actual residence. The Neurothermostat is compared against three conventional poli(cid:173) cies, and achieves reliably lower costs. This result is robust to the relative weighting of comfort and energy costs and the degree of variability in the occupancy patterns. For over a quarter century, the home automation industry has promised to revolu(cid:173) tionize our lifestyle with the so-called Smart House@ in which appliances, lighting, stereo, video, and security systems are integrated under computer control. How(cid:173) ever, home automation has yet to make significant inroads, at least in part because software must be tailored to the home occupants. Instead of expecting the occupants to program their homes or to hire someone to do so, one would ideally like the home to essentially program itself by observing the lifestyle of the occupants. This is the goal of the Neural Network House (Mozer et al., 1995), an actual residence that has been outfitted with over 75 sensors(cid:173) including temperature, light, sound, motion-and actua.tors to control air heating, water heating, lighting, and ventilation. In this paper, we describe one research 954 M. C. Mozer. L. Vidmar and R. H. Dodier project within the house, the Neurothermostat, that learns to regulate the indoor air temperature automatically by observing and detecting patterns in the occupants' schedules and comfort preferences. We focus on the problem of air heating with a whole-house furnace, but the same approach can be taken with alternative or multiple heating devices, and to the problems of cooling and ventilation. 1 TEMPERATURE REGULATION AS AN OPTIMAL
A Comparison between Neural Networks and other Statistical Techniques for Modeling the Relationship between Tobacco and Alcohol and Cancer
Plate, Tony, Band, Pierre, Bert, Joel, Grace, John
Epidemiological data is traditionally analyzed with very simple techniques. Flexible models, such as neural networks, have the potential to discover unanticipated features in the data. However, to be useful, flexible models must have effective control on overfit(cid:173) ting. This paper reports on a comparative study of the predictive quality of neural networks and other flexible models applied to real and artificial epidemiological data. The results suggest that there are no major unanticipated complex features in the real data, and also demonstrate that MacKay's [1995] Bayesian neural network methodology provides effective control on overfitting while retain(cid:173) ing the ability to discover complex features in the artificial data.