Real-time Reinforcement Learning in an Autonomous Robot

Up: Robotics Lab Projects

Students: Andrew H. Fagg, Joel Hoff, David Lotspeich

Advisor: Dr. George A. Bekey

Reactive Control and Reinforcement Learning in an Autonomous Radio-Controlled Robot Car

Within the field of robotics, much recent attention has been given to control techniques that have been termed 'reactive' or 'behavior-based'. The design of such control systems for even a remotely interesting task is typically a laborious effort, requiring many hours of experimental "tweaking" as the actual behavior of the system is observed by the system designer.

Researchers at the Robotics Laboratory at the University of Southern California are working on a reinforcement learning-based approach to the design of reactive control policies in which the designer specifies the desired behavior of the system, rather than the control program that produces the desired behavior.

An adapted radio-controlled car called 'Marvin' has been developed to use such a approach to navigate a laboratory environment. Marvin is equipped with 5 sonar detectors and 2 tactile bumpers. In addition to this sensory information, the environment also provides a reinforcement signal. From this information, the robot must infer the sensory features that are relevant for acting, as well as the appropriate actions to take in a particular situation. Given a particular control policy, the robot must first learn to anticipate the results of acquiring positive reward and avoiding negative reward. This information is then used to update the control policy to make decisions that are better at obtaining positive reward. This process is continued until an appropriate control policy is obtained. Desired behaviors include obstacle avoidance, environmental exploration, and wall following.

The neural-based architecture under development uses a reinforcement prediction module to solve the temporal credit assignment problem, i.e., which recent actions and features to associate with the received reinforcement. It is hypothesized that the prediction module leads to more effective propagation of reinforcement information through time, prevents overlearning, and enables better and faster mastery of more complex behaviors.

The primary contribution of this work, however, is in its attempt to identify more efficient coding schemes for state space representation. A local winner-take-all operation at the feature detector layer forces moderately different input activity patterns to activate different sets of feature detector units. As a result, learning that is performed in one region of the state space does not tend to interfere with learning in other regions of the space (as is the case with Backpropagation-based learning algorithms).

On the other hand, local support given by winning feature detectors to nearby neighbors tends to create a topological mapping of the input space. As a result, points that are nearby each other in the input space will activate largely overlapping sets of feature detectors, allowing these points to share their common experiences.

Ongoing experimentation is focusing on the effectiveness of the overall approach for different behaviors in both staged and non-staged learning.

This project ended in 1996. If you are interested in this project , please send email to:
fagg@cs.umass.edu