文档视界 最新最全的文档下载
当前位置:文档视界 › Appears in the AAAI-04 workshop on Supervisory Control of Learning and Adaptive Systems. Gu

Appears in the AAAI-04 workshop on Supervisory Control of Learning and Adaptive Systems. Gu

Appears in the AAAI-04workshop on Supervisory Control of Learning and Adaptive Systems.

Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer

Gregory Kuhlmann,Peter Stone,Raymond Mooney Department of Computer Sciences

The University of Texas at Austin

{kuhlmann,pstone,mooney}@https://www.docsj.com/doc/8a11604190.html,

Jude Shavlik Department of Computer Sciences The University of Wisconsin-Madison shavlik@https://www.docsj.com/doc/8a11604190.html,

Abstract

We describe our current efforts towards creating a reinforce-

ment learner that learns both from reinforcements provided

by its environment and from human-generated advice.Our

research involves two complementary components:(a)map-

ping advice expressed in English to a formal advice language

and(b)using advice expressed in a formal notation in a re-

inforcement learner.We use a subtask of the challenging

RoboCup simulated soccer task(Noda et al.1998)as our

testbed.

Introduction

Reinforcement learning(RL)is a common way to create

adaptive systems that learn to act in complex,dynamic envi-

ronments(Sutton&Barto1998).In RL,the learner repeat-

edly senses the world,chooses an action to perform,and

occasionally receives feedback from its environment,which

the learner then uses to improve its performance.Employing

RL can be a more effective way to create intelligent robots

and software agents than writing programs by hand.In ad-

dition,RL also requires much less human intervention than

supervised machine learning,which requires large sets of

labeled training examples.We describe our current efforts

towards creating a reinforcement learner that learns both

from reinforcements provided by its environment and from

human-written suggestions;in fact,it is our goal that these

suggestions be provided in ordinary English.

Typically,the feedback given to a reinforcement learner is

simply a numeric representation of rewards or punishments.

However,there usually is much more feedback that a hu-

man teacher of such a learner could provide.Several re-

searchers have designed successful methods where this feed-

back can include high-level“advice”expressed by humans

at a natural level of abstraction using statements in a formal

language(Noelle&Cottrell1994;Maclin&Shavlik1994;

Siegelmann1994;Eliassi-Rad&Shavlik2003).Since the

agent employs machine learning,such advice need not be

perfectly accurate,fully precise,nor completely speci?ed.

In some approaches,the RL-agent’s human partner can pro-

vide advice at any time,based on the agent’s current behav-

ior.The advice may suggest an action to be taken immedi-

1Flash?les illustrating the task are available at http://www.

https://www.docsj.com/doc/8a11604190.html,/?AustinVilla/sim/keepaway/

Boundary

Keepers

Takers

Figure 1:A screen shot from the middle of a 3vs.2keep-away episode in a 20m x 20m region.Flash ?les illustrating the task are available from https://www.docsj.com/doc/8a11604190.html,/?AustinVilla/sim/keepaway/

Agents in the RoboCup simulator (Noda et al.1998)receive visual perceptions every 150msec indicating the relative dis-tance and angle to visible objects in the world,such as the ball and other agents.They may execute a primitive,pa-rameterized action such as turn(angle ),dash(power ),or kick(power ,angle )every 100msec .Thus the agents must sense and act asynchronously.Random noise is in-jected into all sensations and actions.Individual agents must be controlled by separate processes,with no inter-agent communication permitted other than via the simulator itself,which enforces communication bandwidth and range con-straints.Full details of the RoboCup simulator are presented in the server manual (Chen et al.2003).

In this work,we focus exclusively on training the keep-ers.As a way of incorporating domain knowledge,our learners choose not from the simulator’s set of primitive ac-tions but from higher-level actions constructed from a set of basic skills that were implemented by the CMUnited-99team (Stone,Riley,&Veloso 2000).

Keepers have the freedom to decide which action to take only when in possession of the ball.A keeper in possession may either hold the ball or pass to one of its teammates.Keepers not in possession of the ball are required to select the Receive option in which the player who can get there the soonest goes to the ball and the remaining players try to get open for a pass.

We further incorporate domain knowledge by providing the keepers with rotationally-invariant state features computed from the world state.The keepers’set of state variables are computed based on the positions of:the keepers K 1–K n and takers T 1–T m ,ordered by increasing distance from K 1;and C ,the center of the playing region.Let dist (a,b )be the distance between a and b and ang (a,b,c )be the angle between a and c with vertex at b .For 3keepers and 2takers,we used the following 13state variables:dist (K 1,C ),dist (K 2,C ),dist (K 3,C ),dist (T 1,C ),dist (T 2,C ),dist (K 1,K 2),dist (K 1,K 3),dist (K 1,T 1),dist (K 1,T 2),

Min(dist (K 2,T 1),dist (K 2,T 2)),Min(dist (K 3,T 1),dist (K 3,T 2)),

Min(ang (K 2,K 1,T 1),ang (K 2,K 1,T 2)),Min(ang (K 3,K 1,T 1),ang (K 3,K 1,T 2))

For our purposes,the behavior of the takers is “hard-wired”and relatively simple.The two takers that can get there the soonest go to the ball,while the remaining takers try to block open passing lanes.

An obvious performance measure for this task is average episode duration.The keepers attempt to maximize it while the the takers try to minimize it.To this end,the keepers are given a constant positive reward for each time step an episode persists.For full details on the task and the learning scenario,see Stone and Sutton (2001).

Natural Language Interface

Allowing human teachers to provide advice in natural lan-guage allows them to instruct a learning agent without hav-ing to master a complex formal advice language.In our approach,a parser automatically translates natural-language instructions into an underlying formal language appropriate for the domain.Statements in this formal language are then used to in?uence the action policy learned by the agent.In the RoboCup Coach Competition,teams compete to pro-vide effective instructions to a coachable team in the sim-ulated soccer domain.Coaching information is provided in a formal language called C LANG (Coach Language)(Chen et al.2003).By constructing English translations for 500C LANG statements produced by several teams for the 2003RoboCup Coach Competition,we produced a corpus for training and testing a natural-language interface.Below are some sample annotated statements from this corpus:?If player 4has the ball,it should pass the ball to player 2or 10.

((bowner our {4})(do our {4}(pass {210}))))

?No one pass to the goalie.

((bowner our {0})(dont our {0}(pass {1}))))

?If players 9,10or 11have the ball,they should shoot and should not pass to players 2-8.

((bowner our {91011})

(do our {91011}(shoot))

(dont our {91011}(pass {2345678})))

For a suf?ciently restricted task,such as RoboCup coaching,parsing natural-language sentences into formal representa-tions is a reasonably manageable task using current NLP technology (Jurafsky &Martin 2000).However,developing such a parser is a very labor-intensive software-engineering project.Consequently,methods that learn to map natural language to a given formal language from input/output pairs such as those above can signi?cantly automate this dif?cult development process.

We have previously developed methods for learning to trans-late natural-language sentences into formal semantic repre-sentations.In particular,we have developed two integrated systems,C HILL (Zelle &Mooney 1996),which learns a parser for mapping natural-language sentences directly to logical form,and W OLFIE (Thompson &Mooney 2003),

which learns a lexicon of word meanings required by this parser.

We are currently adapting C HILL and W OLFIE to learn to map English to C LANG as well as exploring several new ap-proaches to semantic parsing.Our?rst new approach uses pattern-based transformation rules to map phrases in natural language directly to C LANG.Our second new approach?rst uses a statistical parser(Charniak2000)to produce a syntac-tic parse tree,then uses pattern-based transformation rules to map subtrees of this parse to C LANG expressions.Our third new approach uses a parser that integrates syntactic and se-mantic analysis by pairing each production in the syntactic grammar with a compositional semantic function that pro-duces a semantic form for a phrase given semantic forms for its subphrases(Norvig1992).For lack of space,we elabo-rate only on the?rst of these new approaches,which is the simplest method.

C LANG comes with a formal grammar that de?nes the lan-guage using production rules such as:

ACTION→(pass UNUM

SET)

where UNUM

the advice unit are added to those generated by the function approximator and presented to the learning algorithm.

Action

values

State Variables CMAC Function Approximator

Figure2:A pictorial summary of the complete advice integration scheme.

One of the nice things about this advice incorporation method is that it does not require the function approximator and the advice unit to use the same set of features to describe the world.At the same time,it allows for the advice to be adjusted by the learner.By adjusting the function approxi-mator’s value for an advised action,the agent can effectively “unlearn”the given advice.

Experiments with Advice Giving

We conducted a series of preliminary experiments to mea-sure the impact of advice on the quality of the learned poli-cies and the speed at which those policies converge.In each experiment,we add a single piece of advice at the beginning of the learning trial.The advice applies to every member of the keeper team and remains active for the duration of the experiment.

Four Sample Pieces of Advice

We created four pieces of sample advice to test the system. The advice has not been extensively engineered to maximize performance.It is simply a collection of“?rst

thoughts”made from observing the players during training and recog-nizing some of their shortcomings.

The?rst piece of advice,Hold Advice,states that the player in possession of the ball should choose to hold onto the ball rather than pass if no opponents are within8m.We observed that early in learning the keepers tend to pass more often than they probably should.Therefore,it seems reasonable to advise them not to pass when the takers are too far away to be a threat.Figure3illustrates this advice.The keep-ers are represented as darkly-?lled circles and the takers are represented as lightly-?lled circles.

To give a sense of how advice

is represented in the syntax of C LANG,the following is the C LANG rule corresponding to the Hold Advice:

(definerule hold-advice direc

((ppos opp{0}00

(arc(pt self)080360)) (do our{0}(hold))))

The rule describes a circular region(shown dashed in Fig-ure3)centered at one of the keepers with a radius of8m,Figure3:Hold Advice-If no opponents are within8m then hold. and states that if exactly0opponents are in that region,the keeper is advised to perform the hold action.

The next

piece of example advice we call Quadrant Ad-vice.As seen in Figure4,the play region is divided into four quadrants.A player is advised to pass to a teammate if that teammate is in a different quadrant and that quadrant contains no opponents.This advice aims to encourage play-ers to pass to teammates that are not being defended.

Figure4:Quadrant Advice-Pass to a teammate if he is in a different quadrant that contains no opponents.

Lane Advice(see Figure5)instructs players to pass to a teammate when the passing lane to that teammate is open.

A passing lane is de?ned as an isosceles triangular region with its apex at the position of the player with the ball and its base midpoint at the position of the intended pass recipient.

A lane is open if no opponents are inside the region.The purpose of this advice is to encourage passes that are likely to succeed.

Figure5:Lane Advice-Pass to a teammate if there are no oppo-nents in the corresponding passing lane.

The?nal piece of advice used in our experiments,Edge Advice,differs from the previous advice in that it advises against an action rather than for one.As shown in Figure6, this rule de?nes edge regions along the sides of the playing ?eld.These regions are each5m wide.A player is advised

not to pass to a teammate if both players are in the same edge region.The goal of this advice is to discourage passes along the edges of the play region,which have a high probability of going out of bounds due to noise in the

simulator.

Figure 6:Edge Advice -Do not pass along the edges of the play region.

Empirical Results when Using Advice

For each piece of advice tested,we ran ?ve learning trials starting from a different random initial state of the learner’s function approximator.The results are compared with ?ve learning trials during which no advice is given.For each learning trial,we measure the average episode duration over time.Episodes are averaged using a 1000-episode sliding window.We plot the learning curves for every trial on the same graph to give a sense of the variance.The results are shown in Figures 7-10.In all experiments,3keepers played against 2takers on a 20m ×20m ?eld.

By default in the RoboCup simulator,players are only able to see objects that are within a 900view cone.Although we have shown previously that the learning method used in this paper is able to work under this condition (Kuhlmann &Stone 2004),in this work,we have simpli?ed the prob-lem by giving the players 360o .This simpli?cation ensures that the conditions of the advice unit are always accurately evaluated.

Training Time (hours)

Hold Advice 5101520250

5

10152025

30

E p i s o d e D u r a t i o n (s e c o n d s )

No Advice

Figure 7:Learning curves comparing Hold Advice to no advice.

It is clear from Figure 7,that the Hold Advice is helpful.Learning with this advice consistently outperforms learning without it.However,it is surprising that the players do not learn faster as a result.

Similarly,the results for the Quadrant Advice shown in Figure 8demonstrate that this advice,while not speeding up learning,helps the learners to perform better than without it.

10Quadrant Advice

5101520250

5

152025

30

E p i s o d e D u r a t i o n (s e c o n d s )

Training Time (hours)

No Advice

Figure 8:Learning curves comparing Quadrant Advice to no ad-vice.

10Lane Advice 468101214161820220

5

152025

30

E p i s o d e D u r a t i o n (s e c o n d s )

Training Time (hours)

No Advice

Figure 9:Learning curves comparing Lane Advice to no advice.

Figure 9shows that the Lane Advice is also helpful.How-ever the performance improvement is not as dramatic as in the previous cases.

Finally,from Figure 10,we see that the learners did not ?nd the Edge Advice to be consistently bene?cial.However,it appears that in one learning trial,the keepers were able to bene?t from the advice.

Additional Experiments

After establishing that several different kinds of advice are bene?cial in isolation,we started exploring the possibility of combining the advice.We ran several experiments in which two or more pieces of advice were active at the same time.Typically,while the learners still performed better than with no advice,the results are not as good as those learned with the advice activated individually.

A possible explanation for this result is that in situations in which two pieces of advice that recommend different actions are triggered at the same time,they effectively cancel each other out.Additional work is needed to fully understand and resolve this dif?cult issue.We plan to continue to explore ways to combine advice to achieve the desired additive effect that has been reported in other work (Maclin 1995).

Conclusion and Future Work

Allowing humans to provide high-level advice to their soft-ware assistants is a valuable way to improve the dialog be-tween humans and the software they use.It changes the metaphor from that of giving commands to computers to that

10Edge Advice 468101214161820220

5

152025

30

E p i s o d e D u r a t i o n (s e c o n d s )

Training Time (hours)

No Advice

Figure 10:Learning curves comparing Edge Advice to no advice.

of giving them advice .By their being able to accept,adapt,and even discard advice,advice-taking systems have the po-tential to radically change how we interact with robots and software agents.

We have empirically investigated the idea of giving advice to an adaptive agent that learns how to use the advice effec-tively.We show that some simple,intuitive advice can sub-stantially improve a state-of-the-art reinforcement learner on a challenging,dynamic task.Several pieces of advice were shown to improve performance on the RoboCup keepaway task,and we plan to continue extending our work on advis-able reinforcement learning to cover the complete simulated RoboCup task.

We are currently investigating additional ways of mapping English statements into formal advice and alternate ap-proaches for using advice in reinforcement learners.We are developing multiple approaches to automatically learn-ing to translate natural-language to semantic representations and we will evaluate them on our assembled English/C LANG corpus.Besides extending how we use advice with a CMAC-based learner (e.g.,by modifying the weights in the CMAC directly or changing the learner’s exploration func-tion to give higher consideration to advised actions),we are also investigating the use of knowledge-based support vec-tor machines (Fung,Mangasarian,&Shavlik 2002)and in-structable agents that use relational learning methods (Dze-roski,Raedt,&Driessens 2001).

Acknowledgements

We would like to thank Ruifang Ge,Rohit Kate,and Yuk Wah Wong for contributing to the work on natural-language understand-ing.This research was supported by DARPA Grant HR0011-02-1-0007and by NSF CAREER award IIS-0237699.

References

Albus,J.S.1981.Brains,Behavior,and Robotics .Peterborough,NH:Byte Books.

Charniak,E.2000.A maximum-entropy-inspired parser.In Pro-ceedings of the Meeting of the North American Association for Computational Linguistics .

Chen,M.;Foroughi,E.;Heintz,F.;Kapetanakis,S.;Kostiadis,K.;Kummeneje,J.;Noda,I.;Obst,O.;Riley,P.;Steffens,T.;Wang,Y .;and Yin,https://www.docsj.com/doc/8a11604190.html,ers manual:RoboCup soccer

server manual for soccer server version 7.07and later.Available at https://www.docsj.com/doc/8a11604190.html,/projects/sserver/.Clouse,J.,and Utgoff,P.1992.A teaching method for rein-forcement learning.In Proceedings of the Ninth International Conference on Machine Learning ,92–101.

Dzeroski,S.;Raedt,L.D.;and Driessens,K.2001.Relational reinforcement learning.Machine Learning 43:7–52.

Eliassi-Rad,T.,and Shavlik,J.2003.A system for building intelligent agents that learn to retrieve and extract information.International Journal on User Modeling and User-Adapted In-teraction,Special Issue on User Modeling and Intelligent Agents 13:35–88.

Fung,G.M.;Mangasarian,O.L.;and Shavlik,J.W.2002.Knowledge-based support vector machine classi?ers.In Ad-vances in Neural Information Processing Systems 14.MIT Press.Jurafsky,D.,and Martin,J.H.2000.Speech and Language Pro-cessing:An Introduction to Natural Language Processing,Com-putational Linguistics,and Speech Recognition .Upper Saddle River,NJ:Prentice Hall.

Kate,R.J.;Wong,Y .W.;Ge,R.;and Mooney,R.J.2004.Learning transformation rules for semantic parsing.under re-view,available at:https://www.docsj.com/doc/8a11604190.html,/users/ml/publication/nl.html .

Kuhlmann,G.,and Stone,P.2004.Progress in learning 3vs.2keepaway.In Polani,D.;Browning,B.;Bonarini,A.;and Yoshida,K.,eds.,RoboCup-2003:Robot Soccer World Cup VII .Berlin:Springer Verlag.

Maclin,R.,and Shavlik,J.1994.Incorporating advice into agents that learn from reinforcements.In Proceedings of the Twelfth National Conference on Arti?cial Intelligence ,694–699.

Maclin,R.1995.Learning from instruction and experi-ence:Methods for incorporating procedural domain theories into knowledge-based neural networks .Ph.D.Dissertation,Computer Sciences Department,University of Wisconsin,Madison,WI.Noda,I.;Matsubara,H.;Hiraki,K.;and Frank,I.1998.Soc-cer server:A tool for research on multiagent systems.Applied Arti?cial Intelligence 12:233–250.

Noelle,D.,and Cottrell,G.1994.Towards instructable connec-tionist systems.In Sun,R.,and Bookman,L.,eds.,Computa-tional Architectures Integrating Neural and Symbolic Processes .Boston:Kluwer Academic.

Norvig,P.1992.Paradigms of Arti?cial Intelligence Program-ming:Case Studies in Common Lisp .San Mateo,CA:Morgan Kaufmann.

Siegelmann,H.1994.Neural programming language.In Pro-ceedings of the Twelfth National Conference on Arti?cial Intelli-gence ,877–882.

Stone,P.,and Sutton,R.S.2001.Scaling reinforcement learning toward RoboCup soccer.In Proceedings of the Eighteenth In-ternational Conference on Machine Learning ,537–544.Morgan Kaufmann,San Francisco,CA.

Stone,P.;Riley,P.;and Veloso,M.2000.The CMUnited-99champion simulator team.In Veloso,M.;Pagello,E.;and Ki-tano,H.,eds.,RoboCup-99:Robot Soccer World Cup III .Berlin:Springer.35–48.

Sutton,R.S.,and Barto,A.G.1998.Reinforcement Learning:An Introduction .Cambridge,MA:MIT Press.

Tang,L.R.,and Mooney,https://www.docsj.com/doc/8a11604190.html,ing multiple clause con-structors in inductive logic programming for semantic parsing.In Proceedings of the 12th European Conference on Machine Learn-ing ,466–477.

Thompson,C.A.,and Mooney,R.J.2003.Acquiring word-meaning mappings for natural language interfaces.Journal of Arti?cial Intelligence Research18:1–44.

Zelle,J.M.,and Mooney,R.J.1996.Learning to parse database queries using inductive logic programming.In Proceedings of the Thirteenth National Conference on Arti?cial Intelligence,1050–1055.

相关文档
相关文档 最新文档