In RL: Stetige Policy pi(s) → deterministischer Gradient. Diskrete Policy → Softmax nötig

Uh oh! Wolfram|Alpha doesn't run without JavaScript.

Please enable JavaScript. If you don't know how, you can find instructionshere.Once you've done that, refresh this page to start using Wolfram|Alpha.