Online Markov Decision Processes under Bandit Feedback

Tools | Bookmark & Share | Make MrWhy My Homepage

Answers Shopping eBay Amazon More Web Search Videos Search Recent News

MrWhy.com » Videos » Online Markov Decision Processes under Bandit Feedback

Watch Video

Online Markov Decision Processes under Bandit Feedback

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved convergence rate result for the problem.

Channel: VideoLectures

Category: Educational

Video Length: 0

Date Found: March 28, 2011

Date Produced: March 25, 2011

View Count: 0

MrWhy.com Special Offers

About Us: About MrWhy.com | Advertise on MrWhy.com | Contact MrWhy.com | Privacy Policy | MrWhy.com Partners

Answers: Questions and Answers | Browse by Category

Comparison Shopping: Comparison Shopping | Browse by Category | Top Searches

Shop eBay: Shop eBay | Browse by Category

Shop Amazon: Shop Amazon | Browse by Category

Videos: Video Search | Browse by Category

Web Search: Web Search | Browse by Searches