unsupervised learning recommenders reinforcement learning coursera week 3 answers
Reinforcement learning introduction
1. You are using reinforcement learning to control a four legged robot. The position of the robot would be its _____.
- reward
- action
- state
- return
2. You are controlling a Mars rover. You will be very very happy if it gets to state 1 (significant scientific discovery), slightly happy if it gets to state 2 (small scientific discovery), and unhappy if it gets to state 3 (rover is permanently damaged). To reflect this, choose a reward function so that:
- R(1) > R(2) > R(3), where R(1), R(2) and R(3) are negative.
- R(1) > R(2) > R(3), where R(1), R(2) and R(3) are positive.
- R(1) > R(2) > R(3), where R(1) and R(2) are positive and R(3) is negative.
- R(1) < R(2) < R(3), where R(1) and R(2) are negative and R(3) is positive.
3. You are using reinforcement learning to fly a helicopter. Using a discount factor of 0.75, your helicopter starts in some state and receives rewards -100 on the first step, -100 on the second step, and 1000 on the third and final step (where it has reached a terminal state). What is the return?
- -100 – 0.75*100 + 0.75^2*1000
- -0.75*100 – 0.75^2*100 + 0.75^3*1000
- -0.25*100 – 0.25^2*100 + 0.25^3*1000
- -100 – 0.25*100 + 0.25^2*1000
4. Given the rewards and actions below, compute the return from state 3 with a discount factor of γ=0.25.
- 25
- 0.39
- 6.25
- 0
State-action value function
5. Which of the following accurately describes the state-action value function Q(s,a)?
- It is the return if you start from state s, take action a (once), then behave optimally after that.
- It is the return if you start from state s and repeatedly take action a.
- It is the return if you start from state s and behave optimally.
- It is the immediate reward if you start from state s and take action a (once).
6. You are controlling a robot that has 3 actions: ← (left), → (right) and STOP. From a given state s, you have computed Q(s, ←) = -10, Q(s, →) = -20, Q(s, STOP) = 0.
What is the optimal action to take in state s?
- STOP
- ← (left)
- → (right)
- Impossible to tell
7. For this problem, γ=0.25. The diagram below shows the return and the optimal action from each state. Please compute Q(5, ←).
- 0.625
- 0.391
- 1.25
- 2.5
Continuous state spaces
8. The Lunar Lander is a continuous state Markov Decision Process (MDP) because:
- The state has multiple numbers rather than only a single number (such as position in the x-direction)
- The state-action value Q(s,a) function outputs continuous valued numbers
- The state contains numbers such as position and velocity that are continuous valued.
- The reward contains numbers that are continuous valued
9. In the learning algorithm described in the videos, we repeatedly create an artificial training set to which we apply supervised learning where the input x=(s,a) and the target, constructed using Bellman’s equations, is y = _____?
- y=R(s′) where ′s′ is the state you get to after taking action a in state s
- y=R(s)+γa′maxQ(s′,a′) where ′s′ is the state you get to after taking action a in state s
- y=a′maxQ(s′,a′) where ′s′ is the state you get to after taking action a in state s
- y=R(s)
10. You have reached the final practice quiz of this class! What does that mean? (Please check all the answers, because all of them are correct!)
- The DeepLearning.AI and Stanford Online teams would like to give you a round of applause!
- You deserve to celebrate!
- What an accomplishment — you made it!
- Andrew sends his heartfelt congratulations to you!