Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Practical_4_Reinforcement_Learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -795,7 +795,7 @@
" \n",
" def end_episode(self, final_reward): \n",
" \"\"\"At the end of an episode, we compute the loss for the episode and take a \n",
" step in parameter speace in the direction of the gradients.\"\"\"\n",
" step in parameter space in the direction of the gradients.\"\"\"\n",
" \n",
" # Compute the return (cumulative discounted reward) for the episode\n",
" episode_return = sum(self._rewards) + final_reward # Assuming \\gamma = 1\n",
Expand Down Expand Up @@ -831,7 +831,7 @@
},
"cell_type": "markdown",
"source": [
"Notice that during the episode we run only the forward-pass of the policy network (inference). At the end of the episode, we replay the states that occured during the episode and run both the forward and backward pass of the policy network (notice the gradient tape!) because we can only compute the loss once we have the episode return at the end of the episode. If the policy network is very complex, this could be inefficient. In that case you could run both the forward an backward pass during the episode and store intermediate gradients/partial derivatives to use in the update at the end of the episode."
"Notice that during the episode we run only the forward-pass of the policy network (inference). At the end of the episode, we replay the states that occured during the episode and run both the forward and backward pass of the policy network (notice the gradient tape!) because we can only compute the loss once we have the episode return at the end of the episode. If the policy network is very complex, this could be inefficient. In that case you could run both the forward and backward pass during the episode and store intermediate gradients/partial derivatives to use in the update at the end of the episode."
]
},
{
Expand Down Expand Up @@ -934,4 +934,4 @@
"outputs": []
}
]
}
}