Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Cs 4/5789 – programming assignment 4

In this section, we will walk through Natural Policy Gradient (NPG) and also implement it for CartPole Simulation. 1.1 Background and Setting We will consider the default CartPole simulator in OpenAI gym where we have two discrete actions A = {0, 1} (here 0 and 1 are the index of actions, and physically, 0 means applying a left push to the cart, and 1 means applying a right push to the cart. You just need to compute a stochastic policy which samples 0 or 1, and feed it to the step function which will do the rest of the job for you). Before defining the policy parameterization, let us first featurize our state-action pair. Specifically, we will use random fourier feature (RFF) ϕ(s, a) ∈ R d , where d is the dimension of RFF feature. RFF is a randomized algorithm that takes the concatenation of (s, a) as input, outputs a vector ϕ(s, a) ∈ R d , such that it approximates the RBF kernel, i.e., for any (s, a),(s ′ , a′ ) pair, we have: lim d→∞ ⟨ϕ(s, a), ϕ(s ′ , a′ )⟩ = k ([s, a], [s ′ , a′ ]), where k is the RBF kernel on the concatenation of state-action (we denote [s, a] as the vector [s ⊤, a] ⊤).In summary, RFF feature approximates RBF kernel, but allows us to operate in the primal space rather than the dual space where we need to compute and invert the Gram matrix (recall Kernel trick and Kernel methods from the ML introduction course) which is very computationally expensive and does not scale well to large datasets.We parameterize our policy as follows: πθ(a|s) = exp θ ⊤ϕ(s, a) P a′ exp (θ⊤ϕ(s, a′)), where the parameters θ ∈ R d . Our goal is of course to find the best θ such that the resulting πθ achieves large expected total reward.1See https://pytorch.org/get-started/locally/ for installation for local machine. Recommended to have torch 1.11.0.1.2 Gradient of Policy’s Log-likelhood TODO Given (s, a), we need to first derive the expression ∇θ ln πθ(a|s). This is part of the previous written homework. Now go to utils.py to implement the computation of ∇θ ln πθ(a|s)in compute log softmax grad.You can also implement compute softmax and compute action distribution first to use them to calculate the gradient. You should pass the tests in test.py for these functions before moving on. Remark The file test.py comprises of tests for some functions. If your implementation is correct, the printed errors of all tests should be quite small, usually not larger than 1e-6.1.3 Fisher Information Matrix and Policy Gradient We consider the finite-horizon MDP. Note that in class we considered the infinite-horizon version, but they are quite similar. Let us now compute the fisher information matrix. Let us consider a policy πθ. Recall the definition of Fisher information matrix: Fθ = Es,a∼d πθ µ0 h ∇θ ln πθ(a|s) (∇θ ln πθ(a|s))⊤ i ∈ R d×d .We approximate Fθ using trajectories sampled from πθ. We first sample N trajectories τ 1 , . . . , τ N from πθ, where τ i = {s i h , ai h , ri h } H−1 h=0 with s i 0 ∼ µ0. We approximate Fθ using all (s, a) pairs: Fbθ = 1 N X N i=1 ” 1 H H X−1 h=0 ∇θ ln πθ(a i h |s i h )∇θ ln πθ(a i h |s i h ) ⊤ # + λI, where λ ∈ R + is the regularization for forcing positive definiteness.Remark Note the way we estimate the fisher information. Instead of doing the roll-in procedure we discussed in the class to get s, a ∼ d πθ µ0 (this is the correct way to ensure the samples are i.i.d), we simply sample N trajectories, and then average over all state-action (sh, ah) pairs from all N trajectories. This way, we lose the i.i.d property (these state-action pairs are dependent), but we gain sample efficiency by using all data.TODO First, go to train.py and finish the sample function to sample trajectories using the current policy. This function should rollout N trajectories and keep track of the gradients and rewards collected during each trajectory. Then go to file utils.py and implement Fbθ in compute fisher matrix. Note that in OpenAI Gym CartPole, there is a termination criteria when the pole or the cart is too far away from the goal position (i.e., during execution, if the termination criteria is met or it hits the last time step H, the simulator will return done = True in step function.During generating a trajectory, it is possible that we will just terminate the trajectory early since we might meet the termination criteria before getting to the last time step H). Hence, we will see that when we collect trajectories, each trajectory might have different lengths.Thus, for estimating Fθ, we need to properly average over the trajectory length. You should pass the tests for compute fisher matrix. The test solutions were computed using the default lambda value of 1e-3. We don’t have tests for sample so make sure that is correct before moving on.TODO Denote V θ as the objective function V θ = E hPH−1 h=0 r(sh, ah)|s0 ∼ µ0, ah ∼ πθ(·|sh) i .Again be mindful that each trajectory might have different lengths due to early termination. Let us implement the policy gradient, i.e., ∇bV θ = 1 N X N i=1 ” 1 H H X−1 h=0 ∇θ ln πθ(a i h |s i h ) ” ( H X−1 t=h r i t ) − b ## , where b is a constant baseline b = PN i=1 R(τ i )/N, i.e., the average total reward over a trajectory. Go to utils.py to implement this PG estimator in compute value gradient. There are tests in test.py for this function.1.4 Implement the step size With Fbθ and ∇b θV θ , recall that NPG has the following form: θ ′ := θ + ηFb−1 θ ∇bV θ . 2We need to specify the step size η here. Recall the trust region interpretation of NPG. We perform incremental update such that the KL divergence between the trajectory distributions of the two successive policies are not that big. Recall that the KL divergence KL(ρ πθ |ρ πθ′ ) can be approximate by Fisher information matrix as follows (ignoring constant factors): KL(ρ πθ |ρ πθ′ ) ≈ (θ − θ ′ ) ⊤ Fθ(θ − θ ′ ).As we explained in the lecture, instead of setting learning rate as the hyper-parameter, we set the trust region (which has a more transparent interpretation) size as a hyper parameter, i.e., we set δ such that: KL(ρ πθ |ρ πθ′ ) ≈ (θ − θ ′ ) ⊤ Fbθ(θ − θ ′ ) ≤ δ. Since θ ′ − θ = ηFb−1 θ ∇bV θ , we have: η 2 (∇bV θ ) ⊤Fb−1 θ ∇bV θ ≤ δ. Solving for η we get η ≤ q δ (∇bV θ)⊤Fb−1 θ ∇bV θ . We will just set η = q δ (∇bV θ)⊤Fb−1 θ ∇bV θ , i.e., be aggressive on setting learning rate while subject to the trust region constraint.To ensure numerical stability when the denominator is close to zero, we add ϵ = 1e − 6 to the denominator so the expression we will use is η = s δ (∇bV θ)⊤Fb−1 θ ∇bV θ + ϵ (1) TODO Now go to utils.py and implement these step size computation in compute eta. Check your implementation with the tests in test.py.1.5 Putting Everything Together Now we can start putting all pieces together. Go to train.py to implement the main framework in train. In each iteration of NPG, we do the following 4 steps: • Collect samples by rolling out N trajectories with the current policy using the sample function. • Compute the fisher matrix using the gradients from the previous steps. Use the default value of λ. • Compute the step size for this NPG step. • Update the model paramaters, θ, by taking an NPG step.In addition to the above 4 steps, keep track of the average episode reward in each iteration of the algorithm. The output of train should be the final model parameters and a list containing the average episode rewards for each step of NPG. For this section, run the above algorithm with the parameters T = 20, δ = 0.01, λ = 0.001, and N = 100. Plot the performance curve of each πθt for t = 0 to 19. You should be able to do this by simply running train.py from the terminal. The training output θ will be saved in folder learned policy/NPG/ and will be needed in the subsequent DAgger task.Hint Your algorithm should achieve average reward of over 190 in about 15 steps if implemented correctly.Note DAgger has not yet been covered in class so far, but it will be covered in a future lecture. In this section, we will implement one classical imitation learning methods – DAgger – and get you started on implementing it in PyTorch. We will use the NPG policy that you trained in the previous sections as the expert policy π ∗ . We will be focusing on the CartPole environment as well.2.1 Background and Setting The goal in imitation learning is, given a set of N expert datapoints D∗ = {s ∗ i , a∗ i } N i=1 from some unknown, black-box expert policy π ∗ , we want to learn a policy π that is as good as the expert.DAgger is a method of interactive imitation learning that is similar to behavioral cloning, but we can query the expert on state-action pairs seen when rolling out our policy. In this way, when rolling out policy, we can essentially get an expert rollout R∗ = {si , π∗ (si)} H i=1, and we can “aggregate” this rollout into our dataset and learn from all our experience collected so far.Partial observation of the learning agent. To demonstrate that imitation learning is powerful, we will constrain the learner’s observation to a subset of the expert’s, yet the learner is still capable of attaining comparable performance. Specifically, recall that the state of CartPole comprises four components: Cart Position, Cart Velocity, Pole Angle, and Pole Angular Velocity. Our goal is to demonstrate that, although the expert observes the full state, the policy learned by DAgger works well even when the learning agent is only able to observe a subset of the state.2.2 Implementation You will need to implement functions in dagger.py, which contains functions defining our DAgger learning agents and the training script. Specifically, please implement the following functions: • DAgger/sample from logits: This method should take in action logits (of size (B, Da)), and sample an action according to the distribution defined by said logits. Specifically, you will need to sample from the distribution Pr(a) = exp (ιa) /( P a′ exp (ιa′ )), where ι denotes the logits.• DAgger/rollout: This method takes the environment and the number of steps to roll out in the environment as inputs, and should return a set of data points (s, π∗ (s)) by rolling out in the environment. Specifically, this can be done in two steps: (1) roll out by taking actions according to the current policy (see self.policy definition in the class), and (2) get the expert’s action for each state (see self.expert policy definition in the class).• DAgger/learn: This method should take in a batch of state and actions and perform a step of gradient descent over the batch. Specifically, you need to compute the action logits for the given states using the current policy, compute the cross entropy loss (see self.loss) using the logits and the expert actions, and perform one step of gradient descent.• experiment: For a number of epochs, collect data while also training the agent on the new data. Specifically, first, we have to add our newly rolled out data into the dataset (see dataset.py for more information), and then create the dataloader to process our data. The dataloader has to be created every time before learning can begin due to the changing size of our dataset. Then, we perform supervised learning, i.e., we get batches from dataloader and perform gradient descent using learner.learn.2.3 Testing You can train the agent by running python dagger.py –dagger epochs 20 –num rollout steps 200 –dagger supervision steps 20 –state to remove x where x is an integer that could be 0,1,2, or 3, corresponding to which state component you want to remove (as we said earlier, we want to constrain the learner’s observation). Once done training, please test your agent via running test dagger.py. You may also need to specify the argument state to remove for testing.The parameter state to remove should be consistent between the training and testing. In other words, we should always remove the same state component during both the training and testing. Other arguments for testing are the same ones specified by the get args function in utils.py.Please try to remove each of the four state components, and then report the average reward in the writeup. In other words, please run the training and testing for four times with –state to remove set to 0,1,2, and 3, respectively, and record the results. Please also save the four trained models with the names x-CartPole-v0.pt (where x=0,1,2, or 3 representing the removed state component) under the folder learned policies/dagger/.Please state in the writeup what you have discovered from these four trials. Which state component can be removed without compromising the performance, and which one impacts the performance the most when removed? Hint You may want to verify your DAgger implementation by running the following to train a DAgger agent with full state information: python dagger.py –dagger epochs 20 –num rollout steps 200 –dagger supervision steps 20 If your implementation is good enough, the agent will achieve an average reward of at least 190.Section 3: Submission YOUR NET ID/ learned policies/ dagger 0-CartPole-v0.pt 1-CartPole-v0.pt 2-CartPole-v0.pt 3-CartPole-v0.pt NPG expert theta.npy answers.pdf common.py dagger.py dataset.py README.md requirements.txt test dagger.py test info.pkl test.py train.py utils.py where answers.pdf should contain your plot for Section 1.5 and your discussion for Section 2.3.