Recent Forum Posts
From categories:
page 1123...next »

Regarding noise:
1. Both noising techniques are valid - and will be accepted as an answer. Indeed, it seems that adding noise to the Q table and then choosing yields better results. Hint: the following noising to the Q-table works well: np.random.randn(1,env.action_space.n)*(1./(i+1)).

2. Don't use both noising techniques together!!

3. y variable is the discount_factor gamma.

4. The meaning of the comment "# 5. Update episode if we reached the Goal State" is to make sure that you: (a) update the total reward (rall varaible) and (2) end the episode if you reached the terminal/goal state.

Re: Programming Q2 1 by adampolyakadampolyak, 20 May 2018 11:20
Re: HW3 , Q 1
adampolyakadampolyak 20 May 2018 11:09
in discussion Discussions / HW3 » HW3 , Q 1

1. yes, the meaning is negation. As to how many trails of training occurred without significant learning.

2. There may be a bug in the setting of the random seed… Generally in rl, algorithm are ran multiple times and the std/mean are reported.
Make sure that your results show that learning has occurred - the numerics doesn't need to be the same each run.

Re: HW3 , Q 1 by adampolyakadampolyak, 20 May 2018 11:09

Simple matrix is enough (linear model).

One *hidden* layer answer will also be accepted.

Re: Q2 (b) - network by adampolyakadampolyak, 20 May 2018 11:06
rafi levy (guest) 15 May 2018 20:29
in discussion Discussions / HW3 » Programming Q2 1

Hello,

Another question on Q2, 1

about choosing the action - if we apply greedy on "noisy action" then we should not in addition choose a randomized action with probability epsilon. Correct? Conversely, does applying (pure) greedy with probability of 1 - eps and randomized action with probability eps represent somehow the noise added to the intended action?

thanks
Rafi

by rafi levy (guest), 15 May 2018 20:29
rafi levy (guest) 15 May 2018 20:01
in discussion Discussions / HW3 » Programming Q2 1

Hello,

On Q2, section 1.

What is the meaning of the comment in the code - # 5. Update episode if we reached the Goal State
It is clear that the episode may not reach the goal (if it reaches a hole) but i am not sure i understand how this fact matters to our computations. Should we not add rAll (=0) to rList? Should we somehow subtract the i index? I beleive num_episodes refers to any ended episode whether the ending state is a hole or a goal. Correct?

by rafi levy (guest), 15 May 2018 20:01
HW3 , Q 1
rafi levy (guest) 15 May 2018 14:37
in discussion Discussions / HW3 » HW3 , Q 1

Hello,

1. Does the meaning of "no" in the variable consecutive_no_learning_trials means a negation and not a number (no.)?

2. Upon multiple runs of python control.py results ate not determinsitic. In fact the differences are significant in terms of 'number of failures'. Does it make sense? Since we are using the same seed, i would have expected to get the same results.

thanks,
Rafi

HW3 , Q 1 by rafi levy (guest), 15 May 2018 14:37
Q2 (b) - network
Alon (guest) 14 May 2018 14:10
in discussion Discussions / HW3 » Q2 (b) - network

Hi, you wrote to implement "one layer network". Do you mean a simple matrix (i.e linear model) or a one *hidden* layer network (i.e. two matrix with an activation between)?

Thanks,
Alon

Q2 (b) - network by Alon (guest), 14 May 2018 14:10

I would like to join the question, and also to ask whether the "y" learning parameter in tabular_Q.py refers to the discount factor for future rewards.

Re: Programming Q2 1 by Guy SmoilovskiGuy Smoilovski, 12 May 2018 11:58

Sorry for the late reply, obviously it's working now.

Programming Q2 1
Guy (guest) 10 May 2018 07:24
in discussion Discussions / HW3 » Programming Q2 1

In subsection 1 - Does choosing greedily with noise from the Q table means adding noise to the Q table and then choosing? or choosing using epsilon greedy policy?
When using greedy policy I get pretty bad results (almost always 0 successful episodes) and when adding noise to the Q function the results improves substantially.

Thanks

Programming Q2 1 by Guy (guest), 10 May 2018 07:24
mansourmansour 09 May 2018 06:40
in discussion Discussions / General » Lecture 6 Scribe

Look at the paper:
Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing
eligibility traces. Machine Learning, Machine Learning (1996) 22: 123-158

Specifically, Section 3.3.
More specifically, Theorem 7 (proved in Appendix A.4)

by mansourmansour, 09 May 2018 06:40
Guy (guest) 08 May 2018 19:18
in discussion Discussions / General » Lecture 6 Scribe

I Understand how to get the results in slide 50 but still the equations in slide 52 are not clear to me.
For example, as I wrote in the message above I get different value. (I tried using the same reasoning stating that the expected number of episodes is 1/p)

by Guy (guest), 08 May 2018 19:18

Look at slide 50, there is a simpler case there that has all the ingredients.

Re: Lecture 6 Scribe by mansourmansour, 08 May 2018 17:24
Lecture 6 Scribe
Guy (guest) 08 May 2018 12:39
in discussion Discussions / General » Lecture 6 Scribe

אני אשמח להסבר כיצד הגיעו לביטויים עבור ה-MDP בחלק 6.5.5 (או במצגת הרצאה 6 בשקף 52).
ספציפית עבור החישוב של V(s_1) לפי ה-MDP היא לא הביטוי:

$V(s_1) = \dfrac{1}{p}*r_1 + r_2$

האורך בתוחלת של ה-episode הוא $\frac{1}{p}$ u וה-reward בסוף הוא r_2.
עבור שאר הביטויים לא הבנתי כיצד הגיעו אליהם.

תודה

Lecture 6 Scribe by Guy (guest), 08 May 2018 12:39

The dates were fixed.
Is the PDF link still broken? It works for me.

Re: Problems with HW3 files by adampolyakadampolyak, 07 May 2018 07:26
Re: Prog Q1 (2)
adampolyakadampolyak 07 May 2018 07:25
in discussion Discussions / HW3 » Prog Q1 (2)

Save and submit.
No need to change the plot code.

Re: Prog Q1 (2) by adampolyakadampolyak, 07 May 2018 07:25
Prog Q1 (2)
anon (guest) 06 May 2018 10:48
in discussion Discussions / HW3 » Prog Q1 (2)

The assignment says - 'Plot a learning curve showing the number of time - steps for which the pole was balanced on each trial. Python starter code already includes the code to plot. Submit the plot'

The code itself plot the log of this values and some smoothing ( in the dotted line). Should I change it to fit the requirements (remove the log function and the smoothing) or should I just save and submit the plot as it is?

Prog Q1 (2) by anon (guest), 06 May 2018 10:48

Hi,
It looks like the "due" and "publised" columns of the HW3 entry in the table are reversed, and the link to the PDF is broken

Problems with HW3 files by Guy SmoilovskiGuy Smoilovski, 04 May 2018 15:50

Hi all,

We've received multiple questions regarding Q1 in the theoretical part, about the infinite horizon scenario (as was sent in the clarification mail). As many of you pointed out the discount factor complicates the solution.

Therefore:
1. Solve Q1 in the finite horizon scenario.
2. Q1 will be a bonus question.
3. We will extend the submission of HW2 to Wednesday (25/4/2018) for those who want to finish the exercise.

Good luck

נגה (guest) 20 Apr 2018 11:56
in discussion Discussions / HW2 » Question 2, section 2

the digit is part of the current state definition. So I think it is ok for the optimal policy to depend also on the digit :/

by נגה (guest), 20 Apr 2018 11:56
page 1123...next »
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License