Hi, you print:
"
Action indices [0, 1, 2, 3] correspond to West, South, East and North.
mdp.P[state][action] is a list of tuples (probability, nextstate, reward).
For example, state 0 is the initial state, and the transition information for s=0, a=0 is
P[0][0] = [(0.3333333333333333, 0, 0.0), (0.3333333333333333, 0, 0.0), (0.3333333333333333, 4, 0.0)]
"
if you look at this example you have 2 tuples in the list that have the same next state (the first two).
It shouldn't be something like:
P[0][0] = [(0.6666666666667, 0, 0.0), (0.3333333333333333, 4, 0.0)]
?