$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} $$

Reinforcement Learning with Python - Part 2

Introduction
Reinforcement Learning
Finite Markov Decision Processes
The Bellman equation
The Q-algorithm
Interface for the lake problem

Introduction problem

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:

Black tile : the moving agent at the starting point, safe
White tile : frozen surface, safe
Red tiles : hole, fall to your doom
Green tile : goal, where the frisbee is located

The episode ends when you reach the goal or fall in a hole. Consequently, you have to find a safe path to find the frisbee in the lake. If possible, you have to find the shortest one.
Of course, we assume that the agent doesn't know where are the holes and the frisbee.
The agent receives a reward of $\mathbf{+1}$ if he reaches the goal, and zero otherwise. We want to solve the previous problem by using a Reinforcement Learning methodology. The Reinforcement Learning algorithm allows you to teach an agent to maximize his gains in an environment where each action gives him a reward (positive or negative). It's implement the “agent-environment loop”:

Each timestep, the agent chooses an action, and the environment returns an observation and a reward. The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water (in red). Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile (in green).

Reinforcement Learning

Reinforcement learning (RL) is the subfield of machine learning concerned with decision making. It studies how an agent can learn how to achieve goals in a complex, uncertain environment. The environment can be deterministic or stochastic, hostile (chess game) or not (tetris game), partially observable (mobile robotics), known or unknown to the decision-making agent.

D’après Sutton et Barto, l’apprentissage par renforcement désigne toute méthode adaptative permettant de résoudre un problème de décision séquentielle. Le terme “adaptatif” signifie qu’on part d’une solution inefficace, et qu’elle est améliorée progressivement en fonction de l’expérience de l’agent (ou des agents).

Reinforcement consists in finding:

a sequence of actions which provide an optimal reward
a strategy where sometimes one sacrifices his rewards to small rewards in the short term in order to obtain much better rewards in the long term

It’s exciting for two reasons:

RL is very general, encompassing all problems that involve making a sequence of decisions: for example, controlling a robot’s motors so that it’s able to run and jump, making business decisions like pricing and inventory management, or playing video games and board games. RL can even be applied to supervised learning problems with sequential or structured outputs.
RL algorithms have started to achieve good results in many difficult environments. RL has a long history, but until recent advances in deep learning, it required lots of problem-specific engineering. DeepMind’s StarCraft II results and AlphaGo all used deep RL algorithms which did not make too many assumptions about their environment, and thus can be applied in other settings.

RL algorithms can be view through finite Markov Decision Processes.

Finite Markov Decision Processes

At the beginning of the 20th century, the mathematician Andrey Markov studied the stochastic processes without memory, called Markov chains. Such a process has a fixed number of states and evolves randomly from one state to another at each stage. The probability that it goes from state $S$ to state $S'$ is fixed and depends only on the couple $(S, S')$ and not on past states. The system has no memory.

This Markov chain describes the probability of various weather patterns knowing the weather state. In the "sun" state the probability is of 50% for that state to stay like this. It is of 20% to pass to the "rain" state.

We introduce the formal problem of finite Markov decision processes, or finite MDPs. They were firstly introduced by Richard Bellman in the 1950s. They are very similar to Markov chains but at each step, an agent can choose an action among several possible actions. The transition probabilities only depend from the chosen action. In addition, certain transitions between states return a positive or negative reward. The agent's goal is then to find a policy that maximizes the rewards over time. We will use this strategy to solve the previous problem.

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem's mathematical structure, such as returns, value functions, and Bellman equations.

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

More specifially, the agent and environment interact at each of a sequence of discrete time steps, $t = 0,1,2,3,...$ At each time step $t$, the agent receives some representation of the environment's state, $S[t]$, and on that basis selects an action, in $\{a_0,a_1,...,a_n\}$. One time step later, in part as a consequence of its action, the agent receives a numerical reward, $r[t+1]$ and finds itself in a new state, $S[t+1]$.

The MDP and agent together thereby give rise to a sequence or trajectory. In a finite MDP, the sets of states, actions, and rewards all have a finite number of elements. In this case, the random variables $r[t]$ and $S[t]$ have well defined discrete probability distributions dependent only on the preceding state and action.

In the following MDP Figure, there are $11$ states. The purpose of the agent is to wander around the grid to finally reach the Diamond in state $S_{11}$. Under all circumstances, the agent should avoid the Fire grid in state $S_7$. Also there is a blocked grid, in state $S_5$, it acts like a wall hence the agent cannot enter it. The agent can take any one of these actions:

$a_0$=UP, $a_1$=DOWN, $a_2$=LEFT and $a_3$=RIGHT

For example, from $S_0$ the agent can choose two actions: $a_0$ and $a_3$. If he chooses action $a_0$, he has a probability of 100% to reach state $S_4$ without any reward.
We can imagine here, that he can have a probability of 90% to reach state $S_4$ and 10% to remain in the state $S_0$ even if he chooses action $a_0$.
At state $S_{10}$, if he chooses action $a_3$, he has a probability of 100% getting a reward of $\mathbf{+1}$. He then reaches the final state $S_{11}$. Moreover, if he chooses action $a_0$ at state $S_3$, he has a probability of 100% getting a reward of $\mathbf{-1}$. In other cases, the reward is zero.

Let us give a state $S_i$, what is the strategy that will get the best reward over time?

The Bellman equation

In the following, we assume that the agent knows perfectly the MDP of the system. We will see in the Part 2, what kind of algorithms we can use when this is not the case. These algorithms are based on the Richard Bellman theory. We assume that we have $m+1$ different states. Richard Bellman has found a way to estimate the optimal state value of any state $S_i$, denoted by $V^*(S_i)$, which is the sum of all the rewards with future discounts that the agent can expect on average after reaching a state $S_i$, assuming that it acts optimally:

for all $i \in \{0,..,m\}:$ $$ V^*(S_i) = \max\limits_{a \in \{a_0,a_1,...,a_n\}} \sum_{j=0}^{m} \mathbb{P}(S_j | S_i,a)(r(S_i,a,S_j)+\gamma V^*(S_j)) $$

where

$\mathbb{P}(S_j | S_i,a)$ is the transition probability from state $S_i$ to state $S_j$ knowing that the agent chose action $a$. For example, $\mathbb{P}(S_1| S_0,a_0) = 0.0$ and $\mathbb{P}(S_4| S_0,a_0) = 1.0$ in the previous MDP Figure.
Function $r(S_i,a,S_j)$ is the reward obtained by going from state $S_i$ to state $S_j$ by choosing action $a$. For example, $r(S_0,a_3,S_1)=0.0$, $r(S_{10},a_3,S_{11})=1.0$ and $r(S_{3},a_0,S_{7})=-1.0$.
Coefficient $\gamma$ is the discount rate. It belongs to $[0,1]$. We will see the key role of this coefficient in the next exercices.

This equation is a necessary condition for optimality. It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into a sequence of simpler subproblems. This recursive equation says that if the agent acts in an optimal way then the optimal value of the current state is equal to the reward which he will obtain on average after having carried out an optimal action, plus the optimal value expected for all the following possible states to which this action can lead.

This equation leads then us directly to the following algorithm which allows us to estimate the optimal values for each of these $m$ states. We intialize these values to zero and iteratively we adjust these values. For this we use the following iteration algorithm:

for all $i \in \{0,..,m\}:$ \[ V_{k+1}(S_i) = \max\limits_{a \in \{a_0,a_1,...,a_n\}} \sum_{j=0}^{m} \mathbb{P}(S_j | S_i,a)(r(S_i,a,S_j)+\gamma V_k(S_j)) \] where $V_{k}(S_i)$ for $i=0,...,m$ are the current estimated values for each states at the $k^{th}$ iteration. Obviously, the previous sum can be optimized by knowing which states are really linked to each states $S_i$. As we can see, each values are obtained from a fixed point algorithm.

Exercise [Bellman Equation]
In this exercice, we want to implement with Python this Bellamn equation estimation and apply it to the previous MDP Figure.

For this implementation, we need to define $\mathbb{P}(S_j | S_i,a)$ and $r(S_i,a,S_j)$. To do this, we create numpy arrays for these two functions. We assume that the value of $\mathbb{P}(S_j | S_i,a)$ will be given by using the numpy array P[i,a,j] where the indexes are ordered like this. You can use the numpy.full function to create an initial array with zeros. After that you can create an automatic algorithm that enables you to initialize all the probabilities. We assume here that the probability to do the given action is 100%.

m = 12
n = 4
P = np.full([m,n,m],0.0)
r = np.full([m,n,m],0.0)
for i in range(m):
    for a in range(n):
        for j in range(m):
            if (a==0 and j == i + n) or (a==1 and j == i - n) or (a==2 and j == i - 1) or (a==3 and j == i + 1):
                # Test to avoid to go to state S5 and S11 has no possible actions
                if (i!=5 and j!=5 and i!=(m-1)):
                    P[i,a,j] = 1.0
r[3,0,7] = -1.0
r[6,3,7] = -1.0
r[11,1,7] = -1.0
r[7,0,11] = 1.0
r[10,3,11] = 1.0

Define the MDP class where:
- the attributes of the object with the numpy arrays P (transition probability array), r (the reward array), and a list of all possible actions of each states. For instance, $S_0$ has two possible actions and $S_1$ has three possible actions.
Consequently actions=[[0,3],[0,2,3],...].

- the Bellman method which returns a list that contains the estimated optimal values for each states after niter iterations for a given $\gamma$ value ($\gamma$ and niter are parameters) We assume that $\gamma = 0.95$ and the maximum iterations number niter = 200.
```
    class MDP(object):
        def __init__(self,P,r,actions):
            # To Do

        def Bellman(self, gamma =0.95, niter = 200):
            # To Do
            return V
```

class MDP(object):

    def __init__(self,P,r,actions):
        self.P = P
        self.r = r
        self.actions = actions
        self.m = P.shape[0]
        
    def Bellman(self,gamma = 0.95, niter = 200):
        V0 = [0.0 for i in range(self.m)]
        V1 = V0
        for i in range(niter):
            for s in range(self.m):
                V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[sp]) for sp in range(self.m)]) for a in self.actions[s]])
            V0 = V1
        return V0

actions = []
for i in range(m):
    temp = []
    for a in range(n):
        if (a==0 and i+4 < m) or (a==1 and i-4>=0) or (a==2 and (i%4-1)>=0) or (a==3 and (i%4+1)<=3):
            temp.append(a)
    actions.append(temp)

model = MDP(P,r,actions)
print(model.Bellman())

Create an object with the MDP class for the previous MDP problem, compare the results of the Bellman method with $\gamma=0.95$ and $\gamma=0.9$ for niter=200. Can you explain the differences? In fact, a reward (or payment) in the future is not worth quite as much as a reward now. For example, being promised 10000€ next year is worth only 90% as much as receiving 10000€ right now. Assuming payment $n$ steps in the future is worth only $0.9^n$ of payment now.
We assume now that we have the following MDP Figure: In the state $S_0$ it seems clear that action $a_0$ is the best possible. In the state $S_2$ the agent must choose action $a_1$. But in state $S_1$, it is not easy to determine whether the agent should stay there or go through the flames. Apply your Bellman algorithm to this case. Calculate the expected discounted average future value for each states.

P = np.array([[[0.7,0.3,0.0],[1,0,0],[0.8,0.2,0]],[[0.0,1.0,0.0],[0,0,0],[0.,0.,1.0]],[[0.,0.,0.0],[0.8,0.1,0.1],[0.,0.,0]]])
r = np.array([[[10,0,0],[0,0,0],[0,0,0]],[[0,0,0],[0,0,0],[0,0,-50]],[[0,0,0],[40,0,0],[0,0,0]]])
actions = [[0,1,2],[0,2],[1]]

The Q-algorithm

As we can see in the previous exercice the Bellman equation only gives us the average optimal values for each states. This not say to the agent what he has to do. Which optimal action he as to choose at each states. For this, Richard Bellman designed a new equation called the Q-equation. This leads to the Q-algorithm methodology. Q-algorithm is one of the technique to find the optimal policy in an MDP. The objective of the Q-algorithm is to find a policy that is optimal in the sense that the expected value of the total reward over all successive steps is the maximum achievable. So, in other words, the goal of Q-algorithm is to find the optimal policy by learning the optimal Q-values for each state-action pair.

In the following, I give the Q-algorithm:
for all $i \in \{0,..,m\}$ and for all $l \in \{0,..,n\}:$ \[ Q_{k+1}(S_i,a_l) = \sum_{j=0}^{m} \mathbb{P}(S_j | S_i,a_l)(r(S_i,a_l,S_j)+\gamma \max\limits_{a \in \{a_0,a_1,...,a_n\}}Q_k(S_j,a)) \] Consequently, as soon as we obtain the estimated of the optimal Q matrix, we can define the optimal policy $\pi^*(S_i)$ for each state i.e. the best action to do at each state. \[ \pi^*(S_i) = \underset{a \in \{a_0,a_1,...,a_n\}} {\text{argmax}} Q^*(S_i,a) \]

Exercise [Q algorithm]

Implement this Q-algorithm. For this you will define a new function in the MDP class called Qalgo with $n$, $\gamma$ and niter as parameters. Parameter $n$ is the number of the possible actions.

                def Qalgo(self, n, gamma =0.95, niter = 200):
                    Q = np.full([self.m,self.m],0.0)
                    # To Do
                    return Q

In order to obtain each optimal action for each state, you will apply the numpy.argmax(Q, axis =1) to the previous returned result. Show that the optimal actions for each state of the first MDP are:

$Q^*(s,a)$-Table
$S_0$	0.81450625	0	0	0.81450625
$S_1$	0	0	0.77378094	0.857375
$S_2$	0.9025	0	0.81450625	0.81450625
$S_3$	-0.05	0	0.857375	0
$S_4$	0.857375	0.77378094	0	0
$S_5$	0	0	0	0
$S_6$	0.95	0.857375	0	-0.05
$S_7$	1	0.81450625	0.9025	0
$S_8$	0	0.81450625	0	0.9025
$S_9$	0	0	0.857375	0.95
$S_{10}$	0	0.9025	0.9025	1
$S_{11}$	0	0	0	0

                optimal actions for each states: [0 3 0 2 0 0 0 0 3 3 3 0]

class MDP(object):
    
    def __init__(self,P,r,actions):
        self.P = P
        self.r = r
        self.actions = actions
        self.m = P.shape[0]
        
    def Bellman(self,gamma = 0.95,niter = 200):
        V0 = [0.0 for i in range(self.m)]
        V1 = V0
        for i in range(niter):
            for s in range(self.m):
                V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[s]) for sp in range(self.m)]) for a in self.actions[s]])
            V0 = V1
        return V0
    
    def Qalgo(self,n,gamma = 0.95,niter = 200):
        Q = np.full([self.m,n],0.0)
        for i in range(niter):
            Q1 = Q
            for s in range(self.m):
                for a in actions[s]:
                    Q[s,a] = np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * np.max(Q1[sp])) for sp in range(self.m)])
        return Q

model = MDP(P,r,actions)
print(model.Bellman())
res = model.Qalgo(4)
print(res)
print("Action optimale pour chaque état:")
print(np.argmax(res, axis =1))

Interface for the lake problem

To create the interface for the lake problem, solved by using the reinforcement strategy, we will use the tkinter module.

Exercise [tkinter interface class]

Create a class Game which enables:
- to initialize the size (pxp) of the graphical grid of the lake
- to allow to the user to define with the mouse all the holes (red tiles)
- to set the frisbee location (green tile)
- to create the MDP object for the RL resolution

For this you can use the following code which enables you to draw a square in a grid according to the mouse event coordinates.

from tkinter import *

root = Tk()

def key(event):
    print ("pressed", repr(event.char))

def callback(event):
    global dim
    canvas.focus_set()
    print ("clicked at", event.x, event.y)
    i=event.x//dim
    j=event.y//dim
    canvas.itemconfig(rectangles_array[i][j], fill='blue')


canvas= Canvas(root, width=300, height=300)
canvas.bind("< Key >", key)
canvas.bind("< Button-1 >", callback)
canvas.pack()
rectangles_array=[]
dim = 60

for i in range(5):
    raw=[]
    for j in range(5):
        if (i+j)%2==0:
            raw.append(canvas.create_rectangle(i*dim,j*dim, (i+1)*dim, (j+1)*dim, fill='green'))
        else :
            raw.append(canvas.create_rectangle(i*dim,j*dim, (i+1)*dim, (j+1)*dim, fill='red'))
    rectangles_array.append(raw)
                
root.mainloop()

Add to the class, a function path which "plays" the reinforcement strategy up to the agent finds the frisbee.

class TextBox(Entry):
    def __init__(self,parent=None,width=30):
        self.__text=StringVar()
        super().__init__(parent,width=width, textvariable=self.__text)
        
        
    def get(self):
        return self.__text.get()
        
    def set(self,text):
        self.__text.set(text)
        
        
#classe d'interface pour la création et l'affichage du problème
class Game(Tk):
    def __init__(self):
        super().__init__()       
        self.sizeG=0            #taille de la grille carré
        self.sizeC=30           #taille d'un carré en pixels
        self.configureG=0       #numéro d'étape dans la configuration de la grille 0=arret, 1=point de départ, 2=point d'arrivée, 3=trous
        self.depart=(0,0)       #position du point de départ
        self.arrivee=(0,0)      #position du ooint d'arrivée
        self.trous=[]           #liste des positions des trous
        self.rectangles_array=[] #liste les objets carrés de la grille
        self.create_widgets()   
        self.MDPO=None          #objet MDP pour la résolution

    def create_widgets(self):        
        #création des cadres
        #-----------------cadre11 de configuration de la grille --------------
        # TO DO

        #------------------ cadre12 de la résolution -------------------
        # TO DO  

        #-------------------- cadre13 pour l'affichage des messages -------------------
        # TO DO

        #---------------- cadre2 pour l'affichage de la grille -------
        self.cadre2=Frame(self, width=10, height=10, borderwidth=5,relief=GROOVE)
        self.cadre2.pack(fill=X)
        self.gride= Canvas(self.cadre2, width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC)
        self.rectangles_array=[]
        self.gride.bind("", self.callback)
        self.gride.pack()

    #gestion du click droit de la souris dans le canvas
    def callback(self,event):
        if self.configureG!=0:
            #coordonnées de la souris dans la grille
            j=event.x//self.sizeC
            i=event.y//self.sizeC
            if self.configureG==1:
                #definition du point de départ
                self.gride.itemconfig(self.rectangles_array[j][i], fill='blue')
                self.configureG=2
                self.depart=(i,j)
                self.label1['text']="Choisir le point d'arrivée"
            elif self.configureG==2:
                #definition du point d arrivee
                self.gride.itemconfig(self.rectangles_array[j][i], fill='green')
                self.configureG=3
                self.arrivee=(i,j)
                self.label1['text']="Choisir les emplacements des trous\nCliquez sur le bouton pour terminer"
                self.bouton_config['state'] = 'normal' #arret possible de la configuration
            else:
                #definition des trous
                self.gride.itemconfig(self.rectangles_array[j][i], fill='red')
                self.trous.append((i,j))

                
    #mise en couleur d'une case de la grille
    def colorize(self,j,i,color='black'):
        #TO DO

    #création de la grille
    def creerG(self,size):
        #TO DO

    #configuration de la grille/ arret de la configuration
    def configure(self):       
        if self.configureG!=0:
            #arret
            self.bouton_config.config(relief=RAISED)
            self.bouton_config['text'] = 'Configurer la grille'
            self.bouton_calcul['state'] = 'normal'
            self.configureG=0
            self.label1['text']=''
            self.creerMDP()
        else:
            #démarrage de la configuration
            v=self.texte1.get()
            if v.isnumeric:
                self.creerG(int(v))
                self.bouton_config.config(relief=SUNKEN)
                self.bouton_config['state'] = 'disable'
                self.bouton_config['text'] = 'Fin de configuration'
                self.bouton_calcul['state'] = 'disable'
                self.configureG=1
                self.trous=[]
                self.label1['text']="Choisir le point de départ"
            else:
                self.label1['text']="la taille de la grille doit être un entier!"

    #création de l'objet MDP en fonction de la configuration de la grille        
    def creerMDP(self):
        #TO DO
    #résolution du problème avec affichage du chemin    
    def path(self):
        #TO DO
            
#Démarrage de l'interface
fenetre=Game()
fenetre.mainloop()

from  tkinter import *
from tkinter.filedialog import askopenfilename
import numpy as np

#classe MDP pour la résolution du problème
class MDP(object):
    #P : nparray de taille [m,n,m], avec m=nombre d'états et n=nombre d'actions
    #P[i,a,j] contient la probabilité de passer de l'état i à l'état j par l'action a
    #r: nparray de taille [m,n,m], avec m=nombre d'états et n=nombre d'actions
    #r[i,a,j] contient la récompense de passer de l'état i à l'état j par l'action a
    #actions : list de liste de taille[m], 
    #action[i] : contient la liste des actions possibles à partir de l'état i
    def __init__(self,P,r,actions):
        self.P = P
        self.r = r
        self.actions = actions
        self.m = P.shape[0]
        self.nb_action=P.shape[1]
    #calcul de la valeur de l'état optimal pour chaque état par l'équation de Richard Bellman     
    def Bellman(self,gamma = 0.95,niter = 200):
        V0 = [0.0 for i in range(self.m)]
        V1 = V0
        for i in range(niter):
            for s in range(self.m):
                V1[s] = np.max([np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * V0[sp]) for sp in range(self.m)]) for a in self.actions[s]])
            V0 = V1
        return V0
    #calcul de la valeur optimale pour chaque action à partir de chaque état
    def Qalgo(self,gamma = 0.95,niter = 200):
        Q = np.full([self.m,self.nb_action],0.0)
        for i in range(niter):
            Q1 = Q
            for s in range(self.m):
                for a in self.actions[s]:
                    Q[s,a] = np.sum([self.P[s,a,sp]*(self.r[s,a,sp] + gamma * np.max(Q1[sp])) for sp in range(self.m)])
        return Q
    
class TextBox(Entry):
    def __init__(self,parent=None,width=30):
        self.__text=StringVar()
        super().__init__(parent,width=width, textvariable=self.__text)
        
        
    def get(self):
        return self.__text.get()
        
    def set(self,text):
        self.__text.set(text)
        
        
#classe d'interface pour la création et l'affichage du problème
class Game(Tk):
    def __init__(self):
        super().__init__()       
        self.sizeG=0            #taille de la grille carré
        self.sizeC=30           #taille d'un carré en pixels
        self.configureG=0       #numéro d'étape dans la configuration de la grille 0=arret, 1=point de départ, 2=point d'arrivée, 3=trous
        self.depart=(0,0)       #position du point de départ
        self.arrivee=(0,0)      #position du ooint d'arrivée
        self.trous=[]           #liste des positions des trous
        self.rectangles_array=[] #liste les objets carrés de la grille
        self.create_widgets()   
        self.MDPO=None          #objet MDP pour la résolution

    def create_widgets(self):        
        #création des cadres
        #-----------------cadre11 de configuration de la grille --------------
        self.cadre11 = Frame(self, width=400, height=50, borderwidth=5)
        self.cadre11.pack(fill=X)
        self.label11 = Label(self.cadre11, text="taille de la grille carré :")
        self.label11.pack(side=LEFT,padx=10,pady=10)
        self.texte1 = TextBox(self.cadre11, width=30)
        self.texte1.pack(side=LEFT,padx=10,pady=10)
        self.texte1.set('10')
        self.bouton_config =Button(self.cadre11, text="Configurer la grille", command=self.configure)
        self.bouton_config.pack(side=LEFT,padx=10,pady=10)
        #------------------ cadre12 de la résolution -------------------
        self.cadre12 = Frame(self, width=400, height=50, borderwidth=5)
        self.cadre12.pack(fill=X)
        self.label2 = Label(self.cadre12, text="nonbre d'itérations :")
        self.label2.pack(side=LEFT,padx=10,pady=10)
        self.texte2 = TextBox(self.cadre12, width=30)
        self.texte2.pack(side=LEFT,padx=10,pady=10)
        self.texte2.set('100')
        self.bouton_calcul =Button(self.cadre12, text="Calculer le chemin", command=self.path)
        self.bouton_calcul.pack(side=LEFT,padx=10,pady=10)
        self.bouton_calcul['state'] = 'disable'
        #-------------------- cadre13 pour l'affichage des messages -------------------
        self.cadre13 = Frame(self, width=400, height=50, borderwidth=5)
        self.cadre13.pack(fill=X)
        self.label1 = Label(self.cadre13, text="")
        self.label1.pack(side=TOP,padx=10,pady=10)

        #---------------- cadre2 pour l'affichage de la grille -------
        self.cadre2=Frame(self, width=10, height=10, borderwidth=5,relief=GROOVE)
        self.cadre2.pack(fill=X)
        self.gride= Canvas(self.cadre2, width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC)
        self.rectangles_array=[]
        self.gride.bind("", self.callback)
        self.gride.pack()
    #gestion du click droit de la souris dans le canvas
    def callback(self,event):
        if self.configureG!=0:
            #coordonnées de la souris dans la grille
            j=event.x//self.sizeC
            i=event.y//self.sizeC
            if self.configureG==1:
                #definition du point de départ
                self.gride.itemconfig(self.rectangles_array[j][i], fill='blue')
                self.configureG=2
                self.depart=(i,j)
                self.label1['text']="Choisir le point d'arrivée"
            elif self.configureG==2:
                #definition du point d arrivee
                self.gride.itemconfig(self.rectangles_array[j][i], fill='green')
                self.configureG=3
                self.arrivee=(i,j)
                self.label1['text']="Choisir les emplacements des trous\nCliquez sur le bouton pour terminer"
                self.bouton_config['state'] = 'normal' #arret possible de la configuration
            else:
                #definition des trous
                self.gride.itemconfig(self.rectangles_array[j][i], fill='red')
                self.trous.append((i,j))

                
    #mise en couleur d'une case de la grille
    def colorize(self,j,i,color='black'):
        self.gride.itemconfig(self.rectangles_array[j][i], fill=color)
    #création de la grille
    def creerG(self,size):
        self.configureG=0
        self.bouton_config.config(relief=RAISED)
        self.label1['text']=''
        self.sizeG=int(size)
        self.gride.delete("all")
        #mise à jour de la taille du canvas
        self.gride.config( width=self.sizeG*self.sizeC, height=self.sizeG*self.sizeC)
        self.rectangles_array=[] 
        #création des carrés de la grille
        for i in range(self.sizeG):
            raw=[]
            for j in range(self.sizeG):
                raw.append(self.gride.create_rectangle(i*self.sizeC,j*self.sizeC, (i+1)*self.sizeC, (j+1)*self.sizeC, fill='white'))
            self.rectangles_array.append(raw)

    #configuration de la grille/ arret de la configuration
    def configure(self):       
        if self.configureG!=0:
            #arret
            self.bouton_config.config(relief=RAISED)
            self.bouton_config['text'] = 'Configurer la grille'
            self.bouton_calcul['state'] = 'normal'
            self.configureG=0
            self.label1['text']=''
            self.creerMDP()
        else:
            #démarrage de la configuration
            v=self.texte1.get()
            if v.isnumeric:
                self.creerG(int(v))
                self.bouton_config.config(relief=SUNKEN)
                self.bouton_config['state'] = 'disable'
                self.bouton_config['text'] = 'Fin de configuration'
                self.bouton_calcul['state'] = 'disable'
                self.configureG=1
                self.trous=[]
                self.label1['text']="Choisir le point de départ"
            else:
                self.label1['text']="la taille de la grille doit être un entier!"
    #création de l'objet MDP en fonction de la configuration de la grille        
    def creerMDP(self):
        m = self.sizeG*self.sizeG
        n = 4
        P = np.full([m,n,m],0.0)    
        #initialisation de P
        for i in range(m):
            for a in range(n):
                for j in range(m):
                    if (a==0 and j == i + self.sizeG) or (a==1 and j == i - self.sizeG) or (a==2 and j == i - 1) or (a==3 and j == i + 1):
                        #arrivée=etat final
                        if i!=self.arrivee[0]*self.sizeG+self.arrivee[1]:
                            P[i,a,j] = 1.0
        #initialisation des actions
        actions=[]
        for i in range(m):
            l=i//self.sizeG
            c=i%self.sizeG
            cactions=[]
            if l< self.sizeG-1 : cactions.append(0)
            if l>0 : cactions.append(1)
            if c>0 : cactions.append(2)
            if c< self.sizeG-1 : cactions.append(3)
            actions.append(cactions)
        # initalisation de r 
        r = np.full([m,n,m],0.0)
        #récompensse point final
        ind=self.arrivee[0]*self.sizeG+self.arrivee[1]
        ind2=(self.arrivee[0]-1)*self.sizeG+self.arrivee[1]
        if ind2 >=0 and ind2< m: r[ind2,0,ind]=1.0
        ind2=(self.arrivee[0]+1)*self.sizeG+self.arrivee[1]
        if ind2>=0 and ind2< m: r[ind2,1,ind]=1.0
        ind2=self.arrivee[0]*self.sizeG+self.arrivee[1]+1
        if ind2>=0 and ind2< m: r[ind2,2,ind]=1.0
        ind2=self.arrivee[0]*self.sizeG+self.arrivee[1]-1
        if ind2>=0 and ind2< m: r[ind2,3,ind]=1.0
        #récompenses trous
        for i in range(len(self.trous)):
            ind=self.trous[i][0]*self.sizeG+self.trous[i][1]
            ind2=(self.trous[i][0]-1)*self.sizeG+self.trous[i][1]
            if ind2>=0 and ind2< m: r[ind2,0,ind]=-10.0
            ind2=(self.trous[i][0]+1)*self.sizeG+self.trous[i][1]
            if ind2>=0 and ind2< m: r[ind2,1,ind]=-10.0
            ind2=self.trous[i][0]*self.sizeG+self.trous[i][1]+1
            if ind2>=0 and ind2< m: r[ind2,2,ind]=-10.0
            ind2=self.trous[i][0]*self.sizeG+self.trous[i][1]-1
            if ind2>=0 and ind2< m: r[ind2,3,ind]=-10.0
        self.MDPO=MDP(P,r,actions)

    #résolution du problème avec affichage du chemin    
    def path(self):
        niter=self.texte2.get()
        if self.MDPO is not None and niter.isnumeric:
            Q=self.MDPO.Qalgo(gamma = 0.95,niter = int(niter))
            ind=self.depart[0]*self.sizeG+self.depart[1]
            ind_final=self.arrivee[0]*self.sizeG+self.arrivee[1]
            while ind!=ind_final:
                action=np.argmax(Q[ind])
                if Q[ind][action]>0:
                    if action==0: ind=ind+self.sizeG
                    if action==1: ind=ind-self.sizeG
                    if action==2: ind=ind-1
                    if action==3: ind=ind+1
                    if ind!=ind_final : self.colorize(ind%self.sizeG,ind//self.sizeG)
                else:
                    raise Exception("chenmin bloqué!")
            
#Démarrage de l'interface
fenetre=Game()
fenetre.mainloop()

References

R.S. Sutton, A.G. Barto (2017), Reinforcement Learning: An Introduction.
R. Bellman (1957), A Markovian Decision Process