Using Q-learning, which is one of reinforcement learning, I tried to solve a simple maze of my own work. I implemented it in python.
Reinforcement learning is one of the machine learning methods along with supervised learning and unsupervised learning. Briefly, the "agent" is trained to maximize the "value" in the given "environment". In reinforcement learning, the three elements of state * s *, behavior * a *, and reward * r * are generally used. Below is a description of each element.
State * s *: Indicates the state of the "environment". Action * a *: Shows the action taken by the "agent" in a given "environment". Reward * r *: Indicates the reward obtained as a result of the "agent" performing an action.
If you know the "value when you perform an action * a * in a certain state * s *", you can train the agent to maximize the value.
The Q-learning method is a method in which the "value when an action * a * is performed in a certain state * s *" in the previous section is called the Q value, and learning is performed by updating this value. The Q value when performing the action $ a_t $ in the state $ s_t $ is expressed as $ Q (s_t, a_t)
The next t + 1 reward for updating $ Q (s_t, a_t) $ $ r_ {t + 1} $ and the maximum Q value in the state $ s_ {t + 1} $ $ \ text {max} Use Q (s_ {t + 1}, a_ {t + 1}) $. $ \ Alpha $ and $ \ gamma $ are hyperparameters.
python 3.5.2
The implementation of the maze part is as follows. The structure of the maze is a grid of $ 5 \ times 5 $. (See comments in the code for details)
tresure.py
import numpy as np
class Game():
def __init__(self):
self.x_size = 5
self.y_size = 5
self.init_position = np.array([4,0])
self.game_board = np.zeros((self.x_size,self.y_size))
"""
Maze structure
0 0 0 0 G
0 D 0 D D
0 0 D 0 0
D 0 0 0 0
S 0 D 0 0
G = 1
D = -1
"""
self.game_board[0][4] = 1
self.game_board[1][1] = -1
self.game_board[3][0] = -1
self.game_board[2][2] = -1
self.game_board[1][3] = -1
self.game_board[1][4] = -1
self.game_board[4][2] = -1
def judge(self,x,y):
#Judgment if the game is over
if self.game_board[x][y] == 0:
return 0
elif self.game_board[x][y] == 1:
return 1
else:
return -1
def check_size(self,x,y):
#Determine if it is out of the maze
if x < 0 or y < 0 or x >= self.x_size or y >= self.y_size:
return 0
else:
return 1
def move(self,position,direction):
if direction == 'up':
position[0] -= 1
elif direction == 'down':
position[0] += 1
elif direction == 'right':
position[1] -= 1
elif direction == 'left':
position[1] += 1
return position
The rule is that the player can start from the S square and move to any square up, down, left, or right for each state. If you reach G, you will get a goal and a reward, and if you move to D, the game is over and you will be given a negative reward as a punishment.
The code looks like this
train.py
import numpy as np
from treasure import Game
import copy
import matplotlib.pyplot as plt
#board proscess
game = Game()
game_board = game.game_board
print(game_board)
game_direction = ['up','down','left','right']
def get_action(Q_table,dire,epsilon,x,y):
#random choice
if np.random.rand() < epsilon:
return np.random.choice(dire)
else:
return dire[np.argmax(Q_table[x,y])]
def game_init():
#init process
game = Game()
position= game.init_position
return game,position
def game_reward(game,position):
#result print
if game.judge(position[0],position[1]) == 1:
#print('You got a goal!')
return 1
elif game.judge(position[0],position[1]) == -1:
#print('You died..')
return -1
else:
return 0
def game_step(game,position,Q_table,dire,epsilon):
while(True):
pos = copy.deepcopy(position)
direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
index_dire = dire.index(direction)
move_position = game.move(pos,direction)
if game.check_size(pos[0],pos[1]):
break
reward = game_reward(game,move_position)
return move_position,index_dire,reward
def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
return Q[state_x,state_y,state_a]
if __name__ == '__main__':
#hyper parameter
epsilon = 0.01
alpha = 0.5
gamma = 0.8
Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
episode = 200
sucess = []
for i in range(episode):
game,position = game_init()
while(not game.judge(position[0],position[1])):
next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
position = next_position
if i % 10 == 0:
count = 0
heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
for j in range(100):
game,position = game_init()
while(not game.judge(position[0],position[1])):
next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
position = next_position
heatmap[next_position[0]][next_position[1]] += 1
if reward == 1:
count += 1
sucess.append(count)
print(i)
print(heatmap)
print(sucess)
I will explain each part below.
** Measures by the epsilon-greedy method **
train.py
def get_action(Q_table,dire,epsilon,x,y):
#random choice
if np.random.rand() < epsilon:
return np.random.choice(dire)
else:
return dire[np.argmax(Q_table[x,y])]
It moves randomly with a probability of $ \ epsilon $, and otherwise selects the one with the highest Q value.
** Reward function settings **
train.py
def game_reward(game,position):
#result print
if game.judge(position[0],position[1]) == 1:
#print('You got a goal!')
return 1
elif game.judge(position[0],position[1]) == -1:
#print('You died..')
return -1
else:
return 0
The reward was 1 when the goal was reached, -1 as the punishment when the game was over, and 0 otherwise.
** Update Q value **
train.py
def Q_koushin(Q,state_x,state_y,state_a,s_next_x,state_next_y,alpha,reward,gamma):
Q[state_x,state_y,state_a] += alpha*(reward + gamma*np.max(Q[s_next_x,state_next_y]) - Q[state_x,state_y,state_a])
return Q[state_x,state_y,state_a]
Update the Q value according to the formula.
** Move player **
train.py
def game_step(game,position,Q_table,dire,epsilon):
while(True):
pos = copy.deepcopy(position)
direction = get_action(Q_table,dire,epsilon,pos[0],pos[1])
index_dire = dire.index(direction)
move_position = game.move(pos,direction)
if game.check_size(pos[0],pos[1]):
break
reward = game_reward(game,move_position)
return move_position,index_dire,reward
Make one move during the game. It judges whether it is out of the maze, and if it is okay, returns the position of the destination, the direction of movement, and the reward obtained. In python, the list is passed by reference, so I made a deep copy so that the original list wouldn't change.
** Learning part **
train.py
if __name__ == '__main__':
#hyper parameter
epsilon = 0.01
alpha = 0.5
gamma = 0.8
Q_table = np.zeros((game_board.shape[0],game_board.shape[1],len(game_direction)))
episode = 200
sucess = []
for i in range(episode):
game,position = game_init()
while(not game.judge(position[0],position[1])):
next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
Q_table[position[0],position[1],dire] = Q_koushin(Q_table,position[0],position[1],dire,next_position[0],next_position[1],alpha,reward,gamma)
position = next_position
if (i+1) % 20 == 0:
count = 0
heatmap = np.zeros((game_board.shape[0],game_board.shape[1]))
for j in range(100):
game,position = game_init()
while(not game.judge(position[0],position[1])):
next_position,dire,reward = game_step(game,position,Q_table,game_direction,epsilon)
position = next_position
heatmap[next_position[0]][next_position[1]] += 1
if reward == 1:
count += 1
sucess.append(count)
print('%d times' %(i+1))
print(heatmap)
The Q_table is a matrix of $ 5 \ times5 \ times4 $ in (maze squares) $ \ times $ (moving direction up, down, left and right). The Q value was updated for each move until the goal was reached or the game was over. In addition, every time I finished the game 20 times, I played the game 100 times using the Q_table at that time, and investigated how many goals I had and where I passed.
The result is as follows.
#Maze structure diagram
[[ 0. 0. 0. 0. G.]
[ 0. -1. 0. -1. -1.]
[ 0. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[ S. 0. -1. 0. 0.]]
At 20 times
[[4.500e+01 1.800e+01 1.500e+01 6.000e+00 3.000e+00]
[4.000e+01 1.400e+01 5.000e+00 3.000e+00 5.000e+00]
[9.000e+00 4.148e+03 8.000e+00 8.270e+02 5.000e+00]
[6.700e+01 4.172e+03 1.100e+01 8.350e+02 3.000e+00]
[0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
40 times
[[1.600e+01 1.000e+01 5.000e+00 2.000e+00 2.000e+00]
[1.300e+01 1.000e+01 2.000e+00 3.000e+00 3.000e+00]
[7.000e+00 4.105e+03 9.000e+00 6.400e+02 3.000e+00]
[7.300e+01 4.132e+03 7.000e+00 6.450e+02 2.000e+00]
[0.000e+00 4.900e+01 0.000e+00 2.000e+00 0.000e+00]]
At 60 times
[[1.900e+01 8.000e+00 3.000e+00 3.000e+00 3.000e+00]
[1.700e+01 1.600e+01 0.000e+00 2.000e+00 4.000e+00]
[6.000e+00 3.754e+03 8.000e+00 2.120e+02 4.000e+00]
[6.700e+01 3.783e+03 6.000e+00 2.140e+02 2.000e+00]
[0.000e+00 5.600e+01 0.000e+00 0.000e+00 0.000e+00]]
At 80 times
[[1.100e+01 6.000e+00 6.000e+00 6.000e+00 6.000e+00]
[1.100e+01 1.200e+01 0.000e+00 6.000e+00 4.000e+00]
[6.000e+00 3.544e+03 1.300e+01 2.069e+03 1.107e+03]
[5.900e+01 3.571e+03 1.500e+01 2.080e+03 1.109e+03]
[0.000e+00 5.700e+01 0.000e+00 3.000e+00 2.000e+00]]
At 100 times
[[8.000e+00 8.000e+00 8.000e+00 8.000e+00 8.000e+00]
[8.000e+00 1.600e+01 0.000e+00 5.000e+00 2.000e+00]
[8.000e+00 4.783e+03 1.500e+01 3.083e+03 1.225e+03]
[5.400e+01 4.824e+03 2.300e+01 3.100e+03 1.224e+03]
[0.000e+00 7.100e+01 0.000e+00 1.000e+01 1.000e+00]]
At 120 times
[[1.100e+01 1.100e+01 1.100e+01 1.100e+01 1.100e+01]
[1.100e+01 6.000e+00 0.000e+00 6.000e+00 3.000e+00]
[1.100e+01 4.215e+03 1.500e+01 2.403e+03 1.660e+03]
[5.900e+01 4.251e+03 1.900e+01 2.416e+03 1.665e+03]
[1.000e+00 6.400e+01 0.000e+00 4.000e+00 4.000e+00]]
140 times
[[ 99. 100. 98. 96. 96.]
[100. 4. 1. 0. 0.]
[101. 101. 0. 0. 0.]
[ 0. 102. 0. 0. 0.]
[ 0. 102. 0. 0. 0.]]
At 160 times
[[ 96. 95. 96. 96. 95.]
[ 97. 2. 0. 0. 0.]
[ 97. 100. 1. 0. 0.]
[ 0. 99. 0. 0. 0.]
[ 0. 100. 2. 0. 0.]]
180 times
[[101. 100. 100. 100. 99.]
[101. 0. 0. 1. 0.]
[100. 101. 0. 0. 0.]
[ 0. 101. 0. 0. 0.]
[ 0. 100. 0. 0. 0.]]
At 200 times
[[ 99. 99. 101. 100. 99.]
[100. 1. 1. 0. 0.]
[100. 100. 0. 0. 0.]
[ 0. 101. 0. 0. 0.]
[ 0. 101. 0. 0. 0.]]
#Number of goals(100 times)
[3, 2, 3, 6, 8, 11, 96, 95, 99, 99]
Looking at the results, in the early stages when learning was not advanced, the game was often over at the off-field near the start, but after 140 times of repeated learning, it seems that almost the correct route can be found. I can see.
I tried to solve a simple maze of my own using Q-learning. It was also a good opportunity to practice implementation. In the future, I would like to play more complicated games using another method of reinforcement learning.
Platinum Data Blog by BrainPad Introduction to Reinforcement Learning ~ Basic Knowledge for Those Who Want to Learn Reinforcement Learning ~ 2017 https://blog.brainpad.co.jp/entry/2017/02/24/121500
Recommended Posts