走过一步之后对上一步的价值进行评价(经验)
- 游戏简介
±--------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
±--------+
**R, G, Y, B **are the possible pickup and destination locations. The **blue **letter represents the current passenger pick-up location, and the **purple **letter is the current destination.
The filled square represents the taxi, which is **yellow without a passenger and green with a passenger.
The pipe (“|”) **represents a wall which the taxi cannot cross.
action number:
0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff
- Q-Learning代码
import numpy as np import random from IPython.display import clear_output q_table = np.zeros([env.observation_space.n, env.action_space.n]) # Hyperparameters alpha = 0.1 gamma = 0.6 epsilon = 0.1 # For plotting metrics all_epochs = [] all_penalties = [] for i in range(1, 100001): state = env.reset() epochs, penalties, reward, = 0, 0, 0 done = False while not done: if random.uniform(0, 1) < epsilon: # 部分随机,0.1的概率 action = env.action_space.sample() # Explore action space else: action = np.argmax(q_table[state]) # Exploit learned values next_state, reward, done, info = env.step(action) old_value = q_table[state, action] # 上一个在某state的动作价值 next_max = np.max(q_table[next_state]) new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # Q_learn得分计算公式 q_table[state, action] = new_value # 走到下一步之后对上一步进行评分 if reward == -10: penalties += 1 state = next_state epochs += 1 if i % 100 == 0: clear_output(wait=True) print(f"Episode: {i}") print("Training finished.n")代码详解
- 初始化Q表
import numpy as np q_table = np.zeros([env.observation_space.n, env.action_space.n])
Q表的行数为state的个数,为500(棋盘大小55, 乘客位置情况为(4+1), 乘车点为4种情况, 5554=500)
Q表的列数为每个state对应的决策情况,对于本游戏action有6种可能
- 设置超参,选择决策
import random from IPython.display import clear_output # Hyperparameters alpha = 0.1 gamma = 0.6 epsilon = 0.1 for i in range(1, 100001): state = env.reset() epochs, penalties, reward, = 0, 0, 0 done = False while not done: if random.uniform(0, 1) < epsilon: # 部分随机,0.1的概率 action = env.action_space.sample() # Explore action space else: action = np.argmax(q_table[state]) # Exploit learned values
超参说明
alpha = 0.1alpha: (the learning rate) should decrease as you continue to gain a larger and larger knowledge base.
gamma = 0.6gamma: determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.(as you get closer and closer to the deadline, your preference for near-term reward should increase, as you won’t be around long enough to get the long-term reward, which means your gamma should decrease.)
epsilon = 0.1epsilon: as we develop our strategy, we have less need of exploration and more exploitation to get more utility from our policy, so as trials increase, epsilon should decrease.
由于在开始的时候,Q表的所有值都是0,故模型在每次选择最大的值的action方向行进的时候都会走0对应的action方向(即为south)
- 更新Q表
next_state, reward, done, info = env.step(action) old_value = q_table[state, action] # 上一个在某state的动作价值 next_max = np.max(q_table[next_state]) new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max) # Q_learn得分计算公式 q_table[state, action] = new_value # 走到下一步之后对上一步进行评分
走到新的位置后,对上一步进行打分new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max),并更新上一步state的值
- 周而复始直到找到正确的道路, 进行100000次循环调优Q表