밴디트 문제 (multi-armed bandit problem)

밴디트는 오락실의 슬롯머신을 의미합니다.

슬롯 머신의 목표는 코인을 최대한 많이 얻는 것이죠!

밴디트 문제에서는 무작위성에 현혹되지 않게 '기댓값'을 기준으로 평가합니다.

만약 각 슬롯 머신의 가치(보상 기댓값)을 알면 플레이어는 가장 좋은 슬롯 머신을 고를 수 있음
하지만 실제로는 슬롯머신의 가치를 모름
각 슬롯 머신을 돌려보며 가치를 추정 (가능한 정확하게) 해야 함

따라서 밴디트 문제에서는 greedy 알고리즘과 epsilon 탐색을 통한 문제 풀이를 진행합니다.

Bandit Problem

플레이어는 가치 추정치가 가장 높은 머신을 선택해야 합니다. (Greedy Policy 이용)

exploitation : 지금까지 실제로 플레이 한 결과를 바탕으로 가장 좋다고 생각되는 슬롯 머신을 플레이(greedy)
exploration : 다양한 슬롯 머신 시도
epsilon 탐욕 정책 : epsilon의 확률로 '탐색', 1-epsilon의 확률로 'greedy'

# bandit slot machine

import numpy as np

class Bandit:
    def __init__(self, arms = 10): # arms = 슬롯머신의 개수
        self.rates = np.random.rand(arms) # 각 슬롯머신의 승률
        
    def play(self, arm):
        rate = self.rates[arm]
        if rate > np.random.rand():
            return 1
        else:
            return 0

# agent

Q = 0
for n in range(1, 11):
    reward = bandit.play(0) # 0번 슬롯머신을 플레이
    Q = Q + (reward - Q) / n
    print(Q)

Code

Agent Class

### agent class ###

class Agent:
    def __init__(self, epsilon, action_size=10):
        self.epsilon = epsilon
        self.Qs = np.zeros(action_size)
        self.ns = np.zeros(action_size)
        
    def update(self, action, reward):
        self.ns[action] += 1
        self.Qs[action] += (reward - self.Qs[action]) / self.ns[action]
        
    def get_action(self): # 행동 선택(greedy)
        if np.random.rand() < self.epsilon:
            return np.random.randint(0, len(self.Qs)) # 탐색
        return np.argmax(self.Qs) # greedy

Bandit 프로세스 실행

import matplotlib.pyplot as plt

steps = 1000
epsilon = 0.1

bandit = Bandit()
agent = Agent(epsilon)
total_reward = 0
total_rewards = []
rates = []

for step in range(steps):
    action = agent.get_action()
    reward = bandit.play(action) # 플레이 후 보상
    agent.update(action, reward)
    total_reward += reward
    
    total_rewards.append(total_reward)
    rates.append(total_reward / (step + 1))
    
print(total_reward)

Reward 그래프

단계별 승률을 확인하면 위와 같습니다.

처음엔 잘 맞춘 듯 하지만 초기에 문제를 틀리며 epsilon의 확률로 탐색 후 greedy 알고리즘을 진행하고

step수가 증가할 수록 일정 값의 정확도를 갖는 것을 확인할 수 있습니다.

Plus

np.random.rand와 np.random.randn의 주요 차이점

1. np.random.rand

균일 분포 (Uniform Distribution)에서 난수 생성
0과 1사이의 난수를 생성(0, 1)
모든 값이 나올 확률이 동일함

import numpy as np
import matplotlib.pyplot as plt

rand_nums = np.random.rand(1000)
plt.hist(rand_nums, bins=50)
plt.title("np.random.rand 분포")
plt.show()

2. np.random.randn

표준 정규분포(Standard Normal Distribution)에서 난수 생성
평균이 0이고 표준편차가 1인 가우시안 정규 분포를 따름
중앙값(0) 근처의 값이 나올 확률이 더 높음

import numpy as np
import matplotlib.pyplot as plt

randn_nums = np.random.randn(1000)
plt.hist(randn_nums, bins = 50)
plt.title("np.random.randn 분포")
plt.show()

rand : 확률 기반 선택, 무작위 가중치 초기화 등에 사용
randn : 정규 분포를 따르는 노이즈 생성, 신경망의 가중치 초기화 등에 사용

저작자표시 비영리 변경금지

Bandit Problem

Code

Plus

np.random.rand와 np.random.randn의 주요 차이점

티스토리툴바