# Exploration-Exploitation Trade-off
## Overview
The exploration-exploitation trade-off is a fundamental dilemma in decision-making systems: whether to exploit current knowledge for immediate rewards or explore to gather more information for potentially better future rewards.
Links to:
- [[decision_theory]] - Theoretical framework
- [[active_inference]] - Free energy perspective
- [[reinforcement_learning]] - Learning framework
## Active Inference Perspective
In active inference, this trade-off emerges naturally through the Expected Free Energy components:
1. **Exploitation (Pragmatic Value)**
- Maximizes expected reward
- Uses current beliefs
- Links to [[pragmatic_value]]
2. **Exploration (Epistemic Value)**
- Reduces uncertainty
- Gathers information
- Links to [[epistemic_value]]
## Mathematical Formulation
The trade-off is controlled by balancing epistemic and pragmatic terms:
$G(\pi) = \underbrace{\text{epistemic}}_{\text{exploration}} + \underbrace{\text{pragmatic}}_{\text{exploitation}}$
Links to:
- [[expected_free_energy]] - Full formulation
- [[efe_components]] - Component details
- [[information_gain]] - Exploration measure
## Implementation
```python
def compute_explore_exploit_ratio(
epistemic: float,
pragmatic: float,
temperature: float = 1.0
) -> float:
"""Compute exploration-exploitation ratio.
Args:
epistemic: Epistemic value (exploration)
pragmatic: Pragmatic value (exploitation)
temperature: Temperature parameter
Returns:
Ratio of exploration vs exploitation
"""
return temperature * (epistemic / (abs(pragmatic) + 1e-10))
```
Links to:
- [[temperature_parameter]] - Control parameter
- [[numerical_methods]] - Implementation details
- [[policy_selection]] - Action selection
## Control Mechanisms
### Temperature Parameter
- Controls exploration tendency
- Higher values favor exploration
- Lower values favor exploitation
- Links to:
- [[softmax_function]] - Policy selection
- [[annealing_schedule]] - Temperature dynamics
- [[optimization_parameters]] - Parameter tuning
### Adaptive Strategies
- Dynamic temperature adjustment
- Uncertainty-based exploration
- Information-seeking policies
- Links to:
- [[adaptive_control]] - Control theory
- [[uncertainty_estimation]] - Uncertainty measures
- [[information_seeking]] - Active strategies
## Applications
1. **Active Inference**
- Natural emergence from free energy
- Balanced through precision
- Links to [[active_inference_applications]]
2. **Reinforcement Learning**
- ε-greedy strategies
- Thompson sampling
- Upper confidence bounds
- Links to:
- [[epsilon_greedy]] - Basic strategy
- [[thompson_sampling]] - Bayesian approach
- [[ucb_algorithms]] - Confidence bounds
3. **Multi-armed Bandits**
- Classic exploration problem
- Online learning setting
- Links to:
- [[bandit_problems]] - Problem formulation
- [[regret_minimization]] - Optimization goal
- [[optimal_stopping]] - When to stop exploring
## Analysis Methods
1. **Performance Metrics**
- Cumulative reward
- Information gain
- Regret bounds
- Links to:
- [[reward_analysis]] - Reward metrics
- [[information_metrics]] - Information measures
- [[regret_analysis]] - Regret computation
2. **Visualization**
- Exploration trajectories
- Value landscapes
- Decision boundaries
- Links to:
- [[trajectory_plots]] - Path visualization
- [[value_visualization]] - Value plotting
- [[decision_visualization]] - Choice analysis
## Related Concepts
- [[multi_agent_learning]] - Multiple learners
- [[hierarchical_exploration]] - Structured exploration
- [[optimal_control]] - Control theory
- [[information_theory]] - Information measures
- [[decision_making]] - Decision processes
## Advanced Theoretical Framework
### Information-Theoretic Foundation
**Definition** (Information-Theoretic Exploration): The exploration benefit can be quantified using mutual information:
$I(\pi) = H(O) - \mathbb{E}_\pi[H(O|S)]$
where $H(O)$ is the marginal entropy of observations and $H(O|S)$ is the conditional entropy given states.
**Theorem** (Optimal Exploration Strategy): For a finite horizon $T$, the optimal exploration strategy maximizes:
$\pi^* = \arg\max_\pi \sum_{t=0}^T \gamma^t \left[\mathbb{E}[R_t|\pi] + \beta I_t(\pi)\right]$
where $\gamma$ is a discount factor, $R_t$ is immediate reward, and $\beta$ controls exploration intensity.
```python
class AdvancedExplorationExploitation:
"""Advanced framework for exploration-exploitation with rigorous mathematical foundation."""
def __init__(self,
state_dim: int,
action_dim: int,
exploration_horizon: int = 10,
discount_factor: float = 0.99):
"""Initialize advanced exploration-exploitation system.
Args:
state_dim: Dimension of state space
action_dim: Dimension of action space
exploration_horizon: Planning horizon for exploration
discount_factor: Temporal discount factor
"""
self.state_dim = state_dim
self.action_dim = action_dim
self.horizon = exploration_horizon
self.gamma = discount_factor
# Initialize belief state and uncertainty estimates
self.belief_state = np.zeros(state_dim)
self.uncertainty_matrix = np.eye(state_dim)
# Information gain tracking
self.information_history = []
self.reward_history = []
def compute_expected_information_gain(self,
policy: np.ndarray,
belief_state: np.ndarray,
uncertainty: np.ndarray) -> float:
"""Compute expected information gain for a policy.
Information gain quantifies reduction in uncertainty:
IG = H(θ) - E[H(θ|o)]
Args:
policy: Policy to evaluate
belief_state: Current belief about state
uncertainty: Uncertainty covariance matrix
Returns:
Expected information gain
"""
# Prior entropy
prior_entropy = 0.5 * np.log((2 * np.pi * np.e)**self.state_dim * np.linalg.det(uncertainty))
# Expected posterior entropy (simplified approximation)
# Full computation would require integrating over observation space
observation_noise = 0.1 * np.eye(self.state_dim)
posterior_uncertainty = np.linalg.inv(
np.linalg.inv(uncertainty) + np.linalg.inv(observation_noise)
)
posterior_entropy = 0.5 * np.log(
(2 * np.pi * np.e)**self.state_dim * np.linalg.det(posterior_uncertainty)
)
return prior_entropy - posterior_entropy
def compute_epistemic_value(self,
actions: np.ndarray,
belief_states: np.ndarray,
uncertainties: np.ndarray) -> np.ndarray:
"""Compute epistemic value for different actions.
Epistemic value measures information-seeking utility:
V_epistemic(a) = E[IG(θ; o) | a]
Args:
actions: Array of candidate actions
belief_states: Corresponding belief states
uncertainties: Uncertainty estimates
Returns:
Epistemic values for each action
"""
epistemic_values = np.zeros(len(actions))
for i, action in enumerate(actions):
# Simulate observation under this action
predicted_observation = self._predict_observation(action, belief_states[i])
# Compute information gain
ig = self._compute_mutual_information(
belief_states[i], uncertainties[i], predicted_observation
)
epistemic_values[i] = ig
return epistemic_values
def compute_pragmatic_value(self,
actions: np.ndarray,
belief_states: np.ndarray,
reward_function: Callable) -> np.ndarray:
"""Compute pragmatic value for different actions.
Pragmatic value measures immediate utility:
V_pragmatic(a) = E[R(s,a) | belief]
Args:
actions: Array of candidate actions
belief_states: Corresponding belief states
reward_function: Reward function R(s,a)
Returns:
Pragmatic values for each action
"""
pragmatic_values = np.zeros(len(actions))
for i, action in enumerate(actions):
# Expected reward under current belief
expected_reward = reward_function(belief_states[i], action)
pragmatic_values[i] = expected_reward
return pragmatic_values
def thompson_sampling_exploration(self,
num_samples: int = 100,
temperature: float = 1.0) -> Dict[str, Any]:
"""Implement Thompson sampling for exploration.
Thompson sampling draws samples from posterior and acts greedily.
Args:
num_samples: Number of posterior samples
temperature: Exploration temperature
Returns:
Thompson sampling results
"""
# Sample from posterior belief
posterior_samples = np.random.multivariate_normal(
self.belief_state, self.uncertainty_matrix, num_samples
)
# Compute action values for each sample
action_values = np.zeros((num_samples, self.action_dim))
for i, sample in enumerate(posterior_samples):
for j in range(self.action_dim):
action = np.zeros(self.action_dim)
action[j] = 1.0
action_values[i, j] = self._evaluate_action(sample, action)
# Softmax action selection with temperature
probabilities = scipy.special.softmax(action_values / temperature, axis=1)
selected_actions = np.array([
np.random.choice(self.action_dim, p=prob) for prob in probabilities
])
return {
'posterior_samples': posterior_samples,
'action_values': action_values,
'action_probabilities': probabilities,
'selected_actions': selected_actions,
'exploration_diversity': np.std(selected_actions)
}
def upper_confidence_bound(self,
confidence_level: float = 0.95,
exploration_bonus: float = 1.0) -> Dict[str, Any]:
"""Implement Upper Confidence Bound exploration.
UCB balances exploitation and exploration via confidence intervals.
Args:
confidence_level: Confidence level for bounds
exploration_bonus: Multiplier for exploration term
Returns:
UCB analysis results
"""
# Compute confidence intervals
confidence_radius = scipy.stats.norm.ppf((1 + confidence_level) / 2)
action_means = np.zeros(self.action_dim)
action_uncertainties = np.zeros(self.action_dim)
for j in range(self.action_dim):
action = np.zeros(self.action_dim)
action[j] = 1.0
# Mean value estimate
action_means[j] = self._evaluate_action(self.belief_state, action)
# Uncertainty estimate (simplified)
action_uncertainties[j] = np.sqrt(
action.T @ self.uncertainty_matrix @ action
)
# Upper confidence bounds
ucb_values = (action_means +
exploration_bonus * confidence_radius * action_uncertainties)
# Select action with highest UCB
best_action = np.argmax(ucb_values)
return {
'action_means': action_means,
'action_uncertainties': action_uncertainties,
'ucb_values': ucb_values,
'selected_action': best_action,
'confidence_intervals': np.column_stack([
action_means - confidence_radius * action_uncertainties,
action_means + confidence_radius * action_uncertainties
])
}
def compute_regret_bounds(self,
time_horizon: int,
suboptimality_gaps: np.ndarray) -> Dict[str, float]:
"""Compute theoretical regret bounds for exploration strategies.
Args:
time_horizon: Total time horizon
suboptimality_gaps: Gap between optimal and suboptimal actions
Returns:
Regret bound analysis
"""
# UCB regret bound: O(√(K T log T))
K = self.action_dim
ucb_regret_bound = np.sqrt(K * time_horizon * np.log(time_horizon))
# Thompson sampling regret bound (Bayesian setting)
thompson_regret_bound = np.sqrt(K * time_horizon)
# ε-greedy regret bound
epsilon = 0.1 # Typical exploration rate
epsilon_greedy_bound = epsilon * time_horizon + (1 - epsilon) * np.sum(suboptimality_gaps)
return {
'ucb_regret_bound': ucb_regret_bound,
'thompson_regret_bound': thompson_regret_bound,
'epsilon_greedy_bound': epsilon_greedy_bound,
'best_theoretical_bound': min(ucb_regret_bound, thompson_regret_bound),
'suboptimality_gaps': suboptimality_gaps
}
def adaptive_exploration_rate(self,
time_step: int,
performance_history: List[float],
uncertainty_history: List[float]) -> float:
"""Compute adaptive exploration rate based on performance and uncertainty.
Args:
time_step: Current time step
performance_history: History of performance metrics
uncertainty_history: History of uncertainty measures
Returns:
Adaptive exploration rate
"""
# Base exploration rate (decreasing with time)
base_rate = 1.0 / np.sqrt(time_step + 1)
# Performance-based adjustment
if len(performance_history) > 1:
recent_improvement = performance_history[-1] - performance_history[-2]
performance_factor = 1.0 + np.tanh(recent_improvement)
else:
performance_factor = 1.0
# Uncertainty-based adjustment
current_uncertainty = uncertainty_history[-1] if uncertainty_history else 1.0
uncertainty_factor = np.sqrt(current_uncertainty)
# Combined adaptive rate
adaptive_rate = base_rate * performance_factor * uncertainty_factor
return np.clip(adaptive_rate, 0.01, 0.99) # Keep in reasonable bounds
def information_directed_sampling(self,
action_space: np.ndarray,
information_ratio: float = 1.0) -> Dict[str, Any]:
"""Implement information-directed sampling strategy.
IDS optimizes the ratio of information gain to squared regret.
Args:
action_space: Available actions
information_ratio: Weight for information vs regret
Returns:
IDS results
"""
num_actions = len(action_space)
# Compute information gains and regrets
information_gains = np.zeros(num_actions)
squared_regrets = np.zeros(num_actions)
for i, action in enumerate(action_space):
# Information gain
ig = self.compute_expected_information_gain(
action, self.belief_state, self.uncertainty_matrix
)
information_gains[i] = ig
# Squared regret (simplified)
optimal_value = np.max([self._evaluate_action(self.belief_state, a)
for a in action_space])
action_value = self._evaluate_action(self.belief_state, action)
squared_regrets[i] = (optimal_value - action_value)**2
# Information-directed ratio
ids_ratios = information_gains / (squared_regrets + 1e-10)
# Select action maximizing information ratio
best_action_idx = np.argmax(ids_ratios)
return {
'information_gains': information_gains,
'squared_regrets': squared_regrets,
'ids_ratios': ids_ratios,
'selected_action': action_space[best_action_idx],
'selected_action_index': best_action_idx
}
def _predict_observation(self, action: np.ndarray, state: np.ndarray) -> np.ndarray:
"""Predict observation given action and state."""
# Simplified observation model
return state + 0.1 * action + 0.05 * np.random.randn(self.state_dim)
def _compute_mutual_information(self,
state: np.ndarray,
uncertainty: np.ndarray,
observation: np.ndarray) -> float:
"""Compute mutual information between state and observation."""
# Simplified mutual information computation
# Full computation requires proper integration
return 0.5 * np.log(np.linalg.det(uncertainty) + 1e-10)
def _evaluate_action(self, state: np.ndarray, action: np.ndarray) -> float:
"""Evaluate action value for given state."""
# Simplified value function
return -np.linalg.norm(state - action)**2
### Multi-Armed Bandit Extensions
class ContextualBandit:
"""Contextual bandit with sophisticated exploration strategies."""
def __init__(self,
context_dim: int,
num_arms: int,
regularization: float = 1.0):
"""Initialize contextual bandit.
Args:
context_dim: Dimension of context vectors
num_arms: Number of bandit arms
regularization: Regularization parameter
"""
self.context_dim = context_dim
self.num_arms = num_arms
self.reg = regularization
# Initialize linear models for each arm
self.arm_weights = [np.zeros(context_dim) for _ in range(num_arms)]
self.arm_covariances = [regularization * np.eye(context_dim)
for _ in range(num_arms)]
def linear_ucb(self,
context: np.ndarray,
confidence_width: float = 1.0) -> Dict[str, Any]:
"""Linear Upper Confidence Bound for contextual bandits.
Args:
context: Current context vector
confidence_width: Width of confidence intervals
Returns:
LinUCB results
"""
ucb_values = np.zeros(self.num_arms)
confidence_widths = np.zeros(self.num_arms)
for arm in range(self.num_arms):
# Predicted reward
predicted_reward = self.arm_weights[arm] @ context
# Confidence width
confidence = confidence_width * np.sqrt(
context @ np.linalg.inv(self.arm_covariances[arm]) @ context
)
confidence_widths[arm] = confidence
# Upper confidence bound
ucb_values[arm] = predicted_reward + confidence
# Select arm with highest UCB
selected_arm = np.argmax(ucb_values)
return {
'selected_arm': selected_arm,
'ucb_values': ucb_values,
'confidence_widths': confidence_widths,
'predicted_rewards': [w @ context for w in self.arm_weights]
}
def update_arm_model(self,
arm: int,
context: np.ndarray,
reward: float) -> None:
"""Update linear model for selected arm.
Args:
arm: Selected arm index
context: Observed context
reward: Observed reward
"""
# Update covariance matrix
self.arm_covariances[arm] += np.outer(context, context)
# Update weight vector (normal equations)
# In practice, would use incremental updates for efficiency
inv_cov = np.linalg.inv(self.arm_covariances[arm])
self.arm_weights[arm] = inv_cov @ (
self.arm_covariances[arm] @ self.arm_weights[arm] + reward * context
)
### Hierarchical Exploration
class HierarchicalExploration:
"""Hierarchical exploration for structured environments."""
def __init__(self,
hierarchy_levels: int,
state_abstraction: Callable,
action_abstraction: Callable):
"""Initialize hierarchical exploration.
Args:
hierarchy_levels: Number of abstraction levels
state_abstraction: Function mapping states to abstract states
action_abstraction: Function mapping actions to abstract actions
"""
self.levels = hierarchy_levels
self.state_abs = state_abstraction
self.action_abs = action_abstraction
# Exploration at each level
self.level_explorers = [
AdvancedExplorationExploitation(state_dim=10, action_dim=5)
for _ in range(hierarchy_levels)
]
def hierarchical_policy_selection(self,
state: np.ndarray,
exploration_weights: np.ndarray) -> Dict[str, Any]:
"""Select policy using hierarchical exploration.
Args:
state: Current state
exploration_weights: Weights for each hierarchy level
Returns:
Hierarchical policy selection results
"""
level_policies = []
level_values = []
current_state = state
for level in range(self.levels):
# Abstract state at this level
abstract_state = self.state_abs(current_state, level)
# Get policy at this level
explorer = self.level_explorers[level]
actions = np.eye(explorer.action_dim) # Simple action space
epistemic_values = explorer.compute_epistemic_value(
actions, [abstract_state] * len(actions),
[explorer.uncertainty_matrix] * len(actions)
)
pragmatic_values = explorer.compute_pragmatic_value(
actions, [abstract_state] * len(actions),
lambda s, a: -np.linalg.norm(s - a)**2
)
# Combined values
combined_values = (exploration_weights[level] * epistemic_values +
(1 - exploration_weights[level]) * pragmatic_values)
# Select action at this level
selected_action_idx = np.argmax(combined_values)
selected_action = actions[selected_action_idx]
level_policies.append(selected_action)
level_values.append(combined_values)
# Update state for next level
current_state = self.action_abs(current_state, selected_action, level)
return {
'level_policies': level_policies,
'level_values': level_values,
'final_action': level_policies[-1],
'exploration_trace': exploration_weights
}
# Example usage and validation
def validate_advanced_exploration_exploitation():
"""Validate advanced exploration-exploitation framework."""
# Test advanced exploration-exploitation
explorer = AdvancedExplorationExploitation(state_dim=5, action_dim=3)
# Test Thompson sampling
thompson_result = explorer.thompson_sampling_exploration(num_samples=50)
print("Thompson sampling diversity:", thompson_result['exploration_diversity'])
# Test UCB
ucb_result = explorer.upper_confidence_bound()
print("UCB selected action:", ucb_result['selected_action'])
# Test contextual bandit
bandit = ContextualBandit(context_dim=4, num_arms=5)
context = np.random.randn(4)
linucb_result = bandit.linear_ucb(context)
print("LinUCB selected arm:", linucb_result['selected_arm'])
# Test regret bounds
subopt_gaps = np.array([0.1, 0.2, 0.05])
regret_bounds = explorer.compute_regret_bounds(1000, subopt_gaps)
print("UCB regret bound:", regret_bounds['ucb_regret_bound'])
if __name__ == "__main__":
validate_advanced_exploration_exploitation()
```
### Game-Theoretic Exploration
**Definition** (Nash Exploration Equilibrium): In multi-agent settings, the exploration equilibrium satisfies:
$\pi_i^* \in \arg\max_{\pi_i} \mathbb{E}\left[U_i(\pi_i, \pi_{-i}^*) + \beta I_i(\pi_i, \pi_{-i}^*)\right]$
where $U_i$ is utility and $I_i$ is information gain for agent $i$.
```python
class GameTheoreticExploration:
"""Game-theoretic approach to multi-agent exploration."""
def __init__(self,
num_agents: int,
state_dim: int,
action_dim: int):
"""Initialize game-theoretic exploration.
Args:
num_agents: Number of agents
state_dim: State space dimension
action_dim: Action space dimension
"""
self.num_agents = num_agents
self.state_dim = state_dim
self.action_dim = action_dim
# Agent beliefs and uncertainties
self.agent_beliefs = [np.zeros(state_dim) for _ in range(num_agents)]
self.agent_uncertainties = [np.eye(state_dim) for _ in range(num_agents)]
def compute_nash_exploration_equilibrium(self,
utility_functions: List[Callable],
information_weights: np.ndarray,
max_iterations: int = 100) -> Dict[str, Any]:
"""Compute Nash equilibrium for exploration game.
Args:
utility_functions: Utility function for each agent
information_weights: Information seeking weights
max_iterations: Maximum iterations for convergence
Returns:
Nash exploration equilibrium
"""
# Initialize policies
policies = [np.ones(self.action_dim) / self.action_dim
for _ in range(self.num_agents)]
for iteration in range(max_iterations):
new_policies = []
for agent in range(self.num_agents):
# Compute best response
best_response = self._compute_best_response(
agent, policies, utility_functions[agent],
information_weights[agent]
)
new_policies.append(best_response)
# Check convergence
policy_change = sum(np.linalg.norm(new_policies[i] - policies[i])
for i in range(self.num_agents))
policies = new_policies
if policy_change < 1e-6:
break
return {
'equilibrium_policies': policies,
'iterations_to_convergence': iteration + 1,
'final_policy_change': policy_change
}
def _compute_best_response(self,
agent: int,
other_policies: List[np.ndarray],
utility_function: Callable,
info_weight: float) -> np.ndarray:
"""Compute best response for an agent."""
# Simplified best response computation
# Full implementation would solve optimization problem
action_utilities = np.zeros(self.action_dim)
for action in range(self.action_dim):
# Utility component
utility = utility_function(
self.agent_beliefs[agent], action, other_policies
)
# Information component
info_gain = self._compute_info_gain(agent, action, other_policies)
action_utilities[action] = utility + info_weight * info_gain
# Softmax policy
return scipy.special.softmax(action_utilities)
def _compute_info_gain(self,
agent: int,
action: int,
other_policies: List[np.ndarray]) -> float:
"""Compute information gain for agent's action."""
# Simplified information gain computation
return np.random.exponential(1.0) # Placeholder
```
### Optimal Exploration Theory
**Theorem** (Fundamental Exploration Bound): For any exploration policy $\pi$ and finite horizon $T$, the cumulative regret is bounded by:
$R_T(\pi) \leq \sum_{k: \Delta_k > 0} \frac{8\log T}{\Delta_k} + \left(1 + \frac{\pi^2}{3}\right) \sum_{k=1}^K \Delta_k$
where $\Delta_k$ is the suboptimality gap for action $k$.
**Proof Sketch**: Uses concentration inequalities and the principle of optimism in the face of uncertainty.
```python
class OptimalExplorationTheory:
"""Theoretical framework for optimal exploration with provable guarantees."""
def __init__(self,
horizon: int,
confidence_level: float = 0.95,
suboptimality_tolerance: float = 0.01):
"""Initialize optimal exploration theory framework.
Args:
horizon: Planning horizon T
confidence_level: Confidence level for bounds
suboptimality_tolerance: Tolerance for suboptimal actions
"""
self.T = horizon
self.delta = 1 - confidence_level
self.epsilon = suboptimality_tolerance
# Theoretical constants
self.exploration_constants = {
'ucb': 2.0,
'thompson': 1.0,
'epsilon_greedy': 1.0
}
def compute_information_ratio(self,
algorithm: str,
problem_instance: Dict[str, Any]) -> Dict[str, float]:
"""Compute information ratio for exploration algorithm.
Information ratio quantifies efficiency of information acquisition:
Γ_t = (Regret_t)² / I_t
where I_t is information gain at time t.
Args:
algorithm: Exploration algorithm name
problem_instance: Problem specification
Returns:
Information ratio analysis
"""
# Extract problem parameters
K = problem_instance.get('num_arms', 10)
gaps = np.array(problem_instance.get('suboptimality_gaps', [0.1] * K))
# Algorithm-specific analysis
if algorithm == 'thompson_sampling':
info_ratio_bound = self._thompson_sampling_info_ratio(gaps)
elif algorithm == 'ucb':
info_ratio_bound = self._ucb_info_ratio(gaps)
elif algorithm == 'information_directed_sampling':
info_ratio_bound = self._ids_info_ratio(gaps)
else:
raise ValueError(f"Unknown algorithm: {algorithm}")
# Compute regret bound using information ratio
regret_bound = self._compute_regret_from_info_ratio(
info_ratio_bound, self.T, gaps
)
return {
'information_ratio_bound': info_ratio_bound,
'regret_bound': regret_bound,
'algorithm': algorithm,
'problem_complexity': np.sum(1 / gaps**2)
}
def analyze_sample_complexity(self,
accuracy: float,
confidence: float,
problem_structure: Dict[str, Any]) -> Dict[str, int]:
"""Analyze sample complexity for different exploration objectives.
Computes number of samples needed to achieve (ε,δ)-PAC guarantees.
Args:
accuracy: Accuracy parameter ε
confidence: Confidence parameter δ
problem_structure: Structure of exploration problem
Returns:
Sample complexity bounds for different objectives
"""
# Pure exploration (best arm identification)
gaps = np.array(problem_structure.get('gaps', [0.1]))
complexity_factor = problem_structure.get('complexity', 1.0)
pure_exploration_bound = int(
(32 * np.log(2 / confidence) * complexity_factor) / (accuracy**2)
)
# Regret minimization
regret_minimization_bound = int(
(8 * np.log(self.T) * np.sum(1 / gaps)) / accuracy
)
# Fixed confidence
fixed_confidence_bound = int(
(2 * np.log(2 / confidence) * complexity_factor) / (accuracy**2)
)
# Fixed budget
fixed_budget_bound = self.T # Given fixed budget
return {
'pure_exploration': pure_exploration_bound,
'regret_minimization': regret_minimization_bound,
'fixed_confidence': fixed_confidence_bound,
'fixed_budget': fixed_budget_bound,
'theoretical_lower_bound': self._compute_lower_bound(accuracy, confidence, gaps)
}
def multi_objective_exploration(self,
objectives: List[Callable],
objective_weights: np.ndarray,
constraint_functions: List[Callable] = None) -> Dict[str, Any]:
"""Analyze multi-objective exploration with theoretical guarantees.
Considers trade-offs between multiple exploration objectives
subject to constraints.
Args:
objectives: List of objective functions
objective_weights: Weights for objectives
constraint_functions: Optional constraint functions
Returns:
Multi-objective exploration analysis
"""
# Pareto frontier analysis
pareto_analysis = self._compute_pareto_frontier(
objectives, objective_weights
)
# Scalarization approaches
scalarization_analysis = self._analyze_scalarization_methods(
objectives, objective_weights
)
# Constraint handling
if constraint_functions:
constrained_analysis = self._analyze_constrained_exploration(
objectives, constraint_functions
)
else:
constrained_analysis = {'type': 'unconstrained'}
# Theoretical guarantees
theoretical_guarantees = self._derive_multi_objective_bounds(
objectives, objective_weights, constraint_functions
)
return {
'pareto_analysis': pareto_analysis,
'scalarization_analysis': scalarization_analysis,
'constrained_analysis': constrained_analysis,
'theoretical_guarantees': theoretical_guarantees
}
def contextual_exploration_theory(self,
context_dimension: int,
feature_map: Callable,
regularity_assumptions: Dict[str, Any]) -> Dict[str, Any]:
"""Theoretical analysis of contextual exploration.
Analyzes exploration in contextual settings with function approximation.
Args:
context_dimension: Dimension of context space
feature_map: Feature mapping function φ(x,a)
regularity_assumptions: Assumptions about problem regularity
Returns:
Contextual exploration theory
"""
# Feature dimension
feature_dim = regularity_assumptions.get('feature_dimension', context_dimension)
# Smoothness parameters
smoothness = regularity_assumptions.get('smoothness', 1.0)
# Covering number analysis
covering_analysis = self._analyze_covering_numbers(
context_dimension, feature_dim, smoothness
)
# Regret bounds for contextual bandits
contextual_regret_bounds = self._compute_contextual_regret_bounds(
feature_dim, smoothness, covering_analysis
)
# Information gain bounds
info_gain_bounds = self._compute_contextual_info_gain_bounds(
feature_dim, smoothness
)
# Eluder dimension analysis
eluder_analysis = self._analyze_eluder_dimension(
feature_map, feature_dim
)
return {
'covering_analysis': covering_analysis,
'regret_bounds': contextual_regret_bounds,
'information_gain_bounds': info_gain_bounds,
'eluder_analysis': eluder_analysis,
'sample_complexity': self._compute_contextual_sample_complexity(
feature_dim, smoothness, self.T
)
}
def _thompson_sampling_info_ratio(self, gaps: np.ndarray) -> float:
"""Compute information ratio bound for Thompson sampling."""
return 1.0 # Known theoretical result
def _ucb_info_ratio(self, gaps: np.ndarray) -> float:
"""Compute information ratio bound for UCB."""
return 2.0 # Known theoretical result
def _ids_info_ratio(self, gaps: np.ndarray) -> float:
"""Compute information ratio bound for Information Directed Sampling."""
return 1.0 # Optimal information ratio
def _compute_regret_from_info_ratio(self,
info_ratio: float,
horizon: int,
gaps: np.ndarray) -> float:
"""Compute regret bound from information ratio."""
return np.sqrt(info_ratio * horizon * np.log(horizon))
def _compute_lower_bound(self,
accuracy: float,
confidence: float,
gaps: np.ndarray) -> int:
"""Compute information-theoretic lower bound."""
return int(np.sum(1 / gaps**2) * np.log(1 / confidence))
def _compute_pareto_frontier(self,
objectives: List[Callable],
weights: np.ndarray) -> Dict[str, Any]:
"""Compute Pareto frontier for multi-objective exploration."""
# Simplified Pareto analysis
num_points = 100
pareto_points = []
for i in range(num_points):
# Vary weights
w = np.random.dirichlet(weights)
# Optimize weighted sum (simplified)
objective_values = [w[j] * np.random.rand() for j in range(len(objectives))]
pareto_points.append(objective_values)
return {
'pareto_points': pareto_points,
'dominated_points': [], # Would compute dominated points
'efficient_frontier': pareto_points # Simplified
}
def _analyze_scalarization_methods(self,
objectives: List[Callable],
weights: np.ndarray) -> Dict[str, Any]:
"""Analyze different scalarization approaches."""
return {
'weighted_sum': 'Linear scalarization',
'chebyshev': 'Min-max scalarization',
'epsilon_constraint': 'Constraint method',
'achievement_scalarization': 'Goal programming'
}
def _analyze_constrained_exploration(self,
objectives: List[Callable],
constraints: List[Callable]) -> Dict[str, Any]:
"""Analyze exploration under constraints."""
return {
'constraint_type': 'inequality',
'lagrangian_approach': 'Available',
'penalty_methods': 'Available',
'barrier_methods': 'Available'
}
def _derive_multi_objective_bounds(self,
objectives: List[Callable],
weights: np.ndarray,
constraints: List[Callable] = None) -> Dict[str, Any]:
"""Derive theoretical bounds for multi-objective exploration."""
return {
'regret_bounds': 'O(√T log T)',
'constraint_violation_bounds': 'O(√T)' if constraints else 'N/A',
'convergence_rate': 'O(1/√T)'
}
# Example usage and validation
def validate_optimal_exploration_theory():
"""Validate optimal exploration theory framework."""
print("Optimal Exploration Theory Validation")
print("=" * 50)
# Initialize theory framework
theory = OptimalExplorationTheory(horizon=1000)
# Test information ratio analysis
problem = {
'num_arms': 5,
'suboptimality_gaps': [0.1, 0.2, 0.05, 0.15, 0.3]
}
for algorithm in ['thompson_sampling', 'ucb']:
info_ratio_result = theory.compute_information_ratio(algorithm, problem)
print(f"{algorithm} information ratio: {info_ratio_result['information_ratio_bound']:.4f}")
print(f"{algorithm} regret bound: {info_ratio_result['regret_bound']:.4f}")
# Test sample complexity analysis
sample_complexity = theory.analyze_sample_complexity(
accuracy=0.1, confidence=0.05,
problem_structure={'gaps': [0.1, 0.2], 'complexity': 10}
)
print(f"Pure exploration sample complexity: {sample_complexity['pure_exploration']}")
print(f"Regret minimization sample complexity: {sample_complexity['regret_minimization']}")
# Test contextual exploration
contextual_analysis = theory.contextual_exploration_theory(
context_dimension=5,
feature_map=lambda x, a: np.concatenate([x, [a]]),
regularity_assumptions={'feature_dimension': 6, 'smoothness': 1.0}
)
print("Contextual exploration analysis completed")
if __name__ == "__main__":
validate_optimal_exploration_theory()
```
### Learning-Theoretic Perspective
**Definition** (Exploration Complexity): The exploration complexity $\mathcal{C}(\epsilon, \delta)$ is the minimum number of samples needed to identify an $\epsilon$-optimal policy with probability at least $1-\delta$.
**Theorem** (Instance-Dependent Lower Bound): For any algorithm and problem instance with gaps $\{\Delta_k\}$:
$\mathbb{E}[T_{\epsilon,\delta}] \geq \frac{\log(1/\delta)}{2} \sum_{k: \Delta_k > \epsilon} \frac{1}{\Delta_k^2}$
This establishes fundamental limits on sample efficiency for exploration.
## References
- [[sutton_barto_2018]] - Reinforcement Learning
- [[friston_2017]] - Active Inference
- [[gittins_1979]] - Bandit Processes
- [[thrun_1992]] - Exploration Strategies
- [[bubeck_cesa_bianchi_2012]] - Regret Analysis
- [[russo_van_roy_2018]] - Information Ratio
- [[lattimore_szepesvari_2020]] - Bandit Algorithms
- [[agrawal_goyal_2012]] - Thompson Sampling Theory