Notes on Bayesian Optimization #
tip
- git clone -> cd
- pip install -r requirements.txt (jax.numpy, scipy, botorch, optuna)
Optimization Under Expensive Evaluations #
- Bayesian Optimization (BO) solves black-box optimization problems where each evaluation is expensive:
assumptions:
- gradients are unavailable or too noisy
- objective evaluations are costly (training loops, simulations, experiments)
- query budget is limited, so every new point matters
BO minimizes regret by combining:
- a probabilistic surrogate of \(f(x)\)
- an acquisition strategy that controls exploration vs exploitation
Surrogate Modelling with Gaussian Processes #
- BO commonly models the unknown function as:
- after observing data \(\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^{t}\) , prediction at \(x\) is:
interpretation:
- \(\mu_t(x)\) is the current estimate of performance
- \(\sigma_t(x)\) captures epistemic uncertainty
this posterior is the foundation for deciding where to sample next
Acquisition Functions #
- acquisition functions convert posterior mean/uncertainty into an optimization signal:
- common choices:
- Upper Confidence Bound (UCB) \[\alpha_{\text{UCB}}(x) = \mu_t(x) + \kappa \sigma_t(x)\]
- Probability of Improvement (PI) \[\alpha_{\text{PI}}(x) = \Phi\left(\frac{\mu_t(x)-f(x^+)-\xi}{\sigma_t(x)}\right)\]
- Expected Improvement (EI) \[\alpha_{\text{EI}}(x)= (\mu_t(x)-f(x^+)-\xi)\Phi(z)+\sigma_t(x)\phi(z), \quad z=\frac{\mu_t(x)-f(x^+)-\xi}{\sigma_t(x)}\]
important
- BO performance depends heavily on sensible kernel choices, robust hyperparameter fitting, and stable optimization of the acquisition objective.
Canonical BO Algorithm #
- define bounded search space \(\mathcal{X}\)
- collect initial points (random or Latin hypercube)
- fit/update the GP surrogate with current observations
- maximize acquisition function to propose \(x_{t+1}\)
- evaluate expensive objective at \(x_{t+1}\)
- append \((x_{t+1}, y_{t+1})\) and repeat until budget is exhausted
- practical stop rules:
- fixed evaluation budget
- small improvement over last N steps
- hitting a target metric
Practical Engineering Notes #
- normalize inputs to similar scales before fitting the surrogate
- standardize outputs to improve numerical conditioning
- represent known observation noise in the likelihood
- use log-scaled domains for parameters spanning multiple orders of magnitude
- for higher dimensions, combine BO with trust-region methods or structured priors
How Optuna Framework Works #
Optuna provides a flexible orchestration layer for hyperparameter optimization, including Bayesian-style samplers and early stopping
building blocks:
- Study: full optimization process and trial history
- Trial: one objective evaluation with suggested parameters
- Sampler: chooses parameters (default: TPE; alternatives: random, CMA-ES, NSGA-II)
- Pruner: stops weak trials early based on intermediate reports
internal loop:
- user defines an objective function
study.optimizestarts repeated trials- sampler proposes parameter values (
trial.suggest_*) - objective returns metric (or reports intermediate steps)
- pruner optionally interrupts underperforming trials
- study updates best params and complete trial database
import optuna
def objective(trial):
learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
depth = trial.suggest_int("depth", 2, 12)
dropout = trial.suggest_float("dropout", 0.0, 0.5)
score = train_and_validate_model(
learning_rate=learning_rate, depth=depth, dropout=dropout
)
return score # maximize here
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_trial.number)
print(study.best_params)
print(study.best_value)
note
- Optuna can persist studies in SQLite/PostgreSQL and run distributed workers, making it practical for real training pipelines.