Notes on Bayesian Optimization #

tip
git clone -> cd
pip install -r requirements.txt (jax.numpy, scipy, botorch, optuna)

Optimization Under Expensive Evaluations #

Bayesian Optimization (BO) solves black-box optimization problems where each evaluation is expensive:

\[x^\star = \arg \max_{x \in \mathcal{X}} f(x)\]

assumptions:
- gradients are unavailable or too noisy
- objective evaluations are costly (training loops, simulations, experiments)
- query budget is limited, so every new point matters
BO minimizes regret by combining:
- a probabilistic surrogate of \(f(x)\)
- an acquisition strategy that controls exploration vs exploitation

Surrogate Modelling with Gaussian Processes #

BO commonly models the unknown function as:

\[f(x) \sim \mathcal{GP}(m(x), k(x, x'))\]

after observing data \(\mathcal{D}_t = \{(x_i, y_i)\}_{i=1}^{t}\) , prediction at \(x\) is:

\[f(x) \mid \mathcal{D}_t \sim \mathcal{N}(\mu_t(x), \sigma_t^2(x))\]

interpretation:
- \(\mu_t(x)\) is the current estimate of performance
- \(\sigma_t(x)\) captures epistemic uncertainty
this posterior is the foundation for deciding where to sample next

Acquisition Functions #

acquisition functions convert posterior mean/uncertainty into an optimization signal:

\[x_{t+1} = \arg \max_{x \in \mathcal{X}} \alpha_t(x)\]

common choices:
- Upper Confidence Bound (UCB) \[\alpha_{\text{UCB}}(x) = \mu_t(x) + \kappa \sigma_t(x)\]
- Probability of Improvement (PI) \[\alpha_{\text{PI}}(x) = \Phi\left(\frac{\mu_t(x)-f(x^+)-\xi}{\sigma_t(x)}\right)\]
- Expected Improvement (EI) \[\alpha_{\text{EI}}(x)= (\mu_t(x)-f(x^+)-\xi)\Phi(z)+\sigma_t(x)\phi(z), \quad z=\frac{\mu_t(x)-f(x^+)-\xi}{\sigma_t(x)}\]

important
BO performance depends heavily on sensible kernel choices, robust hyperparameter fitting, and stable optimization of the acquisition objective.

Canonical BO Algorithm #

define bounded search space \(\mathcal{X}\)
collect initial points (random or Latin hypercube)
fit/update the GP surrogate with current observations
maximize acquisition function to propose \(x_{t+1}\)
evaluate expensive objective at \(x_{t+1}\)
append \((x_{t+1}, y_{t+1})\) and repeat until budget is exhausted

practical stop rules:
- fixed evaluation budget
- small improvement over last N steps
- hitting a target metric

Practical Engineering Notes #

normalize inputs to similar scales before fitting the surrogate
standardize outputs to improve numerical conditioning
represent known observation noise in the likelihood
use log-scaled domains for parameters spanning multiple orders of magnitude
for higher dimensions, combine BO with trust-region methods or structured priors

How Optuna Framework Works #

Optuna provides a flexible orchestration layer for hyperparameter optimization, including Bayesian-style samplers and early stopping
building blocks:
- Study: full optimization process and trial history
- Trial: one objective evaluation with suggested parameters
- Sampler: chooses parameters (default: TPE; alternatives: random, CMA-ES, NSGA-II)
- Pruner: stops weak trials early based on intermediate reports
internal loop:
1. user defines an objective function
2. study.optimize starts repeated trials
3. sampler proposes parameter values (trial.suggest_*)
4. objective returns metric (or reports intermediate steps)
5. pruner optionally interrupts underperforming trials
6. study updates best params and complete trial database

import optuna

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    depth = trial.suggest_int("depth", 2, 12)
    dropout = trial.suggest_float("dropout", 0.0, 0.5)

    score = train_and_validate_model(
        learning_rate=learning_rate, depth=depth, dropout=dropout
    )
    return score  # maximize here

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print(study.best_trial.number)
print(study.best_params)
print(study.best_value)

note
Optuna can persist studies in SQLite/PostgreSQL and run distributed workers, making it practical for real training pipelines.