MCMC Estimation
Sampler
BFlux.MCMCState
— Typeabstract type MCMCState end
Every MCMC method must be implemented via a MCMCState which keeps track of all important information.
Mandatory Fields
samples::Matrix
of dimension numtotalparameter×num_samplesnsampled
the number of thus far sampled samples
Mandatory Functions
update!(sampler, θ, bnn, ∇θ)
where θ is the current parameter vector and ∇θ(θ) is a function providing gradients. The function must return θ and num_samples so far sampled.initialise!(sampler, θ, numsamples; continue_sampling)
which initialises the sampler. If continue_sampling is true, then the final goal is to obtain numsamples samples and thus only the remaining ones still need to be sampled.calculate_epochs(sampler, numbatches, numsamples; continue_sampling)
which calculates the number of epochs that must be run through in order to obtainnumsamples
samples ifnumbatches
batches are used. The number of epochs must be returned. Ifcontinue_sampling
is true, then the goal is to obtain in totalnumsamples
samples and thus we only need the number of epochs that still need to be run to obtain this total and NOT the number of epochs to samplenumsamples
new samples.
BFlux.mcmc
— Functionmcmc(args...; kwargs...)
Sample from a BNN using MCMC
Arguments
bnn
: a Bayesian Neural Networkbatchsize
: batchsizenumsamples
: number of samples to takesampler
: sampler to use
Keyword Arguments
shuffle::Bool=true
: should data be shuffled after each epoch such that batches are different in each epoch?partial::Bool=true
: are partial batches allowed? If true, some batches might be smaller thanbatchsize
showprogress::Bool=true
: should a progress bar be shown?continue_sampling::Bool=false
: If true andnumsamples
is larger thansampler.nsampled
then additional samples will be takenθstart::AbstractVector{T}=vcat(bnn.init()...)
: starting parameter vector
BFlux.SGLD
— TypeStochastic Gradient Langevin Dynamics as proposed in Welling, M., & Teh, Y. W. (n.d.). Bayesian Learning via Stochastic Gradient Langevin Dynamics. 8.
Fields
θ::AbstractVector
: Current samplesamples::Matrix
: Matrix of samples. Not all columns will be actual samples if sampling was stopped early. Seensampled
for the actual number of samples taken.nsampled::Int
: Number of samples takenmin_stepsize::T
: Stop decreasing the stepsize when it is below this value.didinform::Bool
: Flag keeping track of whether we informed user thatmin_stepsize
was reached.stepsize_a::T
: Seestepsize
stepsize_b::T
: Seestepsize
stepsize_γ::T
: Seestepsize
maxnorm::T
: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BFlux.SGNHTS
— TypeStochastic Gradient Nose-Hoover Thermostat as proposed in
Proposed in Leimkuhler, B., & Shang, X. (2016). Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2), A712-A736.
This is similar to SGNHT as proposed in Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. Advances in neural information processing systems, 27.
Fields
samples::Matrix
: Containing the samplesnsampled::Int
: Number of samples taken so far. Can be smaller thansize(samples, 2)
if sampling was interrupted.p::AbstractVector
: Momentumxi::Number
: Thermostatl::Number
: Stepsize; This often is in the 0.001-0.1 range.σA::Number
: Diffusion factor; If the stepsize is small, this should be larger than 1.μ::Number
: Free parameter in thermostat. Defaults to 1.t::Int
: Current step countkinetic::Vector
: Keeps track of the kinetic energy. Goal of SGNHT is to have the average close to one
BFlux.SGNHT
— TypeStochastic Gradient Nose-Hoover Thermostat as proposed in
Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. Advances in neural information processing systems, 27.
Fields
samples::Matrix
: Containing the samplesnsampled::Int
: Number of samples taken so far. Can be smaller thansize(samples, 2)
if sampling was interrupted.p::AbstractVector
: Momentumxi::Number
: Thermostatl::Number
: StepsizeA::Number
: Diffusion factort::Int
: Current step countkinetic::Vector
: Keeps track of the kinetic energy. Goal of SGNHT is to have the average close to one
BFlux.GGMC
— TypeGradient Guided Monte Carlo
Proposed in Garriga-Alonso, A., & Fortuin, V. (2021). Exact langevin dynamics with stochastic gradients. arXiv preprint arXiv:2102.01691.
Fields
samples::Matrix
: Matrix containing the samples. If sampling stopped early, then not all columns will actually correspond to samples. Seensampled
to check how many samples were actually takennsampled::Int
: Number of samples taken.t::Int
: Total number of steps taken.accepted::Vector{Bool}
: If true, sample was accepted; If false, proposed sample was rejected and previous sample was taken.β::T
: See paper.l::T
: Step-length; See paper.sadapter::StepsizeAdapter
: A StepsizeAdapter. Default isDualAveragingStepSize
M::AbstractMatrix
: Mass MatrixMhalf::AbstractMatrix
: Lower triangual cholesky decomposition ofM
Minv::AbstractMatrix
: Inverse mass matrix.madapter::MassAdapter
: A MassAdaptermomentum::AbstractVector
: Last momentum vectorlMH::T
: log of Metropolis-Hastings ratio.steps::Int
: Number of steps to take before calculating MH ratio.current_step::Int
: Current step in the recurrent sequence 1, ...,steps
.maxnorm::T
: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BFlux.HMC
— TypeStandard Hamiltonian Monte Carlo (Hybrid Monte Carlo).
Allows for the use of stochastic gradients, but the validity of doing so is not clear.
This is motivated by parts of the discussion in Neal, R. M. (1996). Bayesian Learning for Neural Networks (Vol. 118). Springer New York. https://doi.org/10.1007/978-1-4612-0745-0
Code was partially adapted from https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/
Fields
samples::Matrix
: Samples takennsampled::Int
: Number of samples taken. Might be smaller thansize(samples)
if sampling was interrupted.θold::AbstractVector
: Old sample. Kept for rejection step.momentum::AbstractVector
: Momentum variablesmomentumold::AbstractVector
: Old momentum variables. Kept for rejection step.t::Int
: Current step.path_len::Int
: Number of leapfrog steps.current_step::Int
: Current leapfrog step.accepted::Vector{Bool}
: Whether a draw insamples
was a accepted draw or rejected (in which case it is the same as the previous one.)sadapter::StepsizeAdapter
: Stepsize adapter giving the stepsize in each iteration.l
: Stepsize.madapter::MassAdapter
: Mass matrix adapter giving the inverse mass matrix in each iteration.Minv::AbstractMatrix
: Inverse mass matrixmaxnorm::T
: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BFlux.AdaptiveMH
— TypeAdaptive Metropolis Hastings as introduced in
Haario, H., Saksman, E., & Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 223-242.
Fields
samples::Matix
: Matrix holding the samples. If sampling was stopped early, not all columns will represent samples. To figure out how many columns represent samples, check outnsampled
.nsampled::Int
: Number of samples obtained.C0::Matrix
: Initial covariance matrix.Ct::Matrix
: Covariance matrix in iteration tt::Int
: Current time periodt0::Int
: When to start adaptig the covariance matrix? Covariance is adapted in a rolling window form.sd::T
: See the paper.ϵ::T
: Will be added to diagonal to prevent numerical non-pod-def problems. If you run into numerical problems, try increasing this values.accepted::Vector{Bool}
: For each sample, indicating whether the sample was accepted (true) or the previous samples was chosen (false)
Notes
- Adaptive MH might not be suited if it is very costly to calculate the likelihood as this needs to be done for each sample on the full dataset. Plans exist to make this faster.
- Works best when started at a MAP estimate.
Mass Adaptation
BFlux.MassAdapter
— TypeAdapt the mass matrix in MCMC and especially dynamic MCMC methods such as HMC, GGMC, SGLD, SGNHT, ...
Mandatory Fields
Minv::AbstractMatrix
: The inverse mass matrix used in HMC, GGMC, ...
Mandatory Functions
(madapter::MassAdapter)(s::MCMCState, θ::AbstractVector, bnn, ∇θ)
: Every mass adapter must be callable and have the sampler state, the current sample, the BNN and a gradient function as arguments. It must return the newMinv
Matrix.
BFlux.DiagCovMassAdapter
— TypeUse the variances as the diagonal of the inverse mass matrix as used in HMC, GGMC, ...;
Fields
Minv
: Inverse mass matrix as used in HMC, SGLD, GGMC, ...adapt_steps
: Number of adaptation steps.windowlength
: Lookback length for calculation of covariance.t
: Current step.kappa
: How much to shrink towards the identity.epsilon
: Small value to add to diagonal to avoid numerical instability.
BFlux.FullCovMassAdapter
— TypeUse the full covariance matrix of a moving average of samples as the mass matrix. This is similar to what is already done in Adaptive MH.
Fields
Minv
: Inverse mass matrix as used in HMC, SGLD, GGMC, ...adapt_steps
: Number of adaptation steps.windowlength
: Lookback length for calculation of covariance.t
: Current step.kappa
: How much to shrink towards the identity.epsilon
: Small value to add to diagonal to avoid numerical instability.
BFlux.FixedMassAdapter
— TypeUse a fixed inverse mass matrix.
BFlux.RMSPropMassAdapter
— TypeUse RMSProp as a precondition/mass matrix adapter. This was proposed in
Li, C., Chen, C., Carlson, D., & Carin, L. (2016, February). Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artificial Intelligence for the use in SGLD and related methods.
Stepsize Adaptation
BFlux.StepsizeAdapter
— TypeAdapt the stepsize of MCMC algorithms.
Implentation Details
Mandatory Fields
l::Number
The stepsize. Will be used by the sampler.
Mandatory Functions
(sadapter::StepsizeAdapter)(s::MCMCState, mh_probability::Number)
Every stepsize adapter must be callable with arguments, being the sampler itself and the Metropolis-Hastings acceptance probability. The method must return the new stepsize.
BFlux.ConstantStepsize
— TypeUse a contant stepsize.
BFlux.DualAveragingStepSize
— TypeUse the Dual Average method to tune the stepsize.
The use of the Dual Average method was proposed in:
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 31.