![]() |
![]() |
![]() |
![]() |
摘要
在這個 colab 中,我們示範如何使用 TensorFlow Probability 中實作的各種最佳化器。
依附元件與先決條件
匯入
BFGS 和 L-BFGS 最佳化器
擬牛頓法是一類熱門的一階最佳化演算法。這些方法使用正定近似值來逼近精確的 Hessian 矩陣,以尋找搜尋方向。
Broyden-Fletcher-Goldfarb-Shanno 演算法 (BFGS) 是這個通用概念的特定實作。它適用於梯度在任何地方都連續的中型問題 (例如,具有 \(L_2\) 懲罰的線性迴歸),並且是此類問題的首選方法。
L-BFGS 是 BFGS 的有限記憶體版本,適用於解決大型問題,這些問題的 Hessian 矩陣無法以合理的成本計算,或不是稀疏矩陣。它們不是儲存 Hessian 矩陣的完全密集 \(n \times n\) 近似值,而是只儲存一些長度為 \(n\) 的向量,以隱含地表示這些近似值。
輔助函式
在簡單二次函數上使用 L-BFGS
# Fix numpy seed for reproducibility
np.random.seed(12345)
# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.
# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.
dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)
# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)
# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10
@tf.function
def quadratic_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_lbfgs)
# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.
print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.014586 seconds L-BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
使用 BFGS 的相同問題
@tf.function
def quadratic_with_bfgs():
return tfp.optimizer.bfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_bfgs)
print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.010468 seconds BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
具有 L1 懲罰的線性迴歸:前列腺癌資料
《統計學習要素、資料探勘、推論與預測》(The Elements of Statistical Learning, Data Mining, Inference, and Prediction) 一書中的範例,作者:Trevor Hastie、Robert Tibshirani 和 Jerome Friedman。
請注意,這是具有 L1 懲罰的最佳化問題。
取得資料集
def cache_or_download_file(cache_dir, url_base, filename):
"""Read a cached file or download it."""
filepath = os.path.join(cache_dir, filename)
if tf.io.gfile.exists(filepath):
return filepath
if not tf.io.gfile.exists(cache_dir):
tf.io.gfile.makedirs(cache_dir)
url = url_base + filename
print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
urllib.request.urlretrieve(url, filepath)
return filepath
def get_prostate_dataset(cache_dir=CACHE_DIR):
"""Download the prostate dataset and read as Pandas dataframe."""
url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
return pd.read_csv(
cache_or_download_file(cache_dir, url_base, 'prostate.data'),
delim_whitespace=True, index_col=0)
prostate_df = get_prostate_dataset()
Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.
問題定義
np.random.seed(12345)
feature_names = ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp',
'gleason', 'pgg45']
# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
scalar.fit_transform(
prostate_df[feature_names].astype('float64')))
# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T']
# Select features and labels
features = prostate_df_train[feature_names]
labels = prostate_df_train[['lpsa']]
# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)
dtype = feat.dtype
regularization = 0 # regularization parameter
dim = 8 # number of features
# We pick a random starting point for the search
start = np.random.randn(dim + 1)
def regression_loss(params):
"""Compute loss for linear regression model with L1 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The mean square error loss including L1 penalty.
"""
params = tf.squeeze(params)
intercept, beta = params[0], params[1:]
pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
mse_loss = tf.reduce_sum(
tf.cast(
tf_keras.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
total_loss = mse_loss + l1_penalty
return total_loss
使用 L-BFGS 求解
使用 L-BFGS 擬合。即使 L1 懲罰引入了導數不連續性,但在實務上,L-BFGS 仍然運作良好。
@tf.function
def l1_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
make_val_and_grad_fn(regression_loss),
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.017987 seconds L-BFGS Results Converged: True Intercept: Fitted (2.3879985744556484) Beta: Fitted [ 0.68626215 0.28193532 -0.17030254 0.10799274 0.33634988 -0.24888523 0.11992237 0.08689026]
使用 Nelder Mead 求解
Nelder Mead 方法是最熱門的無導數最小化方法之一。此最佳化器不使用梯度資訊,也不對目標函數的可微性做任何假設;因此,它適用於非平滑目標函數,例如具有 L1 懲罰的最佳化問題。
對於 \(n\) 維度的最佳化問題,它會維護一組 \(n+1\) 個候選解,這些候選解跨越一個非退化單純形。它會使用每個頂點的函數值,根據一組移動 (反射、擴張、收縮和收攏) 連續修改單純形。
# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)
@tf.function
def l1_regression_with_nelder_mead():
return tfp.optimizer.nelder_mead_minimize(
regression_loss,
initial_vertex=initial_vertex,
func_tolerance=1e-10,
position_tolerance=1e-10)
results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.325643 seconds Nelder Mead Results Converged: True Intercept: Fitted (2.387998456121595) Beta: Fitted [ 0.68626266 0.28193456 -0.17030291 0.10799375 0.33635132 -0.24888703 0.11992244 0.08689023]
具有 L2 懲罰的邏輯迴歸
在此範例中,我們建立用於分類的合成資料集,並使用 L-BFGS 最佳化器來擬合參數。
np.random.seed(12345)
dim = 5 # The number of features
n_obs = 10000 # The number of observations
betas = np.random.randn(dim) # The true beta
intercept = np.random.randn() # The true intercept
features = np.random.randn(n_obs, dim) # The feature matrix
probs = sp.special.expit(
np.matmul(features, np.expand_dims(betas, -1)) + intercept)
labels = sp.stats.bernoulli.rvs(probs) # The true labels
regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)
@make_val_and_grad_fn
def negative_log_likelihood(params):
"""Negative log likelihood for logistic model with L2 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The negative log likelihood plus the penalty term.
"""
intercept, beta = params[0], params[1:]
logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
labels=lab, logits=logit))
l2_penalty = regularization * tf.reduce_sum(beta ** 2)
total_loss = log_likelihood + l2_penalty
return total_loss
start = np.random.randn(dim + 1)
@tf.function
def l2_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
negative_log_likelihood,
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))
Evaluation took: 0.056751 seconds Converged: True Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904) Beta: Fitted [-0.18016612 0.53121578 -0.56420632 -0.5336374 2.00499675], Actual [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057]
批次處理支援
BFGS 和 L-BFGS 都支援批次運算,例如從許多不同的起點最佳化單一函數;或從單一起點最佳化多個參數函數。
單一函數,多個起點
Himmelblau 函數是一個標準的最佳化測試案例。此函數由下式給出
\[f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]
此函數有四個最小值,位於
- (3, 2),
- (-2.805118, 3.131312),
- (-3.779310, -3.283186),
- (3.584428, -1.848126).
所有這些最小值都可以從適當的起點到達。
# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.
@make_val_and_grad_fn
def himmelblau(coord):
x, y = coord[..., 0], coord[..., 1]
return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2
starts = tf.constant([[1, 1],
[-2, 2],
[-1, -1],
[1, -2]], dtype='float64')
# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.
@tf.function
def batch_multiple_starts():
return tfp.optimizer.lbfgs_minimize(
himmelblau, initial_position=starts,
stopping_condition=tfp.optimizer.converged_all,
tolerance=1e-8)
results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)
Evaluation took: 0.019095 seconds Converged: [ True True True True] Minima: [[ 3. 2. ] [-2.80511809 3.13131252] [-3.77931025 -3.28318599] [ 3.58442834 -1.84812653]]
多個函數
為了示範目的,在此範例中,我們同時最佳化大量高維度隨機產生的二次碗。
np.random.seed(12345)
dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)
# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')
@tf.function
def batch_multiple_functions():
return tfp.optimizer.lbfgs_minimize(
quadratic, initial_position=start,
stopping_condition=tfp.optimizer.converged_all,
max_iterations=100,
tolerance=1e-8)
results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))
Evaluation took: 1.994132 seconds All converged: True Largest error: 4.4131473142527966e-08