將 tf.feature_columns 遷移至 Keras 預處理層

tf.feature_columns to Keras preprocessing layers" />
在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視原始碼 下載筆記本

訓練模型通常會進行一定程度的特徵預處理,尤其是在處理結構化資料時。在 TensorFlow 1 中訓練 tf.estimator.Estimator 時,您通常會使用 tf.feature_column API 執行特徵預處理。在 TensorFlow 2 中,您可以直接使用 Keras 預處理層執行此操作。

本遷移指南示範了使用 Feature Columns 和預處理層的常見特徵轉換,接著將說明如何使用這兩種 API 訓練完整的模型。

首先,從幾個必要的匯入作業開始

import tensorflow as tf
import tensorflow.compat.v1 as tf1
import math
2024-01-17 02:27:02.309609: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-17 02:27:02.309651: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-17 02:27:02.311142: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

現在,新增一個公用程式函式來呼叫 Feature Column 以進行示範

def call_feature_columns(feature_columns, inputs):
  # This is a convenient way to call a `feature_column` outside of an estimator
  # to display its output.
  feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)
  return feature_layer(inputs)

輸入處理

若要搭配估算器使用 Feature Columns,模型輸入一律應為張量字典

input_dict = {
  'foo': tf.constant([1]),
  'bar': tf.constant([0]),
  'baz': tf.constant([-1])
}

每個 Feature Column 都需要使用鍵來建立索引,以便索引至來源資料。所有 Feature Column 的輸出會串連在一起,並由估算器模型使用。

columns = [
  tf1.feature_column.numeric_column('foo'),
  tf1.feature_column.numeric_column('bar'),
  tf1.feature_column.numeric_column('baz'),
]
call_feature_columns(columns, input_dict)
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3124623333.py:2: numeric_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[ 0., -1.,  1.]], dtype=float32)>

在 Keras 中,模型輸入更加彈性。tf.keras.Model 可以處理單一張量輸入、張量特徵清單或張量特徵字典。您可以藉由在模型建立時傳遞 tf.keras.Input 字典來處理字典輸入。輸入不會自動串連,因此可以更彈性地使用。您可以使用 tf.keras.layers.Concatenate 將它們串連在一起。

inputs = {
  'foo': tf.keras.Input(shape=()),
  'bar': tf.keras.Input(shape=()),
  'baz': tf.keras.Input(shape=()),
}
# Inputs are typically transformed by preprocessing layers before concatenation.
outputs = tf.keras.layers.Concatenate()(inputs.values())
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model(input_dict)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 1.,  0., -1.], dtype=float32)>

One-Hot 編碼整數 ID

常見的特徵轉換是針對已知範圍的整數輸入進行 One-Hot 編碼。以下是使用 Feature Columns 的範例

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=3)
indicator_col = tf1.feature_column.indicator_column(categorical_col)
call_feature_columns(indicator_col, {'type': [0, 1, 2]})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1369923821.py:1: categorical_column_with_identity (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1369923821.py:3: indicator_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用 Keras 預處理層,這些欄可以由單一 tf.keras.layers.CategoryEncoding 層取代,且 output_mode 設定為 'one_hot'

one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=3, output_mode='one_hot')
one_hot_layer([0, 1, 2])
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

正規化數值特徵

處理具有 Feature Columns 的連續浮點特徵時,您需要使用 tf.feature_column.numeric_column。如果輸入已正規化,則轉換為 Keras 非常簡單。您可以直接在模型中使用 tf.keras.Input,如上所示。

numeric_column 也可用於正規化輸入

def normalize(x):
  mean, variance = (2.0, 1.0)
  return (x - mean) / math.sqrt(variance)
numeric_col = tf1.feature_column.numeric_column('col', normalizer_fn=normalize)
call_feature_columns(numeric_col, {'col': tf.constant([[0.], [1.], [2.]])})
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

相較之下,使用 Keras 時,可以使用 tf.keras.layers.Normalization 完成此正規化。

normalization_layer = tf.keras.layers.Normalization(mean=2.0, variance=1.0)
normalization_layer(tf.constant([[0.], [1.], [2.]]))
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

分桶和 One-Hot 編碼數值特徵

連續浮點輸入的另一個常見轉換是分桶,然後轉換為固定範圍的整數。

在 Feature Columns 中,可以使用 tf.feature_column.bucketized_column 達成此目的

numeric_col = tf1.feature_column.numeric_column('col')
bucketized_col = tf1.feature_column.bucketized_column(numeric_col, [1, 4, 5])
call_feature_columns(bucketized_col, {'col': tf.constant([1., 2., 3., 4., 5.])})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3043215186.py:2: bucketized_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

在 Keras 中,可以使用 tf.keras.layers.Discretization 取代

discretization_layer = tf.keras.layers.Discretization(bin_boundaries=[1, 4, 5])
one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=4, output_mode='one_hot')
one_hot_layer(discretization_layer([1., 2., 3., 4., 5.]))
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

使用詞彙表進行 One-Hot 編碼字串資料

處理字串特徵通常需要詞彙表查詢,才能將字串轉換為索引。以下是使用 Feature Columns 查詢字串,然後對索引進行 One-Hot 編碼的範例

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'sizes',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
indicator_col = tf1.feature_column.indicator_column(vocab_col)
call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/2845961037.py:1: categorical_column_with_vocabulary_list (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用 Keras 預處理層,請使用 tf.keras.layers.StringLookup 層,且 output_mode 設定為 'one_hot'

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'],
    num_oov_indices=0,
    output_mode='one_hot')
string_lookup_layer(['small', 'medium', 'large'])
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用詞彙表嵌入字串資料

對於較大的詞彙表,通常需要嵌入才能獲得良好的效能。以下是使用 Feature Columns 嵌入字串特徵的範例

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'col',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)
call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/999372599.py:5: embedding_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[ 0.19490445, -0.41044798, -0.5603343 ,  0.32616043],
       [-0.03856034,  0.45223498,  0.15508214,  0.02957107],
       [-0.20157112,  0.27702358, -0.079571  , -0.6845111 ]],
      dtype=float32)>

使用 Keras 預處理層,可以結合 tf.keras.layers.StringLookup 層和 tf.keras.layers.Embedding 層來達成此目的。StringLookup 的預設輸出會是整數索引,可以直接饋送至嵌入。

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'], num_oov_indices=0)
embedding = tf.keras.layers.Embedding(3, 4)
embedding(string_lookup_layer(['small', 'medium', 'large']))
<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[-0.01309206, -0.01742087,  0.04137394,  0.03209827],
       [ 0.009628  ,  0.03649005,  0.0101964 , -0.02682712],
       [ 0.02947212,  0.04911048,  0.00290644, -0.00840418]],
      dtype=float32)>

加總加權類別資料

在某些情況下,您需要處理類別資料,其中每個類別的出現都帶有關聯的權重。在 Feature Columns 中,可以使用 tf.feature_column.weighted_categorical_column 處理。與 indicator_column 配對時,這會產生加總每個類別權重的效果。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)
call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3529191023.py:6: weighted_categorical_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4033: sparse_merge (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
No similar op available at this time.
<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

在 Keras 中,可以藉由將 count_weights 輸入傳遞至 tf.keras.layers.CategoryEncoding (且 output_mode='count') 來完成此操作。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# Using sparse output is more efficient when `num_tokens` is large.
count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
tf.sparse.to_dense(count_layer(ids, count_weights=weights))
<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

嵌入加權類別資料

您可能想要改為嵌入加權類別輸入。在 Feature Columns 中,embedding_column 包含 combiner 引數。如果任何樣本包含類別的多個項目,則會根據引數設定 (預設為 'mean') 組合這些項目。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
embedding_col = tf1.feature_column.embedding_column(
    weighted_categorical_col, 4, combiner='mean')
call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})
<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[ 0.3714184 , -0.01248874, -0.3219157 , -0.39734876]],
      dtype=float32)>

在 Keras 中,tf.keras.layers.Embedding 沒有 combiner 選項,但您可以使用 tf.keras.layers.Dense 達成相同的效果。上述 embedding_column 只會根據類別權重線性組合嵌入向量。雖然一開始不明顯,但這與將您的類別輸入表示為大小為 (num_tokens) 的稀疏權重向量,並將它們乘以形狀為 (embedding_size, num_tokens)Dense 核心完全相同。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line
# would be equivalent to an `embedding_column` with `combiner='sum'`.
weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)

count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
embedding_layer = tf.keras.layers.Dense(4, use_bias=False)
embedding_layer(count_layer(ids, count_weights=weights))
<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[-0.24909994, -0.02287644,  0.06182466, -0.11370523]],
      dtype=float32)>

完整訓練範例

為了展示完整的訓練工作流程,首先準備一些具有三種不同類型特徵的資料

features = {
    'type': [0, 1, 1],
    'size': ['small', 'small', 'medium'],
    'weight': [2.7, 1.8, 1.6],
}
labels = [1, 1, 0]
predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}

定義 TensorFlow 1 和 TensorFlow 2 工作流程的一些通用常數

vocab = ['small', 'medium', 'large']
one_hot_dims = 3
embedding_dims = 4
weight_mean = 2.0
weight_variance = 1.0

使用 Feature Columns

Feature Columns 必須在建立時以清單形式傳遞至估算器,並會在訓練期間隱含地呼叫。

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=one_hot_dims)
# Convert index to one-hot; e.g., [2] -> [0,0,1].
indicator_col = tf1.feature_column.indicator_column(categorical_col)

# Convert strings to indices; e.g., ['small'] -> [1].
vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'size', vocabulary_list=vocab, num_oov_buckets=1)
# Embed the indices.
embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dims)

normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)
# Normalize the numeric inputs; e.g., [2.0] -> [0.0].
numeric_col = tf1.feature_column.numeric_column(
    'weight', normalizer_fn=normalizer_fn)

estimator = tf1.estimator.DNNClassifier(
    feature_columns=[indicator_col, embedding_col, numeric_col],
    hidden_units=[1])

def _input_fn():
  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)

estimator.train(_input_fn)
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1997355744.py:17: DNNClassifier.__init__ (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:807: Estimator.__init__ (from tensorflow_estimator.python.estimator.estimator) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1844: RunConfig.__init__ (from tensorflow_estimator.python.estimator.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmpfs/tmp/tmpm02fidwe
INFO:tensorflow:Using config: {'_model_dir': '/tmpfs/tmp/tmpm02fidwe', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:446: dnn_logit_fn_builder (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/adagrad.py:138: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/model_fn.py:250: EstimatorSpec.__new__ (from tensorflow_estimator.python.estimator.model_fn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Done calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1416: NanTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1419: LoggingTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/basic_session_run_hooks.py:232: SecondOrStepTimer.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1456: CheckpointSaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:579: StepCounterHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:586: SummarySaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2024-01-17 02:27:09.322220: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT64
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}

    for Tuple type infernce function 0
    while inferring type of node 'dnn/zero_fraction/cond/output/_18'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmpfs/tmp/tmpm02fidwe/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1455: SessionRunArgs.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1454: SessionRunContext.__init__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1474: SessionRunValues.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:loss = 0.6289518, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 3...
INFO:tensorflow:Saving checkpoints for 3 into /tmpfs/tmp/tmpm02fidwe/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 3...
INFO:tensorflow:Loss for final step: 0.81661654.
<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifier at 0x7f4edc48e8b0>

在模型上執行推論時,Feature Columns 也會用於轉換輸入資料。

def _predict_fn():
  return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)

next(estimator.predict(_predict_fn))
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:596: ClassificationOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1307: RegressionOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1309: PredictOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmpfs/tmp/tmpm02fidwe/model.ckpt-3
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'logits': array([0.57964283], dtype=float32),
 'logistic': array([0.6409852], dtype=float32),
 'probabilities': array([0.35901475, 0.6409852 ], dtype=float32),
 'class_ids': array([1]),
 'classes': array([b'1'], dtype=object),
 'all_class_ids': array([0, 1], dtype=int32),
 'all_classes': array([b'0', b'1'], dtype=object)}

使用 Keras 預處理層

Keras 預處理層在呼叫位置方面更具彈性。層可以直接套用至張量、在 tf.data 輸入管線內部使用,或直接建置到可訓練的 Keras 模型中。

在本範例中,您將在 tf.data 輸入管線內套用預處理層。若要執行此操作,您可以定義個別的 tf.keras.Model 來預處理您的輸入特徵。此模型不可訓練,但它是將預處理層分組的便利方式。

inputs = {
  'type': tf.keras.Input(shape=(), dtype='int64'),
  'size': tf.keras.Input(shape=(), dtype='string'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Convert index to one-hot; e.g., [2] -> [0,0,1].
type_output = tf.keras.layers.CategoryEncoding(
      one_hot_dims, output_mode='one_hot')(inputs['type'])
# Convert size strings to indices; e.g., ['small'] -> [1].
size_output = tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size'])
# Normalize the numeric inputs; e.g., [2.0] -> [0.0].
weight_output = tf.keras.layers.Normalization(
      axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight'])
outputs = {
  'type': type_output,
  'size': size_output,
  'weight': weight_output,
}
preprocessing_model = tf.keras.Model(inputs, outputs)

您現在可以在呼叫 tf.data.Dataset.map 時套用此模型。請注意,傳遞至 map 的函式會自動轉換為 tf.function,且適用於撰寫 tf.function 程式碼的常見注意事項 (無副作用)。

# Apply the preprocessing in tf.data.Dataset.map.
dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)
dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),
                      num_parallel_calls=tf.data.AUTOTUNE)
# Display a preprocessed input sample.
next(dataset.take(1).as_numpy_iterator())
({'type': array([[1., 0., 0.]], dtype=float32),
  'size': array([1]),
  'weight': array([0.70000005], dtype=float32)},
 array([1], dtype=int32))

接下來,您可以定義個別的 Model,其中包含可訓練的層。請注意,此模型的輸入現在如何反映預處理的特徵類型和形狀。

inputs = {
  'type': tf.keras.Input(shape=(one_hot_dims,), dtype='float32'),
  'size': tf.keras.Input(shape=(), dtype='int64'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Since the embedding is trainable, it needs to be part of the training model.
embedding = tf.keras.layers.Embedding(len(vocab), embedding_dims)
outputs = tf.keras.layers.Concatenate()([
  inputs['type'],
  embedding(inputs['size']),
  tf.expand_dims(inputs['weight'], -1),
])
outputs = tf.keras.layers.Dense(1)(outputs)
training_model = tf.keras.Model(inputs, outputs)

您現在可以使用 tf.keras.Model.fit 訓練 training_model

# Train on the preprocessed data.
training_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
training_model.fit(dataset)
3/3 [==============================] - 1s 5ms/step - loss: 0.8194
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1705458433.603835   19973 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
<keras.src.callbacks.History at 0x7f4e700e7a30>

最後,在推論時,將這些個別階段組合成單一模型來處理原始特徵輸入可能會很有用。

inputs = preprocessing_model.input
outputs = training_model(preprocessing_model(inputs))
inference_model = tf.keras.Model(inputs, outputs)

predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)
inference_model.predict(predict_dataset)
1/1 [==============================] - 0s 102ms/step
array([[-0.95717776]], dtype=float32)

這個組合模型可以儲存為 .keras 檔案以供日後使用。

inference_model.save('model.keras')
restored_model = tf.keras.models.load_model('model.keras')
restored_model.predict(predict_dataset)
1/1 [==============================] - 0s 79ms/step
array([[-0.95717776]], dtype=float32)

Feature Column 等效表

為了方便參考,以下是 Feature Columns 和 Keras 預處理層之間的大致對應關係

Feature Column Keras 層
tf.feature_column.bucketized_column tf.keras.layers.Discretization
tf.feature_column.categorical_column_with_hash_bucket tf.keras.layers.Hashing
tf.feature_column.categorical_column_with_identity tf.keras.layers.CategoryEncoding
tf.feature_column.categorical_column_with_vocabulary_file tf.keras.layers.StringLookuptf.keras.layers.IntegerLookup
tf.feature_column.categorical_column_with_vocabulary_list tf.keras.layers.StringLookuptf.keras.layers.IntegerLookup
tf.feature_column.crossed_column tf.keras.layers.experimental.preprocessing.HashedCrossing
tf.feature_column.embedding_column tf.keras.layers.Embedding
tf.feature_column.indicator_column output_mode='one_hot'output_mode='multi_hot'*
tf.feature_column.numeric_column tf.keras.layers.Normalization
tf.feature_column.sequence_categorical_column_with_hash_bucket tf.keras.layers.Hashing
tf.feature_column.sequence_categorical_column_with_identity tf.keras.layers.CategoryEncoding
tf.feature_column.sequence_categorical_column_with_vocabulary_file tf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layer.TextVectorization
tf.feature_column.sequence_categorical_column_with_vocabulary_list tf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layer.TextVectorization
tf.feature_column.sequence_numeric_column tf.keras.layers.Normalization
tf.feature_column.weighted_categorical_column tf.keras.layers.CategoryEncoding

* output_mode 可以傳遞至 tf.keras.layers.CategoryEncodingtf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layers.TextVectorization

tf.keras.layers.TextVectorization 可以直接處理自由形式的文字輸入 (例如,整個句子或段落)。這並非 TensorFlow 1 中類別序列處理的一對一替代方案,但可以為臨時文字預處理提供便利的替代方案。

後續步驟