利用情境特徵

在 TensorFlow.org 上檢視 在 Google Colab 中執行 在 GitHub 上檢視原始碼 下載筆記本

特徵化教學課程中,我們將使用者和電影 ID 以外的多個特徵納入模型,但我們尚未探討這些特徵是否能提高模型準確度。

許多因素會影響 ID 以外的特徵在推薦模型中是否有用

  1. 情境的重要性:如果使用者偏好在不同情境和時間中相對穩定,則情境特徵可能不會帶來太多好處。但是,如果使用者偏好具有高度情境性,則新增情境將顯著改善模型。例如,星期幾可能是決定是否推薦短片或電影的重要特徵:使用者可能只有在週間才有時間觀看短片內容,但在週末可以放鬆身心,欣賞完整長度的電影。同樣地,查詢時間戳記可能在建立熱門程度動態模型方面發揮重要作用:一部電影在發行時可能非常熱門,但之後會迅速衰退。相反地,其他電影可能是常青片,人們會一次又一次地樂於觀看。
  2. 資料稀疏性:如果資料稀疏,則使用非 ID 特徵可能至關重要。由於給定使用者或項目的觀察次數很少,模型可能難以估計良好的使用者或項目表示法。為了建構準確的模型,必須使用其他特徵,例如項目類別、描述和圖片,以協助模型概括訓練資料以外的內容。這在冷啟動情況下尤其相關,在這些情況下,關於某些項目或使用者的資料相對較少。

在本教學課程中,我們將實驗將電影標題和使用者 ID 以外的特徵用於我們的 MovieLens 模型。

預備知識

我們先匯入必要的套件。

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import tempfile

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs
2022-12-14 12:11:12.860829: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:11:12.860920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:11:12.860930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

我們遵循特徵化教學課程,並保留使用者 ID、時間戳記和電影標題特徵。

ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

我們也進行一些內務處理以準備特徵詞彙表。

timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))

max_timestamp = timestamps.max()
min_timestamp = timestamps.min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000,
)

unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_id"]))))

模型定義

查詢模型

我們從特徵化教學課程中定義的使用者模型開始,作為我們模型的第一層,其任務是將原始輸入範例轉換為特徵嵌入。但是,我們稍微變更了它,以允許我們開啟或關閉時間戳記特徵。這將使我們能夠更輕鬆地示範時間戳記特徵對模型的影響。在以下程式碼中,use_timestamps 參數讓我們可以控制是否使用時間戳記特徵。

class UserModel(tf.keras.Model):

  def __init__(self, use_timestamps):
    super().__init__()

    self._use_timestamps = use_timestamps

    self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
    ])

    if use_timestamps:
      self.timestamp_embedding = tf.keras.Sequential([
          tf.keras.layers.Discretization(timestamp_buckets.tolist()),
          tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
      ])
      self.normalized_timestamp = tf.keras.layers.Normalization(
          axis=None
      )

      self.normalized_timestamp.adapt(timestamps)

  def call(self, inputs):
    if not self._use_timestamps:
      return self.user_embedding(inputs["user_id"])

    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
    ], axis=1)

請注意,在本教學課程中,我們使用時間戳記特徵的方式會以不良的方式與我們的訓練-測試分割選擇互動。由於我們是以隨機而非時間順序分割資料 (以確保屬於測試資料集的事件發生時間晚於訓練集中的事件),因此我們的模型可以有效地從未來學習。這是不切實際的:畢竟,我們無法在今天的資料上訓練明天的模型。

這表示將時間特徵新增至模型可讓模型學習未來互動模式。我們這樣做僅用於說明目的:MovieLens 資料集本身非常密集,並且與許多真實世界資料集不同,它不會從使用者 ID 和電影標題以外的特徵中獲得太多好處。

撇開這個注意事項不談,真實世界模型很可能從其他基於時間的特徵 (例如一天中的時間或星期幾) 中受益,尤其是在資料具有強烈的季節性模式時。

候選模型

為了簡化起見,我們將保持候選模型固定。同樣地,我們從特徵化教學課程複製它

class MovieModel(tf.keras.Model):

  def __init__(self):
    super().__init__()

    max_tokens = 10_000

    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
          vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])

    self.title_vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens)

    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])

    self.title_vectorizer.adapt(movies)

  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)

組合模型

定義 UserModelMovieModel 後,我們可以將組合模型放在一起,並實作我們的損失和指標邏輯。

在這裡,我們正在建構擷取模型。如需瞭解其運作方式的複習,請參閱基本擷取教學課程。

請注意,我們也需要確保查詢模型和候選模型輸出相容大小的嵌入。由於我們將透過新增更多特徵來變更其大小,因此完成此操作的最簡單方法是在每個模型之後使用密集投影層

class MovielensModel(tfrs.models.Model):

  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
      tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
      tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movies.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "user_id": features["user_id"],
        "timestamp": features["timestamp"],
    })
    movie_embeddings = self.candidate_model(features["movie_title"])

    return self.task(query_embeddings, movie_embeddings)

實驗

準備資料

我們先將資料分割成訓練集和測試集。

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

基準:無時間戳記特徵

我們已準備好試用我們的第一個模型:讓我們從不使用時間戳記特徵開始,以建立我們的基準。

model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=3)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]

print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 13s 215ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0097 - factorized_top_k/top_5_categorical_accuracy: 0.0208 - factorized_top_k/top_10_categorical_accuracy: 0.0310 - factorized_top_k/top_50_categorical_accuracy: 0.0937 - factorized_top_k/top_100_categorical_accuracy: 0.1620 - loss: 14575.4544 - regularization_loss: 0.0000e+00 - total_loss: 14575.4544
Epoch 2/3
40/40 [==============================] - 9s 194ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0027 - factorized_top_k/top_5_categorical_accuracy: 0.0145 - factorized_top_k/top_10_categorical_accuracy: 0.0274 - factorized_top_k/top_50_categorical_accuracy: 0.1229 - factorized_top_k/top_100_categorical_accuracy: 0.2267 - loss: 14112.4198 - regularization_loss: 0.0000e+00 - total_loss: 14112.4198
Epoch 3/3
40/40 [==============================] - 9s 180ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0024 - factorized_top_k/top_5_categorical_accuracy: 0.0159 - factorized_top_k/top_10_categorical_accuracy: 0.0319 - factorized_top_k/top_50_categorical_accuracy: 0.1421 - factorized_top_k/top_100_categorical_accuracy: 0.2579 - loss: 13928.9752 - regularization_loss: 0.0000e+00 - total_loss: 13928.9752
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 8s 155ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0035 - factorized_top_k/top_5_categorical_accuracy: 0.0226 - factorized_top_k/top_10_categorical_accuracy: 0.0427 - factorized_top_k/top_50_categorical_accuracy: 0.1742 - factorized_top_k/top_100_categorical_accuracy: 0.2950 - loss: 13707.4737 - regularization_loss: 0.0000e+00 - total_loss: 13707.4737
5/5 [==============================] - 3s 212ms/step - factorized_top_k/top_1_categorical_accuracy: 5.5000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0071 - factorized_top_k/top_10_categorical_accuracy: 0.0166 - factorized_top_k/top_50_categorical_accuracy: 0.1054 - factorized_top_k/top_100_categorical_accuracy: 0.2069 - loss: 31034.1432 - regularization_loss: 0.0000e+00 - total_loss: 31034.1432
Top-100 accuracy (train): 0.30.
Top-100 accuracy (test): 0.21.

這為我們提供了約 0.2 的基準前 100 名準確度。

使用時間特徵擷取時間動態

如果我們新增時間特徵,結果是否會改變?

model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=3)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]

print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 12s 227ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0142 - factorized_top_k/top_5_categorical_accuracy: 0.0285 - factorized_top_k/top_10_categorical_accuracy: 0.0393 - factorized_top_k/top_50_categorical_accuracy: 0.1045 - factorized_top_k/top_100_categorical_accuracy: 0.1739 - loss: 14540.3001 - regularization_loss: 0.0000e+00 - total_loss: 14540.3001
Epoch 2/3
40/40 [==============================] - 8s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0037 - factorized_top_k/top_5_categorical_accuracy: 0.0178 - factorized_top_k/top_10_categorical_accuracy: 0.0340 - factorized_top_k/top_50_categorical_accuracy: 0.1443 - factorized_top_k/top_100_categorical_accuracy: 0.2600 - loss: 13946.6094 - regularization_loss: 0.0000e+00 - total_loss: 13946.6094
Epoch 3/3
40/40 [==============================] - 9s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0026 - factorized_top_k/top_5_categorical_accuracy: 0.0200 - factorized_top_k/top_10_categorical_accuracy: 0.0412 - factorized_top_k/top_50_categorical_accuracy: 0.1784 - factorized_top_k/top_100_categorical_accuracy: 0.3091 - loss: 13681.3415 - regularization_loss: 0.0000e+00 - total_loss: 13681.3415
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 8s 157ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0057 - factorized_top_k/top_5_categorical_accuracy: 0.0344 - factorized_top_k/top_10_categorical_accuracy: 0.0638 - factorized_top_k/top_50_categorical_accuracy: 0.2331 - factorized_top_k/top_100_categorical_accuracy: 0.3749 - loss: 13357.4951 - regularization_loss: 0.0000e+00 - total_loss: 13357.4951
5/5 [==============================] - 1s 225ms/step - factorized_top_k/top_1_categorical_accuracy: 9.0000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0093 - factorized_top_k/top_10_categorical_accuracy: 0.0228 - factorized_top_k/top_50_categorical_accuracy: 0.1311 - factorized_top_k/top_100_categorical_accuracy: 0.2531 - loss: 30674.3815 - regularization_loss: 0.0000e+00 - total_loss: 30674.3815
Top-100 accuracy (train): 0.37.
Top-100 accuracy (test): 0.25.

這好很多:不僅訓練準確度高得多,而且測試準確度也大幅提高。

後續步驟

本教學課程顯示,即使是簡單的模型,在納入更多特徵時,也可以變得更準確。但是,若要充分利用您的特徵,通常有必要建構更大、更深的模型。請參閱深度擷取教學課程,以更詳細地探索這一點。