![]() |
![]() |
![]() |
![]() |
在特徵化教學課程中,我們將使用者和電影 ID 以外的多個特徵納入模型,但我們尚未探討這些特徵是否能提高模型準確度。
許多因素會影響 ID 以外的特徵在推薦模型中是否有用
- 情境的重要性:如果使用者偏好在不同情境和時間中相對穩定,則情境特徵可能不會帶來太多好處。但是,如果使用者偏好具有高度情境性,則新增情境將顯著改善模型。例如,星期幾可能是決定是否推薦短片或電影的重要特徵:使用者可能只有在週間才有時間觀看短片內容,但在週末可以放鬆身心,欣賞完整長度的電影。同樣地,查詢時間戳記可能在建立熱門程度動態模型方面發揮重要作用:一部電影在發行時可能非常熱門,但之後會迅速衰退。相反地,其他電影可能是常青片,人們會一次又一次地樂於觀看。
- 資料稀疏性:如果資料稀疏,則使用非 ID 特徵可能至關重要。由於給定使用者或項目的觀察次數很少,模型可能難以估計良好的使用者或項目表示法。為了建構準確的模型,必須使用其他特徵,例如項目類別、描述和圖片,以協助模型概括訓練資料以外的內容。這在冷啟動情況下尤其相關,在這些情況下,關於某些項目或使用者的資料相對較少。
在本教學課程中,我們將實驗將電影標題和使用者 ID 以外的特徵用於我們的 MovieLens 模型。
預備知識
我們先匯入必要的套件。
pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
2022-12-14 12:11:12.860829: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2022-12-14 12:11:12.860920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2022-12-14 12:11:12.860930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
我們遵循特徵化教學課程,並保留使用者 ID、時間戳記和電影標題特徵。
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
"timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089 WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
我們也進行一些內務處理以準備特徵詞彙表。
timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))
max_timestamp = timestamps.max()
min_timestamp = timestamps.min()
timestamp_buckets = np.linspace(
min_timestamp, max_timestamp, num=1000,
)
unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
lambda x: x["user_id"]))))
模型定義
查詢模型
我們從特徵化教學課程中定義的使用者模型開始,作為我們模型的第一層,其任務是將原始輸入範例轉換為特徵嵌入。但是,我們稍微變更了它,以允許我們開啟或關閉時間戳記特徵。這將使我們能夠更輕鬆地示範時間戳記特徵對模型的影響。在以下程式碼中,use_timestamps
參數讓我們可以控制是否使用時間戳記特徵。
class UserModel(tf.keras.Model):
def __init__(self, use_timestamps):
super().__init__()
self._use_timestamps = use_timestamps
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
])
if use_timestamps:
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
])
self.normalized_timestamp = tf.keras.layers.Normalization(
axis=None
)
self.normalized_timestamp.adapt(timestamps)
def call(self, inputs):
if not self._use_timestamps:
return self.user_embedding(inputs["user_id"])
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
], axis=1)
請注意,在本教學課程中,我們使用時間戳記特徵的方式會以不良的方式與我們的訓練-測試分割選擇互動。由於我們是以隨機而非時間順序分割資料 (以確保屬於測試資料集的事件發生時間晚於訓練集中的事件),因此我們的模型可以有效地從未來學習。這是不切實際的:畢竟,我們無法在今天的資料上訓練明天的模型。
這表示將時間特徵新增至模型可讓模型學習未來互動模式。我們這樣做僅用於說明目的:MovieLens 資料集本身非常密集,並且與許多真實世界資料集不同,它不會從使用者 ID 和電影標題以外的特徵中獲得太多好處。
撇開這個注意事項不談,真實世界模型很可能從其他基於時間的特徵 (例如一天中的時間或星期幾) 中受益,尤其是在資料具有強烈的季節性模式時。
候選模型
為了簡化起見,我們將保持候選模型固定。同樣地,我們從特徵化教學課程複製它
class MovieModel(tf.keras.Model):
def __init__(self):
super().__init__()
max_tokens = 10_000
self.title_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_movie_titles, mask_token=None),
tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
])
self.title_vectorizer = tf.keras.layers.TextVectorization(
max_tokens=max_tokens)
self.title_text_embedding = tf.keras.Sequential([
self.title_vectorizer,
tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
tf.keras.layers.GlobalAveragePooling1D(),
])
self.title_vectorizer.adapt(movies)
def call(self, titles):
return tf.concat([
self.title_embedding(titles),
self.title_text_embedding(titles),
], axis=1)
組合模型
定義 UserModel
和 MovieModel
後,我們可以將組合模型放在一起,並實作我們的損失和指標邏輯。
在這裡,我們正在建構擷取模型。如需瞭解其運作方式的複習,請參閱基本擷取教學課程。
請注意,我們也需要確保查詢模型和候選模型輸出相容大小的嵌入。由於我們將透過新增更多特徵來變更其大小,因此完成此操作的最簡單方法是在每個模型之後使用密集投影層
class MovielensModel(tfrs.models.Model):
def __init__(self, use_timestamps):
super().__init__()
self.query_model = tf.keras.Sequential([
UserModel(use_timestamps),
tf.keras.layers.Dense(32)
])
self.candidate_model = tf.keras.Sequential([
MovieModel(),
tf.keras.layers.Dense(32)
])
self.task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=movies.batch(128).map(self.candidate_model),
),
)
def compute_loss(self, features, training=False):
# We only pass the user id and timestamp features into the query model. This
# is to ensure that the training inputs would have the same keys as the
# query inputs. Otherwise the discrepancy in input structure would cause an
# error when loading the query model after saving it.
query_embeddings = self.query_model({
"user_id": features["user_id"],
"timestamp": features["timestamp"],
})
movie_embeddings = self.candidate_model(features["movie_title"])
return self.task(query_embeddings, movie_embeddings)
實驗
準備資料
我們先將資料分割成訓練集和測試集。
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)
cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()
基準:無時間戳記特徵
我們已準備好試用我們的第一個模型:讓我們從不使用時間戳記特徵開始,以建立我們的基準。
model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 13s 215ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0097 - factorized_top_k/top_5_categorical_accuracy: 0.0208 - factorized_top_k/top_10_categorical_accuracy: 0.0310 - factorized_top_k/top_50_categorical_accuracy: 0.0937 - factorized_top_k/top_100_categorical_accuracy: 0.1620 - loss: 14575.4544 - regularization_loss: 0.0000e+00 - total_loss: 14575.4544 Epoch 2/3 40/40 [==============================] - 9s 194ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0027 - factorized_top_k/top_5_categorical_accuracy: 0.0145 - factorized_top_k/top_10_categorical_accuracy: 0.0274 - factorized_top_k/top_50_categorical_accuracy: 0.1229 - factorized_top_k/top_100_categorical_accuracy: 0.2267 - loss: 14112.4198 - regularization_loss: 0.0000e+00 - total_loss: 14112.4198 Epoch 3/3 40/40 [==============================] - 9s 180ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0024 - factorized_top_k/top_5_categorical_accuracy: 0.0159 - factorized_top_k/top_10_categorical_accuracy: 0.0319 - factorized_top_k/top_50_categorical_accuracy: 0.1421 - factorized_top_k/top_100_categorical_accuracy: 0.2579 - loss: 13928.9752 - regularization_loss: 0.0000e+00 - total_loss: 13928.9752 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 8s 155ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0035 - factorized_top_k/top_5_categorical_accuracy: 0.0226 - factorized_top_k/top_10_categorical_accuracy: 0.0427 - factorized_top_k/top_50_categorical_accuracy: 0.1742 - factorized_top_k/top_100_categorical_accuracy: 0.2950 - loss: 13707.4737 - regularization_loss: 0.0000e+00 - total_loss: 13707.4737 5/5 [==============================] - 3s 212ms/step - factorized_top_k/top_1_categorical_accuracy: 5.5000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0071 - factorized_top_k/top_10_categorical_accuracy: 0.0166 - factorized_top_k/top_50_categorical_accuracy: 0.1054 - factorized_top_k/top_100_categorical_accuracy: 0.2069 - loss: 31034.1432 - regularization_loss: 0.0000e+00 - total_loss: 31034.1432 Top-100 accuracy (train): 0.30. Top-100 accuracy (test): 0.21.
這為我們提供了約 0.2 的基準前 100 名準確度。
使用時間特徵擷取時間動態
如果我們新增時間特徵,結果是否會改變?
model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 12s 227ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0142 - factorized_top_k/top_5_categorical_accuracy: 0.0285 - factorized_top_k/top_10_categorical_accuracy: 0.0393 - factorized_top_k/top_50_categorical_accuracy: 0.1045 - factorized_top_k/top_100_categorical_accuracy: 0.1739 - loss: 14540.3001 - regularization_loss: 0.0000e+00 - total_loss: 14540.3001 Epoch 2/3 40/40 [==============================] - 8s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0037 - factorized_top_k/top_5_categorical_accuracy: 0.0178 - factorized_top_k/top_10_categorical_accuracy: 0.0340 - factorized_top_k/top_50_categorical_accuracy: 0.1443 - factorized_top_k/top_100_categorical_accuracy: 0.2600 - loss: 13946.6094 - regularization_loss: 0.0000e+00 - total_loss: 13946.6094 Epoch 3/3 40/40 [==============================] - 9s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0026 - factorized_top_k/top_5_categorical_accuracy: 0.0200 - factorized_top_k/top_10_categorical_accuracy: 0.0412 - factorized_top_k/top_50_categorical_accuracy: 0.1784 - factorized_top_k/top_100_categorical_accuracy: 0.3091 - loss: 13681.3415 - regularization_loss: 0.0000e+00 - total_loss: 13681.3415 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 8s 157ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0057 - factorized_top_k/top_5_categorical_accuracy: 0.0344 - factorized_top_k/top_10_categorical_accuracy: 0.0638 - factorized_top_k/top_50_categorical_accuracy: 0.2331 - factorized_top_k/top_100_categorical_accuracy: 0.3749 - loss: 13357.4951 - regularization_loss: 0.0000e+00 - total_loss: 13357.4951 5/5 [==============================] - 1s 225ms/step - factorized_top_k/top_1_categorical_accuracy: 9.0000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0093 - factorized_top_k/top_10_categorical_accuracy: 0.0228 - factorized_top_k/top_50_categorical_accuracy: 0.1311 - factorized_top_k/top_100_categorical_accuracy: 0.2531 - loss: 30674.3815 - regularization_loss: 0.0000e+00 - total_loss: 30674.3815 Top-100 accuracy (train): 0.37. Top-100 accuracy (test): 0.25.
這好很多:不僅訓練準確度高得多,而且測試準確度也大幅提高。
後續步驟
本教學課程顯示,即使是簡單的模型,在納入更多特徵時,也可以變得更準確。但是,若要充分利用您的特徵,通常有必要建構更大、更深的模型。請參閱深度擷取教學課程,以更詳細地探索這一點。