![]() |
![]() |
![]() |
![]() |
在本教學課程中,我們將使用 MovieLens 100K 資料集和 TF-Ranking 建構一個簡單的雙塔排名模型。我們可以根據此模型預測的使用者評分,為特定使用者排名和推薦電影。
設定
安裝並匯入 TF-Ranking 程式庫
pip install -q tensorflow-ranking
pip install -q --upgrade tensorflow-datasets
from typing import Dict, Tuple
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_ranking as tfr
2024-03-19 11:34:49.704174: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-03-19 11:34:49.704225: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-03-19 11:34:49.705795: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
讀取資料
準備訓練模型,方法是建立評分資料集和電影資料集。使用 user_id
作為查詢輸入特徵、movie_title
作為文件輸入特徵,以及 user_rating
作為標籤來訓練排名模型。
%%capture --no-display
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")
# Select the basic features.
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
"user_rating": x["user_rating"]
})
2024-03-19 11:34:53.385017: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
建構詞彙表,將所有使用者 ID 和所有電影標題轉換為整數索引,以用於嵌入層
movies = movies.map(lambda x: x["movie_title"])
users = ratings.map(lambda x: x["user_id"])
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(
mask_token=None)
user_ids_vocabulary.adapt(users.batch(1000))
movie_titles_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(
mask_token=None)
movie_titles_vocabulary.adapt(movies.batch(1000))
依 user_id
分組,以形成排名模型的清單
key_func = lambda x: user_ids_vocabulary(x["user_id"])
reduce_func = lambda key, dataset: dataset.batch(100)
ds_train = ratings.group_by_window(
key_func=key_func, reduce_func=reduce_func, window_size=100)
for x in ds_train.take(1):
for key, value in x.items():
print(f"Shape of {key}: {value.shape}")
print(f"Example values of {key}: {value[:5].numpy()}")
print()
Shape of movie_title: (100,) Example values of movie_title: [b'Man Who Would Be King, The (1975)' b'Silence of the Lambs, The (1991)' b'Next Karate Kid, The (1994)' b'2001: A Space Odyssey (1968)' b'Usual Suspects, The (1995)'] Shape of user_id: (100,) Example values of user_id: [b'405' b'405' b'405' b'405' b'405'] Shape of user_rating: (100,) Example values of user_rating: [1. 4. 1. 5. 5.]
產生批次化的特徵和標籤
def _features_and_labels(
x: Dict[str, tf.Tensor]) -> Tuple[Dict[str, tf.Tensor], tf.Tensor]:
labels = x.pop("user_rating")
return x, labels
ds_train = ds_train.map(_features_and_labels)
ds_train = ds_train.apply(
tf.data.experimental.dense_to_ragged_batch(batch_size=32))
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_12750/4021484596.py:10: dense_to_ragged_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.ragged_batch` instead.
在 ds_train
中產生的 user_id
和 movie_title
張量的形狀為 [32, None]
,其中第二個維度在大多數情況下為 100,除非批次中分組在清單中的項目少於 100 個。因此使用適用於參差張量的模型。
for x, label in ds_train.take(1):
for key, value in x.items():
print(f"Shape of {key}: {value.shape}")
print(f"Example values of {key}: {value[:3, :3].numpy()}")
print()
print(f"Shape of label: {label.shape}")
print(f"Example values of label: {label[:3, :3].numpy()}")
Shape of movie_title: (32, None) Example values of movie_title: [[b'Man Who Would Be King, The (1975)' b'Silence of the Lambs, The (1991)' b'Next Karate Kid, The (1994)'] [b'Flower of My Secret, The (Flor de mi secreto, La) (1995)' b'Little Princess, The (1939)' b'Time to Kill, A (1996)'] [b'Kundun (1997)' b'Scream (1996)' b'Power 98 (1995)']] Shape of user_id: (32, None) Example values of user_id: [[b'405' b'405' b'405'] [b'655' b'655' b'655'] [b'13' b'13' b'13']] Shape of label: (32, None) Example values of label: [[1. 4. 1.] [3. 3. 3.] [5. 1. 1.]]
定義模型
透過繼承 tf.keras.Model
並實作 call
方法來定義排名模型
class MovieLensRankingModel(tf.keras.Model):
def __init__(self, user_vocab, movie_vocab):
super().__init__()
# Set up user and movie vocabulary and embedding.
self.user_vocab = user_vocab
self.movie_vocab = movie_vocab
self.user_embed = tf.keras.layers.Embedding(user_vocab.vocabulary_size(),
64)
self.movie_embed = tf.keras.layers.Embedding(movie_vocab.vocabulary_size(),
64)
def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
# Define how the ranking scores are computed:
# Take the dot-product of the user embeddings with the movie embeddings.
user_embeddings = self.user_embed(self.user_vocab(features["user_id"]))
movie_embeddings = self.movie_embed(
self.movie_vocab(features["movie_title"]))
return tf.reduce_sum(user_embeddings * movie_embeddings, axis=2)
建立模型,然後使用排名 tfr.keras.losses
和 tfr.keras.metrics
編譯模型,這些是 TF-Ranking 套件的核心。
此範例使用排名專用的**Softmax 損失**,這是一種列表式損失,旨在促使排名清單中所有相關項目比不相關項目更有機會名列前茅。與多類別分類問題中的 Softmax 損失(其中只有一個類別是正類別,其餘都是負類別)相反,TF-Ranking 程式庫支援查詢清單中的多個相關文件和非二元相關性標籤。
對於排名指標,此範例特別使用**標準化折扣累積增益 (NDCG)** 和**平均倒數排名 (MRR)**,它們會計算具有位置折扣的排名查詢清單的使用者效用。如需排名指標的更多詳細資訊,請參閱評估指標 離線指標。
# Create the ranking model, trained with a ranking loss and evaluated with
# ranking metrics.
model = MovieLensRankingModel(user_ids_vocabulary, movie_titles_vocabulary)
optimizer = tf.keras.optimizers.Adagrad(0.5)
loss = tfr.keras.losses.get(
loss=tfr.keras.losses.RankingLossKey.SOFTMAX_LOSS, ragged=True)
eval_metrics = [
tfr.keras.metrics.get(key="ndcg", name="metric/ndcg", ragged=True),
tfr.keras.metrics.get(key="mrr", name="metric/mrr", ragged=True)
]
model.compile(optimizer=optimizer, loss=loss, metrics=eval_metrics)
訓練和評估模型
使用 model.fit
訓練模型。
model.fit(ds_train, epochs=3)
Epoch 1/3 48/48 [==============================] - 7s 56ms/step - loss: 998.7637 - metric/ndcg: 0.8213 - metric/mrr: 1.0000 Epoch 2/3 48/48 [==============================] - 4s 53ms/step - loss: 997.1824 - metric/ndcg: 0.9161 - metric/mrr: 1.0000 Epoch 3/3 48/48 [==============================] - 4s 53ms/step - loss: 994.8384 - metric/ndcg: 0.9383 - metric/mrr: 1.0000 <keras.src.callbacks.History at 0x7f666424d700>
產生預測並評估。
# Get movie title candidate list.
for movie_titles in movies.batch(2000):
break
# Generate the input for user 42.
inputs = {
"user_id":
tf.expand_dims(tf.repeat("42", repeats=movie_titles.shape[0]), axis=0),
"movie_title":
tf.expand_dims(movie_titles, axis=0)
}
# Get movie recommendations for user 42.
scores = model(inputs)
titles = tfr.utils.sort_by_scores(scores,
[tf.expand_dims(movie_titles, axis=0)])[0]
print(f"Top 5 recommendations for user 42: {titles[0, :5]}")
Top 5 recommendations for user 42: [b'Star Wars (1977)' b'Liar Liar (1997)' b'Toy Story (1995)' b'Raiders of the Lost Ark (1981)' b'Sound of Music, The (1965)']