新功能！使用「Simple ML for Sheets」將機器學習應用於 Google 試算表中的資料閱讀詳情

預測

在 TensorFlow.org 上檢視

在 Google Colab 中執行

在 GitHub 上檢視

下載筆記本

歡迎使用 TensorFlow Decision Forests (TF-DF) 的預測 Colab。在這個 colab 中，您將學習如何使用 Python API，透過先前訓練的 TF-DF 模型產生不同的預測。

注意： 此 Colab 中顯示的 Python API 簡單易用，非常適合實驗。但是，其他 API (例如 TensorFlow Serving 和 C++ API) 更適合用於生產系統，因為它們速度更快、更穩定。所有 Serving API 的完整清單請見此處。

在這個 colab 中，您將

在以 pd_dataframe_to_tf_dataset 建立的 TensorFlow Dataset 上使用 model.predict() 函數。
在手動建立的 TensorFlow Dataset 上使用 model.predict() 函數。
在 Numpy 陣列上使用 model.predict() 函數。
使用 CLI API 進行預測。
使用 CLI API 基準化模型的推論速度。

重要注意事項

用於預測的資料集應具有與訓練所用資料集相同的特徵名稱和類型。否則很可能會引發錯誤。

例如，使用兩個特徵 f1 和 f2 訓練模型，並嘗試在沒有 f2 的資料集上產生預測將會失敗。請注意，可以將 (部分或所有) 特徵值設定為「遺失」。同樣地，訓練 f2 是數值特徵 (例如 float32) 的模型，並將此模型應用於 f2 是文字 (例如字串) 特徵的資料集也會失敗。

雖然 Keras API 抽象化了此流程，但在 Python 中例項化的模型 (例如，使用 tfdf.keras.RandomForestModel()) 和從磁碟載入的模型 (例如，使用 tf_keras.models.load_model()) 的行為可能有所不同。值得注意的是，Python 例項化的模型會自動套用必要的類型轉換。例如，如果將 float64 特徵饋送到預期 float32 特徵的模型，則會隱含地執行此轉換。但是，從磁碟載入的模型無法進行此類轉換。因此，訓練資料和推論資料務必始終具有完全相同的類型。

設定

首先，我們安裝 TensorFlow Decision Forests...

pip install tensorflow_decision_forests

...，並匯入此範例中使用的程式庫。

import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

`model.predict(...)` 和 `pd_dataframe_to_tf_dataset` 函數

TensorFlow Decision Forests 實作了 Keras 模型 API。因此，TF-DF 模型具有 predict 函數來進行預測。此函數將 TensorFlow Dataset 作為輸入，並輸出預測陣列。建立 TensorFlow 資料集最簡單的方式是使用 Pandas 和 tfdf.keras.pd_dataframe_to_tf_dataset(...) 函數。

下一個範例示範如何使用 pd_dataframe_to_tf_dataset 建立 TensorFlow 資料集。

pd_dataset = pd.DataFrame({
    "feature_1": [1,2,3],
    "feature_2": ["a", "b", "c"],
    "label": [0, 1, 0],
})

pd_dataset

tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_dataset, label="label")

for features, label in tf_dataset:
  print("Features:",features)
  print("label:", label)

Features: {'feature_1': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 2, 3])>, 'feature_2': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>}
label: tf.Tensor([0 1 0], shape=(3,), dtype=int64)
2024-04-20 11:14:51.301980: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

注意： "pd_" 代表 "pandas"。"tf_" 代表 "TensorFlow"。

TensorFlow Dataset 是一個輸出值序列的函數。這些值可以是簡單的陣列 (稱為張量)，也可以是組織成結構的陣列 (例如，組織在字典中的陣列)。

以下範例示範了玩具資料集的訓練和推論 (使用 predict)

# Creating a training dataset in Pandas
pd_train_dataset = pd.DataFrame({
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000),
})
pd_train_dataset["label"] = pd_train_dataset["feature_1"] > pd_train_dataset["feature_2"] 

pd_train_dataset

# Creating a serving dataset with Pandas
pd_serving_dataset = pd.DataFrame({
    "feature_1": np.random.rand(500),
    "feature_2": np.random.rand(500),
})

pd_serving_dataset

讓我們將 Pandas DataFrame 轉換為 TensorFlow 資料集

tf_train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label")
tf_serving_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_serving_dataset)

我們現在可以在 tf_train_dataset 上訓練模型

model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_train_dataset)

[INFO 24-04-20 11:14:55.1176 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpdosbv775/model/ with prefix 951a85e27c8d4048
[INFO 24-04-20 11:14:55.1550 UTC decision_forest.cc:734] Model loaded with 300 root(s), 12674 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:55.1551 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 24-04-20 11:14:55.1551 UTC kernel.cc:1061] Use fast generic engine
<tf_keras.src.callbacks.History at 0x7f96c017a7f0>

然後在 tf_serving_dataset 上產生預測

# Print the first 10 predictions.
model.predict(tf_serving_dataset, verbose=0)[:10]

array([[0.57999957],
       [0.13666661],
       [0.68666613],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00333333]], dtype=float32)

`model.predict(...)` 和手動 TF 資料集

在前一節中，我們示範了如何使用 pd_dataframe_to_tf_dataset 函數建立 TF 資料集。此選項很簡單，但不適合大型資料集。相反地，TensorFlow 提供多種選項來建立 TensorFlow 資料集。下一個範例示範如何使用 tf.data.Dataset.from_tensor_slices() 函數建立資料集。

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])

for value in dataset:
  print("value:", value.numpy())

value: 1
value: 2
value: 3
value: 4
value: 5
2024-04-20 11:14:59.117255: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

TensorFlow 模型是以小批次訓練的：範例不是一次饋送一個，而是分組為「批次」。對於神經網路，批次大小會影響模型的品質，最佳值需要使用者在訓練期間決定。對於 Decision Forests，批次大小對模型沒有影響。但是，為了相容性，TensorFlow Decision Forests 期望資料集為批次處理。批次處理是透過 batch() 函數完成的。

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5]).batch(2)

for value in dataset:
  print("value:", value.numpy())

value: [1 2]
value: [3 4]
value: [5]
2024-04-20 11:14:59.134734: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

TensorFlow Decision Forests 期望資料集具有以下兩種結構之一

特徵、標籤
特徵、標籤、權重

特徵可以是單一的二維陣列 (其中每欄是一個特徵，每列是一個範例)，也可以是陣列的字典。

以下是與 TensorFlow Decision Forests 相容的資料集範例

# A dataset with a single 2d array.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ([[1,2],[3,4],[5,6]], # Features
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: tf.Tensor(
[[1 2]
 [3 4]], shape=(2, 2), dtype=int32)
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: tf.Tensor([[5 6]], shape=(1, 2), dtype=int32)
label: tf.Tensor([0], shape=(1,), dtype=int32)
2024-04-20 11:14:59.152655: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

# A dataset with a dictionary of features.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": [1,2,3],
    "feature_2": [4,5,6],
    },
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: {'feature_1': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([4, 5], dtype=int32)>}
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: {'feature_1': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([3], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([6], dtype=int32)>}
label: tf.Tensor([0], shape=(1,), dtype=int32)
2024-04-20 11:14:59.171912: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

讓我們使用第二個選項訓練模型。

tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    },
    np.random.rand(100) >= 0.5, # Label
    )).batch(2)

model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_dataset)

[INFO 24-04-20 11:14:59.3750 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp208me4tj/model/ with prefix b7fe9aaae54944c5
[INFO 24-04-20 11:14:59.3979 UTC decision_forest.cc:734] Model loaded with 300 root(s), 7574 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:59.3979 UTC kernel.cc:1061] Use fast generic engine
<tf_keras.src.callbacks.History at 0x7f968c13e8e0>

predict 函數可以直接用於訓練資料集

# The first 10 predictions.
model.predict(tf_dataset, verbose=0)[:10]

array([[0.9366659 ],
       [0.42999968],
       [0.9266659 ],
       [0.31999978],
       [0.70999944],
       [0.2133332 ],
       [0.13333328],
       [0.836666  ],
       [0.10666663],
       [0.53333294]], dtype=float32)

`model.predict(...)` 和 `model.predict_on_batch()` 用於字典

在某些情況下，predict 函數可以與陣列 (或陣列字典) 而非 TensorFlow Dataset 一起使用。

以下範例將先前訓練的模型與 NumPy 陣列字典搭配使用。

# The first 10 predictions.
model.predict({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    }, verbose=0)[:10]

array([[0.5366663 ],
       [0.19666655],
       [0.2233332 ],
       [0.99999917],
       [0.3233331 ],
       [0.3866664 ],
       [0.71999943],
       [0.40666637],
       [0.73333275],
       [0.10999996]], dtype=float32)

在上一個範例中，陣列會自動批次處理。或者，可以使用 predict_on_batch 函數來確保所有範例都在同一個批次中執行。

# The first 10 predictions.
model.predict_on_batch({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    })[:10]

array([[0.3433331 ],
       [0.42333302],
       [0.9466659 ],
       [0.38333306],
       [0.21666653],
       [0.10999996],
       [0.09333331],
       [0.23999985],
       [0.13999994],
       [0.36999974]], dtype=float32)

使用 YDF 格式進行推論

此範例示範如何執行使用 CLI API 訓練的 TF-DF 模型 (其他 Serving API 之一)。我們也將使用 Benchmark 工具來測量模型的推論速度。

讓我們先訓練並儲存模型

model = tfdf.keras.GradientBoostedTreesModel(verbose=0)
model.fit(tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label"))
model.save("my_model")

[WARNING 24-04-20 11:15:00.0298 UTC gradient_boosted_trees.cc:1840] "goss_alpha" set but "sampling_method" not equal to "GOSS".
[WARNING 24-04-20 11:15:00.0299 UTC gradient_boosted_trees.cc:1851] "goss_beta" set but "sampling_method" not equal to "GOSS".
[WARNING 24-04-20 11:15:00.0299 UTC gradient_boosted_trees.cc:1865] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
[INFO 24-04-20 11:15:00.4645 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp_gpxt9u3/model/ with prefix 307d0dfd7bcd4058
[INFO 24-04-20 11:15:00.4725 UTC quick_scorer_extended.cc:911] The binary was compiled without AVX2 support, but your CPU supports it. Enable it for faster model inference.
[INFO 24-04-20 11:15:00.4729 UTC kernel.cc:1061] Use fast generic engine
INFO:tensorflow:Assets written to: my_model/assets
INFO:tensorflow:Assets written to: my_model/assets

讓我們也將資料集匯出到 csv 檔案

pd_serving_dataset.to_csv("dataset.csv")

讓我們下載並解壓縮 Yggdrasil Decision Forests CLI 工具。

wget https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
unzip cli_linux.zip

--2024-04-20 11:15:01--  https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=360444739&response-content-disposition=attachment%3B%20filename%3Dcli_linux.zip&response-content-type=application%2Foctet-stream [following]
--2024-04-20 11:15:01--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=360444739&response-content-disposition=attachment%3B%20filename%3Dcli_linux.zip&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31516027 (30M) [application/octet-stream]
Saving to: ‘cli_linux.zip’

cli_linux.zip       100%[===================>]  30.06M   174MB/s    in 0.2s    

2024-04-20 11:15:01 (174 MB/s) - ‘cli_linux.zip’ saved [31516027/31516027]

Archive:  cli_linux.zip
  inflating: README                  
  inflating: cli.txt                 
  inflating: train                   
  inflating: show_model              
  inflating: show_dataspec           
  inflating: predict                 
  inflating: infer_dataspec          
  inflating: evaluate                
  inflating: convert_dataset         
  inflating: benchmark_inference     
  inflating: edit_model              
  inflating: synthetic_dataset       
  inflating: grpc_worker_main        
  inflating: LICENSE                 
  inflating: CHANGELOG.md

最後，讓我們進行預測

注意事項

TensorFlow Decision Forests (TF-DF) 是基於 Yggdrasil Decision Forests (YDF) 程式庫，且 TF-DF 模型始終在內部包含 YDF 模型。將 TF-DF 模型儲存到磁碟時，TF-DF 模型目錄包含 assets 子目錄，其中包含 YDF 模型。此 YDF 模型可以與所有 YDF 工具搭配使用。在下一個範例中，我們將使用 predict 和 benchmark_inference 工具。如需更多詳細資訊，請參閱模型格式文件。
YDF 工具假設資料集的類型是使用前置字元指定的，例如 csv:。如需更多詳細資訊，請參閱 YDF 使用者手冊。

./predict --model=my_model/assets --dataset=csv:dataset.csv --output=csv:predictions.csv

[INFO abstract_model.cc:1296] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO predict.cc:133] Run predictions with semi-fast engine

我們現在可以查看預測

pd.read_csv("predictions.csv")

可以使用基準化推論工具測量模型的推論速度。

# Create the empty label column.
pd_serving_dataset["__LABEL"] = 0
pd_serving_dataset.to_csv("dataset.csv")

!./benchmark_inference \
  --model=my_model/assets \
  --dataset=csv:dataset.csv \
  --batch_size=100 \
  --warmup_runs=10 \
  --num_runs=50

[INFO benchmark_inference.cc:245] Loading model
[INFO benchmark_inference.cc:248] The model is of type: GRADIENT_BOOSTED_TREES
[INFO benchmark_inference.cc:250] Loading dataset
[INFO benchmark_inference.cc:259] Found 3 compatible fast engines.
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesGeneric
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesQuickScorerExtended
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesOptPred
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).
[INFO benchmark_inference.cc:268] Running the slow generic engine
batch_size : 100  num_runs : 50
time/example(us)  time/batch(us)  method
----------------------------------------
         0.44275          44.275  GradientBoostedTreesQuickScorerExtended [virtual interface]
         0.79825          79.825  GradientBoostedTreesOptPred [virtual interface]
           1.877           187.7  GradientBoostedTreesGeneric [virtual interface]
          4.4463          444.62  Generic slow engine
----------------------------------------

在此基準化中，我們看到了不同推論引擎的推論速度。例如，「time/example(us) = 0.6315」(在不同執行中可能會變更) 表示一個範例的推論需要 0.63 微秒。也就是說，模型每秒可以執行約 160 萬次。