Private Heavy Hitters

本教學課程示範如何使用 tff.analytics.heavy_hitters.iblt.build_iblt_computation API 來建構 Federated Analytics 運算，以探索母體中最常出現的字串 (Private Heavy Hitters)。

環境設定

請執行以下步驟，確保您的環境設定正確。如果沒有看到歡迎訊息，請參閱安裝指南以取得操作說明。

# tensorflow_federated_nightly also bring in tf_nightly, which
# can causes a duplicate tensorboard install, leading to errors.
pip install --quiet tensorflow-text-nightly
pip install --quiet --upgrade tensorflow-federated

import collections

import numpy as np
import tensorflow as tf
import tensorflow_federated as tff
import tensorflow_text as tf_text

np.random.seed(0)
tff.backends.test.set_sync_test_cpp_execution_context()

tff.federated_computation(lambda: 'Hello, World!')()

b'Hello, World!'

背景：Federated Analytics 中的 Private Heavy Hitters

考量以下設定：每個用戶端都有字串清單，且每個字串都來自開放式集合，這表示可能是任意字串。目標是以私密方式在 Federated 設定中探索最熱門的字串 (heavy hitters) 及其計數。這個 colab 示範了以下列隱私權屬性解決這個問題的解決方案

安全彙總：計算彙總的字串計數，讓伺服器無法得知任何用戶端的個別值。如需詳細資訊，請參閱 tff.federated_secure_sum。
差分隱私 (DP)：一種廣泛使用的方法，用於限制和量化分析中敏感資料的隱私外洩。您可以將使用者層級的中央 DP 套用至 heavy hitter 結果。

安全彙總 API tff.federated_secure_sum 支援整數向量的線性總和。如果字串來自大小為 n 的封閉式集合，則很容易將每個用戶端的字串編碼為大小為 n 的向量：讓向量索引 i 的值為封閉式集合中第 i^個字串的計數。然後，您可以安全地將所有用戶端的向量加總，以取得整個母體的字串計數。不過，如果字串來自開放式集合，則如何正確編碼以進行安全總和並不顯而易見。在這項工作中，您可以將字串編碼為可逆布隆查詢表 (IBLT)，這是一種機率資料結構，能夠有效率地編碼大型 (或開放式) 網域中的項目。IBLT 草圖可以線性加總，因此與安全總和相容。

您可以使用 tff.analytics.heavy_hitters.iblt.build_iblt_computation 建立 TFF 運算，將每個用戶端的本機字串編碼為 IBLT 結構。這些結構會透過密碼編譯安全多方運算協定安全地加總成伺服器可以解碼的彙總 IBLT 結構。然後，伺服器可以傳回前幾個 heavy hitters。以下章節說明如何使用這個 API 建立 TFF 運算，並使用莎士比亞資料集執行模擬。

載入和預先處理 Federated 莎士比亞資料

莎士比亞資料集包含莎士比亞戲劇角色的台詞。在這個範例中，選取了部分角色 (也就是用戶端)。預先處理器會將每個角色的台詞轉換為字串清單，並捨棄任何僅包含標點符號或符號的字串。

# Load the simulation data.
source, _ = tff.simulation.datasets.shakespeare.load_data()

# Preprocessing function to tokenize a line into words.
def tokenize(ds):
  """Tokenizes a line into words with alphanum characters."""
  def extract_strings(example):
    return tf.expand_dims(example['snippets'], 0)

  def tokenize_line(line):
    return tf.data.Dataset.from_tensor_slices(tokenizer.tokenize(line)[0])

  def mask_all_symbolic_words(word):
    return tf.math.logical_not(
        tf_text.wordshape(word, tf_text.WordShape.IS_PUNCT_OR_SYMBOL))

  tokenizer = tf_text.WhitespaceTokenizer()
  ds = ds.map(extract_strings)
  ds = ds.flat_map(tokenize_line)
  ds = ds.map(tf_text.case_fold_utf8)
  ds = ds.filter(mask_all_symbolic_words)
  return ds

batch_size = 5

def client_data(n: int) -> tf.data.Dataset:
  return tokenize(source.create_tf_dataset_for_client(
      source.client_ids[n])).batch(batch_size)

# Pick a subset of client devices to participate in the computation.
dataset = [client_data(n) for n in range(10)]

模擬

若要執行模擬以探索莎士比亞資料集中最熱門的字詞 (heavy hitters)，首先您需要使用 tff.analytics.heavy_hitters.iblt.build_iblt_computation API 和下列參數建立 TFF 運算

capacity：IBLT 草圖的容量。這個數字應約略等於一輪運算中可能出現的唯一字串總數。預設值為 1000。如果這個數字太小，則解碼可能會因雜湊值衝突而失敗。如果這個數字太大，則會消耗超出必要的記憶體。
string_max_bytes：IBLT 中字串的最大長度。預設值為 10。必須為正數。長度超過 string_max_bytes 的字串將會被截斷。
max_words_per_user：每個用戶端允許貢獻的最大字串數。如果不是 None，則必須為正整數。預設值為 None，表示所有用戶端都會貢獻所有字串。
max_heavy_hitters：要傳回的最大項目數。如果解碼結果的項目數超過這個數字，則會依估計計數遞減排序，並傳回前 max_heavy_hitters 個項目。預設值為 None，表示傳回結果中的所有 heavy hitters。
secure_sum_bitwidth：用於安全總和的位元寬度。預設值為 None，這會停用安全總和。如果不是 None，則必須在 [1,62] 範圍內。請參閱 tff.federated_secure_sum。
multi_contribution：是否允許每個用戶端針對每個唯一字詞貢獻多個計數或僅貢獻一個計數。預設值為 True。當需要差分隱私時，這個引數可以提高效用。
batch_size：資料集中每個批次的元素數量。預設值為 1，表示輸入資料集會由 tf.data.Dataset.batch(1) 處理。必須為正整數。

max_words_per_user = 8
iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    max_heavy_hitters=10,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

現在您已準備好使用 TFF 運算 iblt_computation 和預先處理的輸入資料集執行模擬。 iblt_computation 的輸出具有四個屬性

clients：參與運算的用戶端純量數字。
heavy_hitters：彙總 heavy hitters 清單。
heavy_hitters_counts：彙總 heavy hitters 計數的清單。
num_not_decoded：未成功解碼的字串純量數字。

def run_simulation(one_round_computation: tff.Computation, dataset):
  output = one_round_computation(dataset)
  heavy_hitters = output.heavy_hitters
  heavy_hitters_counts = output.heavy_hitters_counts
  heavy_hitters = [word.decode('utf-8', 'ignore') for word in heavy_hitters]

  results = {}
  for index in range(len(heavy_hitters)):
    results[heavy_hitters[index]] = heavy_hitters_counts[index]
  return output.clients, dict(results)

clients, result = run_simulation(iblt_computation, dataset)
print(f'Number of clients participated: {clients}')
print('Discovered heavy hitters and counts:')
print(result)

Number of clients participated: 10
Discovered heavy hitters and counts:
{'to': 8, 'the': 8, 'and': 7, 'you': 4, 'i': 4, 'a': 3, 'he': 3, 'your': 3, 'is': 3, 'of': 2}

具有差分隱私的 Private Heavy Hitters

若要取得具有中央 DP 的 Private Heavy Hitters，則會將 DP 機制套用至開放式集合長條圖。概念是在彙總長條圖中為字串計數新增雜訊，然後僅保留計數高於特定門檻的字串。雜訊和門檻取決於 (epsilon, delta)-DP 預算，如需詳細演算法和證明，請參閱這份文件。雜訊計數會四捨五入為整數作為後處理步驟，這不會削弱 DP 保證。請注意，當需要 DP 時，您會發現較少的 heavy hitters。這是因為門檻步驟會篩選掉計數較低的字串。

iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

clients, result = run_simulation(iblt_computation, dataset)

# DP parameters
eps = 20
delta = 0.01

# Calculating scale for Laplace noise
scale = max_words_per_user / eps

# Calculating the threshold
tau = 1 + (max_words_per_user / eps) * np.log(max_words_per_user / (2 * delta))

result_with_dp = {}
for word in result:
  noised_count = result[word] + np.random.laplace(scale=scale)
  if noised_count >= tau:
    result_with_dp[word] = int(noised_count)
print(f'Discovered heavy hitters and counts with central DP:')
print(result_with_dp)

Discovered heavy hitters and counts with central DP:
{'the': 8, 'you': 4, 'to': 7, 'tear': 3, 'and': 7, 'i': 3}