![]() |
![]() |
![]() |
![]() |
總覽
本教學課程說明如何從 PostgreSQL 資料庫伺服器建立 tf.data.Dataset
,以便將建立的 Dataset
傳遞至 tf.keras
,以用於訓練或推論用途。
SQL 資料庫是資料科學家的重要資料來源。PostgreSQL 是最熱門的開放原始碼 SQL 資料庫之一,廣泛用於企業中,以儲存重要的交易資料。直接從 PostgreSQL 資料庫伺服器建立 Dataset
,並將 Dataset
傳遞至 tf.keras
進行訓練或推論,可大幅簡化資料管線,並協助資料科學家專注於建構機器學習模型。
設定與使用方式
安裝必要的 tensorflow-io 套件,並重新啟動執行階段
try:
%tensorflow_version 2.x
except Exception:
pass
!pip install -q tensorflow-io
安裝及設定 PostgreSQL (選用)
為了在 Google Colab 上示範使用方式,您將安裝 PostgreSQL 伺服器。也需要密碼和空的資料庫。
如果您不是在 Google Colab 上執行此筆記本,或您偏好使用現有的資料庫,請略過以下設定並繼續進行下一個章節。
# Install postgresql server
sudo apt-get -y -qq update
sudo apt-get -y -qq install postgresql
sudo service postgresql start
# Setup a password `postgres` for username `postgres`
sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
# Setup a database with name `tfio_demo` to be used
sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'
sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;'
Preconfiguring packages ... Selecting previously unselected package libpq5:amd64. (Reading database ... 254633 files and directories currently installed.) Preparing to unpack .../0-libpq5_10.15-0ubuntu0.18.04.1_amd64.deb ... Unpacking libpq5:amd64 (10.15-0ubuntu0.18.04.1) ... Selecting previously unselected package postgresql-client-common. Preparing to unpack .../1-postgresql-client-common_190ubuntu0.1_all.deb ... Unpacking postgresql-client-common (190ubuntu0.1) ... Selecting previously unselected package postgresql-client-10. Preparing to unpack .../2-postgresql-client-10_10.15-0ubuntu0.18.04.1_amd64.deb ... Unpacking postgresql-client-10 (10.15-0ubuntu0.18.04.1) ... Selecting previously unselected package ssl-cert. Preparing to unpack .../3-ssl-cert_1.0.39_all.deb ... Unpacking ssl-cert (1.0.39) ... Selecting previously unselected package postgresql-common. Preparing to unpack .../4-postgresql-common_190ubuntu0.1_all.deb ... Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common' Unpacking postgresql-common (190ubuntu0.1) ... Selecting previously unselected package postgresql-10. Preparing to unpack .../5-postgresql-10_10.15-0ubuntu0.18.04.1_amd64.deb ... Unpacking postgresql-10 (10.15-0ubuntu0.18.04.1) ... Selecting previously unselected package postgresql. Preparing to unpack .../6-postgresql_10+190ubuntu0.1_all.deb ... Unpacking postgresql (10+190ubuntu0.1) ... Selecting previously unselected package sysstat. Preparing to unpack .../7-sysstat_11.6.1-1ubuntu0.1_amd64.deb ... Unpacking sysstat (11.6.1-1ubuntu0.1) ... Setting up sysstat (11.6.1-1ubuntu0.1) ... Creating config file /etc/default/sysstat with new version update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service. Setting up ssl-cert (1.0.39) ... Setting up libpq5:amd64 (10.15-0ubuntu0.18.04.1) ... Setting up postgresql-client-common (190ubuntu0.1) ... Setting up postgresql-common (190ubuntu0.1) ... Adding user postgres to group ssl-cert Creating config file /etc/postgresql-common/createcluster.conf with new version Building PostgreSQL dictionaries from installed myspell/hunspell packages... Removing obsolete dictionary files: Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service. Setting up postgresql-client-10 (10.15-0ubuntu0.18.04.1) ... update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode Setting up postgresql-10 (10.15-0ubuntu0.18.04.1) ... Creating new PostgreSQL cluster 10/main ... /usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5 The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "C.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/postgresql/10/main ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default timezone ... Etc/UTC selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok Success. You can now start the database server using: /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start Ver Cluster Port Status Owner Data directory Log file 10 main 5432 down postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode Setting up postgresql (10+190ubuntu0.1) ... Processing triggers for man-db (2.8.3-2ubuntu0.1) ... Processing triggers for ureadahead (0.100.0-21) ... Processing triggers for libc-bin (2.27-3ubuntu1.2) ... Processing triggers for systemd (237-3ubuntu10.38) ... ALTER ROLE NOTICE: database "tfio_demo" does not exist, skipping DROP DATABASE CREATE DATABASE
設定必要的環境變數
以下環境變數是根據上一節中的 PostgreSQL 設定。如果您有不同的設定,或您使用的是現有的資料庫,則應相應地變更這些變數
%env TFIO_DEMO_DATABASE_NAME=tfio_demo
%env TFIO_DEMO_DATABASE_HOST=localhost
%env TFIO_DEMO_DATABASE_PORT=5432
%env TFIO_DEMO_DATABASE_USER=postgres
%env TFIO_DEMO_DATABASE_PASS=postgres
env: TFIO_DEMO_DATABASE_NAME=tfio_demo env: TFIO_DEMO_DATABASE_HOST=localhost env: TFIO_DEMO_DATABASE_PORT=5432 env: TFIO_DEMO_DATABASE_USER=postgres env: TFIO_DEMO_DATABASE_PASS=postgres
在 PostgreSQL 伺服器中準備資料
為了示範目的,本教學課程將建立資料庫並在資料庫中填入一些資料。本教學課程中使用的資料來自 空氣品質資料集,可從 UCI 機器學習資料庫取得。
以下是空氣品質資料集子集的搶先預覽
Date|Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|T|RH|AH| ----|----|------|-----------|--------|--------|-------------|----|----------|-------|------------|-----------|-|--|--| 10/03/2004|18.00.00|2,6|1360|150|11,9|1046|166|1056|113|1692|1268|13,6|48,9|0,7578| 10/03/2004|19.00.00|2|1292|112|9,4|955|103|1174|92|1559|972|13,3|47,7|0,7255| 10/03/2004|20.00.00|2,2|1402|88|9,0|939|131|1140|114|1555|1074|11,9|54,0|0,7502| 10/03/2004|21.00.00|2,2|1376|80|9,2|948|172|1092|122|1584|1203|11,0|60,0|0,7867| 10/03/2004|22.00.00|1,6|1272|51|6,5|836|131|1205|116|1490|1110|11,2|59,6|0,7888|
關於空氣品質資料集和 UCI 機器學習資料庫的更多資訊,請參閱「參考資料」章節。
為了簡化資料準備,已準備好空氣品質資料集的 sql 版本,並以 AirQualityUCI.sql 形式提供。
建立表格的陳述式如下:
CREATE TABLE AirQualityUCI (
Date DATE,
Time TIME,
CO REAL,
PT08S1 INT,
NMHC REAL,
C6H6 REAL,
PT08S2 INT,
NOx REAL,
PT08S3 INT,
NO2 REAL,
PT08S4 INT,
PT08S5 INT,
T REAL,
RH REAL,
AH REAL
);
在資料庫中建立表格並填入資料的完整指令如下:
curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql
PGPASSWORD=$TFIO_DEMO_DATABASE_PASS psql -q -h $TFIO_DEMO_DATABASE_HOST -p $TFIO_DEMO_DATABASE_PORT -U $TFIO_DEMO_DATABASE_USER -d $TFIO_DEMO_DATABASE_NAME -f AirQualityUCI.sql
從 PostgreSQL 伺服器建立 Dataset 並在 TensorFlow 中使用
從 PostgreSQL 伺服器建立 Dataset 非常簡單,只需使用 query
和 endpoint
引數呼叫 tfio.experimental.IODataset.from_sql
即可。query
是用於選取表格中欄位的 SQL 查詢,而 endpoint
引數是位址和資料庫名稱
import os
import tensorflow_io as tfio
endpoint="postgresql://{}:{}@{}?port={}&dbname={}".format(
os.environ['TFIO_DEMO_DATABASE_USER'],
os.environ['TFIO_DEMO_DATABASE_PASS'],
os.environ['TFIO_DEMO_DATABASE_HOST'],
os.environ['TFIO_DEMO_DATABASE_PORT'],
os.environ['TFIO_DEMO_DATABASE_NAME'],
)
dataset = tfio.experimental.IODataset.from_sql(
query="SELECT co, pt08s1 FROM AirQualityUCI;",
endpoint=endpoint)
print(dataset.element_spec)
{'co': TensorSpec(shape=(), dtype=tf.float32, name=None), 'pt08s1': TensorSpec(shape=(), dtype=tf.int32, name=None)}
如您從上面的 dataset.element_spec
輸出中看到的,建立的 Dataset
元素是 Python 字典物件,其中資料庫表格的欄位名稱作為鍵。這對於套用進一步的運算非常方便。例如,您可以選取 Dataset
的 nox
和 no2
欄位,並計算差異
dataset = tfio.experimental.IODataset.from_sql(
query="SELECT nox, no2 FROM AirQualityUCI;",
endpoint=endpoint)
dataset = dataset.map(lambda e: (e['nox'] - e['no2']))
# check only the first 20 record
dataset = dataset.take(20)
print("NOx - NO2:")
for difference in dataset:
print(difference.numpy())
NOx - NO2: 53.0 11.0 17.0 50.0 15.0 -7.0 -15.0 -14.0 -15.0 0.0 -13.0 -12.0 -14.0 16.0 62.0 28.0 14.0 3.0 9.0 34.0
現在,建立的 Dataset
已準備好直接傳遞至 tf.keras
,以用於訓練或推論用途。
參考資料
- Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005