1. pg_bulkload介紹
PostgreSQL提供了一個copy命令的便利數(shù)據(jù)加載工具,copy命令源于PostgreSQL數(shù)據(jù)庫凰慈,copy命令支持文件與表之間的數(shù)據(jù)加載和表對文件的數(shù)據(jù)卸載。pg_bulkload是一種用于PostgreSQL的高速數(shù)據(jù)加載工具泻拦,相比copy命令扣草。最大的優(yōu)勢就是速度间影。優(yōu)勢在讓我們跳過shared buffer,wal buffer。直接寫文件晴氨。pg_bulkload的direct模式就是這種思路來實現(xiàn)的康嘉,它還包含了數(shù)據(jù)恢復功能,即導入失敗的話瑞筐,需要恢復
2. pg_bulkload架構圖
pg_bulkload主要包括兩個模塊:reader和writer凄鼻。reader負責讀取文件、解析tuple聚假,writer負責把解析出的tuple寫入輸出源中块蚌。pg_bulkload最初的版本功能很簡單,只是加載數(shù)據(jù)膘格。3.1版本增加了數(shù)據(jù)過濾的功能峭范。
3. pg_bulkload安裝
該工具不是PostgreSQL系統(tǒng)自帶;需要下載安裝瘪贱;
[root@Postgres201 ~]# unzip pg_bulkload-VERSION3_1_10.zip
[root@Postgres201 ~]# cd pg_bulkload-VERSION3_1_10
[root@Postgres201 pg_bulkload-VERSION3_1_10]# make
[root@Postgres201 pg_bulkload-VERSION3_1_10]# make install
安裝完成纱控;要使用它需要建extension
[postgres@Postgres201 ~]$ psql lottu lottu
psql (9.6.0)
Type "help" for help.
lottu=# create extension pg_bulkload;
CREATE EXTENSION
4. pg_bulkload參數(shù)和控制文件
[postgres@Postgres201 ~]$ pg_bulkload --help
pg_bulkload is a bulk data loading tool for PostgreSQL
Usage:
Dataload: pg_bulkload [dataload options] control_file_path
Recovery: pg_bulkload -r [-D DATADIR]
Dataload options:
-i, --input=INPUT INPUT path or function
-O, --output=OUTPUT OUTPUT path or table
-l, --logfile=LOGFILE LOGFILE path
-P, --parse-badfile=* PARSE_BADFILE path
-u, --duplicate-badfile=* DUPLICATE_BADFILE path
-o, --option="key=val" additional option
Recovery options:
-r, --recovery execute recovery
-D, --pgdata=DATADIR database directory
Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt
Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit
除了在命令行上指定參數(shù)外,還可以在控制文件中指定參數(shù)菜秦,下面介紹控制文件里的參數(shù)甜害。
- TYPE=CSV|BINARY|FIXED|FUNCTION:輸入數(shù)據(jù)的類型,默認是CSV球昨。
CSV:從CSV格式的文本文件里加載數(shù)據(jù)尔店。
BINARY|FIXED:從二進制文件里加載數(shù)據(jù)。
FUNCTION:從函數(shù)輸出中加載數(shù)據(jù)主慰。 - INPUT|INFILE=path|stdin|function_name: 數(shù)據(jù)源嚣州,必須指定,類型不同共螺,它的值不一樣
path:此處就是路徑该肴,可以是相對路徑,pg服務器必須有讀文件的權限
stdin:pg_bulkload將從標準輸入讀取數(shù)據(jù)藐不。
SQL FUNCTION:指定SQL函數(shù)匀哄,用這個函數(shù)返回插入數(shù)據(jù)秦效,可以是內建的函數(shù),也可以是用戶自定義的函數(shù) - WRITER=DIRECT|PARALLEL|BUFFERED|BINARY:加載數(shù)據(jù)的方式拱雏,默認是DIRECT
DIRECT:直接把數(shù)據(jù)寫入表中棉安,繞過了共享內存并且不寫日志,需要提供恢復函數(shù)铸抑。
BUFFERED:把數(shù)據(jù)寫入共享內存贡耽,寫日志,利用pg的恢復機制鹊汛。
PARALLEL:并行處理模式蒲赂,速度比DIRECT更快
BINARY:把輸入數(shù)據(jù)轉換成二進制數(shù)據(jù),然后加載刁憋。 - OUTPUT|TABLE=table_name|outfile 輸出源滥嘴,即把數(shù)據(jù)導到哪里。
表:把數(shù)據(jù)導入到數(shù)據(jù)庫的表里至耻。
文件:指定文件的路徑若皱,把數(shù)據(jù)導入到文件里。 - LOGFILE=path 日志文件的路徑 尘颓,執(zhí)行過程中會記錄狀態(tài)走触。
- MULTI_PROCESS=YES|NO 若設置了此值,會開啟多線程模式疤苹,并行處理數(shù)據(jù)導入互广。若沒設置,單線程模式卧土,默認模式是單線程模式惫皱。
- SKIP|OFFSET=n 跳過的行數(shù),默認是0尤莺,不能跟"TYPE=FUNCTION"同時設置旅敷。
- LIMIT|LOAD 限制加載的行數(shù),默認是INFINITE颤霎,即加載所有數(shù)據(jù)扫皱,這個選項可以與"TYPE=FUNCTION"同時設置。
- ON_DUPLICATE_KEEP = NEW | OLD 對表存在唯一約束是保留最新的記錄還是現(xiàn)有的記錄
- PARSE_BADFILE = path 用來記錄寫入所有失敗的記錄捷绑。
- TRUNCATE = YES | NO 用來truncate目標表現(xiàn)有所有的記錄。
- DELIMITER = delimiter_character 文件的分隔符
5. pg_bulkload的使用
創(chuàng)建測試表tbl_lottu和測試文件tbl_lottu_output.txt
[postgres@Postgres201 ~]$ psql lottu lottu
psql (9.6.0)
Type "help" for help.
lottu=# create table tbl_lottu(id int,name text);
CREATE TABLE
[postgres@Postgres201 ~]$ seq 100000| awk '{print $0"|lottu"}' > tbl_lottu_output.txt
- 不使用控制文件使用參數(shù)
[postgres@Postgres201 ~]$ pg_bulkload -i /home/postgres/tbl_lottu_output.txt -O tbl_lottu -l /home/postgres/tbl_lottu_output.log -P /home/postgres/tbl_lottu_bad.txt -o "TYPE=CSV" -o "DELIMITER=|" -d lottu -U lottu
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
100000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
[postgres@Postgres201 ~]$ cat tbl_lottu_output.log
pg_bulkload 3.1.9 on 2018-07-12 13:37:18.326685+08
INPUT = /home/postgres/tbl_lottu_output.txt
PARSE_BADFILE = /home/postgres/tbl_lottu_bad.txt
LOGFILE = /home/postgres/tbl_lottu_output.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 0
DELIMITER = |
QUOTE = "\""
ESCAPE = "\""
NULL =
OUTPUT = lottu.tbl_lottu
MULTI_PROCESS = NO
VERBOSE = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /data/postgres/data/pg_bulkload/20180712133718_lottu_lottu_tbl_lottu.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = NO
0 Rows skipped.
100000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
Run began on 2018-07-12 13:37:18.326685+08
Run ended on 2018-07-12 13:37:18.594494+08
CPU 0.14s/0.07u sec elapsed 0.27 sec
- 導入之前先清理表數(shù)據(jù)
[postgres@Postgres201 ~]$ pg_bulkload -i /home/postgres/tbl_lottu_output.txt -O tbl_lottu -l /home/postgres/tbl_lottu_output.log -P /home/postgres/tbl_lottu_bad.txt -o "TYPE=CSV" -o "DELIMITER=|" -o "TRUNCATE=YES" -d lottu -U lottu
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
100000 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
[postgres@Postgres201 ~]$ psql lottu lottu -c "select count(1) from tbl_lottu;"
count
--------
100000
(1 row)
- 使用控制文件
新建控制文件lottu.ctl
INPUT = /home/postgres/lotu01
PARSE_BADFILE = /home/postgres/tbl_lottu_bad.txt
LOGFILE = /home/postgres/tbl_lottu_output.log
LIMIT = INFINITE
PARSE_ERRORS = 0
CHECK_CONSTRAINTS = NO
TYPE = CSV
SKIP = 5
DELIMITER = |
QUOTE = "\""
ESCAPE = "\""
OUTPUT = lottu.tbl_lottu
MULTI_PROCESS = NO
WRITER = DIRECT
DUPLICATE_BADFILE = /home/postgres/tbl_lottu.dup.csv
DUPLICATE_ERRORS = 0
ON_DUPLICATE_KEEP = NEW
TRUNCATE = YES
使用控制文件進行加載操作
pg_bulkload /home/postgres/lottu.ctl -d lottu -U lottu
[postgres@Postgres201 ~]$ pg_bulkload /home/postgres/lottu.ctl -d lottu -U lottu
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
5 Rows skipped.
95 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
6.總結
pg_bulkload是一種用于PostgreSQL的高速數(shù)據(jù)加載工具氢妈,相比copy命令粹污。最大的優(yōu)勢就是速度。優(yōu)勢在讓我們跳過shared buffer,wal buffer首量。直接寫文件壮吩。pg_bulkload的direct模式就是這種思路來實現(xiàn)的进苍。不足的是;表字段的順序要跟導入的文件報錯一致。希望后續(xù)版本能開發(fā)鸭叙。