批量導(dǎo)入數(shù)據(jù)
使用 Elasticsearch Bulk API /_bulk
批量 update
步驟:
需求:我希望批量導(dǎo)入一個(gè)
movie
type 的名詞列表到wordbank
index 索引熙侍。-
準(zhǔn)備數(shù)據(jù):
根據(jù)官方文檔闸拿,Json 數(shù)據(jù)要準(zhǔn)備成這個(gè)格式的:
action_and_meta_data\n optional_source\n action_and_meta_data\n optional_source\n .... action_and_meta_data\n optional_source\n
其中 action 需要是
index
,create
,delete
andupdate
中的一個(gè)。接下來準(zhǔn)備這樣的數(shù)據(jù):
{"index": {"_index": "wordbank", "_type": "movie", "_id": 1}} {"doc": {"name": "權(quán)力的游戲"}} {"index": {"_index": "wordbank", "_type": "movie", "_id": 2}} {"doc": {"name": "熊出沒"}}
-
POST bulk
curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/json' --data-binary @movie_names
-
批量 update 成功
{"took":50,"errors":false,"items":[{"index":{"_index":"wordbank","_type":"movie","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}},{"index":{"_index":"wordbank","_type":"movie","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}}]}
遇到過的坑:
-
illegal_argument_exception:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\n]"}],"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\n]"},"status":400}
- 原因:批量導(dǎo)入的 json 文件最后必須要以
\n
結(jié)尾茶鉴,也就是需要一個(gè)空行。 - 解決:在 json 文件末尾加多一個(gè)回車岗宣。
- 原因:批量導(dǎo)入的 json 文件最后必須要以
-
header 問題:
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}
- 原因:Elasticsearch 6.x 之后 curl 的 content-type 更嚴(yán)格了热某。
- 解決:在 curl 命令后多加一條
-H 'Content-Type: application/json'
-
action_request_validation_exception:
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: script or doc is missing;2: script or doc is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: script or doc is missing;2: script or doc is missing;"},"status":400}
- 原因:bulk update 時(shí)劫侧,更新的文本需要放到
"doc"
字典下刽宪,另外 update 在這里就只是 update,如果文檔不存在會(huì)報(bào)錯(cuò)芜繁。 - 解決:
{ "field1" : "value1", "field2" : "value2" } --> { "doc" : { "field1" : "value1", "field2" : "value2" } }
- 原因:bulk update 時(shí)劫侧,更新的文本需要放到
-
不要直接在 terminal 把 curl 的結(jié)果顯示出來
原因:因?yàn)?curl 返回的結(jié)果是個(gè)單行 json 當(dāng)批量處理?xiàng)l目多的時(shí)候珊肃,這個(gè)單行的 json 很長(zhǎng)荣刑。而且
-s
也silent 模式是不會(huì)把這個(gè)結(jié)果去掉的,因?yàn)?-s
是 curl 的參數(shù)伦乔,會(huì)屏蔽掉 curl 的 log厉亏,但 Elasticsearch 的返回 json 是不會(huì)被屏蔽掉的。-
解決:把輸出結(jié)果導(dǎo)到文件
curl -s 'http://example.com' > /dev/null
-
據(jù)說不要重復(fù)指定 index 和 type:來源烈和,可能是我數(shù)據(jù)量比較小爱只,2w條,差距不大招刹。不過前者確實(shí)省文檔空間恬试。
推薦使用這種:
POST /website/log/_bulk { "index": {}} { "event": "User logged in" }
而不是這種:
POST /_bulk { "index": { "_index": "website" , "_type": "blog" , }} { "title": "Overriding the default type" }