解釋索引 app-log-2024.05.20.18 分片 0 未分配的原因:
{"index":"app-log-2024.05.20.18","shard":0,"primary":true,"current_state":"started","current_node":{"id":"lkdWWSzjS9iGFcvZaaVvrA","name":"elk-02.zgzf.com","transport_address":"172.19.70.11:9800","attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"weight_ranking":2},"can_remain_on_current_node":"yes","can_rebalance_cluster":"no","can_rebalance_cluster_decisions":[{"decider":"rebalance_only_when_active","decision":"NO","explanation":"rebalancing is not allowed until all replicas in the cluster are active"},{"decider":"cluster_rebalance","decision":"NO","explanation":"the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"}],"can_rebalance_to_other_node":"no","rebalance_explanation":"rebalancing is not allowed, even though there is at least one node on which the shard can be allocated","node_allocation_decisions":[{"node_id":"Fyk0dv4hSUaqHVMqsAqXdg","node_name":"elk-01.zgzf.com","transport_address":"172.19.70.10:9800","node_attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"node_decision":"yes","weight_ranking":1},{"node_id":"NwmszklQTqqAQ1p9xzYXZw","node_name":"elk-03.zgzf.com","transport_address":"172.19.70.12:9800","node_attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"node_decision":"worse_balance","weight_ranking":3}]}
調(diào)整集群配置
如果您確信所有節(jié)點(diǎn)都正常且準(zhǔn)備好接受分片,可以調(diào)整集群設(shè)置以允許重新平衡糖驴×诺唬可以臨時(shí)修改 cluster.routing.allocation.allow_rebalance 設(shè)置:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.allow_rebalance": "always"
}
}
恢復(fù)默認(rèn)設(shè)置
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.allow_rebalance": indices_all_active
}
}
解釋索引 app-log-2024.05.19.09 分片 1 未分配的原因:
{"index":"app-log-2024.05.19.09","shard":1,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"ALLOCATION_FAILED","at":"2024-06-03T13:48:36.982Z","failed_allocation_attempts":5,"details":"failed shard on node [Fyk0dv4hSUaqHVMqsAqXdg]: failed recovery, failure RecoveryFailedException[[app-log-2024.05.19.09][1]: Recovery failed from {elk-02.zgzf.com}{lkdWWSzjS9iGFcvZaaVvrA}{VcX4s-mXTt65du9F6NI5XA}{172.19.70.11}{172.19.70.11:9800}{dilm}{ml.machine_memory=66854977536, ml.max_open_jobs=20, xpack.installed=true} into {elk-01.zgzf.com}{Fyk0dv4hSUaqHVMqsAqXdg}{QvoljIiTSOaGvOvzXuObkw}{172.19.70.10}{172.19.70.10:9800}{dilm}{ml.machine_memory=66854977536, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-02.zgzf.com][172.19.70.11:9800][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elk-01.zgzf.com][172.19.70.10:9800][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: FileSystemException[/zgapp/data/elasticsearch7/nodes/0/indices/T8UUt8CWRd6XZUyHJu8skA/1/index/_1bo_Lucene84_0.tim: Too many open files]; ","last_allocation_status":"no_attempt"},"can_allocate":"no","allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions":[{"node_id":"Fyk0dv4hSUaqHVMqsAqXdg","node_name":"elk-01.zgzf.com","transport_address":"172.19.70.10:9800","node_attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"node_decision":"no","deciders":[{"decider":"max_retry","decision":"NO","explanation":"shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-06-03T13:48:36.982Z], failed_attempts[5], failed_nodes[[Fyk0dv4hSUaqHVMqsAqXdg]], delayed=false, details[failed shard on node [Fyk0dv4hSUaqHVMqsAqXdg]: failed recovery, failure RecoveryFailedException[[app-log-2024.05.19.09][1]: Recovery failed from {elk-02.zgzf.com}{lkdWWSzjS9iGFcvZaaVvrA}{VcX4s-mXTt65du9F6NI5XA}{172.19.70.11}{172.19.70.11:9800}{dilm}{ml.machine_memory=66854977536, ml.max_open_jobs=20, xpack.installed=true} into {elk-01.zgzf.com}{Fyk0dv4hSUaqHVMqsAqXdg}{QvoljIiTSOaGvOvzXuObkw}{172.19.70.10}{172.19.70.10:9800}{dilm}{ml.machine_memory=66854977536, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-02.zgzf.com][172.19.70.11:9800][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elk-01.zgzf.com][172.19.70.10:9800][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: FileSystemException[/zgapp/data/elasticsearch7/nodes/0/indices/T8UUt8CWRd6XZUyHJu8skA/1/index/_1bo_Lucene84_0.tim: Too many open files]; ], allocation_status[no_attempt]]]"}]},{"node_id":"NwmszklQTqqAQ1p9xzYXZw","node_name":"elk-03.zgzf.com","transport_address":"172.19.70.12:9800","node_attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"node_decision":"no","deciders":[{"decider":"max_retry","decision":"NO","explanation":"shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-06-03T13:48:36.982Z], failed_attempts[5], failed_nodes[[Fyk0dv4hSUaqHVMqsAqXdg]], delayed=false, details[failed shard on node [Fyk0dv4hSUaqHVMqsAqXdg]: failed recovery, failure RecoveryFailedException[[app-log-2024.05.19.09][1]: Recovery failed from {elk-02.zgzf.com}{lkdWWSzjS9iGFcvZaaVvrA}{VcX4s-mXTt65du9F6NI5XA}{172.19.70.11}{172.19.70.11:9800}{dilm}{ml.machine_memory=66854977536, ml.max_open_jobs=20, xpack.installed=true} into {elk-01.zgzf.com}{Fyk0dv4hSUaqHVMqsAqXdg}{QvoljIiTSOaGvOvzXuObkw}{172.19.70.10}{172.19.70.10:9800}{dilm}{ml.machine_memory=66854977536, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-02.zgzf.com][172.19.70.11:9800][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elk-01.zgzf.com][172.19.70.10:9800][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: FileSystemException[/zgapp/data/elasticsearch7/nodes/0/indices/T8UUt8CWRd6XZUyHJu8skA/1/index/_1bo_Lucene84_0.tim: Too many open files]; ], allocation_status[no_attempt]]]"}]},{"node_id":"lkdWWSzjS9iGFcvZaaVvrA","node_name":"elk-02.zgzf.com","transport_address":"172.19.70.11:9800","node_attributes":{"ml.machine_memory":"66854977536","ml.max_open_jobs":"20","xpack.installed":"true"},"node_decision":"no","deciders":[{"decider":"max_retry","decision":"NO","explanation":"shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-06-03T13:48:36.982Z], failed_attempts[5], failed_nodes[[Fyk0dv4hSUaqHVMqsAqXdg]], delayed=false, details[failed shard on node [Fyk0dv4hSUaqHVMqsAqXdg]: failed recovery, failure RecoveryFailedException[[app-log-2024.05.19.09][1]: Recovery failed from {elk-02.zgzf.com}{lkdWWSzjS9iGFcvZaaVvrA}{VcX4s-mXTt65du9F6NI5XA}{172.19.70.11}{172.19.70.11:9800}{dilm}{ml.machine_memory=66854977536, ml.max_open_jobs=20, xpack.installed=true} into {elk-01.zgzf.com}{Fyk0dv4hSUaqHVMqsAqXdg}{QvoljIiTSOaGvOvzXuObkw}{172.19.70.10}{172.19.70.10:9800}{dilm}{ml.machine_memory=66854977536, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-02.zgzf.com][172.19.70.11:9800][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elk-01.zgzf.com][172.19.70.10:9800][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: FileSystemException[/zgapp/data/elasticsearch7/nodes/0/indices/T8UUt8CWRd6XZUyHJu8skA/1/index/_1bo_Lucene84_0.tim: Too many open files]; ], allocation_status[no_attempt]]]"},{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[app-log-2024.05.19.09][1], node[lkdWWSzjS9iGFcvZaaVvrA], [P], s[STARTED], a[id=luKIA-plTle91E77OPn1GA]]"}]}]}
主要問題是恢復(fù)過程中打開文件過多(Too many open files)佛致,導(dǎo)致分片無(wú)法成功分配。
檢查并增加文件描述符限制
ulimit -n 檢查
ulimit -n 65536 臨時(shí)修改
/etc/security/limits.conf 持久修改
* soft nofile 65536
* hard nofile 65536
增加文件描述符限制后辙谜,手動(dòng)重試分片分配
POST /_cluster/reroute?retry_failed=true
在重試分配后俺榆,監(jiān)控集群健康狀態(tài),確保所有分片正常分配:
GET /_cluster/health