TiDB維護(hù)各節(jié)點(diǎn)服務(wù)主機(jī)-業(yè)務(wù)零感知
TiDB由TiDB魂爪、PD、TiKV三個(gè)節(jié)點(diǎn)組成哩盲,每個(gè)節(jié)點(diǎn)都是一套高可用服務(wù)前方。
下面主要總結(jié)下在一個(gè)TiDB集群下狈醉,尤其是生產(chǎn)環(huán)境,對(duì)業(yè)務(wù)靈感知進(jìn)行集群服務(wù)器維護(hù)惠险,如升級(jí)磁盤(pán)苗傅、磁盤(pán)擴(kuò)容、數(shù)據(jù)遷移班巩、網(wǎng)絡(luò)升級(jí)渣慕、服務(wù)器重啟等
主要方法和注意點(diǎn):
維護(hù)TiDB節(jié)點(diǎn):
通過(guò)負(fù)載均衡層(SLB或HAProxy等)調(diào)整待維護(hù)節(jié)點(diǎn)權(quán)重,節(jié)點(diǎn)恢復(fù)后恢復(fù)權(quán)重或角色
維護(hù)PD節(jié)點(diǎn):
移動(dòng)member leader節(jié)點(diǎn)抱慌,刪除待處理PD節(jié)點(diǎn)摇庙,恢復(fù)后原清理緩存,重新加入集群
維護(hù)TiKV節(jié)點(diǎn):
先將待處理TiKV節(jié)點(diǎn)leader權(quán)重調(diào)為0遥缕,并添加任務(wù)將其上leader調(diào)度到其他節(jié)點(diǎn)上卫袒。此時(shí)可以安全的直接停止TiKV服務(wù),維護(hù)后啟動(dòng)单匣,調(diào)回權(quán)重夕凝,刪除調(diào)度任務(wù)即可。
(KV節(jié)點(diǎn)盡量在一個(gè)小時(shí)內(nèi)操作完成户秤,因?yàn)槟J(rèn)超過(guò)一個(gè)小時(shí)會(huì)在其他TiKV節(jié)點(diǎn)生成數(shù)據(jù)副本码秉,可能造成主機(jī)負(fù)載過(guò)高,甚至影響當(dāng)前業(yè)務(wù)讀寫(xiě)性能)
案例:
以下為一個(gè)案例:不添加機(jī)器的情況下鸡号,將集群現(xiàn)有機(jī)器上磁盤(pán)由普通磁盤(pán)升級(jí)到SSD转砖,涉及到所有服務(wù)的數(shù)據(jù)文件遷移,和服務(wù)重啟鲸伴。整個(gè)過(guò)程府蔗,業(yè)務(wù)零感知。
具體步驟參考:
1.掛載SSD新盤(pán)汞窗,格式化姓赤,創(chuàng)建lvm,掛載到/data2
2.初次拷貝原數(shù)據(jù)目錄至新盤(pán)(主要考慮到停服后再拷貝耗時(shí)較長(zhǎng)仲吏,先拷貝不铆,停服后再增量拷貝即可。也可以用rsync)
mkdir -p /data2/tidb/deploy/data
chown -R tidb.tidb /data2
cp -R -a /data/tidb/deploy/data /data2/tidb/deploy/data
3.處理PD節(jié)點(diǎn):
cd /data/tidb/tidb-ansible/resources/bin/
./pd-ctl -u http://192.168.11.2:2379
查看pd信息:
member
member leader show
如果待處理pd為leader裹唆,可以重新指定leader:
member leader transfer pd2
刪除pd節(jié)點(diǎn):
member delete name pd1
4.處理KV節(jié)點(diǎn):將待處理store 如5誓斥,調(diào)低權(quán)重,并將leader調(diào)度走
store weight 5 0 1
scheduler add evict-leader-scheduler 5
5.停止主機(jī)上相關(guān)服務(wù)许帐,如有的話
注意順序:
systemctl status pd.service
systemctl stop pd.service
systemctl status tikv-20160.service
systemctl stop tikv-20160.service
檢查服務(wù)及數(shù)據(jù)
curl http://192.168.11.2:2379/pd/api/v1/stores
關(guān)閉tidb前劳坑,記得先調(diào)整SLB或HAproxy配置,使其無(wú)流量進(jìn)入
systemctl status tidb-4000.service
systemctl stop tidb-4000.service
systemctl status grafana.service
systemctl stop grafana.service
systemctl status prometheus.service
systemctl stop prometheus.service
systemctl status pushgateway.service
systemctl stop pushgateway.service
systemctl status node_exporter.service
systemctl stop node_exporter.service
netstat -lntp
6.增量拷貝數(shù)據(jù):
time \cp -R -a -u -f /data/* /data2/
7.umount原data舞吭,data2:
fuser -cu /data
ps -ef|grep data
umount /dev/vg01/lv01
umount /dev/vg02/lv02
掛載新盤(pán)到data:
mount /dev/vg02/lv02 /data
df -h
ls -lrt /data
8.啟動(dòng)相關(guān)服務(wù)
systemctl status node_exporter.service
systemctl start node_exporter.service
systemctl status pushgateway.service
systemctl start pushgateway.service
systemctl status prometheus.service
systemctl start prometheus.service
重新加入pd泡垃,需要先清理緩存:
rm -rf /data/tidb/deploy/data.pd/member/
vi /data/tidb/deploy/scripts/run_pd.sh
刪除initial行析珊,加入join:
--join="http://192.168.11.2:2379" \
systemctl status pd.service
systemctl start pd.service
systemctl status tikv-20160.service
systemctl start tikv-20160.service
curl http://192.168.11.2:2379/pd/api/v1/stores
systemctl status tidb-4000.service
systemctl start tidb-4000.service
systemctl status grafana.service
systemctl start grafana.service
netstat -lntp
復(fù)原權(quán)重,并刪除清除leader任務(wù)
cd /data/tidb/tidb-ansible/resources/bin/
./pd-ctl -u http://192.168.11.2:2379
store weight 5 1 1
scheduler show
scheduler remove evict-leader-scheduler-5
觀察leader_count和kv日志蔑穴,稍等片刻忠寻,會(huì)自動(dòng)balance-leader
9.修改 掛載dev
vi /etc/fstab
10.復(fù)原pd配置,主要為了恢復(fù)配置完整性存和,方便以后集群維護(hù):
vi /data/tidb/deploy/scripts/run_pd.sh
刪除 join行奕剃,添加原initial行:
--initial-cluster="pd1=http://192.168.11.1:2380,pd2=http://192.168.11.2:2380,pd3=http://192.168.11.3:2380" \