SCN Headroom警報
項目巡檢時朋譬,發(fā)現(xiàn)了有關(guān)于SCN的headroom警報,這是我第一次在項目中遇到SCN相關(guān)的警報假夺,遂記錄下來解決過程以及SCN相關(guān)知識梳毙。
SCN介紹
concept文檔中是這么介紹:
A system change number (SCN) is a logical, internal timestamp used by Oracle Database. SCNs order events that occur within the database, which is necessary to satisfy the ACID properties of a transaction. >
Oracle Database uses SCNs to mark the SCN before which all changes are known to be on disk so that recovery avoids applying unnecessary redo. The database also uses SCNs to mark the point at which no redo exists for a set of data so that recovery can stop.
SCN是Oracle數(shù)據(jù)庫的邏輯時間戳,用來保證Oracle中的事物和數(shù)據(jù)一致性般婆。SCN在數(shù)據(jù)庫中是一個單一的不斷的隨著數(shù)據(jù)庫一致性狀態(tài)的改變而自增的序列到腥。
查看SCN:
SELECT current_scn scn FROM v$database;
SCN兩個極限
SCN會有兩個極限,一個是hard limit,一個是soft limit,即headroom蔚袍。
- Hard Limit
SCNs occur in a monotonically increasing sequence, and there is a very large upper limit to how many SCNs an Oracle Database can use - that limit is currently 281 trillion, or specifically 281,474,976,710,656 (is 2^48) SCN values.
Given that there is an upper limit, it is important that any given Oracle Database does not run out of available SCNs. The Oracle Database uses a time based rationing system to ensure that this does not happen.
最大值281萬億乡范,oracle確保在不出現(xiàn)bug的時候不會達到,理論上最大值需要一兩百年才能達到啤咽,當(dāng)?shù)竭@個最大值時晋辆,數(shù)據(jù)庫將再也啟動不了,只能重新建庫解決宇整。
- Headroom
At any point in time, the Oracle Database calculates a "not to exceed" limit for the number of SCNs a database can have used, based on the number of seconds elapsed since 1988, multiplied by 16,384. This is known as the database's current maximum SCN limit. Doing this ensures that Oracle Databases will ration SCNs over time, allowing over 500 years of data processing for any Oracle Database.
The difference between the current SCN the database is using, and the "not to exceed" upper limit, is known as the SCN headroom. For almost all Oracle Databases, this headroom is constantly increasing every second.
計算方法(sysdate-1988年的秒數(shù))*(16*1024)瓶佳。oracle認(rèn)為合理的情況下每秒鐘SCN最大的增長數(shù)為16384。
SCN警報信息
SCN的警報信息會出現(xiàn)在告警日志中鳞青,通過相關(guān)MOS提供的腳本也能檢查出來涩哟。
當(dāng)告警日志中出現(xiàn)了以下的信息,即代表出現(xiàn)了Headroom問題了盼玄。
-
Warning信息
Warning信息 -
SCN跳躍信息
SCN跳躍信息
由圖中信息可知當(dāng)前數(shù)據(jù)庫SCN存在跳躍信息,向前跳躍7862分鐘潜腻,到了0x0e1332b78159(轉(zhuǎn)化為十進制是 15475618054489)埃儿,遠端機器HAJC,登錄用戶EPOINTOA8_HA融涣,主機名童番、程序等信息精钮。 scnhealthcheck.sql腳本
The "scnhealthcheck.sql" script can be downloaded here: Patch:13498243.
If you install the patch then it will create "scnhealthcheck.sql" in the $ORACLE_HOME/rdbms/admin directory.
Alternatively you can use the script directly from the unzipped patch without actually installing it in your $ORACLE_HOME.
需要打補丁[Patch:13498243](https://support.oracle.com/epmos/faces/ui/patch/PatchDetail.jspx?parent=DOCUMENT&sourceId=1393363.1&patchId=13498243),腳本自動創(chuàng)建到$ORACLE_HOME/rdbms/admin下剃斧。
-
SQL語句檢查
select version, date_time, dbms_flashback.get_system_change_number current_scn, indicator from (select version, to_char(SYSDATE, 'YYYY/MM/DD HH24:MI:SS') DATE_TIME, ((((((to_number(to_char(sysdate, 'YYYY')) - 1988) * 12 * 31 * 24 * 60 * 60) + ((to_number(to_char(sysdate, 'MM')) - 1) * 31 * 24 * 60 * 60) + (((to_number(to_char(sysdate, 'DD')) - 1)) * 24 * 60 * 60) + (to_number(to_char(sysdate, 'HH24')) * 60 * 60) + (to_number(to_char(sysdate, 'MI')) * 60) + (to_number(to_char(sysdate, 'SS')))) * (16 * 1024)) - dbms_flashback.get_system_change_number) / (16 * 1024 * 60 * 60 * 24)) indicator from v$instance)
使用以上SQL檢查轨香,根據(jù)INDICATOR列的值來判定Headroom還可以使用多久,這應(yīng)該是一個趨于穩(wěn)定的值幼东,當(dāng)小于10天的時候應(yīng)該提高注意臂容,系統(tǒng)可能出現(xiàn)問題,而這個項目上的INDICATOR值只有6.1根蟹。
解決辦法
在上一章警報信息中的SCN跳躍信息欄中已經(jīng)解析了日志中的一些信息脓杉,可以用來解決問題。
我們下面應(yīng)該找到HAJC這個庫和POINTOA8_HA用戶简逮,既然是遠端信息連接到這臺HAOA數(shù)據(jù)庫上球散,那肯定是通過Dblink的方式。
所以散庶,在HAJC庫中查找對應(yīng)的Dblink信息蕉堰,信息匹配,確實存在Dblink悲龟。
Dblink有可能會導(dǎo)致SCN爆炸增長屋讶,當(dāng)兩個數(shù)據(jù)庫通過Dblink相互訪問時,他們會選用兩者中當(dāng)前SCN最大的一個來同步SCN躲舌,譬如說數(shù)據(jù)庫A 的SCN 是10000丑婿,而數(shù)據(jù)庫B是20000,當(dāng)2者發(fā)生DBLINK聯(lián)系時没卸,將會用最大的SCN (20000)來同步羹奉,數(shù)據(jù)庫A的SCN將jump跳躍到20000。在一些環(huán)境中约计,往往不是本地數(shù)據(jù)庫觸發(fā)了SCN快速增長的bug诀拭,而是眾多數(shù)據(jù)庫中的某一個存在活躍的SCN BUG,而其他數(shù)據(jù)庫與該問題數(shù)據(jù)庫發(fā)生DBLINK聯(lián)系后煤蚌,就會因為SCN同步而經(jīng)歷 SCN headroom的的極速減少耕挨;異常高的SCN會通過DBLINK傳播給正常的數(shù)據(jù)庫,這種傳播往往呈爆炸式發(fā)展尉桩。
打2012年1月后發(fā)布的CPU或PSU補丁
補丁增加了_external_scn_rejection_threshold_hours參數(shù)筒占,通常設(shè)置該參數(shù)為24小時源頭解決
將對應(yīng)的Dblink源頭找到并禁用
對于此項目的情況,分析Dblink的使用功能是做數(shù)據(jù)同步蜘犁,和相關(guān)負(fù)責(zé)人討論后可以用數(shù)據(jù)交換平臺替代Dblink翰苫,于是決定禁用此Dblink,來解決問題。
看下刪掉Dblink同步后的效果奏窑。
5月27日檢查Headroom只剩下6.1天导披,5月31日drop掉此Dblink,6月1日檢查,Headroom增加到11天埃唯,接下來還會繼續(xù)釋放撩匕,SCN恢復(fù)使用正常,趨于穩(wěn)定墨叛。