Make Service Fault Transparent

This article is an English one, because I really need to work on the language. Sorry if it is not easy to understand.

A Summary to What's Happening Recently

Recently in my campus, IT service is very unstable.

  • In March, many people posted on forums that they tried to top up campus Internet account by WeChat, but more money (maybe 100x) than they paid were topped up.
    • Later WeChat top-up service were disabled. Because most people were not aware of the existing offline top-up-by-card service, many of them became arrearage.
    • Several days later, campus Internet's charging system was disabled, which means you can use it for free. Later the charging system was resumed, but only charging at the monthly fee (not counting flux fee).
    • An unnoticeable statement was published then, indicating that it was caused by a bug from the software company.
  • On March 20th, campus card users who used their cards to drink hot water or eat breakfast, found their card locked. (Those lazy guys were not affected at all)
    • In the morning nobody knows whether the issue was being solved, until at around 11 (lunchtime) my school's instructor sent an announcement that "there will be unlock service in canteens, please keep order and don't panic at the scene". At canteens announcements by canteens' administrator is put up. Unlocking was quick and easy, but most people still went to canteens where Alipay is accepted.
    • Later that afternoon public statement by card administrator was out: It was a service fault (on BITUnion some said that it's a bug hidden for 14 years). IT staffs explained on BITUnion that they tried to work out solutions and mitigate the issue before they drafted public statements.
  • In these months campus Internet is unstable: During peak hours it became very slow or even unavailable. Maybe it's around 2%'s downtime (in a 24-hour aspect), looking not that much, but users surely could experience that.
    • The causes seem very complex. In my view, new DNS servers, old cache servers, new firewall systems, new upstream link providers and upstream link issue all can cause problems. And of course those new facilities all need to be fine-tuned, which takes time.
    • Currently no authentic statement is published. But in the IT service monthly report (which most people are not aware of), it said "Issue fully fixed, during peak hours upstream links can work in full bandwidth now". One of the reasons they mentioned was "DDoS attack causing network core server CPU instant usage up to 99% (usually ~20%)".
    • However, as student representatives meeting will be held, many representatives will raise the heated Internet issue onto the meeting. But I believe most of they will never get the point why this is happening.
"Totally nailed the fix"

Why Fault Needs to Be Transparent

As you can see, suddenly all issues came into being, but they will not happen because of no reason. Anyway apart from solving issues, making the solving process transparent is also important. Why?

Because, Information technology is becoming essential to our life, just like water and electricity supplies. To this point, it is not anything "advanced" any more, for which people get high expectations to that. What's more, IT is developing fast (counting with years, not decades), thus people's expectations are growing fast with it.

It's quite a challenge for campus IT service to catch up with that. But firstly, they are working on that. If they don't speak, people thinking the service essential will imagine "It's just messing up my life, and they just don't try hard to solve that". This is surely a gap between the two's understanding.

"Why you leave the esculator unfixed for ONE MONTH!"

P.S. Some good man has reminded me that, sometimes there will be staffs not working at all in the "old system". But I guess in my campus they work hard.

Another problem if IT service is not transparent "in time" is that, users don't know whether they need to report or wait. Of course most of us will silently wait for the fix - most of us are busy, right? But what if the staffs don't know the issue at all? We don't know whether they know the issue, and most people won't trust others forever and believe "they must be fixing it now". This might be a more misleading situation, which causes user dissatisfaction.

I can't think of any disadvantage of being actively transparent to faults for a hard-working public service, so I strongly believe this theory.

Ah, yes, I have to highlight that what I mean here about transparency, is "instant transparency". Something this brings one problem: when you realize that you identifed a wrong cause that you published before, you have to recall the previous statement, which brings confusion. If everybody is wise and realizes that people can make mistakes, this is not a problem at all, and you can just leave your previous "wrong" statement there.

In Staytus's demo, an issue became red again from `Monitoring` status

Tool and Platform is Not That Important

People may argue that, "we might not have the right tool to do that for now". Probably the tool doesn't fit, but when you have the idea to do the right thing, tools and platforms are not a problem.

A good example in my campus is the student financial service. They always use forums to answer students' scholarship questions. Though the forum they choose is not that popolar, and I guess some scholarship project process information can be formatted in a nicer single page, but firstly they choose to be transparent.

IT service, on the contrary, is:

  • Essential, so users need feedback more instantly;
  • Wide, so physical service and on-site announcement in all areas is expensive;
  • Complex, where hardware, software and configuration all matters.

Thus a digital way might be a better way to provide transparency.

But what if "the digital way" is faulty? We can put the solution on a school server that hardly fails (probably standalone) and connects with both Intranet and Internet. And a better solution might be prepare for the worst: Choose a third-party (VPS outside Intranet) or public service (Weibo or WeChat), and hope that it won't fail when our infrastructure fails. Unreliable as it seems, you are winning a lottery if everything fails (maybe once in a lifetime?), and you won't hestitate to do the physical announcements.

Yeah, maybe your physical announcement is not enough...

A Blueprint Specifically for IT Service

When everybody is busy, this kind of customer service cannot be depended only by "I contacted you and you talk to me". Some self-service theory can be incorporated here: Make status updates available to everyone. When they need help, they can check on the updates, rest assured, and calmly wait.

I heard that the support ticket systems for IT services is being considered now, but now the "status page" thing is more important.

We have talked about the platforms, right? We will look into them one by one.

  • Webpage, which is very customizable, seems good. But no matter it's in the browser, or inside WeChat WebView, it can't push notifications by itself.
    • However, when users met issues, if that matters to them, they will check the status themselves. So pushing doesn't matter that much.
    • When we have met a disaster and need to "push" some apologies, it doesn't need to be instant and frequent. That's not in the aspect of what we are talking about.
  • Weibo seems good, and can be a choice. But two problems: It is so public that sometimes it's not that good. Last but not least, when everybody uses WeChat, who cares about Weibo?
  • WeChat official account's problem is that it can't push messages that frequent. When you have limits, you might not want to be that transparent. And yes, users don't want to receive that frequent messages.
  • WeChat enterprise account seems don't have these problems. It doesn't limit your push frequency. But when you choose this, remember, this is not a long-running solution (surpassed by Enterprise WeChat App), and this is not supported by PC and Windows Phone. Seems not that fit to be called "transparent" unless you provide a webpage alternative.
  • For other push methods, people hardly use emails, and SMS are expensive, and you probably think of mobile app? Nobody likes this to be heavy.

As I said above, when fault happens, users have motivations to "check status". Thus frequent, up-to-date, no-need-to-push-to-everybody status update looks good.

The conclusion is that, it's best to have

  • a self-hosted standalone status webpage,
  • linked from major IT platforms (in my campus's case, wechat enterprise account and IT department website),
  • which can be quickly deployed to external VPS and work if the self-hosted one crashed,
  • whose data can be consumed via Webhooks or API by other official platforms, like Weibo, WeChat or something.

Of course this have some technology expenses, thus choosing a existing public service (in the short term) is fine, too.

"How to publish" is easy: we can formulate some statement templates (like the well-known investgating/identified/monitoring/resolved model), and when being used, add details to the statements. And we can form rules of updates, to keep transparency, like at lease publish one update every X hours.

Pre-translated templates in Google's statusboard; notice the "we have additional English explaination" sentences

We also need someone to publish messages (I know in China this is a bigger problem). A good technical writer should be recurited. But I think it can be achieved by part-time job by students: they signed some confidentiality agreement and joined the working discussion group, and if any fault happens, they are responsible to publish the situation according to the template and the discussion group's conversations. Yeah, I bet these conversations sometimes contain password or something else, so confidentiality is important.

Or if the tech staffs can do updates themselves, that's fine (but that's really too busy for them).

"The well-known modal" in Staytus

Choosing a open-source solution

As a student, who don't have that much money, I like open-source a lot. For this status page thing, of course I would like to solve it by open-source stuffs.

Actually according to my recongnition, there is no such "status-page service" in China. For example, Leancloud built the status page themselves. The "international" cloud version of this seems not good here, because it might be very slow. So we have to count on self-hosted, open-source ones.

In my opinion, for a status page of school IT service, the most important thing is "update". The overall "status indicator" is not that important.

Yes, this Apple style doesn't fit

After some research, the dynamic, usable, being maintained open-source status solutions are not that many.

  • Cachet, the most popular one, but not perfect for now, with PHP, MySQL/PostgreSQL (as a reminder try dev version, current stable version doesn't have status update)
  • Staytus, already elegant and perfect to use (but simple), with Ruby and MySQL, and the demo is really pretty
  • statuspage, not that popular, hasn't checked thoroughly yet, but as a Python alternative it said "Cachet is a great product, I simply despise PHP"
Cachet (dev version)

I know some of you hate databases. Using a static page generator is a good idea. These solutions exist, but they just seem not that perfect, and to form the workflow is a hard work.

  • Netlify StatusKit, though it's "a template to deploy your own Status pages on Netlify", it seems to be a generator
  • or we can make it with Jekyll and customized themes and plugins

I hope these solutions can be helpful. Though, the most important thing is still what you are trying to achieve.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末塘娶,一起剝皮案震驚了整個(gè)濱河市归斤,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌刁岸,老刑警劉巖脏里,帶你破解...
    沈念sama閱讀 222,183評論 6 516
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異虹曙,居然都是意外死亡迫横,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,850評論 3 399
  • 文/潘曉璐 我一進(jìn)店門酝碳,熙熙樓的掌柜王于貴愁眉苦臉地迎上來矾踱,“玉大人,你說我怎么就攤上這事疏哗∏航玻” “怎么了?”我有些...
    開封第一講書人閱讀 168,766評論 0 361
  • 文/不壞的土叔 我叫張陵返奉,是天一觀的道長贝搁。 經(jīng)常有香客問我,道長芽偏,這世上最難降的妖魔是什么雷逆? 我笑而不...
    開封第一講書人閱讀 59,854評論 1 299
  • 正文 為了忘掉前任,我火速辦了婚禮哮针,結(jié)果婚禮上关面,老公的妹妹穿的比我還像新娘坦袍。我一直安慰自己,他們只是感情好等太,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,871評論 6 398
  • 文/花漫 我一把揭開白布捂齐。 她就那樣靜靜地躺著,像睡著了一般缩抡。 火紅的嫁衣襯著肌膚如雪蜘腌。 梳的紋絲不亂的頭發(fā)上愚战,一...
    開封第一講書人閱讀 52,457評論 1 311
  • 那天啡彬,我揣著相機(jī)與錄音杆煞,去河邊找鬼。 笑死蘑险,一個(gè)胖子當(dāng)著我的面吹牛滴肿,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播佃迄,決...
    沈念sama閱讀 40,999評論 3 422
  • 文/蒼蘭香墨 我猛地睜開眼泼差,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了呵俏?” 一聲冷哼從身側(cè)響起堆缘,我...
    開封第一講書人閱讀 39,914評論 0 277
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎普碎,沒想到半個(gè)月后吼肥,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,465評論 1 319
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡麻车,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,543評論 3 342
  • 正文 我和宋清朗相戀三年缀皱,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片绪氛。...
    茶點(diǎn)故事閱讀 40,675評論 1 353
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡唆鸡,死狀恐怖涝影,靈堂內(nèi)的尸體忽然破棺而出枣察,到底是詐尸還是另有隱情,我是刑警寧澤燃逻,帶...
    沈念sama閱讀 36,354評論 5 351
  • 正文 年R本政府宣布序目,位于F島的核電站,受9級特大地震影響伯襟,放射性物質(zhì)發(fā)生泄漏猿涨。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 42,029評論 3 335
  • 文/蒙蒙 一姆怪、第九天 我趴在偏房一處隱蔽的房頂上張望叛赚。 院中可真熱鬧澡绩,春花似錦、人聲如沸俺附。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,514評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽事镣。三九已至步鉴,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間璃哟,已是汗流浹背氛琢。 一陣腳步聲響...
    開封第一講書人閱讀 33,616評論 1 274
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留随闪,地道東北人阳似。 一個(gè)月前我還...
    沈念sama閱讀 49,091評論 3 378
  • 正文 我出身青樓,卻偏偏與公主長得像铐伴,于是被迫代替她去往敵國和親障般。 傳聞我的和親對象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,685評論 2 360

推薦閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,576評論 0 23
  • PLEASE READ THE FOLLOWING APPLE DEVELOPER PROGRAM LICENSE...
    念念不忘的閱讀 13,493評論 5 6
  • 她從周五晚上開始盛杰,回歸舊模式挽荡。有個(gè)銷售離職,她擔(dān)心團(tuán)隊(duì)內(nèi)大家有情緒即供,所以想帶著一起唱唱歌吼一吼發(fā)泄一下情緒定拟。有兩三...
    艷敏姐閱讀 283評論 4 2
  • 靜默的黑 惡毒的潮濕 無人行走、低語 風(fēng)嗚咽著鉆過窗欞 我的蛇 一路緩慢游來 纏綿在喉 蛇眼盛開干竭的具象 吞沒 ...
    七喬閱讀 226評論 0 0
  • D博士是我們辦公室我最佩服的人逗嫡,工作時(shí)低調(diào)青自,娛樂時(shí)活潑,專業(yè)博學(xué)有節(jié)制驱证,有我最缺少的東西延窜,也有我正在努力的東西。 ...
    胖魚Kingman閱讀 165評論 0 0