廢話嗶嗶
都說算法是程序的靈魂亚亲,算法源于數(shù)學(xué),數(shù)學(xué)是描述宇宙萬物的語言腐缤,這話一點(diǎn)不假捌归,開發(fā)出身算法用的較少,回過頭看算法岭粤,用到了惜索,遞歸、循環(huán)剃浇、分支巾兆、分治、合并虎囚、取舍調(diào)優(yōu)的思想角塑,確實(shí)精彩,燒腦還挺有意思淘讥。
好的技術(shù)博客必須有做到有No BB,Show Code的干貨圃伶,也得有說明輔助理解,因此寫了這篇博客蒲列。
需求
- 概括:將一個(gè)字符字符串窒朋,拆分單個(gè)的拼音,例如bianchengyuyan(編程語言)蝗岖,拆分成bian侥猩、cheng、yu抵赢、yan欺劳。如果是zhang1san(張1三)洛退,zhangssan(張s三),則返回空數(shù)組杰标。
- 應(yīng)用場景:這算法并非百無一用,拼音分詞彩匕,詞義分析腔剂,品相判別等場景能夠用到。
成品代碼
<?php
/**
* @function 返回拼音組合驼仪,由于冗長掸犬,單獨(dú)拆分出來,拼音手動(dòng)總結(jié)绪爸,可能存在遺漏
* @return array
*/
function pinYinArr() {
return [
1 => ['a', 'e', 'm', 'n', 'o'],
2 => ['ai','an','ao','ba','bi','bo','bu','ca','ce','ci','cu','da','de','di','du','en','er','fa','fo','fu','ga','ge','gu','ha','he','hm','hu','ji','ju','ka','ke','ku','la','le','li','lo','lu','lv','ma','me','mi','mo','mu','na','ne','ng','ni','nu','nv','ou','pa','pi','po','pu','qi','qu','re','ri','ru','sa','se','si','su','ta','te','ti','tu','wa','wo','wu','xi','xu','ya','ye','yi','yo','yu','za','ze','zi','zu'],
3 => ['ang','bai','ban','bao','bei','ben','bie','bin','cai','can','cao','cen','cha','che','chi','chu','cou','cui','cun','cuo','dai','dan','dao','dei','den','dia','die','diu','dou','dui','dun','duo','eng','fan','fei','fen','fou','gai','gan','gao','gei','gen','gou','gua','gui','gun','guo','hai','han','hao','hei','hen','hng','hou','hua','hui','hun','huo','jia','jie','jin','jiu','jue','jun','kai','kan','kao','kei','ken','kou','kua','kui','kun','kuo','lai','lan','lao','lei','lia','lie','lin','liu','lou','lue','lun','luo','mai','man','mao','mei','men','mie','min','miu','mou','nai','nan','nao','nei','nen','nie','nin','niu','nou','nue','nuo','pai','pan','pao','pei','pen','pie','pin','pou','qia','qie','qin','qiu','que','qun','ran','rao','ren','rou','rua','rui','run','ruo','sai','san','sao','sen','sha','she','shi','shu','sou','sui','sun','suo','tai','tan','tao','tie','tou','tui','tun','tuo','wai','wan','wei','wen','xia','xie','xin','xiu','xue','xun','yan','yao','yin','you','yue','yun','zai','zan','zao','zei','zen','zha','zhe','zhi','zhu','zou','zui','zun','zuo'],
4 => ['bang','beng','bian','biao','bing','cang','ceng','chai','chan','chao','chen','chou','chua','chui','chun','chuo','cong','cuan','dang','deng','dian','diao','ding','dong','duan','fang','feng','gang','geng','gong','guai','guan','hang','heng','hong','huai','huan','jian','jiao','jing','juan','kang','keng','kong','kuai','kuan','lang','leng','lian','liao','ling','long','luan','mang','meng','mian','miao','ming','nang','neng','nian','niao','ning','nong','nuan','pang','peng','pian','piao','ping','qian','qiao','qing','quan','rang','reng','rong','ruan','sang','seng','shai','shan','shao','shei','shen','shou','shua','shui','shun','shuo','song','suan','tang','teng','tian','tiao','ting','tong','tuan','wang','weng','xian','xiao','xing','xuan','yang','ying','yong','yuan','zang','zeng','zhai','zhan','zhao','zhei','zhen','zhou','zhua','zhui','zhun','zhuo','zong','zuan'],
5 => ['chang','cheng','chong','chuai','chuan','guang','huang','jiang','jiong','kuang','liang','niang','qiang','qiong','shang','sheng','shuai','shuan','xiang','xiong','zhang','zheng','zhong','zhuai','zhuan'],
6 => ['chuang', 'shuang', 'zhuang'],
];
}
/**
* @function 移除參數(shù)1中右邊包含的參數(shù)2湾碎,并返回剩余的字符,例如strRemoveRightOnce('wahaha', 'ha')奠货,返回waha
* @param $string string 被操作字符串
* @param $part string 要被移除的字符串
* @return string
*/
function strRemoveRightOnce($str, $part) {
if (substr($str, -strlen($part)) == $part) {
return substr($str, 0, - strlen($part));
}
return $str;
}
/**
* @function 獲取字符串存在的拼音數(shù)量介褥,不兼容-符號,從長往短了截取
* @param $str string 字符
* @param $result array 函數(shù)返回的結(jié)果
* @return array
*/
function pinYinCutLongToShort($str, $result = []) {
if($str == '') {
return $result;
}
if(($str == '') && ($result == [])) {
return [];
}
//判斷是否是純拼音递惋,不是直接過濾
if((! preg_match('/^[a-z]+$/', $str)) && ($result == [])) {
return [];
}
$initial_arr = pinYinArr();
$initial_keys = array_keys($initial_arr);
$max = max($initial_keys);
$min = min($initial_keys);
for($i = $max; $i >= $min; $i--) {
$substring = substr($str, - $i);
if(in_array($substring, $initial_arr[$i])) {
array_unshift($result, $substring);
//避免xiao ha ha,用ltrim函數(shù)萍虽,一次性移除掉了兩個(gè)ha造成的計(jì)量有誤
return pinYinCutLongToShort(strRemoveRightOnce($str, $substring), $result);
} else {
if($i == $min) {
return [];
}
}
}
return $result;
}
print_r(pinYinCutLongToShort('bianchengyuyan'));
Array
(
[0] => bian
[1] => cheng
[2] => yu
[3] => yan
)
算法調(diào)優(yōu)注意的地方
- 如果檢測到含有非拼音的字符(例如有數(shù)字)睛廊,以及多余的拼音字符(例如zhangxsan張x三),會(huì)直接返回空數(shù)組杉编。
-
代碼采用從長往短切割的策略超全,以xianggang(香港)舉例:
- 從長到短(for循環(huán)遞減):分成xiang、gang邓馒。粒度大嘶朱,但是失敗率小。
- 從短到長(for循環(huán)遞增):拆分成xi绒净、an见咒,后面的gg沒法切了。粒度更小挂疆,容易出錯(cuò)改览。
- 從長到短缺點(diǎn)也很明顯:相應(yīng)的會(huì)忽略精度,所以xian(西安)會(huì)記作一個(gè)拼音缤言,當(dāng)做xian(賢)處理宝当。
-
代碼采用的從字符串右邊往左切的策略,進(jìn)一步避免切割出錯(cuò)胆萧。例如xianguang(閑逛):
- 從左到右從長到短:xiang庆揩、uang沒辦法切了俐东。
- 從左到右從短到長:xi、an订晌、gu虏辫、an、g沒辦法切了锈拨。
- 從右到左從長到短:guang砌庄、xian,剛剛好奕枢。
- 從右到左從短到長:ang娄昆、gu、an缝彬、xi萌焰,也行。
- 結(jié)語:從長往短了切割用于減少失敗率谷浅,從右往左切割扒俯,用于進(jìn)一步避免出錯(cuò),因此被采用壳贪。
- 注意:以上算法陵珍,并非適合所有場景,可能存在誤差违施,畢竟沒有NLP的AI算法加持(自然語言處理)互纯。
- 補(bǔ)充:若讀者想要獲取最細(xì)粒度的拼音,不必再原有函數(shù)上改動(dòng)磕蒲,可以將返回的數(shù)組結(jié)果遍歷留潦,再次調(diào)用另一個(gè)切割函數(shù)(注意另一個(gè)函數(shù)是從短到長切割),隨后匯總辣往。