開始
有時(shí)由于存儲(chǔ)優(yōu)化或?qū)崿F(xiàn)業(yè)務(wù)要求拗慨,或數(shù)據(jù)引用、去重要求。需要對(duì)業(yè)務(wù)數(shù)據(jù)計(jì)算和存儲(chǔ)指紋信息赵抢。
以下分析幾方面。
1.指紋算法選擇
2.指紋輸入的數(shù)據(jù)選擇
3.指紋儲(chǔ)存方法
指紋算法選擇
理論hash沖突的可能性:
md5 > sha1 > shaXX
這里沒什么心得烦却。不過md5已經(jīng)有彩虹表之類的沖突列表了宠叼。無意的沖突很難其爵,但如果是有意的沖突和搞事,MD5就比較不安全摩渺。
業(yè)務(wù)數(shù)據(jù)指紋的指紋輸入選擇
選擇MD5的輸入公式简烤,建議使用可以直接在sql中計(jì)算的公式,方便后臺(tái)維護(hù)時(shí)在線查對(duì)和做數(shù)乐埠。可以考慮:
注意多字段合并計(jì)算md5時(shí)囚企,需要加入分隔符(想想不加為什么不可以?)龙宏,和考慮是否需要對(duì)分隔符沖突作轉(zhuǎn)義(escape)處理棵逊。
使用時(shí),需要注意DB和表的charset
select
concat_ws('~',
replace(field1,'~','~~'),
replace(field2,'~','~~')
);
Mysql中用好Binary類型做性能優(yōu)化:
需要在mysql表中保存MD5值辆影。考慮三個(gè)方案:
varchar(32)
char(32)
binary(16)
由于MD5值黍特,實(shí)際為128-bit(16 byte)二進(jìn)數(shù)據(jù)蛙讥,字符只是一個(gè)方便人看的表達(dá)灭衷,所以應(yīng)該用msql的binary(16)來保存最為節(jié)省空間(與varchar和CHAR相比)次慢。sha1/shaX 等HASH算法也同理。減少IO和內(nèi)存的操作翔曲,所以理論上同時(shí)可以提高性能迫像。這在多表關(guān)聯(lián)、排序時(shí)差別更大瞳遍。
因?yàn)槿绻鹶archar(32)闻妓。在utf8 環(huán)境 下最少要32*3+長(zhǎng)度記數(shù)個(gè)字節(jié)。長(zhǎng)度比binary16長(zhǎng)幾倍掠械。而且由缆,作為索引或比較時(shí)注祖,varchar的忽略大小寫的比較比直接的binary比較更加慢。所以理論來說犁功,更應(yīng)該考慮binary(16)氓轰。
人工select時(shí),可以使用HEX/UNHEX方法把binary數(shù)據(jù)變?yōu)榭勺x(http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_hex)浸卦。
如有需要,對(duì)于mysql 5.7+ 也可以用virtual列等方法去虛擬一個(gè)文本列案糙。
附錄:參考資料
hen generating our hash, we'll put an unlikely separator between each column during concatenation so that we won't have "1" and "23" getting confused with "12" and "3" as in this example:
mysql> select concat(1,23), md5(concat(1,23));+--------------+----------------------------------+| concat(1,23) | md5(concat(1,23)) |+--------------+----------------------------------+| 123 | 202cb962ac59075b964b07152d234b70 |+--------------+----------------------------------+1 row in set (0.00 sec)mysql> select concat(12,3), md5(concat(12,3));+--------------+----------------------------------+| concat(12,3) | md5(concat(12,3)) |+--------------+----------------------------------+| 123 | 202cb962ac59075b964b07152d234b70 |+--------------+----------------------------------+1 row in set (0.00 sec)
Instead, we'll do this:
mysql> select concat_ws('~',1,23), md5(concat_ws('~',1,23));+---------------------+----------------------------------+| concat_ws('~',1,23) | md5(concat_ws('~',1,23)) |+---------------------+----------------------------------+| 1~23 | 037ef90202e1e89a23016e4b51489326 |+---------------------+----------------------------------+1 row in set (0.00 sec)mysql> select concat_ws('~',12,3), md5(concat_ws('~',12,3));+---------------------+----------------------------------+| concat_ws('~',12,3) | md5(concat_ws('~',12,3)) |+---------------------+----------------------------------+| 12~3 | 4ba8224d8a784c8af2af98b4ceb034c6 |+---------------------+----------------------------------+1 row in set (0.00 sec)
http://dev.mysql.com/doc/refman/5.7/en/char.html :
In contrast to CHAR
, VARCHAR
values are stored as a 1-byte or 2-byte length prefix plus data. The length prefix indicates the number of bytes in the value. A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes.
TRADEOFF #1 Obviously, VARCHAR holds the advantage since variable-length data would produce smaller rows and, thus, smaller physical files.
TRADEOFF #2 Since CHAR fields require less string manipulation because of fixed field widths, index lookups against CHAR field are on average 20% faster than that of VARCHAR fields. This is not any conjecture on my part. The book MySQL Database Design and Tuning performed something marvelous on a MyISAM table to prove this. The example in the book did something like the following:
http://stackoverflow.com/questions/59667/why-would-i-ever-pick-char-over-varchar-in-sql:
As was pointed out by Gaven in the comments, if you are using a multi-byte, variable length character set like UTF8 then CHAR stores the maximum number of bytes necessary to store the number of characters. So if UTF8 needs at most 3 bytes to store a character, then CHAR(6) will be fixed at 18 bytes, even if only storing latin1 characters. So in this case VARCHAR becomes a much better choice.
select * from t where id = x'0cc175b9c0f1b6a831c399e269772661';
MySQL 5.7:
create table users(
id_bin binary(16),
id_text varchar(36) generated always as
(insert(
insert(
insert(
insert(hex(id_bin),9,0,'-'),
14,0,'-'),
19,0,'-'),
24,0,'-')
) virtual,
name varchar(200));
http://www.ovaistariq.net/632/understanding-mysql-binary-and-non-binary-string-data-types/:
A CHAR(10) column would need 30 bytes for each value regardless of the actual value if utf8 character set is used, however, the same column would need 10 bytes for each value if a single-byte character set such as latin1 is used. Keeping these considerations in mind is very important.
http://www.techearl.com/mysql/how-to-store-md5-hashes-in-a-mysql-database:
INSERT INTO md5_test_binary (md5
) VALUES (unhex('0800fc577294c34e0b28ad2839435945'));