哈夫曼樹_Entropy

source

Description

An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with “wasted” or “extra” information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.

English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: it’s easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a “prefix-free variable-length” encoding.

In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.

Consider the text “AAAAABCD”. Using ASCII, encoding this would require 64 bits. If, instead, we encode “A” with the bit pattern “00”, “B” with “01”, “C” with “10”, and “D” with “11” then we can encode this text in only 16 bits; the resulting bit pattern would be “0000000000011011”. This is still a fixed-length encoding, however; we’re using two bits per glyph instead of eight. Since the glyph “A” occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode “A” with “0”, “B” with “10”, “C” with “110”, and “D” with “111”. (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to “0000010110111”, a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and you’ll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.

As a second example, consider the text “THE CAT IN THE HAT”. In this text, the letter “T” and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters “C”, “I’ and “N” only occur once, however, so they will have the longest codes.

There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with “00”, “A” with “100”, “C” with “1110”, “E” with “1111”, “H” with “110”, “I” with “1010”, “N” with “1011” and “T” with “01”. The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.
Input
The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word “END” as the text string. This line should not be processed.
Output
For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.

Sample Input

AAAAABCD
THE_CAT_IN_THE_HAT
END

Sample Output

64 13 4.9
144 51 2.8

題意:輸入一串只含有大寫字母或下劃線的字符串,輸入"END"結(jié)束。分別求用8位編碼一個字符,哈夫曼編碼所需的二進(jìn)制位數(shù),以及他們的比值骗爆。
題解:用優(yōu)先隊列模擬哈夫曼樹的建立過程栈虚。

#include<cstdio>
#include<iterator>
#include<queue>
#include<map>
#include<string>
#include<functional>
#include<iostream>
using namespace std;
int main()
{
    string letter;
    while(cin>>letter&&(letter!="END"))
    {
        map<char,int> str;
        int sum=0;
        priority_queue<int,vector<int>,greater<int> > q;
        for(int i=0;i<letter.length();i++){
            str[letter[i]]++;
        }
        map<char,int>::iterator it=str.begin();
        for(;it!=str.end();it++)
        {
            q.push(it->second);
        }
        if(q.size()==1) sum=q.top();
        while(q.size()!=1)
        {
            int t1=q.top();
            q.pop();
            int t2=q.top();
            q.pop();
            sum+=t1+t2;
            q.push(t1+t2);
        }
         printf("%d %d %.1f\n",8*(letter.size()),sum,1.0*8*(letter.size())/sum);
    }
}
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末钞螟,一起剝皮案震驚了整個濱河市纵寝,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌花吟,老刑警劉巖,帶你破解...
    沈念sama閱讀 222,183評論 6 516
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件厨姚,死亡現(xiàn)場離奇詭異衅澈,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)谬墙,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,850評論 3 399
  • 文/潘曉璐 我一進(jìn)店門今布,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人芭梯,你說我怎么就攤上這事险耀。” “怎么了玖喘?”我有些...
    開封第一講書人閱讀 168,766評論 0 361
  • 文/不壞的土叔 我叫張陵甩牺,是天一觀的道長。 經(jīng)常有香客問我累奈,道長贬派,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,854評論 1 299
  • 正文 為了忘掉前任澎媒,我火速辦了婚禮搞乏,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘戒努。我一直安慰自己请敦,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 68,871評論 6 398
  • 文/花漫 我一把揭開白布储玫。 她就那樣靜靜地躺著侍筛,像睡著了一般。 火紅的嫁衣襯著肌膚如雪撒穷。 梳的紋絲不亂的頭發(fā)上匣椰,一...
    開封第一講書人閱讀 52,457評論 1 311
  • 那天,我揣著相機(jī)與錄音端礼,去河邊找鬼禽笑。 笑死入录,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的佳镜。 我是一名探鬼主播僚稿,決...
    沈念sama閱讀 40,999評論 3 422
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼邀杏!你這毒婦竟也來了贫奠?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,914評論 0 277
  • 序言:老撾萬榮一對情侶失蹤望蜡,失蹤者是張志新(化名)和其女友劉穎唤崭,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體脖律,經(jīng)...
    沈念sama閱讀 46,465評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡谢肾,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,543評論 3 342
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了小泉。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片芦疏。...
    茶點故事閱讀 40,675評論 1 353
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖微姊,靈堂內(nèi)的尸體忽然破棺而出酸茴,到底是詐尸還是另有隱情,我是刑警寧澤兢交,帶...
    沈念sama閱讀 36,354評論 5 351
  • 正文 年R本政府宣布薪捍,位于F島的核電站,受9級特大地震影響配喳,放射性物質(zhì)發(fā)生泄漏酪穿。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 42,029評論 3 335
  • 文/蒙蒙 一晴裹、第九天 我趴在偏房一處隱蔽的房頂上張望被济。 院中可真熱鬧,春花似錦涧团、人聲如沸只磷。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,514評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽喳瓣。三九已至,卻和暖如春赞别,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背配乓。 一陣腳步聲響...
    開封第一講書人閱讀 33,616評論 1 274
  • 我被黑心中介騙來泰國打工仿滔, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留惠毁,地道東北人。 一個月前我還...
    沈念sama閱讀 49,091評論 3 378
  • 正文 我出身青樓崎页,卻偏偏與公主長得像鞠绰,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子飒焦,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,685評論 2 360

推薦閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,581評論 0 23
  • 今天早上沒有吃早餐牺荠,中午吃過午飯正準(zhǔn)備午休翁巍,就被安排出去辦事情去了,外面又熱休雌,等待的時候感覺人都困乏得找不著北了灶壶,...
    李小小4161閱讀 293評論 0 0
  • 一見錯過,誰又知那時的心痛為何涌献。
    不過五年閱讀 62評論 0 1
  • 我定一個六點的鬧鐘 起床看花開放 茶幾上昨夜的面包 拿起就啃 防腐劑的味道也沒有變淡 洗臉?biāo)⒀辣称鸨嘲?沖進(jìn)人潮 ...
    angelican閱讀 261評論 0 1