題目
原問(wèn)題鏈接
編程統(tǒng)計(jì)出input.txt文件保存的文章中,每個(gè)單詞出現(xiàn)的次數(shù)劲蜻。文章內(nèi)容如下:
In this chapter we will be looking at files and directories and how to manipulate them. We will learn how to create files, open them, read, write and close them. We'll also learn how programs can manipulate directories, to create, scan and delete them, for example. After the last chapter's diversion into shells, we now start programming in C.
Before proceeding to the way UNIX handles file I/O, we'll review the concepts associated with files, directories and devices. To manipulate files and directories, we need to make system calls (the UNIX parallel of the Windows API), but there also exists a whole range of library functions, the standard I/O library (stdio), to make file handling more efficient.
這段文字來(lái)自網(wǎng)絡(luò)恋昼。為了統(tǒng)計(jì)更有意義御蒲,加入兩個(gè)條件:
統(tǒng)計(jì)過(guò)程中不考慮空格和標(biāo)點(diǎn)符號(hào)
不區(qū)分大小寫(xiě)(可以把所有字母轉(zhuǎn)成小寫(xiě)后參與統(tǒng)計(jì))
解題思路
做一個(gè)詞頻統(tǒng)計(jì)工具
觀察文本發(fā)現(xiàn)洁奈,兩個(gè)單詞之間都有空格隔開(kāi)稍味。這就比較好處理了井濒,因?yàn)槲覀兛梢院茌p松的實(shí)現(xiàn)一次讀取一個(gè)單詞灶似。
如果實(shí)際要處理的文本有標(biāo)點(diǎn)之后沒(méi)空格的情況,如
I love you.My program.
那我們還要加一個(gè)deliver函數(shù)把you和My分開(kāi)
如何存儲(chǔ)出現(xiàn)過(guò)的單詞
1.二維數(shù)組搭配一個(gè)一維數(shù)組實(shí)現(xiàn)
2.利用結(jié)構(gòu)體做一個(gè)鏈表實(shí)現(xiàn)
我選擇了第二種瑞你,因?yàn)槲矣X(jué)得結(jié)構(gòu)體是通往抽象編程的路酪惭,而且使用結(jié)構(gòu)體可讀性更好。
具體思路
1.一次從文本讀取一個(gè)字符串word
2.把字符串word的大寫(xiě)字母轉(zhuǎn)化為小寫(xiě)
3.把字符串word里的 ,.() 四種標(biāo)點(diǎn)消除(想做成一個(gè)好用的詞頻統(tǒng)計(jì)工具的話者甲,根據(jù)實(shí)際需要增添要消除的標(biāo)點(diǎn)符號(hào))
4.檢查word是否出現(xiàn)過(guò)
沒(méi)有出現(xiàn)過(guò): 把他存入結(jié)構(gòu)體
出現(xiàn)過(guò):找到存儲(chǔ)這個(gè)word的結(jié)構(gòu)體春感,并把里面的計(jì)數(shù)變量加一
5.打印所有存儲(chǔ)了單詞的結(jié)構(gòu)體
源碼
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define FILE_PATH "..\\要統(tǒng)計(jì)的文本.txt" //宏存儲(chǔ)文本相對(duì)路徑
#define WORD_LENGTH 60
#define LENGTH sizeof(WORD)
struct WORD
{
char word[WORD_LENGTH]; //存放單詞
int wordCount; //本單詞出現(xiàn)次數(shù)
struct WORD *pNext; //next指針
};
struct WORD pHeader; //鏈表的頭結(jié)點(diǎn)
WORD *countWord(char *word, WORD *pLastWord);
void changeToLower(char *word);
void checkPunct(char *word);
int checkCountWord(char *word);
void addcount(char * word);
WORD * addNewWord(char *word, WORD *pLastWord);
void printResult();
void freeAll();
void main()
{
WORD *pLastWord = &pHeader;//指向最后一個(gè)結(jié)點(diǎn)的指針
char tempWord[WORD_LENGTH];//儲(chǔ)存本次讀取的字符串
pHeader.pNext = NULL;//把頭節(jié)點(diǎn)的指向初始化為空
FILE *fp;
if ((fp = fopen(FILE_PATH, "r")) == NULL)//如果打開(kāi)文本為空
{
printf("System can`t find this file!\n");
exit(0);
}
while ((fscanf(fp, "%s", tempWord)) != EOF)//如果文件還沒(méi)讀取完
{
pLastWord = countWord(tempWord, pLastWord);//統(tǒng)計(jì)詞頻
}
fclose(fp);//關(guān)閉文件
printResult();//打印統(tǒng)計(jì)結(jié)果
freeAll();//釋放所有動(dòng)態(tài)申請(qǐng)的內(nèi)存
system("PAUSE");
}
WORD *countWord(char *word,WORD *pLastWord)
{
changeToLower(word);//把Word全部轉(zhuǎn)換為小寫(xiě)字母
checkPunct(word);//去除word中的 (),. 如果文本含有其他特殊標(biāo)點(diǎn),這里需要再增加一點(diǎn)代碼
if (checkCountWord(word))//如果本單詞在前文出現(xiàn)過(guò)
{
addcount(word);//把這個(gè)單詞的計(jì)數(shù)加一
}
else
{
pLastWord = addNewWord(word, pLastWord);//建立一個(gè)新的單詞結(jié)點(diǎn)
}
return pLastWord;
}
void changeToLower(char *word)//把字符串全部轉(zhuǎn)化為小寫(xiě)字母
{
int i;
for (i = 0; word[i] != '\0'; i++)
{
if (word[i] >= 65 && word[i] <= 90)
{
word[i] += 32;
}
}
}
void checkPunct(char *word)//去除標(biāo)點(diǎn)符號(hào)
{
int i, k;
for (i = 0; word[i] != '\0'; i++)//遍歷字符串
{
if (word[i] == '(' || word[i] == ')' || word[i] == ',' || word[i] == '.')//如果含有標(biāo)點(diǎn)
{
for (k = i; word[k] != '\0'; k++)
{
word[k] = word[k + 1];
}
i--;//防止"sdio)."這種連續(xù)多個(gè)特殊字符出現(xiàn)的情況虏缸,再次檢查當(dāng)前index下的字符
}
}
}
int checkCountWord(char *word)//不在已存儲(chǔ)單詞中返回0. 在就返回1
{
WORD *p;
p = &pHeader;
if (pHeader.pNext == NULL)//如果檢查的是第一個(gè)單詞
{
return 0;
}
do
{
p = p->pNext;
if (strcmp(word, p->word) == 0)
{
return 1;
}
} while (p->pNext != NULL);
return 0;
}
void addcount(char * word)//把出現(xiàn)過(guò)的單詞計(jì)數(shù)加一
{
WORD *p;
p = &pHeader;
do
{
p = p->pNext;
if (strcmp(word, p->word) == 0)//遍歷到重復(fù)的結(jié)點(diǎn)鲫懒,并把計(jì)數(shù) +1
{
p->wordCount++;
return;
}
} while (p->pNext != NULL);
}
WORD * addNewWord(char *word, WORD *pLastWord)//加一個(gè)新單詞結(jié)點(diǎn)
{
WORD *pTemp;
if (pHeader.pNext == NULL)//如果添加的是第一個(gè)結(jié)點(diǎn)
{
pHeader.pNext = (WORD *)malloc(LENGTH);
strcpy(pHeader.pNext->word, word);
pHeader.pNext->wordCount = 1;
pHeader.pNext->pNext = NULL;
pLastWord = pHeader.pNext;
}
else
{
pTemp = pLastWord;
pLastWord = (WORD *)malloc(LENGTH);
strcpy(pLastWord->word, word);
pLastWord->wordCount = 1;
pLastWord->pNext = NULL;
pTemp->pNext = pLastWord;
}
return pLastWord;
}
void printResult()//打印存儲(chǔ)的所有單詞結(jié)點(diǎn)
{
WORD *p;
int sign = 0;
p = &pHeader;
do
{
p = p->pNext;
printf("%-12s%d\t\t", p->word, p->wordCount);
sign++;
if (sign % 3 == 0)
{
printf("\n");
}
} while (p->pNext != NULL);
}
void freeAll()
{
WORD *p ,*piror;
p = pHeader.pNext;//頭節(jié)點(diǎn)不是動(dòng)態(tài)申請(qǐng)的不能釋放
do
{
piror = p;
p = p->pNext;
free(piror);
} while (p->pNext != NULL);
free(p);
}
執(zhí)行結(jié)果
后來(lái)改成了雙向鏈表并在打印前加了排序的函數(shù),下載了一本英文版的 <百年孤獨(dú)> 進(jìn)行了詞頻統(tǒng)計(jì)
很好用(亂碼是因?yàn)橄螺d的文本文件本身就有亂碼...)
就是運(yùn)行效率有些低
統(tǒng)計(jì)<百年孤獨(dú)>運(yùn)行時(shí)間記錄(800K的文本文件)
- 打印結(jié)果占用2秒
- 排序代碼占用1秒
- 其他代碼占用3秒
- 總計(jì)運(yùn)行時(shí)間6秒
總結(jié)
去年考英語(yǔ)六級(jí)時(shí)刽辙,我下載了一本英文電子版的<The Great Gatsby>窥岩,想統(tǒng)計(jì)出它的高頻單詞并背誦。我用了一個(gè)在線詞頻統(tǒng)計(jì)工具來(lái)統(tǒng)計(jì)宰缤,因?yàn)楦鞣N特殊字符的存在結(jié)果特別不理想颂翼。
現(xiàn)在學(xué)會(huì)了寫(xiě)自己的程序,我可以根據(jù)自己的實(shí)際需求來(lái)改寫(xiě)出合適的程序慨灭。It's really cool朦乏!
而且這個(gè)鏈表思維和我寫(xiě)的貪吃蛇實(shí)現(xiàn)蛇身體的思維很相似,我可以優(yōu)化我的貪吃蛇程序了氧骤!