前言
前不久準(zhǔn)備寫一個(gè)關(guān)于文本處理的小程序,需要高效地對(duì)文本進(jìn)行讀取肺素。于是就歸納了一下常見的文件讀取方法粤策,并測試了各自的時(shí)間,也閱讀了相關(guān)的一些源碼琐旁,希望能說清楚測試結(jié)果背后的道理涮阔,在以后用到相關(guān)操作時(shí),能選取最佳的方法灰殴。為了減少一些無關(guān)的干擾敬特,我們把源碼里的一些檢驗(yàn)參數(shù)等的代碼省略掰邢,有些代碼進(jìn)行了簡化。
常見的五類文件讀取方法
采用BufferedReader
static long testBuffered(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
BufferedReader reader = new BufferedReader(new FileReader(fileName));
char[] buffer=new char[8*1024];
long sum = 0;
while((count=reader.read(buffer))!=-1)
{
sum += count;
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of BufferedReader is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
BufferedReader是一個(gè)很常見的文件讀取方法伟阔。buffer的大小為8*1024辣之。這是因?yàn)闉榱撕虰ufferedReader里的緩存進(jìn)行統(tǒng)一。BufferedReader的構(gòu)造函數(shù)如下:
private char cb[];
private static int defaultCharBufferSize = 8192;
public BufferedReader(Reader in, int sz) {
super(in);
this.in = in;
cb = new char[sz];
nextChar = nChars = 0;
}
public BufferedReader(Reader in) {
this(in, defaultCharBufferSize);
}
我們可以看到如果構(gòu)造時(shí)未輸入?yún)?shù)皱炉,那么這個(gè)大小就是默認(rèn)的defaultCharBufferSize也就是$8192=8*1024$怀估,用這個(gè)大小呢,創(chuàng)建了一個(gè)私有數(shù)據(jù)cb合搅,我猜它是charbuffer的縮寫多搀。而BufferedReader的讀一串字符調(diào)用的是如下函數(shù)。
public int read(char cbuf[], int off, int len) throws IOException {
synchronized (lock) {
int n = read1(cbuf, off, len);
if (n <= 0) return n;
while ((n < len) && in.ready()) {
int n1 = read1(cbuf, off + n, len - n);
if (n1 <= 0) break;
n += n1;
}
return n;
}
}
可見它是循環(huán)調(diào)用read1把傳入的數(shù)組(cbuf)填充到要求的長度(len)灾部。然后后面就是一連串的調(diào)用鏈如下圖
經(jīng)過各種嵌套調(diào)用后康铭,最后是用的是FileChannel,這也是本文里的第四種方法赌髓,于是當(dāng)然从藤,BufferedReader的效率很差。
采用RandomAccessFile
static long testRandomAccess(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
RandomAccessFile reader = new RandomAccessFile(fileName,"r");
int count;
byte[] buffer=new byte[8*1024];//緩沖區(qū)
long sum = 0;
while((count=reader.read(buffer))!=-1){
sum += count;
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of RandomAccess is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
為啥上述代碼里的buffer也是8k呢锁蠕?這是因?yàn)檎{(diào)用鏈如下
可見該函數(shù)的調(diào)用鏈很短夷野,而且是用native函數(shù)進(jìn)行操作。最后的io_util.c的相關(guān)代碼如下
#define BUF_SIZE 8192
jint
readBytes(JNIEnv *env, jobject this, jbyteArray bytes,
jint off, jint len, jfieldID fid)
{
jint nread;
char stackBuf[BUF_SIZE];
char *buf = stackBuf;
if (len > BUF_SIZE) {
buf = malloc(len);
}
fd = GET_FD(this, fid);
nread = IO_Read(fd, buf, len);
(*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf);
if (buf != stackBuf) {
free(buf);
}
return nread;
}
從上述代碼可以知道匿沛,如果要讀的數(shù)組的長度不大于8192扫责,那么就直接用該局部變量。如果大于逃呼,那么就需要重新分配這么一塊內(nèi)存鳖孤。因此我們?cè)跍y試代碼里片迅,選擇了8192這樣的長度阁簸,就是為了避免調(diào)用時(shí)需要從堆上分配內(nèi)存,畢竟C中的malloc和free都不是很快柠横,完全是效率黑洞推姻。
采用FileInputStream
這種方式也很常見平匈,原理也和名字一樣,把文件變成輸入流藏古,然后一個(gè)字符一個(gè)字符的讀取增炭。它是調(diào)用了InputStream的read函數(shù)實(shí)現(xiàn)的,代碼如下:
public int read(byte b[], int off, int len) throws IOException {
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
采用與ByteBuffer配合的FileChannel
這種方式就和第一種方式的最后的調(diào)用那里差不多拧晕,所以速度按理來說還行隙姿。代碼如下:
static long testFileStreamChannel(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
FileInputStream reader = new FileInputStream(fileName);
FileChannel ch = reader.getChannel();
ByteBuffer bb = ByteBuffer.allocate(8*1024);
long sum = 0;
int count;
while ((count=ch.read(bb)) != -1 )
{
sum += count;
bb.clear();
}
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of FileStreamChannel is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
它調(diào)用的FileChannel的read函數(shù)其實(shí)內(nèi)部是用IOUtill里的read。代碼如下:
static int read(FileDescriptor fd, ByteBuffer dst, long position, NativeDispatcher nd) throws IOException
{
if (dst instanceof DirectBuffer)
return readIntoNativeBuffer(fd, dst, position, nd);
ByteBuffer bb = Util.getTemporaryDirectBuffer(dst.remaining());
try {
int n = readIntoNativeBuffer(fd, bb, position, nd);
bb.flip();
if (n > 0)
dst.put(bb);//放入傳入的緩存
return n;
} finally {
Util.offerFirstTemporaryDirectBuffer(bb);
}
}
他就是申請(qǐng)一塊臨時(shí)堆外DirectByteBuffer厂捞,大小同傳入的buffer的大小输玷。然后讀取文件队丝,最后在把它放回傳入的緩存。
采用與MappedByteBuffer相結(jié)合的FileChannel
這類方法很少見欲鹏。測試代碼如下:
static long testFileStreamChannelMap(String fileName) throws IOException{
Long startTime = System.currentTimeMillis();
FileInputStream reader = new FileInputStream(fileName);
FileChannel ch = reader.getChannel();
MappedByteBuffer mb =ch.map( FileChannel.MapMode.READ_ONLY,0L, ch.size() );//這是關(guān)鍵
long sum = 0;
sum = mb.limit();
reader.close();
Long endTime = System.currentTimeMillis();
System.out.println("Total time of testFileStreamChannelMap is "+ (endTime - startTime) + " milliseconds, Total byte is " + sum);
return endTime - startTime;
}
我們現(xiàn)在看看上面有注釋的那句話干了什么
public MappedByteBuffer map(MapMode mode, long position, long size) throws IOException
{
int pagePosition = (int)(position % allocationGranularity);
long mapPosition = position - pagePosition;
long mapSize = size + pagePosition;
try {
// native方法机久,返回一個(gè)內(nèi)存映射的地址
addr = map0(imode, mapPosition, mapSize);
} catch (OutOfMemoryError x) {
// 內(nèi)存不夠,手動(dòng)gc,然后再來
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException y) {
Thread.currentThread().interrupt();
}
try {
addr = map0(imode, mapPosition, mapSize);
} catch (OutOfMemoryError y) {
throw new IOException("Map failed", y);
}
}
//根據(jù)地址赔嚎,構(gòu)造一個(gè)Buffer返回
return Util.newMappedByteBufferR(isize, addr + pagePosition, mfd, um);
}
上述代碼中Util.newMappedByteBufferR這個(gè)名字很容易讓人誤解膘盖,其實(shí)它構(gòu)造的是MappedByteBuffer的子類DirectByteBuffer的子類DirectByteBufferR。也就是說尽狠,它獲取了文件在虛擬內(nèi)存中映射的地址衔憨,并構(gòu)造了一個(gè)DirectByteBufferR。這種類型的好處是袄膏,它是直接操縱那塊虛擬內(nèi)存的践图。
測試和分析總結(jié)
我們現(xiàn)在可以開始對(duì)這四種方法的讀取速率進(jìn)行測試了,將生成大小大約是1KB沉馆,128KB码党,256KB,512KB斥黑,768KB揖盘,1MB,128MB锌奴,256MB兽狭,512MB,768MB鹿蜀,1GB的文件進(jìn)行讀取箕慧。
static boolean generateFile(String fileName,long size){
try {
BufferedWriter writer = new BufferedWriter(new FileWriter(fileName),8*1024);
for(int count = 0;count < size;count ++){
writer.write('a');
}
writer.close();
}catch (IOException e){
e.printStackTrace();
return false;
}
return true;
}
public static void main(String[] args) {
String fileName = "data.txt";
long m = 1024 ;
long size[] = {m,m * 128,m * 256,m * 512,m * 768,m * 1024,m * 1024 * 128,m * 1024 * 256,m * 1024 * 512,m * 1024 * 768,m * 1024 * 1024};
for (int i = 0;i < size.length;i ++ ) {
generateFile(fileName, size[i]);
try {
testBuffered(fileName);
testRandomAccess(fileName);
testFileStream(fileName);
testFileStreamChannel(fileName);
testFileStreamChannelMap(fileName);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("--------------------------------------------------------");
}
}
測試得到的輸出如下:
Total time of BufferedReader is 1 milliseconds, Total byte is 1024
Total time of RandomAccess is 1 milliseconds, Total byte is 1024
Total time of FileStream is 0 milliseconds, Total byte is 1024
Total time of FileStreamChannel is 17 milliseconds, Total byte is 1024
Total time of testFileStreamChannelMap is 3 milliseconds, Total byte is 1024
--------------------------------------------------------
Total time of BufferedReader is 16 milliseconds, Total byte is 131072
Total time of RandomAccess is 0 milliseconds, Total byte is 131072
Total time of FileStream is 0 milliseconds, Total byte is 131072
Total time of FileStreamChannel is 0 milliseconds, Total byte is 131072
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 131072
--------------------------------------------------------
Total time of BufferedReader is 5 milliseconds, Total byte is 262144
Total time of RandomAccess is 1 milliseconds, Total byte is 262144
Total time of FileStream is 0 milliseconds, Total byte is 262144
Total time of FileStreamChannel is 1 milliseconds, Total byte is 262144
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 262144
--------------------------------------------------------
Total time of BufferedReader is 9 milliseconds, Total byte is 524288
Total time of RandomAccess is 0 milliseconds, Total byte is 524288
Total time of FileStream is 0 milliseconds, Total byte is 524288
Total time of FileStreamChannel is 1 milliseconds, Total byte is 524288
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 524288
--------------------------------------------------------
Total time of BufferedReader is 10 milliseconds, Total byte is 786432
Total time of RandomAccess is 0 milliseconds, Total byte is 786432
Total time of FileStream is 0 milliseconds, Total byte is 786432
Total time of FileStreamChannel is 5 milliseconds, Total byte is 786432
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 786432
--------------------------------------------------------
Total time of BufferedReader is 2 milliseconds, Total byte is 1048576
Total time of RandomAccess is 1 milliseconds, Total byte is 1048576
Total time of FileStream is 0 milliseconds, Total byte is 1048576
Total time of FileStreamChannel is 3 milliseconds, Total byte is 1048576
Total time of testFileStreamChannelMap is 1 milliseconds, Total byte is 1048576
--------------------------------------------------------
Total time of BufferedReader is 146 milliseconds, Total byte is 134217728
Total time of RandomAccess is 43 milliseconds, Total byte is 134217728
Total time of FileStream is 44 milliseconds, Total byte is 134217728
Total time of FileStreamChannel is 89 milliseconds, Total byte is 134217728
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 134217728
--------------------------------------------------------
Total time of BufferedReader is 230 milliseconds, Total byte is 268435456
Total time of RandomAccess is 88 milliseconds, Total byte is 268435456
Total time of FileStream is 85 milliseconds, Total byte is 268435456
Total time of FileStreamChannel is 107 milliseconds, Total byte is 268435456
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 268435456
--------------------------------------------------------
Total time of BufferedReader is 463 milliseconds, Total byte is 536870912
Total time of RandomAccess is 193 milliseconds, Total byte is 536870912
Total time of FileStream is 393 milliseconds, Total byte is 536870912
Total time of FileStreamChannel is 379 milliseconds, Total byte is 536870912
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 536870912
--------------------------------------------------------
Total time of BufferedReader is 844 milliseconds, Total byte is 805306368
Total time of RandomAccess is 282 milliseconds, Total byte is 805306368
Total time of FileStream is 273 milliseconds, Total byte is 805306368
Total time of FileStreamChannel is 255 milliseconds, Total byte is 805306368
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 805306368
--------------------------------------------------------
Total time of BufferedReader is 1097 milliseconds, Total byte is 1073741824
Total time of RandomAccess is 407 milliseconds, Total byte is 1073741824
Total time of FileStream is 348 milliseconds, Total byte is 1073741824
Total time of FileStreamChannel is 395 milliseconds, Total byte is 1073741824
Total time of testFileStreamChannelMap is 0 milliseconds, Total byte is 1073741824
--------------------------------------------------------
可以看見第一種方法所用的時(shí)間最長,這是完全符合我們的預(yù)期的茴恰。而最后一種因?yàn)橹苯硬倏v內(nèi)存颠焦,所以時(shí)間可以忽略。最后因?yàn)橐獦?gòu)造BufferedCache往枣,所以在小文件上也會(huì)花一些時(shí)間伐庭。于是我們可以得出結(jié)論BufferedReader效率怎么都比較低,完全可以棄用分冈。如果只是第一次讀取小文件的話圾另,不要用關(guān)于FileChannel的方法。輸入緩沖期不要大于8K雕沉,因?yàn)榇蟛糠值哪J(rèn)緩沖區(qū)都是8K盯捌,這樣可以容易配合。雖然在測試中FileChannel配合MappedByteBuffer在大文件中取得了很優(yōu)異的效果蘑秽,但是在實(shí)際使用中饺著,用這個(gè)的還是比較少。因?yàn)樗嬖诤芏鄦栴}如內(nèi)存占用肠牲、文件關(guān)閉不確定幼衰,被其打開的文件只有在垃圾回收的才會(huì)被關(guān)閉,而且這個(gè)時(shí)間點(diǎn)是不確定的缀雳。而這些問題是大部分程序員所深惡痛絕的渡嚣,畢竟這些行為沒法自己控制。不能重現(xiàn)的Bug最難修啊肥印。
轉(zhuǎn)載請(qǐng)注明:http://djjowfy.com/2017/09/10/對(duì)java中關(guān)于文件讀取方法效率的比較/