該文章為本系列的第二篇
第一篇為 : Java POI操作Excel(User Model)
第三篇為 : Java POI操作Excel(Event User Model)
第四篇為 : 使用POI封裝一個輕量級Excel解析框架
前言
在上一篇文章中,我們不僅對POI有了基本的認(rèn)識,也學(xué)會了使用POI的User model(Dom)的方式進(jìn)行Excel解析.
但是這種方式在進(jìn)行大數(shù)據(jù)量Excel解析的時候,要一次性的解析完整個Excel.而在我們的系統(tǒng)中,是存在多個用戶同時上傳的情況,這種情況對內(nèi)存的壓力是很大的.輕則Full GC.嚴(yán)重的話甚至?xí)﨩OM.
所以,POI還提供了抽象程度相比User Model更低,但是也更節(jié)約內(nèi)存的方式來解析Excel.也就是今天要整理的Event Model(Sax)的方式.
xls & xlsx的存儲形式
xls采用的是一種名為BIFF8(BinaryInterchangeFileFormat)的文件格式.而xlsx則是采用OOXML(Office open Xml)的格式存儲數(shù)據(jù).
xls格式Excel的數(shù)據(jù)組織由于使用的是二進(jìn)制,沒找到太直觀的方式展示.故不做特殊說明(實(shí)際上是沒搞明白怎么看)
xlsx格式Excel則采用Xml來組織數(shù)據(jù).所以我們可以很輕易的看到.如下圖:
上面的xml只是sheet1對應(yīng)的xml.而sheet2則對應(yīng)另一個xml.
Event Model節(jié)約內(nèi)存原理
User Model的缺點(diǎn)是一次性將文件讀入內(nèi)存,構(gòu)建一顆Dom樹.并且在POI對Excel的抽象中,每一行,每一個單元格都是一個對象.當(dāng)文件大,數(shù)據(jù)量多的時候?qū)?nèi)存的占用可想而知.
Event Model使用的方式是邊讀取邊解析,并且不會將這些數(shù)據(jù)封裝成Row,Cell這樣的對象.而都只是普通的數(shù)字或者是字符串.并且這些解析出來的對象是不需要一直駐留在內(nèi)存中,而是解析完使用后就可以回收.
所以相比于User Model,Event Model更節(jié)省內(nèi)存.效率也更高.但是作為代價,相比User Model功能更少.門檻也要高一些.
解析
關(guān)于解析代碼,使用POI官網(wǎng)提供的代碼.并做一些局部的修改來達(dá)到學(xué)習(xí)的目的.
開發(fā)環(huán)境
Java版本 :1.8.0_40
Maven版本:3.3.9
POI版本 :3.15
解析xls
Excel內(nèi)容如下圖:
解析代碼:
public class ReadExcel2003 implements HSSFListener {
private SSTRecord sstrec;
/**
* This method listens for incoming records and handles them as required.
*
* @param record The record that was found while reading.
*/
public void processRecord(Record record) {
switch (record.getSid()) {
// the BOFRecord can represent either the beginning of a sheet or the workbook
case BOFRecord.sid:
BOFRecord bof = (BOFRecord) record;
if (bof.getType() == bof.TYPE_WORKBOOK) {
System.out.println("Encountered workbook");
// assigned to the class level member
} else if (bof.getType() == bof.TYPE_WORKSHEET) {
System.out.println("Encountered sheet reference");
}
break;
case BoundSheetRecord.sid:
BoundSheetRecord bsr = (BoundSheetRecord) record;
System.out.println("New sheet named: " + bsr.getSheetname());
break;
case RowRecord.sid:
RowRecord rowrec = (RowRecord) record;
System.out.println("Row found, first column at "
+ rowrec.getFirstCol() + " last column at " + rowrec.getLastCol());
break;
case NumberRecord.sid:
NumberRecord numrec = (NumberRecord) record;
System.out.println("Cell found with value " + numrec.getValue()
+ " at row " + numrec.getRow() + " and column " + numrec.getColumn());
break;
// SSTRecords store a array of unique strings used in Excel.
case SSTRecord.sid:
sstrec = (SSTRecord) record;
for (int k = 0; k < sstrec.getNumUniqueStrings(); k++) {
System.out.println("String table value " + k + " = " + sstrec.getString(k));
}
break;
case LabelSSTRecord.sid:
LabelSSTRecord lrec = (LabelSSTRecord) record;
System.out.println("String cell found with value "
+ sstrec.getString(lrec.getSSTIndex()));
break;
}
}
/**
* Read an excel file and spit out what we find.
*
* @param args Expect one argument that is the file to read.
* @throws IOException When there is an error processing the file.
*/
public static void main(String[] args) throws IOException {
// create a new file input stream with the input file specified
// at the command line
FileInputStream fin = new FileInputStream("workbook05.xls");
// create a new org.apache.poi.poifs.filesystem.Filesystem
POIFSFileSystem poifs = new POIFSFileSystem(fin);
// get the Workbook (excel part) stream in a InputStream
InputStream din = poifs.createDocumentInputStream("Workbook");
// construct out HSSFRequest object
HSSFRequest req = new HSSFRequest();
// lazy listen for ALL records with the listener shown above
req.addListenerForAllRecords(new ReadExcel2003());
// create our event factory
HSSFEventFactory factory = new HSSFEventFactory();
// process our events based on the document input stream
factory.processEvents(req, din);
// once all the events are processed close our file input stream
fin.close();
// and our document input stream (don't want to leak these!)
din.close();
System.out.println("done.");
}
}
輸出
Encountered workbook
New sheet named: sheet01
String table value 0 = This is a String
Encountered sheet reference
Row found, first column at 0 last column at 5
Row found, first column at 0 last column at 5
Row found, first column at 0 last column at 5
Cell found with value 1.0 at row 0 and column 0
Cell found with value 42932.0 at row 0 and column 1
String cell found with value This is a String
Cell found with value 42932.0 at row 0 and column 3
Cell found with value 2.0 at row 1 and column 1
Cell found with value 42932.0 at row 1 and column 2
String cell found with value This is a String
Cell found with value 42932.0 at row 1 and column 4
Cell found with value 3.0 at row 2 and column 2
Cell found with value 42932.0 at row 2 and column 3
String cell found with value This is a String
done.
我們實(shí)現(xiàn)了HSSFListener接口的processRecord方法來自定義了當(dāng)我們遇到不同Record之后的操作.
現(xiàn)在我們的需求就是Event Model要像User Model那種解析出指定Sheet,指定Row,指定Cell的數(shù)據(jù).
我們先來對代碼一下小觀察:
- 只能一次性的解析所有的sheet數(shù)據(jù).不能分頁解析.(這個功能在Dom解析的時候是有的)
- 解析Row中數(shù)據(jù)的時候,并不會解析空Cell
- 先解析了所有的Row,然后才解析了Cell.而不是Row,Cell的嵌套關(guān)系.
- 但是我們可以獲取到Cell對應(yīng)的行號.
通過上面的分析,我們就具備了滿足前面所說需求的能力.
具體來說,換頁的時候,Cell對應(yīng)的行號會變成0.我們可以通過這個方案,來判斷到了第幾個Sheet.
關(guān)于指定的Row數(shù)據(jù),我們可以根據(jù)判斷Cell的行號來判斷是否需要解析.
而具體的Cell數(shù)據(jù),我們可以在一開始傳入列數(shù).將每一行的數(shù)據(jù)解析成一個指定大小的List.并且用空字符串填滿這個List.在解析過程中,使用Cell的列號去覆蓋List指定位置的值,這樣.最終的List中就有這一行中包括空值的數(shù)據(jù).想要第一列就獲取第幾列即可.
解析xlsx
excel的話,我們?nèi)耘f使用上面的,僅僅把格式轉(zhuǎn)換為xlsx.
關(guān)于代碼我們依舊使用官網(wǎng)的demo代碼.
解析代碼:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.InputStream;
import java.util.Iterator;
public class ReadExcel2007 extends DefaultHandler{
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet1 = r.getSheet("rId1");
InputSource sheetSource = new InputSource(sheet1);
parser.parse(sheetSource);
sheet1.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("\nProcessing new sheet:");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
XMLReader parser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser"
);
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
ReadExcel2007 example = new ReadExcel2007();
example.processOneSheet("workbook07.xlsx");
example.processAllSheets("workbook07.xlsx");
}
輸出結(jié)果
A1 - 1
B1 - 42932
C1 - This is a String
D1 - 42932
C2 - D3 -
Processing new sheet:
A1 - 1
B1 - 42932
C1 - This is a String
D1 - 42932
C2 - D3 -
Processing new sheet:
A1 - 111
學(xué)過xml解析的肯定不會陌生,這段代碼完全就是解析Xml的Sax解析代碼.如果你只看到了官網(wǎng)的這段demo代碼,你一定是一臉懵逼的.你并不知道xml的格式.而我已經(jīng)在前面為你提供了xml的格式,相信你就不會因此而迷茫了.
幾點(diǎn)說明:
- xlsx的Event Model是可以指定sheet來解析的.并且是以1為第一個sheet.
- 如果單元格里沒有值,則不會有v標(biāo)簽.在真正解析的時候要考慮到這個因素.
- 在做User Model的時候也說過,Excel中使用SST來存儲字符串常量.所以無論是Xls,Xlsx都有SST.具體的單元格的真是存儲實(shí)際是SST中的下標(biāo).
總結(jié)
本文介紹了使用Event Model解析兩種格式Excel的Demo代碼.官網(wǎng)中也沒有對這些代碼如何應(yīng)用于生產(chǎn)提供具體的指導(dǎo).所以在一開始學(xué)習(xí)和應(yīng)用于項(xiàng)目中,都會遇到一些不可避免的問題.
但是在遇到的問題的時候,我們一定要記住,User Model也是基于Event Model.只不過是抽象層級更高.所以理論上的Event Model是可以實(shí)現(xiàn)所有User Model的功能.只不過需要我們做更多的思考和探索.但是我相信通過不停的思考與實(shí)踐.沒有問題是解決不掉的.