TPCx-BB是大數(shù)據(jù)基準(zhǔn)測(cè)試工具,它通過(guò)模擬零售商的30個(gè)應(yīng)用場(chǎng)景擦囊,執(zhí)行30個(gè)查詢來(lái)衡量基于Hadoop的大數(shù)據(jù)系統(tǒng)的包括硬件和軟件的性能。其中一些場(chǎng)景還用到了機(jī)器學(xué)習(xí)算法(聚類梅割、線性回歸等)霜第。為了更好地了解被測(cè)試的系統(tǒng)的性能,需要對(duì)TPCx-BB整個(gè)測(cè)試流程深入了解户辞。本文詳細(xì)分析了整個(gè)TPCx-BB測(cè)試工具的源碼泌类,希望能夠?qū)Υ蠹依斫釺PCx-BB有所幫助。
代碼結(jié)構(gòu)
主目錄($BENCH_MARK_HOME
)下有:
- bin
- conf
- data-generator
- engines
- tools
幾個(gè)子目錄底燎。
bin下有幾個(gè) module
,是執(zhí)行時(shí)需要用到的腳本:bigBench刃榨、cleanLogs、logEnvInformation双仍、runBenchmark枢希、zipLogs等
conf下有兩個(gè)配置文件:bigBench.properties
和 userSettings.conf
bigBench.properties
主要設(shè)置 workload
(執(zhí)行的benchmarkPhases)和 power_test_0
(POWER_TEST
階段需要執(zhí)行的SQL查詢)
默認(rèn) workload
:
workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1
默認(rèn) power_test_0
:1-30
userSetting.conf
是一些基本設(shè)置,包括JAVA environment 朱沃、default settings for benchmark(database苞轿、engine茅诱、map_tasks、scale_factor ...)搬卒、HADOOP environment瑟俭、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION契邀、HADOOP_JVM_ENV...)
data-generator下是跟數(shù)據(jù)生成相關(guān)的腳本及配置文件摆寄。詳細(xì)內(nèi)容在下面介紹。
engines下是TPCx-BB支持的4種引擎:biginsights坯门、hive微饥、impala、spark_sql古戴。默認(rèn)引擎為hive欠橘。實(shí)際上,只有hive目錄下不為空允瞧,其他三個(gè)目錄下均為空简软,估計(jì)是現(xiàn)在還未完善。
tools下有兩個(gè)jar包:HadoopClusterExec.jar
和 RunBigBench.jar
述暂。其中 RunBigBench.jar
是執(zhí)行TPCx-BB測(cè)試的一個(gè)非常重要的文件痹升,大部分程序都在該jar包內(nèi)。
數(shù)據(jù)生成
數(shù)據(jù)生成相關(guān)程序和配置都在 data-generator
目錄下畦韭。該目錄下有一個(gè) pdgf.jar
包和 config疼蛾、dicts、extlib
三個(gè)子目錄艺配。
pdgf.jar是數(shù)據(jù)生成的Java程序察郁。config下有兩個(gè)配置文件:bigbench-generation.xml
和 bigbench-schema.xml
。
bigbench-generation.xml
主要設(shè)置生成的原始數(shù)據(jù)(不是數(shù)據(jù)庫(kù)表)包含哪幾張表转唉、每張表的表名皮钠、表的大小以及表輸出的目錄、表文件的后綴赠法、分隔符麦轰、字符編碼等。
<schema name="default">
<tables>
<!-- not refreshed tables -->
<!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons -->
<table name="income_band"></table>
<table name="reason"></table>
<table name="ship_mode"></table>
<table name="web_site"></table>
<!-- /tables not used in benchmark -->
<!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->
<table name="date_dim" static="true"></table>
<table name="time_dim" static="true"></table>
<table name="customer_demographics" static="true"></table>
<table name="household_demographics" static="true"></table>
<!-- /static tables -->
<!-- "normal" tables. split over all nodes. not generated during refresh -->
<table name="store"></table>
<table name="warehouse"></table>
<table name="promotion"></table>
<table name="web_page"></table>
<!-- /"normal" tables.-->
<!-- /not refreshed tables -->
<!--
refreshed tables. Generated on all nodes.
Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)
In "normal"-Phase generate table rows: [0,REFRESH_PERCENTAGE*Table.Size];
In "refresh"-Phase generate table rows: [REFRESH_PERCENTAGE*Table.Size+1, Table.Size]
.Has effect only if ${REFRESH_SYSTEM_ENABLED}==1.
-->
<table name="customer">
<scheduler name="DefaultScheduler">
<partitioner
name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">
<prePartition><![CDATA[
if(${REFRESH_SYSTEM_ENABLED}>0){
int tableID = table.getTableID();
int timeID = 0;
long lastTableRow=table.getSize()-1;
long rowStart;
long rowStop;
boolean exclude=false;
long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});
if(${REFRESH_PHASE}>0){
//Refresh part
rowStart = lastTableRow - refreshRows +1;
rowStop = lastTableRow;
if(refreshRows<=0){
exclude=true;
}
}else{
//"normal" part
rowStart = 0;
rowStop = lastTableRow - refreshRows;
}
return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);
}else{
//DEFAULT
return getParentPartitioner().getDefaultPrePartition(project, table);
}
]]></prePartition>
</partitioner>
</scheduler>
</table>
<output name="SplitFileOutputWrapper">
<!-- DEFAULT output for all Tables, if no table specific output is specified-->
<output name="CSVRowOutput">
<fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.dat</fileEnding>
<delimiter>|</delimiter>
<charset>UTF-8</charset>
<sortByRowID>true</sortByRowID>
</output>
<output name="StatisticsOutput" active="1">
<size>${item_size}</size><!-- a counter per item .. initialize later-->
<fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.csv</fileEnding>
<delimiter>,</delimiter>
<header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->
</header>
<footer></footer>
bigbench-schema.xml
設(shè)置了很多參數(shù)砖织,有跟表的規(guī)模有關(guān)的款侵,比如每張表的大小(記錄的條數(shù));絕大多數(shù)是跟表的字段有關(guān)的侧纯,比如時(shí)間的起始新锈、結(jié)束、性別比例眶熬、結(jié)婚比例妹笆、指標(biāo)的上下界等块请。還具體定義了每個(gè)字段是怎么生成的,以及限制條件拳缠。示例如下:
生成的數(shù)據(jù)大小由 SCALE_FACTOR(-f)
決定负乡。如 -f 1
,則生成的數(shù)據(jù)總大小約為1G脊凰;-f 100
,則生成的數(shù)據(jù)總大小約為100G茂腥。那么SCALE_FACTOR(-f)
是怎么精確控制生成的數(shù)據(jù)的大小呢狸涌?
原因是 SCALE_FACTOR(-f)
決定了每張表的記錄數(shù)。如下最岗,customer
表的記錄數(shù)為 100000.0d * ${SF_sqrt}
帕胆,即如果 -f 1
則 customer
表的記錄數(shù)為 100000*sqrt(1)= 10萬(wàn)條
;如果 -f 100
則 customer
表的記錄數(shù)為 100000*sqrt(100)= 100萬(wàn)條
<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>
<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>
<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>
<property name="${gender_likelihood}" type="double">0.5</property>
<property name="${married_likelihood}" type="double">0.3</property>
<property name="${WP_LINK_MIN}" type="double">2</property>
<property name="${WP_LINK_MAX}" type="double">25</property>
<field name="d_date" size="13" type="CHAR" primary="false">
<gen_DateTime>
<disableRng>true</disableRng>
<useFixedStepSize>true</useFixedStepSize>
<startDate>${date_dim_begin_date}</startDate>
<endDate>${date_dim_end_date}</endDate>
<outputFormat>yyyy-MM-dd</outputFormat>
</gen_DateTime>
</field>
<field name="t_time_id" size="16" type="CHAR" primary="false">
<gen_ConvertNumberToString>
<gen_Id/>
<size>16.0</size>
<characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>
</gen_ConvertNumberToString>
</field>
<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">
<gen_Null probability="${NULL_CHANCE}">
<gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />
</gen_Null>
</field>
dicts下有city.dict、country.dict般渡、male.dict懒豹、female.dict、state.dict驯用、mail_provider.dict等字典文件脸秽,表里每一條記錄的各個(gè)字段應(yīng)該是從這些字典里生成的。
extlib下是引用的外部程序jar包蝴乔。有 lucene-core-4.9.0.jar
记餐、commons-net-3.3.jar
、xml-apis.jar
和log4j-1.2.15.jar
等
總結(jié):
pdgf.jar
根據(jù)bigbench-generation.xml
和 bigbench-schema.xml
兩個(gè)文件里的配置(表名薇正、字段名片酝、表的記錄條數(shù)、每個(gè)字段生成的規(guī)則)挖腰,從 dicts
目錄下對(duì)應(yīng)的 .dict
文件獲取表中每一條記錄雕沿、每個(gè)字段的值,生成原始數(shù)據(jù)猴仑。
customer
表里的某條記錄如下:
0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com gdMmGdU9
如果執(zhí)行 TPCx-BB 測(cè)試時(shí)指定 -f 1(SCALE_FACTOR = 1)
則最終生成的原始數(shù)據(jù)總大小約為 1G(977M+8.6M)
[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data
12.7 M 38.0 M /user/root/benchmarks/bigbench/data/customer
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/customer_address
74.2 M 222.5 M /user/root/benchmarks/bigbench/data/customer_demographics
14.7 M 44.0 M /user/root/benchmarks/bigbench/data/date_dim
151.5 K 454.4 K /user/root/benchmarks/bigbench/data/household_demographics
327 981 /user/root/benchmarks/bigbench/data/income_band
405.3 M 1.2 G /user/root/benchmarks/bigbench/data/inventory
6.5 M 19.5 M /user/root/benchmarks/bigbench/data/item
4.0 M 12.0 M /user/root/benchmarks/bigbench/data/item_marketprices
53.7 M 161.2 M /user/root/benchmarks/bigbench/data/product_reviews
45.3 K 135.9 K /user/root/benchmarks/bigbench/data/promotion
3.0 K 9.1 K /user/root/benchmarks/bigbench/data/reason
1.2 K 3.6 K /user/root/benchmarks/bigbench/data/ship_mode
3.3 K 9.9 K /user/root/benchmarks/bigbench/data/store
4.1 M 12.4 M /user/root/benchmarks/bigbench/data/store_returns
88.5 M 265.4 M /user/root/benchmarks/bigbench/data/store_sales
4.9 M 14.6 M /user/root/benchmarks/bigbench/data/time_dim
584 1.7 K /user/root/benchmarks/bigbench/data/warehouse
170.4 M 511.3 M /user/root/benchmarks/bigbench/data/web_clickstreams
7.9 K 23.6 K /user/root/benchmarks/bigbench/data/web_page
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/web_returns
127.6 M 382.8 M /user/root/benchmarks/bigbench/data/web_sales
8.6 K 25.9 K /user/root/benchmarks/bigbench/data/web_site
執(zhí)行流程
要執(zhí)行TPCx-BB測(cè)試审轮,首先需要切換到TPCx-BB源程序的目錄下,然后進(jìn)入bin目錄宁脊,執(zhí)行以下語(yǔ)句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
其中断国,-f、-m榆苞、-s稳衬、-j都是參數(shù),用戶可根據(jù)集群的性能以及自己的需求來(lái)設(shè)置。如果不指定吱雏,則使用默認(rèn)值,默認(rèn)值在 conf
目錄下的 userSetting.conf
文件指定跛溉,如下:
export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"
默認(rèn) MAP_TASKS
為 80(-m 80)
街夭、SCALE_FACTOR
為 1000(-f 1000)
砰碴、NUMBER_OF_PARALLEL_STREAMS
為 2(-s 2)
。
所有可選參數(shù)及其意義如下:
General options:
-d 使用的數(shù)據(jù)庫(kù) (默認(rèn): $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e 使用的引擎 (默認(rèn): $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f 數(shù)據(jù)集的規(guī)模因子(scale factor) (默認(rèn): $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h 顯示幫助
-m 數(shù)據(jù)生成的`map tasks`數(shù) (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s 并行的`stream`數(shù) (默認(rèn): $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)
Driver specific options:
-a 偽裝模式執(zhí)行
-b 執(zhí)行期間將調(diào)用的bash腳本在標(biāo)準(zhǔn)輸出中打印出來(lái)
-i 指定需要執(zhí)行的階段 (詳情見(jiàn)$BIG_BENCH_CONF_DIR/bigBench.properties)
-j 指定需要執(zhí)行的查詢 (默認(rèn):1-30共30個(gè)查詢均執(zhí)行)"
-U 解鎖專家模式
若指定了-U
,即解鎖了專家模式板丽,則:
echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution"
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]"
-D 指定需要debug的查詢部分. 大部分查詢都只有一個(gè)單獨(dú)的部分
-p 需要執(zhí)行的benchmark phase (默認(rèn): $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q 指定需要執(zhí)行哪個(gè)查詢(只能指定一個(gè))
-t 指定執(zhí)行該查詢時(shí)用第哪個(gè)stream
-v metastore population的sql腳本 (默認(rèn): ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w metastore refresh的sql腳本 (默認(rèn): ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y 含額外的用戶自定義查詢參數(shù)的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z 含額外的用戶自定義引擎設(shè)置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"
List of available modules:
$BIG_BENCH_ENGINE_BIN_DIR
回到剛剛執(zhí)行TPCx-BB測(cè)試的語(yǔ)句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
bigBench
bigBench
是主腳本呈枉,runBenchmark
是module。
bigBench
里設(shè)置了很多環(huán)境變量(包括路徑埃碱、引擎猖辫、STREAM數(shù)等等),因?yàn)楹竺嬲{(diào)用 runBigBench.jar
的時(shí)候需要在Java程序里讀取這些環(huán)境變量砚殿。
bigBench
前面都是在做一些基本工作啃憎,如設(shè)置環(huán)境變量、解析用戶輸入?yún)?shù)似炎、賦予文件權(quán)限辛萍、設(shè)置路徑等等。到最后一步調(diào)用 runBenchmark
的 runModule()
方法:
-
設(shè)置基本路徑
export BIG_BENCH_VERSION="1.0" export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin" export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf" export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator" export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools" export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"
-
指定
core-site.xml
和hdfs-site.xml
的路徑數(shù)據(jù)生成時(shí)要用到Hadoop集群羡藐,生成在hdfs上
export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"
export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
```
-
賦予整個(gè)包下所有可執(zhí)行文件權(quán)限(.sh/.jar/.py)
find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '.py' -exec chmod 755 {} +
```
-
設(shè)置
userSetting.conf
的路徑并source
USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf" if [ ! -f "$USER_SETTINGS" ] then echo "User settings file $USER_SETTINGS not found" exit 1 else source "$USER_SETTINGS" fi
-
解析輸入?yún)?shù)和選項(xiàng)并根據(jù)選項(xiàng)的內(nèi)容作設(shè)置
第一個(gè)參數(shù)必須是
module_name
如果沒(méi)有輸入?yún)?shù)或者第一個(gè)參數(shù)以"-"開(kāi)頭贩毕,說(shuō)明用戶沒(méi)有輸入需要運(yùn)行的module。
if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]] then export MODULE_NAME="" SHOW_HELP="1" else export MODULE_NAME="$1" shift fi export LIST_OF_USER_OPTIONS="$@"
解析用戶輸入的參數(shù)
根據(jù)用戶輸入的參數(shù)來(lái)設(shè)置環(huán)境變量
```bash
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do
case "$OPT" in
# script options
d)
#echo "-d was triggered, Parameter: $OPTARG" >&2
USER_DATABASE="$OPTARG"
;;
D)
#echo "-D was triggered, Parameter: $OPTARG" >&2
DEBUG_QUERY_PART="$OPTARG"
;;
e)
#echo "-e was triggered, Parameter: $OPTARG" >&2
USER_ENGINE="$OPTARG"
;;
f)
#echo "-f was triggered, Parameter: $OPTARG" >&2
USER_SCALE_FACTOR="$OPTARG"
;;
h)
#echo "-h was triggered, Parameter: $OPTARG" >&2
SHOW_HELP="1"
;;
m)
#echo "-m was triggered, Parameter: $OPTARG" >&2
USER_MAP_TASKS="$OPTARG"
;;
p)
#echo "-p was triggered, Parameter: $OPTARG" >&2
USER_BENCHMARK_PHASE="$OPTARG"
;;
q)
#echo "-q was triggered, Parameter: $OPTARG" >&2
QUERY_NUMBER="$OPTARG"
;;
s)
#echo "-t was triggered, Parameter: $OPTARG" >&2
USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"
;;
t)
#echo "-s was triggered, Parameter: $OPTARG" >&2
USER_STREAM_NUMBER="$OPTARG"
;;
U)
#echo "-U was triggered, Parameter: $OPTARG" >&2
USER_EXPERT_MODE="1"
;;
v)
#echo "-v was triggered, Parameter: $OPTARG" >&2
USER_POPULATE_FILE="$OPTARG"
;;
w)
#echo "-w was triggered, Parameter: $OPTARG" >&2
USER_REFRESH_FILE="$OPTARG"
;;
y)
#echo "-y was triggered, Parameter: $OPTARG" >&2
USER_QUERY_PARAMS_FILE="$OPTARG"
;;
z)
#echo "-z was triggered, Parameter: $OPTARG" >&2
USER_ENGINE_SETTINGS_FILE="$OPTARG"
;;
# driver options
a)
#echo "-a was triggered, Parameter: $OPTARG" >&2
export USER_PRETEND_MODE="1"
;;
b)
#echo "-b was triggered, Parameter: $OPTARG" >&2
export USER_PRINT_STD_OUT="1"
;;
i)
#echo "-i was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_WORKLOAD="$OPTARG"
;;
j)
#echo "-j was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"
;;
?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
:)
echo "Option -$OPTARG requires an argument." >&2
exit 1
;;
esac
done
```
設(shè)置全局變量仆嗦。如果用戶指定了某個(gè)參數(shù)的值耳幢,則采用該值,否則使用默認(rèn)值欧啤。
```bash
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}"
export SHOW_HELP="${SHOW_HELP:-"0"}"
export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}"
export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}"
export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}"
export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}"
export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}"
export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}"
export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}"
export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE"
export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"
```
-
檢測(cè) -s -m -f -j的選項(xiàng)是否為數(shù)字
if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_MAP_TASKS is not a number" fi if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_SCALE_FACTOR is not a number" fi if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number" fi if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_STREAM_NUMBER is not a number" fi
-
檢查引擎是否存在
if [ ! -d "$BIG_BENCH_ENGINE_DIR" ] then echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..." exit 1 fi if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ] then echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..." exit 1 fi
-
設(shè)置
engineSetting.conf
路徑并source
ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf" if [ ! -f "$ENGINE_SETTINGS" ] then echo "Engine settings file $ENGINE_SETTINGS not found" exit 1 else source "$ENGINE_SETTINGS" fi
-
檢查module是否存在
當(dāng)輸入某個(gè)module時(shí)睛藻,系統(tǒng)會(huì)先到
$BIG_BENCH_ENGINE_BIN_DIR/
目錄下去找該module是否存在,如果存在邢隧,就source "$MODULE"
店印;如果該目錄下不存在指定的module,再到export MODULE="$BIG_BENCH_BIN_DIR/"
目錄下找該module倒慧,如果存在按摘,就source "$MODULE"
;否則纫谅,輸出Module $MODULE not found, aborting script.
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else echo "Module $MODULE not found, aborting script." exit 1 fi fi
-
檢查module里的runModule()炫贤、helpModule ( )、runEngineCmd()方法是否有定義
MODULE_RUN_METHOD="runModule" if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1 then echo "$MODULE_RUN_METHOD was not implemented, aborting script" exit 1 fi
-
運(yùn)行
module
如果module是runBenchmark付秕,執(zhí)行
runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是
runCmdWithErrorCheck runModule()
由上可以看出兰珍,bigBench腳本主要執(zhí)行一些如設(shè)置環(huán)境變量、賦予權(quán)限询吴、檢查并解析輸入?yún)?shù)等基礎(chǔ)工作掠河,最終調(diào)用runBenchmark
的runModule()
方法繼續(xù)往下執(zhí)行亮元。
runBenchmark
接下來(lái)看看runBenchmark
腳本。
runBenchmark
里有兩個(gè)函數(shù):helpModule ()
和runModule ()
唠摹。
helpModule ()
就是顯示幫助爆捞。
runModule ()
是運(yùn)行runBenchmark
模塊時(shí)真正調(diào)用的函數(shù)。該函數(shù)主要做四件事:
- 清除之前生成的日志
- 調(diào)用
RunBigBench.jar
來(lái)執(zhí)行 - logEnvInformation
- 將日志文件夾打包成zip
源碼如下:
runModule () {
#check input parameters
if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ]
then
echo "The number of parallel streams -s must be greater than 0"
return 1
fi
"${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS
"$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar"
"${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS
"${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS
return $?
}
相當(dāng)于運(yùn)行runBenchmark
模塊時(shí)又調(diào)用了cleanLogs
勾拉、logEnvInformation
煮甥、zipLogs
三個(gè)模塊以及RunBigBench.jar
。其中RunBigBench.jar
是TCPx-BB測(cè)試執(zhí)行的核心代碼藕赞,用Java語(yǔ)言編寫(xiě)苛秕。接下來(lái)分析RunBigBench.jar
源碼。
runModule()
runModule()函數(shù)用來(lái)執(zhí)行某個(gè)module找默。我們已知,執(zhí)行某個(gè)module需要切換到主目錄下的bin目錄吼驶,然后執(zhí)行:
./bigBench module_name arguments
在runModule()函數(shù)里惩激,cmdLine用來(lái)生成如上命令。
ArrayList cmdLine = new ArrayList();
cmdLine.add("bash");
cmdLine.add(this.runScript);
cmdLine.add(benchmarkPhase.getRunModule());
cmdLine.addAll(arguments);
其中蟹演,this.runScript
為:
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
benchmarkPhase.getRunModule()
用來(lái)獲得需要執(zhí)行的module风钻。
arguments
為用戶輸入的參數(shù)。
至此酒请,cmdLine為:
bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments
那么骡技,怎么讓系統(tǒng)執(zhí)行該bash命令呢?答案是調(diào)用runCmd()
方法羞反。
boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));
接下來(lái)介紹rumCmd()方法
runCmd()
runCmd()方法通過(guò)ProcessBuilder
來(lái)創(chuàng)建一個(gè)操作系統(tǒng)進(jìn)程布朦,并用該進(jìn)程執(zhí)行以上的bash命令。
ProcessBuilder
還可以設(shè)置工作目錄和環(huán)境昼窗。
ProcessBuilder pb = new ProcessBuilder(command);
pb.directory(new File(workingDirectory));
Process p = null;
---
p = pb.start();
getQueryList()
getQueryList()用來(lái)獲得需要執(zhí)行的查詢列表是趴。從$BIG_BENCH_LOGS_DIR/bigBench.properties
文件中讀取。與$BIG_BENCH_HOME/conf/bigBench.properties
內(nèi)容一致澄惊。
bigBench.properties
里power_test_0=1-30
規(guī)定了powter_test_0
階段需要執(zhí)行的查詢及其順序唆途。
可以用區(qū)間如 5-12
或者單個(gè)數(shù)字如 21
表示,中間用 ,
隔開(kāi)掸驱。
power_test_0=28-25,2-5,10,22,30
表示powter_test_0
階段需要執(zhí)行的查詢及其順序?yàn)椋?code>28,27,26,25,2,3,4,5,10,22,30
如果想讓30個(gè)查詢按順序執(zhí)行肛搬,則:
power_test_0=1-30
獲得查詢列表的源碼如下:
private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) {
String SHUFFLED_NAME_PATTERN = "shuffledQueryList";
BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST;
String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber);
boolean queryOrderCached = benchmarkPhase.isQueryOrderCached();
if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) {
return new ArrayList((Collection)this.queryListCache.get(propertyKey));
} else {
Object queryList;
String basicPhaseNamePattern;
if(!this.properties.containsKey(propertyKey)) {
if(benchmarkPhase.isQueryOrderRandom()) {
if(!this.queryListCache.containsKey("shuffledQueryList")) {
basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0);
if(!this.properties.containsKey(basicPhaseNamePattern)) {
throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list.");
}
this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0));
}
queryList = (List)this.queryListCache.get("shuffledQueryList");
this.shuffleList((List)queryList, this.rnd);
} else {
queryList = this.getQueryList(queryOrderBasicPhase, 0);
}
} else {
queryList = new ArrayList();
String[] var11;
int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length;
label65:
for(int var9 = 0; var9 < var10; ++var9) {
basicPhaseNamePattern = var11[var9];
String[] queryRange = basicPhaseNamePattern.trim().split("-");
switch(queryRange.length) {
case 1:
((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim())));
break;
case 2:
int startQuery = Integer.parseInt(queryRange[0]);
int endQuery = Integer.parseInt(queryRange[1]);
int i;
if(startQuery > endQuery) {
i = startQuery;
while(true) {
if(i < endQuery) {
continue label65;
}
((List)queryList).add(Integer.valueOf(i));
--i;
}
} else {
i = startQuery;
while(true) {
if(i > endQuery) {
continue label65;
}
((List)queryList).add(Integer.valueOf(i));
++i;
}
}
default:
throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated.");
}
}
}
if(queryOrderCached) {
this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList));
}
return new ArrayList((Collection)queryList);
}
}
parseEnvironment()
parseEnvironment()讀取系統(tǒng)的環(huán)境變量并解析。
Map env = System.getenv();
this.version = (String)env.get("BIG_BENCH_VERSION");
this.homeDir = (String)env.get("BIG_BENCH_HOME");
this.confDir = (String)env.get("BIG_BENCH_CONF_DIR");
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR");
this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR");
this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG");
this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG");
this.engine = (String)env.get("BIG_BENCH_ENGINE");
this.database = (String)env.get("BIG_BENCH_DATABASE");
this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS");
this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS"));
this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR"));
this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");
并自動(dòng)在用戶指定的參數(shù)后面加上 -U
(解鎖專家模式)
this.userArguments.add("-U");
如果用戶指定了 PRETEND_MODE
毕贼、PRINT_STD_OUT
温赔、WORKLOAD
、QUERIES_TO_RUN
鬼癣,則以用戶指定的參數(shù)為準(zhǔn)让腹,否則使用默認(rèn)值远剩。
if(env.containsKey("USER_PRETEND_MODE")) {
this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE"));
}
if(env.containsKey("USER_PRINT_STD_OUT")) {
this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT"));
}
if(env.containsKey("USER_DRIVER_WORKLOAD")) {
this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD"));
}
if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) {
this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN"));
}
讀取 workload
并賦值 benchmarkPhases
。如果 workload
里不包含 BENCHMARK_START
和 BENCHMARK_STOP
骇窍,自動(dòng)在 benchmarkPhases
的首位和末位分別加上 BENCHMARK_START
和 BENCHMARK_STOP
瓜晤。
this.benchmarkPhases = new ArrayList();
Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator();
while(var7.hasNext()) {
String benchmarkPhase = (String)var7.next();
this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim()));
}
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) {
this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START);
}
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) {
this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP);
}
run()
run()
方法是 RunBigBench.jar
里核心的方法。所有的執(zhí)行都是通過(guò) run()
方法調(diào)用的腹纳。比如 runQueries()
痢掠、runModule()
、generateData()
等嘲恍。runQueries()
足画、runModule()
、generateData()
又通過(guò)調(diào)用 runCmd()
方法來(lái)創(chuàng)建操作系統(tǒng)進(jìn)程佃牛,執(zhí)行bash命令淹辞,調(diào)用bash腳本。
run()
方法里通過(guò)一個(gè) while
循環(huán)來(lái)逐一執(zhí)行 workload
里的每一個(gè) benchmarkPhase
俘侠。 不同的 benchmarkPhase
會(huì)調(diào)用 runQueries()
象缀、runModule()
、generateData()
...中的不同方法爷速。
try {
long e = 0L;
this.log.finer("Benchmark phases: " + this.benchmarkPhases);
Iterator startCheckpoint = this.benchmarkPhases.iterator();
long throughputStart;
while(startCheckpoint.hasNext()) {
BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next();
if(children.isPhaseDone()) {
this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase");
} else {
try {
switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) {
case 1:
case 20:
throw new IllegalArgumentException("The value " + children.name() + " is only used internally.");
case 2:
this.log.info(children.getConsoleMessage());
e = System.currentTimeMillis();
break;
case 3:
if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) {
throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it");
}
throughputStart = System.currentTimeMillis();
this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)}));
this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful()));
break;
case 4:
case 15:
case 18:
case 22:
case 27:
case 28:
case 29:
this.runModule(children, this.userArguments);
break;
case 5:
case 10:
case 11:
this.runQueries(children, 1, validationArguments);
break;
case 6:
case 9:
this.runModule(children, validationArguments);
break;
case 7:
this.generateData(children, false, validationArguments);
break;
case 8:
this.generateData(children, true, validationArguments);
break;
case 12:
case 19:
case 24:
this.runQueries(children, 1, this.userArguments);
break;
case 13:
case 14:
case 21:
case 23:
case 25:
case 26:
this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
break;
case 16:
this.generateData(children, false, this.userArguments);
break;
case 17:
this.generateData(children, true, this.userArguments);
}
children.setPhaseDone(true);
} catch (IOException var21) {
this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n==============");
var21.printStackTrace();
if(this.stopAfterFailure || children.mustSucceed()) {
break;
}
}
}
}
這里的 case 1-29
并不是 1-29
條查詢央星,而是枚舉類型里的 1-29
個(gè) benmarkPhase
。如下所示:
private static enum BenchmarkPhase {
BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"),
BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"),
BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"),
CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"),
ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"),
ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"),
ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"),
ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"),
ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"),
ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"),
ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"),
CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"),
CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"),
CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"),
CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"),
CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"),
DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"),
LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"),
POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"),
THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"),
THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"),
THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"),
THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"),
VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"),
VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"),
VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"),
SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"),
SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"),
SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results");
private String runModule;
private String namePattern;
private boolean queryOrderRandom;
private boolean queryOrderCached;
private boolean printStdOut;
private boolean mustSucceed;
private String consoleMessage;
private boolean phaseDone;
private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) {
this.runModule = runModule;
this.namePattern = namePattern;
this.queryOrderRandom = queryOrderRandom;
this.queryOrderCached = queryOrderCached;
this.printStdOut = printStdOut;
this.mustSucceed = mustSucceed;
this.consoleMessage = consoleMessage;
this.phaseDone = false;
}
3對(duì)應(yīng) BENCHMARK_STOP
惫东,4對(duì)應(yīng) CLEAN_ALL
,29對(duì)應(yīng) SHOW_VALIDATION
莉给,依此類推...
可以看出:
CLEAN_ALL、CLEAN_LOAD_TEST廉沮、LOAD_TEST颓遏、THROUGHPUT_TEST_REFRESH、SHOW_TIMES滞时、SHOW_ERRORS州泊、SHOW_VALIDATION
等benchmarkPhases調(diào)用的是
this.runModule(children, this.userArguments);
方法是 runModule
,參數(shù)是 this.userArguments
漂洋。
ENGINE_VALIDATION_CLEAN_POWER_TEST遥皂、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION
調(diào)用的是
this.runQueries(children, 1, validationArguments);
方法是 runQueries
刽漂,參數(shù)是 1
(stream number) 和 validationArguments
演训。
ENGINE_VALIDATION_CLEAN_LOAD_TEST
和 ENGINE_VALIDATION_LOAD_TEST
調(diào)用的是
this.runModule(children, validationArguments);
ENGINE_VALIDATION_CLEAN_DATA
調(diào)用的是
this.generateData(children, false, validationArguments);
ENGINE_VALIDATION_DATA_GENERATION
調(diào)用的是
this.generateData(children, true, validationArguments);
CLEAN_POWER_TEST
、POWER_TEST
贝咙、VALIDATE_POWER_TEST
調(diào)用的是
this.runQueries(children, 1, this.userArguments);
CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1
VALIDATE_THROUGHPUT_TEST_2
調(diào)用的是
this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
CLEAN_DATA
調(diào)用的是
this.generateData(children, false, this.userArguments);
DATA_GENERATION
調(diào)用的是
this.generateData(children, true, this.userArguments);
總結(jié)一下以上的方法調(diào)用可以發(fā)現(xiàn):
- 跟
ENGINE_VALIDATION
相關(guān)的benchmarkPhase用的參數(shù)都是validationArguments
样悟。其余用的是userArguments
( validationArguments 和 userArguments 唯一的區(qū)別是 validationArguments 的SCALE_FACTOR
恒為1) - 跟
POWER_TEST
相關(guān)的都是調(diào)用runQueries()
方法,因?yàn)?POWER_TEST
就是執(zhí)行SQL查詢 - 跟
CLEAN_DATA
DATA_GENERATION
相關(guān)的都是調(diào)用generateData()
方法 - 跟
LOAD_TEST
SHOW
相關(guān)的都是調(diào)用runModule()
方法
benchmarkPhase 和 module 對(duì)應(yīng)關(guān)系
具體每個(gè) benchmarkPhase
跟 module
(執(zhí)行的腳本)的對(duì)應(yīng)關(guān)系如下:
CLEAN_ALL -> "cleanAll"
ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery"
ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore",
ENGINE_VALIDATION_CLEAN_DATA -> "cleanData"
ENGINE_VALIDATION_DATA_GENERATION -> "dataGen"
ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore"
ENGINE_VALIDATION_POWER_TEST -> "runQuery"
ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery"
CLEAN_POWER_TEST -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery"
CLEAN_LOAD_TEST -> "cleanMetastore"
CLEAN_DATA -> "cleanData"
DATA_GENERATION -> "dataGen"
LOAD_TEST -> "populateMetastore"
POWER_TEST -> "runQuery"
THROUGHPUT_TEST -> (String)null
THROUGHPUT_TEST_1 -> "runQuery"
THROUGHPUT_TEST_REFRESH -> "refreshMetastore"
THROUGHPUT_TEST_2 -> "runQuery"
VALIDATE_POWER_TEST -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery"
SHOW_TIMES -> "showTimes"
SHOW_ERRORS -> "showErrors"
SHOW_VALIDATION -> "showValidation"
當(dāng)執(zhí)行某個(gè) benchmarkPhase
時(shí)會(huì)去調(diào)用如上該 benchmarkPhase
對(duì)應(yīng)的 module
(腳本位于 $BENCH_MARK_HOME/engines/hive/bin
目錄下)
cmdLine.add(benchmarkPhase.getRunModule());
程序調(diào)用流程
接下來(lái)介紹每個(gè)module的功能
module
cleanAll
1. DROP DATABASE
2. 刪除hdfs上的源數(shù)據(jù)
echo "dropping database (with all tables)"
runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
cleanQuery
1. 刪除對(duì)應(yīng)的 Query 生成的臨時(shí)表
2. 刪除對(duì)應(yīng)的 Query 生成的結(jié)果表
runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;"
return $?
cleanMetastore
1. 調(diào)用 `dropTables.sql` 將23張表依次DROP
echo "cleaning metastore tables"
runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"
export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"
dropTables.sql
將23張表依次DROP,源碼如下:
DROP TABLE IF EXISTS ${hiveconf:customerTableName};
DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName};
DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:dateTableName};
DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:incomeTableName};
DROP TABLE IF EXISTS ${hiveconf:itemTableName};
DROP TABLE IF EXISTS ${hiveconf:promotionTableName};
DROP TABLE IF EXISTS ${hiveconf:reasonTableName};
DROP TABLE IF EXISTS ${hiveconf:shipModeTableName};
DROP TABLE IF EXISTS ${hiveconf:storeTableName};
DROP TABLE IF EXISTS ${hiveconf:timeTableName};
DROP TABLE IF EXISTS ${hiveconf:warehouseTableName};
DROP TABLE IF EXISTS ${hiveconf:webSiteTableName};
DROP TABLE IF EXISTS ${hiveconf:webPageTableName};
DROP TABLE IF EXISTS ${hiveconf:inventoryTableName};
DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:webSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName};
DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName};
DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};
cleanData
1. 刪除hdfs上 /user/root/benchmarks/bigbench/data 目錄
2. 刪除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目錄
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
dataGen
1. 創(chuàng)建目錄 /user/root/benchmarks/bigbench/data 并賦予權(quán)限
2. 創(chuàng)建目錄 /user/root/benchmarks/bigbench/data_refresh 并賦予權(quán)限
3. 調(diào)用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目錄下
4. 調(diào)用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目錄下
創(chuàng)建目錄 /user/root/benchmarks/bigbench/data 并賦予權(quán)限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
創(chuàng)建目錄 /user/root/benchmarks/bigbench/data_refresh 并賦予權(quán)限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
調(diào)用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
調(diào)用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
populateMetastore
該過(guò)程是真正的創(chuàng)建數(shù)據(jù)庫(kù)表的過(guò)程。建表的過(guò)程調(diào)用的是 $BENCH_MARK_HOME/engines/hive/population/
目錄下的 hiveCreateLoad.sql
,通過(guò)該sql文件來(lái)建數(shù)據(jù)庫(kù)表窟她。
- 從 /user/root/benchmarks/bigbench/data 路徑下讀取 .dat 的原始數(shù)據(jù)陈症,生成 TEXTFILE 格式的外部臨時(shí)表
- 用
select * from 臨時(shí)表
來(lái)創(chuàng)建最終的 ORC 格式的數(shù)據(jù)庫(kù)表 - 刪除外部臨時(shí)表。
從 /user/root/benchmarks/bigbench/data 路徑下讀取 .dat 的原始數(shù)據(jù)震糖,生成 TEXTFILE 格式的外部臨時(shí)表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
( c_customer_sk bigint --not null
, c_customer_id string --not null
, c_current_cdemo_sk bigint
, c_current_hdemo_sk bigint
, c_current_addr_sk bigint
, c_first_shipto_date_sk bigint
, c_first_sales_date_sk bigint
, c_salutation string
, c_first_name string
, c_last_name string
, c_preferred_cust_flag string
, c_birth_day int
, c_birth_month int
, c_birth_year int
, c_birth_country string
, c_login string
, c_email_address string
, c_last_review_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
用 select * from 臨時(shí)表
來(lái)創(chuàng)建最終的 ORC 格式的數(shù)據(jù)庫(kù)表
DROP TABLE IF EXISTS ${hiveconf:customerTableName};
CREATE TABLE ${hiveconf:customerTableName}
STORED AS ${hiveconf:tableFormat}
AS
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;
刪除外部臨時(shí)表
DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
runQuery
1. runQuery 調(diào)用每個(gè)query下的 run.sh 里的 `query_run_main_method()` 方法
2. `query_run_main_method()` 調(diào)用 `runEngineCmd` 來(lái)執(zhí)行query腳本(qxx.sql)
runQuery 調(diào)用每個(gè)query下的 run.sh 里的 query_run_main_method()
方法
QUERY_MAIN_METHOD="query_run_main_method"
-----------------------------------------
"$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1
query_run_main_method()
調(diào)用 runEngineCmd
來(lái)執(zhí)行query腳本(qxx.sql)
query_run_main_method () {
QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql"
if [ ! -r "$QUERY_SCRIPT" ]
then
echo "SQL file $QUERY_SCRIPT can not be read."
exit 1
fi
runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT"
return $?
}
一般情況下 query_run_main_method ()
方法只是執(zhí)行對(duì)應(yīng)的query腳本录肯,但是像 q05、q20... 這些查詢吊说,用到了機(jī)器學(xué)習(xí)算法论咏,所以在執(zhí)行對(duì)應(yīng)的query腳本后會(huì)把生成的結(jié)果表作為輸入,然后調(diào)用執(zhí)行機(jī)器學(xué)習(xí)算法(如聚類颁井、邏輯回歸)的jar包繼續(xù)執(zhí)行厅贪,得到最終的結(jié)果。
runEngineCmd () {
if addInitScriptsToParams
then
"$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@"
else
return 1
fi
}
--------------------------
BINARY="/usr/bin/hive"
BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX)
INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE")
INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE")
if [ -n "$USER_QUERY_PARAMS_FILE" ]
then
if [ -r "$USER_QUERY_PARAMS_FILE" ]
then
echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init."
INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE")
else
echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read."
return 1
fi
fi
if [ -n "$USER_ENGINE_SETTINGS_FILE" ]
then
if [ -r "$USER_ENGINE_SETTINGS_FILE" ]
then
echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init."
INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE")
else
echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read."
return 1
fi
fi
return 0
validateQuery
1. 調(diào)用每個(gè)query下的 run.sh 里的 `query_run_validate_method()` 方法
2. `query_run_validate_method()` 比較 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 兩個(gè)文件雅宾,如果一樣养涮,則驗(yàn)證通過(guò),否則驗(yàn)證失敗眉抬。
if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*")
then
echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results"
else
echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results"
VALIDATION_PASSED="0"
fi
SF為1時(shí)(-f 1)贯吓,用上面的方法比較,SF不為1(>1)時(shí),只要hdfs上的結(jié)果表中行數(shù)大于等于1即驗(yàn)證通過(guò)
if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ]
then
echo "Validation passed: Query returned results"
else
echo "Validation failed: Query did not return results"
return 1
fi
refreshMetastore
1. 調(diào)用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目錄下的 `hiveRefreshCreateLoad.sql` 腳本
2. `hiveRefreshCreateLoad.sql` 將hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目錄下每個(gè)表數(shù)據(jù)插入外部臨時(shí)表
3. 外部臨時(shí)表再將每個(gè)表的數(shù)據(jù)插入Hive數(shù)據(jù)庫(kù)對(duì)應(yīng)的表中
hiveRefreshCreateLoad.sql
將hdfs上 /user/root/benchmarks/bigbench/data_refresh/
目錄下每個(gè)表數(shù)據(jù)插入外部臨時(shí)表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
( c_customer_sk bigint --not null
, c_customer_id string --not null
, c_current_cdemo_sk bigint
, c_current_hdemo_sk bigint
, c_current_addr_sk bigint
, c_first_shipto_date_sk bigint
, c_first_sales_date_sk bigint
, c_salutation string
, c_first_name string
, c_last_name string
, c_preferred_cust_flag string
, c_birth_day int
, c_birth_month int
, c_birth_year int
, c_birth_country string
, c_login string
, c_email_address string
, c_last_review_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
-----------------
set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};
外部臨時(shí)表再將每個(gè)表的數(shù)據(jù)插入Hive數(shù)據(jù)庫(kù)對(duì)應(yīng)的表中
INSERT INTO TABLE ${hiveconf:customerTableName}
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;
附錄
歡迎關(guān)注公眾號(hào): FullStackPlan 獲取更多干貨哦~