1 背景
HappyBase是一個開發(fā)人員友好的Python庫簇搅,可與Apache HBase進(jìn)行交互寝并。 HappyBase為應(yīng)用程序開發(fā)人員提供了Pythonic API與HBase交互瓶殃。這些api包括:
詳細(xì)的文檔參考這里
happybase在scan
api中也提供了hbase thrift的Filter查詢接口,但是卻沒有詳細(xì)的Filter語法文檔,在互聯(lián)網(wǎng)上也沒有找到很詳細(xì)的文檔焚虱。
為此,我查看了hbase的的文檔懂版,翻譯了Thrift API and Filter Language部分的內(nèi)容鹃栽。接下來2.1介紹了happybase的scan
接口,2.2為hbase的翻譯內(nèi)容躯畴。
2 使用Filter進(jìn)行復(fù)雜的Hbase查詢
2.1 happybase的scan
接口
scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False, reverse=False)
其中的filter
參數(shù)就是用于hbase的Filter查詢民鼓。下面是一個簡單的示例:
import happybase
hbase_host = ''
hbase_port = 9090
# hbase連接
conn = happybase.Connection(host=hbase_host, port=hbase_port)
table = conn.table('test')
# filter
scan_filter = "SingleColumnValueFilter('info', 'item_delivery_status', =, 'binary:1', true, true) "
# 查詢
result = table.scan(filter=scan_filter)
# 打印查詢結(jié)果
for row_key, item in result:
print(row_key)
print(item)
2.2 Filter語法
這一部分的文字內(nèi)容翻譯自:hbase文檔-Thrift API and Filter Language,代碼為自己書寫蓬抄,使用時需要將host丰嘉、表名、列名等信息更改為自己信息倡鲸。
2.2.1 基本查詢語法
"FilterName (argument, argument,... , argument)"
語法指導(dǎo):
- 首先指定過濾器的名稱供嚎,后跟括號,括號中為參數(shù)列表峭状,使用逗號分隔克滴。
- 如果參數(shù)是字符串, 應(yīng)該使用單引號
'
把字符串包起來. - 如果參數(shù)是布爾型、整型或者操作符(如
<
,>
或!=
)优床,不能使用單引號包裹劝赔。 - filter name 必須是一個單詞,換句話說必須是除空格胆敞、引號着帽、括號之外的ASCII 字符。
- Filter的參數(shù)可以包含任意的ASCII字符移层,如果一個參數(shù)中包含單引號仍翰,那么必須使用另外一個單引號對其轉(zhuǎn)義
2.2.2 多個過濾條件和邏輯運算符
二元運算符
-
AND
同時滿足兩個條件 -
OR
滿足其中一個條件即可
一元運算符
-
SKIP
For a particular row, if any of the key-values fail the filter condition, the entire row is skipped.
-WHILE
For a particular row, key-values will be emitted until a key-value is reached that fails the filter condition.
例子
(Filter1 AND Filter2) OR (Filter3 AND Filter4)
運算優(yōu)先級
- 括號擁有最高的優(yōu)先級;
- 一元運算符
SKIP
和WHILE
次之, 它們擁有相同的優(yōu)先級观话; - 接下來是二元運算符予借。
AND
的優(yōu)先級高于OR
。
例子1
Filter1 AND Filter2 OR Filter
is evaluated as
(Filter1 AND Filter2) OR Filter3
例子2
Filter1 AND SKIP Filter2 OR Filter3
is evaluated as
(Filter1 AND (SKIP Filter2)) OR Filter3
2.2.3 比較運算符
- 小于 LESS (<)
- 小于等于 LESS_OR_EQUAL (?)
- 等于 EQUAL (=)
- 不等于 NOT_EQUAL (!=)
- 大于等于GREATER_OR_EQUAL (>=)
- 大于GREATER (>)
- 無操作NO_OP (no operation)
用戶需要使用這些符號 (<, ?, =, !=, >, >=) 表示比較運算符
2.2.4 比較器(Comparator)
-
BinaryComparator
- 以字典序與特定的字節(jié)數(shù)組進(jìn)行比較频蛔,使用Bytes.compareTo(byte[], byte[])
灵迫; -
BinaryPrefixComparator
- 前綴比較,以字典序與特定的字節(jié)數(shù)組進(jìn)行比較晦溪,比較的長度僅僅是該字節(jié)數(shù)組的長度瀑粥; -
RegexStringComparator
- 正則表達(dá)式比較,使用正則表達(dá)式來匹配. 僅可以使用EQUAL
和NOT_EQUAL
兩種比較運算符三圆; -
SubStringComparator
- 子串比較狞换,如果給定的子字符串出現(xiàn)避咆,則返回該查詢結(jié)果。該比較器是大小寫敏感的哀澈。僅可以使用EQUAL
和NOT_EQUAL
兩種比較運算符牌借。
比較器的語法是: ComparatorType:ComparatorValue
ComparatorType與comparators的對應(yīng)關(guān)系如下:
-
BinaryComparator
-binary
-
BinaryPrefixComparator
-binaryprefix
-
RegexStringComparator
-regexstring
-
SubStringComparator
-substring
例子
-
binary:abc
將匹配字典序大于abc
的數(shù)據(jù); -
binaryprefix:abc
將匹配前三個字符的字典序與abc
相等的數(shù)據(jù)割按; -
regexstring:ab*yz
將會根據(jù)正則表達(dá)式ab*yz
進(jìn)行匹配(該正則表達(dá)式表示:不以ab
為開頭和以yz
為結(jié)束的數(shù)據(jù)) -
substring:abc123
將會匹配包含子字符串abc123
的數(shù)據(jù)
2.2.5 Filter
-
KeyOnlyFilter
這個filter不接受任何參數(shù)膨报,只返回所有鍵值對中的鍵和row_key(不包含值)
英文原文: This filter doesn’t take any arguments. It returns only the key component of each key-value.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "KeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for item in result:
print(item)
-
FirstKeyOnlyFilter
該filter不接受任何的參數(shù),只返回每一行中的第一個鍵值對和row_key
英文原文: This filter doesn’t take any arguments. It returns only the first key-value from each row.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FirstKeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
PrefixFilter
該filter僅僅包含一個參數(shù)-主鍵的前綴适荣,返回前綴相匹配的行
英文原文: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PrefixFilter('0047a')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
ColumnPrefixFilter
該filter接受一個參數(shù)-列的前綴现柠,僅返回列名前綴與給定參數(shù)相同的列
英文原文: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPrefixFilter('box')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
MultipleColumnPrefixFilter
該filter接受一組列前綴,僅僅返回與列表中的前綴相匹配的列
英文原文: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "MultipleColumnPrefixFilter('box', 'create')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
ColumnCountGetFilter
該filter接受一個參數(shù) - limit, 返回第一行弛矛,前l(fā)imit列的數(shù)據(jù)
英文原文: This filter takes one argument – a limit. It returns the first limit number of columns in the table.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnCountGetFilter(6)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
PageFilter
該filter接受一個參數(shù) -- page-size, 返回page size行數(shù)據(jù)
英文原文: This filter takes one argument – a page size. It returns page size number of rows from the table.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PageFilter(5)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
ColumnPaginationFilter
該filter接受兩個參數(shù) -- limit 和 offset够吩,它返回偏移列數(shù)后的列數(shù)限制。它為所有行執(zhí)行此操作丈氓。
英文原文: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPaginationFilter(3, 7)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
InclusiveStopFilter
該filter接受一個參數(shù)--row key(在該row key處停止scanning)周循,返回截止row key之前的行(包含)的所有列
英文原文: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "InclusiveStopFilter('005c2_4530489164_10599261608')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
TimeStampsFilter
該filter接受一組timestamps,
英文原文: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps.
-
RowFilter
該filter接受一個比較操作符(=, !=, >, <, >=, <=
)和一個比較器(binary, binaryprefix, regexstring, substring
)万俗。使用比較操作符比較所有的行與比較器的匹配情況湾笛,如果返回true,則返回該行的row key和所有的列
英文原文: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "RowFilter(=, 'binary:0047a_4530641731_102627717')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
FamilyFilter
該filter接受一個比較運算符(compare operator)和一個比較器(comparator)闰歪。根據(jù)比較運算符(compare operator)把所有的列族名與比較器(comparator)進(jìn)行比較嚎研,如果返回true,就返回所有行的row key和列族下的列
英文原文: This filter takes a compare operator and a comparator. It compares each column family name with the comparator using the compare operator and if the comparison returns true, it returns all the Cells in that column family.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FamilyFilter(=, 'binary:info')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
QualifierFilter
該filter接受一個比較運算符(compare operator)和一個比較器(comparator)库倘。根據(jù)比較運算符(compare operator)把所有的列名(Qualifier)與比較器(comparator)進(jìn)行比較临扮,如果返回true,就返回所有行的row key和匹配的所有列
英文原文: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "QualifierFilter(=, 'binary:item_delivery_status')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
ValueFilter
該filter接受一個比較運算符(compare operator)和一個比較器(comparator)教翩。根據(jù)比較運算符(compare operator)把所有的value(Qualifier)與比較器(comparator)進(jìn)行比較杆勇,如果返回true,就返回所有行的row key和所匹配的鍵值對
英文原文: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ValueFilter(=, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
DependentColumnFilter
該filter接受兩個參數(shù)--列族(fanily)和列名(qualifier)
英文原文: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "DependentColumnFilter('info', 'store_code')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
SingleColumnValueFilter
該filter接受--一個列族(column family), 一個列(qualifier), 一個比較運算符(compare operator) 和一個比較器(comparator)饱亿。根據(jù)列族和列名確定的的列靶橱,把所有的值與比較器(comparator)進(jìn)行比較, 如果返回true, 則輸出該行和所有的列,如果指定的列不存在路捧,那么將返回所有的行。
英文原文: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.
注意??: 實際上传黄,該filter還有兩個參數(shù) <filterIfColumnMissing_boolean>杰扫、<latest_version_boolean>分別表示是否過濾缺失數(shù)據(jù)的行、是否只取最近的版本
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueFilter(
'info', 'item_delivery_status', =, 'binary:2', true, true)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
-
SingleColumnValueExcludeFilter
該過濾器接受的參數(shù)與SingleColumnValueFilter
相同膘掰,與SingleColumnValueFilter
不同的是章姓,將會輸出與輸入條件相同的所有行佳遣,除去參數(shù)指定的列。
英文原文: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.
conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueExcludeFilter(
'info', 'item_delivery_status', =, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
print(item)
ColumnRangeFilter
英文原文: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.