Python happaybase使用Thrift API和Filter實現(xiàn)Hbase的復(fù)雜查詢

1 背景

HappyBase是一個開發(fā)人員友好的Python庫簇搅,可與Apache HBase進(jìn)行交互寝并。 HappyBase為應(yīng)用程序開發(fā)人員提供了Pythonic API與HBase交互瓶殃。這些api包括:

詳細(xì)的文檔參考這里
happybase在scanapi中也提供了hbase thrift的Filter查詢接口,但是卻沒有詳細(xì)的Filter語法文檔,在互聯(lián)網(wǎng)上也沒有找到很詳細(xì)的文檔焚虱。
為此,我查看了hbase的的文檔懂版,翻譯了Thrift API and Filter Language部分的內(nèi)容鹃栽。接下來2.1介紹了happybase的scan接口,2.2為hbase的翻譯內(nèi)容躯畴。

2 使用Filter進(jìn)行復(fù)雜的Hbase查詢

2.1 happybase的scan接口

scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False, reverse=False)
其中的filter參數(shù)就是用于hbase的Filter查詢民鼓。下面是一個簡單的示例:

import happybase
hbase_host = ''
hbase_port = 9090
# hbase連接
conn = happybase.Connection(host=hbase_host, port=hbase_port)
table = conn.table('test')
# filter
scan_filter = "SingleColumnValueFilter('info', 'item_delivery_status', =, 'binary:1', true, true) " 
# 查詢
result = table.scan(filter=scan_filter)
# 打印查詢結(jié)果
for row_key, item in result:
    print(row_key)
    print(item)

2.2 Filter語法

這一部分的文字內(nèi)容翻譯自:hbase文檔-Thrift API and Filter Language,代碼為自己書寫蓬抄,使用時需要將host丰嘉、表名、列名等信息更改為自己信息倡鲸。

2.2.1 基本查詢語法

"FilterName (argument, argument,... , argument)"

語法指導(dǎo):

  • 首先指定過濾器的名稱供嚎,后跟括號,括號中為參數(shù)列表峭状,使用逗號分隔克滴。
  • 如果參數(shù)是字符串, 應(yīng)該使用單引號'把字符串包起來.
  • 如果參數(shù)是布爾型、整型或者操作符(如<, >!=)优床,不能使用單引號包裹劝赔。
  • filter name 必須是一個單詞,換句話說必須是除空格胆敞、引號着帽、括號之外的ASCII 字符。
  • Filter的參數(shù)可以包含任意的ASCII字符移层,如果一個參數(shù)中包含單引號仍翰,那么必須使用另外一個單引號對其轉(zhuǎn)義

2.2.2 多個過濾條件和邏輯運算符

二元運算符

  • AND
    同時滿足兩個條件
  • OR
    滿足其中一個條件即可

一元運算符

  • SKIP
    For a particular row, if any of the key-values fail the filter condition, the entire row is skipped.
    -WHILE
    For a particular row, key-values will be emitted until a key-value is reached that fails the filter condition.

例子

(Filter1 AND Filter2) OR (Filter3 AND Filter4)

運算優(yōu)先級

  • 括號擁有最高的優(yōu)先級;
  • 一元運算符 SKIPWHILE 次之, 它們擁有相同的優(yōu)先級观话;
  • 接下來是二元運算符予借。 AND 的優(yōu)先級高于 OR

例子1

Filter1 AND Filter2 OR Filter
is evaluated as
(Filter1 AND Filter2) OR Filter3

例子2

Filter1 AND SKIP Filter2 OR Filter3
is evaluated as
(Filter1 AND (SKIP Filter2)) OR Filter3

2.2.3 比較運算符

  • 小于 LESS (<)
  • 小于等于 LESS_OR_EQUAL (?)
  • 等于 EQUAL (=)
  • 不等于 NOT_EQUAL (!=)
  • 大于等于GREATER_OR_EQUAL (>=)
  • 大于GREATER (>)
  • 無操作NO_OP (no operation)

用戶需要使用這些符號 (<, ?, =, !=, >, >=) 表示比較運算符

2.2.4 比較器(Comparator)

  • BinaryComparator - 以字典序與特定的字節(jié)數(shù)組進(jìn)行比較频蛔,使用Bytes.compareTo(byte[], byte[])灵迫;
  • BinaryPrefixComparator- 前綴比較,以字典序與特定的字節(jié)數(shù)組進(jìn)行比較晦溪,比較的長度僅僅是該字節(jié)數(shù)組的長度瀑粥;
  • RegexStringComparator - 正則表達(dá)式比較,使用正則表達(dá)式來匹配. 僅可以使用 EQUALNOT_EQUAL 兩種比較運算符三圆;
  • SubStringComparator - 子串比較狞换,如果給定的子字符串出現(xiàn)避咆,則返回該查詢結(jié)果。該比較器是大小寫敏感的哀澈。僅可以使用 EQUALNOT_EQUAL 兩種比較運算符牌借。

比較器的語法是: ComparatorType:ComparatorValue

ComparatorType與comparators的對應(yīng)關(guān)系如下:

  • BinaryComparator - binary
  • BinaryPrefixComparator - binaryprefix
  • RegexStringComparator - regexstring
  • SubStringComparator - substring

例子

  1. binary:abc 將匹配字典序大于 abc的數(shù)據(jù);
  2. binaryprefix:abc 將匹配前三個字符的字典序與abc相等的數(shù)據(jù)割按;
  3. regexstring:ab*yz 將會根據(jù)正則表達(dá)式 ab*yz 進(jìn)行匹配(該正則表達(dá)式表示:不以ab為開頭和以yz為結(jié)束的數(shù)據(jù))
  4. substring:abc123將會匹配包含子字符串 abc123 的數(shù)據(jù)

2.2.5 Filter

  • KeyOnlyFilter
    這個filter不接受任何參數(shù)膨报,只返回所有鍵值對中的鍵和row_key(不包含值)

英文原文: This filter doesn’t take any arguments. It returns only the key component of each key-value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "KeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for item in result:
    print(item)
  • FirstKeyOnlyFilter
    該filter不接受任何的參數(shù),只返回每一行中的第一個鍵值對和row_key

英文原文: This filter doesn’t take any arguments. It returns only the first key-value from each row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FirstKeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • PrefixFilter
    該filter僅僅包含一個參數(shù)-主鍵的前綴适荣,返回前綴相匹配的行

英文原文: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PrefixFilter('0047a')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnPrefixFilter
    該filter接受一個參數(shù)-列的前綴现柠,僅返回列名前綴與給定參數(shù)相同的列

英文原文: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPrefixFilter('box')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • MultipleColumnPrefixFilter
    該filter接受一組列前綴,僅僅返回與列表中的前綴相匹配的列

英文原文: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "MultipleColumnPrefixFilter('box', 'create')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnCountGetFilter
    該filter接受一個參數(shù) - limit, 返回第一行弛矛,前l(fā)imit列的數(shù)據(jù)

英文原文: This filter takes one argument – a limit. It returns the first limit number of columns in the table.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnCountGetFilter(6)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • PageFilter
    該filter接受一個參數(shù) -- page-size, 返回page size行數(shù)據(jù)

英文原文: This filter takes one argument – a page size. It returns page size number of rows from the table.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PageFilter(5)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnPaginationFilter
    該filter接受兩個參數(shù) -- limit 和 offset够吩,它返回偏移列數(shù)后的列數(shù)限制。它為所有行執(zhí)行此操作丈氓。

英文原文: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPaginationFilter(3, 7)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • InclusiveStopFilter
    該filter接受一個參數(shù)--row key(在該row key處停止scanning)周循,返回截止row key之前的行(包含)的所有列

英文原文: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "InclusiveStopFilter('005c2_4530489164_10599261608')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • TimeStampsFilter
    該filter接受一組timestamps,

英文原文: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps.

  • RowFilter
    該filter接受一個比較操作符(=, !=, >, <, >=, <=)和一個比較器(binary, binaryprefix, regexstring, substring)万俗。使用比較操作符比較所有的行與比較器的匹配情況湾笛,如果返回true,則返回該行的row key和所有的列

英文原文: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "RowFilter(=, 'binary:0047a_4530641731_102627717')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • FamilyFilter
    該filter接受一個比較運算符(compare operator)和一個比較器(comparator)闰歪。根據(jù)比較運算符(compare operator)把所有的列族名與比較器(comparator)進(jìn)行比較嚎研,如果返回true,就返回所有行的row key和列族下的列

英文原文: This filter takes a compare operator and a comparator. It compares each column family name with the comparator using the compare operator and if the comparison returns true, it returns all the Cells in that column family.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FamilyFilter(=, 'binary:info')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • QualifierFilter
    該filter接受一個比較運算符(compare operator)和一個比較器(comparator)库倘。根據(jù)比較運算符(compare operator)把所有的列名(Qualifier)與比較器(comparator)進(jìn)行比較临扮,如果返回true,就返回所有行的row key和匹配的所有列

英文原文: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "QualifierFilter(=, 'binary:item_delivery_status')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ValueFilter
    該filter接受一個比較運算符(compare operator)和一個比較器(comparator)教翩。根據(jù)比較運算符(compare operator)把所有的value(Qualifier)與比較器(comparator)進(jìn)行比較杆勇,如果返回true,就返回所有行的row key和所匹配的鍵值對

英文原文: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ValueFilter(=, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • DependentColumnFilter
    該filter接受兩個參數(shù)--列族(fanily)和列名(qualifier)

英文原文: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "DependentColumnFilter('info', 'store_code')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • SingleColumnValueFilter
    該filter接受--一個列族(column family), 一個列(qualifier), 一個比較運算符(compare operator) 和一個比較器(comparator)饱亿。根據(jù)列族和列名確定的的列靶橱,把所有的值與比較器(comparator)進(jìn)行比較, 如果返回true, 則輸出該行和所有的列,如果指定的列不存在路捧,那么將返回所有的行。

英文原文: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.

注意??: 實際上传黄,該filter還有兩個參數(shù) <filterIfColumnMissing_boolean>杰扫、<latest_version_boolean>分別表示是否過濾缺失數(shù)據(jù)的行、是否只取最近的版本

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueFilter(
    'info', 'item_delivery_status', =, 'binary:2', true, true)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • SingleColumnValueExcludeFilter
    該過濾器接受的參數(shù)與SingleColumnValueFilter相同膘掰,與SingleColumnValueFilter不同的是章姓,將會輸出與輸入條件相同的所有行佳遣,除去參數(shù)指定的列。

英文原文: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueExcludeFilter(
    'info', 'item_delivery_status', =, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnRangeFilter

英文原文: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.

3 參考資料

1.happybase文檔
2.hbase文檔-Thrift API and Filter Language

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末凡伊,一起剝皮案震驚了整個濱河市零渐,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌系忙,老刑警劉巖诵盼,帶你破解...
    沈念sama閱讀 206,968評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異银还,居然都是意外死亡风宁,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,601評論 2 382
  • 文/潘曉璐 我一進(jìn)店門蛹疯,熙熙樓的掌柜王于貴愁眉苦臉地迎上來戒财,“玉大人,你說我怎么就攤上這事捺弦∫” “怎么了?”我有些...
    開封第一講書人閱讀 153,220評論 0 344
  • 文/不壞的土叔 我叫張陵列吼,是天一觀的道長幽崩。 經(jīng)常有香客問我,道長冈欢,這世上最難降的妖魔是什么歉铝? 我笑而不...
    開封第一講書人閱讀 55,416評論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮凑耻,結(jié)果婚禮上太示,老公的妹妹穿的比我還像新娘。我一直安慰自己香浩,他們只是感情好类缤,可當(dāng)我...
    茶點故事閱讀 64,425評論 5 374
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著邻吭,像睡著了一般餐弱。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上囱晴,一...
    開封第一講書人閱讀 49,144評論 1 285
  • 那天膏蚓,我揣著相機與錄音,去河邊找鬼畸写。 笑死驮瞧,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的枯芬。 我是一名探鬼主播论笔,決...
    沈念sama閱讀 38,432評論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼采郎,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了狂魔?” 一聲冷哼從身側(cè)響起蒜埋,我...
    開封第一講書人閱讀 37,088評論 0 261
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎最楷,沒想到半個月后整份,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,586評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡管嬉,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,028評論 2 325
  • 正文 我和宋清朗相戀三年皂林,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蚯撩。...
    茶點故事閱讀 38,137評論 1 334
  • 序言:一個原本活蹦亂跳的男人離奇死亡础倍,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出胎挎,到底是詐尸還是另有隱情沟启,我是刑警寧澤,帶...
    沈念sama閱讀 33,783評論 4 324
  • 正文 年R本政府宣布犹菇,位于F島的核電站德迹,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏揭芍。R本人自食惡果不足惜胳搞,卻給世界環(huán)境...
    茶點故事閱讀 39,343評論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望称杨。 院中可真熱鬧肌毅,春花似錦、人聲如沸姑原。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,333評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽锭汛。三九已至笨奠,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間唤殴,已是汗流浹背般婆。 一陣腳步聲響...
    開封第一講書人閱讀 31,559評論 1 262
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留朵逝,地道東北人蔚袍。 一個月前我還...
    沈念sama閱讀 45,595評論 2 355
  • 正文 我出身青樓,卻偏偏與公主長得像廉侧,于是被迫代替她去往敵國和親页响。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,901評論 2 345