摘要
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite’s architecture consists of a modular and extensible query optimizer
with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in bigdata frameworks. It is an active project that continues to introduce
support for the new types of data sources, query languages, and approaches to query processing and optimization
Apache Calcite是一個(gè)基礎(chǔ)的框架非驮,它提供查詢處理灭袁,優(yōu)化器惭笑,拓展查詢語言落君,這些拓展語言可以支持許多流行的開源數(shù)據(jù)處理系統(tǒng)浓恳,例如 Apache Hive, Apache Storm, Apache Flink, Druid, and MapD缺狠。Apache Calcite
主要組成部分
模塊 | 描述 |
---|---|
查詢解析器 | 處理多樣化查詢語言 |
查詢優(yōu)化器 | 模塊化宇攻,可拓展,包含幾百個(gè)內(nèi)建優(yōu)化規(guī)則 |
適配器 | 可拓展奕锌,支持異構(gòu)多種數(shù)據(jù)模型(關(guān)系型著觉,半結(jié)構(gòu)型,流式惊暴,地理hash) |
關(guān)鍵詞
Apache Calcite饼丘,關(guān)系語法,數(shù)據(jù)管理辽话,查詢關(guān)系代數(shù)肄鸽,模塊化優(yōu)化器,存儲(chǔ)適配器
1. 引言
Following the seminal System R, conventional relational database engines dominated the data processing landscape. Yet, as far back as 2005, Stonebraker and ?etintemel [49] predicted that we would see the rise a collection of specialized engines such as column stores,stream processing engines, text search engines, and so forth. Theyargued that specialized engines can offer more cost-effective performance and that they would bring the end of the “one size fits all” paradigm. Their vision seems today more relevant than ever.Indeed, many specialized open-source data systems have since become popular such as Storm [50] and Flink [16] (stream processing),
Elasticsearch [15] (text search), Apache Spark [47], Druid [14], etc.As organizations have invested in data processing systems tailored towards their specific needs, two overarching problems have
arisen:
? The developers of such specialized systems have encountered related problems, such as query optimization [4, 25]
or the need to support query languages such as SQL and related extensions (e.g., streaming queries [26]) as well as
language-integrated queries inspired by LINQ [33]. Without a unifying framework, having multiple engineers independently develop similar optimization logic and language support wastes engineering effort.
? Programmers using these specialized systems often have to integrate several of them together. An organization might rely on Elasticsearch, Apache Spark, and Druid. We need to build systems capable of supporting optimized queries across heterogeneous data sources [55].
Apache Calcite was developed to solve these problems. It isa complete query processing system that provides much of the common functionality—query execution, optimization, and query languages—required by any database management system, except for data storage and management, which are left to specialized engines. Calcite was quickly adopted by Hive, Drill [13], Storm, and many other data processing engines, providing them with advanced query optimizations and query languages.1 For example,Hive [24] is a popular data warehouse project built on top of ApacheHadoop. As Hive moved from its batch processing roots t owards an interactive SQL query answering platform, it became clear that the project needed a powerful optimizer at its core. Thus, Hive adopted Calcite as its optimizer and their integration has been growing since.Many other projects and products have followed suit, including Flink, MapD [12], etc.
Furthermore, Calcite enables cross-platform optimization by exposing a common interface to multiple systems. To be efficient, the optimizer needs to reason globally, e.g., make decisions across different systems about materialized view selection.
看到一系列專業(yè)方向引擎油啤,例如列數(shù)據(jù)庫贴捡,流處理引擎,文本搜索引擎等等村砂,將會(huì)逐步發(fā)展起來,結(jié)束“one size fits all” 的時(shí)代
隨著組織為了滿足特定的訴求開發(fā)數(shù)據(jù)處理系統(tǒng)時(shí)候屹逛,出現(xiàn)了兩個(gè)難題
● 定制系統(tǒng)的開發(fā)工程師幾乎遇到了相同類似的問題础废,例如查詢優(yōu)化器[4,25]汛骂,或者查詢語言(SQL[26])和拓展語言(流式查詢,集成語言的查詢LINQ[33])的支持评腺,如果沒有一個(gè)統(tǒng)一的框架帘瞭,許多工程師各自開發(fā)了一套相似的優(yōu)化邏輯和語言支持,就會(huì)浪費(fèi)大量的精力蒿讥,不可復(fù)用蝶念。
● 用定制數(shù)據(jù)系統(tǒng)的開發(fā)者經(jīng)常不得不把他們集成在一起,一個(gè)團(tuán)隊(duì)也許依賴Elasticsearch, Apache Spark, and Druid芋绸,我們需要構(gòu)建跨越異構(gòu)數(shù)據(jù)源支持優(yōu)化查詢的能力
開發(fā)Apache Calcite是為了解決以上的問題媒殉,它是一個(gè)完整的查詢處理系統(tǒng),提供許多核心的通用能力摔敛,比如查詢執(zhí)行器廷蓉,優(yōu)化器,查詢語言马昙,它不支持?jǐn)?shù)據(jù)存儲(chǔ)和數(shù)據(jù)管理桃犬,這些留給了定制數(shù)據(jù)系統(tǒng)各自去實(shí)現(xiàn)。
Building a common framework does not come without challenges. In particular, the framework needs to be extensible and flexible enough to accommodate the different types of systems requiring integration.We believe that the following features have contributed to Calcite’s wide adoption in the open source community and industry:
- Open source friendliness. Many of the major data processing platforms of the last decade have been either open
source or largely based on open source. Calcite is an opensource framework, backed by the Apache Software Foundation (ASF) [5], which provides the means to collaboratively develop the project. Furthermore, the software is written in Java making it easier to interoperate with many of the latest data processing systems [12, 13, 16, 24, 28, 44] that are often written themselves in Java (or in the JVM-based Scala), especially those in the Hadoop ecosystem.- Multiple data models. Calcite provides support for query optimization and query languages using both streaming
and conventional data processing paradigms. Calcite treats streams as time-ordered sets of records or events that are not persisted to the disk as they would be in conventional data processing systems.- Flexible query optimizer. Each component of the optimizer is pluggable and extensible, ranging from rules to cost
models. In addition, Calcite includes support for multiple planning engines. Hence, the optimization can be broken down into phases handled by different optimization engines depending on which one is best suited for the stage.- Cross-system support. The Calcite framework can run and optimize queries across multiple query processing systems and database backends.
- Reliability. Calcite is reliable, as its wide adoption over many years has led to exhaustive testing of the platform.
Calcite also contains an extensive test suite validating all components of the system including query optimizer rules
and integration with backend data sources.- Support for SQL and its extensions. Many systems do not provide their own query language, but rather prefer to rely
on existing ones such as SQL. For those, Calcite provides support for ANSI standard SQL, as well as various SQL dialects and extensions, e.g., for expressing queries on streaming or nested data. In addition, Calcite includes a driver conforming to the standard Java API (JDBC). The remainder is organized as follows. Section
構(gòu)造一個(gè)通用框架有很多挑戰(zhàn)行楞,這個(gè)框架需要是足夠靈活的攒暇,可拓展的去支持不同系統(tǒng)的集成需求,下面的特性幫助Calcite被廣大社區(qū)和生產(chǎn)環(huán)境使用:
- 友好的開源氛圍
- 多種數(shù)據(jù)模型
- 靈活的優(yōu)化器
- 跨系統(tǒng)支持
- 可靠性
- 支持SQL語法和可拓展
The remainder is organized as follows. Section 2 discusses related work. Section 3 introduces Calcite’s architecture and its main components. Section 4 describes the relational algebra at the core of Calcite. Section 5 presents Calcite’s adapters, an abstraction to define how to read external data sources. In turn, Section 6 describes Calcite’s optimizer and its main features, while Section 7 presents the extensions to handle different query processing paradigms. Section 8 provides an overview of the data processing systems already using Calcite. Section 9 discusses possible future extensions for the framework before we conclude in Section 10.
剩余部分組織如下
第2部分:討論相關(guān)工作
第3部分:介紹Calcite的架構(gòu)和它的主要組件
第4部分:介紹Calcite的核心-關(guān)系代數(shù)
第5部分:展示Calcite的適配器子房,定義了怎樣讀取外部數(shù)據(jù)源的一種抽象
第6部分:反過來描述Calcite優(yōu)化器和它的核心功能
第7部分:處理不同查詢解析的拓展
第8部分:統(tǒng)計(jì)和概覽已經(jīng)使用Calcite的引擎
第9部分:討論未來框架可能的功能拓展
第10部分:結(jié)論
2. 相關(guān)工作
雖然Calcite現(xiàn)在被很多Hadoop生態(tài)系統(tǒng)的大數(shù)據(jù)分析引擎使用形用,但是它背后許多的技術(shù)點(diǎn)都不是新穎的,比如
查詢優(yōu)化的想法來源于
- Volcano【The Volcano Optimizer Generator- Extensibility and Efficient Search】
- Cascades【The Cascades Framework for Query Optimization】
物化重寫 - 【Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering】
- 【Optimizing Queries Using Materialized Views: A Practical, Scalable Solution】
- 【Implementing Data Cubes Efficiently
Though Calcite is currently the most widely adopted optimizer for big-data analytics in the Hadoop ecosystem, many of the ideas that lie behind it are not novel. For instance, the query optimizer builds on ideas from the Volcano [20] and Cascades [19] frameworks,incorporating other widely used optimization techniques such as materialized view rewriting [10, 18, 22]. There are other systems that try to fill a similar role to Calcite.
?? Orca [45] is a modular query optimizer used in data management products such as Greenplum and HAWQ. Orca decouples the optimizer from the query execution engine by implementing a framework for exchanging information between the two known as Data eXchange Language. Orca also provides tools for verifying the correctness and performance of generated query plans. In contrast to Orca, Calcite can be used as a standalone query execution engine that federates multiple storage and processing backends, including pluggable planners, and optimizers.
?? Spark SQL [3] extends Apache Spark to support SQL query executionwhich can also execute queries over multiple data sourcesas in Calcite. However, although the Catalyst optimizer in Spark SQL also attempts to minimize query execution cost, it lacks the dynamic programming approach used by Calcite and risks falling
into local minima.
?? Algebricks [6] is a query compiler architecture that provides a data model agnostic algebraic layer and compiler framework for big data query processing. High-level languages are compiled to Algebricks logical algebra. Algebricks then generates an optimized job targeting the Hyracks parallel processing backend. While Calcite shares a modular approach with Algebricks, Calcite also includes a support for cost-based optimizations. In the current version of Calcite, the query optimizer architecture uses dynamic programming-based planning based on Volcano [20] with extensions for multi-stage optimizations as in Orca [45]. Though in principle Algebricks could support multiple processing backends (e.g.,Apache Tez, Spark), Calcite has provided well-tested support for
diverse backends for many years.
??Garlic [7] is a heterogeneous data management system which represents data from multiple systems under a unified object model. However, Garlic does not support query optimization across different systems and relies on each system to optimize its own queries.
??FORWARD [17] is a federated query processor that implementsa superset of SQL called SQL++ [38]. SQL++ has a semi-structured data model that integrate both JSON and relational data models whereas Calcite supports semi-structured data models by representing them in the relational data model during query planning.
FORWARD decomposes federated queries written in SQL++ into subqueries and executes them on the underlying databases according to the query plan. The merging of data happens inside the
FORWARD engine.
??Another federated data storage and processing system is Big-DAWG, which abstracts a wide spectrum of data models including relational, time-series and streaming. The unit of abstraction in
BigDAWG is called an island of information. Each island of information has a query language, data model and connects to one or more storage systems. Cross storage system querying is supported within the boundaries of a single island of information. Calcite instead provides a unifying relational abstraction which allows querying
across backends with different data models.
?? Myria is a general-purpose engine for big data analytics, with
advanced support for the Python language [21]. It produces query
plans for other backend engines such as Spark and PostgreSQL.
3. 架構(gòu)
Calcite contains many of the pieces that comprise a typical database management system. However, it omits some key components, e.g., storage of data, algorithms to process data, and a repository for storing metadata. These omissions are deliberate: it makes Calcite an excellent choice for mediating between applications having one
or more data storage locations and using multiple data processing engines. It is also a solid foundation for building bespoke data processing systems.
Figure 1 outlines the main components of Calcite’s architecture. Calcite’s optimizer uses a tree of relational operators as its internal representation. The optimization engine primarily consists of three components: rules, metadata providers, and planner engines. We discuss these components in more detail in Section 6. In the figure, the dashed lines represent possible external interactions with the framework. There are different ways to interact with Calcite.
圖1大概描述了Calcite架構(gòu)的主要組件池颈。Calcite優(yōu)化器使用關(guān)系表達(dá)式樹來作為它的內(nèi)在表示尾序。優(yōu)化器主要由規(guī)則,元數(shù)據(jù)提供者和計(jì)劃引擎三部分組成躯砰,虛線表示外部和框架的交互每币,有許多不同的和Calcite交互的方式。
??First, Calcite contains a query parser and validator that can translate a SQL query to a tree of relational operators. As Calcite does not contain a storage layer, it provides a mechanism to define
table schemas and views in external storage engines via adapters(described in Section 5), so it can be used on top of these engines.
??Second, although Calcite provides optimized SQL support to systems that need such database language support, it also provides optimization support to systems that already have their own language
parsing and interpretation:
- Some systems support SQL queries, but without or with limited query optimization. For example, both Hive and Spark initially offered support for the SQL language, but they did not include an optimizer. For such cases, once the query has been optimized, Calcite can translate the relational expression back to SQL. This feature allows Calcite to work as a stand-alone system on top of any data management system
with a SQL interface, but no optimizer.- The Calcite architecture is not only tailored towards optimizing SQL queries. It is common that data processing
systems choose to use their own parser for their own query language. Calcite can help optimize these queries as well.
Indeed, Calcite also allows operator trees to be easily constructed by directly instantiating relational operators. One
can use the built-in relational expressions builder interface. For instance, assume that we want to express the following Apache Pig [41] script using the expression builder
This interface exposes the main constructs necessary for building relational expressions. After the optimization phase
is finished, the application can retrieve the optimized relational expression which can then be mapped back to the
system’s query processing unit.
第一琢歇,Calcite包含一個(gè)查詢解析器和校驗(yàn)器兰怠,它能轉(zhuǎn)化一個(gè)SQL查詢到一個(gè)關(guān)系表達(dá)式樹。由于Calcite并沒有自己的存儲(chǔ)層李茫,它通過適配器來表達(dá)外部存儲(chǔ)引擎
第二揭保,Calcite也支持框架有自己sql解析器,之后進(jìn)行優(yōu)化
舉個(gè)例子魄宏,對于有自己sql解析器的框架的sql語句如下所示
可以用builder代碼轉(zhuǎn)化為calcite的 relNode關(guān)系表達(dá)式
Calcite可以對relNode進(jìn)行優(yōu)化秸侣,之后可以再把關(guān)系表達(dá)式再轉(zhuǎn)換為SQL
4.查詢關(guān)系代數(shù)
??Operators. Relational algebra [11] lies at the core of Calcite. In addition to the operators that express the most common data manipulation operations, such as filter, project, join etc., Calcite includes additional operators that meet different purposes, e.g., being able to concisely represent complex operations, or recognize optimization
opportunities more efficiently. For instance, it has become common for OLAP, decision making, and streaming applications to use window definitions to express complex analytic functions such as moving average of a quantity
over a time period or number or rows. Thus, Calcite introduces a window operator that encapsulates the window definition, i.e., upper and lower bound, partitioning etc., and the aggregate functions to execute on each window.
?? Traits. Calcite does not use different entities to represent logical and physical operators. Instead, it describes the physical properties associated with an operator using traits. These traits help the optimizer evaluate the cost of different alternative plans. Changing a trait value does not change the logical expression being evaluated, i.e., the rows produced by the given operator will still be the same.
?? During optimization, Calcite tries to enforce certain traits on relational expressions, e.g., the sort order of certain columns. Relational operators can implement a converter interface that indicates how to convert traits of an expression from one value to another.
?? Calcite includes common traits that describe the physical properties of the data produced by a relational expression, such as ordering, grouping, and partitioning. Similar to the SCOPE optimizer [57], the Calcite optimizer can reason about these properties and exploit them to find plans that avoid unnecessary operations. For example, if the input to the sort operator is already correctly ordered— possibly because this is the same order used for rows in the backend system—then the sort operation can be removed.
?? In addition to these properties, one of the main features of Calcite is the calling convention trait. Essentially, the trait represents the data processing system where the expression will be executed. Including the calling convention as a trait allows Calcite to meet its goal of optimizing transparently queries whose execution might
span over different engines i.e., the convention will be treated as any other physical property.
?? For example, consider joining a Products table held in MySQL to an Orders table held in Splunk (see Figure 2). Initially, the scan of Orders takes place in the splunk convention and the scan of Products is in the jdbc-mysql convention. The tables have to be scanned inside their respective engines. The join is in the logical convention,
meaning that no implementation has been chosen. Moreover, the SQL query in Figure 2 contains a filter (where clause) which is pushed into splunk by an adapter-specific rule (see Section 5). One possible implementation is to use Apache Spark as an external engine: the join is converted to spark convention, and its inputs are converters from jdbc-mysql and splunk to spark convention. But there is a more efficient implementation: exploiting the fact that Splunk can perform lookups into MySQL via ODBC, a planner rule pushes the join through the splunk-to-spark converter, and the join is now in splunk convention, running inside the Splunk engine.
4.1 Operators
關(guān)系代數(shù)[11]是Calcite的核心,Calcite不僅僅包含可以表示大部分通用數(shù)據(jù)操作的標(biāo)識符(例如filter, project, join等),還包括額外的操作符味榛,這些操作符可以滿足不同的目的椭坚,比如簡明的表達(dá)復(fù)雜操作,或者辨識更有效的優(yōu)化機(jī)會(huì)
4.2 Traits
Calcite沒有使用不同的實(shí)體來表示邏輯和物理操作符搏色,替代的善茎,它用帶有Traits的操作符來描述物理屬性,這些特性幫助優(yōu)化器來評估不同替代計(jì)劃的代價(jià)频轿。改變一個(gè)特質(zhì)的值不會(huì)改變邏輯表達(dá)式垂涯。例如這些由操作符產(chǎn)生的行依然不變。
RelTrait 與 RelTraitDef
RelTrait 表示 RelNode 的一種性質(zhì)航邢,用于指定物理執(zhí)行屬性耕赘,比如是否需要排序,數(shù)據(jù)的分布(distribution)翠忠,它使用時(shí)由 RelTraitDef 來定義鞠苟,目前分為三類:
ConventionTraitDef,表示由何種數(shù)據(jù)處理引擎處理秽之,對應(yīng)的 RelTrait 類型為 Convention泽本,邏輯執(zhí)行計(jì)劃中玄货,其值默認(rèn)為 org.apache.calcite.plan.Convention#NONE渺杉,物理執(zhí)行計(jì)劃的對應(yīng)的值在優(yōu)化前臭猜,通過方法 org.apache.calcite.plan.RelOptPlanner#changeTraits 指定,Calcite 已經(jīng)定義的有 EnumerableConvention河质,BindableConvention冀惭,JdbcConvention 等。如果為 EnumerableConvention掀鹅,那么生成的物理執(zhí)行計(jì)劃將由 Calcite 的 linq4j 引擎執(zhí)行散休,此外每種 Convention 都對應(yīng)具體的關(guān)系表達(dá)式的轉(zhuǎn)換規(guī)則。
RelCollationTraitDef乐尊,表示排序規(guī)則的定義戚丸,對應(yīng)的 RelTrait 為 RelCollation。比如對于排序表達(dá)式(也稱算子) org.apache.calcite.rel.core.Sort扔嵌,就存在一個(gè) RelCollation 類型的屬性 collation限府。
RelDistributionTraitDef,表示數(shù)據(jù)在物理存儲(chǔ)上的分布方式痢缎,對應(yīng)的 RelTrait 為 RelDistribution胁勺。
另外,對于每個(gè) RelNode 對象独旷,都會(huì)有 RelTraitSet署穗,這是 RelTrait 的一個(gè)有序集合寥裂,RelNode 的 RelTrait 都是保存在該集合中的。
在優(yōu)化過程中蛇捌,Calcite嘗試實(shí)施確定的特質(zhì)在關(guān)系表達(dá)式上抚恒,例如確定列的排序順序,關(guān)系表達(dá)式可以實(shí)現(xiàn)
convert接口來指示怎樣把一個(gè)表達(dá)式的特質(zhì)轉(zhuǎn)換為另一個(gè)络拌。
??上面的查詢描述了ConventionTrait,掃描Orders表發(fā)生在splunk convention回溺,掃描Products表發(fā)生在jdbc-mysql convention春贸,各自引擎進(jìn)行掃描。
??一個(gè)可能的實(shí)現(xiàn)就是用Apache Spark作為一個(gè)外部引擎遗遵,這個(gè)連接join被轉(zhuǎn)化為spark convention萍恕,join的輸入為jdbc-mysql和splunk.
??但是也有一個(gè)更有效的實(shí)現(xiàn)。利用Splunk可以使用ODBC訪問MySQL的原理车要,使用splunk convention查詢允粤。再通過splunk-to-spark converter運(yùn)行在Splunk引擎。
5.適配器
An adapter is an architectural pattern that defines how Calcite incorporates diverse data sources for general access. Figure 3 depicts its components. Essentially, an adapter consists of a model, a schema, and a schema factory. The model is a specification of the physical properties of the data source being accessed. A schema is
the definition of the data (format and layouts) found in the model.The data itself is physically accessed via tables. Calcite interfaces with the tables defined in the adapter to read the data as the query is being executed. The adapter may define a set of rules that are added to the planner. For instance, it typically includes rules to convert various types of logical relational expressions to the corresponding relational expressions of the adapter’s convention. The schema factory component acquires the metadata information from the model and generates a schema.
As discussed in Section 4, Calcite uses a physical trait known as the calling convention to identify relational operators which correspond to a specific database backend. These physical operators implement the access paths for the underlying tables in each adapter.When a query is parsed and converted to a relational algebra expression,
an operator is created for each table representing a scan of the data on that table. It is the minimal interface that an adapter must implement. If an adapter implements the table scan operator, the Calcite optimizer is then able to use client-side operators such as sorting, filtering, and joins to execute arbitrary SQL queries against these tables.
?This table scan operator contains the necessary information theadapter requires to issue the scan to the adapter’s backend database.To extend the functionality provided by adapters, Calcite defines an enumerable calling convention. Relational operators with the enumerable calling convention simply operate over tuples via an iterator interface. This calling convention allows Calcite to implement operators which may not be available in each adapter’s
backend. For example, the EnumerableJoin operator implements joins by collecting rows from its child nodes and joining on the desired attributes.
?For queries which only touch a small subset of the data in a table, it is inefficient for Calcite to enumerate all tuples. Fortunately, the same rule-based optimizer can be used to implement adapter-specific rules for optimization. For example, suppose a query involves filtering and sorting on a table. An adapter which can perform filtering on the backend can implement a rule which matches a LogicalFilter and converts it to the adapter’s calling convention. This rule converts the LogicalFilter into another Filter instance. This new Filter node has a lower associated cost that allows Calcite to optimize queries across adapters.
?The use of adapters is a powerful abstraction that enables not only optimization of queries for a specific backend, but also across multiple backends. Calcite is able to answer queries involving tables across multiple backends by pushing down all possible logic to each backend and then performing joins and aggregations on the resulting data. Implementing an adapter can be as simple as providing a table scan operator or it can involve the design of many advanced optimizations. Any expression represented in the relational algebra can be pushed down to adapters with optimizer rules.
??本質(zhì)上翼岁,適配器由模型类垫,模式,模式工廠三部分組成琅坡,模型是接觸的數(shù)據(jù)源物理屬性的明確說明悉患,在calcite中它是一個(gè)json配置文件,模式是在模型中發(fā)現(xiàn)的數(shù)據(jù)(格式和布局)的定義榆俺。
這部分可以看下源碼demo
??對于只能接觸一個(gè)表中小部分?jǐn)?shù)據(jù)的查詢售躁,對于Calcite來說迭代查詢所有元組是低效,基于規(guī)則的優(yōu)化器可以用來實(shí)現(xiàn)適配器定制的規(guī)則來用于優(yōu)化茴晋。例如陪捷,假設(shè)對于一張表執(zhí)行排序和過濾查詢,
可以針對后臺(tái)數(shù)據(jù)庫實(shí)行過濾的適配器可以實(shí)現(xiàn)一個(gè)規(guī)則诺擅,這個(gè)規(guī)則的觸發(fā)條件是匹配到LogicalFilter市袖,之后將它轉(zhuǎn)換成適配器的calling convention,這個(gè)規(guī)則把LogicalFilter轉(zhuǎn)換成另外的過濾器實(shí)例掀虎,這個(gè)新的過濾器實(shí)例擁有更小的關(guān)聯(lián)代價(jià)凌盯,允許Calcite跨優(yōu)化器執(zhí)行優(yōu)化(條件下推)。
6.查詢過程和優(yōu)化器
The query optimizer is the main component in the framework.Calcite optimizes queries by repeatedly applying planner rules to a relational expression. A cost model guides the process, and the planner engine tries to generate an alternative expression that has the same semantics as the original but a lower cost.
Every component in the optimizer is extensible. Users can add relational operators, rules, cost models, and statistics.
??查詢優(yōu)化器是框架里的核心組件烹玉,Calcite通過對一個(gè)關(guān)系表達(dá)式不斷地執(zhí)行計(jì)劃規(guī)則來優(yōu)化查詢驰怎,一個(gè)代價(jià)估算的模型指引優(yōu)化過程,計(jì)劃引擎嘗試生成一個(gè)可以替代的關(guān)系表達(dá)式二打,擁有和原關(guān)系表達(dá)式一樣的語義和更低的代價(jià)县忌。每一個(gè)組件都是可以拓展的,用戶可以增加過站關(guān)系表達(dá)式操作符,規(guī)則症杏,代價(jià)模型和統(tǒng)計(jì)信息装获。
6.1 計(jì)劃規(guī)則(Planner rules)
??Calcite includes a set of planner rules to transform expression trees. In particular, a rule matches a given pattern in
the tree and executes a transformation that preserves semantics of that expression. Calcite includes several hundred optimization rules. However, it is rather common for data processing systems relying on Calcite for optimization to include their own rules to allow specific rewritings.
??For example, Calcite provides an adapter for Apache Cassandra [29], a wide column store which partitions data by a subset of columns in a table and then within each partition, sorts rows based on another subset of columns. As discussed in Section 5, it is beneficial for adapters to push down as much query processing as
possible to each backend for efficiency. A rule to push a Sort into Cassandra must check two conditions:
??(1) the table has been previously filtered to a single partition
(since rows are only sorted within a partition) and
??(2) the sorting of partitions in Cassandra has some commonprefix with the required sort.
??This requires that a LogicalFilter has been rewritten to aCassandraFilter to ensure the partition filter is pushed down to the database. The effect of the rule is simple (convert a LogicalSort into a CassandraSort) but the flexibility in rule matching enables backends to push down operators even in complex scenarios.
??For an example of a rule with more complex effects, consider the following query:
??The query corresponds to the relational algebra expression presented in Figure 4a. Because the WHERE clause only applies to the sales table, we can move the filter before the join as in Figure 4b. This optimization can significantly reduce query execution time since we do not need to perform the join for rows which do match the predicate. Furthermore, if the sales and products tables were contained in separate backends, moving the filter before
the join also potentially enables an adapter to push the filter into the backend. Calcite implements this optimization via
FilterIntoJoinRule which matches a filter node with a join node as a parent and checks if the filter can be performed by the join. This optimization illustrates the flexibility of the Calcite approach to optimization.
這個(gè)查詢可以表示為圖4a
??Calcite包含了一組計(jì)劃規(guī)則可以轉(zhuǎn)換關(guān)系表達(dá)式樹。對于符合關(guān)系表達(dá)式樹的規(guī)則厉颤,會(huì)對關(guān)系表達(dá)式書進(jìn)行轉(zhuǎn)換穴豫,轉(zhuǎn)換會(huì)保持原語義。Calcite包含幾百個(gè)優(yōu)化規(guī)則逼友。對于基于Calcite優(yōu)化器的數(shù)據(jù)庫系統(tǒng)(dremio)可以包含自己特有的優(yōu)化規(guī)則來實(shí)現(xiàn)重寫.
例如精肃,Calcite為Apache Cassandra提供一個(gè)適配器,Apache Cassandra是一個(gè)多列存儲(chǔ)引擎帜乞,通過一組表中的列進(jìn)行分區(qū)司抱,在每個(gè)分區(qū)內(nèi),行根據(jù)列組進(jìn)行排序黎烈。就像在第5部分討論的习柠,對于適配器來說,下推盡可能多的查詢處理到背后的數(shù)據(jù)引擎是很有效的照棋。下推Sort到Cassandra必須滿足兩個(gè)條件:
(1)這個(gè)表已經(jīng)過濾出來一個(gè)單獨(dú)的分區(qū)资溃,因?yàn)閞ows排序只在每個(gè)分區(qū)內(nèi)有效
(2)Cassandra中分區(qū)的排序根據(jù)需要的排序規(guī)則有一些通用的前綴
這就需要LogicalFilter被重寫成CassandraFilter來確保分區(qū)的filter已經(jīng)下推到數(shù)據(jù)庫。這個(gè)規(guī)則的影響是簡單的(轉(zhuǎn)換LogicalSort到CassandraSort)必怜,靈活的規(guī)則命中機(jī)制可以確保在復(fù)雜的情形中依然可以下推操作符到后臺(tái)數(shù)據(jù)庫肉拓。
??因?yàn)閃HER從句僅僅作用于sales表,我們可以移動(dòng)這個(gè)filter到j(luò)oin之前梳庆,就像圖4表達(dá)的那樣暖途。這個(gè)優(yōu)化可以顯著地減少查詢時(shí)間,因?yàn)槲覀兛梢詻]有必要針對不符合斷言的數(shù)據(jù)進(jìn)行連接操作膏执。此外驻售,如果sales和products表屬于不同的后臺(tái)數(shù)據(jù)庫,允許適配器將過濾下推到后臺(tái)服務(wù)器更米。Calcite通過FilterIntoJoinRule規(guī)則實(shí)現(xiàn)了這項(xiàng)優(yōu)化欺栗。規(guī)則的命中機(jī)制是關(guān)系表達(dá)式節(jié)點(diǎn)的父節(jié)點(diǎn)是join節(jié)點(diǎn)和檢查join是否可以應(yīng)用于filter。上面這個(gè)優(yōu)化例子說明了Calcite優(yōu)化的靈活性征峦。
6.2 元數(shù)據(jù)提供者(Metadata providers)
Metadata is an important part of Calcite’s optimizer, and it serves two main purposes: (i) guiding the planner
towards the goal of reducing the cost of the overall query plan, and (ii) providing information to the rules while they are being applied.
元數(shù)據(jù)對于Calcite的優(yōu)化器來說是一個(gè)很重要的部分迟几,他有兩個(gè)主要目的
(1)指引計(jì)劃朝著減少代價(jià)的目標(biāo)進(jìn)行
(2)為規(guī)則的執(zhí)行提供必要的信息
??Metadata providers are responsible for supplying that information to the optimizer. In particular, the default metadata providers implementation in Calcite contains functions that return the overall cost of executing a subexpression in the operator tree, the number of rows and the data size of the results of that expression, and the maximum degree of parallelism with which it can be executed. In turn, it can also provide information about the plan structure, e.g., filter conditions that are present below a certain tree node.
??Calcite provides interfaces that allow data processing systems toplug their metadata information into the framework. These systems may choose to write providers that override the existing functions, or provide their own new metadata functions that might be used during the optimization phase. However, for many of them, it is
sufficient to provide statistics about their input data, e.g., number of rows and size of a table, whether values for a given column are unique etc., and Calcite will do the rest of the work by using its default implementation.
As the metadata providers are pluggable, they are compiled and instantiated at runtime using Janino [27], a Java lightweight compiler. Their implementation includes a cache for metadata results, which yields significant performance improvements, e.g., when we need to compute multiple types of metadata such as cardinality,\ average row size, and selectivity for a given join, and all these computations rely on the cardinality of their inputs.
??元數(shù)據(jù)提供者負(fù)責(zé)給優(yōu)化器提供必要的信息。
6.3 計(jì)劃引擎(Planner engines)
The main goal of a planner engine is to trigger the rules provided to the engine until it reaches a given objective. At
the moment, Calcite provides two different engines. New engines are pluggable in the framework.
計(jì)劃引擎的主要目的是觸發(fā)提供給引擎的規(guī)則直到達(dá)到了優(yōu)化目標(biāo)栏笆。于此类腮,Calcite提供了兩種不同的引擎
6.3.1 基于代價(jià)的引擎(CBO)
The first one, a cost-based planner engine, triggers the input ruleswith the goal of reducing the overall expression cost. The engineuses a dynamic programming algorithm, similar to Volcano [20],to create and track different alternative plans created by firing the rules given to the engine. Initially, each expression is registered with the planner, together with a digest based on the expression attributes and its inputs. When a rule is fired on an expression e1
and the rule produces a new expression e2, the planner will add e2 to the set of equivalence expressions Sa that e1 belongs to. In addition, the planner generates a digest for the new expression, which is compared with those previously registered in the planner. If a similar digest associated with an expression e3 that belongs to a set Sb is found, the planner has found a duplicate and hence will merge Sa and Sb into a new set of equivalences. The process
continues until the planner reaches a configurable fix point. In particular, it can (i) exhaustively explore the search space until all rules have been applied on all expressions, or (ii) use a heuristicbased approach to stop the search when the plan cost has not improved by more than a given threshold δ in the last planner iterations. The cost function that allows the optimizer to decide which plan to choose is supplied through metadata providers. The default cost function implementation combines estimations for CPU, IO, and memory resources used by a given expression.
觸發(fā)規(guī)則,這些規(guī)則的目標(biāo)是減少全面表達(dá)式的代價(jià)蛉加。使用類似于Volcano的動(dòng)態(tài)規(guī)劃算法蚜枢,通過激活的引擎規(guī)則創(chuàng)建和追蹤不同可替代的計(jì)劃缸逃。默認(rèn)的代價(jià)函數(shù)包含一個(gè)表達(dá)式使用的CPU,IO,內(nèi)存資源.
6.3.2 啟發(fā)式優(yōu)化器(RBO)
The second engine is an exhaustive planner, which triggers rules exhaustively until it generates an expression that is no longer modified by any rules. This planner is useful to quickly execute rules without taking into account the cost of each expression.
窮舉地觸發(fā)規(guī)則,直到它生成一個(gè)表達(dá)式厂抽,表達(dá)式再也不能被規(guī)則修改
Users may choose to use one of the existing planner engines depending on their concrete needs, and switching from one to another, when their system requirements change, is straightforward. Alternatively, users may choose to generate multi-stage optimization logic, in which different sets of rules are applied in consecutive phases of the optimization process. Importantly, the existence of two planners allows Calcite users to reduce the overall optimization
time by guiding the search for different query plans.
6.4 物化結(jié)果
??One of the most powerful techniques to accelerate query processing in data warehouses is the precomputation of
relevant summaries or materialized views. Multiple Calcite adapters and projects relying on Calcite have their own notion of materialized views. For instance, Cassandra allows the user to define materialized views based on existing tables which are automatically maintained by the system.
??These engines expose their materialized views to Calcite. The optimizer then has the opportunity to rewrite incoming queries to use these views instead of the original tables. In particular, Calcite provides an implementation of two different materialized view based rewriting algorithms.
??The first approach is based on view substitution [10, 18]. The aim is to substitute part of the relational algebra tree with an equivalent expression which makes use of a materialized view, and the algorithm proceeds as follows: (i) the scan operator over the materialized view and the materialized view definition plan are registered with the planner, and (ii) transformation rules that try to unify expressions in the plan are triggered. Views do not need to exactly match expressions in the query being replaced, as the rewriting algorithm in Calcite can produce partial rewritings that include additional operators to compute the desired expression, e.g., filters with residual predicate conditions.
??The second approach is based on lattices [22]. Once the data sources are declared to form a lattice, Calcite represents each of the materializations as a tile which in turn can be used by the optimizer
to answer incoming queries. On the one hand, the rewriting algorithm is especially efficient in matching expressions over data sources organized in a star schema, which are common in OLAP applications. On the other hand, it is more restrictive than view substitution, as it imposes restrictions on the underlying schema.
數(shù)據(jù)倉庫處理查詢最有效的加速技術(shù)之一就是預(yù)計(jì)算相關(guān)聚合結(jié)果或者是結(jié)果物化需频。多種Calcite適配器或者基于Calcite的項(xiàng)目有自己的物化視圖概念。
Calcite提供了兩種物化視圖重寫算法的實(shí)現(xiàn)筷凤。
(1)基于view substitution昭殉。這個(gè)實(shí)現(xiàn)的目的就是使用等效的物化視圖替換部分關(guān)系表達(dá)式樹。
(2)基于lattices嵌施,當(dāng)數(shù)據(jù)源被聲明來源于lattice,Calcite可以使用tile表達(dá)每個(gè)物化結(jié)果饲化。這些tile可以被優(yōu)化器使用響應(yīng)查詢。一方面吗伤,重寫算法對于星型查詢的數(shù)據(jù)源的表達(dá)式依然有效。
7.Calcite拓展
As we have mentioned in the previous sections, Calcite is not only tailored towards SQL processing. In fact, Calcite provides extensions to SQL expressing queries over other data abstractions, such as semistructured,
streaming and geospatial data. Its internal operators adapt to these queries. In addition to extensions to SQL, Calcite also includes a language-integrated query language. We describe these extensions throughout this section and provide some examples.
像我們在前面章節(jié)里提到的硫眨,Calcite對于SQL處理不是定制的足淆。事實(shí)上,Calcite通過數(shù)據(jù)抽象提供了SQL查詢語言拓展礁阁,這些抽象包含半結(jié)構(gòu)化數(shù)據(jù)巧号,流式,和地理位置數(shù)據(jù)姥闭。Calcite內(nèi)部的操作符可以適配這些查詢丹鸿。除了對SQL查詢進(jìn)行拓展,Calcite也可以包含整合語言的查詢語言棚品,在下面描述這些拓展和舉例靠欢。
7.1 半結(jié)構(gòu)化數(shù)據(jù)
Calcite supports several complex column data types that enable a hybrid of relational and semi-structured data to be stored in tables. Specifically, columns can be of type ARRAY, MAP, or MULTISET. Furthermore, these complex types can be nested so it is possible for example, to have a MAP where the values are of type ARRAY. Data
within the ARRAY and MAP columns (and nested data therein) can be extracted using the [] operator. The specific type of values stored in any of these complex types need not be predefined.
??For example, Calcite contains an adapter for MongoDB [36], a document store which stores documents consisting of data roughly equivalent to JSON documents. To expose MongoDB data to Calcite, a table is created for each document collection with a single column named _MAP: a map from document identifiers to their data. In many cases, documents can be expected to have a common structure. A collection of documents representing zip codes may each contain columns with a city name, latitude and longitude. It can be useful to expose this data as a relational table. In Calcite, this is achieved by creating a view after extracting the desired values and casting
them to the appropriate type:
With views over semi-structured data defined in this manner, it becomes easier to manipulate data from different semi-structured sources in tandem with relational data.
Calcite支持幾種復(fù)雜的列數(shù)據(jù)類型,可以支持存儲(chǔ)在表里的關(guān)系和半結(jié)構(gòu)化的數(shù)據(jù)的混合铜跑。數(shù)據(jù)類型可以是ARRAY, MAP, or MULTISET门怪,這些復(fù)雜的類型可是嵌套的,例如可以擁有MAP锅纺,它的值是ARRAY.在ARRAY和MAP列中的數(shù)據(jù)可以使用[]操作符提取出來掷空。
7.2 流式數(shù)據(jù)
Calcite provides first-class support for streaming queries [26] based on a set of streaming-specific extensions to standard SQL, namely STREAM extensions, windowing extensions, implicit references to streams via window expressions in joins, and others. These extensions were inspired by the Continuous Query Language [2]
while also trying to integrate effectively with standard SQL. The primary extension, the STREAM directive tells the system that the user is interested in incoming records, not existing ones.
??In the absence of the STREAM keyword when querying a stream, the query becomes a regular relational query, indicating the system should process existing records which have already been received from a stream, not the incoming ones.
??Due to the inherently unbounded nature of streams, windowing is used to unblock blocking operators such as aggregates and joins. Calcite’s streaming extensions use SQL analytic functions to express sliding and cascading window aggregations, as shown in the following example.
Tumbling, hopping and session windows2 are enabled by the TUMBLE, HOPPING, SESSION functions and related utility functions such as TUMBLE_END and HOP_END that can be used respectively in GROUP BY clauses and projections.
??Streaming queries involving window aggregates require the presence of monotonic or quasi-monotonic expressions in the GROUP BY clause or in the ORDER BY clause in case of sliding and cascading
window queries. Streaming queries which involve more complex stream-to-stream joins can be expressed using an implicit (time) window expression in the JOIN clause.
In the case of an implicit window, Calcite’s query planner validates that the expression is monotonic.
Calcite通過對于標(biāo)準(zhǔn)SQL的拓展對于流式查詢提供了很好的支持,這些拓展包含STREAM拓展囤锉,窗口函數(shù)拓展坦弟,通過joins里窗口表達(dá)式隱式使用流式。這些拓展的靈感來源于Continuous Query Language(CQL),最主要的拓展STREAM指令告訴系統(tǒng)用戶操作流入的數(shù)據(jù)官地,而不是存在的數(shù)據(jù)酿傍。
7.3 地理數(shù)據(jù)查詢
Geospatial support is preliminary in Calcite, but is being implemented using Calcite’s relational algebra. The core of this implementation consists in adding a new GEOMETRY data type which encapsulates different geometric objects such as points, curves, and polygons. It is expected that Calcite will be fully compliant with the OpenGIS Simple Feature Access [39] specification which defines a standard for SQL interfaces to access geospatial data. An example
query finds the country which contains the city of Amsterdam:
地理位置查詢在Calcite中初步支持,使用Calcite的關(guān)系代數(shù)來支持区丑,這個(gè)實(shí)現(xiàn)的核心部分在于添加了一個(gè)新的數(shù)據(jù)類型GEOMETRY拧粪,這個(gè)類型可以表示不同的地理幾何對象修陡,比如點(diǎn),曲線可霎,多邊形等魄鸦,Calcite將完全遵從OpenGIS Simple Feature Access的標(biāo)準(zhǔn)定義。
7.4 整合語言查詢
Calcite can be used to query multiple data sources, beyond just relational databases. But it also aims to support more than just the SQL language. Though SQL remains the primary database language,
many programmers favour language-integrated languages like LINQ [33]. Unlike SQL embedded within Java or C++ code, language-integrated query languages allow the programmer to write all of her code using a single language. Calcite provides Language-Integrated Query for Java (or LINQ4J, in short) which closely follows the convention set forth by Microsoft’s LINQ for the .NET languages.
Calcite可以用來查詢多種數(shù)據(jù)源癣朗,不僅僅是關(guān)系型數(shù)據(jù)庫拾因。它不僅僅用來支持SQL語言。雖然SQL語言依然是數(shù)據(jù)庫中的主要語言旷余,需要稱需要傾向于LINQ(Language-Integrated Query)的查詢語言绢记,不像SQL語言,LINQ可以集成java或者C++代碼正卧,整合語言的查詢語言允許開發(fā)者使用一種語言來編寫代碼蠢熄。Calcite支持Language-Integrated Query for Java (LINQ4J),遵從微軟為.NET語言設(shè)計(jì)的LINQ的語法
8.工業(yè)和學(xué)術(shù)應(yīng)用
Calcite enjoys wide adoption, specially among open-source projects used in industry. As Calcite provides certain integration flexibility, these projects have chosen to either (i) embed Calcite within their core, i.e., use it as a library, or (ii) implement an adapter to allow Calcite to federate query processing. In addition, we see a growing interest in the research community to use Calcite as the cornerstone of the development of data management projects. In the following, we describe how different systems are using Calcite.
Calcite被廣泛使用炉旷,尤其是在開源的工業(yè)級項(xiàng)目中签孔。因?yàn)镃alcite提供了集成的靈活性,項(xiàng)目集成Calcite成他們的核心模塊窘行,或者使用它作為庫饥追,或者實(shí)現(xiàn)了一個(gè)適配器來允許Calcite進(jìn)行聯(lián)邦查詢。另外罐盔,我們看到研究型社區(qū)使用Calcite作為數(shù)據(jù)引擎的基礎(chǔ)模塊但绕,下面我們來描述下不同的系統(tǒng)是怎樣使用Calcite的
8.1 集成calcite
Table 1 provides a list of software that incorporates Calcite as a library, including (i) the query language interface that they expose to users, (ii) whether they use Calcite’s JDBC driver (called Avatica),
(iii) whether they use the SQL parser and validator included in Calcite, (iv) whether they use Calcite’s query algebra to represent their operations over data, and (v) the engine that they rely on for execution, e.g., their own native engine, Calcite’s operators (referred to as enumerable), or any other project.
??Drill [13] is a flexible data processing engine based on the Dremelsystem [34] that internally uses a schema-free JSON document datamodel. Drill uses its own dialect of SQL that includes extensions to express queries on semi-structured data, similar to SQL++ [38].
??Hive [24] first became popular as a SQL interface on top of the MapReduce programming model [52]. It has since moved towards being an interactive SQL query answering engine, adoptingCalcite as its rule and cost-based optimizer. Instead of relying onCalcite’s JDBC driver, SQL parser and validator, Hive uses its own implementation of these components. The query is then translated into Calcite operators, which after optimization are translated into Hive’s physical algebra. Hive operators can be executed by multiple engines, the most popular being Apache Tez [43, 51] and Apache Spark [47, 56].
??Apache Solr [46] is a popular full-text distributed search platform built on top of the Apache Lucene library [31]. Solr exposes multiple query interfaces to users, including REST-like HTTP/XML and JSON APIs. In addition, Solr integrates with Calcite to provide SQL compatibility.
?? Apache Phoenix [40] and Apache Kylin [28] both work on top of Apache HBase [23], a distributed key-value store modeled after Bigtable [9]. In particular, Phoenix provides a SQL interface and orchestration layer to query HBase. Kylin focuses on OLAP-style
??SQL queries instead, building cubes that are declared as materializedviews and stored in HBase, and hence allowing Calcite’s optimizer to rewrite the input queries to be answered using those cubes. In Kylin, query plans are executed using a combination of Calcite native operators and HBase.
??Recently Calcite has become popular among streaming systems too. Projects such as Apache Apex [1], Flink [16], Apache Samza [44], and Storm [50] have chosen to integrate with Calcite, using its components to provide a streaming SQL interface to their users. Finally, other commercial systems have adopted Calcite,
表1提供了一個(gè)集成Calcite作為庫的軟件列表,使用方面包括
(1)暴露給用戶的查詢接口
(2)使用Calcite的 JDBC driver(avatica)
(3)使用Calcite的sql解析和校驗(yàn)器
(4)使用Calcite的關(guān)系代數(shù)表達(dá)式
(5)執(zhí)行引擎惶看,例如他們自己的本地引擎捏顺,使用Calcite的關(guān)系表達(dá)式或者其他的項(xiàng)目
8.2 Calcite適配器
Instead of using Calcite as a library, other systems integrate with Calcite via adapters which read their data sources. Table 2 provides the list of available adapters in Calcite. One of the main key components of the implementation of these adapters is the converter responsible for translating the algebra expression to be pushed to the system into the query language supported by that system. Table 2 also shows the languages that Calcite translates into for
its own SQL-like language called CQL whereas the adapter for Apache Pig [41] generates queries expressed in Pig Latin [37]. The adapter for Apache Spark [47] uses the Java RDD API. Finally, Druid [14], Elasticsearch [15] and Splunk [48] are queried through REST HTTP API requests. The queries generated by Calcite for these systems are expressed in JSON or XML.
each of these adapters.
??The JDBC adapter supports the generation of multiple SQL dialects, including those supported by popular RDBMSes such as PostgreSQL and MySQL. In turn, the adapter for Cassandra [8] generates
對于集成Calcite作為庫,有些系統(tǒng)選擇使用Calcite作為適配器來讀取他們的數(shù)據(jù)源碳竟。
8.3 學(xué)術(shù)研究應(yīng)用
In a research setting, Calcite has been considered [54] as a polystorealternative for precision medicine and clinical analysis scenarios. In those scenarios, heterogeneous medical data has to be logically assembled and aligned to assess the best treatments based on the comprehensive medical history and the genomic profile of the patient.
The data comes from relational sources representing patients’ electronic medical records, structured and semi-structured sources representing various reports (oncology, psychiatry,laboratory tests, radiology, etc.), imaging, signals, and sequence data, stored in scientific databases. In those circumstances, Calcite represents a good
foundation with its uniform query interface, and flexible adapter architecture, but the ongoing research efforts are aimed at (i) introduction of the new adapters for array, and textual sources, and (ii) support efficient joining of heterogeneous data sources.
9.未來工作
The future work on Calcite will focus on the development of the new features, and the expansion of its adapter architecture:
- Enhancements to the design of Calcite to further support its use a standalone engine, which would require a support for data definition languages (DDL), materialized views, indexes and constraints.
- Ongoing improvements to the design and flexibility of the planner, including making it more modular, allowing users Calcite to supply planner programs (collections of rules organized into planning phases) for execution.
- Incorporation of new parametric approaches [53] into the design of the optimizer.
- Support for an extended set of SQL commands, functions, and utilities, including full compliance with OpenGIS.
- New adapters for non-relational data sources such as array databases for scientific computing.
- Improvements to performance profiling and instrumentation.
Calcite未來的工作將聚焦在拓展新功能和適配器架構(gòu)的拓展:
● Calcite增強(qiáng)設(shè)計(jì)來支持它作為一個(gè)獨(dú)立引擎來使用草丧,支持DDL,物化視圖,索引和約束
● 持續(xù)提升Calcite計(jì)劃引擎的靈活性莹桅,包括使它更加模塊化昌执,允許用戶提供計(jì)劃引擎programs(計(jì)劃階段的一組規(guī)則)
● 引入多參數(shù)并行化優(yōu)化器的設(shè)計(jì)思想到優(yōu)化器的設(shè)計(jì)
● 支持符合OpenGIS標(biāo)準(zhǔn)的SQL命令,函數(shù)和基礎(chǔ)設(shè)施
● 對于非關(guān)系型數(shù)據(jù)源的適配器支持诈泼,例如array數(shù)據(jù)庫和科學(xué)計(jì)算
● 改進(jìn)性能,資料收集和圖表化
9.1 性能測試和評估
Though Calcite contains a performance testing module, it does not evaluate query execution. It would be useful to assess the performance of systems built with Calcite. For example, we could compare the performance of Calcite with similar frameworks. Unfortunately, it might be difficult to craft fair comparisons. For example, like Calcite, Algebricks optimizes queries for Hive. Borkar et al. [6] compared Algebricks with the Hyracks scheduler against Hive version 0.12 (without Calcite). The work of Borkar et al. precedes significant engineering and architectural changes into Hive. Comparing Calcite against Algebricks in a fair manner in terms of timings does not seem feasible, as one would need to ensure that each uses the same execution engine. Hive applications rely mostly on either Apache Tez or Apache Spark as execution engines whereas Algebricks is tied to its own framework (including Hyracks).
??Moreover, to assess the performance of Calcite-based systems, we need to consider two distinct use cases. Indeed, Calcite can be used either as part of a single system—as a tool to accelerate the construction of such a system—or for the more difficult task of combining several distinct systems—as a common layer. The former is tied to the characteristics of the data processing system, and because Calcite is so versatile and widely used, many distinct benchmarks are needed. The latter is limited by the availability of existing heterogeneous benchmarks. BigDAWG [55] has been used to integrate PostgreSQL with Vertica, and on a standard benchmark, one gets that the integrated system is superior to a baseline where entire tables are copied from one system to another to answer specific queries. Based on real-world experience, we believe that more ambitious goals are possible for integrated multiple systems: they should be superior to the sum of their parts.
10.結(jié)論
Emerging data management practices and associated analytic uses of data continue to evolve towards an increasingly diverse, and heterogeneous spectrum of scenarios. At the same time, relational data sources, accessed through the SQL, remain an essential means to how enterprises work with the data. In this somewhat dichotomous
space, Calcite plays a unique role with its strong support for both traditional, conventional data processing, and for its support of other data sources including those with semi-structured, streaming and geospatial models. In addition, Calcite’s design philosophy with a focus on flexibility, adaptivity, and extensibility, has been another factor in Calcite becoming the most widely adopted query optimizer, used in a large number of open-source frameworks. Calcite’s dynamic and flexible query optimizer, and its adapter architecture allows it to be embedded selectively by a variety of data management frameworks such as Hive, Drill, MapD, and Flink. Calcite’s support for heterogeneous data processing, as well as for the extended set of relational functions will continue to improve, in both functionality and performance.