時序數(shù)據(jù)庫分析 - TimescaleDB時序數(shù)據(jù)庫介紹
A hypertable can be partitioned by additional columns as well -- such as a device identifier, server or container id, user or customer id, location, stock ticker symbol, and so forth. Such partitioning on this additional column typically employs hashing (mapping all devices or servers into a specific number of hash buckets), although interval-based partitioning can be employed here as well. We sometimes refer to hypertables partitioned by both time and this additional dimension as "time and space" partitions.
This time-and-space partitioning is primarily used for distributed hypertables. With such two-dimensional partitioning, each time interval will also be partitioned across multiple nodes comprising the distributed hypertables. In such cases, for the same hour, information about some portion of the devices will be stored on each node. This allows multi-node TimescaleDB to parallelize inserts and queries for data during that time interval.
Each chunk is implemented using a standard database table. (In PostgreSQL internals, the chunk is actually a "child table" of the "parent" hypertable.) A chunk includes constraints that specify and enforce its partitioning ranges, e.g., that the time interval of the chunk covers ['2020-07-01 00:00:00+00', '2020-07-02 00:00:00+00'), and all rows included in the chunk must have a time value within that range. Any space partitions will be reflected as chunk constraints as well. As these ranges and partitions are non-overlapping, all chunks in a hypertable are disjoint in their partitioning dimensional space.
Local indexes. Indexes are built on each chunk independently, rather than a global index across all data. This similarly ensures that both data and indexes from the latest chunks typically reside in memory, so that updating indexes when inserting data remains fast. And TimescaleDB can still ensure global uniqueness on keys that include any partitioning keys, given the disjoint nature of its chunks, i.e., given a unique (device_id, timestamp) primary key, first identify the corresponding chunk given constraints, then use one of that chunk's index to ensure uniqueness. But this remains simple to use with TimecaleDB's hypertable abstraction: Users simply create an index on the hypertable, and these operations (and configurations) are pushed down to both existing and new chunks.
Distributed hypertables
TimescaleDB supports distributing hypertables across multiple nodes (i.e., a cluster) by leveraging the same hypertable and chunk primitives as described above. This allows TimescaleDB to scale inserts and queries beyond the capabilities of a single TimescaleDB instance.
Distributed hypertables and regular hypertables look very similar, with the main difference being that distributed chunks are not stored locally. There are also some features of regular hypertables that distributed hypertables do not support (see section on current limitations).
Scaling distributed hypertables
As time-series data grows, a common use case is to add data nodes to expand the storage and compute capacity of distributed hypertables. Thus, TimescaleDB can be elastically scaled out by simply adding data nodes to a distributed database.
As mentioned earlier, TimescaleDB can (and will) adjust the number of space partitions as new data nodes are added. Although existing chunks will not have their space partitions updated, the new settings will be applied to newly created chunks. Because of this behavior, we do not need to move data between data nodes when the cluster size is increased, and simply update how data is distributed for the next time interval. Writes for new incoming data will leverage the new partitioning settings, while the access node can still support queries across all chunks (even those that were created using the old partitioning settings). Do note that although the number of space partitions can be changed, the column on which the data is partitioned can not be changed.
Data Retention
An intrinsic part of time-series data is that new data is accumulated and old data is rarely, if ever, updated and the relevance of the data diminishes over time. It is therefore often desirable to delete old data to save disk space.
Hypertable limitations
Foreign key constraints referencing a hypertable are not supported.
Time dimensions (columns) used for partitioning cannot have NULL values.
Unique indexes must include all columns that are partitioning dimensions.
UPDATE statements that move values between partitions (chunks) are not supported. This includes upserts (INSERT ... ON CONFLICT UPDATE).
Create your first hypertable
Creating a hypertable is a two step process. First we execute a CREATE TABLE statement to create a regular relational table. Second, we execute a SELECT statement using the function create_hypertable and specifying the name of the table we want to turn into a hypertable, as well as the name of the time column in that table, which is a required parameter.
How hypertables help with times series data
Hypertables help speed up ingest rates, since data is only inserted into the current chunk, leaving data in the other chunks untouched. Contrast this with inserting data into a single table, which will become bigger and more bloated as more data is ingested.
Hypertables help speed up queries, since only specific chunks are queried thanks to the automatic indexing by time and/or space.
Accessing the dataset
-- copy data from weather_data.csv into weather_metrics
\copy weather_metrics (time, timezone_shift, city_name, temp_c, feels_like_c, temp_min_c, temp_max_c, pressure_hpa, humidity_percent, wind_speed_ms, wind_deg, rain_1h_mm, rain_3h_mm, snow_1h_mm, snow_3h_mm, clouds_percent, weather_type_id) from './weather_data.csv' CSV HEADER;
create_hypertable()
SELECT create_hypertable('conditions', 'time', 'location', 4);
SELECT create_hypertable('conditions', 'time', 'location', 4, partitioning_func => 'location_hash');