OLTP VS OLAP
Online Transaction Processing (OLTP) pulls up a row of data related with some query inputs.
Online Analytic Processing (OLAP) pulls in large numbers of records and do lots of computation on data to analyze the statistic performance of something.
OLTP runs in RDS. OLAP runs in Data warehousing.
Relational VS No Relational
Collection = Table
Document = Row
Key Value Pairs = Fields
Relational | No-relational | |
---|---|---|
Data Storage | Rows and Columns. | Key-value, document, wide-column, graph |
Schemas | Fixed | Dynamic |
Querying | Use SQL. | Focused on collection of documents |
Scalability | Vertical | Horizontal |
Transactions | Supported | Support varies |
Consistency | Strong | Eventual and strong. |
RDS (Relational Database Service)
Compared to a self-built relational database, Amazon RDS:
- allow you to tailor your storage performance and cost to the needs of your database. You can scale CPU, memory, storage, and IOPS independently.
- manages backups, software patching, automatic failure detection, and recovery.
- Ensure secure access by IAM. It also doesn't provide shell access to DB instances.
- You can get high availability with a primary instance and a synchronous secondary replica instance in another AZ that you can fail over to when problems occur.
RDS is not serverless, excpet for the Aurora Serverless.
DB instances
The basic building block of Amazon RDS is the DB instance. A DB instance is an isolated database environment in the AWS Cloud. Each DB instance runs a DB engine. Amazon RDS currently supports the MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server DB engines. The computation and memory capacity of a DB instance is determined by its DB instance class.
Each DB instance has a DB instance identifier. The DB instance identifier must be unique for that customer in an AWS Region. The identifier is used as part of the DNS hostname allocated to your instance by RDS.
DB storage classes
DB instance storage comes in three types: Magnetic, General Purpose (SSD), and Provisioned IOPS (PIOPS).
They differ in performance characteristics and price, allowing you to tailor your storage performance and cost to the needs of your database. Each DB instance has minimum and maximum storage requirements depending on the storage type and the database engine it supports.
When using General Purpose SSD storage, your DB instance receives an initial I/O credit balance of 5.4 million I/O credits. This initial credit balance is enough to sustain a burst performance of 3,000 IOPS for 30 minutes. When your storage requires more than the base performance I/O level, it uses I/O credits in the I/O credit balance to burst to the required performance level. Such a burst goes to a maximum of 3,000 IOPS. When your storage uses fewer I/O credits than it earns in a second, unused I/O credits are added to the I/O credit balance. If your storage uses all of its I/O credit balance, its maximum performance remains at the base performance level until I/O demand drops below the base level and unused I/O credits are added to the I/O credit balance.
Provisioned IOPS storage is optimized for online transaction processing (OLTP) workloads that have consistent performance requirements. In this case, when you create a DB instance, you specify the IOPS rate and the size of the volume. Amazon RDS provides that IOPS rate for the DB instance until you change it. You are charged for the provisioned resources whether or not you use them in a given month.
DB Instance classes
Amazon RDS supports three types of instance classes: Standard, Memory Optimized, and Burstable Performance.
The Burstable Per instance classes provides more computing capacity than the previous two instance classes. It is powered by the AWS Nitro System.
The processor settings for a DB instance are associated with snapshots of the DB instance. When a snapshot is restored, its restored DB instance uses the processor feature settings used when the snapshot was taken.
Depending on the instance class you're using, you might see lower IOPS performance than the maximum that RDS allows you to provision. Indeed, you should determine the maximum IOPS for the instance class before setting a Provisioned IOPS value for your DB instance.
Performance
Both system activities and database workload can affect storage performance.
Below are the main factors of RDS performance:
IOPS – The number of I/O operations completed each second. Measured IOPS values are independent of the size of the individual I/O operation. This means that when you measure I/O performance, you should look at the throughput of the instance, not simply the number of I/O operations.
Latency – The elapsed time between the submission of an I/O request and its completion.
Throughput – The number of bytes each second that are transferred to or from disk.
Queue Depth – The number of I/O requests in the queue waiting to be serviced. These are I/O requests that have been submitted by the application but have not been sent to the device because the device is busy servicing other I/O requests. Time spent waiting in the queue is a component of latency and service time.
Supported DB engines
- MySQL/MariaDB
- Oracle
- Microsoft SQL Server
- PostgreSQL
A security group controls the access to a DB instance.
Multi-AZ
In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. The primary DB instance is synchronously replicated across Availability Zones to a standby replica to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups.
The Multi-AZ solution is used for disaster recovery only. The high-availability feature isn't a scaling solution for read-only scenarios; you can't use a standby replica to serve read traffic. To serve read-only traffic, you use a read replica instead. You can configure a read replica for a DB instance that also has a standby replica configured for high availability at the same time.
DB instances using Multi-AZ deployments can have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs.
In the event of a planned or unplanned outage of your DB instance, Amazon RDS automatically switches to a standby replica in another Availability Zone if you have enabled Multi-AZ. Failover times are typically 60–120 seconds. The failover reasons includes:
- The operating system underlying the RDS database instance is being patched in an offline operation.
- The primary host of the RDS Multi-AZ instance is unhealthy.
- The primary host of the RDS Multi-AZ instance is unreachable due to loss of network connectivity.
- The RDS Multi-AZ primary instance is busy and unresponsive.
- The RDS instance was modified by customer.
- The storage volume underlying the primary host of the RDS Multi-AZ instance experienced a failure.
- The user requested a reboot with failover of the DB instance.
To enable Multi-AZ for a DB instance, RDS takes a snapshot of the primary DB instance's EBS volume and restores it on the newly created standby replica, and then synchronizes both volumes.
The failover mechanism automatically changes the Domain Name System (DNS) record of the DB instance to point to the standby DB instance.
A Local Zone can't be included in a Multi-AZ deployment.
Read Replica
When enabling read replica, the source DB instance becomes the primary DB instance. Updates made to the primary DB instance are asynchronously copied to the read replica. You can reduce the load on your primary DB instance by routing read queries from your applications to the read replica.
Not support for Microsoft Sql Server.
Common use scenarios includes:
- Scaling beyond the compute or I/O capacity of a single DB instance for read-heavy database workloads.
- Serving read traffic while the source DB instance is unavailable, for example, due to maintenance.
- Business reporting or data warehousing scenarios where you might want business reporting queries to run against a read replica, rather than your production DB instance.
- Promoting a read replica to a standalone instance as a disaster recovery solution if the primary DB instance fails.
When you create a read replica, you first specify an existing DB instance as the source. Then Amazon RDS takes a snapshot of the source instance and creates a read-only instance from the snapshot. As a result, you experience a brief I/O suspension on your source DB instance while the DB snapshot occurs. Amazon RDS then uses the asynchronous replication method for the DB engine to update the read replica whenever there is a change to the primary DB instance. The read replica operates as a DB instance that allows only read-only connections.
The type of storage for a read replica is independent of that on the primary DB instance. For example, you might use General Purpose SSD for read replicas with a primary DB instance that uses Provisioned IOPS SSD storage to reduce costs. However, your read replica's performance in this case might differ from that of a configuration where both the primary DB instance and the read replicas use Provisioned IOPS SSD storage. By default, a read replica is created with the same storage type as the source DB instance.
Amazon RDS doesn't support circular replication. You can't configure a DB instance to serve as a replication source for an existing DB instance. You can only create a new read replica from an existing DB instance. For MariaDB and MySQL you can create a read replica from an existing read replica.
If you delete a source DB instance without deleting its read replicas in the same AWS Region, each read replica is promoted to a standalone DB instance.
Each read replica will have its own DNS endpoint.
You can have a read replica in a second region. In these cases, Amazon RDS sets up a secure communications channel between the primary DB instance and the read replica.
You can have read replicas that have Multi-AZ. You can create read replicas of Multi-AZ source databases.
You need to turn on backups to enable a read replica.
Mounted Replica
The Oracle DB engine supports replica databases in mounted mode. A mounted replica doesn't accept user connections and so can't serve a read-only workload. The primary use for mounted replicas is cross-Region disaster recovery.
Backup
When you restored from a backup, it will restore to a new RDS instance with a new DNS endpoint
Automated Backups
- Enabled by default
- Stored in S3, you get free storage space equal to the size of your DB
- Take a full daily snapshot and will store transaction logs throughout the day
- During the backup window, storage I/O may be suspended and you may experience a little bit elevated latency
- delete together with the original RDS instance
Database Snapshots
- Manually taken
- Can copy cross-region
- Stored even after you delete the original RDS instance
- Snapshots that use the default Amazon RDS encryption key (aws/rds) can be shared, but you must first copy the snapshot and choose a custom encryption key. Then, you can share the custom key and the copied snapshot.
Parameter Group VS Option Group
Parameter Group - configuration list of database engine (e.g. binlog, charset)
Option Group - database third-party plugins (e.g. JVM interaction)
Aurora
A MySQL and PostgreSQL-compatible relational database engine
MySQL and PostgreSQL can be migrated into Aurora
Five times better performance than MySQL, and three times better than PostgreSQL
Storage and memory autoscaling
Read and write nodes have different DNS
2 copies of your data in each AZ, with minimum of 3 AZs, totally 6 copies of your data
Can have read replica of Aurora, MySQL, and PostgreSQL
Taking automated backups and snapshots on Aurora doest not impact performance
Aurora Global Database replicates writes in the primary region with a typical latency of <1 second to secondary regions, for low latency global reads. In disaster recovery situations, you can promote a secondary region to take full read-write responsibilities in under a minute.
Low latency between master & read replica. If 4 of 6 copies has been written, it thinks the data is inserted.
Aurora Serverless
An on-demand, autoscaling configuration for Amazon Aurora clusters. It automatically starts up, shuts down, and scales capacity up or down based on your application's needs.
Used for infrequent, intermittent, or unpredictable workloads
Endpoints
When you connect to an Aurora cluster, the host name and port that you specify point to an intermediate handler called an endpoint. Aurora uses the endpoint mechanism to abstract these connections. Thus, you don't have to hardcode all the hostnames or write your own logic for load-balancing and rerouting connections when some DB instances aren't available.
For certain Aurora tasks, different instances or groups of instances perform different roles. Using endpoints, you can map each connection to the appropriate instance or group of instances based on your use case
Cluster endpoint
A cluster endpoint (or writer endpoint) for an Aurora DB cluster connects to the current primary DB instance for that DB cluster. This endpoint is the only one that can perform write operations such as DDL statements.
Each Aurora DB cluster has one cluster endpoint and one primary DB instance.
The cluster endpoint provides failover support for read/write connections to the DB cluster. If the current primary DB instance of a DB cluster fails, Aurora automatically fails over to a new primary DB instance.
Read endpoint
A reader endpoint for an Aurora DB cluster provides load-balancing support for read-only connections to the DB cluster.
Each Aurora DB cluster has one reader endpoint.
If the cluster contains one or more Aurora Replicas, the reader endpoint load-balances each connection request among the Aurora Replicas.
Custom endpoint
A custom endpoint for an Aurora cluster represents a set of DB instances that you choose. When you connect to the endpoint, Aurora performs load balancing and chooses one of the instances in the group to handle the connection. You define which instances this endpoint refers to, and you decide what purpose the endpoint serves.
You can create up to five custom endpoints for each provisioned Aurora cluster. You can't use custom endpoints for Aurora Serverless clusters.
Instance endpoint
An instance endpoint connects to a specific DB instance within an Aurora cluster. The instance endpoint provides direct control over connections to the DB cluster, For example, it can be used to improve connection speed after a failover.
Each DB instance in a DB cluster has its own unique instance endpoint. So there is one instance endpoint for the current primary DB instance of the DB cluster, and there is one instance endpoint for each of the Aurora Replicas in the DB cluster.
DynamoDB
AWS's No Relational Solution
Key: Scarifies some storage space to allow for computationally easier queries, as storage is cheap and computational power is expensive.
Store indexes and selecting by indexes results to accelerate computation
- Stored on SSD Storage
- Spread across 3 geographically distinct data centers
- Eventual/Strongly Consistent Reads (eventual by default)
- Encryption at rest using KMS
- Security at transit can be protected by VPC endpoints, VPN, Direct Connect, and IAM policies
Data Structure
- Table = Table
I- tems = row - Partition key ≈ Primary Key
Must define when creating the table - Sort key/Range key (Optional) = Order by index
Must define when creating the table, but if it is not defined, you can implement sorting equivalently by setting indexes later - Attributes ≈ Column except partition key and sort key
Table data is stored in partitions based on partition key
Secondary Index
Index you can use to sort data without the partition key
Generate a correlated table of sorted data when the secondary index is created\
Global secondary index (GSI)
Can add GSI after a table is created
Use for sorting and selection irrelevant with the partition key of the base table
Choose a key that will provide uniform workloads
Create a GSI with a subset of table's attributes for quick lookups
Sparse Index
A special type of GSI
Some attributes occur rarely in the base table.
Index only a subset of the collection with attribute absent
Spare indexes contains item with "null" attribute
Spare indexes are great replacement for queries with FilterExpressions that include a contains operator
Local secondary Index (LSI)
Must created when the table is created
Partition key must be the same as the base table
Share RCU and WCU with the base table
Auto Scaling
Amazon DynamoDB auto scaling uses the AWS Application Auto Scaling service to dynamically adjust provisioned throughput capacity on your behalf, in response to actual traffic patterns.
With Application Auto Scaling, you create a scaling policy for a table or a global secondary index. The scaling policy specifies whether you want to scale read capacity or write capacity (or both), and the minimum and maximum provisioned capacity unit settings for the table or index.
The scaling policy also contains a target utilization—the percentage of consumed provisioned throughput at a point in time. Application Auto Scaling uses a target tracking algorithm to adjust the provisioned throughput of the table (or index) upward or downward in response to actual workloads, so that the actual capacity utilization remains at or near your target utilization.
DynamoDB Accelerator (DAX)
- Fully managed, highly available, in-memory cache
- Compatible with existing Dynamo API calls
- A write through cache sits between your application and DynamoDB
Read/Write Capacity Mode
On-Demand Capacity
- You don't need to specify how much read and write throughput you expect your application to perform. DynamoDB charges you for the reads and writes that your application performs on your tables in terms of read request units and write request units.
- Pay-per-request pricing with no minimum capacity
- No charge for read/write, only storage and backups
- Pay more per request than with provisioned capacity
- Use for new product launches, when you are not exactly sure what your consumption going to look like
Provisioned Capacity
- In provisioned mode, you specify the number of reads and writes per second that you require for your application. You specify throughput capacity in terms of read capacity units (RCUs) and write capacity units (WCUs).
Read Capacity Units (RCU): Number of strongly consistent reads per second of items up to 1KB in size.
Write Capacity Units (WCU): Number of 1 KB writes per second
Eventually consistent reads use half the provisioned read capacity. 2 eventually consistent read efficiency is equivalent to 1 strong consistent read.
Throughput is divided evenly among partitions.
On-Demand Backup and Restore
- Full backups at any time
- Zero impact on table performance or availability, unlike RDS
- Consistent within seconds and retained until deleted
- Can only backup and restore into the same region as the source table
DynamoDB Streams
DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table. When you enable a stream on a table, DynamoDB captures information about every modification to data items in the table. Whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record with the primary key attributes of the items that were modified. This information is stored in a log for up to 24 hours.
Point-in-Time Recovery (PITR)
Point-in-time recovery allows to restore a database into a state it was in any point of time
- Recover from accidental writes/deletions by restoring to any point in the last 35 days. Latest restorable time is usually five minutes in the past
- Not enabled by default
Stream
- Time-ordered sequence of item-level change logs in a table, like a FIFO queue
- Stored for 24 hours
Use cases
- Cross region replication (Global tables)
- Establish relations across tables
- Combine with Lambda functions for functionality like stored procedures
Stream Records
- A single data modification in the DynamoDB table towards the stream belongs
- With a sequence number reflecting the order in which the record is publishing to the stream
Shard
- A container in which multiple stream records are organized into
- Contains information required for accessing and iterating through these records
Global tables
- Multi-Master, Multi-Region Replication, Globally distributed
- Multi-region redundancy for disaster recovery and high availability
- No application rewrites, all handle by DynamoDB
- Replication latency under 1 second
Use Cases
- Shopping carts
- Workflow engines
- Inventory tracking and fulfillment
- Customer profiles and accounts
Batch Operations VS Transaction Operations
Batch Operation - If one request in a batch fails, the entire operation does not fail.
Transaction Operation - If one request fails, all operations rollback.
Redshift
AWS's data warehousing resolution, used by management to do queries on large amount of data.
Redshift can be configured as single node with 160Gb, or multiple node, with a leader node that manages client connections and receives queries and up to 128 compute nodes that store data and perform queries and computations.
You are billed per compute node per hour. Leader node hours is not charged.
Redshift always attempts to maintain at least three copies of your data, the original, a replica on the compute nodes, and a backup in Amazon S3.
Only available for single AZ.
Redshift can asynchronously replicate your snapshots to S3 in another AZ or region for disaster recovery.
Encrypted in transit with SSL, at rest with AES-256.
With the Concurrency Scaling feature, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. Users see the most current data, whether the queries run on the main cluster or a concurrency-scaling cluster. When you turn on concurrency scaling, Amazon Redshift automatically adds additional cluster capacity to process an increase in both read queries and write queries.
ElasticCache
ElasticCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud. It helps you to retrieve information from fast, managed, in-memory caches, instead of relying entirely on slower disk-based databses.
It supports two open-source in-memory caching engines: Memcached and Redis
Two modes:
- master-slave mode
- cluster mode (no gap when one node is down, better for disaster recovery)
Set a TTL to delete data on time
Memcached can scale horizontally.
Redis in ElasticCache runs on single thread. If you want high-concurrency, Memcached is better.
Compared to Memcached, Redis has
- Advanced data types
- Ranking/sorting data types
- Publish/subscribe capabilities
- Persistence (Memchaced does not promise data in memory all safe)
- Multi-AZ
- Backup & Restore capabilities
Database Migration Service (DMS)
AWS DMS is a server runs replication software to migrate EC2 instances databases, RDS, some on-premises databases, etc. to some target EC2, S3, RDS, on-premises databases, Aurora, DynamoDB, etc . You create a source and target connection to tell AWS DMS where to extract from and load to. Then you schedule a task on this server to move your data.
Homogenous migration - source and target has identical database type
Heterogeneous migration - source and target has different database types
Supports both homogenous and heterogeneous migration.
For heterogeneous migration, you need AWS schema conversion tool (SCT) to transfer your table schemas. SCT can run on the same server as the DMS replication software. You do not need SCT if you are migrating to identical databases.
Elastic Map Reduce (EMR)
Used for big data processing, with map reduce techniques such as Apache Hive, Apache Spark, and Apache HBase.
EMR is a cluster consists of multiple EC2 instances, which we called nodes.Each node has a role within the cluster, referred to as the node type.
By default, logs are stored on your master node under /mnt/var/log/
, so when the maser node has been turned down for some reason, logs will be lost. To ensure logs persistence, you can configure the cluster to periodically archive the log files to Amazon S3 every five minutes. The backup configuration must be set up at the first time you create the cluster.
Node Types
Master Node: Node that manages the cluster, tracks the status of tasks, and monitors the health of the cluster. Each cluster has a master node.
Core Node: Node that runs tasks and stores data in HDFS. Multi-node clusters have at least one core node.
Task Node (Optional): Node with software components that only runs tasks and does not store data in HDFS.
@@ Amazon Neptune
Graph Database used to find the insight relationships between large amount of data
DocumentDB
Amazon DocumentDB is a scalable, highly durable, and fully managed database service for operating mission-critical MongoDB workloads.