Indexes
Ký hiệu
- 3FS (distributed filesystem, Distributed Filesystems
A
- aborts (transactions), Transactions, Atomicity
- cascading, No dirty reads
- in two-phase commit, Two-Phase Commit (2PC)
- performance of optimistic concurrency control, Performance of serializable snapshot isolation
- retrying aborted transactions, Handling errors and aborts
- abstraction, Layering of cloud services, Simplicity: Managing Complexity, Data Models and Query Languages, Transactions, Summary
- accidental complexity, Simplicity: Managing Complexity
- accountability, Responsibility and Accountability
- accounting (financial data), Summary, Advantages of immutable events
- Accumulo (database)
- wide-column data model, Data locality for reads and writes, Column Compression
- ACID properties (transactions), The Meaning of ACID
- atomicity, Atomicity, Single-Object and Multi-Object Operations
- consistency, Consistency, Maintaining integrity in the face of software bugs
- durability, Making B-trees reliable, Durability
- isolation, Isolation, Single-Object and Multi-Object Operations
- acknowledgements (messaging), Acknowledgments and redelivery
- active/active replication (xem multi-leader replication)
- active/passive replication (xem leader-based replication)
- ActiveMQ (messaging), Message brokers, Message brokers compared to databases
- distributed transaction support, XA transactions
- ActiveRecord (object-relational mapper), Object-relational mapping (ORM), Handling errors and aborts
- activity (workflows) (xem workflow engines)
- actor model, Distributed actor frameworks
- (xem cũng event-driven architecture)
- comparison to stream processing, Event-Driven Architectures and RPC
- adaptive capacity, Skewed Workloads and Relieving Hot Spots
- Advanced Message Queuing Protocol (xem AMQP)
- aerospace systems, Byzantine Faults
- Aerospike (database)
- strong consistency mode, Single-object writes
- AGE (graph database), The Cypher Query Language
- aggregation
- data cubes and materialized views, Materialized Views and Data Cubes
- in batch processes, Sorting Versus In-memory Aggregation
- in stream processes, Stream analytics
- aggregation pipeline (MongoDB), Normalization, Denormalization, and Joins, Query languages for documents
- Agile, Evolvability: Making Change Easy
- minimizing irreversibility, Batch Processing, Reprocessing data for application evolution
- moving faster with confidence, The end-to-end argument again
- agreement, Single-value consensus, Atomic commitment as consensus
- (xem cũng consensus)
- AI (artificial intelligence) (xem machine learning)
- AI Act (European Union), Data Systems, Law, and Society
- AirByte, Data Warehousing
- Airflow (workflow scheduler), Durable Execution and Workflows, Batch Processing, Scheduling Workflows
- cloud data warehouse integration, Query languages
- use for ETL, Extract–Transform–Load (ETL)
- Akamai
- response time study, Average, Median, and Percentiles
- algorithms
- algorithm correctness, Defining the correctness of an algorithm
- B-trees, B-Trees-B-tree variants
- for distributed systems, System Model and Reality
- mergesort, Constructing and merging SSTables, Shuffling Data
- scheduling, Resource Allocation
- SSTables and LSM-trees, The SSTable file format-Compaction strategies
- all-to-all replication topologies, Multi-leader replication topologies
- AllegroGraph (database), Graph-Like Data Models
- SPARQL query language, The SPARQL query language
- ALTER TABLE statement (SQL), Schema flexibility in the document model, Encoding and Evolution
- Amazon
- Dynamo (xem Dynamo (database))
- response time study, Average, Median, and Percentiles
- Amazon Web Services (AWS)
- Aurora (xem Aurora (cloud database))
- ClockBound (xem ClockBound (time sync))
- correctness testing, Formal Methods and Randomized Testing
- DynamoDB (xem DynamoDB (database))
- EBS (xem EBS (virtual block device))
- Kinesis (xem Kinesis (messaging))
- Neptune (xem Neptune (graph database))
- network reliability, Network Faults in Practice
- S3 (xem S3 (object storage))
- amplification
- of bias, Bias and Discrimination
- of failures, Maintaining derived state
- of tail latency, Use of Response Time Metrics, Local Secondary Indexes
- write amplification, Write amplification
- AMQP (Advanced Message Queuing Protocol), Message brokers compared to databases
- (xem cũng messaging systems)
- comparison to log-based messaging, Logs compared to traditional messaging, Replaying old messages
- message ordering, Acknowledgments and redelivery
- analytical systems, Operational Versus Analytical Systems
- as derived data systems, Systems of Record and Derived Data
- ETL from operational systems, Data Warehousing
- governance, Beyond the data lake
- analytics, Operational Versus Analytical Systems-Systems of Record and Derived Data
- comparison to transaction processing, Characterizing Transaction Processing and Analytics
- data normalization, Trade-offs of normalization
- data warehousing (xem data warehousing)
- predictive (xem predictive analytics)
- relation to batch processing, Analytics-Analytics
- schemas for, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analytics
- snapshot isolation for queries, Snapshot Isolation and Repeatable Read
- stream analytics, Stream analytics
- analytics engineering, Operational Versus Analytical Systems
- anti-entropy, Catching up on missed writes
- Antithesis (deterministic simulation testing), Deterministic simulation testing
- Apache Accumulo (xem Accumulo)
- Apache ActiveMQ (xem ActiveMQ)
- Apache AGE (xem AGE)
- Apache Arrow (xem Arrow (data format))
- Apache Avro (xem Avro)
- Apache Beam (xem Beam)
- Apache BookKeeper (xem BookKeeper)
- Apache Cassandra (xem Cassandra)
- Apache Curator (xem Curator)
- Apache DataFusion (xem DataFusion (query engine))
- Apache Druid (xem Druid (database))
- Apache Flink (xem Flink (processing framework))
- Apache HBase (xem HBase)
- Apache Iceberg (xem Iceberg (table format))
- Apache Jena (xem Jena)
- Apache Kafka (xem Kafka)
- Apache Lucene (xem Lucene)
- Apache Oozie (xem Oozie (workflow scheduler))
- Apache ORC (xem ORC (data format))
- Apache Parquet (xem Parquet (data format))
- Apache Pig (query language), Query languages
- Apache Pinot (xem Pinot (database))
- Apache Pulsar (xem Pulsar)
- Apache Qpid (xem Qpid)
- Apache Samza (xem Samza)
- Apache Solr (xem Solr)
- Apache Spark (xem Spark) (xem Spark (processing framework))
- Apache Storm (xem Storm)
- Apache Superset (xem Superset (data visualization software))
- Apache Thrift (xem Thrift)
- Apache ZooKeeper (xem ZooKeeper)
- Apama (stream analytics), Complex event processing
- append-only files (xem logs)
- Application Programming Interfaces (APIs), Data Models and Query Languages
- for change streams, API support for change streams
- for distributed transactions, XA transactions
- for services, Dataflow Through Services: REST and RPC-Data encoding and evolution for RPC
- (xem cũng services)
- evolvability, Data encoding and evolution for RPC
- RESTful, Web services
- application state (xem state)
- approximate search (xem similarity search)
- archival storage, data from databases, Archival storage
- arcs (xem edges)
- ArcticDB (database), DataFrames, Matrices, and Arrays
- arithmetic mean, Average, Median, and Percentiles
- arrays
- array databases, DataFrames, Matrices, and Arrays
- multidimensional, DataFrames, Matrices, and Arrays
- Arrow (data format), Column-Oriented Storage, DataFrames
- artificial intelligence (xem machine learning)
- ASCII text, Protocol Buffers
- ASN.1 (schema language), The Merits of Schemas
- associative table, Many-to-One and Many-to-Many Relationships, Property Graphs
- asynchronous networks, Unreliable Networks, Glossary
- comparison to synchronous networks, Synchronous Versus Asynchronous Networks
- system model, System Model and Reality
- asynchronous replication, Synchronous Versus Asynchronous Replication, Glossary
- data loss on failover, Leader failure: Failover
- reads from asynchronous follower, Problems with Replication Lag
- with multiple leaders, Multi-Leader Replication
- Asynchronous Transfer Mode (ATM), Can we not simply make network delays predictable?
- atomic broadcast, Shared logs as consensus
- atomic clocks, Clock readings with a confidence interval, Synchronized clocks for global snapshots
- (xem cũng clocks)
- atomicity (concurrency), Glossary
- atomic increment, Single-object writes
- compare-and-set (CAS), Conditional writes (compare-and-set), What Makes a System Linearizable?
- (xem cũng compare-and-set (CAS))
- denormalized data, Trade-offs of normalization
- fetch-and-add/increment, ID Generators and Logical Clocks, Consensus, Fetch-and-add as consensus
- write operations, Atomic write operations
- atomicity (transactions), Atomicity, Single-Object and Multi-Object Operations, Glossary
- atomic commit
- avoiding, Multi-shard request processing, Coordination-avoiding data systems
- blocking and nonblocking, Three-phase commit
- in stream processing, Exactly-once message processing, Exactly-once message processing revisited, Atomic commit revisited
- maintaining derived data, Keeping Systems in Sync
- distributed transactions, Distributed Transactions-Exactly-once message processing revisited
- for multi-object transactions, Single-Object and Multi-Object Operations
- for single-object writes, Single-object writes
- relation to consensus, Atomic commitment as consensus
- atomic commit
- auditability, Trust, but Verify-Tools for auditable data systems
- designing for, Designing for auditability
- self-auditing systems, Don’t just blindly trust what they promise
- through immutability, Advantages of immutable events
- tools for auditable data systems, Tools for auditable data systems
- Aurora (cloud database), Cloud-Native System Architecture
- Aurora DSQL (database)
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- auto-scaling, Operations: Automatic or Manual Rebalancing
- Automerge (CRDT library), Pros and cons of sync engines
- availability, Reliability and Fault Tolerance
- (xem cũng fault tolerance)
- in CAP theorem, The CAP theorem
- in leader election, Subtleties of consensus
- in service level agreements (SLAs), Use of Response Time Metrics
- availability zones, Tolerating hardware faults through redundancy, Reading Your Own Writes
- Avro (data format), Avro-Dynamically generated schemas
- dynamically generated schemas, Dynamically generated schemas
- object container files, But what is the writer’s schema?, Archival storage
- reader determining writer’s schema, But what is the writer’s schema?
- schema evolution, The writer’s schema and the reader’s schema
- use in batch processing, MapReduce
- awk (Unix tool), Simple Log Analysis, Simple Log Analysis, Distributed Job Orchestration
- Axon Framework, Event Sourcing and CQRS
- Azkaban (workflow scheduler), Batch Processing
- Azure Blob Storage (object storage), Layering of cloud services, Setting Up New Followers
- conditional headers, Fencing off zombies and delayed requests
- Azure managed disks, Separation of storage and compute
- Azure SQL DB (database), Cloud-Native System Architecture
- Azure Storage, Object Stores
- Azure Synapse Analytics (database), Cloud-Native System Architecture
- Azure Virtual Machines
- spot virtual machines, Handling Faults
B
- B-trees (indexes), B-Trees-B-tree variants
- B+ trees, B-tree variants
- branching factor, B-Trees
- comparison to LSM-trees, Comparing B-Trees and LSM-Trees-Disk space usage
- crash recovery, Making B-trees reliable
- growing by splitting a page, B-Trees
- immutable variants, B-tree variants, Indexes and snapshot isolation
- similarity to shard splitting, Rebalancing key-range sharded data
- variants, B-tree variants
- B2 (object storage), Distributed Filesystems
- Backblaze B2 (xem B2 (object storage))
- backend, Trade-offs in Data Systems Architecture
- backoff, exponential, Describing Performance, Handling errors and aborts
- backpressure, Describing Performance, Read performance, Messaging Systems, Glossary
- in batch processing, Scheduling Workflows
- in TCP, The Limitations of TCP
- backups
- database snapshot for replication, Setting Up New Followers
- in multitenant systems, Sharding for Multitenancy
- integrity of, Don’t just blindly trust what they promise
- snapshot isolation for, Snapshot Isolation and Repeatable Read
- using object storage, Setting Up New Followers
- versus replication, Replication
- backward compatibility, Encoding and Evolution
- BadgerDB (database)
- serializable transactions, Serializable Snapshot Isolation (SSI)
- BASE, contrast to ACID, The Meaning of ACID
- bash shell (Unix), Storage and Indexing for OLTP
- batch processing, Batch Processing-Summary, Glossary
- and functional programming, MapReduce
- benefits of, Batch Processing
- combining with stream processing, Unifying batch and stream processing
- comparison to stream processing, Processing Streams
- dataflow engines, Dataflow Engines-Dataflow Engines
- fault tolerance, Handling Faults, Messaging Systems
- for data integration, Batch and Stream Processing-Unifying batch and stream processing
- graphs and iterative processing, Machine Learning
- high-level APIs and languages, Query languages-Query languages
- in cloud data warehouses, Query languages
- in distributed systems, Batch Processing in Distributed Systems
- join and group by, JOIN and GROUP BY-JOIN and GROUP BY
- limitations, Batch Processing
- log-based messaging and, Replaying old messages
- maintaining derived state, Maintaining derived state
- measuring performance, Batch Processing
- models of, Batch Processing Models
- resource allocation, Resource Allocation-Resource Allocation
- resource managers, Distributed Job Orchestration
- schedulers, Distributed Job Orchestration
- serving derived data, Serving Derived Data-Serving Derived Data
- shuffling data, Shuffling Data-Shuffling Data
- task execution, Distributed Job Orchestration
- use cases, Batch Use Cases-Serving Derived Data
- using Unix tools (example), Batch Processing with Unix Tools-Sorting Versus In-memory Aggregation
- batch processing frameworks
- comparison to operating systems, Batch Processing in Distributed Systems
- Beam (dataflow library), Unifying batch and stream processing
- BERT (language model), Vector Embeddings
- bias, Bias and Discrimination
- bidirectional replication (xem multi-leader replication)
- big ball of mud, Simplicity: Managing Complexity
- big data
- versus data minimization, Data Systems, Law, and Society, Legislation and Self-Regulation
- BigQuery (database), Cloud-Native System Architecture, Cloud Data Warehouses, Batch Processing
- DataFrames, Query languages
- sharding and clustering, Sharding by hash range
- shuffling data, Shuffling Data
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- Bigtable (database)
- sharding scheme, Sharding by Key Range
- storage layout, Constructing and merging SSTables
- tablets (sharding), Sharding
- wide-column data model, Data locality for reads and writes, Column Compression
- binary data encodings, Binary encoding-The Merits of Schemas
- Avro, Avro-Dynamically generated schemas
- MessagePack, Binary encoding-Binary encoding
- Protocol Buffers, Protocol Buffers-Field tags and schema evolution
- binary encoding
- based on schemas, The Merits of Schemas
- by network drivers, The Merits of Schemas
- binary strings, lack of support in JSON and XML, JSON, XML, and Binary Variants
- Bitcoin (cryptocurrency), Tools for auditable data systems
- Byzantine fault tolerance, Byzantine Faults
- concurrency bugs in exchanges, Weak Isolation Levels
- bitmap indexes, Column Compression
- BitTorrent uTP protocol, The Limitations of TCP
- Bkd-trees (indexes), Multidimensional and Full-Text Indexes
- blameless postmortems, Humans and Reliability
- Blazegraph (database), Graph-Like Data Models
- SPARQL query language, The SPARQL query language
- blob storage (xem object storage)
- block (file system), Distributed Filesystems
- block device (disk), Separation of storage and compute
- blockchains, Summary
- Byzantine fault tolerance, Byzantine Faults, Consensus, Tools for auditable data systems
- blocking atomic commit, Three-phase commit
- Bloom filter (algorithm), Bloom filters, Read performance, Stream analytics
- BookKeeper (replicated log), Allocating work to nodes
- bounded datasets, Stream Processing, Glossary
- (xem cũng batch processing)
- bounded delays, Glossary
- in networks, Synchronous Versus Asynchronous Networks
- process pauses, Response time guarantees
- broadcast
- total order broadcast (xem shared logs)
- brokerless messaging, Direct messaging from producers to consumers
- Brubeck (metrics aggregator), Direct messaging from producers to consumers
- BTM (transaction coordinator), Two-Phase Commit (2PC)
- Buf
- Bufstream (messaging), Setting Up New Followers
- Bufstream (messaging), Disk space usage
- build or buy, Cloud Versus Self-Hosting
- bursty network traffic patterns, Can we not simply make network delays predictable?
- business analyst, Operational Versus Analytical Systems, From data warehouse to data lake
- business data processing, Characterizing Transaction Processing and Analytics
- business intelligence, Operational Versus Analytical Systems-Data Warehousing
- Business Process Execution Language (BPEL), Durable Execution and Workflows
- Business Process Model and Notation (BPMN), Durable Execution and Workflows
- example, Durable Execution and Workflows
- byte sequence, encoding data in, Formats for Encoding Data
- Byzantine faults, Byzantine Faults-Weak forms of lying, System Model and Reality, Glossary
- Byzantine fault-tolerant systems, Byzantine Faults
- Byzantine Generals Problem, Byzantine Faults
- consensus algorithms and, Consensus, Tools for auditable data systems
C
- caches, Keeping everything in memory, Glossary
- and materialized views, Materialized Views and Data Cubes
- as derived data, Systems of Record and Derived Data, Composing Data Storage Technologies-Unbundled versus integrated systems
- in CPUs, Query Execution: Compilation and Vectorization, Linearizability and network delays
- invalidation and maintenance, Keeping Systems in Sync, Maintaining materialized views
- linearizability, Linearizability
- local disks in the cloud, Separation of storage and compute
- calendar sync, Sync Engines and Local-First Software, Pros and cons of sync engines
- California Consumer Privacy Act (CCPA), Data Systems, Law, and Society
- Camunda (workflow engine), Durable Execution and Workflows
- canonical version (of data), Systems of Record and Derived Data
- CAP theorem, The CAP theorem-The CAP theorem, Glossary
- capacity planning, Operations in the Cloud Era
- Cap’n Proto (data format), Formats for Encoding Data
- carbon emissions, Distributed Versus Single-Node Systems
- cascading aborts, No dirty reads
- cascading failures, Software faults, Operations: Automatic or Manual Rebalancing, Timeouts and Unbounded Delays
- Cassandra (database)
- change data capture, Implementing change data capture, API support for change streams
- compaction strategy, Compaction strategies
- consistency level ANY, Single-Leader Versus Leaderless Replication Performance
- hash-range sharding, Sharding by Hash of Key, Sharding by hash range
- last-write-wins conflict resolution, Detecting Concurrent Writes
- leaderless replication, Leaderless Replication
- lightweight transactions, Single-object writes
- linearizability, lack of, Implementing Linearizable Systems
- log-structured storage, Constructing and merging SSTables
- multi-region support, Multi-region operation
- secondary indexes, Local Secondary Indexes
- use of clocks, Limitations of Quorum Consistency, Timestamps for ordering events
- vnodes (sharding), Sharding
- cat (Unix tool), Simple Log Analysis
- catalog, Cloud Data Warehouses
- causal context, Version vectors
- (xem thêm causal dependencies)
- causal dependencies, The “happens-before” relation and concurrency-Version vectors
- capturing, Version vectors, Ordering events to capture causality, Reads are events too
- by total ordering, The limits of total ordering
- in transactions, Decisions based on an outdated premise
- sending message to friends (example), Ordering events to capture causality
- capturing, Version vectors, Ordering events to capture causality, Reads are events too
- causality, Glossary
- causal ordering
- total order consistent with, Logical Clocks
- consistency with, Logical Clocks-Enforcing constraints using logical clocks
- happens-before relation, The “happens-before” relation and concurrency
- in serializable transactions, Decisions based on an outdated premise-Detecting writes that affect prior reads
- mismatch with clocks, Timestamps for ordering events
- ordering events to capture, Ordering events to capture causality
- violations of, Consistent Prefix Reads, Problems with different topologies, Timestamps for ordering events
- with synchronized clocks, Synchronized clocks for global snapshots
- causal ordering
- cell-based architecture, Sharding for Multitenancy
- CEP (xem complex event processing)
- CephFS (distributed filesystem), Batch Processing, Object Stores
- certificate transparency, Tools for auditable data systems
- cgroups, Distributed Job Orchestration
- change data capture, Logical (row-based) log replication, Change Data Capture
- API support for change streams, API support for change streams
- comparison to event sourcing, Change data capture versus event sourcing
- implementing, Implementing change data capture
- initial snapshot, Initial snapshot
- log compaction, Log compaction
- changelogs, State, Streams, and Immutability
- change data capture, Change Data Capture
- for operator state, Rebuilding state after a failure
- in stream joins, Stream-table join (stream enrichment)
- log compaction, Log compaction
- maintaining derived state, Databases and Streams
- chaos engineering, Fault Tolerance, Fault injection
- checkpointing
- in high-performance computing, Cloud Computing Versus Supercomputing
- in stream processors, Microbatching and checkpointing
- circuit breaker (limiting retries), Describing Performance
- circuit-switched networks, Synchronous Versus Asynchronous Networks
- circular buffers, Disk space usage
- circular replication topologies, Multi-leader replication topologies
- Citus (database)
- hash sharding, Fixed number of shards
- ClickHouse (database), Characterizing Transaction Processing and Analytics, Cloud-Native System Architecture
- incremental view maintenance, Maintaining materialized views
- clickstream data, analysis of, JOIN and GROUP BY
- clients
- calling services, Dataflow Through Services: REST and RPC
- offline-capable, Sync Engines and Local-First Software, Stateful, offline-capable clients
- pushing state changes to, Pushing state changes to clients
- request routing, Request Routing
- ClockBound (time sync), Clock readings with a confidence interval
- use in YugabyteDB, Synchronized clocks for global snapshots
- clocks, Unreliable Clocks-Limiting the impact of garbage collection
- atomic clocks, Clock readings with a confidence interval, Synchronized clocks for global snapshots
- confidence interval, Clock readings with a confidence interval-Synchronized clocks for global snapshots
- for global snapshots, Synchronized clocks for global snapshots
- hybrid logical clocks, Hybrid logical clocks
- logical (xem logical clocks)
- skew, Last write wins (discarding concurrent writes), Limitations of Quorum Consistency, Relying on Synchronized Clocks-Clock readings with a confidence interval, Implementing Linearizable Systems
- slewing, Monotonic clocks
- synchronization and accuracy, Clock Synchronization and Accuracy-Clock Synchronization and Accuracy
- synchronization using GPS, Unreliable Clocks, Clock Synchronization and Accuracy, Clock readings with a confidence interval, Synchronized clocks for global snapshots
- time-of-day versus monotonic clocks, Monotonic Versus Time-of-Day Clocks
- timestamping events, Whose clock are you using, anyway?
- cloud services, Cloud Versus Self-Hosting-Cloud Computing Versus Supercomputing
- availability zones, Tolerating hardware faults through redundancy, Reading Your Own Writes
- data warehouses, Cloud Data Warehouses
- need for service discovery, Service discovery
- network glitches, Network Faults in Practice
- pros and cons, Pros and Cons of Cloud Services-Pros and Cons of Cloud Services
- quotas, Operations in the Cloud Era
- regions (xem regions (geographic distribution))
- serverless, Microservices and Serverless
- shared resources, Network congestion and queueing
- versus supercomputing, Cloud Computing Versus Supercomputing
- cloud-native, Cloud-Native System Architecture-Operations in the Cloud Era
- Cloudflare
- R2 (xem R2 (object storage))
- clustered indexes, Storing values within the index
- clustering (record ordering), Sharding by hash range
- CockroachDB (database)
- consensus-based replication, Single-Leader Replication
- consistency model, What Makes a System Linearizable?
- key-range sharding, Sharding, Sharding by Key Range
- serializable transactions, Serializable Snapshot Isolation (SSI)
- sharded secondary indexes, Global Secondary Indexes
- transactions, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- use of model-checking, Model checking and specification languages
- code generation
- for query execution, Query Execution: Compilation and Vectorization
- with Protocol Buffers, Protocol Buffers
- collaborative editing, Real-time collaboration, offline-first, and local-first apps
- column families (Bigtable), Data locality for reads and writes, Column Compression
- column-oriented storage, Column-Oriented Storage-Query Execution: Compilation and Vectorization
- column compression, Column Compression
- Parquet, Column-Oriented Storage, Archival storage
- sort order in, Sort Order in Column Storage-Sort Order in Column Storage
- vectorized processing, Query Execution: Compilation and Vectorization
- versus wide-column model, Column Compression
- writing to, Writing to Column-Oriented Storage
- comma-separated values (xem CSV)
- command query responsibility segregation (CQRS), Event Sourcing and CQRS-Event Sourcing and CQRS, Deriving several views from the same event log
- commands (event sourcing), Event Sourcing and CQRS
- commits (transactions), Transactions
- atomic commit, Distributed Transactions-Exactly-once message processing revisited
- (xem thêm atomicity; transactions)
- read committed isolation, Read Committed
- three-phase commit (3PC), Three-phase commit
- two-phase commit (2PC), Two-Phase Commit (2PC)-Coordinator failure
- atomic commit, Distributed Transactions-Exactly-once message processing revisited
- commutative operations, Conflict resolution and replication
- compaction
- of changelogs, Log compaction
- (xem thêm log compaction)
- for stream operator state, Rebuilding state after a failure
- of log-structured storage, Constructing and merging SSTables
- issues with, Read performance
- size-tiered and leveled approaches, Compaction strategies, Disk space usage
- of changelogs, Log compaction
- compare-and-set (CAS), Conditional writes (compare-and-set), What Makes a System Linearizable?
- implementing locks, Coordination Services
- implementing uniqueness constraints, Constraints and uniqueness guarantees
- on object storage, Setting Up New Followers
- relation to consensus, Linearizability and quorums, Consensus, Compare-and-set as consensus
- relation to fencing tokens, Fencing off zombies and delayed requests
- relation to transactions, Single-object writes
- compatibility, Encoding and Evolution, Modes of Dataflow
- calling services, Data encoding and evolution for RPC
- properties of encoding formats, Summary
- using databases, Dataflow Through Databases-Archival storage
- compensating transactions, Advantages of immutable events, Loosely interpreted constraints
- compilation, Query Execution: Compilation and Vectorization
- complex event processing (CEP), Complex event processing
- complexity
- distilling in theoretical models, Mapping system models to the real world
- essential and accidental, Simplicity: Managing Complexity
- hiding using abstraction, Data Models and Query Languages
- managing, Simplicity: Managing Complexity
- composing data systems (xem unbundling databases)
- compression
- in SSTables, The SSTable file format
- compute-intensive applications, Trade-offs in Data Systems Architecture
- computer games, Pros and cons of sync engines
- concatenated indexes, Multidimensional and Full-Text Indexes
- in hash-sharded systems, Sharding by hash range
- concurrency
- actor programming model, Distributed actor frameworks, Event-Driven Architectures and RPC
- (xem thêm event-driven architecture)
- bugs from weak transaction isolation, Weak Isolation Levels
- conflict resolution, Dealing with Conflicting Writes-Types of conflict
- definition, Dealing with Conflicting Writes
- detecting concurrent writes, Detecting Concurrent Writes-Version vectors
- dual writes, problems with, Keeping Systems in Sync
- happens-before relation, The “happens-before” relation and concurrency
- in replicated systems, Problems with Replication Lag-Version vectors, Linearizability-Linearizability and network delays
- lost updates, Preventing Lost Updates
- multi-version concurrency control (MVCC), Multi-version concurrency control (MVCC), Synchronized clocks for global snapshots
- optimistic concurrency control, Pessimistic versus optimistic concurrency control
- ordering of operations, What Makes a System Linearizable?
- reducing, through event logs, Concurrency control, Dataflow: Interplay between state changes and application code
- time and relativity, The “happens-before” relation and concurrency
- transaction isolation, Isolation
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- actor programming model, Distributed actor frameworks, Event-Driven Architectures and RPC
- conditional write, Conditional writes (compare-and-set)
- in transactions, Single-object writes
- on object storage, Setting Up New Followers
- conference management system (example), Event Sourcing and CQRS
- conflict-free replicated datatypes (CRDTs), CRDTs and Operational Transformation
- for leaderless replication, Capturing the happens-before relationship
- preventing lost updates, Conflict resolution and replication
- conflicts
- avoidance, Conflict avoidance
- causal dependencies, The “happens-before” relation and concurrency
- conflict detection
- in distributed transactions, Problems with XA transactions
- in log-based systems, Uniqueness constraints require consensus
- in serializable snapshot isolation (SSI), Detecting writes that affect prior reads
- in two-phase commit, A system of promises
- conflict resolution
- by aborting transactions, Pessimistic versus optimistic concurrency control
- by apologizing, Loosely interpreted constraints
- last write wins (LWW), Timestamps for ordering events
- using atomic operations, Conflict resolution and replication
- determining what is a conflict, Types of conflict, Uniqueness in log-based messaging
- in leaderless replication, Detecting Concurrent Writes
- lost updates, Preventing Lost Updates-Conflict resolution and replication
- materializing, Materializing conflicts
- resolution, Dealing with Conflicting Writes-Types of conflict
- automatic, Automatic conflict resolution
- in leaderless systems, Detecting Concurrent Writes
- last write wins (LWW), Last write wins (discarding concurrent writes)
- using custom logic, Manual conflict resolution, Capturing the happens-before relationship
- siblings, Manual conflict resolution, Capturing the happens-before relationship
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- Confluent
- Freight (messaging), Setting Up New Followers, Disk space usage
- schema registry, JSON Schema, But what is the writer’s schema?
- congestion (networks)
- avoidance, The Limitations of TCP
- limiting accuracy of clocks, Clock readings with a confidence interval
- queueing delays, Network congestion and queueing
- consensus, Consensus-Summary, Glossary
- algorithms, Consensus, Consensus in Practice
- consensus numbers, Fetch-and-add as consensus
- coordination services, Coordination Services-Service discovery
- cost of, Pros and cons of consensus
- impossibility of, Consensus
- preventing split brain, From single-leader replication to consensus
- reconfiguration, Subtleties of consensus
- relation to atomic commitment, Atomic commitment as consensus
- relation to compare-and-set (CAS), Linearizability and quorums, Compare-and-set as consensus
- relation to fetch-and-add, Fetch-and-add as consensus
- relation to replication, Using shared logs
- relation to shared logs, Shared logs as consensus
- relation to uniqueness constraints, Uniqueness constraints require consensus
- safety and liveness properties, Single-value consensus
- single-value consensus, Single-value consensus
- consent (GDPR), Consent and Freedom of Choice
- consistency, Consistency, Timeliness and Integrity
- across different databases, Leader failure: Failover, Keeping Systems in Sync, Deriving several views from the same event log, Derived data versus distributed transactions
- causal, Consistent Prefix Reads, Problems with different topologies, Ordering events to capture causality
- consistent prefix reads, Consistent Prefix Reads-Consistent Prefix Reads
- consistent snapshots, Setting Up New Followers, Snapshot Isolation and Repeatable Read-Snapshot isolation, repeatable read, and naming confusion, Synchronized clocks for global snapshots, Initial snapshot, Creating an index
- (xem thêm snapshots)
- crash recovery, Making B-trees reliable
- enforcing constraints (xem constraints)
- eventual, Problems with Replication Lag
- (xem thêm eventual consistency)
- in ACID transactions, Consistency, Maintaining integrity in the face of software bugs
- in CAP theorem, The CAP theorem
- in leader election, Subtleties of consensus
- in microservices, Problems with Distributed Systems
- linearizability, Solutions for Replication Lag, Linearizability-Linearizability and network delays
- meanings of, Consistency
- monotonic reads, Monotonic Reads-Monotonic Reads
- of secondary indexes, The need for multi-object transactions, Indexes and snapshot isolation, Reasoning about dataflows, Creating an index
- read-after-write, Reading Your Own Writes-Reading Your Own Writes
- in derived data systems, Derived data versus distributed transactions
- strong (xem linearizability)
- timeliness and integrity, Timeliness and Integrity
- using quorums, Limitations of Quorum Consistency, Linearizability and quorums
- consistent hashing, Consistent hashing
- consistent prefix reads, Consistent Prefix Reads
- constraints (databases), Consistency, Characterizing write skew
- asynchronously checked, Loosely interpreted constraints
- coordination avoidance, Coordination-avoiding data systems
- ensuring idempotence, Uniquely identifying requests
- in log-based systems, Enforcing Constraints-Multi-shard request processing
- across multiple shards, Multi-shard request processing
- in two-phase commit, Distributed Transactions, A system of promises
- relation to consensus, Uniqueness constraints require consensus
- requiring linearizability, Constraints and uniqueness guarantees
- Consul (coordination service), Coordination Services
- use for service discovery, Service discovery
- consumers (message streams), Message brokers, Transmitting Event Streams
- backpressure, Messaging Systems
- consumer groups, Multiple consumers
- consumer offsets in logs, Consumer offsets
- failures, Acknowledgments and redelivery, Consumer offsets
- fan-out, Materializing and Updating Timelines, Multiple consumers, Logs compared to traditional messaging
- load balancing, Multiple consumers, Logs compared to traditional messaging
- not keeping up with producers, Messaging Systems, Disk space usage, Making unbundling work
- content models (JSON Schema), JSON Schema
- contention
- between transactions, Handling errors and aborts
- blocking threads, Process Pauses
- performance of optimistic concurrency control, Pessimistic versus optimistic concurrency control
- under two-phase locking, Performance of two-phase locking
- context switches, Latency and Response Time, Process Pauses
- convergence (conflict resolution), Automatic conflict resolution-CRDTs and Operational Transformation
- coordination
- avoidance, Coordination-avoiding data systems
- cross-datacenter, The limits of total ordering
- cross-region, Geographically Distributed Operation
- cross-shard ordering, Sharding, Synchronized clocks for global snapshots, Using shared logs, Multi-shard request processing
- routing requests to shards, Request Routing
- services, Locking and leader election, Coordination Services-Service discovery
- coordinator (in 2PC), Two-Phase Commit (2PC)
- failure, Coordinator failure
- in XA transactions, XA transactions-Problems with XA transactions
- recovery, Recovering from coordinator failure
- copy-on-write (B-trees), B-tree variants, Indexes and snapshot isolation
- CORBA (Common Object Request Broker Architecture), The problems with remote procedure calls (RPCs)
- coronal mass ejection (xem solar storm)
- correctness
- auditability, Trust, but Verify-Tools for auditable data systems
- Byzantine fault tolerance, Byzantine Faults
- dealing with partial failures, Faults and Partial Failures
- in log-based systems, Enforcing Constraints-Multi-shard request processing
- of algorithm within system model, Defining the correctness of an algorithm
- of derived data, Designing for auditability
- of immutable data, Advantages of immutable events
- of personal data, Responsibility and Accountability, Privacy and Use of Data
- of time, Problems with different topologies, Clock Synchronization and Accuracy-Synchronized clocks for global snapshots
- of transactions, Consistency, Aiming for Correctness, Maintaining integrity in the face of software bugs
- timeliness and integrity, Timeliness and Integrity-Coordination-avoiding data systems
- corruption of data
- detecting, The end-to-end argument, Don’t just blindly trust what they promise-Tools for auditable data systems
- due to pathological memory access, Hardware and Software Faults
- due to radiation, Byzantine Faults
- due to split brain, Leader failure: Failover, Distributed Locks and Leases
- due to weak transaction isolation, Weak Isolation Levels
- integrity as absence of, Timeliness and Integrity
- network packets, Weak forms of lying
- on disks, Durability
- preventing using write-ahead logs, Making B-trees reliable
- recovering from, Batch Processing, Advantages of immutable events
- cosine similarity (semantic search), Vector Embeddings
- Couchbase (database)
- document data model, Relational Model versus Document Model
- durability, Keeping everything in memory
- hash sharding, Fixed number of shards
- join support, Convergence of document and relational databases
- rebalancing, Operations: Automatic or Manual Rebalancing
- vBuckets (sharding), Sharding
- CouchDB (database)
- as sync engine, Pros and cons of sync engines
- B-tree storage, Indexes and snapshot isolation
- conflict resolution, Manual conflict resolution
- coupling (loose and tight), Evolvability: Making Change Easy
- covering indexes, Storing values within the index
- CozoDB (database), Datalog: Recursive Relational Queries
- CPUs
- cache coherence and memory barriers, Linearizability and network delays
- caching and pipelining, Query Execution: Compilation and Vectorization
- computing the wrong result, Hardware and Software Faults
- SIMD instructions, Query Execution: Compilation and Vectorization
- crash-stop and crash-recovery faults, System Model and Reality
- CRDTs (see conflict-free replicated datatypes)
- CREATE INDEX statement (SQL), Multi-Column and Secondary Indexes, Creating an index
- credit rating agencies, Responsibility and Accountability
- crypto-shredding, Event Sourcing and CQRS, Limitations of immutability
- cryptocurrencies, Summary
- cryptography
- defense against attackers, Byzantine Faults
- end-to-end encryption and authentication, The end-to-end argument
- CSV (comma-separated values), Storage and Indexing for OLTP, JSON, XML, and Binary Variants
- Curator (ZooKeeper recipes), Locking and leader election, Allocating work to nodes
- Cypher (query language), The Cypher Query Language
- comparison to SPARQL, The SPARQL query language
D
- Daft (processing framework)
- DataFrames, DataFrames
- shuffling data, Shuffling Data
- Dagster (workflow scheduler), Durable Execution and Workflows, Batch Processing, Scheduling Workflows
- cloud data warehouse integration, Query languages
- dashboard (business intelligence), Characterizing Transaction Processing and Analytics
- Dask (processing framework), DataFrames, Matrices, and Arrays
- data catalog, Cloud Data Warehouses
- data connectors, Data Warehousing
- data contracts, Extract–Transform–Load (ETL)
- change data capture, Change data capture versus event sourcing
- data corruption (see corruption of data)
- data cubes, Materialized Views and Data Cubes
- data engineering, Operational Versus Analytical Systems
- data fabric, Extract–Transform–Load (ETL)
- data formats (see encoding)
- data infrastructure, Trade-offs in Data Systems Architecture
- data integration, Data Integration-Unifying batch and stream processing, Summary
- batch and stream processing, Batch and Stream Processing-Unifying batch and stream processing
- maintaining derived state, Maintaining derived state
- reprocessing data, Reprocessing data for application evolution
- unifying, Unifying batch and stream processing
- by unbundling databases, Unbundling Databases-Multi-shard data processing
- comparison to federated databases, The meta-database of everything
- combining tools by deriving data, Combining Specialized Tools by Deriving Data-Ordering events to capture causality
- derived data versus distributed transactions, Derived data versus distributed transactions
- limits of total ordering, The limits of total ordering
- ordering events to capture causality, Ordering events to capture causality
- reasoning about dataflows, Reasoning about dataflows
- need for, Systems of Record and Derived Data
- using batch processing, Batch Processing, Extract–Transform–Load (ETL)
- batch and stream processing, Batch and Stream Processing-Unifying batch and stream processing
- data lake, From data warehouse to data lake
- data lakehouse, Cloud Data Warehouses, Analytics
- data locality (see locality)
- data mesh, Extract–Transform–Load (ETL)
- data minimization, Data Systems, Law, and Society, Legislation and Self-Regulation
- data models, Data Models and Query Languages-Summary
- DataFrames and arrays, DataFrames, Matrices, and Arrays
- graph-like models, Graph-Like Data Models-GraphQL
- Datalog language, Datalog: Recursive Relational Queries-Datalog: Recursive Relational Queries
- property graphs, Property Graphs
- RDF and triple-stores, Triple-Stores and SPARQL-The SPARQL query language
- relational model versus document model, Relational Model versus Document Model-Convergence of document and relational databases
- supporting multiple, Event Sourcing and CQRS
- data pipelines, From data warehouse to data lake, Systems of Record and Derived Data, Extract–Transform–Load (ETL)
- data products, Beyond the data lake
- data protection regulations (see GDPR)
- data residence laws, Distributed Versus Single-Node Systems, Sharding for Multitenancy
- data science, Operational Versus Analytical Systems, From data warehouse to data lake
- data silo, Data Warehousing
- data systems
- correctness, constraints, and integrity, Aiming for Correctness-Tools for auditable data systems
- data integration, Data Integration-Unifying batch and stream processing
- goals for using, Trade-offs in Data Systems Architecture
- heterogeneous, keeping in sync, Keeping Systems in Sync
- maintainability, Maintainability-Evolvability: Making Change Easy
- possible faults in, Transactions
- reliability, Reliability and Fault Tolerance-Humans and Reliability
- hardware faults, Hardware and Software Faults
- human errors, Humans and Reliability
- importance of, Humans and Reliability
- software faults, Software faults
- scalability, Scalability-Principles for Scalability
- unbundling databases, Unbundling Databases-Multi-shard data processing
- unreliable clocks, Unreliable Clocks-Limiting the impact of garbage collection
- data warehousing, Data Warehousing, Glossary
- cloud-based solutions, Cloud Data Warehouses
- ETL (extract-transform-load), Data Warehousing, Keeping Systems in Sync
- for batch processing, Batch Processing
- keeping data systems in sync, Keeping Systems in Sync
- schema design, Stars and Snowflakes: Schemas for Analytics
- sharding and clustering, Sharding by hash range
- slowly changing dimension (SCD), Time-dependence of joins
- data-intensive applications, Trade-offs in Data Systems Architecture
- database administrator, Operations in the Cloud Era
- database-internal distributed transactions, Distributed Transactions Across Different Systems, Database-internal Distributed Transactions, Atomic commit revisited
- databases
- archival storage, Archival storage
- comparison of message brokers to, Message brokers compared to databases
- dataflow through, Dataflow Through Databases
- end-to-end argument for, The end-to-end argument-Applying end-to-end thinking in data systems
- checking integrity, The end-to-end argument again
- relation to event streams, Databases and Streams-Limitations of immutability
- (see also changelogs)
- API support for change streams, API support for change streams, Separation of application code and state
- change data capture, Change Data Capture-API support for change streams
- event sourcing, Change data capture versus event sourcing
- keeping systems in sync, Keeping Systems in Sync-Keeping Systems in Sync
- philosophy of immutable events, State, Streams, and Immutability-Limitations of immutability
- unbundling, Unbundling Databases-Multi-shard data processing
- composing data storage technologies, Composing Data Storage Technologies-Unbundled versus integrated systems
- designing applications around dataflow, Designing Applications Around Dataflow-Stream processors and services
- observing derived state, Observing Derived State-Multi-shard data processing
- datacenters
- failures of, Hardware and Software Faults
- geographically distributed (see regions (geographic distribution))
- multitenancy and shared resources, Network congestion and queueing
- network architecture, Cloud Computing Versus Supercomputing
- network faults, Network Faults in Practice
- dataflow, Modes of Dataflow-Distributed actor frameworks, Designing Applications Around Dataflow-Stream processors and services
- correctness of dataflow systems, Correctness of dataflow systems
- dataflow engines, Dataflow Engines
- comparison to stream processing, Processing Streams
- DataFrames, DataFrames
- support in batch processing frameworks, Batch Processing
- event-driven, Event-Driven Architectures-Distributed actor frameworks
- reasoning about, Reasoning about dataflows
- through databases, Dataflow Through Databases
- through services, Dataflow Through Services: REST and RPC-Data encoding and evolution for RPC
- workflow engines (see workflow engines)
- DataFrames, DataFrames, Matrices, and Arrays
- implementation, DataFrames
- in batch processing, DataFrames
- in notebooks, Machine Learning
- support in batch processing frameworks, Batch Processing
- DataFusion (query engine), Cloud Data Warehouses
- Datalog (query language), Datalog: Recursive Relational Queries-Datalog: Recursive Relational Queries
- Datastream (change data capture), API support for change streams
- datatypes
- binary strings in XML and JSON, JSON, XML, and Binary Variants
- conflict-free, CRDTs and Operational Transformation
- in Avro encodings, Avro
- in Protocol Buffers, Field tags and schema evolution
- numbers in XML and JSON, JSON, XML, and Binary Variants
- Datensparsamkeit, Data Systems, Law, and Society
- Datomic (database)
- B-tree storage, Indexes and snapshot isolation
- data model, Graph-Like Data Models, Triple-Stores and SPARQL
- Datalog query language, Datalog: Recursive Relational Queries
- excision (deleting data), Limitations of immutability
- languages for transactions, Pros and cons of stored procedures
- serial execution of transactions, Actual Serial Execution
- Daylight Saving Time (DST), Time-of-day clocks
- Db2 (database)
- change data capture, Implementing change data capture
- DBA (database administrator), Operations in the Cloud Era
- deadlocks, Explicit locking
- detection, in distributed transaction, Problems with XA transactions
- in two-phase locking (2PL), Implementation of two-phase locking
- Debezium (change data capture), Implementing change data capture
- Cassandra, API support for change streams
- for data integration, Unbundled versus integrated systems
- declarative languages, Data Models and Query Languages, Glossary
- and sync engines, Pros and cons of sync engines
- Datalog, Datalog: Recursive Relational Queries
- in document databases, Convergence of document and relational databases
- recursive SQL queries, Graph Queries in SQL
- SPARQL, The SPARQL query language
- DeepSeek
- 3FS (see 3FS)
- delays
- bounded network delays, Synchronous Versus Asynchronous Networks
- bounded process pauses, Response time guarantees
- unbounded network delays, Timeouts and Unbounded Delays
- unbounded process pauses, Process Pauses
- deleting data, Limitations of immutability
- in LSM storage, Disk space usage
- legal basis, Data Systems, Law, and Society
- Delta Lake (table format), Constructing and merging SSTables, Cloud Data Warehouses
- sharding and clustering, Sharding by hash range
- demilitarized zone (networking), Serving Derived Data
- denormalization (data representation), Normalization, Denormalization, and Joins-Many-to-One and Many-to-Many Relationships, Glossary
- in derived data systems, Systems of Record and Derived Data
- in event sourcing/CQRS, Event Sourcing and CQRS
- in social network case study, Denormalization in the social networking case study
- materialized views, Materialized Views and Data Cubes
- updating derived data, Single-Object and Multi-Object Operations, The need for multi-object transactions, Combining Specialized Tools by Deriving Data
- versus normalization, Deriving several views from the same event log
- derived data, Systems of Record and Derived Data, Stream Processing, Glossary
- batch processing, Batch Processing
- event sourcing and CQRS, Event Sourcing and CQRS
- from change data capture, Implementing change data capture
- maintaining derived state through logs, Databases and Streams-API support for change streams, State, Streams, and Immutability-Concurrency control
- observing, by subscribing to streams, End-to-end event streams
- outputs of batch and stream processing, Batch and Stream Processing
- through application code, Application code as a derivation function
- versus distributed transactions, Derived data versus distributed transactions
- design patterns, Simplicity: Managing Complexity
- deterministic operations, Pros and cons of stored procedures, Faults and Partial Failures, Glossary
- and idempotence, Idempotence, Reasoning about dataflows
- computing derived data, Maintaining derived state, Correctness of dataflow systems, Designing for auditability
- in event sourcing, Event Sourcing and CQRS
- in state machine replication, Using shared logs, Databases and Streams
- in statement-based replication, Statement-based replication
- in testing, Deterministic simulation testing
- joins, Time-dependence of joins
- making code deterministic, Deterministic simulation testing
- overview, Deterministic simulation testing
- deterministic simulation testing (DST), Deterministic simulation testing
- DevOps, Operations in the Cloud Era
- dimension tables, Stars and Snowflakes: Schemas for Analytics
- dimensional modeling (see star schemas)
- directed acyclic graphs (DAG)
- workflows, Scheduling Workflows
- (see also workflow engines)
- workflows, Scheduling Workflows
- dirty reads (transaction isolation), No dirty reads
- dirty writes (transaction isolation), No dirty writes
- disaggregation
- of storage and compute, Separation of storage and compute
- Discord (group chat)
- GraphQL example, GraphQL
- discrimination, Bias and Discrimination
- disks (see hard disks)
- distributed actor frameworks, Distributed actor frameworks
- distributed filesystems, Distributed Filesystems-Distributed Filesystems
- comparison to object storage, Object Stores
- use by Flink, Rebuilding state after a failure
- distributed ledgers, Summary
- distributed systems, The Trouble with Distributed Systems-Summary, Glossary
- Byzantine faults, Byzantine Faults-Weak forms of lying
- detecting network faults, Detecting Faults
- faults and partial failures, Faults and Partial Failures
- formalization of consensus, Single-value consensus
- impossibility results, The CAP theorem, Consensus
- issues with failover, Leader failure: Failover
- multi-region (see regions (geographic distribution))
- network problems, Unreliable Networks-Can we not simply make network delays predictable?
- problems with, Problems with Distributed Systems
- quorums, relying on, The Majority Rules
- reasons for using, Distributed Versus Single-Node Systems, Replication
- synchronized clocks, relying on, Relying on Synchronized Clocks-Synchronized clocks for global snapshots
- system models, System Model and Reality-Deterministic simulation testing
- use of clocks and time, Unreliable Clocks
- distributed transactions (see transactions)
- Django (web framework), Handling errors and aborts
- DMZ (demilitarized zone), Serving Derived Data
- DNS (Domain Name System), Request Routing, Service discovery
- for load balancing, Load balancers, service discovery, and service meshes
- Docker (container manager), Separation of application code and state
- document data model, Relational Model versus Document Model-Convergence of document and relational databases
- comparison to relational model, When to Use Which Model-Convergence of document and relational databases
- multi-object transactions, need for, The need for multi-object transactions
- sharded secondary indexes, Sharding and Secondary Indexes
- versus relational model
- convergence of models, Convergence of document and relational databases
- data locality, Data locality for reads and writes
- document-partitioned indexes (see local secondary indexes)
- domain-driven design (DDD), Simplicity: Managing Complexity, Event Sourcing and CQRS
- dotted version vectors, Version vectors
- double-entry bookkeeping, Summary
- DRBD (Distributed Replicated Block Device), Single-Leader Replication
- drift (clocks), Clock Synchronization and Accuracy
- Druid (database), Characterizing Transaction Processing and Analytics, Column-Oriented Storage, Deriving several views from the same event log
- handling writes, Writing to Column-Oriented Storage
- pre-aggregation, Analytics
- serving derived data, Serving Derived Data
- Dryad (dataflow engine), Dataflow Engines
- dual writes, problems with, Keeping Systems in Sync
- DuckDB (database), Problems with Distributed Systems, Compaction strategies
- column-oriented storage, Column-Oriented Storage
- use for ETL, Extract–Transform–Load (ETL)
- duplicates, suppression of, Duplicate suppression
- (see also idempotence)
- using a unique ID, Uniquely identifying requests, Multi-shard request processing
- durability (transactions), Making B-trees reliable, Durability, Glossary
- durable execution, Durable Execution and Workflows
- reliance on determinism, Deterministic simulation testing
- Restate (see Restate (workflow engine))
- Temporal (see Temporal (workflow engine))
- durable functions (see workflow engines)
- duration (time), Unreliable Clocks
- measurement with monotonic clocks, Monotonic clocks
- dynamically typed languages
- analogy to schema-on-read, Schema flexibility in the document model
- Dynamo (database), Leaderless Replication
- Dynamo-style databases (see leaderless replication)
- DynamoDB (database)
- auto-scaling, Operations: Automatic or Manual Rebalancing
- hash-range sharding, Sharding by hash range
- leader-based replication, Single-Leader Replication
- sharded secondary indexes, Global Secondary Indexes
E
- EBS (virtual block device), Separation of storage and compute
- compared to object storage, Setting Up New Followers
- ECC (see error-correcting codes)
- EDB Postgres Distributed (database), Geographically Distributed Operation
- edges (in graphs), Graph-Like Data Models
- property graph model, Property Graphs
- edit distance (full-text search), Full-Text Search
- effectively-once semantics, Fault Tolerance, Exactly-once execution of an operation
- (see also exactly-once semantics)
- preservation of integrity, Correctness of dataflow systems
- Elastic Compute Cloud (EC2)
- spot instances, Handling Faults
- elasticity, Distributed Versus Single-Node Systems
- cloud data warehouses, Cloud Data Warehouses, Query languages
- Elasticsearch (search server)
- local secondary indexes, Local Secondary Indexes
- percolator (stream search), Search on streams
- serving derived data, Serving Derived Data
- shard rebalancing, Fixed number of shards
- use of Lucene, Full-Text Search
- Elm (programming language), End-to-end event streams
- ELT (extract-load-transform), Data Warehousing
- relation to batch processing, Extract–Transform–Load (ETL)
- embarassingly parallel (algorithms)
- ETL (see ETL (extract-transform-load))
- MapReduce, MapReduce
- (see also MapReduce)
- embedded storage engines, Compaction strategies
- embedding (vector), Vector Embeddings
- encodings (data formats), Encoding and Evolution-The Merits of Schemas
- Avro, Avro-Dynamically generated schemas
- binary variants of JSON and XML, Binary encoding
- compatibility, Encoding and Evolution
- calling services, Data encoding and evolution for RPC
- using databases, Dataflow Through Databases-Archival storage
- defined, Formats for Encoding Data
- JSON, XML, and CSV, JSON, XML, and Binary Variants
- language-specific formats, Language-Specific Formats
- merits of schemas, The Merits of Schemas
- Protocol Buffers, Protocol Buffers-Field tags and schema evolution
- representations of data, Formats for Encoding Data
- end-to-end argument, The end-to-end argument-Applying end-to-end thinking in data systems
- checking integrity, The end-to-end argument again
- publish/subscribe streams, End-to-end event streams
- enrichment (stream), Stream-table join (stream enrichment)
- Enterprise JavaBeans (EJB), The problems with remote procedure calls (RPCs)
- enterprise software, Trade-offs in Data Systems Architecture
- entities (xem vertices)
- ephemeral storage, Separation of storage and compute
- epoch (consensus algorithms), From single-leader replication to consensus
- epoch (Unix timestamps), Time-of-day clocks
- erasure coding (error correction), Distributed Filesystems
- error handling
- for network faults, Network Faults in Practice
- in transactions, Handling errors and aborts
- error-correcting codes, Hardware and Software Faults, Distributed Filesystems
- Esper (CEP engine), Complex event processing
- essential complexity, Simplicity: Managing Complexity
- etcd (coordination service), Coordination Services-Service discovery
- generating fencing tokens, Fencing off zombies and delayed requests, Coordination Services
- linearizable operations, Implementing Linearizable Systems, Subtleties of consensus
- locks and leader election, Locking and leader election
- use for service discovery, Load balancers, service discovery, and service meshes, Service discovery
- use for shard assignment, Request Routing
- use of Raft algorithm, Single-Leader Replication
- Ethereum (blockchain), Tools for auditable data systems
- Ethernet (networks), Cloud Computing Versus Supercomputing, Unreliable Networks, Can we not simply make network delays predictable?
- packet checksums, Weak forms of lying, The end-to-end argument
- ethics, Doing the Right Thing-Legislation and Self-Regulation
- code of ethics and professional practice, Doing the Right Thing
- legislation and self-regulation, Legislation and Self-Regulation
- predictive analytics, Predictive Analytics-Feedback Loops
- amplifying bias, Bias and Discrimination
- feedback loops, Feedback Loops
- privacy and tracking, Privacy and Tracking-Legislation and Self-Regulation
- consent and freedom of choice, Consent and Freedom of Choice
- data as assets and power, Data as Assets and Power
- meaning of privacy, Privacy and Use of Data
- surveillance, Surveillance
- respect, dignity, and agency, Legislation and Self-Regulation
- unintended consequences, Doing the Right Thing, Feedback Loops
- ETL (extract-transform-load), Data Warehousing, Keeping Systems in Sync, Glossary
- relation to batch processing, Extract–Transform–Load (ETL)-Extract–Transform–Load (ETL)
- using batch processing, Batch Processing
- Euclidean distance (semantic search), Vector Embeddings
- European Union
- AI Act (xem AI Act)
- GDPR (xem GDPR)
- event sourcing, Event Sourcing and CQRS-Event Sourcing and CQRS
- and change data capture, Change data capture versus event sourcing
- comparison to change data capture, Change data capture versus event sourcing
- immutability and auditability, State, Streams, and Immutability, Designing for auditability
- large, reliable data systems, Uniquely identifying requests, Correctness of dataflow systems
- reliance on determinism, Deterministic simulation testing
- event streams (xem streams)
- event-driven architecture, Event-Driven Architectures-Distributed actor frameworks
- distributed actor frameworks, Distributed actor frameworks
- events, Transmitting Event Streams
- deciding on total order of, The limits of total ordering
- deriving views from event log, Deriving several views from the same event log
- event time versus processing time, Event time versus processing time, Microbatching and checkpointing, Unifying batch and stream processing
- immutable, advantages of, Advantages of immutable events, Designing for auditability
- ordering to capture causality, Ordering events to capture causality
- reads as, Reads are events too
- stragglers, Handling straggler events
- timestamp of, in stream processing, Whose clock are you using, anyway?
- EventSource (browser API), Pushing state changes to clients
- EventStoreDB (database), Event Sourcing and CQRS
- eventual consistency, Replication, Problems with Replication Lag, Safety and liveness
- (xem thêm conflicts)
- and perpetual inconsistency, Timeliness and Integrity
- strong eventual consistency, Automatic conflict resolution
- evidence
- data used as, Humans and Reliability
- evolvability, Evolvability: Making Change Easy, Encoding and Evolution
- calling services, Data encoding and evolution for RPC
- event sourcing, Event Sourcing and CQRS
- graph-structured data, Property Graphs
- of databases, Schema flexibility in the document model, Dataflow Through Databases-Archival storage, Deriving several views from the same event log, Reprocessing data for application evolution
- reprocessing data, Reprocessing data for application evolution, Unifying batch and stream processing
- schema evolution in Avro, The writer’s schema and the reader’s schema
- schema evolution in Protocol Buffers, Field tags and schema evolution
- schema-on-read, Schema flexibility in the document model, Encoding and Evolution, The Merits of Schemas
- exactly-once semantics, Exactly-once message processing, Exactly-once message processing revisited, Fault Tolerance, Exactly-once execution of an operation
- parity with batch processors, Unifying batch and stream processing
- preservation of integrity, Correctness of dataflow systems
- using durable execution, Durable execution
- exclusive mode (locks), Implementation of two-phase locking
- exponential backoff, Describing Performance, Handling errors and aborts
- ext4 (file system), Distributed Filesystems
- eXtended Architecture transactions (xem XA transactions)
- extract-transform-load (xem ETL)
F
- Facebook
- Faiss (vector index), Vector Embeddings
- React (user interface library), End-to-end event streams
- social graphs, Graph-Like Data Models
- facts
- fact table (star schema), Stars and Snowflakes: Schemas for Analytics
- in Datalog, Datalog: Recursive Relational Queries
- in event sourcing, Event Sourcing and CQRS
- fail-slow faults, System Model and Reality
- fail-stop model, System Model and Reality
- failover, Leader failure: Failover, Glossary
- (xem thêm leader-based replication)
- in leaderless replication, absence of, Writing to the Database When a Node Is Down
- leader election, Distributed Locks and Leases, Consensus, From single-leader replication to consensus
- potential problems, Leader failure: Failover
- failures
- amplification by distributed transactions, Maintaining derived state
- failure detection, Detecting Faults
- automatic rebalancing causing cascading failures, Operations: Automatic or Manual Rebalancing
- timeouts and unbounded delays, Timeouts and Unbounded Delays, Network congestion and queueing
- using a coordination service, Coordination Services
- faults versus, Reliability and Fault Tolerance
- partial failures, Faults and Partial Failures, Summary
- Faiss (vector index), Vector Embeddings
- false positive (Bloom filters), Bloom filters
- fan-out (messaging systems), Materializing and Updating Timelines, Multiple consumers
- fault injection, Fault Tolerance, Network Faults in Practice, Fault injection
- fault isolation, Sharding for Multitenancy
- fault tolerance, Reliability and Fault Tolerance-Humans and Reliability, Glossary
- formalization in consensus, Single-value consensus
- human fault tolerance, Batch Processing
- in batch processing, Handling Faults
- in log-based systems, Applying end-to-end thinking in data systems, Timeliness and Integrity-Correctness of dataflow systems
- in stream processing, Fault Tolerance-Rebuilding state after a failure
- atomic commit, Atomic commit revisited
- idempotence, Idempotence
- maintaining derived state, Maintaining derived state
- microbatching and checkpointing, Microbatching and checkpointing
- rebuilding state after a failure, Rebuilding state after a failure
- of distributed transactions, XA transactions-Exactly-once message processing revisited
- of leader-based and leaderless replication, Single-Leader Versus Leaderless Replication Performance
- transaction atomicity, Atomicity, Distributed Transactions-Exactly-once message processing
- faults
- Byzantine faults, Byzantine Faults-Weak forms of lying
- failures versus, Reliability and Fault Tolerance
- handled by transactions, Transactions
- handling in supercomputers and cloud computing, Cloud Computing Versus Supercomputing
- hardware, Hardware and Software Faults
- in distributed systems, Faults and Partial Failures
- introducing deliberately (xem fault injection)
- network faults, Network Faults in Practice-Detecting Faults
- asymmetric faults, The Majority Rules
- detecting, Detecting Faults
- tolerance of, in multi-leader replication, Geographically Distributed Operation
- software faults, Software faults
- tolerating (xem fault tolerance)
- feature engineering (machine learning), From data warehouse to data lake
- federated databases, The meta-database of everything
- Feldera (database)
- incremental view maintenance, Maintaining materialized views
- fence (CPU instruction), Linearizability and network delays
- fencing (preventing split brain), Leader failure: Failover, Fencing off zombies and delayed requests-Fencing with multiple replicas
- generating fencing tokens, Using shared logs, Coordination Services
- properties of fencing tokens, Defining the correctness of an algorithm
- stream processors writing to databases, Idempotence, Exactly-once execution of an operation
- fetch-and-add
- relation to consensus, Fetch-and-add as consensus
- Fibre Channel (networks), Distributed Filesystems
- field tags (Protocol Buffers), Protocol Buffers-Field tags and schema evolution
- Figma (graphics software), Real-time collaboration, offline-first, and local-first apps
- filesystem in userspace (FUSE), Setting Up New Followers, Distributed Filesystems
- on object storage, Object Stores
- financial data
- accounting ledgers, Summary
- immutability, Advantages of immutable events
- time series data, DataFrames, Matrices, and Arrays
- Fivetran, Data Warehousing
- FizzBee (specification language), Model checking and specification languages
- flat index (vector index), Vector Embeddings
- FlatBuffers (data format), Formats for Encoding Data
- Flink (processing framework), Batch Processing, Dataflow Engines
- cost efficiency, Query languages
- DataFrames, DataFrames, Matrices, and Arrays, DataFrames
- fault tolerance, Handling Faults, Microbatching and checkpointing, Rebuilding state after a failure
- FlinkML, Machine Learning
- for data warehouses, Cloud Data Warehouses
- high availability using ZooKeeper, Coordination Services
- integration of batch and stream processing, Unifying batch and stream processing
- query optimizer, Query languages
- shuffling data, Shuffling Data
- stream processing, Stream analytics
- streaming SQL support, Complex event processing
- flow control, The Limitations of TCP, Messaging Systems, Glossary
- FLP result (on consensus), Consensus
- Flyte (workflow scheduler), Machine Learning
- followers, Single-Leader Replication, Glossary
- (xem thêm leader-based replication)
- formal methods, Formal Methods and Randomized Testing-Deterministic simulation testing
- forward compatibility, Encoding and Evolution
- forward decay (algorithm), Use of Response Time Metrics
- Fossil (version control system), Concurrency control
- shunning (deleting data), Limitations of immutability
- FoundationDB (database)
- consistency model, What Makes a System Linearizable?
- deterministic simulation testing, Deterministic simulation testing
- key-range sharding, Sharding by Key Range
- process-per-core model, Pros and Cons of Sharding
- serializable transactions, Serializable Snapshot Isolation (SSI), Performance of serializable snapshot isolation
- transactions, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- fractional indexing, When to Use Which Model
- fragmentation (of B-trees), Disk space usage
- frame (computer graphics), Pros and cons of sync engines
- frontend (web development), Trade-offs in Data Systems Architecture
- FrostDB (database)
- deterministic simulation testing (DST), Deterministic simulation testing
- fsync (system call), Making B-trees reliable, Durability
- full-text search, Full-Text Search, Glossary
- and fuzzy indexes, Full-Text Search
- Lucene storage engine, Full-Text Search
- sharded indexes, Sharding and Secondary Indexes
- Function as a Service (FaaS), Microservices and Serverless
- functional programming
- inspiration for MapReduce, MapReduce
- functional requirements, Defining Nonfunctional Requirements
- FUSE (xem filesystem in userspace (FUSE))
- fuzzing, Formal Methods and Randomized Testing
- fuzzy search (xem similarity search)
G
- Gallina (specification language), Model checking and specification languages
- game development, Pros and cons of sync engines
- garbage collection
- immutability and, Limitations of immutability
- process pauses for, Latency and Response Time, Process Pauses-Limiting the impact of garbage collection, The Majority Rules
- (xem thêm process pauses)
- gas stations algorithmic pricing, Feedback Loops
- GDPR (regulation), Data Systems, Law, and Society, Limitations of immutability
- consent, Consent and Freedom of Choice
- data minimization, Legislation and Self-Regulation
- legitimate interest, Consent and Freedom of Choice
- right of access, Sharding for Multitenancy
- right to erasure, Data Systems, Law, and Society, Disk space usage, Sharding for Multitenancy
- GenBank (genome database), Summary
- General Data Protection Regulation (xem GDPR (regulation))
- genome analysis, Summary
- geographic distribution (xem regions (geographic distribution))
- geospatial indexes, Multidimensional and Full-Text Indexes
- Git (version control system), Concurrency control
- local-first software, Real-time collaboration, offline-first, and local-first apps
- merge conflicts, Manual conflict resolution
- GitHub, postmortems, Leader failure: Failover, Leader failure: Failover, Mapping system models to the real world
- global secondary indexes, Global Secondary Indexes, Summary
- globally unique identifiers (xem UUIDs)
- GlusterFS (distributed filesystem), Batch Processing, Distributed Filesystems, Object Stores
- GNU Coreutils (Linux), Sorting Versus In-memory Aggregation
- Go (programming language)
- garbage collection, Limiting the impact of garbage collection
- GoldenGate (change data capture), Implementing change data capture
- (xem thêm Oracle)
- Google
- BigQuery (xem BigQuery (database))
- Bigtable (xem Bigtable (database))
- Chubby (lock service), Coordination Services
- Cloud Storage (object storage), Setting Up New Followers, Object Stores
- request preconditions, Fencing off zombies and delayed requests
- Compute Engine
- preemptible instances, Handling Faults
- Dataflow (stream processing)
- data warehouse integration, Cloud Data Warehouses
- shuffling data, Shuffling Data
- Dataflow (stream processor), Stream analytics, Atomic commit revisited, Unifying batch and stream processing
- (xem thêm Beam)
- Datastream (change data capture), API support for change streams
- Docs (collaborative editor), Real-time collaboration, offline-first, and local-first apps, CRDTs and Operational Transformation
- operational transformation, CRDTs and Operational Transformation
- Dremel (query engine), Column-Oriented Storage
- Firestore (database), Pros and cons of sync engines
- MapReduce (batch processing), Batch Processing
- (xem thêm MapReduce)
- Percolator (transaction system), Implementing a linearizable ID generator
- persistent disks (cloud service), Separation of storage and compute
- Pub/Sub (messaging), Message brokers, Message brokers compared to databases, Using logs for message storage
- response time study, Average, Median, and Percentiles
- Sheets (collaborative spreadsheet), Real-time collaboration, offline-first, and local-first apps, CRDTs and Operational Transformation
- Spanner (xem Spanner (database))
- TrueTime (clock API), Clock readings with a confidence interval
- gossip protocol, Request Routing
- governance, Beyond the data lake
- government use of data, Data as Assets and Power
- GPS (Global Positioning System)
- use for clock synchronization, Unreliable Clocks, Clock Synchronization and Accuracy, Clock readings with a confidence interval, Synchronized clocks for global snapshots
- GPT (language model), Vector Embeddings
- GPU (graphics processing unit), Layering of cloud services, Distributed Versus Single-Node Systems
- gradual rollout (xem rolling upgrades)
- GraphQL (query language), GraphQL
- validation, Pros and cons of stored procedures
- graphs, Glossary
- as data models, Graph-Like Data Models-GraphQL
- property graphs, Property Graphs
- RDF and triple-stores, Triple-Stores and SPARQL-The SPARQL query language
- DAGs (xem directed acyclic graphs)
- processing and analysis, Machine Learning
- query languages
- Cypher, The Cypher Query Language
- Datalog, Datalog: Recursive Relational Queries-Datalog: Recursive Relational Queries
- GraphQL, GraphQL
- Gremlin, Graph-Like Data Models
- recursive SQL queries, Graph Queries in SQL
- SPARQL, The SPARQL query language-The SPARQL query language
- traversal, Property Graphs
- as data models, Graph-Like Data Models-GraphQL
- gray failures, System Model and Reality
- in leaderless replication, Single-Leader Versus Leaderless Replication Performance
- Gremlin (graph query language), Graph-Like Data Models
- grep (Unix tool), Simple Log Analysis
- gRPC (service calls), Microservices and Serverless, Web services
- forward and backward compatibility, Data encoding and evolution for RPC
- GUIDs (xem UUIDs)
H
- Hadoop (data infrastructure)
- comparison to distributed databases, Batch Processing
- MapReduce (xem MapReduce)
- NodeManager, Distributed Job Orchestration
- YARN (xem YARN (job scheduler))
- HANA (xem SAP HANA (database))
- happens-before relation, The “happens-before” relation and concurrency
- hard disks
- access patterns, Sequential versus random writes
- detecting corruption, The end-to-end argument, Don’t just blindly trust what they promise
- faults in, Hardware and Software Faults, Durability
- sequential vs. random writes, Sequential versus random writes
- sequential write throughput, Disk space usage
- hardware faults, Hardware and Software Faults
- hash function
- in Bloom filters, Bloom filters
- hash join
- in stream processing, Stream-table join (stream enrichment)
- hash sharding, Sharding by Hash of Key-Consistent hashing, Summary
- consistent hashing, Consistent hashing
- problems with hash mod N, Hash modulo number of nodes
- range queries, Sharding by hash range
- suitable hash functions, Sharding by Hash of Key
- with fixed number of shards, Fixed number of shards
- hash tables, Log-Structured Storage
- Hazelcast (in-memory data grid)
- FencedLock, Fencing off zombies and delayed requests
- Flake ID Generator, ID Generators and Logical Clocks
- HBase (database)
- bug due to lack of fencing, Distributed Locks and Leases
- key-range sharding, Sharding by Key Range
- log-structured storage, Constructing and merging SSTables
- regions (sharding), Sharding
- request routing, Request Routing
- size-tiered compaction, Compaction strategies
- wide-column data model, Data locality for reads and writes, Column Compression
- HDFS (Hadoop Distributed File System), Batch Processing, Distributed Filesystems
- (xem thêm distributed filesystems)
- checking data integrity, Don’t just blindly trust what they promise
- DataNode, Distributed Filesystems
- NameNode, Distributed Filesystems
- use in MapReduce, MapReduce
- workflow example, Scheduling Workflows
- HdrHistogram (numerical library), Use of Response Time Metrics
- head (Unix tool), Simple Log Analysis, Distributed Job Orchestration
- head vertex (property graphs), Property Graphs
- head-of-line blocking, Latency and Response Time
- heap files (databases), Storing values within the index
- in multiversion concurrency control, Multi-version concurrency control (MVCC)
- heat management, Skewed Workloads and Relieving Hot Spots
- hedged requests, Single-Leader Versus Leaderless Replication Performance
- heterogeneous distributed transactions, Distributed Transactions Across Different Systems, Problems with XA transactions
- heuristic decisions (in 2PC), Recovering from coordinator failure
- Hex (notebook), Machine Learning
- hexagons
- for geospatial indexing, Multidimensional and Full-Text Indexes
- Hibernate (object-relational mapper), Object-relational mapping (ORM)
- hierarchical model, Relational Model versus Document Model
- hierarchical navigable small world (vector index), Vector Embeddings
- hierarchical queries (xem recursive common table expressions)
- high availability (xem fault tolerance)
- high-frequency trading, Clock Synchronization and Accuracy
- high-performance computing (HPC), Cloud Computing Versus Supercomputing
- hinted handoff (leaderless replication), Catching up on missed writes
- histograms, Use of Response Time Metrics
- Hive (data warehouse), Cloud Data Warehouses
- query optimizer, Query languages
- HNSW (vector index), Vector Embeddings
- hopping windows (stream processing), Types of windows
- (xem thêm windows)
- Hoptimator (query engine), The meta-database of everything
- Horizon scandal, Humans and Reliability
- lack of transactions, Transactions
- horizontal scaling (xem scaling out)
- by sharding, Pros and Cons of Sharding
- HornetQ (messaging), Message brokers, Message brokers compared to databases
- distributed transaction support, XA transactions
- hot keys, Sharding of Key-Value Data
- hot spots, Sharding of Key-Value Data
- due to celebrities, Skewed Workloads and Relieving Hot Spots
- for time-series data, Sharding by Key Range
- relieving, Skewed Workloads and Relieving Hot Spots
- hot standbys (xem leader-based replication)
- HTAP (xem hybrid transactional/analytic processing)
- HTTP, use in APIs (xem services)
- human errors, Humans and Reliability, Network Faults in Practice, Batch Processing
- hybrid logical clocks, Hybrid logical clocks
- hybrid transactional/analytic processing, Data Warehousing, Data Storage for Analytics
- hydrating IDs (join), Denormalization in the social networking case study
- hypergraph, Property Graphs
- HyperLogLog (algorithm), Stream analytics
I
- I/O operations, waiting for, Process Pauses
- IaaS (xem infrastructure as a service (IaaS))
- IBM
- Db2 (database)
- distributed transaction support, XA transactions
- serializable isolation, Snapshot isolation, repeatable read, and naming confusion, Implementation of two-phase locking
- MQ (messaging), Message brokers compared to databases
- distributed transaction support, XA transactions
- System R (database), What Exactly Is a Transaction?
- WebSphere (messaging), Message brokers
- Db2 (database)
- Iceberg (table format), Cloud Data Warehouses
- databases on object storage, Setting Up New Followers
- log-based message broker storage, Disk space usage
- idempotence, The problems with remote procedure calls (RPCs), Idempotence, Glossary
- by giving operations unique IDs, Multi-shard request processing
- by giving requests unique IDs, Uniquely identifying requests
- for exactly-once semantics, Exactly-once message processing revisited
- idempotent operations, Exactly-once execution of an operation
- in workflow engines, Durable execution
- immutability
- advantages of, Advantages of immutable events, Designing for auditability
- and right to erasure, Data Systems, Law, and Society, Disk space usage
- crypto-shredding for deletion, Event Sourcing and CQRS, Limitations of immutability
- deriving state from event log, State, Streams, and Immutability-Limitations of immutability
- for crash recovery, Constructing and merging SSTables
- in B-trees, B-tree variants, Indexes and snapshot isolation
- in event sourcing, Event Sourcing and CQRS, Change data capture versus event sourcing
- limitations of, Concurrency control
- impedance mismatch, The Object-Relational Mismatch
- in doubt (transaction status), Coordinator failure
- holding locks, Holding locks while in doubt
- orphaned transactions, Recovering from coordinator failure
- in-memory databases, Keeping everything in memory
- durability, Durability
- serial transaction execution, Actual Serial Execution
- incidents
- accounting software bugs leading to wrongful convictions, Humans and Reliability
- blameless postmortems, Humans and Reliability
- crashes due to leap seconds, Clock Synchronization and Accuracy
- data corruption and financial losses due to concurrency bugs, Weak Isolation Levels
- data corruption on hard disks, Durability
- data loss due to last-write-wins, Timestamps for ordering events
- data on disks unreadable, Mapping system models to the real world
- disclosure of sensitive data due to primary key reuse, Leader failure: Failover
- errors in transaction serializability, Maintaining integrity in the face of software bugs
- gigabit network interface with 1 Kb/s throughput, System Model and Reality
- leap second crash, Software faults
- network faults, Network Faults in Practice
- network interface dropping only inbound packets, Network Faults in Practice
- network partitions and whole-datacenter failures, Faults and Partial Failures
- poor handling of network faults, Network Faults in Practice
- sending message to ex-partner, Ordering events to capture causality
- sharks biting undersea cables, Network Faults in Practice
- split brain due to 1-minute packet delay, Leader failure: Failover, Network Faults in Practice
- SSD failure after 32,768 hours, Software faults
- thread contention bringing down a service, Process Pauses
- vibrations in server rack, Latency and Response Time
- violation of uniqueness constraint, Maintaining integrity in the face of software bugs
- incremental view maintenance (IVM), Maintaining materialized views
- for data integration, Unbundled versus integrated systems
- indexes, Storage and Indexing for OLTP, Glossary
- and snapshot isolation, Indexes and snapshot isolation
- as derived data, Systems of Record and Derived Data, Composing Data Storage Technologies-Unbundled versus integrated systems
- B-trees, B-Trees-B-tree variants
- clustered, Storing values within the index
- comparison of B-trees and LSM-trees, Comparing B-Trees and LSM-Trees-Disk space usage
- covering (with included columns), Storing values within the index
- creating, Creating an index
- full-text search, Full-Text Search
- geospatial, Multidimensional and Full-Text Indexes
- index-range locking, Index-range locks
- multi-column (concatenated), Multidimensional and Full-Text Indexes
- secondary, Multi-Column and Secondary Indexes
- (xem thêm secondary indexes)
- problems with dual writes, Keeping Systems in Sync, Reasoning about dataflows
- sharding and secondary indexes, Sharding and Secondary Indexes-Global Secondary Indexes, Summary
- sparse, The SSTable file format
- SSTables and LSM-trees, The SSTable file format-Compaction strategies
- updating when data changes, Keeping Systems in Sync, Maintaining materialized views
- Industrial Revolution, Remembering the Industrial Revolution
- InfiniBand (networks), Can we not simply make network delays predictable?
- InfluxDB IOx (storage engine), Column-Oriented Storage
- information retrieval (xem full-text search)
- infrastructure as a service (IaaS), Cloud Versus Self-Hosting, Layering of cloud services
- InnoDB (storage engine)
- clustered index on primary key, Storing values within the index
- not preventing lost updates, Automatically detecting lost updates
- preventing write skew, Characterizing write skew, Implementation of two-phase locking
- serializable isolation, Implementation of two-phase locking
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- instance (cloud computing), Layering of cloud services
- integrating different data systems (xem data integration)
- integrity, Timeliness and Integrity
- coordination-avoiding data systems, Coordination-avoiding data systems
- correctness of dataflow systems, Correctness of dataflow systems
- in consensus formalization, Single-value consensus, Atomic commitment as consensus
- integrity checks, Don’t just blindly trust what they promise
- (xem thêm auditing)
- end-to-end, The end-to-end argument, The end-to-end argument again
- use of snapshot isolation, Snapshot Isolation and Repeatable Read
- maintaining despite software bugs, Maintaining integrity in the face of software bugs
- Interface Definition Language (IDL), Protocol Buffers, Avro, Web services
- invariants, Consistency
- (xem thêm constraints)
- inverted file index (vector index), Vector Embeddings
- inverted index, Full-Text Search
- irreversibility, minimizing, Evolvability: Making Change Easy, Event Sourcing and CQRS, Batch Processing
- ISDN (Integrated Services Digital Network), Synchronous Versus Asynchronous Networks
- isolation (in operating systems)
- cgroups (xem cgroups)
- isolation (in transactions), Isolation, Single-Object and Multi-Object Operations, Glossary
- correctness and, Aiming for Correctness
- for single-object writes, Single-object writes
- serializability, Serializability-Performance of serializable snapshot isolation
- actual serial execution, Actual Serial Execution-Summary of serial execution
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)-Performance of serializable snapshot isolation
- two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks
- violating, Single-Object and Multi-Object Operations
- weak isolation levels, Weak Isolation Levels-Materializing conflicts
- preventing lost updates, Preventing Lost Updates-Conflict resolution and replication
- read committed, Read Committed-Implementing read committed
- snapshot isolation, Snapshot Isolation and Repeatable Read-Snapshot isolation, repeatable read, and naming confusion
- IVF (vector index), Vector Embeddings
J
- Java Database Connectivity (JDBC)
- distributed transaction support, XA transactions
- network drivers, The Merits of Schemas
- Java Enterprise Edition (EE), The problems with remote procedure calls (RPCs), Two-Phase Commit (2PC), XA transactions
- Java Message Service (JMS), Message brokers compared to databases
- (xem thêm messaging systems)
- comparison to log-based messaging, Logs compared to traditional messaging, Replaying old messages
- distributed transaction support, XA transactions
- message ordering, Acknowledgments and redelivery
- Java Transaction API (JTA), Two-Phase Commit (2PC), XA transactions
- Java Virtual Machine (JVM)
- garbage collection, Process Pauses, Limiting the impact of garbage collection
- JIT compilation, Query Execution: Compilation and Vectorization
- process reuse in batch processors, Dataflow Engines
- Jena (RDF framework), The RDF data model
- SPARQL query language, The SPARQL query language
- Jepsen (fault tolerance testing), Fault injection, Aiming for Correctness
- jitter (network delay), Average, Median, and Percentiles, Network congestion and queueing
- JMESPath (query language), Query languages
- join table, Many-to-One and Many-to-Many Relationships, Property Graphs
- joins, Glossary
- expressing as relational operators, Query languages
- handling GraphQL query, GraphQL
- in application code, Normalization, Denormalization, and Joins, Denormalization in the social networking case study
- in DataFrames, DataFrames, Matrices, and Arrays
- in relational and document databases, Normalization, Denormalization, and Joins
- secondary indexes and, Multi-Column and Secondary Indexes
- sort-merge joins, JOIN and GROUP BY
- stream joins, Stream Joins-Time-dependence of joins
- stream-stream join, Stream-stream join (window join)
- stream-table join, Stream-table join (stream enrichment)
- table-table join, Table-table join (materialized view maintenance)
- time-dependence of, Time-dependence of joins
- support in document databases, Convergence of document and relational databases
- JOTM (transaction coordinator), Two-Phase Commit (2PC)
- journaling (filesystems), Making B-trees reliable
- JSON
- aggregation pipeline (query language), Query languages for documents
- Avro schema representation, Avro
- binary variants, Binary encoding
- data locality, Data locality for reads and writes
- document data model, Relational Model versus Document Model
- for application data, issues with, JSON, XML, and Binary Variants
- GraphQL response, GraphQL
- in relational databases, Schema flexibility in the document model
- representing a résumé (example), The document data model for one-to-many relationships
- Schema, JSON Schema
- JSON-LD, Triple-Stores and SPARQL
- JsonPath (query language), Query languages
- JuiceFS (distributed filesystem), Distributed Filesystems, Object Stores
- Jupyter (notebook), Machine Learning
- just-in-time (JIT) compilation, Query Execution: Compilation and Vectorization
K
- Kafka (messaging), Message brokers, Using logs for message storage
- consumer groups, Multiple consumers
- for data integration, Unbundled versus integrated systems
- for event sourcing, Event Sourcing and CQRS
- Kafka Connect (database integration), Implementing change data capture, API support for change streams, Deriving several views from the same event log
- Kafka Streams (stream processor), Stream analytics, Maintaining materialized views
- exactly-once semantics, Exactly-once message processing revisited
- fault tolerance, Rebuilding state after a failure
- ksqlDB (stream database), Maintaining materialized views
- leader-based replication, Single-Leader Replication
- log compaction, Log compaction, Maintaining materialized views
- message offsets, Using logs for message storage, Idempotence
- partitions (sharding), Sharding
- request routing, Request Routing
- schema registry, But what is the writer’s schema?
- serving derived data, Serving Derived Data
- tiered storage, Disk space usage
- transactions, Database-internal Distributed Transactions, Atomic commit revisited
- unclean leader election, Subtleties of consensus
- use of model-checking, Model checking and specification languages
- kappa architecture, Unifying batch and stream processing
- key-value stores, Storage and Indexing for OLTP
- comparison to object stores, Object Stores
- in-memory, Keeping everything in memory
- LSM storage, Log-Structured Storage-Disk space usage
- sharding, Sharding of Key-Value Data-Skewed Workloads and Relieving Hot Spots
- by hash of key, Sharding by Hash of Key, Summary
- by key range, Sharding by Key Range, Summary
- skew and hot spots, Skewed Workloads and Relieving Hot Spots
- Kinesis (messaging), Message brokers, Using logs for message storage
- data warehouse integration, Cloud Data Warehouses
- Kryo (Java), Language-Specific Formats
- ksqlDB (stream database), Maintaining materialized views
- Kubernetes (cluster manager), Cloud Versus Self-Hosting, Microservices and Serverless, Distributed Job Orchestration, Separation of application code and state
- Kubeflow, Machine Learning
- kubelet, Distributed Job Orchestration
- operators, Distributed Job Orchestration
- use of etcd, Request Routing, Coordination Services
- KùzuDB (database), Problems with Distributed Systems, Graph-Like Data Models
- as embedded storage engine, Compaction strategies
- Cypher query language, The Cypher Query Language
L
- labeled property graphs (xem property graphs)
- lambda architecture, Unifying batch and stream processing
- Lamport timestamps, Lamport timestamps
- Lance (data format), Cloud Data Warehouses, Column-Oriented Storage
- (xem thêm column-oriented storage)
- large language models (LLMs)
- pre-processing training data, Machine Learning
- last write wins (LWW), Last write wins (discarding concurrent writes), Detecting Concurrent Writes, Implementing Linearizable Systems
- problems with, Timestamps for ordering events
- prone to lost updates, Conflict resolution and replication
- latency, Latency and Response Time
- (xem thêm response time)
- across regions, Distributed Versus Single-Node Systems
- instability under two-phase locking, Performance of two-phase locking
- network latency and resource utilization, Can we not simply make network delays predictable?
- reducing by request hedging, Single-Leader Versus Leaderless Replication Performance
- response time versus, Latency and Response Time
- tail latency, Average, Median, and Percentiles, Use of Response Time Metrics, Local Secondary Indexes
- law (xem legal matters)
- layering (of cloud services), Layering of cloud services
- leader-based replication, Single-Leader Replication-Logical (row-based) log replication
- (xem thêm replication)
- failover, Leader failure: Failover, Distributed Locks and Leases
- handling node outages, Handling Node Outages
- implementation of replication logs
- change data capture, Change Data Capture-API support for change streams
- (xem thêm changelogs)
- statement-based, Statement-based replication
- write-ahead log (WAL) shipping, Write-ahead log (WAL) shipping
- change data capture, Change Data Capture-API support for change streams
- linearizability of operations, Implementing Linearizable Systems
- locking and leader election, Locking and leader election
- log sequence number, Setting Up New Followers, Consumer offsets
- read-scaling architecture, Problems with Replication Lag, Single-Leader Versus Leaderless Replication Performance
- relation to consensus, Consensus, From single-leader replication to consensus, Pros and cons of consensus
- setting up new followers, Setting Up New Followers
- synchronous versus asynchronous, Synchronous Versus Asynchronous Replication-Synchronous Versus Asynchronous Replication
- leaderless replication, Leaderless Replication-Version vectors
- (xem thêm replication)
- catching up on missed writes, Catching up on missed writes
- detecting concurrent writes, Detecting Concurrent Writes-Version vectors
- version vectors, Version vectors
- multi-region, Multi-region operation
- quorums, Quorums for reading and writing-Multi-region operation
- consistency limitations, Limitations of Quorum Consistency-Monitoring staleness, Linearizability and quorums
- leap seconds, Software faults, Clock Synchronization and Accuracy
- in time-of-day clocks, Time-of-day clocks
- leases, Process Pauses
- implementation with coordination service, Coordination Services
- need for fencing, Distributed Locks and Leases
- relation to consensus, Single-value consensus
- ledgers (accounting), Summary
- immutability, Advantages of immutable events
- legacy systems, maintenance of, Maintainability
- legal matters, Data Systems, Law, and Society-Data Systems, Law, and Society
- data deletion, Data Systems, Law, and Society, Disk space usage
- data residence, Distributed Versus Single-Node Systems, Sharding for Multitenancy
- privacy regulation, Data Systems, Law, and Society, Legislation and Self-Regulation
- legitimate interest (GDPR), Consent and Freedom of Choice
- leveled compaction, Compaction strategies, Disk space usage
- Levenshtein automata, Full-Text Search
- limping (partial failure), System Model and Reality
- Linear (project management software), Real-time collaboration, offline-first, and local-first apps
- linear algebra, DataFrames, Matrices, and Arrays
- linear scalability, Describing Load
- linearizability, Solutions for Replication Lag, Linearizability-Linearizability and network delays, Glossary
- and consensus, Consensus
- cost of, The Cost of Linearizability-Linearizability and network delays
- CAP theorem, The CAP theorem
- memory on multi-core CPUs, Linearizability and network delays
- definition, What Makes a System Linearizable?-What Makes a System Linearizable?
- ID generation, Linearizable ID Generators
- in coordination services, Coordination Services
- of derived data systems
- avoiding coordination, Coordination-avoiding data systems
- of different replication methods, Implementing Linearizable Systems-Linearizability and quorums
- using quorums, Linearizability and quorums
- reads in consensus systems, Subtleties of consensus
- relying on, Relying on Linearizability-Cross-channel timing dependencies
- constraints and uniqueness, Constraints and uniqueness guarantees
- cross-channel timing dependencies, Cross-channel timing dependencies
- locking and leader election, Locking and leader election
- versus serializability, What Makes a System Linearizable?
- linked data, Triple-Stores and SPARQL
- LinkedIn
- Espresso (database), But what is the writer’s schema?
- LIquid (database), Datalog: Recursive Relational Queries
- profile (example), The document data model for one-to-many relationships
- Linux, leap second bug, Software faults, Clock Synchronization and Accuracy
- Litestream (backup tool), Setting Up New Followers
- liveness properties, Safety and liveness
- LLVM (compiler), Query Execution: Compilation and Vectorization
- LMDB (storage engine), Compaction strategies, B-tree variants, Indexes and snapshot isolation
- load
- coping with, Principles for Scalability
- describing, Describing Load
- load balancing, Describing Performance, Load balancers, service discovery, and service meshes
- in hardware, Load balancers, service discovery, and service meshes
- in software, Load balancers, service discovery, and service meshes
- using message brokers, Multiple consumers
- load shedding, Describing Performance
- local secondary indexes, Local Secondary Indexes, Summary
- local-first software, Real-time collaboration, offline-first, and local-first apps
- locality (data access), The document data model for one-to-many relationships, Data locality for reads and writes, Glossary
- in batch processing, Dataflow Engines
- in stateful clients, Sync Engines and Local-First Software, Stateful, offline-capable clients
- in stream processing, Stream-table join (stream enrichment), Rebuilding state after a failure, Stream processors and services, Uniqueness in log-based messaging
- location transparency, The problems with remote procedure calls (RPCs)
- in the actor model, Distributed actor frameworks
- lock-in, Pros and Cons of Cloud Services
- locks, Glossary
- deadlock, Explicit locking, Implementation of two-phase locking
- distributed locking, Distributed Locks and Leases-Fencing with multiple replicas, Locking and leader election
- fencing tokens, Fencing off zombies and delayed requests
- implementation with coordination service, Coordination Services
- relation to consensus, Single-value consensus
- for transaction isolation
- in snapshot isolation, Multi-version concurrency control (MVCC)
- in two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks
- making operations atomic, Atomic write operations
- performance, Performance of two-phase locking
- preventing dirty writes, Implementing read committed
- preventing phantoms with index-range locks, Index-range locks, Detecting writes that affect prior reads
- read locks (shared mode), Implementing read committed, Implementation of two-phase locking
- shared mode and exclusive mode, Implementation of two-phase locking
- in distributed transactions
- deadlock detection, Problems with XA transactions
- in-doubt transactions holding locks, Holding locks while in doubt
- materializing conflicts with, Materializing conflicts
- preventing lost updates by explicit locking, Explicit locking
- log sequence number, Setting Up New Followers, Consumer offsets
- logical clocks, Timestamps for ordering events, ID Generators and Logical Clocks-Enforcing constraints using logical clocks, Ordering events to capture causality
- for last-write-wins, Last write wins (discarding concurrent writes)
- for read-after-write consistency, Reading Your Own Writes
- hybrid logical clocks, Hybrid logical clocks
- insufficiency for enforcing constraints, Enforcing constraints using logical clocks
- Lamport timestamps, Lamport timestamps
- logical replication, Logical (row-based) log replication
- for change data capture, Implementing change data capture
- LogicBlox (database), Datalog: Recursive Relational Queries
- logs (data structure), Storage and Indexing for OLTP, Shared logs as consensus, Glossary
- (see also shared logs)
- advantages of immutability, Advantages of immutable events
- and right to erasure, Data Systems, Law, and Society, Disk space usage
- compaction, Constructing and merging SSTables, Compaction strategies, Log compaction, State, Streams, and Immutability
- for stream operator state, Rebuilding state after a failure
- implementing uniqueness constraints, Uniqueness in log-based messaging
- log-based messaging, Log-based Message Brokers-Replaying old messages
- comparison to traditional messaging, Logs compared to traditional messaging, Replaying old messages
- consumer offsets, Consumer offsets
- disk space usage, Disk space usage
- replaying old messages, Replaying old messages, Reprocessing data for application evolution, Unifying batch and stream processing
- slow consumers, When consumers cannot keep up with producers
- using logs for message storage, Using logs for message storage
- log-structured storage, Storage and Indexing for OLTP-Compaction strategies
- log-structured merge tree (see LSM-trees)
- relation to consensus, Shared logs as consensus
- replication, Single-Leader Replication, Implementation of Replication Logs-Logical (row-based) log replication
- change data capture, Change Data Capture-API support for change streams
- (see also changelogs)
- coordination with snapshot, Setting Up New Followers
- logical (row-based) replication, Logical (row-based) log replication
- statement-based replication, Statement-based replication
- write-ahead log (WAL) shipping, Write-ahead log (WAL) shipping
- change data capture, Change Data Capture-API support for change streams
- scalability limits, The limits of total ordering
- Looker (business intelligence software), Characterizing Transaction Processing and Analytics, Analytics
- loose coupling, Making unbundling work
- lost updates (see updates)
- Lotus Notes (sync engine), Pros and cons of sync engines
- LSM-trees (indexes), The SSTable file format-Compaction strategies
- comparison to B-trees, Comparing B-Trees and LSM-Trees-Disk space usage
- Lucene (storage engine), Full-Text Search
- similarity search, Full-Text Search
- LWW (see last write wins)
M
- machine learning
- batch inference, Machine Learning
- data preparation with DataFrames, DataFrames, Matrices, and Arrays
- deleting training data, Data Systems, Law, and Society
- deploying data products, Beyond the data lake
- ethical considerations, Predictive Analytics
- (see also ethics)
- feature engineering, From data warehouse to data lake, Machine Learning
- in analytics systems, Operational Versus Analytical Systems
- iterative processing, Machine Learning
- LLMs (see large language models (LLMs))
- models derived from training data, Application code as a derivation function
- relation to batch processing, Machine Learning-Machine Learning
- using a data lake, From data warehouse to data lake
- using GPUs, Layering of cloud services, Distributed Versus Single-Node Systems
- using matrices, DataFrames, Matrices, and Arrays
- madsim (deterministic simulation testing), Deterministic simulation testing
- magic scaling sauce, Principles for Scalability
- maintainability, Maintainability-Evolvability: Making Change Easy, A Philosophy of Streaming Systems
- evolvability (see evolvability)
- operability, Operability: Making Life Easy for Operations
- simplicity and managing complexity, Simplicity: Managing Complexity
- many-to-many relationships, Many-to-One and Many-to-Many Relationships
- modeling as graphs, Graph-Like Data Models
- many-to-one relationships, Many-to-One and Many-to-Many Relationships
- in star schema, Stars and Snowflakes: Schemas for Analytics
- MapReduce (batch processing), Batch Processing, MapReduce-MapReduce
- analysis of user activity events (example), JOIN and GROUP BY
- comparison to stream processing, Processing Streams
- disadvantages and limitations of, MapReduce
- fault tolerance, Handling Faults
- higher-level tools, Query languages
- mapper and reducer functions, MapReduce
- shuffling data, Shuffling Data
- sort-merge joins, JOIN and GROUP BY
- workflows, Scheduling Workflows
- (see also workflow engines)
- marshalling (see encoding)
- MartenDB (database), Event Sourcing and CQRS
- master-slave replication (obsolete term), Single-Leader Replication
- materialization, Glossary
- aggregate values, Materialized Views and Data Cubes
- conflicts, Materializing conflicts
- materialized views, Materialized Views and Data Cubes
- as derived data, Systems of Record and Derived Data, Composing Data Storage Technologies-Unbundled versus integrated systems
- in event sourcing, Event Sourcing and CQRS
- incremental view maintenance, Maintaining materialized views
- (see also incremental view maintenance (IVM))
- maintaining, using stream processing, Maintaining materialized views, Table-table join (materialized view maintenance)
- social network timeline example, Materializing and Updating Timelines
- Materialize (database), Materialized Views and Data Cubes
- incremental view maintenance, Maintaining materialized views
- matrices, DataFrames, Matrices, and Arrays
- sparse, DataFrames, Matrices, and Arrays
- Maxwell (change data capture), Implementing change data capture
- mean, Average, Median, and Percentiles
- media monitoring, Search on streams
- median, Average, Median, and Percentiles
- meeting room booking (example), More examples of write skew, Predicate locks, Enforcing Constraints
- Memcached (caching server), Keeping everything in memory
- Memgraph (database), Graph-Like Data Models
- Cypher query language, The Cypher Query Language
- memory
- barrier (CPU instruction), Linearizability and network delays
- corruption, Hardware and Software Faults
- in-memory databases, Keeping everything in memory
- durability, Durability
- serial transaction execution, Actual Serial Execution
- in-memory representation of data, Formats for Encoding Data
- memtable (in LSM-trees), Constructing and merging SSTables
- random bit-flips in, Trust, but Verify
- use by indexes, Log-Structured Storage
- memtable (in LSM-trees), Constructing and merging SSTables
- Mercurial (version control system), Concurrency control
- merge (DataFrame operator), DataFrames, Matrices, and Arrays
- merging sorted files, Constructing and merging SSTables, Shuffling Data
- Merkle trees, Tools for auditable data systems
- Mesos (cluster manager), Separation of application code and state
- message brokers (see messaging systems)
- message-passing (see event-driven architecture)
- MessagePack (encoding format), Binary encoding
- messaging systems, Stream Processing-Replaying old messages
- (see also streams)
- backpressure, buffering, or dropping messages, Messaging Systems
- brokerless messaging, Direct messaging from producers to consumers
- event logs, Log-based Message Brokers-Replaying old messages
- as data model, Event Sourcing and CQRS
- comparison to traditional messaging, Logs compared to traditional messaging, Replaying old messages
- consumer offsets, Consumer offsets
- replaying old messages, Replaying old messages, Reprocessing data for application evolution, Unifying batch and stream processing
- slow consumers, When consumers cannot keep up with producers
- exactly-once semantics, Exactly-once message processing, Exactly-once message processing revisited, Fault Tolerance
- message brokers, Message brokers-Acknowledgments and redelivery
- acknowledgements and redelivery, Acknowledgments and redelivery
- comparison to event logs, Logs compared to traditional messaging, Replaying old messages
- multiple consumers of same topic, Multiple consumers
- versus RPC, Event-Driven Architectures
- message loss, Messaging Systems
- reliability, Messaging Systems
- uniqueness in log-based messaging, Uniqueness in log-based messaging
- metastable failure, Describing Performance
- metered billing
- serverless, Microservices and Serverless
- storage, Operations in the Cloud Era
- microbatching, Microbatching and checkpointing
- microservices, Microservices and Serverless
- (see also services)
- causal dependencies across services, The limits of total ordering
- loose coupling, Making unbundling work
- relation to batch/stream processors, Batch Processing, Stream processors and services
- Microsoft
- Azure Blob Storage (see Azure Blob Storage)
- Azure managed disks, Separation of storage and compute
- Azure Service Bus (messaging), Message brokers, Message brokers compared to databases
- Azure SQL DB (database), Cloud-Native System Architecture
- Azure Storage, Object Stores
- Azure Stream Analytics, Stream analytics
- Azure Synapse Analytics (database), Cloud-Native System Architecture
- DCOM (Distributed Component Object Model), The problems with remote procedure calls (RPCs)
- MSDTC (transaction coordinator), Two-Phase Commit (2PC)
- SQL Server (see SQL Server)
- Microsoft Power BI (see Power BI (business intelligence software))
- migrating (rewriting) data, Schema flexibility in the document model, Different values written at different times, Deriving several views from the same event log, Reprocessing data for application evolution
- MinIO (object storage), Distributed Filesystems
- mobile apps, Trade-offs in Data Systems Architecture
- embedded databases, Compaction strategies
- model checking, Model checking and specification languages
- modulus operator (%), Hash modulo number of nodes
- Mojo (programming language)
- memory management, Limiting the impact of garbage collection
- MongoDB (database)
- aggregation pipeline, Query languages for documents
- atomic operations, Atomic write operations
- BSON, Data locality for reads and writes
- document data model, Relational Model versus Document Model
- hash-range sharding, Sharding by Hash of Key, Sharding by hash range
- in the cloud, Cloud-Native System Architecture
- join support, Convergence of document and relational databases
- joins ($lookup operator), Normalization, Denormalization, and Joins
- JSON Schema validation, JSON Schema
- leader-based replication, Single-Leader Replication
- ObjectIds, ID Generators and Logical Clocks
- range-based sharding, Sharding by Key Range
- request routing, Request Routing
- secondary indexes, Local Secondary Indexes
- shard splitting, Rebalancing key-range sharded data
- stored procedures, Pros and cons of stored procedures
- monitoring, Operations in the Cloud Era, Humans and Reliability, Operability: Making Life Easy for Operations
- monotonic clocks, Monotonic clocks
- monotonic reads, Monotonic Reads
- Morel (query language), Query languages
- MSMQ (messaging), XA transactions
- multi-column indexes, Multidimensional and Full-Text Indexes
- multi-leader replication, Multi-Leader Replication-Types of conflict
- (see also replication)
- collaborative editing, Real-time collaboration, offline-first, and local-first apps
- conflict detection, Types of conflict
- conflict resolution, Dealing with Conflicting Writes
- for multi-region replication, Geographically Distributed Operation, The Cost of Linearizability
- linearizability, lack of, Implementing Linearizable Systems
- offline-capable clients, Sync Engines and Local-First Software
- replication topologies, Multi-leader replication topologies-Problems with different topologies
- multi-object transactions, Single-Object and Multi-Object Operations
- need for, The need for multi-object transactions
- Multi-Paxos (consensus algorithm), Consensus in Practice
- multi-reader single-writer lock, Implementation of two-phase locking
- multi-table index cluster tables (Oracle), Data locality for reads and writes
- multi-version concurrency control (MVCC), Multi-version concurrency control (MVCC), Summary
- detecting stale MVCC reads, Detecting stale MVCC reads
- indexes and snapshot isolation, Indexes and snapshot isolation
- using synchronized clocks, Synchronized clocks for global snapshots
- multidimensional arrays, DataFrames, Matrices, and Arrays
- multitenancy, Separation of storage and compute, Network congestion and queueing
- by sharding, Sharding for Multitenancy
- using embedded databases, Compaction strategies
- versus Byzantine fault tolerance, Byzantine Faults
- mutual exclusion, Pessimistic versus optimistic concurrency control
- (see also locks)
- MySQL (database)
- archiving WAL to object stores, Setting Up New Followers
- binlog coordinates, Setting Up New Followers
- change data capture, Implementing change data capture, API support for change streams
- circular replication topology, Multi-leader replication topologies
- consistent snapshots, Setting Up New Followers
- distributed transaction support, XA transactions
- global transaction identifiers (GTIDs), Setting Up New Followers
- in the cloud, Cloud-Native System Architecture
- InnoDB storage engine (see InnoDB)
- leader-based replication, Single-Leader Replication
- multi-leader replication, Geographically Distributed Operation
- row-based replication, Logical (row-based) log replication
- sharding (see Vitess (database))
- snapshot isolation support, Snapshot isolation, repeatable read, and naming confusion
- (see also InnoDB)
- statement-based replication, Statement-based replication
N
- N+1 query problem, Object-relational mapping (ORM)
- nanomsg (messaging library), Direct messaging from producers to consumers
- Narayana (transaction coordinator), Two-Phase Commit (2PC)
- NATS (messaging), Message brokers
- natural language processing, From data warehouse to data lake
- Neo4j (database)
- Cypher query language, The Cypher Query Language
- graph data model, Graph-Like Data Models
- Neon (database), Setting Up New Followers
- Nephele (dataflow engine), Dataflow Engines
- Neptune (graph database), Graph-Like Data Models
- Cypher query language, The Cypher Query Language
- SPARQL query language, The SPARQL query language
- netcode (game development), Pros and cons of sync engines
- Network Attached Storage (NAS), Shared-Memory, Shared-Disk, and Shared-Nothing Architecture, Distributed Filesystems
- network model (data representation), Relational Model versus Document Model
- Network Time Protocol (see NTP)
- networks
- congestion and queueing, Network congestion and queueing
- datacenter network topologies, Cloud Computing Versus Supercomputing
- faults (see faults)
- linearizability and network delays, Linearizability and network delays
- network partitions, Network Faults in Practice
- in CAP theorem, The Cost of Linearizability
- timeouts and unbounded delays, Timeouts and Unbounded Delays
- NewSQL, Relational Model versus Document Model, Solutions for Replication Lag
- transactions and, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- next-key locking, Index-range locks
- NFS (network file system), Distributed Filesystems
- on object storage, Object Stores
- Nimble (data format), Cloud Data Warehouses, Column-Oriented Storage
- (see also column-oriented storage)
- node (in graphs) (see vertices)
- nodes (processes), Distributed Versus Single-Node Systems, Glossary
- handling outages in leader-based replication, Handling Node Outages
- system models for failure, System Model and Reality
- noisy neighbors, Network congestion and queueing
- nonblocking atomic commit, Three-phase commit
- nondeterministic operations, Statement-based replication
- (see also deterministic operations)
- in distributed systems, Deterministic simulation testing
- in workflow engines, Durable execution
- partial failures, Faults and Partial Failures
- nonfunctional requirements, Defining Nonfunctional Requirements, Summary
- nonrepeatable reads, Snapshot Isolation and Repeatable Read
- (see also read skew)
- normalization (data representation), Normalization, Denormalization, and Joins-Many-to-One and Many-to-Many Relationships, Glossary
- foreign key references, The need for multi-object transactions
- in social network case study, Denormalization in the social networking case study
- in systems of record, Systems of Record and Derived Data
- versus denormalization, Deriving several views from the same event log
- NoSQL, Relational Model versus Document Model, Solutions for Replication Lag, Unbundling Databases
- transactions and, What Exactly Is a Transaction?
- Notation3 (N3), Triple-Stores and SPARQL
- NTP (Network Time Protocol), Unreliable Clocks
- accuracy, Clock Synchronization and Accuracy, Timestamps for ordering events
- adjustments to monotonic clocks, Monotonic clocks
- multiple server addresses, Weak forms of lying
- numbers, in XML and JSON encodings, JSON, XML, and Binary Variants
- NumPy (Python library), DataFrames, Matrices, and Arrays, Column-Oriented Storage
- NVMe (Non-Volatile Memory Express) (see solid state drives (SSDs))
O
- object databases, Relational Model versus Document Model
- object storage, Layering of cloud services, Object Stores-Object Stores
- Azure Blob Storage (see Azure Blob Storage)
- comparison to distributed filesystems, Object Stores
- comparison to key-value stores, Object Stores
- databases backed by, Setting Up New Followers
- for backups, Replication
- for cloud data warehouses, Cloud Data Warehouses, Writing to Column-Oriented Storage
- for database replication, Setting Up New Followers
- Google Cloud Storage (see Google Cloud Storage)
- object size, Separation of storage and compute
- S3 (see S3 (object storage))
- storing LSM segment files, Constructing and merging SSTables
- support for fencing, Fencing off zombies and delayed requests
- use in data lakes, From data warehouse to data lake
- object-relational mapping (ORM) frameworks, Object-relational mapping (ORM)
- error handling and aborted transactions, Handling errors and aborts
- unsafe read-modify-write cycle code, Atomic write operations
- object-relational mismatch, The Object-Relational Mismatch
- observability, Problems with Distributed Systems, Humans and Reliability, Operability: Making Life Easy for Operations
- observer pattern, Separation of application code and state
- OBT (one big table), Stars and Snowflakes: Schemas for Analytics, Stars and Snowflakes: Schemas for Analytics
- offline systems, Batch Processing
- (see also batch processing)
- offline-first applications, Real-time collaboration, offline-first, and local-first apps, Stateful, offline-capable clients
- offsets
- consumer offsets in sharded logs, Consumer offsets
- messages in sharded logs, Using logs for message storage
- OLAP (online analytic processing), Characterizing Transaction Processing and Analytics, Glossary
- data cubes, Materialized Views and Data Cubes
- OLTP (online transaction processing), Characterizing Transaction Processing and Analytics, Glossary
- analytics queries versus, Analytics
- data normalization, Trade-offs of normalization
- workload characteristics, Actual Serial Execution
- on-premises deployment, Cloud Versus Self-Hosting
- data warehouses, Cloud Data Warehouses
- one big table (data warehouse schema), Stars and Snowflakes: Schemas for Analytics, Stars and Snowflakes: Schemas for Analytics
- one-hot encoding, DataFrames, Matrices, and Arrays
- one-to-few relationships, The document data model for one-to-many relationships
- one-to-many relationships, The document data model for one-to-many relationships
- JSON representation, The document data model for one-to-many relationships
- online systems, Batch Processing
- (see also services)
- versus scientific computing, Cloud Computing Versus Supercomputing
- ontologies, Triple-Stores and SPARQL
- Oozie (workflow scheduler), Batch Processing
- OpenAPI (service definition format), Microservices and Serverless, Web services, Web services
- use of JSON Schema, JSON Schema
- openCypher (see Cypher (query language))
- OpenLink Virtuoso (see Virtuoso (database))
- OpenStack
- Swift (object storage), Object Stores
- operability, Operability: Making Life Easy for Operations
- operating systems versus databases, Unbundling Databases
- operational systems, Operational Versus Analytical Systems
- (see also OLTP)
- as systems of record, Systems of Record and Derived Data
- ETL into analytical systems, Data Warehousing
- operational transformation, CRDTs and Operational Transformation
- operations teams, Operations in the Cloud Era
- operators (query execution), Query Execution: Compilation and Vectorization
- in stream processing, Processing Streams
- optimistic concurrency control, Pessimistic versus optimistic concurrency control
- optimistic locking, Conditional writes (compare-and-set)
- Oracle (database)
- distributed transaction support, XA transactions
- GoldenGate (change data capture), Implementing change data capture
- hierarchical queries, Graph Queries in SQL, Graph Queries in SQL
- lack of serializability, Isolation
- leader-based replication, Single-Leader Replication
- multi-leader replication, Geographically Distributed Operation
- multi-table index cluster tables, Data locality for reads and writes
- not preventing write skew, Characterizing write skew
- PL/SQL language, Pros and cons of stored procedures
- preventing lost updates, Automatically detecting lost updates
- read committed isolation, Implementing read committed
- Real Application Clusters (RAC), Locking and leader election
- snapshot isolation support, Snapshot Isolation and Repeatable Read, Snapshot isolation, repeatable read, and naming confusion
- TimesTen (in-memory database), Keeping everything in memory
- WAL-based replication, Write-ahead log (WAL) shipping
- ORC (data format), Cloud Data Warehouses, Column-Oriented Storage
- (see also column-oriented storage)
- orchestration (service deployment), Cloud Versus Self-Hosting, Microservices and Serverless
- batch job execution, Distributed Job Orchestration-Distributed Job Orchestration
- workflow engines, Batch Processing
- ordering
- event logs, Event Sourcing and CQRS
- limits of total ordering, The limits of total ordering
- logical timestamps, Logical Clocks
- of auto-incrementing IDs, ID Generators and Logical Clocks
- shared logs, Consensus in Practice-Pros and cons of consensus
- Orkes (workflow engine), Durable Execution and Workflows
- orphan pages (B-trees), Making B-trees reliable
- outbox pattern, Change data capture versus event sourcing
- outliers (response time), Average, Median, and Percentiles
- outsourcing, Cloud Versus Self-Hosting
- overload, Describing Performance, Handling errors and aborts
P
- PACELC principle, The CAP theorem
- package managers, Separation of application code and state
- packet switching, Can we not simply make network delays predictable?
- packets
- corruption of, Weak forms of lying
- sending via UDP, Direct messaging from producers to consumers
- PageRank (algorithm), Graph-Like Data Models, Query languages, Machine Learning
- paging (see virtual memory)
- pandas (Python library), From data warehouse to data lake, DataFrames, Matrices, and Arrays, Column-Oriented Storage, DataFrames
- Parquet (data format), Cloud Data Warehouses, Column-Oriented Storage, Archival storage, Query languages
- (see also column-oriented storage)
- databases on object storage, Setting Up New Followers
- document data model, Column-Oriented Storage
- use in batch processing, MapReduce
- partial failures, Faults and Partial Failures, Summary
- limping, System Model and Reality
- partial synchrony (system model), System Model and Reality
- partition key, Pros and Cons of Sharding, Sharding of Key-Value Data
- partitioning (see sharding)
- Paxos (consensus algorithm), Consensus, Consensus in Practice
- ballot number, From single-leader replication to consensus
- Multi-Paxos, Consensus in Practice
- payment card industry (PCI), Data Systems, Law, and Society
- PCI (payment card industry) compliance, Data Systems, Law, and Society
- percentiles, Average, Median, and Percentiles, Glossary
- calculating efficiently, Use of Response Time Metrics
- importance of high percentiles, Use of Response Time Metrics
- use in service level agreements (SLAs), Use of Response Time Metrics
- Percolator (Google), Implementing a linearizable ID generator
- Percona XtraBackup (MySQL tool), Setting Up New Followers
- performance
- degradation as fault, System Model and Reality
- describing, Describing Performance
- of distributed transactions, Distributed Transactions Across Different Systems
- of in-memory databases, Keeping everything in memory
- of linearizability, Linearizability and network delays
- of multi-leader replication, Geographically Distributed Operation
- permission isolation, Sharding for Multitenancy
- perpetual inconsistency, Timeliness and Integrity
- pessimistic concurrency control, Pessimistic versus optimistic concurrency control
- pglogical (PostgreSQL extension), Geographically Distributed Operation
- pgvector (vector index), Vector Embeddings
- phantoms (transaction isolation), Phantoms causing write skew
- materializing conflicts, Materializing conflicts
- preventing, in serializability, Predicate locks
- physical clocks (see clocks)
- pickle (Python), Language-Specific Formats
- Pinot (database), Characterizing Transaction Processing and Analytics, Column-Oriented Storage
- handling writes, Writing to Column-Oriented Storage
- pre-aggregation, Analytics
- serving derived data, Serving Derived Data, Serving Derived Data
- pipelined execution
- in data warehouse queries, Query Execution: Compilation and Vectorization
- pivot table, DataFrames, Matrices, and Arrays
- point in time, Unreliable Clocks
- point query, Characterizing Transaction Processing and Analytics
- Polaris (data catalog), Cloud Data Warehouses
- polling, Representing Users, Posts, and Follows
- polystores, The meta-database of everything
- POSIX (portable operating system interface)
- compliant filesystems, Setting Up New Followers, Distributed Filesystems, Object Stores
- Post Office Horizon scandal, Humans and Reliability
- lack of transactions, Transactions
- PostgreSQL (database)
- archiving WAL to object stores, Setting Up New Followers
- change data capture, Implementing change data capture, API support for change streams
- distributed transaction support, XA transactions
- foreign data wrappers, The meta-database of everything
- full text search support, Combining Specialized Tools by Deriving Data
- in the cloud, Cloud-Native System Architecture
- JSON Schema validation, JSON Schema
- leader-based replication, Single-Leader Replication
- log sequence number, Setting Up New Followers
- logical decoding, Logical (row-based) log replication
- materialized view maintenance, Maintaining materialized views
- multi-leader replication, Geographically Distributed Operation
- MVCC implementation, Multi-version concurrency control (MVCC), Indexes and snapshot isolation
- partitioning vs. sharding, Sharding
- pgvector (vector index), Vector Embeddings
- PL/pgSQL language, Pros and cons of stored procedures
- PostGIS geospatial indexes, Multidimensional and Full-Text Indexes
- preventing lost updates, Automatically detecting lost updates
- preventing write skew, Characterizing write skew, Serializable Snapshot Isolation (SSI)
- read committed isolation, Implementing read committed
- representing graphs, Property Graphs
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)
- sharding (see Citus (database))
- snapshot isolation support, Snapshot Isolation and Repeatable Read, Snapshot isolation, repeatable read, and naming confusion
- WAL-based replication, Write-ahead log (WAL) shipping
- postings list, Full-Text Search
- in sharded indexes, Local Secondary Indexes
- postmortems, blameless, Humans and Reliability
- PouchDB (database), Pros and cons of sync engines
- Power BI (business intelligence software), Characterizing Transaction Processing and Analytics, Analytics
- pre-aggregation, Analytics
- serving derived data, Serving Derived Data
- pre-splitting, Rebalancing key-range sharded data
- Precision Time Protocol (PTP), Clock Synchronization and Accuracy
- predicate locks, Predicate locks
- predictive analytics, Operational Versus Analytical Systems, Predictive Analytics-Feedback Loops
- amplifying bias, Bias and Discrimination
- ethics of (see ethics)
- feedback loops, Feedback Loops
- preemption, Resource Allocation
- in distributed schedulers, Handling Faults
- of threads, Process Pauses
- Prefect (workflow scheduler), Durable Execution and Workflows, Batch Processing, Scheduling Workflows
- cloud data warehouse integration, Query languages
- Presto (query engine), Cloud Data Warehouses
- primary keys, Multi-Column and Secondary Indexes, Glossary
- auto-incrementing, ID Generators and Logical Clocks
- versus partition key, Sharding by hash range
- primary-backup replication (see leader-based replication)
- privacy, Privacy and Tracking-Legislation and Self-Regulation
- consent and freedom of choice, Consent and Freedom of Choice
- data as assets and power, Data as Assets and Power
- deleting data, Limitations of immutability
- ethical considerations (see ethics)
- legislation and self-regulation, Legislation and Self-Regulation
- meaning of, Privacy and Use of Data
- regulation, Data Systems, Law, and Society
- surveillance, Surveillance
- tracking behavioral data, Privacy and Tracking
- probabilistic algorithms, Use of Response Time Metrics, Stream analytics
- process pauses, Process Pauses-Limiting the impact of garbage collection
- processing time (of events), Reasoning About Time
- producers (message streams), Transmitting Event Streams
- product analytics, Characterizing Transaction Processing and Analytics
- column-oriented storage, Column-Oriented Storage
- programming languages
- for stored procedures, Pros and cons of stored procedures
- projections (event sourcing), Event Sourcing and CQRS
- Prolog (language), Datalog: Recursive Relational Queries
- (see also Datalog)
- property graphs, Property Graphs
- Cypher query language, The Cypher Query Language
- Property Graph Query Language (PGQL), Graph Queries in SQL
- property-based testing, Humans and Reliability, Formal Methods and Randomized Testing
- Protocol Buffers (data format), Protocol Buffers-Field tags and schema evolution, Protocol Buffers
- field tags and schema evolution, Field tags and schema evolution
- provenance of data, Designing for auditability
- publish/subscribe model, Messaging Systems
- publishers (message streams), Transmitting Event Streams
- Pulsar (streaming platform), Acknowledgments and redelivery
- PyTorch (machine learning library), Machine Learning
Q
- Qpid (messaging), Message brokers compared to databases
- quality of service (QoS), Can we not simply make network delays predictable?
- Quantcast File System (distributed filesystem), Object Stores
- query engines
- compilation and vectorization, Query Execution: Compilation and Vectorization
- in cloud data warehouse, Cloud Data Warehouses
- operators, Query Execution: Compilation and Vectorization
- optimizing declarative queries, Data Models and Query Languages
- query languages
- Cypher, The Cypher Query Language
- Datalog, Datalog: Recursive Relational Queries
- GraphQL, GraphQL
- MongoDB aggregation pipeline, Normalization, Denormalization, and Joins, Query languages for documents
- recursive SQL queries, Graph Queries in SQL
- SPARQL, The SPARQL query language
- SQL, Normalization, Denormalization, and Joins
- query optimizers, Query languages
- query plans, Query Execution: Compilation and Vectorization
- queueing delays, Network congestion and queueing
- head-of-line blocking, Latency and Response Time
- latency and response time, Latency and Response Time
- queues (messaging), Message brokers
- QUIC (protocol), The Limitations of TCP
- quorums, Quorums for reading and writing-Multi-region operation, Glossary
- for leaderless replication, Quorums for reading and writing
- in consensus algorithms, From single-leader replication to consensus
- limitations of consistency, Limitations of Quorum Consistency-Monitoring staleness, Linearizability and quorums
- making decisions in distributed systems, The Majority Rules
- monitoring staleness, Monitoring staleness
- multi-region replication, Multi-region operation
- relying on durability, Mapping system models to the real world
- quotas, Operations in the Cloud Era
R
- R (language), From data warehouse to data lake, DataFrames, Matrices, and Arrays, DataFrames
- R-trees (indexes), Multidimensional and Full-Text Indexes
- R2 (object storage), Layering of cloud services, Distributed Filesystems
- RabbitMQ (messaging), Message brokers, Message brokers compared to databases
- quorum queues (replication), Single-Leader Replication
- race conditions, Isolation
- (see also concurrency)
- avoiding with linearizability, Cross-channel timing dependencies
- caused by dual writes, Keeping Systems in Sync
- causing loss of money, Weak Isolation Levels
- dirty writes, No dirty writes
- in counter increments, No dirty writes
- lost updates, Preventing Lost Updates-Conflict resolution and replication
- preventing with event logs, Concurrency control, Dataflow: Interplay between state changes and application code
- preventing with serializable isolation, Serializability
- weak transaction isolation, Weak Isolation Levels
- write skew, Write Skew and Phantoms-Materializing conflicts
- Raft (consensus algorithm), Consensus, Consensus in Practice
- leader-based replication, Single-Leader Replication
- sensitivity to network problems, Pros and cons of consensus
- term number, From single-leader replication to consensus
- use in etcd, Implementing Linearizable Systems
- RAID (Redundant Array of Independent Disks), Separation of storage and compute, Tolerating hardware faults through redundancy, Distributed Filesystems
- railways, schema migration on, Reprocessing data for application evolution
- RAM (see memory)
- RAMCloud (in-memory storage), Keeping everything in memory
- random writes (access pattern), Sequential versus random writes
- range queries
- in B-trees, B-Trees, Read performance
- in LSM-trees, Read performance
- not efficient in hash maps, Log-Structured Storage
- with hash sharding, Sharding by hash range
- ranking algorithms, Machine Learning
- Ray (workflow scheduler), Machine Learning
- RDF (Resource Description Framework), The RDF data model
- querying with SPARQL, The SPARQL query language
- RDMA (Remote Direct Memory Access), Layering of cloud services, Cloud Computing Versus Supercomputing
- React (user interface library), End-to-end event streams
- reactive programming, Pros and cons of sync engines
- read committed isolation level, Read Committed-Implementing read committed
- implementing, Implementing read committed
- multi-version concurrency control (MVCC), Multi-version concurrency control (MVCC)
- no dirty reads, No dirty reads
- no dirty writes, No dirty writes
- read models (event sourcing), Event Sourcing and CQRS
- read repair (leaderless replication), Catching up on missed writes
- for linearizability, Linearizability and quorums
- read replicas (xem leader-based replication)
- read skew (transaction isolation), Snapshot Isolation and Repeatable Read, Summary
- read uncommitted isolation level, Implementing read committed
- read-after-write consistency, Reading Your Own Writes, Timeliness and Integrity
- cross-device, Reading Your Own Writes
- in derived data systems, Derived data versus distributed transactions
- read-modify-write cycle, Preventing Lost Updates
- read-scaling architecture, Problems with Replication Lag, Single-Leader Versus Leaderless Replication Performance
- versus sharding, Pros and Cons of Sharding
- reads as events, Reads are events too
- real-time
- analytics (xem product analytics)
- collaborative editing, Real-time collaboration, offline-first, and local-first apps
- publish/subscribe dataflow, End-to-end event streams
- response time guarantees, Response time guarantees
- time-of-day clocks, Time-of-day clocks
- Realm (database), Pros and cons of sync engines
- rebalancing shards, Rebalancing key-range sharded data-Operations: Automatic or Manual Rebalancing, Glossary
- (xem thêm sharding)
- automatic or manual rebalancing, Operations: Automatic or Manual Rebalancing
- fixed number of shards, Fixed number of shards
- fixed number of shards per node, Sharding by hash range
- problems with hash mod N, Hash modulo number of nodes
- recency guarantee, Linearizability
- recommendation engines, Operational Versus Analytical Systems
- building using DataFrames, DataFrames, Matrices, and Arrays
- iterative processing, Machine Learning
- reconfiguration (consensus), Subtleties of consensus
- records, MapReduce
- events in stream processing, Transmitting Event Streams
- recursive queries
- in Cypher, The Cypher Query Language
- in Datalog, Datalog: Recursive Relational Queries
- in SPARQL, The SPARQL query language
- lack of, in GraphQL, GraphQL
- SQL common table expressions, Graph Queries in SQL
- Red Hat
- Apicurio Registry, JSON Schema
- red-black tree, Constructing and merging SSTables
- redelivery (messaging), Acknowledgments and redelivery
- Redis (database)
- atomic operations, Atomic write operations
- CRDT support, CRDTs and Operational Transformation
- durability, Keeping everything in memory
- Lua scripting, Pros and cons of stored procedures
- multi-leader replication, Geographically Distributed Operation
- process-per-core model, Pros and Cons of Sharding
- single-threaded execution, Actual Serial Execution
- redo log (xem write-ahead log)
- Redpanda (messaging), Message brokers, Setting Up New Followers
- tiered storage, Disk space usage
- Redshift (database), Cloud Data Warehouses
- redundancy
- hardware components, Tolerating hardware faults through redundancy
- of derived data, Systems of Record and Derived Data
- (xem thêm derived data)
- Reed–Solomon codes (error correction), Distributed Filesystems
- refactoring, Evolvability: Making Change Easy
- (xem thêm evolvability)
- regions (phân phối địa lý), Reading Your Own Writes
- (xem thêm datacenters)
- consensus across, Pros and cons of consensus
- definition, Reading Your Own Writes
- latency, Distributed Versus Single-Node Systems
- linearizable ID generation, Implementing a linearizable ID generator
- replication across, Geographically Distributed Operation-Problems with different topologies, The Cost of Linearizability, The limits of total ordering
- leaderless, Multi-region operation
- multi-leader, Geographically Distributed Operation
- regions (sharding), Sharding
- register (cấu trúc dữ liệu), What Makes a System Linearizable?
- regulation (xem legal matters)
- relational data model, From data warehouse to data lake, Relational Model versus Document Model-Convergence of document and relational databases
- comparison to document model, When to Use Which Model-Convergence of document and relational databases
- graph queries in SQL, Graph Queries in SQL
- in-memory databases with, Keeping everything in memory
- many-to-one and many-to-many relationships, Many-to-One and Many-to-Many Relationships
- multi-object transactions, need for, The need for multi-object transactions
- object-relational mismatch, The Object-Relational Mismatch
- representing a reorderable list, When to Use Which Model
- versus document model
- convergence of models, Convergence of document and relational databases
- data locality, Data locality for reads and writes
- relational databases
- eventual consistency, Problems with Replication Lag
- history, Relational Model versus Document Model
- leader-based replication, Single-Leader Replication
- logical logs, Logical (row-based) log replication
- philosophy compared to Unix, Unbundling Databases, The meta-database of everything
- schema changes, Schema flexibility in the document model, Encoding and Evolution, Different values written at different times
- sharded secondary indexes, Sharding and Secondary Indexes
- statement-based replication, Statement-based replication
- use of B-tree indexes, B-Trees
- relationships (xem edges)
- reliability, Reliability and Fault Tolerance-Humans and Reliability, A Philosophy of Streaming Systems
- building a reliable system from unreliable components, Faults and Partial Failures
- hardware faults, Hardware and Software Faults
- human errors, Humans and Reliability
- importance of, Humans and Reliability
- of messaging systems, Messaging Systems
- software faults, Software faults
- Remote Method Invocation (Java RMI), The problems with remote procedure calls (RPCs)
- remote procedure calls (RPCs), The problems with remote procedure calls (RPCs)-Data encoding and evolution for RPC
- (xem thêm services)
- data encoding and evolution, Data encoding and evolution for RPC
- issues with, The problems with remote procedure calls (RPCs)
- using Avro, But what is the writer’s schema?
- versus message brokers, Event-Driven Architectures
- renewable energy, Distributed Versus Single-Node Systems
- repeatable reads (transaction isolation), Snapshot isolation, repeatable read, and naming confusion
- replicas, Single-Leader Replication
- replication, Replication-Summary, Glossary
- and durability, Durability
- conflict resolution and, Conflict resolution and replication
- consistency properties, Problems with Replication Lag-Solutions for Replication Lag
- consistent prefix reads, Consistent Prefix Reads
- monotonic reads, Monotonic Reads
- reading your own writes, Reading Your Own Writes
- in distributed filesystems, Distributed Filesystems
- leaderless, Leaderless Replication-Version vectors
- detecting concurrent writes, Detecting Concurrent Writes-Version vectors
- limitations of quorum consistency, Limitations of Quorum Consistency-Monitoring staleness, Linearizability and quorums
- monitoring staleness, Monitoring staleness
- multi-leader, Multi-Leader Replication-Types of conflict
- across multiple regions, Geographically Distributed Operation, The Cost of Linearizability
- conflict resolution, Dealing with Conflicting Writes-Types of conflict
- replication topologies, Multi-leader replication topologies-Problems with different topologies
- reasons for using, Distributed Versus Single-Node Systems, Replication
- sharding and, Sharding
- single-leader, Single-Leader Replication-Logical (row-based) log replication
- failover, Leader failure: Failover
- implementation of replication logs, Implementation of Replication Logs-Logical (row-based) log replication
- relation to consensus, From single-leader replication to consensus, Pros and cons of consensus
- setting up new followers, Setting Up New Followers
- synchronous versus asynchronous, Synchronous Versus Asynchronous Replication-Synchronous Versus Asynchronous Replication
- state machine replication, Statement-based replication, Pros and cons of stored procedures, Using shared logs, Databases and Streams
- event sourcing, Event Sourcing and CQRS
- reliance on determinism, Deterministic simulation testing
- using consensus, Pros and cons of consensus
- using erasure coding, Distributed Filesystems
- using object storage, Setting Up New Followers
- versus backups, Replication
- with heterogeneous data systems, Keeping Systems in Sync
- replication logs (xem logs)
- representations of data (xem data models)
- reprocessing data, Reprocessing data for application evolution, Unifying batch and stream processing
- (xem thêm evolvability)
- from log-based messaging, Replaying old messages
- request hedging, Single-Leader Versus Leaderless Replication Performance
- request identifiers, Uniquely identifying requests, Multi-shard request processing
- request routing, Request Routing-Request Routing
- approaches to, Request Routing
- residence laws for data, Distributed Versus Single-Node Systems, Sharding for Multitenancy
- resilient systems, Reliability and Fault Tolerance
- (xem thêm fault tolerance)
- resource isolation, Cloud Computing Versus Supercomputing, Sharding for Multitenancy
- resource limits, Operations in the Cloud Era
- response time
- as performance metric, Describing Performance, Batch Processing
- guarantees on, Response time guarantees
- impact on users, Average, Median, and Percentiles
- in replicated systems, Single-Leader Versus Leaderless Replication Performance
- latency versus, Latency and Response Time
- mean and percentiles, Average, Median, and Percentiles
- user experience, Average, Median, and Percentiles
- responsibility and accountability, Responsibility and Accountability
- REST (Representational State Transfer), Web services
- (xem thêm services)
- Restate (workflow engine), Durable Execution and Workflows
- RethinkDB (database)
- join support, Convergence of document and relational databases
- key-range sharding, Sharding by Key Range
- retry storm, Describing Performance, Software faults
- reverse ETL, Beyond the data lake
- Riak (database)
- CRDT support, CRDTs and Operational Transformation, Detecting Concurrent Writes
- dotted version vectors, Version vectors
- gossip protocol, Request Routing
- hash sharding, Fixed number of shards
- leaderless replication, Leaderless Replication
- linearizability, lack of, Linearizability and quorums
- multi-region support, Multi-region operation
- rebalancing, Operations: Automatic or Manual Rebalancing
- secondary indexes, Local Secondary Indexes
- sloppy quorums, Single-Leader Versus Leaderless Replication Performance
- vnodes (sharding), Sharding
- ring buffers, Disk space usage
- RisingWave (database)
- incremental view maintenance, Maintaining materialized views
- rockets, Byzantine Faults
- RocksDB (storage engine), Constructing and merging SSTables
- as embedded storage engine, Compaction strategies
- leveled compaction, Compaction strategies
- serving derived data, Serving Derived Data
- rollbacks (transactions), Transactions
- rolling upgrades, Tolerating hardware faults through redundancy, Encoding and Evolution, Faults and Partial Failures
- in a multitenant system, Sharding for Multitenancy
- routing (xem request routing)
- row-based replication, Logical (row-based) log replication
- row-oriented storage, Column-Oriented Storage
- rowhammer (memory corruption), Hardware and Software Faults
- RPCs (xem remote procedure calls)
- rules (Datalog), Datalog: Recursive Relational Queries
- Rust (programming language)
- memory management, Limiting the impact of garbage collection
S
- S3 (object storage), Layering of cloud services, Setting Up New Followers, Batch Processing, Distributed Filesystems, Object Stores
- checking data integrity, Don’t just blindly trust what they promise
- conditional writes, Fencing off zombies and delayed requests
- object size, Separation of storage and compute
- S3 Express One Zone, Object Stores, Object Stores
- use in MapReduce, MapReduce
- workflow example, Scheduling Workflows
- SaaS (xem software as a service (SaaS))
- safety and liveness properties, Safety and liveness
- in consensus algorithms, Single-value consensus
- in transactions, Transactions
- sagas (xem compensating transactions)
- Samza (stream processor), Stream analytics
- SAP HANA (database), Data Storage for Analytics
- scalability, Scalability-Principles for Scalability, A Philosophy of Streaming Systems
- auto-scaling, Operations: Automatic or Manual Rebalancing
- by sharding, Pros and Cons of Sharding
- describing load, Describing Load
- describing performance, Describing Performance
- linear, Describing Load
- principles for, Principles for Scalability
- replication and, Problems with Replication Lag
- scaling up versus scaling out, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
- scaling out, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
- (xem thêm shared-nothing architecture)
- by sharding, Pros and Cons of Sharding
- scaling up, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
- SCD (slowly changing dimension), Time-dependence of joins
- scheduling
- algorithms, Resource Allocation
- batch jobs, Distributed Job Orchestration-Scheduling Workflows
- gang scheduling, Resource Allocation
- schema-on-read, Schema flexibility in the document model
- comparison to evolvable schema, The Merits of Schemas
- schema-on-write, Schema flexibility in the document model
- schemaless databases (xem schema-on-read)
- schemas, Glossary
- Avro, Avro-Dynamically generated schemas
- reader determining writer’s schema, But what is the writer’s schema?
- schema evolution, The writer’s schema and the reader’s schema
- dynamically generated, Dynamically generated schemas
- evolution of, Reprocessing data for application evolution
- affecting application code, Encoding and Evolution
- compatibility checking, But what is the writer’s schema?
- in databases, Dataflow Through Databases-Archival storage
- in service calls, Data encoding and evolution for RPC
- flexibility in document model, Schema flexibility in the document model
- for analytics, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analytics
- for JSON and XML, JSON, XML, and Binary Variants, JSON Schema
- generation and migration using ORMs, Object-relational mapping (ORM)
- merits of, The Merits of Schemas
- migration, Schema flexibility in the document model
- Protocol Buffers, Protocol Buffers-Field tags and schema evolution
- schema evolution, Field tags and schema evolution
- schema migration on railways, Reprocessing data for application evolution
- traditional approach to design, fallacy in, Deriving several views from the same event log
- Avro, Avro-Dynamically generated schemas
- scientific computing, Cloud Computing Versus Supercomputing
- scikit-learn (Python library), From data warehouse to data lake
- ScyllaDB (database)
- cluster metadata, Request Routing
- consistency level ANY, Single-Leader Versus Leaderless Replication Performance
- hash-range sharding, Sharding by Hash of Key, Sharding by hash range
- last-write-wins conflict resolution, Detecting Concurrent Writes
- leaderless replication, Leaderless Replication
- lightweight transactions, Single-object writes
- linearizability, lack of, Implementing Linearizable Systems
- log-structured storage, Constructing and merging SSTables
- multi-region support, Multi-region operation
- use of clocks, Limitations of Quorum Consistency, Timestamps for ordering events
- vnodes (sharding), Sharding
- search engines (xem full-text search)
- searching on streams, Search on streams
- secondaries (xem leader-based replication)
- secondary indexes, Multi-Column and Secondary Indexes, Glossary
- for many-to-many relationships, Many-to-One and Many-to-Many Relationships
- problems with dual writes, Keeping Systems in Sync, Reasoning about dataflows
- sharding, Sharding and Secondary Indexes-Global Secondary Indexes, Summary
- global, Global Secondary Indexes
- index maintenance, Maintaining derived state
- local, Local Secondary Indexes
- updating, transaction isolation and, The need for multi-object transactions
- secondary sort (MapReduce), JOIN and GROUP BY
- sed (Unix tool), Simple Log Analysis
- self-hosting, Cloud Versus Self-Hosting
- data warehouses, Cloud Data Warehouses
- self-joins, Summary
- self-validating systems, Don’t just blindly trust what they promise
- semantic search, Vector Embeddings
- semantic similarity, Vector Embeddings
- semantic web, Triple-Stores and SPARQL
- semi-synchronous replication, Synchronous Versus Asynchronous Replication
- sequential writes (access pattern), Sequential versus random writes
- serializability, Isolation, Weak Isolation Levels, Serializability-Performance of serializable snapshot isolation, Glossary
- linearizability versus, What Makes a System Linearizable?
- pessimistic versus optimistic concurrency control, Pessimistic versus optimistic concurrency control
- serial execution, Actual Serial Execution-Summary of serial execution
- sharding, Sharding
- using stored procedures, Encapsulating transactions in stored procedures, Using shared logs
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)-Performance of serializable snapshot isolation
- detecting stale MVCC reads, Detecting stale MVCC reads
- detecting writes that affect prior reads, Detecting writes that affect prior reads
- distributed execution, Performance of serializable snapshot isolation, Database-internal Distributed Transactions
- performance of SSI, Performance of serializable snapshot isolation
- preventing write skew, Decisions based on an outdated premise-Detecting writes that affect prior reads
- strict serializability, What Makes a System Linearizable?
- timeliness vs. integrity, Timeliness and Integrity
- two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks
- index-range locks, Index-range locks
- performance, Performance of two-phase locking
- Serializable (Java), Language-Specific Formats
- serialization, Formats for Encoding Data
- (xem thêm encoding)
- serverless, Microservices and Serverless
- service discovery, Load balancers, service discovery, and service meshes, Request Routing, Service discovery
- service level agreements (SLAs), Use of Response Time Metrics, Describing Load
- service mesh, Load balancers, service discovery, and service meshes
- Service Organization Control (SOC), Data Systems, Law, and Society
- service time, Latency and Response Time
- service-oriented architecture (SOA), Microservices and Serverless
- (xem thêm services)
- services, Dataflow Through Services: REST and RPC-Data encoding and evolution for RPC
- microservices, Microservices and Serverless
- causal dependencies across services, The limits of total ordering
- loose coupling, Making unbundling work
- relation to batch/stream processors, Batch Processing, Stream processors and services
- remote procedure calls (RPCs), The problems with remote procedure calls (RPCs)-Data encoding and evolution for RPC
- issues with, The problems with remote procedure calls (RPCs)
- similarity to databases, Dataflow Through Services: REST and RPC
- web services, Web services
- microservices, Microservices and Serverless
- session windows (stream processing), Types of windows
- (xem thêm windows)
- sharding, Sharding-Summary, Glossary
- and consensus, Using shared logs
- and replication, Sharding
- distributed transactions across shards, Distributed Transactions
- hot shards, Sharding of Key-Value Data
- in batch processing, Batch Processing
- key-range splitting, Rebalancing key-range sharded data
- multi-shard operations, Xử lý dữ liệu đa shard
- enforcing constraints, Xử lý yêu cầu đa shard
- secondary index maintenance, Duy trì trạng thái dẫn xuất
- of key-value data, Sharding of Key-Value Data-Skewed Workloads and Relieving Hot Spots
- by key range, Sharding by Key Range
- skew and hot spots, Skewed Workloads and Relieving Hot Spots
- origin of the term, Sharding
- partition key, Pros and Cons of Sharding, Sharding of Key-Value Data
- rebalancing
- of key-range sharded data, Rebalancing key-range sharded data
- rebalancing shards, Rebalancing key-range sharded data-Operations: Automatic or Manual Rebalancing
- automatic or manual rebalancing, Operations: Automatic or Manual Rebalancing
- problems with hash mod N, Hash modulo number of nodes
- using fixed number of shards, Fixed number of shards
- using N shards per node, Sharding by hash range
- request routing, Request Routing-Request Routing
- secondary indexes, Sharding and Secondary Indexes-Global Secondary Indexes
- global, Global Secondary Indexes
- local, Local Secondary Indexes
- serial execution of transactions and, Sharding
- sorting sharded data, Shuffling Data
- shared logs, Consensus in Practice-Pros and cons of consensus, The limits of total ordering, Uniqueness in log-based messaging
- algorithms, Consensus in Practice
- for event sourcing, Event Sourcing and CQRS
- for messaging, Log-based Message Brokers-Replaying old messages
- relation to consensus, Shared logs as consensus
- using, Using shared logs
- shared mode (locks), Implementation of two-phase locking
- shared-disk architecture, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture, Distributed Filesystems
- shared-memory architecture, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
- shared-nothing architecture, Shared-Memory, Shared-Disk, and Shared-Nothing Architecture, Glossary
- distributed filesystems, Distributed Filesystems
- (see also distributed filesystems)
- use of network, Unreliable Networks
- distributed filesystems, Distributed Filesystems
- sharks
- biting undersea cables, Network Faults in Practice
- counting (example), Query languages for documents
- shredding (deletion) (see crypto-shredding)
- shredding (in columnar encoding), Column-Oriented Storage
- shredding (in relational model), When to Use Which Model
- shuffle (batch processing), Shuffling Data-Shuffling Data
- siblings (concurrent values), Manual conflict resolution, Capturing the happens-before relationship, Conflict resolution and replication
- (see also conflicts)
- silo, Data Warehousing
- similarity search
- edit distance, Full-Text Search
- genome data, Summary
- simplicity, Simplicity: Managing Complexity
- Singer, Data Warehousing
- single-instruction-multi-data (SIMD) instructions, Query Execution: Compilation and Vectorization
- single-leader replication (see leader-based replication)
- single-threaded execution, Atomic write operations, Actual Serial Execution
- in stream processing, Logs compared to traditional messaging, Concurrency control, Uniqueness in log-based messaging
- SingleStore (database)
- in-memory storage, Keeping everything in memory
- site reliability engineer, Operations in the Cloud Era
- size-tiered compaction, Compaction strategies, Disk space usage
- skew, Glossary
- clock skew, Relying on Synchronized Clocks-Clock readings with a confidence interval, Implementing Linearizable Systems
- in transaction isolation
- read skew, Snapshot Isolation and Repeatable Read, Summary
- write skew, Write Skew and Phantoms-Materializing conflicts, Decisions based on an outdated premise-Detecting writes that affect prior reads
- (see also write skew)
- meanings of, Snapshot Isolation and Repeatable Read
- unbalanced workload, Sharding of Key-Value Data
- compensating for, Skewed Workloads and Relieving Hot Spots
- due to celebrities, Skewed Workloads and Relieving Hot Spots
- for time-series data, Sharding by Key Range
- skip list, Constructing and merging SSTables
- SLA (see service level agreements)
- Slack (group chat)
- GraphQL example, GraphQL
- SlateDB (database), Constructing and merging SSTables, Setting Up New Followers
- sliding windows (stream processing), Types of windows
- (see also windows)
- sloppy quorums, Single-Leader Versus Leaderless Replication Performance
- slowly changing dimension (data warehouses), Time-dependence of joins
- smearing (leap seconds adjustments), Clock Synchronization and Accuracy
- snapshots (databases)
- as backups, Replication
- computing derived data, Creating an index
- in change data capture, Initial snapshot
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)-Performance of serializable snapshot isolation
- setting up a new replica, Setting Up New Followers
- snapshot isolation and repeatable read, Snapshot Isolation and Repeatable Read-Snapshot isolation, repeatable read, and naming confusion
- implementing with MVCC, Multi-version concurrency control (MVCC)
- indexes and MVCC, Indexes and snapshot isolation
- visibility rules, Visibility rules for observing a consistent snapshot
- synchronized clocks for global snapshots, Synchronized clocks for global snapshots
- Snowflake (database), Cloud-Native System Architecture, Layering of cloud services, Cloud Data Warehouses, Batch Processing
- column-oriented storage, Column-Oriented Storage
- handling writes, Writing to Column-Oriented Storage
- sharding and clustering, Sharding by hash range
- Snowpark, Query languages
- Snowflake (ID generator), ID Generators and Logical Clocks
- snowflake schemas, Stars and Snowflakes: Schemas for Analytics
- SOAP (web services), The problems with remote procedure calls (RPCs)
- SOC2 (see Service Organization Control (SOC))
- social graph, Graph-Like Data Models
- society
- responsibility towards, Data Systems, Law, and Society, Legislation and Self-Regulation
- sociotechnical systems, Humans and Reliability
- software as a service (SaaS), Trade-offs in Data Systems Architecture, Cloud Versus Self-Hosting
- ETL from, Data Warehousing
- multitenancy, Sharding for Multitenancy
- software bugs, Software faults
- maintaining integrity, Maintaining integrity in the face of software bugs
- solar storm, Hardware and Software Faults
- solid state drives (SSDs)
- access patterns, Sequential versus random writes
- compared to object storage, Setting Up New Followers
- detecting corruption, The end-to-end argument, Don’t just blindly trust what they promise
- failure rate, Hardware and Software Faults
- faults in, Durability
- firmware bugs, Software faults
- read throughput, Read performance
- sequential vs. random writes, Sequential versus random writes
- Solr (search server)
- local secondary indexes, Local Secondary Indexes
- request routing, Request Routing
- use of Lucene, Full-Text Search
- sort (Unix tool), Simple Log Analysis, Simple Log Analysis, Sorting Versus In-memory Aggregation, Distributed Job Orchestration
- sort-merge joins (MapReduce), JOIN and GROUP BY
- Sorted String Tables (see SSTables)
- sorting
- sort order in column storage, Sort Order in Column Storage
- source of truth (see systems of record)
- Spanner (database)
- consistency model, What Makes a System Linearizable?
- data locality, Data locality for reads and writes
- in the cloud, Cloud-Native System Architecture
- snapshot isolation using clocks, Synchronized clocks for global snapshots
- transactions, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- TrueTime API, Clock readings with a confidence interval
- Spark (processing framework), From data warehouse to data lake, Cloud-Native System Architecture, Batch Processing, Dataflow Engines
- cost efficiency, Query languages
- DataFrames, DataFrames, Matrices, and Arrays, DataFrames
- fault tolerance, Handling Faults
- for data warehouses, Cloud Data Warehouses
- high availability using ZooKeeper, Coordination Services
- MLlib, Machine Learning
- query optimizer, Query languages
- shuffling data, Shuffling Data
- Spark Streaming, Stream analytics
- microbatching, Microbatching and checkpointing
- streaming SQL support, Complex event processing
- use for ETL, Extract–Transform–Load (ETL)
- SPARQL (query language), The SPARQL query language
- sparse index, The SSTable file format
- sparse matrices, DataFrames, Matrices, and Arrays
- split brain, Leader failure: Failover, Request Routing, Glossary
- enforcing constraints, Uniqueness constraints require consensus
- in consensus algorithms, Consensus, From single-leader replication to consensus
- preventing, Implementing Linearizable Systems
- using fencing tokens to avoid, Fencing off zombies and delayed requests-Fencing with multiple replicas
- spot instances, Handling Faults
- spreadsheets, Trade-offs in Data Systems Architecture, DataFrames, Matrices, and Arrays
- dataflow programming, Designing Applications Around Dataflow
- pivot table, DataFrames, Matrices, and Arrays
- SQL (Structured Query Language), Simplicity: Managing Complexity, Relational Model versus Document Model, Cloud Data Warehouses
- for analytics, Data Warehousing, Column-Oriented Storage
- graph queries in, Graph Queries in SQL
- isolation levels standard, issues with, Snapshot isolation, repeatable read, and naming confusion
- joins, Normalization, Denormalization, and Joins
- résumé (example), The document data model for one-to-many relationships
- social network home timelines (example), Representing Users, Posts, and Follows
- SQL injection vulnerability, Byzantine Faults
- statement-based replication, Statement-based replication
- stored procedures, Pros and cons of stored procedures
- support in batch processing frameworks, Batch Processing
- views, Datalog: Recursive Relational Queries
- SQL Server (database)
- archiving WAL to object stores, Setting Up New Followers
- change data capture, Implementing change data capture
- data warehousing support, Data Storage for Analytics
- distributed transaction support, XA transactions
- leader-based replication, Single-Leader Replication
- multi-leader replication, Geographically Distributed Operation
- preventing lost updates, Automatically detecting lost updates
- preventing write skew, Characterizing write skew, Implementation of two-phase locking
- read committed isolation, Implementing read committed
- serializable isolation, Implementation of two-phase locking
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- T-SQL language, Pros and cons of stored procedures
- SQLite (database), Problems with Distributed Systems, Compaction strategies
- archiving WAL to object stores, Setting Up New Followers
- SRE (site reliability engineer), Operations in the Cloud Era
- SSDs (see solid state drives)
- SSTables (storage format), The SSTable file format-Compaction strategies
- constructing and maintaining, Constructing and merging SSTables
- making LSM-Tree from, Constructing and merging SSTables
- staged rollout (see rolling upgrades)
- staleness (old data), Reading Your Own Writes
- cross-channel timing dependencies, Cross-channel timing dependencies
- in leaderless databases, Writing to the Database When a Node Is Down
- in multi-version concurrency control, Detecting stale MVCC reads
- monitoring for, Monitoring staleness
- of client state, Pushing state changes to clients
- versus linearizability, Linearizability
- versus timeliness, Timeliness and Integrity
- standbys (see leader-based replication)
- star replication topologies, Multi-leader replication topologies
- star schemas, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analytics
- Star Wars analogy (event time versus processing time), Event time versus processing time
- starvation (scheduling), Resource Allocation
- state
- derived from log of immutable events, State, Streams, and Immutability
- interplay between state changes and application code, Dataflow: Interplay between state changes and application code
- maintaining derived state, Maintaining derived state
- maintenance by stream processor in stream-stream joins, Stream-stream join (window join)
- observing derived state, Observing Derived State-Multi-shard data processing
- rebuilding after stream processor failure, Rebuilding state after a failure
- separation of application code and, Separation of application code and state
- state machine replication, Statement-based replication, Pros and cons of stored procedures, Using shared logs, Databases and Streams
- event sourcing, Event Sourcing and CQRS
- reliance on determinism, Deterministic simulation testing
- stateless systems, Trade-offs in Data Systems Architecture
- statement-based replication, Statement-based replication
- reliance on determinism, Deterministic simulation testing
- statically typed languages
- analogy to schema-on-write, Schema flexibility in the document model
- statistical and numerical algorithms, DataFrames, Matrices, and Arrays
- StatsD (metrics aggregator), Direct messaging from producers to consumers
- stock market feeds, Direct messaging from producers to consumers
- STONITH (Shoot The Other Node In The Head), Leader failure: Failover
- problems with, Fencing off zombies and delayed requests
- stop-the-world (see garbage collection)
- storage
- composing data storage technologies, Composing Data Storage Technologies-Unbundled versus integrated systems
- Storage Area Network (SAN), Shared-Memory, Shared-Disk, and Shared-Nothing Architecture, Distributed Filesystems
- storage engines, Storage and Retrieval-Summary
- column-oriented, Column-Oriented Storage-Query Execution: Compilation and Vectorization
- column compression, Column Compression-Column Compression
- defined, Column-Oriented Storage
- Parquet, Cloud Data Warehouses, Column-Oriented Storage, Archival storage
- sort order in, Sort Order in Column Storage-Sort Order in Column Storage
- versus wide-column model, Column Compression
- writing to, Writing to Column-Oriented Storage
- in-memory storage, Keeping everything in memory
- durability, Durability
- row-oriented, Storage and Indexing for OLTP-Keeping everything in memory
- B-trees, B-Trees-B-tree variants
- comparing B-trees and LSM-trees, Comparing B-Trees and LSM-Trees-Disk space usage
- defined, Column-Oriented Storage
- log-structured, Log-Structured Storage-Compaction strategies
- column-oriented, Column-Oriented Storage-Query Execution: Compilation and Vectorization
- stored procedures, Encapsulating transactions in stored procedures-Pros and cons of stored procedures, Glossary
- and shared logs, Using shared logs
- pros and cons of, Pros and cons of stored procedures
- similarity to stream processors, Application code as a derivation function
- Storm (stream processor), Stream analytics
- distributed RPC, Event-Driven Architectures and RPC, Multi-shard data processing
- Trident state handling, Idempotence
- straggler events, Handling straggler events
- Stream Control Transmission Protocol (SCTP), The Limitations of TCP
- stream processing, Processing Streams-Summary, Glossary
- accessing external services within job, Stream-table join (stream enrichment), Microbatching and checkpointing, Idempotence, Exactly-once execution of an operation
- combining with batch processing, Unifying batch and stream processing
- comparison to batch processing, Processing Streams
- complex event processing (CEP), Complex event processing
- fault tolerance, Fault Tolerance-Rebuilding state after a failure
- atomic commit, Atomic commit revisited
- idempotence, Idempotence
- microbatching and checkpointing, Microbatching and checkpointing
- rebuilding state after a failure, Rebuilding state after a failure
- for data integration, Batch and Stream Processing-Unifying batch and stream processing
- for event sourcing, Event Sourcing and CQRS
- maintaining derived state, Maintaining derived state
- maintenance of materialized views, Maintaining materialized views
- messaging systems (see messaging systems)
- reasoning about time, Reasoning About Time-Types of windows
- event time versus processing time, Event time versus processing time, Microbatching and checkpointing, Unifying batch and stream processing
- knowing when window is ready, Handling straggler events
- types of windows, Types of windows
- relation to databases (see streams)
- relation to services, Stream processors and services
- relationship to batch processing, Batch Processing
- search on streams, Search on streams
- single-threaded execution, Logs compared to traditional messaging, Concurrency control
- stream analytics, Stream analytics
- stream joins, Stream Joins-Time-dependence of joins
- stream-stream join, Stream-stream join (window join)
- stream-table join, Stream-table join (stream enrichment)
- table-table join, Table-table join (materialized view maintenance)
- time-dependence of, Time-dependence of joins
- streams, Stream Processing-Replaying old messages
- end-to-end, pushing events to clients, End-to-end event streams
- messaging systems (see messaging systems)
- processing (see stream processing)
- relation to databases, Databases and Streams-Limitations of immutability
- (see also changelogs)
- API support for change streams, API support for change streams
- change data capture, Change Data Capture-API support for change streams
- derivative of state by time, State, Streams, and Immutability
- event sourcing, Change data capture versus event sourcing
- keeping systems in sync, Keeping Systems in Sync-Keeping Systems in Sync
- philosophy of immutable events, State, Streams, and Immutability-Limitations of immutability
- topics, Transmitting Event Streams
- strict serializability, What Makes a System Linearizable?
- timeliness vs. integrity, Timeliness and Integrity
- striping (in columnar encoding), Column-Oriented Storage
- strong consistency (see linearizability)
- strong eventual consistency, Automatic conflict resolution
- strong one-copy serializability, What Makes a System Linearizable?
- subjects, predicates, and objects (in triple-stores), Triple-Stores and SPARQL
- subscribers (message streams), Transmitting Event Streams
- (see also consumers)
- supercomputers, Cloud Computing Versus Supercomputing
- Superset (data visualization software), Analytics
- surveillance, Surveillance
- (see also privacy)
- sushi principle, From data warehouse to data lake
- sustainability, Distributed Versus Single-Node Systems
- Swagger (service definition format), Web services
- swapping to disk (see virtual memory)
- Swift (programming language)
- memory management, Limiting the impact of garbage collection
- sync engines, Sync Engines and Local-First Software-Pros and cons of sync engines
- examples of, Pros and cons of sync engines
- for local-first software, Real-time collaboration, offline-first, and local-first apps
- synchronous networks, Synchronous Versus Asynchronous Networks, Glossary
- comparison to asynchronous networks, Synchronous Versus Asynchronous Networks
- system model, System Model and Reality
- synchronous replication, Synchronous Versus Asynchronous Replication, Glossary
- with multiple leaders, Multi-Leader Replication
- system administrator, Operations in the Cloud Era
- system models, Knowledge, Truth, and Lies, System Model and Reality-Deterministic simulation testing
- assumptions in, Trust, but Verify
- correctness of algorithms, Defining the correctness of an algorithm
- mapping to the real world, Mapping system models to the real world
- safety and liveness, Safety and liveness
- systems of record, Systems of Record and Derived Data, Glossary
- change data capture, Implementing change data capture, Reasoning about dataflows
- event logs, Event Sourcing and CQRS
- treating event log as, State, Streams, and Immutability
- systems thinking, Feedback Loops
T
- t-digest (algorithm), Use of Response Time Metrics
- table-table joins, Table-table join (materialized view maintenance)
- Tableau (data visualization software), Characterizing Transaction Processing and Analytics, Analytics
- tail (Unix tool), Using logs for message storage
- tail latency (see latency)
- tail vertex (property graphs), Property Graphs
- task (workflows) (see workflow engines)
- TCP (Transmission Control Protocol), The Limitations of TCP
- comparison to circuit switching, Can we not simply make network delays predictable?
- comparison to UDP, Network congestion and queueing
- connection failures, Detecting Faults
- flow control, Network congestion and queueing, Messaging Systems
- reliability and duplicate suppression, Duplicate suppression
- retransmission timeouts, Network congestion and queueing
- use for transaction sessions, Single-Object and Multi-Object Operations
- Temporal (workflow engine), Durable Execution and Workflows
- Tensorflow (machine learning library), Machine Learning
- Teradata (database), Cloud-Native System Architecture, Cloud Data Warehouses
- term-partitioned indexes (xem global secondary indexes)
- termination (consensus), Single-value consensus, Atomic commitment as consensus
- testing, Humans and Reliability
- thrashing (out of memory), Process Pauses
- threads (concurrency)
- actor model, Distributed actor frameworks, Event-Driven Architectures and RPC
- (xem also event-driven architecture)
- atomic operations, Atomicity
- background threads, Constructing and merging SSTables
- execution pauses, Can we not simply make network delays predictable?, Process Pauses-Process Pauses
- memory barriers, Linearizability and network delays
- preemption, Process Pauses
- single (xem single-threaded execution)
- actor model, Distributed actor frameworks, Event-Driven Architectures and RPC
- three-phase commit, Three-phase commit
- three-way relationships, Property Graphs
- Thrift (data format), Protocol Buffers
- throughput, Describing Performance, Describing Load, Batch Processing
- TIBCO, Message brokers
- Enterprise Message Service, Message brokers compared to databases
- StreamBase (stream analytics), Complex event processing
- TiDB (database)
- consensus-based replication, Single-Leader Replication
- regions (sharding), Sharding
- request routing, Request Routing
- serving derived data, Serving Derived Data
- sharded secondary indexes, Global Secondary Indexes
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- timestamp oracle, Implementing a linearizable ID generator
- transactions, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- use of model-checking, Model checking and specification languages
- tiered storage, Setting Up New Followers, Disk space usage
- TigerBeetle (database), Summary
- deterministic simulation testing, Deterministic simulation testing
- TigerGraph (database)
- GSQL language, Graph Queries in SQL
- Tigris (object storage), Distributed Filesystems
- TileDB (database), DataFrames, Matrices, and Arrays
- time
- concurrency and, The “happens-before” relation and concurrency
- cross-channel timing dependencies, Cross-channel timing dependencies
- in distributed systems, Unreliable Clocks-Limiting the impact of garbage collection
- (xem also clocks)
- clock synchronization and accuracy, Clock Synchronization and Accuracy
- relying on synchronized clocks, Relying on Synchronized Clocks-Synchronized clocks for global snapshots
- process pauses, Process Pauses-Limiting the impact of garbage collection
- reasoning about, in stream processors, Reasoning About Time-Types of windows
- event time versus processing time, Event time versus processing time, Microbatching and checkpointing, Unifying batch and stream processing
- knowing when window is ready, Handling straggler events
- timestamp of events, Whose clock are you using, anyway?
- types of windows, Types of windows
- system models for distributed systems, System Model and Reality
- time-dependence in stream joins, Time-dependence of joins
- time series data
- as DataFrames, DataFrames, Matrices, and Arrays
- column-oriented storage, Column-Oriented Storage
- time-of-day clocks, Time-of-day clocks
- hybrid logical clocks, Hybrid logical clocks
- timeliness, Timeliness and Integrity
- coordination-avoiding data systems, Coordination-avoiding data systems
- correctness of dataflow systems, Correctness of dataflow systems
- timeouts, Unreliable Networks, Glossary
- dynamic configuration of, Network congestion and queueing
- for failover, Leader failure: Failover
- length of, Timeouts and Unbounded Delays
- TimescaleDB (database), Column-Oriented Storage
- timestamps, Logical Clocks
- assigning to events in stream processing, Whose clock are you using, anyway?
- for read-after-write consistency, Reading Your Own Writes
- for transaction ordering, Synchronized clocks for global snapshots
- insufficiency for enforcing constraints, Enforcing constraints using logical clocks
- key range sharding by, Sharding by Key Range
- Lamport, Lamport timestamps
- logical, Ordering events to capture causality
- ordering events, Timestamps for ordering events
- timestamp oracle, Implementing a linearizable ID generator
- TLA+ (specification language), Model checking and specification languages
- token bucket (limiting retries), Describing Performance
- tombstones, Constructing and merging SSTables, Disk space usage, Log compaction
- topics (messaging), Message brokers, Transmitting Event Streams
- torn pages (B-trees), Making B-trees reliable
- total order, Glossary
- broadcast (xem shared logs)
- limits of, The limits of total ordering
- on logical timestamps, Logical Clocks
- tracing, Problems with Distributed Systems
- tracking behavioral data, Privacy and Tracking
- (xem also privacy)
- trade-offs, Trade-offs in Data Systems Architecture-Data Systems, Law, and Society
- transaction coordinator (xem coordinator)
- transaction manager (xem coordinator)
- transaction processing, Characterizing Transaction Processing and Analytics-Characterizing Transaction Processing and Analytics
- comparison to analytics, Characterizing Transaction Processing and Analytics
- comparison to data warehousing, Data Storage for Analytics
- transactions, Transactions-Summary, Glossary
- ACID properties of, The Meaning of ACID
- atomicity, Atomicity
- consistency, Consistency
- durability, Making B-trees reliable, Durability
- isolation, Isolation
- and derived data integrity, Timeliness and Integrity
- and replication, Solutions for Replication Lag
- compensating (xem compensating transactions)
- concept of, What Exactly Is a Transaction?
- distributed transactions, Distributed Transactions-Exactly-once message processing revisited
- avoiding, Derived data versus distributed transactions, Making unbundling work, Enforcing Constraints-Coordination-avoiding data systems
- failure amplification, Maintaining derived state
- for sharded systems, Pros and Cons of Sharding
- in doubt/uncertain status, Coordinator failure, Holding locks while in doubt
- two-phase commit, Two-Phase Commit (2PC)-Three-phase commit
- use of, Distributed Transactions Across Different Systems-Exactly-once message processing
- XA transactions, XA transactions-Problems with XA transactions
- OLTP versus analytics queries, Analytics
- purpose of, Transactions
- serializability, Serializability-Performance of serializable snapshot isolation
- actual serial execution, Actual Serial Execution-Summary of serial execution
- pessimistic versus optimistic concurrency control, Pessimistic versus optimistic concurrency control
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)-Performance of serializable snapshot isolation
- two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks
- single-object and multi-object, Single-Object and Multi-Object Operations-Handling errors and aborts
- handling errors and aborts, Handling errors and aborts
- need for multi-object transactions, The need for multi-object transactions
- single-object writes, Single-object writes
- snapshot isolation (xem snapshots)
- strict serializability, What Makes a System Linearizable?
- weak isolation levels, Weak Isolation Levels-Materializing conflicts
- preventing lost updates, Preventing Lost Updates-Conflict resolution and replication
- read committed, Read Committed-Snapshot Isolation and Repeatable Read
- ACID properties of, The Meaning of ACID
- traversal (graphs), Property Graphs
- trie (data structure), Constructing and merging SSTables, Full-Text Search
- as SSTable index, The SSTable file format
- triggers (databases), Transmitting Event Streams
- Trino (data warehouse), Cloud Data Warehouses
- federated databases, The meta-database of everything
- query optimizer, Query languages
- use for ETL, Extract–Transform–Load (ETL)
- workflow example, Scheduling Workflows
- triple-stores, Triple-Stores and SPARQL-The SPARQL query language
- SPARQL query language, The SPARQL query language
- tumbling windows (stream processing), Types of windows
- (xem also windows)
- in microbatching, Microbatching and checkpointing
- Turbopuffer (vector search), Setting Up New Followers
- Turtle (RDF data format), Triple-Stores and SPARQL
- Twitter (xem X (social network))
- two-phase commit (2PC), Two-Phase Commit (2PC)-Coordinator failure, Glossary
- confusion with two-phase locking, Two-Phase Locking (2PL)
- coordinator failure, Coordinator failure
- coordinator recovery, Recovering from coordinator failure
- how it works, A system of promises
- performance cost, Distributed Transactions Across Different Systems
- problems with XA transactions, Problems with XA transactions
- transactions holding locks, Holding locks while in doubt
- two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks, What Makes a System Linearizable?, Glossary
- confusion with two-phase commit, Two-Phase Locking (2PL)
- growing and shrinking phases, Implementation of two-phase locking
- index-range locks, Index-range locks
- performance of, Performance of two-phase locking
- type checking, dynamic versus static, Schema flexibility in the document model
U
- UDP (User Datagram Protocol)
- comparison to TCP, Network congestion and queueing
- multicast, Direct messaging from producers to consumers
- Ultima Online (game), Sharding
- unbounded datasets, Stream Processing, Glossary
- (xem also streams)
- unbounded delays, Glossary
- in networks, Timeouts and Unbounded Delays
- process pauses, Process Pauses
- unbundling databases, Unbundling Databases-Multi-shard data processing
- composing data storage technologies, Composing Data Storage Technologies-Unbundled versus integrated systems
- federation versus unbundling, The meta-database of everything
- designing applications around dataflow, Designing Applications Around Dataflow-Stream processors and services
- observing derived state, Observing Derived State-Multi-shard data processing
- materialized views and caching, Materialized views and caching
- multi-shard data processing, Multi-shard data processing
- pushing state changes to clients, Pushing state changes to clients
- composing data storage technologies, Composing Data Storage Technologies-Unbundled versus integrated systems
- uncertain (transaction status) (xem in doubt)
- union type (in Avro), Schema evolution rules
- uniq (Unix tool), Simple Log Analysis, Simple Log Analysis, Distributed Job Orchestration
- uniqueness constraints
- asynchronously checked, Loosely interpreted constraints
- requiring consensus, Uniqueness constraints require consensus
- requiring linearizability, Constraints and uniqueness guarantees
- uniqueness in log-based messaging, Uniqueness in log-based messaging
- Unity (data catalog), Cloud Data Warehouses
- universally unique identifiers (xem UUIDs)
- Unix philosophy
- comparison to relational databases, Unbundling Databases, The meta-database of everything
- comparison to stream processing, Processing Streams
- Unix pipes, Simple Log Analysis
- compared to distributed batch processing, Scheduling Workflows
- UPDATE statement (SQL), Schema flexibility in the document model
- updates
- preventing lost updates, Preventing Lost Updates-Conflict resolution and replication
- atomic write operations, Atomic write operations
- automatically detecting lost updates, Automatically detecting lost updates
- compare-and-set (CAS), Conditional writes (compare-and-set)
- conflict resolution and replication, Conflict resolution and replication
- using explicit locking, Explicit locking
- preventing write skew, Write Skew and Phantoms-Materializing conflicts
- preventing lost updates, Preventing Lost Updates-Conflict resolution and replication
- utilization
- batch process scheduling, Resource Allocation
- increasing through preemption, Handling Faults
- trade-off with latency, Can we not simply make network delays predictable?
- uTP protocol (BitTorrent), The Limitations of TCP
- UUIDs, ID Generators and Logical Clocks
V
- validity (consensus), Single-value consensus, Atomic commitment as consensus
- vBuckets (sharding), Sharding
- vector clocks, Version vectors
- (xem also version vectors)
- and Lamport/hybrid logical clocks, Lamport/hybrid logical clocks versus vector clocks
- and version vectors, Version vectors
- vector embedding, Vector Embeddings
- vectorized processing, Query Execution: Compilation and Vectorization
- vendor lock-in, Pros and Cons of Cloud Services
- Venice (database), Serving Derived Data
- verification, Trust, but Verify-Tools for auditable data systems
- avoiding blind trust, Don’t just blindly trust what they promise
- designing for auditability, Designing for auditability
- end-to-end integrity checks, The end-to-end argument again
- tools for auditable data systems, Tools for auditable data systems
- version control systems
- merge conflicts, Manual conflict resolution
- reliance on immutable data, Concurrency control
- version vectors, Problems with different topologies, Version vectors
- dotted, Version vectors
- versus vector clocks, Version vectors
- Vertica (database), Cloud Data Warehouses
- handling writes, Writing to Column-Oriented Storage
- vertical scaling (xem scaling up)
- vertices (in graphs), Graph-Like Data Models
- property graph model, Property Graphs
- video games, Pros and cons of sync engines
- video transcoding (example), Cross-channel timing dependencies
- views (SQL queries), Datalog: Recursive Relational Queries
- materialized views (xem materialization)
- Viewstamped Replication (consensus algorithm), Consensus, Consensus in Practice
- use of model-checking, Model checking and specification languages
- view number, From single-leader replication to consensus
- virtual block device, Separation of storage and compute
- virtual file system, Distributed Filesystems
- comparison to distributed filesystems, Distributed Filesystems
- virtual machines, Layering of cloud services
- context switches, Process Pauses
- network performance, Network congestion and queueing
- noisy neighbors, Network congestion and queueing
- virtualized clocks in, Clock Synchronization and Accuracy
- virtual memory
- process pauses due to page faults, Latency and Response Time, Process Pauses
- Virtuoso (database), The SPARQL query language
- VisiCalc (spreadsheets), Designing Applications Around Dataflow
- Vitess (database)
- key-range sharding, Sharding by Key Range
- vnodes (sharding), Sharding
- vocabularies, Triple-Stores and SPARQL
- Voice over IP (VoIP), Network congestion and queueing
- VoltDB (database)
- cross-shard serializability, Sharding
- deterministic stored procedures, Pros and cons of stored procedures
- in-memory storage, Keeping everything in memory
- process-per-core model, Pros and Cons of Sharding
- secondary indexes, Local Secondary Indexes
- serial execution of transactions, Actual Serial Execution
- statement-based replication, Statement-based replication, Rebuilding state after a failure
- transactions in stream processing, Atomic commit revisited
W
- WAL (write-ahead log), Making B-trees reliable
- WAL-G (backup tool), Setting Up New Followers
- WarpStream (messaging), Disk space usage
- web services (xem services)
- webhooks, Direct messaging from producers to consumers
- webMethods (messaging), Message brokers
- WebSocket (protocol), Pushing state changes to clients
- wide-column data model, Data locality for reads and writes
- versus column-oriented storage, Column Compression
- windows (stream processing), Stream analytics, Reasoning About Time-Types of windows
- infinite windows for changelogs, Maintaining materialized views, Stream-table join (stream enrichment)
- knowing when all events have arrived, Handling straggler events
- stream joins within a window, Stream-stream join (window join)
- types of windows, Types of windows
- WITH RECURSIVE syntax (SQL), Graph Queries in SQL
- Word2Vec (language model), Vector Embeddings
- workflow engines, Durable Execution and Workflows
- Airflow (xem Airflow (workflow scheduler))
- batch processing, Scheduling Workflows
- Camunda (xem Camunda (workflow engine))
- Dagster (xem Dagster (workflow scheduler))
- durable execution, Durable Execution and Workflows
- ETL (xem ETL (extract-transform-load))
- executor, Durable Execution and Workflows
- orchestrators, Durable Execution and Workflows, Batch Processing
- Orkes (xem Orkes (workflow engine))
- Prefect (xem Prefect (workflow scheduler))
- reliance on determinism, Deterministic simulation testing
- Restate (xem Restate (workflow engine))
- Temporal (xem Temporal (workflow engine))
- working set, Sorting Versus In-memory Aggregation
- write amplification, Write amplification
- write path (derived data), Observing Derived State
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- characterizing, Write Skew and Phantoms-Phantoms causing write skew, Decisions based on an outdated premise
- examples of, Write Skew and Phantoms, More examples of write skew
- materializing conflicts, Materializing conflicts
- occurrence in practice, Maintaining integrity in the face of software bugs
- phantoms, Phantoms causing write skew
- preventing
- in snapshot isolation, Decisions based on an outdated premise-Detecting writes that affect prior reads
- in two-phase locking, Predicate locks-Index-range locks
- options for, Characterizing write skew
- write-ahead log (WAL), Making B-trees reliable, Write-ahead log (WAL) shipping
- in durable execution, Durable execution
- writes (database)
- atomic write operations, Atomic write operations
- detecting writes affecting prior reads, Detecting writes that affect prior reads
- preventing dirty writes with read committed, No dirty writes
- WS-* framework, The problems with remote procedure calls (RPCs)
- WS-AtomicTransaction (2PC), Two-Phase Commit (2PC)
X
- X (social network)
- constructing home timelines (example), Case Study: Social Network Home Timelines, Deriving several views from the same event log, Table-table join (materialized view maintenance), Materialized views and caching
- cost of joins, Denormalization in the social networking case study
- describing load, Describing Load
- fault tolerance, Fault Tolerance
- performance metrics, Describing Performance
- DistributedLog (event log), Using logs for message storage
- Snowflake (ID generator), ID Generators and Logical Clocks
- constructing home timelines (example), Case Study: Social Network Home Timelines, Deriving several views from the same event log, Table-table join (materialized view maintenance), Materialized views and caching
- XA transactions, Two-Phase Commit (2PC), XA transactions-Problems with XA transactions
- heuristic decisions, Recovering from coordinator failure
- problems with, Problems with XA transactions
- xargs (Unix tool), Simple Log Analysis
- XFS (file system), Distributed Filesystems
- XGBoost (machine learning library), Machine Learning
- XML
- binary variants, Binary encoding
- data locality, Data locality for reads and writes
- encoding RDF data, The RDF data model
- for application data, issues with, JSON, XML, and Binary Variants
- in relational databases, Schema flexibility in the document model
- XML databases, Relational Model versus Document Model, Query languages for documents
- Xorq (query engine), The meta-database of everything
- XPath, Query languages for documents
- XQuery, Query languages for documents
Y
- Yahoo
- response time study, Average, Median, and Percentiles
- YARN (job scheduler), Distributed Job Orchestration, Separation of application code and state
- ApplicationMaster, Distributed Job Orchestration
- Yjs (CRDT library), Pros and cons of sync engines
- YugabyteDB (database)
- hash-range sharding, Sharding by hash range
- key-range sharding, Sharding by Key Range
- multi-leader replication, Geographically Distributed Operation
- request routing, Request Routing
- sharded secondary indexes, Global Secondary Indexes
- tablets (sharding), Sharding
- transactions, What Exactly Is a Transaction?, Database-internal Distributed Transactions
- use of clock synchronization, Synchronized clocks for global snapshots
Z
- Zab (consensus algorithm), Consensus, Consensus in Practice
- use in ZooKeeper, Implementing Linearizable Systems
- zero-copy, Formats for Encoding Data
- zero-disk architecture (ZDA), Setting Up New Followers
- ZeroMQ (messaging library), Direct messaging from producers to consumers
- zombies (split brain), Fencing off zombies and delayed requests
- zones (cloud computing) (xem availability zones)
- ZooKeeper (coordination service), Coordination Services-Service discovery
- generating fencing tokens, Fencing off zombies and delayed requests, Using shared logs, Coordination Services
- linearizable operations, Implementing Linearizable Systems
- locks and leader election, Locking and leader election
- observers, Service discovery
- use for service discovery, Load balancers, service discovery, and service meshes, Service discovery
- use for shard assignment, Request Routing