读书笔记:Design Data-intensive Application[持续更新]

这是一个持续更新的读书笔记.

书名: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems(DDIA).

原书网站: link

……… 脑子里或者说工具箱里里的东西很多，但是却非常乱，概念实现经常记混，我经常在想：如果能总结归类，用更科学的方法去记住所有我学过的技术，那就好了。。。然后我就遇到了这本书，我读了三遍，然后感觉一切都理顺了… 如果我先读了这本书，才去学上边这些东西，那么我花费的时间将是原来的3分之一到5分之一左右。这本书在广度和一定程度的深度（如果对进一步深度有要求，可以读完此书的reference）上，概括和总结了数据系统，分布式或非分布式环境下面临的本质问题，和解决方案的分类，看完这本，我开始理解了“为什么”这么多的分布式系统要这么设计。而这本书上千的引用论文，给我指明了一条系统学习理论的明路 ………

Chapter 1

3 goals of Data Intensive Application(DIA)

Reliability

Continuing to work correctly, even when things go wrong.

“Wrong”
- Hardware: rare, little correlation. Remedy: Hardware duplication
- Software: wrong assumption about the system/bugs
- Human: most frequent.
  - Remedy:
    - 1. Good system design
      2. Decouple sensitive environment/sandbox
      3. Test
      4. Easy and quick recovery
      5. Good monitoring
      6. Good management practice

Scalability

system’s ability to cope with increased load.

Metrics for “Load”
- Latency/response time of Xth percentile (usually 99th Percentile)
  - Best to estimate continuously over a window (t-digest/HdrHistogram)
- Throughput

Maintainability

Operability: Make it easy for operations teams to keep the system running smoothly.
Simplicity; Make it easy for new engineers to understand the system
Evolvability: Make it easy for engineers to make changes to the system in the future

Chapter 2: Data Structure

Hash Indexes

Key idea: In-memory hashtable of keys, whereby values are byte offset of its on-disk stored file. Log-structured means append-only.
How to avoid run out of disk space (since append-only) : break log into segments, and each segments has its own in-memory hashtable. We close a segment and open a new segment when size gets too big.
How to keep number of segments small: merging old segments together by discarding old values of the same key.
How to retrieve value of a key: firstly look-up in the newest in-memo hashtable, then old ones.

SSTable

Key idea: make key-value pairs sorted by key. So in-memo hashtable can only keep a small number of keys.
Constructing and maintaining SSTables/Memtable
1. Write to in-memory balance tree structure, also called “memtable”
2. When memtable gets too big, save it to disk as SSTable. Create a new memtable.
3. Read request will search memtable, if not found then in newest SSTable
4. Need background process merging and compacting SSTable
5. Keep a log of events in case of crash. Refresh log once memtable is written out as an SSTable
Optimization
- Slow to know a key not in DB: Bloom filter

B-Tree

Key idea: Data are break-up into fix-size pages. For non-leaf pages, each one has sorted indexes and pointers to children pages. Each leaf page contains indexes and their values. All data are on disk

Look-up: n-ary search
Insert: look-up then insert. If full, split into two pages.
Write-ahead log(WAL):

an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the data‐ base comes back up after a crash, this log is used to restore the B-tree back to a con‐ sistent state

B-Tree vs. LSM-Tree

LSM-Tree ✔️B-Tree ✖️:
1. B-Tree must write data twice (WAL and on-disk tree page)
2. LSM-Tree sometimes has lower write amplification. B-Tree often needs to write several pages but LSM-Tree sequentially write compact SSTables.
3. B-Tree leaves unused fragmentation
LSM-Tree ✖️ B-Tree ✔️:
1. LSM-Tree’s compaction process affects concurrent access of disk. B-Tree more predictable.
2. At high write throughput. compaction uses up disk bandwidth
3. If not configured properly, LSM-Tree has too many SSTable that uses up disk space and slow down reads
4. B-Tree has exactly one index in one place, which makes it easy to impose transactional isolation using locks on keys.
5. B-Tree provide consistently good performance for many workloads.

Other Index Structures

Heap file: referring value outside indexes
- What: data with no order
- Why: value could be the actual row (document, vertex), or it could be a reference to the row stored in heap file.
- https://blog.csdn.net/qq910894904/article/details/39312901
Clustered/Non-clustered index/Covering index
- Store row directly within index. e.g. in MySQL’s InnerDB, the primary key of a table is default a clustered index
- Covering index: A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index)

Chapter 5 Data Duplication

Leader based/master-slave

Synchronous Versus Asynchronous Replication

Sync: leader waits for follower to finish update, then responde
- Cons: leader halts/waits for follower: if one follower fails, all nodes wait
- Remedy: semi-sync. make only one sync; make other follower node sync once the old sync follower fails

Setting up new follower

Leader takes a snapshot
Copy to the new follower
Follower requests a log of all data change since the snapshot
Follower process the backlog and “caught up”

Leader failover

Determining that the leader has failed
Choosing a new leader. : involve consensus problem.
Reconfiguring the system to use the new leader.: route all request to new leader and need to tell older leader to stepdown if it ever comes back

Problems with failover

Async writes on older leader not propergates to new leader
Two nodes think they are leader at the same time
What is the best timeout?

Replication Log

Statement based(replicate execute UPDATE, DELETE ,etc.): not used today.
- Problems (non-deterministic statement, side effect, execution order)
Write-ahead log (WAL) shipping:
- Problem: The main disadvantage is that the log describes the data on a very low level: a WAL con‐ tains details of which bytes were changed in which disk blocks.
Logical (row-based) log replication: a sequence of records describing writes to database tables at the granularity of a row:
Trigger-based replication: bind application level code to database changes
- Problems: more overhead, bugs.

To be continued….

DDIA Reading Notes[Updating]