DDIA Chapter 01 Reliable, Scalable and Maintainable Applications

8 minute read

Published:

0x01 Reliability 可靠性

硬件故障 Hardware Faults

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power.

首先通过 MTTF 一类指标去评估硬件故障发生概率。

对于一些小的节点、集群,做好 RAID、热备电源,可以热切换的 CPU 。

甚至于柴油发电机就够了。{: .notice–success}

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults.

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.

从系统层面看,出现故障的往往并非某个硬件配件,所以需要去做一些备用的节点和主机。同时,支持备用主机的系统也更方便去滚动升级。

软件错误 Software Errors

软件错误更加具有系统性,一个错误会引发系统内所有节点的故障。

软件错误的产生,往往是因为开发者对软件运行环境做了一些错误的assumption。

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances.

软件错误可以通过仔细论证系统内的假设与交互问题,开展测试、解耦、考虑程序的故障和重启问题;度量、检测、分析系统在生产环境中运行状态。

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.

If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found.

人为操作 Human Errors

  • Design systems in a way that minimizes opportunities for error.
  • Decouple the places where people make the most mistakes from the places where they can cause failures.
  • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests.
  • Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure.
  • Set up detailed and clear monitoring, such as performance metrics and error rates.

0x02 可扩展性 Scalability

分析负载 Describing Load

形容负载,首先就要确定 load parameters,负载参数的最佳选项需要根据系统予以决定

Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

以推特为例:

  • Post tweet: A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).
  • Home timeline: A user can view tweets posted by the people they follow (300k requests/sec).

对于推特提出两种架构,联表查询方案和订阅表方案:

  1. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2, you could write a query such as:
SELECT tweets.*, users.* FROM tweets 
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id 
WHERE follows.follower_id = current_user
  1. Maintain a cache for each user’s home timeline—like a mailbox of tweets for each recipient user (see Figure 1-3). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

DDIA1-1

This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time.

The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out.

Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.

分析性能 Describing Performance

延迟和相应时间的区别(latency and response time)

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled during which it is latent, awaiting service.

SLA 和 SLO 中对于相应时间百分数的实际应用:

An SLA may state that the service is considered to be up if it has a median response time of less than 200ms and a 99th percentile under 1 s (if the response time is longer, it might as well be down), and the service may be required to be up at least 99.9% of the time. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

压力测试中容易出现的误区

测试中的模拟客户端,不应该串行发包。

如果客户端不能独立于响应时间地发送测试请求,那服务端的请求队列其实是被人为的缩短了,就不能模拟真实环境中,服务端所面临的请求压力。

关于测试环境无法模拟真实环境的更多问题可以参考:Everything You Know About Latency Is Wrong – Brave New Geek

When generating load artificially in order to test the scalability of a system, the load generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements.

现实中,一个用户端程序往往会并行请求多个服务端的服务,而客户端的总延迟是由那个延迟最长的服务来决定的。

此时在服务端来测算延迟,几百个请求中,可能只有一两个延迟会超标,但是客户端的体验依然会十分差,这种情况往往被称作 尾部延迟放大tail latency amplification)效应,内容参见文章 The Tail at Scale | February 2013 | Communications of the ACM

负载解决方案 Approaches for Coping with Load

横向扩展 和 纵向扩展

  • 纵向扩容 Scaling Up: vertical scaling, moving to a more powerful machine

  • 横向扩容 Scaling Out: horizontal scaling, distributing the load across multiple smaller machines

人工操作 和 弹性伸缩

  • 弹性系统 Elastic System: they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system).

基本原则:没有银弹

The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.

0x03 可维护性 Maintainability

一个具有良好可维护性的系统,应该有下面三个特点:

  • 可操作性 Operability:保证运维团队可以轻松让系统得以正常运行,而不必在系统正常运行时需要执行大量人工操作。
  • 易用性 Simplicity:保证工程师可以轻松理解系统,尽可能移除过于复杂的实现。这和用户界面的易用性并非同一种特质。
  • 可迭代性 Evolvability:保证工程师可以在未来可以迭代系统,将未实现的用例适配到系统中。也可以叫 可扩展性,可变更性,或者 系统弹性。

可操作性

一个可操作性良好的系统大大缩小运维的工作压力。

也就意味着更小的运维团队可以承担更大的系统运维职责。

尽管在理想环境中,运维所有环节都是自动化的。

但在任何一个系统在建立之初的正确运作,以及后期做到合理的模块提炼,仍然取决于运维团队的水平。

运维职责分析

  • 监控系统,并且在服务状态不佳时快速恢复服务
  • 跟踪问题,例如系统故障或性能下降
  • 系统更新,例如安全补丁和框架升级
  • 了解系统,分析系统间相互作用,在异常变更造成损失前进行规避
  • 预测问题,在问题出现之前加以解决,例如,容量规划
  • 规范构建,建立部署,配置、管理方面的良好实践,编写相应工具来确保实践得以执行
  • 复杂维护,例如将系统从一个平台迁移到另一个平台
  • 配置变更,保证在维持系统的安全性的前提下完成配置变更
  • 定义流程,使运维操作可预测,并保持生产环境稳定
  • 团队提升,保持团队成员,尤其是新成员对系统的了解

运维工作经验

  • 通过良好的监控,提供对系统内部状态和运行时行为的可见性(visibility)
  • 为自动化提供良好支持,将系统与标准化工具相集成
  • 避免依赖单台机器(在整个系统继续不间断运行的情况下允许机器停机维护)
  • 提供良好的文档和易于理解的操作模型(“如果做X,会发生Y”)
  • 提供良好的默认行为,但需要时也允许管理员自由覆盖默认值
  • 有条件时进行自我修复,但需要时也允许管理员手动控制系统状态
  • 行为可预测,最大限度减少意外

简化运维工作 Simplicity

复杂性的来源分析

There are various possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more.

关于系统复杂性分析的讨论

抽象:复杂性的最终解决方案

解决系统复杂性最好的办法还是抽象。

例如 SQL 对于数据系统就是一种抽象,高级语言对于计算机硬件是一种抽象。

好的抽象在于可以通过一个清晰、易于理解的外在,来实现复杂的内部细节。

好的抽象不仅可以避免反复编写一些相似的东西,更在于可以实现高质量的软件,因为所有应用都可以通过遵循好的抽象来获益。

好的抽象虽然难于发现,但是可以真正从根上解决系统复杂性问题。

Evolvability 让二次开发更简单

对于一个小系统,你修改内部文件、添加功能的复杂性往往被称作软件的敏捷性(Agility),而对于一个更大的系统,这样的性质往往被称作可进化性(Evolvability)。

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability.

0x05 总结

为一个软件做功能评审的时候,不仅需要关注功能性需求,运维更需要关注非功能性需求:

  • 功能性需求:它应该做什么,比如允许以各种方式存储,检索,搜索和处理数据
  • 非功能性需求:通用属性,例如安全性,可靠性,合规性,可扩展性,兼容性和可维护性
    • 可靠性:避免来自于 硬件(随机的,不相关的)、软件(系统性 Bug,难于修复)、人为操作(不可避免,时不时出错)带来的隐患。
    • 可扩展性:定量描述负载性能,确保系统可以通过灵活的容量变更以在高负载下保持可靠。
    • 可维护性:良好的可维护性意味着对系统的状态具有良好的可观测性,并拥有处理系统腐败和失火的手段。