Companies want to capture user happiness in metric form to provide the optimal level of reliability for their software that maximises user happiness. In this series of posts, I’m writing about using Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in data-driven negotiations between Engineering, Product and Business to achieve this goal.

Customer Reliability Engineering Principlesaysthat it’s the user experience that determines the reliability of our services。由于我们无法直接衡量用户幸福,因此SLIS是代理,以帮助回答问题:“Is our service working as our users expect it?”.用户仔细衡量我们的系统性能,更准确的用户幸福衡量我们的SLIS将是。

Why social media is not great at measuring user happiness

Social media channels are not good indicators of users’ (un)happiness. We want indicators to be可量化and可预测的, ideally in a linear relationship with user happiness. Predictability is key and good indicators show long-term trends clearly.

社交媒体指标有几个缺点。这data isn’t timely and is often dubious and sometimes outright malicious。Competitors can spamdowndetectorduring your public launches and newspapers pick up on it. It’s powered by crowdsourced user reports of “problems” butisn’t targeted at specific areas of the site, just that there are problems. Monitoring is faster at detecting and measuring incident resolution. Synthetics are more targeted and reliable alternatives.” 

Ben Cordero,员工软件工程师,SRE,送达

这anatomy of a good SLI

选择“良好”SLIS的主要挑战是系统复杂性。有很多斯里斯是没有的,因为他们成为噪音的海洋。

Commonly chosen types of SLIs that aren’t actually good

❌ System metrics:Tempting because a sharp change is often associated with outages. But most users don’t care about “CPU being 100%”; they care about the system being slow:

  • Load average
  • CPU utilisation
  • Memory usage
  • Bandwidth

❌ Internal state:Data is noisy; there are too many reasons why large changes can occur. They also can be temporary while the system is scaling up or down. None of them has a predictable relationship with the happiness of our users.

  • 线程池充满
  • Request queue length
  • Outages

这SLI equation

所有有效事件的比例很好。

Why we care about valid events

需要排除监控工具记录的一些事件,不得不消耗Error Budget(more about that later), for example, bot requests or health checks. “Valid” events make this possible.

  • 为了HTTP requests, validity is often determined by request params, e.g. hostname or requested path.
  • 为了data processing systems, validity is determined by the selection of inputs.

SLIs are better aggregated over a reasonably long time to smooth out the noise curve from the underlying data. This is because SLIs need to provide a clear definition of好的and坏的events. It’s much harder to set a meaningful threshold for metrics with variance and poor correlation with user experience.

在上面的示例中,良好的度量标准具有明显的DIP,它与中断的时间跨度匹配。

  • It has less noise because the data has been smoothed over a time window.
  • 在正常操作期间,它具有较窄的值范围,即在中断期间明显不同于中断范围。这使得更容易设置阈值。
  • It tracks the performance of the service against the user expectations accurately and predictably.

这坏的metric forces us to either set a tight threshold and run the risk offalse positives或set a loose threshold and riskfalse negatives。Worst, choosing the middle ground means accepting both risks.

Five ways to measure SLIs and their trade-offs

这closer to the user we measure, the better approximation of their happiness we’ll have. The options below are listed in increasing proximity to users.

1. Server-side logs

服务器日志是跟踪复杂用户旅程的可靠性的方法之一,在长时间运行的会话期间具有许多请求 - 响应交互。

Pros

  • Even if we haven’t measured it previously, we can still process request logs retroactively to backfill the SLI data and get an idea of the historical performance.
  • 如果SLI需要复制逻辑以确定什么事件是好的,这可以写入日志和处理作业的代码,并导出为更简单的“好的事件”计数器。

Cons

  • 这engineering effort to process logs is significant.
  • Reconstructing user sessions requires an even bigger effort.
  • Ingestion and processing add significant latency between an event occurring and being observed in the SLI, making log-based SLI unsuitable for triggering an emergency response.
  • Requests that don’t make it to the application servers can’t be observed by log-based SLIs at all.

2. Application-level logs

Application-level metrics capture the performance of individual requests.

Pros

  • Easy to add.
  • 这y don’t have the same measurement latency as log processing.

Cons

  • Can’t easily measure complex multi-request user journeys by exporting metrics from stateless servers.
  • 生成与响应内容相关的响应和导出指标之间存在利益冲突。

3.云提供商的前端负载平衡器

Pros

  • 这cloud’s load balancer has detailed metrics and historical data.
  • 这engineering effort to get started is smaller.

Cons

  • Most load balancers are stateless and can’t track sessions, so they don’t have insight into the response data.
  • 这y rely on setting correct metadata in the response envelope to determine if responses were good.

4.合成客户

Synthetic clients can emulate a user’s interaction with the service to confirm if a full journey has been successful and verify if the responses were good, outside of our infrastructure.

Pros

  • Can monitor a new area of the website or application before getting real traffic, so there’s time to remedy availability and performance issues.
  • Easy to simulate a user in a certain geography.
  • Helpful to assess the reliability of third parties like payment processors, recommendation engines, business intelligence tools etc.

Cons

  • A synthetic client is only an approximation of user behaviour. Users are human, so they do unexpected things. Synthetics might not be enough as the sole measurement strategy.
  • Covering all the edge-cases of the user journey with a synthetic client is a huge engineering effort that usually devolves into integration testing.

5. Client-side telemetry

另一种选择是使用真实用户监控来介绍客户端或RUM tagsto provide telemetry for the SLI.

Pros

  • A far more accurate measure of the user experience.
  • 有助于评估涉及用户旅程的第三方系统的可靠性。

Cons

  • Telemetry from clients can incur significant measurement latency, especially for mobile clients. Waking up the device every few seconds is detrimental to battery life and user trust.
  • It’s unsuitable for emergency responses.
  • It captures many factors outside of our direct control, lowering the signal to noise ratio of SLIs. For example, mobile clients could suffer from poor latency and high error rates, but we can’t do much about it, so we have to relax our SLOs to accommodate these situations.

这SLI buffet

要创造一个好的斯利,我们需要一个specificationimplementation

  • 这specification is thedesired outcomefrom a user perspective.
  • 实施是specificationplusa way to measure it.
1. Request/Response

Example:an HTTP service where the user interacts with the browser or a mobile app to send API requests and receive responses.

1.1 Availability

有两种方法可以测量可用性:time-basedandaggregateevents.

基于时间的可用性

How long the service was unavailable for a period of time.

Aggregate availability

这proportion of valid requests served successfully.

Implementation
  1. Which requests the system serves arevalidfor the SLI?
  2. What makes a responsesuccessful

Aggregate availability is a more reasonable approximation of unplanned downtime from a user perspective because most systems are at leastpartially一直在努力。它还为不必一直运行的系统提供一致的度量标准,如批处理。

When considering the availability of an entire user journey, we need to also consider the voluntary exit scenarios before completing the journey.

1.2. Latency

这proportion of valid requests served faster than a threshold.

Implementation
  1. Which requests the system serves arevalidfor the SLI?
  2. 什么时候计时器for measuring latencystartand停止

当设置一个延迟阈值,我们需要to consider the long tail of requests, where 95% or 99% of requests must respond faster than a threshold for users to be happy. The relationship between user happiness and latency is on an S-curve, so it’s good to set thresholds for 75% to 90% to describe it as more nuanced.

影响延迟的事情:

  • Pre-fetching
  • Caching
  • 装载尖峰

RobinHood: Tail Latency-Aware Cachinglists several strategies to maintain low request tail latency, such as load balancing, auto-scaling, caching and prefetching. The difficulty lies in user journeys with multiple requests across multiple backends, where the latency of the slowest request defines the latency of a journey. Even when requests can be parallelised among backends and all backends have low tail latency, the resulting tail latency can still be high.这Tale at Scaleoffers interesting techniques to toleratelatency variabilityin large-scale web services.

延迟和批处理

延迟同样重要的是跟踪数据处理或异步队列任务。例如,如果我们有一个日常运行的批处理管道,那么该管道不应该运行超过一天。

We must be careful when reporting the latency of long-running operations only on their eventual success or failure. For example, if the threshold for operational latency is 30 minutes, but the latency is only reported after the process fails two hours later, there’s a 90-minute window where the operation has missed expectations without being measured.

1.3. Quality

这proportion of valid requests served without degraded quality.

Implementation
  1. Which requests the system serves arevalidfor the SLI?
  2. How to determine whether the response was servedwithout degraded quality

Sometimes we trade off the quality of the user response with CPU or memory utilisation. We need to track this graceful degradation of service with a quality SLI.

退化,直到用户可能不知道it becomes severe. However, degradation can still impact the bottom line. For example, degraded quality could mean serving fewer ads to users, resulting in lower click-through rates.

It’s easier to express this SLI in terms of坏事相当than good ones. The mechanism used by the system to degrade response quality should also mark the responses as degraded and increment metrics to count them.

与延迟相同,响应劣化沿着具有多个阈值的频谱下降。例如,考虑将传入请求融为10个后端的服务,每个服务器都有99.9%可用性目标以及重载时拒绝请求的能力。我们可能会选择提供99%的表面响应,而不会缺少后端响应,而99.9%,没有超过一个错过的响应。

2. Data Processing

Examples
  • A video service that converts from one format to another.
  • 处理日志并生成报告的系统。
  • A storage system that accepts data and makes it available for retrieval later on.

2.1。新鲜

这proportion of valid data updated more recently than a threshold.

Implementation
  1. What data isvalidfor the SLI?
  2. 什么时候thetimer测量数据freshnessstartand停止

为了abatch processingsystem, freshness can be approximated since the completion of the last successful run. More accurate measurements require processing systems to track generation and source age timestamps.

为了streaming processingsystems, we can measure freshness with a watermark that tracks the age of the most recent record that has been fully processed.

Serving stale data is a common way for response quality to be degraded without the system making an active choice. If we don’t track it and no user accesses the stale data, we can miss freshness expectations.

系统生成的数据也必须连续ce a generation timestamp so that the infrastructure can check against the freshness threshold when it reads the data.

2.2. Correctness

这proportion of valid data producing correct output.

Implementation
  1. What data isvalidfor the SLI?
  2. How to determine thecorrectnessof output records?

这methods for determining correctness need to be independent of the methods used to generate the output of the data; otherwise, bugs during generation will also affect validation.

为了估计整体正确性,输入数据必须足够地代表真实的用户数据,并锻炼大多数处理系统代码路径。

2.3. Coverage

这proportion of valid data processed successfully.

Implementation
  1. What data isvalidfor the SLI?
  2. 如何确定数据的处理是否为successful

这data processing system should determine whether a record that began processing has finished and the outcome is a success or failure.

这challenge is with records that should have been processed but were missed for some reason. To solve this, we need to determine the number of valid records outside the data processing system itself, directly in the data source.

为了batch processing, we can measure the proportion of jobs that processed data above a threshold amount.

为了streaming processing, we can measure the proportion of incoming records that were successfully processed within a time window.

2.4. Throughput

这proportion of time where the data processing rate is faster than a threshold.

Implementation
  1. units of measurementof the data processing rate, e.g. bytes per second.

How does this differ from latency? Throughput is the rate of events over time. As with latency and quality, throughput rates are a spectrum.

Managing SLI Complexity

对于最关键的用户旅程,少数SLIS

We should haveone to three SLIs for each user journey, even if they are relatively complex. Why?

  1. 并非所有指标都能使SLIS好。
  2. Not all user journeys are equally important. We should prioritise those where reliability has a significant impact on business outcomes.
  3. 我们拥有的速度越多,团队的认知负荷越多,学习和理解响应中断所需的信号。
  4. 太多的SLI增加了相互冲突的信号的概率,这将推动时间分辨率,因为团队将追逐“红鲱鱼”。

Monitoring and observability

Having SLIs alone is not enough。We need monitoring and observability. Why?

  1. 这deterioration of an SLI is an indication that something is wrong.
  2. When the deterioration becomes bad enough to provoke an incident response, we need other systems like monitoring and observability to identifywhatis wrong.

Manage complexity with aggregation

Let’s take an example of a typical e-commerce website, where a user lands on the home page, searches or browses a specific category of products and then goes into product details. To simplify, we can group these events into a single “browsing journey”. We can then sum up the valid and good events for an overall browse availability and latency SLIs.

这problem with summing events is that it treats all of them equally, even though some might be more important than others. Request rates can differ significantly. For example, summing hides low-traffic events in the noise of high-traffic ones. One solution is to multiply the SLI by weight, be that traffic rate or user journey importance.

Manage complexity with bucketing

Another source of complexity is choosing different thresholds for different SLOs. To reduce the complexity, we can reduce the set of good thresholds and label them with consistent, easily recognisable and comparable labels, for example, choosing one to three discrete response buckets.

Bucket 1: Interactive requests

第一步是识别人类用户在主动等待响应时。This is important because requests could also come from bots and mobile devices pre-fetching data overnight on WiFi and AC power. However, we care about thehumanuser experience.

Bucket 2: Write requests

这second step is to categorise which requests mutate state in the system。这很重要,因为写入和读取,尤其是分布式系统,具有不同的特征。例如,在点击“提交”之后,用户已经习惯了一下,而不是在查看页面上看到静态信息时。

Bucket 3: Read requests

这third step is choosing which requests should have the strictest latencies. Choosing a spectrum of thresholds is a good idea:

  1. Annoying requests:50–75% of requests are faster than this threshold.
  2. Painful requests (long-tail):90%的请求比此阈值快。
Bucket 4 (optional): Third-party dependent requests

When we have third-party dependencies like payment providers, we can’t make requests faster because they’re not within our control. A solution is to make it explicit to the user what the responsibility boundaries are. For example, make it visible when the third-party dependency kicks in in the user journey.

We could also bucket by customer tier: enterprise customers have tighter SLOs than self-serve ones.

可实现的vs aspirational slos

一旦我们选择了与用户幸福的密切可预测的关系的正确SLIS,下一步就是选择好的-enough reliability targets

Users’ expectations are strongly tied to past performances.这best way to arrive at a reasonable target is to have historical monitoring data to tell us what’s achievable.If that’s missing or the business needs change, the solution is to gather data and iterate towards achievable and aspirational targets.

Achievable SLOsare based on historical data when there’s enough information to set the targets that meet the users’ expectations in most cases. The downside of achievable SLOs is that the assumption that users are happy with past and current performance is impossible to validate from monitoring data alone.

  • What if our feature is completely new?
  • What if our users only stick with us because the competition is far worse?
  • 如果我们的用户对我们的表现太满意,可以放松一些股票来增加利润率怎么办?

As stated in纪律创业:成功启动的步骤24步, aspirational SLOsare based on business needs. Like OKRs, they are set higher than the achievable ones. Since they start from assumptions about the users’ happiness, it’s totally reasonable to not hit them at first.That’s why it’s more important to set a reasonable target than to set the right target.

这first thing to do when achievable and aspirational SLOs diverge is to understandwhy

Why are the users sad even if we’re within an SLO?

To answer that, we need two things:

  1. Tracking signals which proxies for user happiness外部监控系统。例如,NPS或客户支持请求。
  2. Time。We don’t have to wait an entire year before setting some reasonable targets because a lot can happen in one year: the business can pivot or scale 10x. At the same time, we also don’t want to panic every week and change the targets based onfearhope

我们应该迭代假设是否已通过continuous learningandimprovement

Four steps to arrive at good SLOs

1.从SLI自助式选择SLI规范。

问题

  • What does the user expect this service to do?
  • What would make the user unhappy with this service?
  • Do different types of failures have different effects on the service?

输出:SLIs for request/response and data processing.

2. Refine the specification into a detailed SLI implementation.

问题

  • What does the SLI measure?
  • Where is the SLI measured?
  • What metrics should be included and excluded?

输出:A detailed-enough SLI specification that can be added to a monitoring system

3. Walk through the user journey and look for coverage gaps.

问题

  • What edge doesn’t the SLI specification cover cases?
  • How risky are those edge cases?

输出:A documented list of edge cases and/or augmenting measurement strategies.

4. Set aspirational SLO targets based on business needs.

问题

  • What historical performance data can we use to set the initial targets?
  • What other user happiness signals can we use to estimate targets?
  • If there are competitors on the market, what levels of service do they offer?
  • What is the profile of the user: self-serve or enterprise?
  • What is the cost for the next order of magnitude in the SLO target?
  • What is worse for users: a constant rate of low failures or an occasional full-site outage?

输出:SLO targets on a spectrum

(The SLO decision matrix fromGoogle SRE book/Example SLO document的)

第二部分,我将潜入错误预算的需要,良好错误预算策略的七个属性,以及CRE风险分析模板的示例。

***

If you or your CTO / technology lead would benefit from any of the services offered by the CTO Craft community, use the Contact Us button at the top or email ushereand we’ll be in touch!

订阅Tech Manager每周免费每周服用技术文化,招聘,发展,过程等

Baidu