到目前为止,在本系列中,我们已经了解了 Resilience4j 及其 Retry, RateLimiter, TimeLimiter, 和 Bulkhead 模块。在本文中,我们将探索 CircuitBreaker 模块。我们将了解何时以及如何使用它,并查看一些示例。
代码示例
本文附有 GitHub 上的工作代码示例。
什么是 Resilience4j?
请参阅上一篇文章中的描述,快速了解 Resilience4j 的一般工作原理。
什么是断路器?
断路器的思想是,如果我们知道调用可能会失败或超时,则阻止对远程服务的调用。我们这样做是为了不会在我们的服务和远程服务中不必要地浪费关键资源。这样的退出也给了远程服务一些时间来恢复。
我们怎么知道一个调用可能会失败? 通过跟踪对远程服务发出的先前请求的结果。例如,如果前 10 次调用中有 8 次导致失败或超时,则下一次调用也可能会失败。
断路器通过包装对远程服务的调用来跟踪响应。在正常运行期间,当远程服务成功响应时,我们说断路器处于“闭合”状态。当处于关闭状态时,断路器正常将请求传递给远程服务。
当远程服务返回错误或超时时,断路器会增加一个内部计数器。如果错误计数超过配置的阈值,断路器将切换到“断开”状态。当处于断开状态时,断路器立即向调用者返回错误,甚至无需尝试远程调用。
经过一段配置的时间后,断路器从断开状态切换到“半开”状态。在这种状态下,它允许一些请求传递到远程服务以检查它是否仍然不可用或缓慢。 如果错误率或慢呼叫率高于配置的阈值,则切换回断开状态。但是,如果错误率或慢呼叫率低于配置的阈值,则切换到关闭状态以恢复正常操作。
断路器的类型
断路器可以基于计数或基于时间。如果最后 N 次调用失败或缓慢,则基于计数的断路器将状态从关闭切换为断开。如果最后 N 秒的响应失败或缓慢,则基于时间的断路器将切换到断开状态。在这两个断路器中,我们还可以指定失败或慢速调用的阈值。
例如,如果最近 25 次调用中有 70% 失败或需要 2 秒以上才能完成,我们可以配置一个基于计数的断路器来“断开电路”。同样,如果过去 30 秒内 80% 的调用失败或耗时超过 5 秒,我们可以告诉基于时间的断路器断开电路。
Resilience4j 的 CircuitBreaker 概念
resilience4j-circuitbreaker 的工作原理与其他 Resilience4j 模块类似。我们提供想要作为函数构造执行的代码——一个进行远程调用的 lambda 表达式或一个从远程服务中检索到的某个值的 Supplier,等等——并且断路器用代码修饰它 如果需要,跟踪响应并切换状态。
Resilience4j 同时支持基于计数和基于时间的断路器。
我们使用 slidingWindowType()
配置指定断路器的类型。此配置可以采用两个值之一 –
SlidingWindowType.COUNT_BASED
或
SlidingWindowType.TIME_BASED
。
failureRateThreshold()
和 slowCallRateThreshold()
以百分比形式配置失败率阈值和慢速调用率。
slowCallDurationThreshold()
以秒为单位配置调用被认为慢的时间。
我们可以指定一个 minimumNumberOfCalls()
,在断路器可以计算错误率或慢速调用率之前需要它。
如前所述,断路器在一定时间后从断开状态切换到半断开状态,以检查远程服务的情况。waitDurationInOpenState()
指定断路器在切换到半开状态之前应等待的时间。
permittedNumberOfCallsInHalfOpenState()
配置在半开状态下允许的调用次数,
maxWaitDurationInHalfOpenState()
确定断路器在切换回开状态之前可以保持在半开状态的时间。
此配置的默认值 0 意味着断路器将无限等待,直到所有
permittedNumberOfCallsInHalfOpenState()
完成。
默认情况下,断路器将任何异常视为失败。但是我们可以对此进行调整,以使用 recordExceptions()
配置指定应视为失败的异常列表和使用 ignoreExceptions()
配置忽略的异常列表。
如果我们在确定异常应该被视为失败还是忽略时想要更精细的控制,我们可以提供 Predicate<Throwable>
作为 recordException()
或 ignoreException()
配置。
当断路器拒绝处于断开状态的呼叫时,它会抛出 CallNotPermittedException
。我们可以使用 writablestacktraceEnabled()
配置控制 CallNotPermittedException
的堆栈跟踪中的信息量。
使用 Resilience4j CircuitBreaker模块
让我们看看如何使用
resilience4j-circuitbreaker 模块中可用的各种功能。
我们将使用与本系列前几篇文章相同的示例。假设我们正在为一家航空公司建立一个网站,以允许其客户搜索和预订航班。我们的服务与 FlightSearchService
类封装的远程服务对话。
使用 Resilience4j 断路器时,CircuitBreakerRegistry
、CircuitBreakerConfig
和 CircuitBreaker
是我们使用的主要抽象。
CircuitBreakerRegistry
是用于创建和管理 CircuitBreaker
对象的工厂。
CircuitBreakerConfig
封装了上一节中的所有配置。每个 CircuitBreaker
对象都与一个 CircuitBreakerConfig
相关联。
第一步是创建一个 CircuitBreakerConfig
:
CircuitBreakerConfig config = CircuitBreakerConfig.ofDefaults();
这将创建一个具有以下默认值的 CircuitBreakerConfig:
配置 | 默认值 |
---|---|
slidingWindowType | COUNT_BASED |
failureRateThreshold | 50% |
slowCallRateThreshold | 100% |
slowCallDurationThreshold | 60s |
minimumNumberOfCalls | 100 |
permittedNumberOfCallsInHalfOpenState | 10 |
maxWaitDurationInHalfOpenState | 0s |
基于计数的断路器
假设我们希望断路器在最近 10 次调用中有 70% 失败时断开:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.failureRateThreshold(70.0f)
.build();
然后我们用这个配置创建一个 CircuitBreaker
:
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker circuitBreaker = registry.circuitBreaker("flightSearchService");
现在让我们表达我们的代码以作为 Supplier
运行航班搜索并使用 circuitbreaker
装饰它:
Supplier<List<Flight>> flightsSupplier =
() -> service.searchFlights(request);
Supplier<List<Flight>> decoratedFlightsSupplier =
circuitBreaker.decorateSupplier(flightsSupplier);
最后,让我们调用几次修饰操作来了解断路器的工作原理。我们可以使用 CompletableFuture
来模拟来自用户的并发航班搜索请求:
for (int i=0; i<20; i++) {
try {
System.out.println(decoratedFlightsSupplier.get());
}
catch (...) {
// Exception handling
}
}
输出显示前几次飞行搜索成功,然后是 7 次飞行搜索失败。此时,断路器断开并为后续调用抛出 CallNotPermittedException
:
Searching for flights; current time = 12:01:12 884
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 954
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 957
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:01:12 958
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... other lines omitted ...
io.reflectoring.resilience4j.circuitbreaker.Examples.countBasedSlidingWindow_FailedCalls(Examples.java:56)
at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:229)
现在,假设我们希望断路器在最后 10 个调用中有 70% 需要 2 秒或更长时间才能完成:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.slowCallRateThreshold(70.0f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
示例输出中的时间戳显示请求始终需要 2 秒才能完成。在 7 次缓慢响应后,断路器断开并且不允许进一步调用:
Searching for flights; current time = 12:26:27 901
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:26:29 953
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 12:26:31 957
Flight search successful
... other lines omitted ...
Searching for flights; current time = 12:26:43 966
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...
at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:231)
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...
at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:231)
通常我们会配置一个具有故障率和慢速调用率阈值的断路器:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.failureRateThreshold(70.0f)
.slowCallRateThreshold(70.0f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
基于时间的断路器
假设我们希望断路器在过去 10 秒内 70% 的请求失败时断开:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.failureRateThreshold(70.0f)
.slowCallRateThreshold(70.0f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
我们创建了 CircuitBreaker
,将航班搜索调用表示为 Supplier<List<Flight>>
并使用 CircuitBreaker
对其进行装饰,就像我们在上一节中所做的那样。
以下是多次调用修饰操作后的示例输出:
Start time: 18:51:01 552
Searching for flights; current time = 18:51:01 582
Flight search successful
[Flight{flightNumber='XY 765', ... }]
... other lines omitted ...
Searching for flights; current time = 18:51:01 631
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
Searching for flights; current time = 18:51:01 632
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search
... stack trace omitted ...
Searching for flights; current time = 18:51:01 633
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... other lines omitted ...
前 3 个请求成功,接下来的 7 个请求失败。此时断路器断开,后续请求因抛出 CallNotPermittedException
而失败。
现在,假设我们希望断路器在过去 10 秒内 70% 的调用需要 1 秒或更长时间才能完成:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.TIME_BASED)
.minimumNumberOfCalls(10)
.slidingWindowSize(10)
.slowCallRateThreshold(70.0f)
.slowCallDurationThreshold(Duration.ofSeconds(1))
.build();
示例输出中的时间戳显示请求始终需要 1 秒才能完成。在 10 个请求(minimumNumberOfCalls
)之后,当断路器确定 70% 的先前请求花费了 1 秒或更长时间时,它断开电路:
Start time: 19:06:37 957
Searching for flights; current time = 19:06:37 979
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:39 066
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:40 070
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 19:06:41 070
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
... stack trace omitted ...
通常我们会配置一个具有故障率和慢速调用率阈值的基于时间的断路器:
指定断开状态下的等待时间
假设我们希望断路器处于断开状态时等待 10 秒,然后转换到半断开状态并让一些请求传递到远程服务:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.TIME_BASED)
.slidingWindowSize(10)
.minimumNumberOfCalls(10)
.failureRateThreshold(70.0f)
.slowCallRateThreshold(70.0f)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.build();
示例输出中的时间戳显示断路器最初转换为断开状态,在接下来的 10 秒内阻止一些调用,然后更改为半断开状态。后来,在半开状态时一致的成功响应导致它再次切换到关闭状态:
Searching for flights; current time = 20:55:58 735
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:55:59 812
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:56:00 816
... other lines omitted ...
io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Flight search failed
at
... stack trace omitted ...
2020-12-13T20:56:03.850115+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
2020-12-13T20:56:04.851700+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:05.852220+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:06.855338+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
... other similar lines omitted ...
2020-12-13T20:56:12.862362+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
2020-12-13T20:56:13.865436+05:30: CircuitBreaker 'flightSearchService' changed state from OPEN to HALF_OPEN
Searching for flights; current time = 20:56:13 865
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other similar lines omitted ...
2020-12-13T20:56:16.877230+05:30: CircuitBreaker 'flightSearchService' changed state from HALF_OPEN to CLOSED
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 20:56:17 879
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other similar lines omitted ...
指定回退方法
使用断路器时的常见模式是指定在电路断开时要调用的回退方法。回退方法可以为不允许的远程调用提供一些默认值或行为。
我们可以使用 Decorators
实用程序类进行设置。Decorators
是来自 resilience4j-all
模块的构建器,具有 withCircuitBreaker()
、withRetry()
、withRateLimiter()
等方法,可帮助将多个 Resilience4j 装饰器应用于 Supplier
、Function
等。
当断路器断开并抛出 CallNotPermittedException
时,我们将使用它的 withFallback()
方法从本地缓存返回航班搜索结果:
Supplier<List<Flight>> flightsSupplier = () -> service.searchFlights(request);
Supplier<List<Flight>> decorated = Decorators
.ofSupplier(flightsSupplier)
.withCircuitBreaker(circuitBreaker)
.withFallback(Arrays.asList(CallNotPermittedException.class),
e -> this.getFlightSearchResultsFromCache(request))
.decorate();
以下示例输出显示了断路器断开后从缓存中返回的搜索结果:
Searching for flights; current time = 22:08:29 735
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 854
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 855
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Searching for flights; current time = 22:08:29 855
2020-12-13T22:08:29.856277+05:30: CircuitBreaker 'flightSearchService' recorded an error: 'io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search'. Elapsed time: 0 ms
Searching for flights; current time = 22:08:29 912
... other lines omitted ...
2020-12-13T22:08:29.926691+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
Returning flight search results from cache
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
Returning flight search results from cache
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... }]
... other lines omitted ...
减少 Stacktrace 中的信息
每当断路器断开时,它就会抛出 CallNotPermittedException
:
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
at io.github.resilience4j.circuitbreaker.CallNotPermittedException.createCallNotPermittedException(CallNotPermittedException.java:48)
... other lines in stack trace omitted ...
at io.reflectoring.resilience4j.circuitbreaker.Examples.timeBasedSlidingWindow_SlowCalls(Examples.java:169)
at io.reflectoring.resilience4j.circuitbreaker.Examples.main(Examples.java:263)
除了第一行,堆栈跟踪中的其他行没有增加太多价值。如果 CallNotPermittedException
发生多次,这些堆栈跟踪行将在我们的日志文件中重复。
我们可以通过将 writablestacktraceEnabled()
配置设置为 false
来减少堆栈跟踪中生成的信息量:
CircuitBreakerConfig config = CircuitBreakerConfig
.custom()
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.failureRateThreshold(70.0f)
.writablestacktraceEnabled(false)
.build();
现在,当 CallNotPermittedException
发生时,堆栈跟踪中只存在一行:
Searching for flights; current time = 20:29:24 476
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
Searching for flights; current time = 20:29:24 540
Flight search successful
[Flight{flightNumber='XY 765', flightDate='12/31/2020', from='NYC', to='LAX'}, ... ]
... other lines omitted ...
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
io.github.resilience4j.circuitbreaker.CallNotPermittedException: CircuitBreaker 'flightSearchService' is OPEN and does not permit further calls
...
其他有用的方法
与 Retry 模块类似,CircuitBreaker
也有像 ignoreExceptions()
、recordExceptions()
等方法,让我们可以指定 CircuitBreaker
在跟踪调用结果时应该忽略和考虑哪些异常。
例如,我们可能不想忽略来自远程飞行服务的 SeatsUnavailableException
– 在这种情况下,我们真的不想断开电路。
与我们见过的其他 Resilience4j 模块类似,CircuitBreaker
还提供了额外的方法,如 decorateCheckedSupplier()
、decorateCompletionStage()
、decorateRunnable()
、decorateConsumer()
等,因此我们可以在 Supplier
之外的其他结构中提供我们的代码。
断路器事件
CircuitBreaker
有一个 EventPublisher
可以生成以下类型的事件:
CircuitBreakerOnSuccessEvent
,CircuitBreakerOnErrorEvent
,CircuitBreakerOnStateTransitionEvent
,CircuitBreakerOnResetEvent
,CircuitBreakerOnIgnoredErrorEvent
,CircuitBreakerOnCallNotPermittedEvent
,CircuitBreakerOnFailureRateExceededEvent
以及CircuitBreakerOnSlowCallRateExceededEvent
.
我们可以监听这些事件并记录它们,例如:
circuitBreaker.getEventPublisher()
.onCallNotPermitted(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher()
.onError(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher()
.onFailureRateExceeded(e -> System.out.println(e.toString()));
circuitBreaker.getEventPublisher().onStateTransition(e -> System.out.println(e.toString()));
以下是示例的日志输出:
2020-12-13T22:25:52.972943+05:30: CircuitBreaker 'flightSearchService' recorded an error: 'io.reflectoring.resilience4j.circuitbreaker.exceptions.FlightServiceException: Error occurred during flight search'. Elapsed time: 0 ms
Searching for flights; current time = 22:25:52 973
... other lines omitted ...
2020-12-13T22:25:52.974448+05:30: CircuitBreaker 'flightSearchService' exceeded failure rate threshold. Current failure rate: 70.0
2020-12-13T22:25:52.984300+05:30: CircuitBreaker 'flightSearchService' changed state from CLOSED to OPEN
2020-12-13T22:25:52.985057+05:30: CircuitBreaker 'flightSearchService' recorded a call which was not permitted.
... other lines omitted ...
CircuitBreaker指标
CircuitBreake 暴露了许多指标,这些是一些重要的条目:
- 成功、失败或忽略的调用总数 (
resilience4j.circuitbreaker.calls
) - 断路器状态 (
resilience4j.circuitbreaker.state
) - 断路器故障率 (
resilience4j.circuitbreaker.failure.rate
) - 未被允许的调用总数 (
resilience4.circuitbreaker.not.permitted.calls
) - 断路器的缓慢调用 (
resilience4j.circuitbreaker.slow.call.rate
)
首先,我们像往常一样创建 CircuitBreakerConfig
、CircuitBreakerRegistry
和 CircuitBreaker
。然后,我们创建一个 MeterRegistry
并将 CircuitBreakerRegistry
绑定到它:
MeterRegistry meterRegistry = new SimpleMeterRegistry();
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
.bindTo(meterRegistry);
运行几次断路器修饰操作后,我们显示捕获的指标。这是一些示例输出:
The number of slow failed calls which were slower than a certain threshold - resilience4j.circuitbreaker.slow.calls: 0.0
The states of the circuit breaker - resilience4j.circuitbreaker.state: 0.0, state: metrics_only
Total number of not permitted calls - resilience4j.circuitbreakernot.permitted.calls: 0.0
The slow call of the circuit breaker - resilience4j.circuitbreaker.slow.call.rate: -1.0
The states of the circuit breaker - resilience4j.circuitbreaker.state: 0.0, state: half_open
Total number of successful calls - resilience4j.circuitbreaker.calls: 0.0, kind: successful
The failure rate of the circuit breaker - resilience4j.circuitbreaker.failure.rate: -1.0
在实际应用中,我们会定期将数据导出到监控系统并在仪表板上进行分析。
结论
在本文中,我们学习了如何使用 Resilience4j 的 Circuitbreaker
模块在远程服务返回错误时暂停向其发出请求。我们了解了为什么这很重要,还看到了一些有关如何配置它的实际示例。
您可以使用 GitHub 上的代码来演示一个完整的应用程序。
本文译自:Implementing a Circuit Breaker with Resilience4j – Reflectoring