今回は「Try Envoy」の「Detecting Down Services with Health Checks」を紹介する．高可用性のために Envoy でサポートされている「ヘルスチェック (Health Checking)」と「外れ値検出 (Outlier Detection)」を学べる．

Detecting Down Services with Health Checks

手順は以下の「計8種類」ある．

Step.1 「Proxy Configuration」
Step.2 「Add Health Check」
Step.3 「Start Proxy」
Step.4 「Failed Services」
Step.5 「Healthy Services」
Step.6 「Total Failure」
Step.7 「Outlier Detection Configuration」
Step.8 「Testing Outlier Detection」

www.envoyproxy.io

www.katacoda.com

Step.1 「Proxy Configuration」

まず，用意された envoy.yaml の設定を確認する．もう完全に読み慣れたと思う．ポイントは clusters で，全てのリクエストを 172.18.0.3 と 172.18.0.4 にラウンドロビンでルーティングする．例えば「もし一部のエンドポイントに障害が発生した場合」に Envoy ではどう対応したら良いのだろう？そこで「ヘルスチェック (Health Checking)」を使う．

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          codec_type: auto
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains:
                - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: targetCluster
          http_filters:
          - name: envoy.router
  clusters:
  - name: targetCluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts: [
      { socket_address: { address: 172.18.0.3, port_value: 80 }},
      { socket_address: { address: 172.18.0.4, port_value: 80 }}
    ]

Step.2 「Add Health Check」

ヘルスチェックは clusters の中に設定する．代表的なパラメータは以下となる．

interval : 間隔
unhealthy_threshold : 異常と判断する閾値
healthy_threshold : 正常と判断する閾値
http_health_check.path : ヘルスチェックをする URL

今回は /health に対して「10秒間隔」でヘルスチェックをし，1回成功すると正常と判断する．他にも「Jitter（ランダムな遅延時間）」の設定も入っている．

clusters:
- name: targetCluster
  connect_timeout: 0.25s
  type: STRICT_DNS
  dns_lookup_family: V4_ONLY
  lb_policy: ROUND_ROBIN
  hosts: [
    { socket_address: { address: 172.18.0.3, port_value: 80 }},
    { socket_address: { address: 172.18.0.4, port_value: 80 }}
  ]
  health_checks:
    - timeout: 1s
      interval: 10s
      interval_jitter: 1s
      unhealthy_threshold: 6
      healthy_threshold: 1
      http_health_check:
        path: "/health"

他にも細かなパラメータは多くあるため，必要に応じてドキュメントを参照する．

www.envoyproxy.io

Step.3 「Start Proxy」

さっそく Envoy を起動する．同時にバックエンドのホストとして katacoda/docker-http-server:healthy を2個起動する．コンテナイメージ（healthy タグ）の解説は書かれてなく，あくまで予想だけど，デフォルトでヘルスチェックに成功する実装になっていると思う．そして curl で状態を変えることもできる．

$ docker run -d --name proxy1 -p 80:8080 -v /root/:/etc/envoy envoyproxy/envoy

$ docker run -d katacoda/docker-http-server:healthy

$ docker run -d katacoda/docker-http-server:healthy

Envoy にリクエストを送ると，正常に動いている．やっと検証環境が整った！

$ curl localhost
<h1>A healthy request was processed by host: daab80baf4ae</h1>

$ curl localhost
<h1>A healthy request was processed by host: 80461e934cfd</h1>

$ curl localhost
<h1>A healthy request was processed by host: daab80baf4ae</h1>

Step.3 完了時点で構成図は以下のようになる．

f:id:kakku22:20200107012133p:plain

Step.4 「Failed Services」

意図的に障害を発生させるため，172.18.0.3 に対して /unhealthy エンドポイントを呼び出す．すると，レスポンスに unhealthy request と表示される．レスポンスコードも 500 となり，1個のエンドポイントを落とすことができた．

$ curl 172.18.0.3/unhealthy

$ curl 172.18.0.3 -i
HTTP/1.1 500 Internal Server Error
（中略）
<h1>A unhealthy request was processed by host: daab80baf4ae</h1>

実際に while を使って 0.5 秒ごとに curl をすると，unhealthy_threshold に該当したタイミングから unhealthy request の発生は止まる．期待通りにヘルスチェックが動いていることを確認できる．

$ while true; do curl localhost; sleep .5; done
<h1>A unhealthy request was processed by host: daab80baf4ae</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A unhealthy request was processed by host: daab80baf4ae</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A unhealthy request was processed by host: daab80baf4ae</h1>

（中略）

<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>
<h1>A healthy request was processed by host: 80461e934cfd</h1>

Step.4 完了時点で構成図は以下のようになる．

f:id:kakku22:20200107012153p:plain

Step.5 「Healthy Services」

障害を復旧するため，172.18.0.3 に対して /healthy エンドポイントを呼び出す．すると，またエンドポイントは2個に戻る．

$ curl 172.18.0.3/healthy

$ curl localhost
<h1>A healthy request was processed by host: daab80baf4ae</h1>

$ curl localhost
<h1>A healthy request was processed by host: 80461e934cfd</h1>

$ curl localhost
<h1>A healthy request was processed by host: daab80baf4ae</h1>

Step.6 「Total Failure」

今度は全面障害を発生させるため，172.18.0.3 と 172.18.0.4 に対して /unhealthy エンドポイントを呼び出す．

$ curl 172.18.0.3/unhealthy

$ curl 172.18.0.4/unhealthy;

「Try Envoy」の手順だと Envoy から 503 が返ってくると書いてあるけど，実際にはそうならなかった．全面障害になると，全てのエンドポイントにリクエストをルーティングしているため，期待していた挙動とも異なる．設定変更など，もう少し調査が必要そう．なんだろう．

$ curl localhost -i
HTTP/1.1 500 Internal Server Error
（中略）
<h1>A unhealthy request was processed by host: daab80baf4ae</h1>

$ curl localhost -i
HTTP/1.1 500 Internal Server Error
（中略）
<h1>A unhealthy request was processed by host: 80461e934cfd</h1>

Step.7 「Outlier Detection Configuration」

Step.7 と Step.8 は「外れ値検出 (Outlier Detection)」を試す．ドキュメントを読むと，以下のように表現されている．なお「ヘルスチェック」と「外れ値検出」は併用もできる．

"Active" Health Checking : ヘルスチェック (Health Checking)
"Passive" Health Checking : 外れ値検出 (Outlier Detection)

Active と Passive という表現にもある通り，意図的にリクエストを投げて正常を確認するのが「ヘルスチェック」で，エンドポイントからのレスポンスをダイナミックに確認して正常を確認するのが「外れ値検出」となる．

www.envoyproxy.io

実際のレスポンスを判断材料にするため，設定項目は少なく使うことができる．envoy.yaml は以下のようになり，5xx のレスポンスコードが「3回」連続して返された場合に base_ejection_time に設定した時間はルーティング対象から除外する（Ejection と言う）．その後もう1度ルーティング対象になる．

clusters:
- name: targetCluster
  connect_timeout: 0.25s
  type: STRICT_DNS
  dns_lookup_family: V4_ONLY
  lb_policy: ROUND_ROBIN
  hosts: [
    { socket_address: { address: 172.18.0.5, port_value: 80 }},
    { socket_address: { address: 172.18.0.6, port_value: 80 }}
  ]
  outlier_detection:
      consecutive_5xx: "3"
      base_ejection_time: "30s"

ただし，base_ejection_time は実際にはもう少し複雑で，ドキュメントを読むと以下のように書いてある．ようするに「base_ejection_time に設定した時間に Ejection された回数を掛ける」となり，繰り返し 5xx を返すエンドポイントほど「長時間 Ejection される」アプローチが実装されている．

The base time that a host is ejected for. The real time is equal to the base time multiplied by the number of times the host has been ejected.

www.envoyproxy.io

Step.8 「Testing Outlier Detection」

最後は「外れ値検出」を試す．katacoda/docker-http-server:healthy を追加で2個起動してから outlier_detection の設定をした Envoy も起動する．

$ docker run -d katacoda/docker-http-server:healthy

$ docker run -d katacoda/docker-http-server:healthy

$ docker run -d --name proxy2 -p 81:8080 \
    -v /root/:/etc/envoy \
    -v /root/envoy1.yaml:/etc/envoy/envoy.yaml \
    envoyproxy/envoy

while を使って 0.5 秒ごとに curl をしながら 172.18.0.5 に対して /unhealthy エンドポイントを呼び出すと，計3回 unhealthy が出て，その後は止まっている．期待通りに外れ値検出が動いていることを確認できる．

$ curl 172.18.0.5/unhealthy

$ while true; do curl localhost:81; sleep .5; done
<h1>A healthy request was processed by host: f586da250a29</h1>
<h1>A unhealthy request was processed by host: 932aa85e6053</h1>
<h1>A healthy request was processed by host: f586da250a29</h1>
<h1>A unhealthy request was processed by host: 932aa85e6053</h1>
<h1>A healthy request was processed by host: f586da250a29</h1>
<h1>A unhealthy request was processed by host: 932aa85e6053</h1>
<h1>A healthy request was processed by host: f586da250a29</h1>
<h1>A healthy request was processed by host: f586da250a29</h1>
<h1>A healthy request was processed by host: f586da250a29</h1>

まとめ

「Try Envoy」のコンテンツ「Detecting Down Services with Health Checks」を試した
Envoy では「ヘルスチェック (Health Checking)」以外に「外れ値検出 (Outlier Detection)」もあることを学べた

引き続き，進めていくぞ！