Elasticsearch Shard할당 실패원인 및 해결방법，“X of Y shards failed”

2023년 7월 10일 · 약 9분

Kibana에서 가끔 X of Y shards failed와 같은 오류메시지를 볼수 있다.이는 UNASSIGNED Shard가 원인이다.

이때 cluster 상태를 확인하게 되면 Yellow혹은 Red인것을 확인할수 있다.

ES중 Shard의 4가지 상태：

INITIALIZING - 초기화상태, 신규 index를 생성하거나 node를 구동할때 일시적으로 발생하며 이 상태에서는 Shard사용이 불가하다.
RELOCATING - 새로운 node가 추가되거나 node 다운될때 shard가 재할당되면서 발생한다. 이 또한 일시적으로 발생.
STARTED - active상태
UNASSIGNED - shard할당 실패

그렇다면 어떤 경우에 UNASSIGNED Shard가 발생하나？

replica shard수량을 너무 크게 설정하여 할당할 node가 부족한 경우
shard 데이터유실 발생
디스크 가용공간 부족

Cluster상태 및 shard할당 실패원인 확인

우선 Cluster health API를 통하여 cluster 및 shard할당 상태를 확인한다.

GET /_cluster/health\?pretty

아래와 같은 output을 확인가능：

{
  "cluster_name" : "my-application",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 565,
  "active_shards" : 565,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 60,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 90.4
}

주로status，unassigned_shards，active_shards_percent_as_number, delayed_unassigned_shards등 데이터 확인.

status는 cluster상태이며 총 3가지 상태가 있다.

green - 모든 shard 할당완료
yellow - primary shard는 전부 활당 완료하였으나 replica shard 할당 실패. 이때 node장애가 발생하면 데이터 유실이 발샐할수 있음.
red - 할당못한 primary shard 존재, 부분 데이터 사용불가，node구동때 일시적으로 발생할수도 있음.

unassigned_shards는 할당되지 않은 shard수량이다. Cluster상태가 Yellow혹은 Red일때 0보다 큼.

active_shards_percent_as_number는 active(할당된) shard의 비율이다. 수치가 작을수록 UNASSIGNED Shard가 많다는 의미

delayed_unassigned_shards는 특정node가 down된후 index.unassigned.node_left.delayed_timeout(기본값:1분) 을 대기중인 shard수량.

cat shards API를 실행하여 미할당 shard목록을 확인한다.

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

혹은 curl명령으로 UNASSIGNED Shard만 필터링:

curl -X GET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

shard 할당 실패한 세부원인 확인:

GET /_cluster/allocation/explain?pretty

output：

{
  "index" : "my-index",
  "shard" : 2,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2023-02-06T06:34:22.345Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "dntxO1EFQVSzk7A4n45OLQ",
      "node_name" : "node-1",
      "transport_address" : "52.208.205.70:9300",
      "node_attributes" : {
        "ml.machine_memory" : "33737449472",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[my-index][2], node[dntxO1EFQVSzk7A4n45OLQ], [P], s[STARTED], a[id=ag-Komw1QJu9EKYkrsrdmw]]"
        }
      ]
    }
  ]
}

shard할당 실패원인까지 확인하였으니 아래 해결방법을 보자.

해결방법

replica shard수량을 너무 크게 설정하여 할당할 node가 부족한 경우

shard는 primary shard 와 replica shard로 나뉜다. index내의 데이터(document)는 특정수량의 primary shard에 할당되어 각 node에 저장되며 primary shard를 몇개 가져갈지는 number_of_shards설정으로 결정된다. 그리고 각 primary shard는 특정수량의 복사본 즉 replica shard를 가지고 있다. replica수량은 number_of_replicas설정으로 결정된다. shard수량은 보통은 index를 생성하거나 index template을 만들때 지정한다.

여기서 알아야 할 하나의 규칙은 특정primary shard의 replica는 해당primary shard와 다른 node에 저장되여야 한다. 이는 replica수량이 node수와 같거나클때 할당할수 없다는 뜻이다.(N(node수량) >= R(replica수량) + 1 공식을 준수하자.)이때 서비스영향은 없으나 ES가 보기에 합리한 설정이 아니다. 필경 우리가 원하는 number_of_replicas수량만큰 할당해주지 못하였기때문이다.가장 쉽게들 하는 실수가 single node에서 number_of_replicas를 1(default)로 설정한 경우다.

위에서 /_cluster/allocation/explain실행 결과가 현재case다. 아래와 같은 오류를 확인핤수 있음：

the shard cannot be allocated to the same node on which a copy of the shard already exists

이때 해결방법은 node수량을 늘리거나 replica수량을 줄이는거다.

PUT /my-index-000001/_settings
{
  "index" : {
    "number_of_replicas" : 0
  }
}

Shard 데이터 유실 발생

primary shard 데이터 손실이 발생한 경우다. 보통은 replica가 만들어지기전 혹은 할당되기전에 노드가 다운되거나 disk손상이 생길때 발생한다.

이때 Cluster allocation explain API를 호출하면 아래와같은 오류를 확인할수 있다.

"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",

이런 상황에서는 문제 있는 node를 복구하여 다시 cluster에 join시킬지 아니면 데이터 유실을 감수하면서 empty primary를 강제로 할당할지 결정이 필요하다. 데이터 유실을 감수할수 있다면 아래와 같이 API호출을 하여 empty primary를 할당해주면 된다.

POST /_cluster/reroute?pretty
{
    "commands" : [
        {
          "allocate_empty_primary" : {
                "index" : "<INDEX_NAME>",
                "shard" : 0,
                "node" : "<NODE_NAME>",
                "accept_data_loss" : "true"
          }
        }
    ]
}

allocate_empty_primary명령을 이용할때 "accept_data_loss" : "true" 옵션은 필수이다. 데이터 잃을 준비가 돼있을시에만 실행하라는 뜻.

디스크 가용공간 부족

노드 및 디스크 가용공간이 부족할때 master node는 shard할당을 할수 없다. 디폴트로 디스크 사용율이 85%일때면 node가 Low disk watermark로 표기되며 더 이상 shard할당을 못받는다.

cat allocation API를 이용하여 각 노드의 Shard할당 현황 및 디스크 사용율을 확인할수 있다.

GET /_cat/allocation?v

불필요한 index를 삭제하거나 노드 추가/디스크 가용공간을 늘리는 등 옵션이 있다.

디스크 사이즈가 충분히 클때에는 디폴트 Low disk watermark 수치가 너무 작을수 있다.Cluster update settings API를 이용하여 적합한 값으로 설정하면 된다.

주의해야 할 점은 해당 값이 Safety point에 해당한다는 것이며, 실제 데이터 증장률 등을 고려하여 실무에 맞는 값을 설정할 필요가 있다.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%"
  }
}

임시로 해당값을 수정하고 싶다면 위에 persistent를 transient로 변경해주면 된다.영구적으로 수정하고 싶다면 persistent를 사용하면 된다.

결론

미할당 된 Shard는 ES에서 unhealthy한 Cluster상태로 표현된다. 불합리한 Shard수량 설정, 노드 장애, 디스크 가용공간 부족등이 UNASSIGNED Shard 발생원이 될수 있음을 알았다. 물론 일부분 case는 서비스에 영향이 없으나 ES보기에 최적화 상태가 아니므로 주의할 필요가 있다.

Cluster상태 및 shard할당 실패원인 확인​

해결방법​

replica shard수량을 너무 크게 설정하여 할당할 node가 부족한 경우​

Shard 데이터 유실 발생​

디스크 가용공간 부족​

결론​

Cluster상태 및 shard할당 실패원인 확인

해결방법

replica shard수량을 너무 크게 설정하여 할당할 node가 부족한 경우

Shard 데이터 유실 발생

디스크 가용공간 부족

결론