Resolving Elasticsearch Unassigned Shard Issues: X of Y shards failed

July 10, 2023 · 4 min read

When opening Kibana, you might sometimes see the error message “X of Y shards failed” which usually indicates that some indices have unassigned shards. In such cases, the Elasticsearch (ES) cluster status is typically Yellow or Red.

Let’s first understand the four shard states in ES:

INITIALIZING - The shard is in the initialization state and unavailable. This occurs briefly when creating an index or starting a node.
RELOCATING - Shards are being moved due to node addition or removal, a transient state.
STARTED - The shard is active and available to handle requests.
UNASSIGNED - The shard has failed to be allocated.

What Causes Shard Allocation Failures?

Shard allocation failures can occur due to the following reasons:

Excessive replica shard settings with insufficient nodes for allocation.
Delayed mechanisms triggered when a node goes offline.
Data loss in shards.
Insufficient disk space on nodes.

Check Cluster Status and Diagnose Allocation Failures

First, use the Cluster health API to inspect the cluster status and the overall shard allocation:

GET /_cluster/health\?pretty

Example response:

{
  "cluster_name" : "my-application",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 565,
  "active_shards" : 565,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 60,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 90.4
}

Pay attention to the following fields:

status: Indicates the cluster status:
- green: All shards are allocated.
- yellow: All primary shards are allocated, but some replica shards are not. A node failure could make part of the data unavailable.
- red: Some primary shards are unallocated, making some data unavailable.
unassigned_shards: Number of unallocated shards.
active_shards_percent_as_number: Percentage of allocated shards.
delayed_unassigned_shards: Number of shards waiting for allocation after a node leaves

Next, use the cat shards API to view the allocation status of all shards:

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

Or, in a shell, filter for unassigned shards:

curl -X GET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

To investigate the cause of unassigned shards:

GET /_cluster/allocation/explain?pretty

Example output:

{
  "index" : "my-index",
  "shard" : 2,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2023-02-06T06:34:22.345Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "dntxO1EFQVSzk7A4n45OLQ",
      "node_name" : "node-1",
      "transport_address" : "52.208.205.70:9300",
      "node_attributes" : {
        "ml.machine_memory" : "33737449472",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[my-index][2], node[dntxO1EFQVSzk7A4n45OLQ], [P], s[STARTED], a[id=ag-Komw1QJu9EKYkrsrdmw]]"
        }
      ]
    }
  ]
}

Resolve Common Causes of Unassigned Shards

Excessive Replica Shards with Insufficient Nodes

Replica shards cannot be allocated on the same node as their primary shard. If the number of replicas exceeds the number of nodes, unallocated replica shards may appear. For example, setting number_of_replicas to 1 on a single-node cluster will cause this issue.

Solution: Increase the number of nodes or reduce the number of replicas:

PUT /my-index-000001/_settings
{
  "index" : {
    "number_of_replicas" : 0
  }
}

Delayed Allocation Mechanism

When a node goes offline, Elasticsearch delays shard reallocation to avoid excessive rebalancing. This delay is controlled by the index.unassigned.node_left.delayed_timeout setting (default: 1 minute).

Solution: You can manually modify the delay period:

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "0"
  }
}

Shard Data Loss

If a primary shard is unallocated and no replicas exist, the shard data is considered lost.

Solution: Forcefully allocate an empty primary shard (data loss is inevitable):

POST /_cluster/reroute?pretty
{
    "commands" : [
        {
          "allocate_empty_primary" : {
                "index" : "<INDEX_NAME>",
                "shard" : 0,
                "node" : "<NODE_NAME>",
                "accept_data_loss" : "true"
          }
        }
    ]
}

Insufficient Disk Space

Nodes running out of disk space may be marked as "low disk watermark" and excluded from shard allocation.

Solution: Free up disk space, add more nodes, or increase the watermark threshold:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%"
  }
}

Conclusion

Unassigned shards reflect suboptimal cluster health, typically denoted by a Yellow or Red cluster status. By diagnosing the cause and taking appropriate corrective actions, you can resolve issues related to shard allocation and maintain a healthy Elasticsearch cluster.

What Causes Shard Allocation Failures?​

Check Cluster Status and Diagnose Allocation Failures​

Resolve Common Causes of Unassigned Shards​

Excessive Replica Shards with Insufficient Nodes​

Delayed Allocation Mechanism​

Shard Data Loss​

Insufficient Disk Space​

Conclusion​