Saturday, February 14, 2015

Splitting Elasticsearch Index

Elasticsearch doesn't provide facilities for splitting an index. The main reason may be because the Elasticsearch nodes may not be able to hold the intermediate data created for splitting an index. So, if we need split an index,  we need to do something like  (a) Create the two new indices (b) reindex the data from the original index to the new indices by adding the alternate documents to the two new indices created.
Problem with the above approach is that, most of the time we disable to storing the source documents in the index. For example we may index 5 petabytes of data in an index, but we may not like to store the documents in the index as it will result in a very large index. So, for re-indexing we need have all original documents somewhere. We cannot just get all the documents from the original index itself.

But sometimes we may want to split an existing index when the index grows very large. This may be due to performance issues, when an index is too big there is a performance hit.

So, I came up with the below approach which worked fine. Hopefully it will be useful for you as well.

Let us assume, we have an index "original-index"  and we may want to split it to "original-index-firsthalf" and "original-index-secondhalf". 

Basically we need to follow the below steps.

  • Create an index  original-index-firsthalf with the same settings as that of original-index,  and put same mappings on the new index.
  •  Stop adding new docs to original-index-firsthalf and original-index till the splitting is over.
  • Flush original-index
  • Shutdown Elasticsearch nodes
  •  Copy (scp or something like that) the lucne indices in shards from original-index to original-index-firsthalf. We need to copy shard 0 directory index from source to shard 0 directory index of destination (Eg. original-index-firsthalf/0/index/* is copied to original-index-firsthalf/0/index/*.  Same needs to be repeated for all other shards (and for the replicas as well)
  • Restart Elasticsearch cluster
  •  Now original-index and original-index-firsthalf contain same documents indexed and will produce similar search results
  • Let us assume there were two mappings mapping1 and mapping2 in the indices for two types type1 and type2. Let us assume there is a field mapping1.date1 and mapping2.date2 in the two mappings and they are of "date" types (We may chose to split on the basis of some other mapping field as well, just for this example I am chosing some date fields
  • Let us assume docs in type1 includes values for mapping1.date1  in the range start_date and end_date and for simplicity let us assume docs in type2 also includes dates in the same range (from start_date to end_date).  Let us assume middle_date is the date which lies almost halfway from start_date and end_date.
  •  Delete all the documents in type1 and type2 that matches the queries with "type1.date1 >=  middle_date"  and "type2.date2 >=  middle_date"  respectively from original-index-firsthalf.
  • Delete all the documents in type1 and type2 that matches the queries with "type1.date1 <  middle_date"  and "type2.date2 < middle_date"  respectively from original-index.
  • Optimize original-index and original-index-firsthalf.
  • Now original-index-firsthalf and original-index contains almost half the documents from the original index, but they don’t share any documents
  •  May be we create an alias for original-index as original-index-secondhalf or simply create original-index-secondhalf index and replace its data from original-index and then delete original-index.

This may be useful when we want to split big indices into smaller indices (with same number of shards) as we don’t have to re-index the all the documents again. I could have written a shell-script to demonstrate the operation, but don't have time today. But shortly will post a shell script for your benefits :):)

No comments:

Post a Comment