Reindexes documents from a Solr query to a destination collection. Quick tutorial here.
Download the uber-jar from releases and run it with Java (11+). Here's an example with all the options:
java -jar solr-reindexer.jar\
-sourceCollection my_collection_v1\
-targetCollection my_collection_v2\
-uniqueKey id\
-sourceZkAddress localhost:9983,localhost:2181\
-targetZkAddress zoo1:2181,zoo2:2181\
-skipFields _version_,text\
-numWriteThreads 2\
-queueSize 10000\
-retries 7\
-retryInterval 2000\
-query "isDeleted:false AND isIgnored:false"\
-rows 100
Only sourceCollection
and targetCollection
are mandatory.
The rest are:
uniqueKey
: we use a cursor to go over the data. The cursor requires to sort on theuniqueKey
defined in the schema, which in turn defaults toid
sourceZkAddress
andtargetZkAddress
: the Zookeeper host:port for SolrCloud (source and destination). If there are more, comma-separate themskipFields
: we reindex all the stored and docValues fields by default. But some may be skipped, like the default_version_
(which will break the reindex because it will cause a version conflict) or copyFields that are also stored (they'll duplicate the values, because you'll redo the copyField operation). Comma-separate multiple fieldsretries
andretryInterval
: if we encounter an exception, we wait forretryInterval
millis and retry up toretries
timesqueueSize
: the reader thread writes into an in-memory queue of this size (in pages, seerows
below for page size). Defaults to 100numWriteThreads
: this many threads consume from the in-memory queue, writing to the target collection. Defaults to 2query
: you may not want to reindex everything with the default*:*
rows
: we read one page of this size at a time. We also write one batch of this size at a time. Typically, the best performance is around 1MB per batch. Default is 1000 rows per page/batch
You can start multiple instances of the reindexer, one per shard, by specifying -sourceShards shard1
for one instance, -sourceShards shard2
for another, etc.
You can also group N shards per reindexer by saying -sourceShards shard1,shard2...
you get it, by comma-separating values.
Typically, the bottleneck is reading. You'll want to run the reindexer close to the source. The default of 2 write threads should keep up, unless the destination (or the network to it) is slow.
Feel free to clone the repository, import it as a Gradle project, and add features.
To build the uber-jar, use gradle jar
.
Tentative roadmap:
- authentication support
- supporting non-SolrCloud
- using Export instead of Cursor