The term “reindex” is not a special thing you can do with Solr. It literally means “index again.” You just have to restart Solr (or reload your core), possibly delete the existing index, and then repeat whatever actions you took to build your index in the first place.
Indexing (and reindexing) is not something that just happens. Solr has no ability to initiate indexing itself. There is the dataimport handler, but it will not do anything until it is called by something external to Solr.
Indexing is something that can be manually done by a person or automatically done by a program, but it is always external to Solr. There is an issue in the bugtracker for adding dataimport handler scheduling to Solr, but it is meeting with committer resistance, because *ALL* modern operating systems have a scheduling capability built in. Also, that would mean that Solr can change your index without external action, which is generally considered a bad idea by committers.
Depending on your setup and goals, you may need to delete all documents before you begin your indexing process. Sometimes it is necessary to delete your index directory entirely before you restart Solr or reload your core.
It’s reasonable to wonder why deleting the existing data and building it again is necessary. Here’s why: When you change your schema, nothing happens to the existing data in the index. When Solr tries to access the existing data in the index, it uses the schema as a guide to interpreting that data. If the index contains rows that have a field built with the SortableIntField class and then Solr tries to access that data with a different class (such as TrieIntField), there’s a good chance that an unrecoverable error will occur.
“From my experience indexing big chunks of data might take a while. Index I’m working on have 2m items (size: 10G). Full index takes about 40 hours using DB.
There are some factors that might slowing you down:
- Memory. One think is having memory on the box, and the other is to allow Solr to use it. Give Solr as much as you can afford for indexing time (you can easily change that later)
- Garbage collector. With default one we had a lot of problems (after 20-30h indexing was interrupted and we had to start from the beginning)
- Make Solr cache results from DB
- Check all queries, how expensive they are
- Index in smaller batches. If I would index 300k items it would take much longer, than indexing them in 3 batches of 100k
- Having lots of big full text stored fields is not helping (if you don’t need to store something, don’t do that)”
Might want to revisit some settings in SolrConfigXml:
The Seven Deadly Sins of Solr:
Duplicates in Solr index – items added twice or more times:
“Actually, all added documents will have an auto generated unique key, through Solr’s own uuid type:
<field name=”uid” type=”uuid” indexed=”true” stored=”true” default=”NEW”/>
So any document added to the index will be considered a new one, since it gets a GUID. However, I think we’ve got a problem with some other code here, code that adds items to the index when they are updated, instead of just updating them […]
Ok, it turned out there was a couple of bugs in the code updating the index. Instead of updating, we always had a document added to index, even tho it already existed.”