Relatively frequently (about a once a month) we need to reindex the data, by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by users.
Sometimes DIH could be interrupted at the middle, because of some unexpected
exception caused by OutOfMemory or something else (many times it failed when
more than 90 % was completed).
More than this, almost every time, some items are missing at new the index.
It is very complicated to find them.
At this stage I can't be sure about what documents exactly were missed and I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items.
I want to suggest the following way to solve the problem:
• Get list of all item ids ( call LUCINE API , like CLUE does for example )
• Start DIH , which will iterate over those ids and each time make a
query for n items.
1. Of course original DIH class should be changed to support it.
• This will give the following advantages :
1. I will know exactly what items were failed.
2. I can restart the process from any point and in case of DIH failure
restart it from the point of failure.
so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS
for example if I have 1000 docs and want that this new DIH will take each
time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )
The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?
SolrUser1543 's gravatar image asked Feb 20 2015 at 02:32 in Lucene-Solr-User by SolrUser1543

14 Answers

Personally, I much prefer indexing from an independent SolrJ client
to using DIH when I have to take explicit control of errors & etc.
Here's an example:
https://lucidworks.com/blog/indexing-with-solrj/
In your example, you seem to be assuming that the Lucene IDs
(and here I'm assuming you're not talking about the internal Lucene
ID) corresponds to some kind of primary key in your database table.
But the correspondence isn't necessarily straightforward, how
would it handle composite keys?
I'll leave actual comments on DIH's internals to people who, you know,
actually understand the code ;)...
Erick
Erick Erickson 's gravatar image answered Feb 20 2015 at 12:28 by Erick Erickson
It's a little bit hard to get the overall context eg why do you live with
OOME as usual, what's the reasoning to pull from one index to another, and
what's added during this process.
Make sure that you are aware of
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which
queries other Solr. and
http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can
use to log recently imported ids, to be able to restart indexing from this
point.
You can drop me more details in your native language if you wish.
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid
Mikhail Khludnev 's gravatar image answered Feb 20 2015 at 12:55 by Mikhail Khludnev
My index has about 110 millions of documents. The index is split over several
shards.
May be the number it's not so big ,but each document is relatively large.
The reason to perform the reindex is something like adding a new fields , or
adding some update processor which can extract something from one field and
put in another and etc.
Each time I need to reindex data , I create a new collection and starting to
import data from old one .
It gives the opportunity for an update processors to act.
The dih running with *:* query and takes some number of items each time.
In case of exception , the process stops and the middle and I can't to
restart from this point.
That's the reason that I want to run on predefined list of IDs.
In this case I will able to restart from any point and to know about filed
IDs.
SolrUser1543 's gravatar image answered Feb 20 2015 at 13:57 by SolrUser1543
You can include information on a URL parameter and then use that URL
parameter inside your dih config. If the URL parameter is "idlist" then
you can use ${dih.request.idlist} in your SELECT statement.
Be aware that most servlet containers have a default header length limit
of about 8192 characters, affecting the length of the URL that can be
sent successfully. If the list of IDs is going to get huge, you will
either need to switch from a GET to a POST request where the parameter
is in the post body, or increase the header length limit in the servlet
container that is running Solr.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 20 2015 at 14:46 by Shawn Heisey
I realized after I sent this that you are not using a database ... the
list would simply go in the query you send to the other server. I don't
know whether the request that the SolrEntityProcessor sends is a GET or
a POST, so for a really large list of IDs, you might need to edit the
container config on both servers.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 20 2015 at 14:52 by Shawn Heisey
Yes, you right, I am not using a DB.
SolrEntityProcessor is using a GET method, so I will need to send
relatively big URL ( something like a hundreds of ids ) hope it will be
possible.
Any way I think it is the only method to perform reindex if I want to
control it and be able to continue from any point in case of failure.
SolrUser1543 's gravatar image answered Feb 21 2015 at 00:42 by SolrUser1543
Careful with the GETs! There is a real, hard limit on the length of a GET url (in the low hundreds of characters). That's why a POST is so much better for complex queries; the limit is in the hundreds of MegaBytes.
steve 's gravatar image answered Feb 21 2015 at 00:46 by steve
That's right, but I am not sure that if it is works with Get I will able to
use Post without changing it.
SolrUser1543 's gravatar image answered Feb 21 2015 at 00:52 by SolrUser1543
And I'm familiar with the setup and configuration using Python, JavaScript, and PHP; not at all with Java.
steve 's gravatar image answered Feb 21 2015 at 01:12 by steve
The HTTP protocol does not set a limit on GET URL size, but individual web servers usually do. You should get a response code of 414 Request-URI Too Long when the URL is too long.
This limit is usually configurable.
wunder
Walter (my blog)
Walter Underwood 's gravatar image answered Feb 21 2015 at 09:50 by Walter Underwood
Thank you! Another 4xx error that makes sense. Quoting from the Book of StackOverFlowhttp://stackoverflow.com/questions/2659952/maximum-length-of-http-get-request"Most webservers have a limit of 8192 bytes (8KB), which is usually configureable somewhere in the server configuration. As to the client side matter, the HTTP 1.1 specification even warns about this, here's an extract of chapter 3.2.1:Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths.The limit is in MSIE and Safari about 2KB, in Opera about 4KB and in Firefox about 8KB. We may thus assume that 8KB is the maximum possible length and that 2KB is a more affordable length to rely on at the server side and that 255 bytes is the safest length to assume that the entire URL will come in.If the limit is exceeded in either the browser or the server, most will just truncate the characters outside the limit without any warning. Some servers however may send a HTTP 414 error. If you need to send large data, then better use POST instead of GET. Its limit is much higher, but more dependent on the server used than the client. Usually up to around 2GB is allowed by the average webserver. This is also configureable somewhere in the server settings. The average server will display a server-specific error/exception when the POST limit is exceeded, usually as HTTP 500 error."
steve 's gravatar image answered Feb 21 2015 at 13:37 by steve
The limit on a GET command (including the GET itself and the protocol
specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes. That's the
default value in Jetty, at least.
A question for the experts: Would it be a good idea to force a POST
request in SolrEntityProcessor? It may be dealing with parameters that
have been sent via POST and may exceed the header size limit.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 21 2015 at 16:45 by Shawn Heisey
Am an expert? Not sure, but I worked on an enterprise search spider and search engine for about a decade (Ultraseek Server) and Ive done customer-facing search for another 6+ years.
Let the server reject URLs it cannot handle. Great servers will return a 414, good servers will return a 400, broken servers will return a 500, and crapulous servers will hang. In nearly all cases, youll get a fast fail which wont hurt other users of the site.
Manage your site for zero errors, so you can fix the queries that are too long.
At Chegg, we have people paste entire homework problems into the search for homework solutions, and, yes, we have a few queries longer than 8K. But we deal with it gracefully.
Never do POST for a read-only request. Never. That only guarantees that you cannot reproduce the problem by looking at the logs.
If your design requires extremely long GET requests, you may need to re-think your design.
wunder
Walter (my blog)
Walter Underwood 's gravatar image answered Feb 21 2015 at 17:33 by Walter Underwood
I agree with those sentiments ... but those who consume the services we
provide tend to push the envelope well beyond any reasonable limits.
My Solr install deals with some Solr queries where the GET request is
pushing 20000 characters. The queries and filters constructed by the
website code for some of the more powerful users are really large. I
had to configure haproxy and jetty to allow HTTP headers up to 32K. I'd
like to tell development that we just can't handle it, but with the way
the system is currently structured, there's no other way to get the
results they need.
If I were to make it widely known internally that the Solr config is
currently allowing POST requests up to 32 megabytes, I am really scared
to find out what sort of queries development would try to do. I raised
that particular configuration limit (which defaults to 2MB) for my own
purposes, not for the development group.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 21 2015 at 18:20 by Shawn Heisey