Relatively frequently (about a once a month) we need to reindex the data, by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by users.
Sometimes DIH could be interrupted at the middle, because of some unexpected
exception caused by OutOfMemory or something else (many times it failed when
more than 90 % was completed).
More than this, almost every time, some items are missing at new the index.
It is very complicated to find them.
At this stage I can't be sure about what documents exactly were missed and I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items.
I want to suggest the following way to solve the problem:
• Get list of all item ids ( call LUCINE API , like CLUE does for example )
• Start DIH , which will iterate over those ids and each time make a
query for n items.
1. Of course original DIH class should be changed to support it.
• This will give the following advantages :
1. I will know exactly what items were failed.
2. I can restart the process from any point and in case of DIH failure
restart it from the point of failure.
so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS
for example if I have 1000 docs and want that this new DIH will take each
time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )
The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?
SolrUser1543 's gravatar image asked Feb 20 2015 at 02:32 in Lucene-Solr-User by SolrUser1543

14 Answers

Personally, I much prefer indexing from an independent SolrJ client
to using DIH when I have to take explicit control of errors & etc.
Here's an example:
https://lucidworks.com/blog/indexing-with-solrj/
In your example, you seem to be assuming that the Lucene IDs
(and here I'm assuming you're not talking about the internal Lucene
ID) corresponds to some kind of primary key in your database table.
But the correspondence isn't necessarily straightforward, how
would it handle composite keys?
I'll leave actual comments on DIH's internals to people who, you know,
actually understand the code ;)...
Erick
Erick Erickson 's gravatar image answered Feb 20 2015 at 12:28 by Erick Erickson
It's a little bit hard to get the overall context eg why do you live with
OOME as usual, what's the reasoning to pull from one index to another, and
what's added during this process.
Make sure that you are aware of
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which
queries other Solr. and
http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can
use to log recently imported ids, to be able to restart indexing from this
point.
You can drop me more details in your native language if you wish.
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid
Mikhail Khludnev 's gravatar image answered Feb 20 2015 at 12:55 by Mikhail Khludnev
My index has about 110 millions of documents. The index is split over several
shards.
May be the number it's not so big ,but each document is relatively large.
The reason to perform the reindex is something like adding a new fields , or
adding some update processor which can extract something from one field and
put in another and etc.
Each time I need to reindex data , I create a new collection and starting to
import data from old one .
It gives the opportunity for an update processors to act.
The dih running with *:* query and takes some number of items each time.
In case of exception , the process stops and the middle and I can't to
restart from this point.
That's the reason that I want to run on predefined list of IDs.
In this case I will able to restart from any point and to know about filed
IDs.
SolrUser1543 's gravatar image answered Feb 20 2015 at 13:57 by SolrUser1543
You can include information on a URL parameter and then use that URL
parameter inside your dih config. If the URL parameter is "idlist" then
you can use ${dih.request.idlist} in your SELECT statement.
Be aware that most servlet containers have a default header length limit
of about 8192 characters, affecting the length of the URL that can be
sent successfully. If the list of IDs is going to get huge, you will
either need to switch from a GET to a POST request where the parameter
is in the post body, or increase the header length limit in the servlet
container that is running Solr.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 20 2015 at 14:46 by Shawn Heisey
I realized after I sent this that you are not using a database ... the
list would simply go in the query you send to the other server. I don't
know whether the request that the SolrEntityProcessor sends is a GET or
a POST, so for a really large list of IDs, you might need to edit the
container config on both servers.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 20 2015 at 14:52 by Shawn Heisey
Yes, you right, I am not using a DB.
SolrEntityProcessor is using a GET method, so I will need to send
relatively big URL ( something like a hundreds of ids ) hope it will be
possible.
Any way I think it is the only method to perform reindex if I want to
control it and be able to continue from any point in case of failure.
SolrUser1543 's gravatar image answered Feb 21 2015 at 00:42 by SolrUser1543
Careful with the GETs! There is a real, hard limit on the length of a GET url (in the low hundreds of characters). That's why a POST is so much better for complex queries; the limit is in the hundreds of MegaBytes.
steve 's gravatar image answered Feb 21 2015 at 00:46 by steve
That's right, but I am not sure that if it is works with Get I will able to
use Post without changing it.
SolrUser1543 's gravatar image answered Feb 21 2015 at 00:52 by SolrUser1543
And I'm familiar with the setup and configuration using Python, JavaScript, and PHP; not at all with Java.
steve 's gravatar image answered Feb 21 2015 at 01:12 by steve
The HTTP protocol does not set a limit on GET URL size, but individual web servers usually do. You should get a response code of 414 Request-URI Too Long when the URL is too long.
This limit is usually configurable.
wunder
Walter (my blog)
Walter Underwood 's gravatar image answered Feb 21 2015 at 09:50 by Walter Underwood
Thank you! Another 4xx error that makes sense. Quoting from the Book of StackOverFlowhttp://stackoverflow.com/questions/2659952/maximum-length-of-http-get-request"Most webservers have a limit of 8192 bytes (8KB), which is usually configureable somewhere in the server configuration. As to the client side matter, the HTTP 1.1 specification even warns about this, here's an extract of chapter 3.2.1:Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths.The limit is in MSIE and Safari about 2KB, in Opera about 4KB and in Firefox about 8KB. We may thus assume that 8KB is the maximum possible length and that 2KB is a more affordable length to rely on at the server side and that 255 bytes is the safest length to assume that the entire URL will come in.If the limit is exceeded in either the browser or the server, most will just truncate the characters outside the limit without any warning. Some servers however may send a HTTP 414 error. If you need to send large data, then better use POST instead of GET. Its limit is much higher, but more dependent on the server used than the client. Usually up to around 2GB is allowed by the average webserver. This is also configureable somewhere in the server settings. The average server will display a server-specific error/exception when the POST limit is exceeded, usually as HTTP 500 error."
steve 's gravatar image answered Feb 21 2015 at 13:37 by steve
The limit on a GET command (including the GET itself and the protocol
specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes. That's the
default value in Jetty, at least.
A question for the experts: Would it be a good idea to force a POST
request in SolrEntityProcessor? It may be dealing with parameters that
have been sent via POST and may exceed the header size limit.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 21 2015 at 16:45 by Shawn Heisey
Am an expert? Not sure, but I worked on an enterprise search spider and search engine for about a decade (Ultraseek Server) and Ive done customer-facing search for another 6+ years.
Let the server reject URLs it cannot handle. Great servers will return a 414, good servers will return a 400, broken servers will return a 500, and crapulous servers will hang. In nearly all cases, youll get a fast fail which wont hurt other users of the site.
Manage your site for zero errors, so you can fix the queries that are too long.
At Chegg, we have people paste entire homework problems into the search for homework solutions, and, yes, we have a few queries longer than 8K. But we deal with it gracefully.
Never do POST for a read-only request. Never. That only guarantees that you cannot reproduce the problem by looking at the logs.
If your design requires extremely long GET requests, you may need to re-think your design.
wunder
Walter (my blog)
Walter Underwood 's gravatar image answered Feb 21 2015 at 17:33 by Walter Underwood
I agree with those sentiments ... but those who consume the services we
provide tend to push the envelope well beyond any reasonable limits.
My Solr install deals with some Solr queries where the GET request is
pushing 20000 characters. The queries and filters constructed by the
website code for some of the more powerful users are really large. I
had to configure haproxy and jetty to allow HTTP headers up to 32K. I'd
like to tell development that we just can't handle it, but with the way
the system is currently structured, there's no other way to get the
results they need.
If I were to make it widely known internally that the Solr config is
currently allowing POST requests up to 32 megabytes, I am really scared
to find out what sort of queries development would try to do. I raised
that particular configuration limit (which defaults to 2MB) for my own
purposes, not for the development group.
Thanks,
Shawn
Shawn Heisey 's gravatar image answered Feb 21 2015 at 18:20 by Shawn Heisey

Related Discussions

  • Return Solr Docs In A Specific Order By List Of Ids in Lucene-solr-user

  • Hi, I have a use case where I have a list of doc ids and I need to return Documents from solr in the same order as my list of ids. For instance: 459,185,569,8,1,896 Is it possible to return docs is Solr following in the same order? Regards, Sergio View this message in context: http://lucene.472066.n3.nabble.com/Return-Solr-docs-in-a-specific-order-by-list-of-ids-tp4128570.html ...

  • Filter Query From External List Of Solr Unique IDs in Lucene-solr-user

  • At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids. Some use cases I can think of are: 1) personal sub-collections (in our case a user can create a small subset of our 6.5 million doc collection and then run filter queries against it) 2) tagging documents 3) access control lists 4) anything that needs ...

  • Retrieving And Updating Large Set Of Documents On Solr 4.7.2 in Lucene-solr-user

  • 0 down vote favorite I am trying to implement an activity feed for a website, and planning to use Solr for this case. As it does not have any follower/following relation, Solr is fitting for the requirements. There is one point which makes me concerned about performance. So as user A, I may have 10K activities in the feed, and then I have updated my preferences, so the activities that...

  • Solr Subset Searching In 100-million Document Index in Lucene-solr-user

  • Hi, We have a Solr index of around 100 million documents with each document being given a region id growing at a rate of about 10 million documents per month - the average document size being aronud 10KB of pure text. The total number of region ids are themselves in the range of 2.5 million. We want to search for a query with a given list of region ids. The number of region ids in this list...

  • Solr 3.4 With NTiers >= 2: Usage Of Ids Param Causes NullPointerException (NPE) in Lucene-solr-user

  • Hello, Hopefully this question is not too complex to handle, but I'm currently stuck with it. We have a system with nTiers, that is: Solr front base ---> Solr front --> shards Inside QueryComponent there is a method createRetrieveDocs(ResponseBuilder rb) which collects doc ids of each shard and sends them in different queries using the ids parameter: [code] sreq.params.add(ShardParams.IDS, StrUtils...

  • Restrict Search To Subset (a List Of Aprrox 40,000 Ids From An External Service) Of Corpus in Lucene-solr-user

  • Hi guys, How do I search only a subset of my corpus based on a large list of non consecutive unique key ids (cannot do a range query). Is there a way around doing this q=id:(id1 OR id2 OR id3 OR id4 ... OR id40000 ) AND name:* Also what is the limit of "OR"s i can apply on the query if that is the Thanks...

  • Handling Large No. Of Ids In Solr in Lucene-solr-user

  • 1 down vote favorite I need to perform an online search in solr i.e user need to find list of user which are online with particular criteria. how i am handling this: we store the ids of user in a table and i send all online user id in solr request like &fq=-id:(id1 id2 id3 ............id5000) problem with this approach is that when ids become large, solr talking too much time to resolved and we ...

  • Ranking Based On Number Of Matches In A Multivalued Field? in Lucene-solr-user

  • So suppose I have a multivalued field for categories. Let's say we have 3 items with these categories: Item 1: category ids [1,2,5,7,9] Item 2: category ids [4,8,9] Item 3: category ids [1,4,9] I now run a filter query for any of the following category ids [1,4,9]. I should get all of them back as results because they all include at least one category which I'm querying. Now, how do I order it ...

  • Filter On Millions Of IDs From External Query in Lucene-solr-user

  • I am working with an in index of ~10 million documents. The index does not change often. I need to preform some external search criteria that will return some number of results -- this search could take up to 5 mins and return anywhere from 0-10M docs. I would like to use the output of this long running query as a filter in solr. Any suggestions on how to wire this all together? My initial...

  • Performance Implications On Using Lots Of Values In Fq in Lucene-solr-user

  • I have documents in SOLR such that each document contains one to many points (latitude and longitudes). Currently we store the multiple points for a given document in the db and query the db to find all of the document ids around a given point first. Once we have the list of ids, we populate the fq with those ids and the q value and send that off to SOLR to do a search. In the "longest" query...

  • Filtering/faceting By A Big List IDs in Lucene-solr-user

  • Hi all, I am running a Solr application and I would need to implement a feature that requires faceting and filtering on a large list of IDs. The IDs are stored outside of Solr and is specific to the current logged on user. An example of this is the articles/tweets the user has read in the last few weeks. Note that the IDs here are the real document IDs and not Lucene internal docids. So the question...

  • Retrieve Ids Of All Indexed Docs Efficiently in Lucene-solr-user

  • Hi -- I'd like to retrieve the ids of all the docs in my Solr 5.3.1 index. In my query, I've set rows=1000, fl=id, and am using the cursorMark mechanism to split the overall traversal into multiple requests. Not because I care about the order, but because the documentation implies that it's necessary to make cursorMark work reliably, I've also set sort=id asc. While this does give me the data...

  • Solr Ids Query Parameter in Lucene-solr-user

  • Hello, I am trying to do a distributed search with solr and for some reason I get an internal server error. The set up is like this: I have 4 solr servers that index data (say daily each with 10 cores) and I use another bunch of solr instances (lets call one of them as L1aggregator) that does a distributed request to all the 40 cores of 4 solr servers. I also have another solr instance (lets call...

  • Interesting Practical Solr Question in Lucene-solr-user

  • Hi, I use Solr to search through a set of about 200,000 documents. Each document has a numeric ID. How to do the following: 1) I use facets and want to return the facets for "all documents" as the starting point of the user interface. In other words, I want to /select the facet counts for about 10 facets (like states for example) for all documents without having to do a search. Is this possible...

  • Solr _docid_ Parameter in Lucene-solr-user

  • In Solr, I noticed that I can sort by the internal Lucene _docid_. -> http://wiki.apache.org/solr/CommonQueryParameters > You can sort by index id using sort=_docid_ asc or sort=_docid_ desc * I have also read the docid is represented by a sequential number. -> http://lucene.472066.n3.nabble.com/Get-DocID-after-Document-insert-td556278.html > Your document IDs may change, and in fact...

  • Search Or Filter By A List Of Document Ids And Return Them In The Same Order. in Lucene-solr-user

  • Hi I am trying to search or filter by alist ofdocuments by their ids (product id field).The requirement is the return documents must be in the same order as search or filter by. Eg.if i search or filter on the below list of ids, the documents must be return in the same order too 1083342171 1079463095 1078278592 1085253674 1076558399 Is this possible? Thanks, Derek...

  • Clearing FieldValueCache In Solr 4.6 in Lucene-solr-user

  • Hello. We're just starting to use solr in production. We've indexed 18,000 documents or so. We've just implemented faceted search results. We mistakenly stored integer ids in what was meant to be a string field. So, our facet results are showing numbers instead of the textual values. After fixing this oversight, reindexing the documents yields the correct results, but the faceted...

  • Solr Atomic Updates By Query in Lucene-solr-user

  • Hi! I have one more question about atomic updates in Solr (Solr 4.4.0). Is it posible to generate atomic update by query? I mean I want to update those documents in which IDs contain some string. For example, index has: Doc1, id="123|a,b" Doc2, id="123|a,c" Doc3, id="345|a,b" Doc4, id="345|a,c,d". And if I don't want to generate all IDs to update, but I know that necessary IDs start ...

  • Multiple Unique Ids in Lucene-solr-user

  • Hi, I have two Ids DocumentId and AuthorId. I want both of them unique. Can i have two in my document? id authorId Regards, Ninad Raut...

  • Dynamic Boosting Of Ids At Search Time in Lucene-solr-user

  • Hi, I have to boost certain ids at the search time and these ids are not fixed, so i can't keep them in DismaxRequest handler. I mean, if for query x, ids to be boosted are 243452,346563,773567, then for query y the ids to be boosted won't be the same. They are calculated at the search time. Also, I cant keep them in the lucene query as the list goes in thousands. Please suggest a good resolution ...