On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown wrote: > Is there any way to control introduce a different ordering scheme from > the base comparable bytes? My use case is that I am using UTF-8 data > for my keys, and I would like to have scans use UTF-8 collation. > > Could this be done by providing an alternate implementation of > WritableComparable? > > Thanks in advance! > Unfortunately no Tom. The database is all sorted the same way. Different sorts per table would complicate system interactions (the catalog tables would have to change sort by table). It might be doable but it would take some work. Can you store your data UTF-16 or UTF-32? Its a while since I dealt w/ this stuff but IIRC, their sort order is byte order? (WARNING! I could be way off here). St.Ack answered Jun 8 2012 at 17:14 |
Storing the bytes as native UTF-16 or UTF-32 will not help. Even strings in UTF-8 format can be sorted by their code points when stored as bytes. Unfortunately, that's not really useful for collation as characters like "è" (U+00E8) should appear between "e" (U+0065) and "f" (U+0066), but the code points to not allow this. Thanks anyway! --Tom answered Jun 8 2012 at 17:34 |
Tom, another approach you could take would be to store an ASCII encoded version of the string as the row key or column qualifier, and then the full UTF-8 string elsewhere (e.g. in the cell value, or even later in the row key). That wouldn't work out the fine sorting (whether "è" sorts before or after "e") but it would solve the gross sorting ("è" would always come before "f"). If you need true UTF-8 collation in the results, you could then implement it as a layer on top of that (in your app, or maybe a co-processor, I'm not sure about the latter). But at least with this approach, you'd be able to take advantage of rowkey ranges in your scans, which would probably make up for any time spent doing a secondary sort. Ian answered Jun 8 2012 at 17:40 |
Yet another approach is to transform your keys into byte comparable values that preserve your desired sort order, and store that instead. The ICU library has the ability to do this for various collations of UTF strings: http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys So for this case HBase could store the ICU sortkey rather than the actual UTF string. You then get correct scans, but just as in Ian's example, you need to implement a layer that converts requests your client requests to HBase UTF to sortkey. This will almost certainly give you better HBase performance since memcmp is generally faster than a custom comparator. answered Jun 8 2012 at 17:58 |
On Fri, Jun 8, 2012 at 10:58 AM, Jason Frantz wrote: > Yet another approach is to transform your keys into byte comparable values > that preserve your desired sort order, and store that instead. The ICU > library has the ability to do this for various collations of UTF strings: > > http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys > > So for this case HBase could store the ICU sortkey rather than the actual > UTF string. You then get correct scans, but just as in Ian's example, you > need to implement a layer that converts requests your client requests to > HBase UTF to sortkey. This will almost certainly give you better HBase > performance since memcmp is generally faster than a custom comparator. I love this mailing list. Thanks, you just helped solve a problem for me unrelated to HBase. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) answered Jun 8 2012 at 18:59 |
Group Hbase-user
asked Jun 8 2012 at 16:35
active Jun 8 2012 at 18:59
posts:6
users:5