QnaList > Groups > Hbase-User > Jun 2012
faq

Collation Order Of Items

Is there any way to control introduce a different ordering scheme from
the base comparable bytes?  My use case is that I am using UTF-8 data
for my keys, and I would like to have scans use UTF-8 collation.
Could this be done by providing an alternate implementation of
WritableComparable?
Thanks in advance!
--Tom

asked Jun 8 2012 at 16:35

Tom Brown's gravatar image



5 Replies for : Collation Order Of Items
On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown  wrote:
> Is there any way to control introduce a different ordering scheme from
> the base comparable bytes?  My use case is that I am using UTF-8 data
> for my keys, and I would like to have scans use UTF-8 collation.
>
> Could this be done by providing an alternate implementation of
> WritableComparable?
>
> Thanks in advance!
>
Unfortunately no Tom.  The database is all sorted the same way.
Different sorts per table would complicate system interactions (the
catalog tables would have to change sort by table).  It might be
doable but it would take some work.
Can you store your data UTF-16 or UTF-32?  Its a while since I dealt
w/ this stuff but IIRC, their sort order is byte order?  (WARNING!  I
could be way off here).
St.Ack

answered Jun 8 2012 at 17:14

Stack's gravatar image


Storing the bytes as native UTF-16 or UTF-32 will not help.  Even
strings in UTF-8 format can be sorted by their code points when stored
as bytes.  Unfortunately, that's not really useful for collation as
characters like "è" (U+00E8) should appear between "e" (U+0065) and
"f" (U+0066), but the code points to not allow this.
Thanks anyway!
--Tom

answered Jun 8 2012 at 17:34

Tom Brown's gravatar image


Tom, another approach you could take would be to store an ASCII encoded version of the string
as the row key or column qualifier, and then the full UTF-8 string elsewhere (e.g. in the
cell value, or even later in the row key). That wouldn't work out the fine sorting (whether
"è" sorts before or after "e") but it would solve the gross sorting ("è" would always come
before "f"). If you need true UTF-8 collation in the results, you could then implement it
as a layer on top of that (in your app, or maybe a co-processor, I'm not sure about the latter).
But at least with this approach, you'd be able to take advantage of rowkey ranges in your
scans, which would probably make up for any time spent doing a secondary sort.
Ian

answered Jun 8 2012 at 17:40

Ian Varley's gravatar image


Yet another approach is to transform your keys into byte comparable values
that preserve your desired sort order, and store that instead. The ICU
library has the ability to do this for various collations of UTF strings:
http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys
So for this case HBase could store the ICU sortkey rather than the actual
UTF string. You then get correct scans, but just as in Ian's example, you
need to implement a layer that converts requests your client requests to
HBase UTF to sortkey. This will almost certainly give you better HBase
performance since memcmp is generally faster than a custom comparator.

answered Jun 8 2012 at 17:58

Jason Frantz's gravatar image


On Fri, Jun 8, 2012 at 10:58 AM, Jason Frantz  wrote:
> Yet another approach is to transform your keys into byte comparable values
> that preserve your desired sort order, and store that instead. The ICU
> library has the ability to do this for various collations of UTF strings:
>
> http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys
>
> So for this case HBase could store the ICU sortkey rather than the actual
> UTF string. You then get correct scans, but just as in Ian's example, you
> need to implement a layer that converts requests your client requests to
> HBase UTF to sortkey. This will almost certainly give you better HBase
> performance since memcmp is generally faster than a custom comparator.
I love this mailing list. Thanks, you just helped solve a problem for
me unrelated to HBase.
Best regards,
   - Andy
Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

answered Jun 8 2012 at 18:59

Andrew Purtell's gravatar image


Related discussions

Tagged

Group Hbase-user

asked Jun 8 2012 at 16:35

active Jun 8 2012 at 18:59

posts:6

users:5

©2013 QnaList.com