Is there any way to control introduce a different ordering scheme from
the base comparable bytes?  My use case is that I am using UTF-8 data
for my keys, and I would like to have scans use UTF-8 collation.
Could this be done by providing an alternate implementation of WritableComparable<ImmutableBytesWritable>?
Thanks in advance!
--Tom
Tom Brown's gravatar image asked Jun 8 2012 at 16:35 in Hbase-User by Tom Brown

5 Answers

On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown wrote: > Is there any way to control introduce a different ordering scheme from > the base comparable bytes?  My use case is that I am using UTF-8 data > for my keys, and I would like to have scans use UTF-8 collation. > > Could this be done by providing an alternate implementation of > WritableComparable<ImmutableBytesWritable>? > > Thanks in advance! >
Unfortunately no Tom. The database is all sorted the same way. Different sorts per table would complicate system interactions (the catalog tables would have to change sort by table). It might be doable but it would take some work.
Can you store your data UTF-16 or UTF-32? Its a while since I dealt w/ this stuff but IIRC, their sort order is byte order? (WARNING! I could be way off here).
St.Ack
Stack's gravatar image answered Jun 8 2012 at 17:14 by Stack
Storing the bytes as native UTF-16 or UTF-32 will not help. Even strings in UTF-8 format can be sorted by their code points when stored as bytes. Unfortunately, that's not really useful for collation as characters like "è" (U+00E8) should appear between "e" (U+0065) and "f" (U+0066), but the code points to not allow this.
Thanks anyway!
--Tom
Tom Brown's gravatar image answered Jun 8 2012 at 17:34 by Tom Brown
Tom, another approach you could take would be to store an ASCII encoded version of the string as the row key or column qualifier, and then the full UTF-8 string elsewhere (e.g. in the cell value, or even later in the row key). That wouldn't work out the fine sorting (whether "è" sorts before or after "e") but it would solve the gross sorting ("è" would always come before "f"). If you need true UTF-8 collation in the results, you could then implement it as a layer on top of that (in your app, or maybe a co-processor, I'm not sure about the latter). But at least with this approach, you'd be able to take advantage of rowkey ranges in your scans, which would probably make up for any time spent doing a secondary sort.
Ian
Ian Varley's gravatar image answered Jun 8 2012 at 17:40 by Ian Varley
Yet another approach is to transform your keys into byte comparable values that preserve your desired sort order, and store that instead. The ICU library has the ability to do this for various collations of UTF strings:
http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys
So for this case HBase could store the ICU sortkey rather than the actual UTF string. You then get correct scans, but just as in Ian's example, you need to implement a layer that converts requests your client requests to HBase UTF to sortkey. This will almost certainly give you better HBase performance since memcmp is generally faster than a custom comparator.
Jason Frantz's gravatar image answered Jun 8 2012 at 17:58 by Jason Frantz
On Fri, Jun 8, 2012 at 10:58 AM, Jason Frantz wrote: > Yet another approach is to transform your keys into byte comparable values > that preserve your desired sort order, and store that instead. The ICU > library has the ability to do this for various collations of UTF strings: > > http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys > > So for this case HBase could store the ICU sortkey rather than the actual > UTF string. You then get correct scans, but just as in Ian's example, you > need to implement a layer that converts requests your client requests to > HBase UTF to sortkey. This will almost certainly give you better HBase > performance since memcmp is generally faster than a custom comparator.
I love this mailing list. Thanks, you just helped solve a problem for me unrelated to HBase.
Best regards,
   - Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
Andrew Purtell's gravatar image answered Jun 8 2012 at 18:59 by Andrew Purtell

Related Discussions

  • Store List Of Data Items In Hbase. in Hbase-user

  • Im trying to store a list,collection of data objects in Hbase. For example ,a User table where a the userId is the Rowkey and column family Contacts with column Contacts:EmailIds where EmailIds is a list of emails as [email protected],[email protected]...etc} How do we model this in Hbase ? How do we do this in Java?/Python?Ive tried pickling and unpickling data in Python but this is one solution...

  • Hbase & Hive in Hbase-user

  • Hi all: I have two tables(orders,ordersitem) in Hbase and now I want to import them into hive. Orders -> Ordersitem is one-to-many,and table ordersitem's keyrow is orderid and column is item:1 info:1 bdinfo:1 item:2 info:2 bdinfo:2 to mark the items number. My question is how to write the HiveQL to import ordersitem...

  • Nested Data Structures Examples For HBase in Hbase-user

  • While I am aware that HBase does not have native support for nested structures, surely there are some of you that have thought through this use case carefully. Our particular use case is likely having single digit nested layers with tens to hundreds of items in the lists at each level. An example would be a top Level 300 items middle level : 1 to 100 items ("1 value" may indicate...

  • Hbase And Linear Scaling With Small Write Intensive Clusters in Hbase-user

  • Hello all, I've been working with HBase for the past few months on a proof of concept/technology adoption evaluation. I wanted to describe my scenario to the user/development community to get some input on my observations. =20 =20 I've written an application that is comprised of two tables. It models a classic many-to-many relationship. One table stores "User" data and the other represents...

  • HTable.put Hangs On Bulk Loading in Hbase-user

  • I am doing a load test for which I need to load a table with many rows. I have a small java program that has a for loop and calls HTable.put. I am inserting a map of 2 items into a table that has one column family. The limit of the for loop is currently 20000. However after 15876 rows the call to Put hangs. I am using autoFlush on the HTable. Any ideas why this may happen? The table configuration...

  • Changing Sort Order Of Items. in Mysql-general

  • I'm trying to create the most efficient way to allow a user to change the display order of a group of rows in a table. Lets say the basic table is: id group_id name sort_order The query to display it would be "SELECT id, name FROM mytable WHERE group_id = $x ORDER BY sort_order" Now when I display it they currenlty all have the same sort_order value so they come in the order ...

  • Sort Order Of "missing" Items in Lucene-solr-user

  • When items are sorted, are all the docs with the sort field missing considered "tied" in terms of their sort order, or are they "indeterminate", or do they have some arbitrary order imposed on them (e.g. _docid_)? For example, would "b" be considered as part of the sort in the following query, or would all the missing 'a' fields be in some kind of order already, thus making the sort algorithm never...

  • Changing Order Of Items in Php-general

  • this one bugs me for a while. how to change order. I have a list of tasks. by status, task could be 1 (todo) or 0 (done) - status value stored in mysql. I can list tasks per status or all. order number is stored in mysql too. the easiest way to change order is to have form for each task where you will enter manually number and then submit (one submit button for whole form). but, if you change...

  • Order Of Items In Cachedump in Memcached

  • I know cachedump may be going away soon, but for now it's still a somewhat useful diagnostic tool. To that end, what determines the order of the keys returned by cachedump? Is it based on the LRU or just completely random? Thanks! You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from...

  • Changing Order Of Collection Items in Isis-users

  • Hi, Anybody a great idea for changing item order in a sorted set using the wicket viewer? I have a rank-property in the database but how to easily manage the item order (move item(s) up/down single or multiple places)... Thanks, Erik...

  • Order Of Items In A WHERE...IN Clause in Mysql-general

  • Hello, Is it permissible to order a clause such that the search term is the first item (in the clause)? standard: field1 IN (123, 654, 789) in question: 123 IN (field1, field2, field3) I am interested to know if the optimizer treats this any differently if anybody can shed any light on it (except for the obvious difference in the above queries!) Thanks, Andy...

  • Index Of Items In A List In Sorted Order in Python

  • I am posting a code to get index of sorted list for future reference. if you can make this code faster, i'll appreciate your input nizar ########################################################3 #!/usr/bin/python def ind_of_sorted(L): ''' return index of items in sorted order ''' k = len(L) d = {} # take care of repeated values [ d.setdefault(L[i],[]).append(i) for i in range(k)] auxL...

  • Listing Question in Php-db

  • Hi there everyone, I have a little problem, I could do this with 2 seperate queries but if I can do it with 1 then even better ;-) I have to list items in numeric order IF the field isn't empty (ie: 0 comes at the top, followed by 1 etc ....) and that isn't an issue as PHP with MySQL makes that very easy - BUT here's the problem i'm having. Once it's gone through the list, any items that...

  • Sessions in Php-general

  • Hi I am working on a Admin view of an Intranet site. They need to enter orders. An order can include many inventory items so I have divided it up to two pages. I have set up a first page session that passes the first page info onto the second page where the inventory items are addes to the order. I need to be able to add multple second pages while still maintaining the first page info and...

  • Sort Question in Scala-user

  • Hi, I need to sort a map however when more then 10 items the sort order takes 11 before 2, ie: 1,11,12,2,3,4,5 etc. I've tried sortWith(_._1.sort < _._1.sort) and sortBy(_.1.sort) but doesn't work Thanks...

  • Ecommerce Transactions in Mongodb-user

  • Hi I am considering  proposing to my team to port an existing website to use Mongo for an e-commerce application. Being new to Mongo I have read various articles stating that Mongo provides atomic updates at the document level. The application requires inventory management. I am considering using something like the following schema Product {    sku    name    description    quantityAvailable...

  • A Template Generates Invalid Items Order in Play-framework

  • I have a problem play in 2.0 and 2.1. fragment template code: @for(prop_type...

  • Processing Order In Spark in Spark-user

  • Hi, I am planning an application where the order of items is somehow important. In particular it is an online machine learning application where learning in a different order will lead to a different model. I was wondering about ordering guarantees for Spark applications. So if I say myRdd.map(someFun), then someFun will be executed on many cluster nodes, but do I know anything about the...

  • *ByKey Aggregations: Performance + Order in Spark-user

  • Hi, I have an RDD[(Long, MyData)] where I want to compute various functions on lists of MyData items with the same key (this will in general be a rather short lists, around 10 items per key). Naturally I was thinking of groupByKey() but was a bit intimidated by the warning: "This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average...

  • Affect Of Non-sample Items On Test Times in Jmeter-user

  • Please, does anyone know if actions not relevant to the test increase Jmeter reported times? Actions such as post processors, beanshell, config elements and controllers, user defined vars and parameters-- do they add to the test time? In my test I need to read and verify responses, and extract info from them in order to perform the next http calls. I wondered if Jmeter includes this in...

  • Does Parallelize And Collect Preserve The Original Order Of List? in Spark-user

  • Step1 List items = new ArrayList();items.addAll(XXX); javaSparkContext.parallelize(items).saveAsTextFile(output); Step2 final List items2 = ctx.textFile(output).collect(); Does items and items2 has the same order? Besh wishes. Thanks. View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-parallelize-and-collect-preserve-the-original-order-...

  • Popularity Of Recommender Items in Mahout-user

  • Trying to come up with a relative measure of popularity for items in a recommender. Something that could be used to rank items. The user - item preference matrix would be the obvious thought. Just add the number of preferences per item. Maybe transpose the preference matrix (the temp DRM created by the recommender), then for each row vector (now that a row = item) grab the number of non zero preferences...

  • Repeating Blocks Of Items in Angularjs

  • hello, id appreciate some help on the following i have an array of objects, lets say var items = [{"id": 1, "name": "one"}, {"id": 2, "name": "two"},{"id": 3, "name": "three"},{"id": 4, "name": "four"},{"id": 5, "name": "five"}] now i need to create as many rows of TWO divs as necessary ... the final "pseudo" html should look like this 1 (one) 2 (two) ...