0
Hi everyone,
I'm working on a project in which we need a distributed inverted index, and
are getting some fair results using HBase and Hadoop (Crawlers -> Document
Repository (HBase) --M/R-> Document Index (Hbase) --M/R-> Inverted Index).
However, we are also investigating more efficient methods to use this
inverted index. So after reading [1] we are wondering if anyone figured a
way to let a HBase
I would like to compare strings. I'm using this for my navigation
PSEUDO-CODE
IF $title CONTAINS "abc: def"
print TITLE;
print SUB1;
print SUB2;
print SUB3;
ELSE
print TITLE;
Comparing the number of characters in one string with another does not do
the trick. (strcmp)
Is there anything besides a general regex?
Or is my concern about processing speed unwaranted?
Can Hive optimizer take advantage of tables CLUSTERED BY … SORTED BY … when
performing aggregations or analytic functions (window aggregations)?
If so, how?
more details at SO: optimize Hive table storage for subsequent Aggregations
and/or Window Analytic Functions
hi all,
I recently performance test activemq queue, message processing speed,
producers and consumers of the message processing speed is very low, 3000
messages per second.
Test message from supplier to consumer, however, the speed of the message is
received it soon, 12000 messages per second, is this why?Who can help me
analysis?Thank you very much.
View this message in context: http://activemq
0
Hi guys,
On what factors does HBase read latency primarily depend? What would be the
approx theoretical limit for read latency in v0.90.1 on a cluster of 7 nodes
(16 core/16 GB RAM on 5 machines and 36 GB on the other two)? I have an
application where I generate around 1000 rows/s to be input into HBase. Then
I have to read this data and process it at regular intervals. Write speed is
not a problem
Hi, i need to change storage format of data.
The data looks like:
/masterdata/source/some_source/archive/YEAR/MONTH/DAY/HOUR/part*
Right now the data is in CSV format.
I want to convert it to SequenceFile with Snappy compression.
I need to preserve date partitioning /YEAR/MONTH/DAY/HOUR/
I need to get single file for each partition (SeqFiles are splittable so I can reduce qtty of files and save
Hi All,
I am using the schema in the Impala VM and trying to create a dynamic partitioned table on date_dim.
New table is called date_dim_i and schema for that is defined as:
create table date_dim_i
(
d_date_sk int,
d_date_id string,
d_date timestamp,
d_month_seq int,
d_week_seq
Hi,
I am planning a system to process information with Hadoop, and I will have
a few look-up tables that each processing node will need to query. There
are perhaps 20-50 such tables, and each has on the order of one million
entries. Which is the best mechanism for this look-up? Memcache, HBase,
JavaSpace, Lucene index, anything else?
Thank you,
Mark
I seem to be seeing something entirely unexpected - when my JmsConsumer
thread throw a SchedulerException, that JmsConsumer thread will not process
any new message until it has finished processing the original ActiveMQ
message. I was expecting that having a delay would mean that that
JmsConsumer thread will now be free to process another ActiveMQ message.
Here's my setup:
1) I have a whole slew
I’m totally stumped on this bug ….
Essentially, I have a queue that locks up and consumers in my main daemon
no longer consume messages from it.
It’s basically dead. If I restart my daemon, no more messages are consumed.
I can browse the queue, consume them from my desktop, but I can’t consume
them from my main daemon.
I’ve done all the normal debugging. JMX shows there are
Hi
I am a newer for hadoop, now we have 32 nodes for hadoop study
I need to speed up the process of hadoop processing by finding the best
configuration.
For example: io.sort.mb io.sort.record.percent etc. But I do not how to
start with so many parameters available for optimization.
BRs
Geelong
--
>From Good To Great
Hi all,
we have switched to activemq and we are using automated webtests. Some of
these tests put messages in a queue and we have to wait for these messages
to be processed before we can go on with the tests.
Messages gets redelivered on failures. Currently I use jmx in a loop with a
sleep and assume that it failed if the queue not gets empty within a period
of time.
Now the question is if there
Hi,
i am having a bolt to poll from kafka(bolt, not spout). It has 3 threads as
the topic has 3 partitions. Then the downstream processing bolt has 200
threads. With this set up I hope the processing bolt can catch up with the
kafka bolt. However, in my testing, it seems that at a single time, only 3
processing bolts are processing the data, and then the kafka polling bolt
seems to be blocked
Hi, ALL!
I have 15 segments with 180000 urls
When I'm trying to execute mergesegs tool
the process hangs up on Processing 80000 pages and then nothing...
HEAP_SIZE=512 Mb
Please help!
--
Regards,
Dima mailto:
[email protected]
When running HIVE with our own Serde it all works fine. But when using beeline to see the data it does not pickup the serde even tho it prints
out the jar on the classpath. Does anyone see this?
[root@cloudera-dev auxlib]# beeline
-hiveconf (No such file or directory)
hive.aux.jars.path=file:/usr/lib/hive/auxlib/celertech-flume-logger-0.22.0-SNAPSHOT.jar,file:/usr/lib/hive/auxlib/hbase.jar,file:/
SGFkb29wZXJzLA0K4oCcSGFkb29wIHNoaXBzIHRoZSBjb2RlIHRvIHRoZSBkYXRhIGluc3RlYWQg
b2Ygc2VuZGluZyB0aGUgZGF0YSB0byB0aGUgY29kZS7igJ0NClNheSB5b3UgYWRkZWQgdHdvIERO
cy9UVHMgdG8gdGhlIGNsdXN0ZXIuIFRoZXkgaGF2ZSBubyBkYXRhIGF0IHRoaXMgcG9pbnQsIGku
ZS4geW91IGhhdmUgbm90IHJhbiB0aGUgYmFsYW5jZXIuDQpJbiB2aWV3IG9mIHRoZSBhYm92ZSBx
dW90ZWQgc3RhdGVtZW50LCB3aWxsIHRoZXNlIHR3byBub2RlcyBub3QgcGFydGljaXBhdGUgaW4g
dGhlIE1hcFJlZHVjZSBqb2IgdW50aWwgeW91IGJhbGFuY2VkIHNvbWUgZGF0YSBvbnRvIHRob3Nl
Hi all,
Would anyone have some logic ideas on how to do this processing given
the following HTML?
Selected?
Manufacturer
Colour #
Colour
Can
Order?
DMC
B5200
Snow
white
Is there a method provided with the MM JDBC driver that will
take care of escaping special characters in strings? I'm
looking for something analogous to the $dbh->quote($string)
method provided in the Perl DBI for MySQL.
In the docs, I see the org.gjt.mm.mysql.EscapeProcessor class,
but there isn't much description of how to use it. Will this
do what I want?
Thanks,
-Don