[Spark-User] Partitioning to speed up processing?

I have a number of queries that result in a sequence Filter > Project > Aggregate. I wonder whether partitioning the input table makes sense.
Does Aggregate benefit from a partitioned input? If so, what partitions would be most useful (related to the aggregations)?
Do Filter and Project preserve the partition of its inputs?
Thanks,
Gerhard

Reply To : Partitioning To Speed Up Processing?

asked Mar 10 2016 at 11:54

Gerhard Fiedler

Related discussions

Improvement Of Processing Speed

Hbase Inverted Index Partitioning

Hi everyone, I'm working on a project in which we need a distributed inverted index, and are getting some fair results using HBase and Hadoop (Crawlers -> Document Repository (HBase) --M/R-> Document Index (Hbase) --M/R-> Inverted Index). However, we are also investigating more efficient methods to use this inverted index. So after reading [1] we are wondering if anyone figured a way to let a HBase

Compare Lists

I would like to compare strings. I'm using this for my navigation PSEUDO-CODE IF $title CONTAINS "abc: def" print TITLE; print SUB1; print SUB2; print SUB3; ELSE print TITLE; Comparing the number of characters in one string with another does not do the trick. (strcmp) Is there anything besides a general regex? Or is my concern about processing speed unwaranted?

Can CLUSTERED BY/SORTED BY Speed Up Processing Aggregations Or Inter-row Calculations?

Can Hive optimizer take advantage of tables CLUSTERED BY … SORTED BY … when performing aggregations or analytic functions (window aggregations)? If so, how? more details at SO: optimize Hive table storage for subsequent Aggregations and/or Window Analytic Functions

Activemq5.11 Producer Speed And Consumer Speed Is Lower,why?

hi all, I recently performance test activemq queue, message processing speed, producers and consumers of the message processing speed is very low, 3000 messages per second. Test message from supplier to consumer, however, the speed of the message is received it soon, 12000 messages per second, is this why?Who can help me analysis?Thank you very much. View this message in context: http://activemq

Configuration To Clean Up Storm Internal Queues Post Data Processing.

HBase Read Latency

Hi guys, On what factors does HBase read latency primarily depend? What would be the approx theoretical limit for read latency in v0.90.1 on a cluster of 7 nodes (16 core/16 GB RAM on 5 machines and 36 GB on the other two)? I have an application where I generate around 1000 rows/s to be input into HBase. Then I have to read this data and process it at regular intervals. Write speed is not a problem

Batch Processing Of Fact Data Stored On HDFS Using Time Partitioning

Hi, i need to change storage format of data. The data looks like: /masterdata/source/some_source/archive/YEAR/MONTH/DAY/HOUR/part* Right now the data is in CSV format. I want to convert it to SequenceFile with Snappy compression. I need to preserve date partitioning /YEAR/MONTH/DAY/HOUR/ I need to get single file for each partition (SeqFiles are splittable so I can reduce qtty of files and save

Creating Impala Table For Partitioning Has Issues With Timestamp Column

Hi All, I am using the schema in the Impala VM and trying to create a dynamic partitioned table on date_dim. New table is called date_dim_i and schema for that is defined as: create table date_dim_i ( d_date_sk int, d_date_id string, d_date timestamp, d_month_seq int, d_week_seq

Best Ways To Look-up Information?

Hi, I am planning a system to process information with Hadoop, and I will have a few look-up tables that each processing node will need to query. There are perhaps 20-50 such tables, and each has on the order of one million entries. Which is the best mechanism for this look-up? Memcache, HBase, JavaSpace, Lucene index, anything else? Thank you, Mark

After Rollback, JMSConsumer Is Not Freed Up To Process Other Messages?

I seem to be seeing something entirely unexpected - when my JmsConsumer thread throw a SchedulerException, that JmsConsumer thread will not process any new message until it has finished processing the original ActiveMQ message. I was expecting that having a delay would mean that that JmsConsumer thread will now be free to process another ActiveMQ message. Here's my setup: 1) I have a whole slew

Queue Locks Up , Purging It Allows It To Work Again.

I’m totally stumped on this bug …. Essentially, I have a queue that locks up and consumers in my main daemon no longer consume messages from it. It’s basically dead. If I restart my daemon, no more messages are consumed. I can browse the queue, consume them from my desktop, but I can’t consume them from my main daemon. I’ve done all the normal debugging. JMX shows there are

Need Help About The Configuration Optimization

Hi I am a newer for hadoop, now we have 32 nodes for hadoop study I need to speed up the process of hadoop processing by finding the best configuration. For example: io.sort.mb io.sort.record.percent etc. But I do not how to start with so many parameters available for optimization. BRs Geelong -- >From Good To Great

Forcing Processing Of A Queue And Wait Until Empty

Hi all, we have switched to activemq and we are using automated webtests. Some of these tests put messages in a queue and we have to wait for these messages to be processed before we can go on with the tests. Messages gets redelivered on failures. Currently I use jmx in a loop with a sleep and assume that it failed if the queue not gets empty within a period of time. Now the question is if there

Most Bolts Are Idle, And The Upstream Bolt Stopped Pulling Data

Hi, i am having a bolt to poll from kafka(bolt, not spout). It has 3 threads as the topic has 3 partitions. Then the downstream processing bolt has 200 threads. With this set up I hope the processing bolt can catch up with the kafka bolt. However, in my testing, it seems that at a single time, only 3 processing bolts are processing the data, and then the kafka polling bolt seems to be blocked

Mergesegs Tool Hangs Up

Hi, ALL! I have 15 segments with 180000 urls When I'm trying to execute mergesegs tool the process hangs up on Processing 80000 pages and then nothing... HEAP_SIZE=512 Mb Please help! -- Regards, Dima mailto:[email protected]

Beeline Not Picking Up Auxlib

When running HIVE with our own Serde it all works fine. But when using beeline to see the data it does not pickup the serde even tho it prints out the jar on the classpath. Does anyone see this? [root@cloudera-dev auxlib]# beeline -hiveconf (No such file or directory) hive.aux.jars.path=file:/usr/lib/hive/auxlib/celertech-flume-logger-0.22.0-SNAPSHOT.jar,file:/usr/lib/hive/auxlib/hbase.jar,file:/

Hadoop Processing

SGFkb29wZXJzLA0K4oCcSGFkb29wIHNoaXBzIHRoZSBjb2RlIHRvIHRoZSBkYXRhIGluc3RlYWQg b2Ygc2VuZGluZyB0aGUgZGF0YSB0byB0aGUgY29kZS7igJ0NClNheSB5b3UgYWRkZWQgdHdvIERO cy9UVHMgdG8gdGhlIGNsdXN0ZXIuIFRoZXkgaGF2ZSBubyBkYXRhIGF0IHRoaXMgcG9pbnQsIGku ZS4geW91IGhhdmUgbm90IHJhbiB0aGUgYmFsYW5jZXIuDQpJbiB2aWV3IG9mIHRoZSBhYm92ZSBx dW90ZWQgc3RhdGVtZW50LCB3aWxsIHRoZXNlIHR3byBub2RlcyBub3QgcGFydGljaXBhdGUgaW4g dGhlIE1hcFJlZHVjZSBqb2IgdW50aWwgeW91IGJhbGFuY2VkIHNvbWUgZGF0YSBvbnRvIHRob3Nl

How To Do This Processing....

Hi all, Would anyone have some logic ideas on how to do this processing given the following HTML? Selected? Manufacturer Colour # Colour Can Order? DMC B5200 Snow white

Escape Processing

Is there a method provided with the MM JDBC driver that will take care of escaping special characters in strings? I'm looking for something analogous to the $dbh->quote($string) method provided in the Perl DBI for MySQL. In the docs, I see the org.gjt.mm.mysql.EscapeProcessor class, but there isn't much description of how to use it. Will this do what I want? Thanks, -Don

Partitioning To Speed Up Processing?

Related discussions

Spark-dev

Spark-user