QnaList > Groups > Spark-User > Mar 2016
faq

DataFrame Python UDF Performnce Too Slow

Hi,
I am running Spark 1.6.0 on EMR. The job fails with OOM.I have DataFrame
with 250 columns and I am applying UDF on more than 50 of the columns. I am
registering the DataFrame as temptable and applying the UDF in hive_context
sql statement. I am applying the UDF after sort merge join of two DataFrame
(each of around 4GB) and multiple broadcast joins of 22 Dim table.
Below is how I am applying the UDF.
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select
python_udf(column_1),python_udf(column_2), ... , from temp_table")
There is Jira for the same issue (
https://issues.apache.org/jira/browse/SPARK-8632) which is resolved for
1.6.0 but I am running into the similar issue.
Thanks,
Bijay

asked Mar 24 2016 at 09:20

Bijay Pathak 's gravatar image



Related discussions

Tagged

Group Spark-user

asked Mar 24 2016 at 09:20

active Mar 24 2016 at 09:20

posts:1

users:1

Spark-dev

Spark-user

©2013 QnaList.com . QnaList is part of ZisaTechnologies LLC.