[Spark-User] Best way to store Avro Objects as Parquet using SPARK

Hi All,
          In my current project there is a requirement to store avro data
(json format) as parquet files.
I was able to use AvroParquetWriter in separately to create the Parquet
Files. The parquet files along with the data also had the 'avro schema'
stored on them as a part of their footer.
           But when tired using Spark streamng I could not find a way to
store the data with the avro schema information. The closest that I got was
to create a Dataframe using the json RDDs and store them as parquet. Here
the parquet files had a spark specific schema in their footer.
      Is this the right approach or do I have a better one. Please guide me.
We are using Spark 1.4.1.
Thanks In Advance!!

Reply To : Best Way To Store Avro Objects As Parquet Using SPARK

asked Mar 20 2016 at 22:55

Manivannan Selvadurai

1 Replies for : Best Way To Store Avro Objects As Parquet Using SPARK

We use this, but not sure how the schema is stored
Job job = Job.getInstance();
ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
AvroParquetOutputFormat.setSchema(job, schema);
LazyOutputFormat.setOutputFormatClass(job, new
ParquetOutputFormat().getClass());
job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false");
job.getConfiguration().set("parquet.enable.summary-metadata", "false");
//save the file
rdd.mapToPair(me -> new Tuple2(null, me))
.saveAsNewAPIHadoopFile(
String.format("%s/%s", path, timeStamp.milliseconds()),
Void.class,
clazz,
LazyOutputFormat.class,
job.getConfiguration());

Reply To : Best Way To Store Avro Objects As Parquet Using SPARK

answered Mar 20 2016 at 23:58

Sebastian Piu

Related discussions

Spark SQL, Parquet And Impala

Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0 and Impala 1.3.1. First we are registering our tables using SparkSQL: val sqlContext = new

Store One To Many Relation Ship In Parquet File With Spark Sql

Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends: Array[String]) gives an error when I try to insert the data in a parquet file. It doesn't like the Array[String] Any suggestion will be helpfull, Regards, Jao

Inserting Data Into Parquet Format

Hi, we are using CDH 4.6 as our hadoop cluster. We would like to store realtime data into HDFS in Parquet format using partition. Please suggest us the best tool to achieve this. We are evaluating Spark but don't know how to store it into hdfs in partitions. Thanks in advance. Regards, Riyaz

Is There A Way To Insert Data Into Existing Parquet File Using Spark ?

Hi, Is there a way to insert data into existing parquet file using spark ? I am using spark stream and spark sql to store store real time data into parquet files and then query it using impala. spark creating multiple sub directories of parquet files and it make me challenge while loading it to impala. I want to insert data into existing parquet file instead of creating new parquet file. I have

Spark As Key/value Store?

Hi, Is it possible to use Spark as clustered key/value store ( say, like redis-cluster or hazelcast)?Will it out perform in write/read or other operation? My main urge is to use same RDD from several different SparkContext without saving to disk or using spark-job server,but I'm curious if someone has already tried using Spark like key/value store. Thanks, Hajime

Is There A Way To Write Spark RDD To Avro Files

We used mapreduce for ETL and storing results in Avro files, which are loaded to hive/impala for query. Now we are trying to migrate to spark, but didn't find a way to write resulting RDD to Avro files. I wonder if there is a way to make it, or if not, why spark doesn't support Avro as well as mapreduce? Are there any plans? Or what's the recommended way to output spark results with schema

Using Parquet From An Interactive Spark Shell

Has anyone tried this? I'd like to read a bunch of Avro GenericRecords from a Parquet file. I'm having a bit of trouble with respect to dependencies. My latest attempt looks like this: export SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson

Best Way To Do Deep Learning On Spark ?

Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib development plan?* great thanks *--------------------------------------* a spark lover, a

Spark SQL: Storing AVRO Schema In Parquet

Hi spark users, I'm using spark SQL to create parquet files on HDFS. I would like to store the avro schema into the parquet meta so that non spark sql applications can marshall the data without avro schema using the avro parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do that. Is there another API that allows me to do this? Best Regards, Jerry

Which Is The Best Way To Get A Connection To An External Database Per Task In Spark Streaming?

Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task, uses a different connection to the database, and then database inserts/updates would be performed

Best Way To Deploy A Jar To Spark Cluster?

Hi, I'm quite new and recetly started to try spark. I've setup a single node spark "cluster" and followed the tutorials in Quick Start. But I've come across some issues. The thing I was trying to do is to try the java api and run it on the single-node "cluster". I followed the Quick Start/A Standalone App in Java and successfully ran it using maven. But when I was trying to use ./bin/spark

Best Way To Determine # Of Workers

I'm building some elasticity into my model and I'd like to know when my workers have come online. It appears at present that the API only supports getting information about applications. Is there a good way to determine how many workers are available? View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-determine-of-workers-tp26586.html Sent from the

Data Loading To Parquet Using Spark

Hi, We are planning to use spark to load data to Parquet and this data will be query by Impala for present visualization through Tableau. Can we achieve this flow? How to load data to Parquet from spark? Will impala be able to access the data loaded by spark? I will greatly appreciate if someone can help with the example to achieve the goal. Thanks in advance. Regards, Riyaz

How To Merge Small Parquet Files?

I am new to Parquet and using parquet format for storing spark stream data into hdfs. Is it possible to merge multiple small parquet files into one ? Please suggest an example. Thanks in Advance!

Parquet-SPARK-PIG Integration.

Hi All, We have written PIG Jobs which outputs the data in parquet format. For eg: register parquet-column-1.3.1.jar; register parquet-common-1.3.1.jar; register parquet-format-2.0.0.jar; register parquet-hadoop-1.3.1.jar; register parquet-pig-1.3.1.jar; register parquet-encoding-1.3.1.jar; A =load 'path' using PigStorage('\t') as (in:int,name:chararray); store A into 'output_path

Spark With Parquet

Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !! Hi All,I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark.Somehow my search to find the way to do was futile

Apache Kafka + Spark + Parquet

Hi All, Currently we are reading (multiple) topics from Apache kafka and storing that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 different tables). but we are facing some performance issue with HBase. so we are replacing* HBase* with *Parquet* file and *storm* with *Apache Spark*. difficulties: 1. How to read multiple topics from kafka using spark? 2. One tuple

Create Table Using Parquet Metadata

Hi guys! I'm using Spark SQL 1.3 on Hive with HDFS and Parquet. I've configured Hive Metastore and I'd like to start using it. It is possible to create table in Hive Metastore based on metadata stored in Parquet tables? Stored tables contains lots of columns (sometimes volatile) and I can't specify all of them. I see that Impala added sql keyword "LIKE PARQUET 'path'" but it looks like

JSON To Parquet Conversion

Hi -- I've a kafka stream producing JSON and wanted to use Spark Streaming or Camus to write to HDFS in Parquet format and use with Impala. Just wanted to see if anyone has that working and point me to how to schema evolution is handled in this case? Thanks! To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

Writing Data To HBase Using Spark

Hi, I am reading data from a HBase table to RDD and then using foreach on that RDD I am doing some processing on every Result of HBase table. After this processing I want to store the processed data back to another HBase table. How can I do that ? If I use standard Hadoop and HBase classes to write data to HBase I fall into serialization issues. How should I write data to HBase in this

Best Way To Store Avro Objects As Parquet Using SPARK

Related discussions

Spark-dev

Spark-user