QnaList > Groups > Spark-User > Mar 2016
faq

Best Way To Store Avro Objects As Parquet Using SPARK

Hi All,
          In my current project there is a requirement to store avro data
(json format) as parquet files.
I was able to use AvroParquetWriter in separately to create the Parquet
Files. The parquet files along with the data also had the 'avro schema'
stored on them as a part of their footer.
           But when tired using Spark streamng I could not find a way to
store the data with the avro schema information. The closest that I got was
to create a Dataframe using the json RDDs and store them as parquet. Here
the parquet files had a spark specific schema in their footer.
      Is this the right approach or do I have a better one. Please guide me.
We are using Spark 1.4.1.
Thanks In Advance!!

asked Mar 20 2016 at 22:55

Manivannan Selvadurai 's gravatar image



1 Replies for : Best Way To Store Avro Objects As Parquet Using SPARK
We use this, but not sure how the schema is stored
Job job = Job.getInstance();
ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
AvroParquetOutputFormat.setSchema(job, schema);
LazyOutputFormat.setOutputFormatClass(job, new
ParquetOutputFormat().getClass());
job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false");
job.getConfiguration().set("parquet.enable.summary-metadata", "false");
//save the file
rdd.mapToPair(me -> new Tuple2(null, me))
.saveAsNewAPIHadoopFile(
String.format("%s/%s", path, timeStamp.milliseconds()),
Void.class,
clazz,
LazyOutputFormat.class,
job.getConfiguration());

answered Mar 20 2016 at 23:58

Sebastian Piu 's gravatar image


Related discussions

Tagged

Group Spark-user

asked Mar 20 2016 at 22:55

active Mar 20 2016 at 23:58

posts:2

users:2

Spark-dev

Spark-user

©2013 QnaList.com