Hi,
We would like to use Spark SQL to store data in Parquet format and then
query that data using Impala.
We've tried to come up with a solution and it is working but it doesn't
seem good. So I was wondering if you guys could tell us what is the
correct way to do this. We are using Spark 1.0 and Impala 1.3.1.
First we are registering our tables using SparkSQL:
val sqlContext = new
Hi all,
How should I store a one to many relationship using spark sql and parquet
format. For example I the following case class
case class Person(key: String, name: String, friends: Array[String])
gives an error when I try to insert the data in a parquet file. It doesn't
like the Array[String]
Any suggestion will be helpfull,
Regards,
Jao
Hi,
we are using CDH 4.6 as our hadoop cluster. We would like to store realtime data into HDFS in Parquet format using partition.
Please suggest us the best tool to achieve this. We are evaluating Spark but don't know how to store it into hdfs in partitions.
Thanks in advance.
Regards,
Riyaz
Hi,
Is there a way to insert data into existing parquet file using spark ?
I am using spark stream and spark sql to store store real time data into parquet files and then query it using impala.
spark creating multiple sub directories of parquet files and it make me challenge while loading it to impala. I want to insert data into existing parquet file instead of creating new parquet file.
I have
Hi,
Is it possible to use Spark as clustered key/value store ( say, like
redis-cluster or hazelcast)?Will it out perform in write/read or other
operation?
My main urge is to use same RDD from several different SparkContext without
saving to disk or using spark-job server,but I'm curious if someone has
already tried using Spark like key/value store.
Thanks,
Hajime
We used mapreduce for ETL and storing results in Avro files, which are
loaded to hive/impala for query.
Now we are trying to migrate to spark, but didn't find a way to write
resulting RDD to Avro files.
I wonder if there is a way to make it, or if not, why spark doesn't support
Avro as well as mapreduce? Are there any plans?
Or what's the recommended way to output spark results with schema
Has anyone tried this? I'd like to read a bunch of Avro GenericRecords
from a Parquet file. I'm having a bit of trouble with respect to
dependencies. My latest attempt looks like this:
export
SPARK_CLASSPATH="/Users/laserson/repos/parquet-mr/parquet-avro/target/parquet-avro-1.3.3-SNAPSHOT.jar:/Users/laserson/repos/parquet-mr/parquet-hadoop/target/parquet-hadoop-1.3.3-SNAPSHOT.jar:/Users/laserson
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems
that MLlib does not support deep learning, I want to know is there any way
to implement deep learning on spark ?
*Do I must use 3-party package like caffe or tensorflow ?*
or
*Does deep learning module list in the MLlib development plan?*
great thanks
*--------------------------------------*
a spark lover, a
Hi spark users,
I'm using spark SQL to create parquet files on HDFS. I would like to store
the avro schema into the parquet meta so that non spark sql applications
can marshall the data without avro schema using the avro parquet reader.
Currently, schemaRDD.saveAsParquetFile does not allow to do that. Is there
another API that allows me to do this?
Best Regards,
Jerry
Hi list,
I'm writing a Spark Streaming program that reads from a kafka topic,
performs some transformations on the data, and then inserts each record in
a database with foreachRDD. I was wondering which is the best way to handle
the connection to the database so each worker, or even each task, uses a
different connection to the database, and then database inserts/updates
would be performed
Hi,
I'm quite new and recetly started to try spark. I've setup a single node
spark "cluster" and followed the tutorials in Quick Start. But I've come
across some issues.
The thing I was trying to do is to try the java api and run it on the
single-node "cluster". I followed the Quick Start/A Standalone App in Java
and successfully ran it using maven. But when I was trying to use
./bin/spark
I'm building some elasticity into my model and I'd like to know when my
workers have come online. It appears at present that the API only supports
getting information about applications. Is there a good way to determine
how many workers are available?
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-determine-of-workers-tp26586.html
Sent from the
Hi,
We are planning to use spark to load data to Parquet and this data will be
query by Impala for present visualization through Tableau.
Can we achieve this flow? How to load data to Parquet from spark? Will
impala be able to access the data loaded by spark?
I will greatly appreciate if someone can help with the example to achieve
the goal.
Thanks in advance.
Regards,
Riyaz
I am new to Parquet and using parquet format for storing spark stream data into hdfs.
Is it possible to merge multiple small parquet files into one ?
Please suggest an example.
Thanks in Advance!
Hi All,
We have written PIG Jobs which outputs the data in parquet format.
For eg:
register parquet-column-1.3.1.jar;
register parquet-common-1.3.1.jar;
register parquet-format-2.0.0.jar;
register parquet-hadoop-1.3.1.jar;
register parquet-pig-1.3.1.jar;
register parquet-encoding-1.3.1.jar;
A =load 'path' using PigStorage('\t') as (in:int,name:chararray);
store A into 'output_path
Hi All,
I want to store a csv-text file in Parquet format in HDFS and then do some
processing in Spark.
Somehow my search to find the way to do was futile. More help was available
for parquet with impala.
Any guidance here? Thanks !!
Hi All,I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark.Somehow my search to find the way to do was futile
Hi All,
Currently we are reading (multiple) topics from Apache kafka and storing
that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
different tables).
but we are facing some performance issue with HBase.
so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
Spark*.
difficulties:
1. How to read multiple topics from kafka using spark?
2. One tuple
Hi guys!
I'm using Spark SQL 1.3 on Hive with HDFS and Parquet. I've configured Hive
Metastore and I'd like to start using it.
It is possible to create table in Hive Metastore based on metadata stored
in Parquet tables? Stored tables contains lots of columns (sometimes
volatile) and I can't specify all of them.
I see that Impala added sql keyword "LIKE PARQUET 'path'" but it looks like
Hi -- I've a kafka stream producing JSON and wanted to use Spark Streaming
or Camus to write to HDFS in Parquet format and use with Impala.
Just wanted to see if anyone has that working and point me to how to schema
evolution is handled in this case?
Thanks!
To unsubscribe from this group and stop receiving emails from it, send an email to
[email protected].
Hi,
I am reading data from a HBase table to RDD and then using foreach on that
RDD I am doing some processing on every Result of HBase table. After this
processing I want to store the processed data back to another HBase table.
How can I do that ? If I use standard Hadoop and HBase classes to write
data to HBase I fall into serialization issues.
How should I write data to HBase in this