QnaList > Groups > Pig-User > Nov 2012
faq

How Do I Load JSON In Pig?

I have some JSON data with a uniform schema. I want to load it in Pig.
JsonStorage doesn't work, because the data has no schema.
How can I load JSON data in Pig?
-- 
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com

asked Nov 17 2012 at 22:09

Russell Jurney's gravatar image



18 Replies for : How Do I Load JSON In Pig?
Maybe JsonLoader from ElephantBird can be useful? -
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java.
2012/11/17 Russell Jurney :
> I have some JSON data with a uniform schema. I want to load it in Pig.
> JsonStorage doesn't work, because the data has no schema.
>
> How can I load JSON data in Pig?
>
> --
> Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com

answered Nov 17 2012 at 23:40

Adam Kawa's gravatar image


No sure if this helps, but in 0.11 I've been using this on EMR for some of
our JSON data....
raw =3D load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USING
JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,=
referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outer=
height:chararray,outerwidth:chararray),resolution:(height:chararray,width:c=
hararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararr=
ay,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:=
chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararr=
ay,v:chararray');
Regards,
Dano
On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney w=
rote:
> I have some JSON data with a uniform schema. I want to load it in Pig.
> JsonStorage doesn't work, because the data has no schema.
>
> How can I load JSON data in Pig?
>
> --
> Russell Jurney twitter.com/rjurney [email protected]
> datasyndrome.com
>

answered Nov 18 2012 at 01:23

Dan Young's gravatar image


keep calm
and use elephant-bird
https://github.com/kevinweil/elephant-bird
I posted here yesterday an example how to load tweets in json
here goes again. I hope it helps.
  register 'elephant-bird-core-3.0.0.jar'
    register 'elephant-bird-pig-3.0.0.jar'
    register 'google-collections-1.0.jar'
    register 'json-simple-1.1.jar'
    json_lines =3D LOAD
'/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING
com.twitter.elephantbird.pig.load.JsonLoader();
    geo_tweets =3D FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS
id, (CHARARRAY) $0#'geoLocation' AS geoLocation;
    only_not_nulls =3D FILTER geo_tweets BY geoLocation is not null;
    store only_not_nulls into '/twitter_data/results/geo_tweets';
Arian Rodrigo Pasquali
FEUP, SAPO Labs
http://www.arianpasquali.com
twitter @arianpasquali
2012/11/18 Dan Young 
> No sure if this helps, but in 0.11 I've been using this on EMR for some o=
f
> our JSON data....
>
> raw =3D load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USI=
NG
>
> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararra=
y,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,out=
erheight:chararray,outerwidth:chararray),resolution:(height:chararray,width=
:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chara=
rray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,p=
v:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chara=
rray,v:chararray');
>
>
> Regards,
>
> Dano
>
>
>
> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney  >wrote:
>
> > I have some JSON data with a uniform schema. I want to load it in Pig.
> > JsonStorage doesn't work, because the data has no schema.
> >
> > How can I load JSON data in Pig?
> >
> > --
> > Russell Jurney twitter.com/rjurney [email protected]
> > datasyndrome.com
> >
>

answered Nov 18 2012 at 02:30

Arian Pasquali's gravatar image


Thanks, that is excellent.
Russell Jurney http://datasyndrome.com

answered Nov 18 2012 at 04:32

Russell Jurney's gravatar image


Thanks - looks like I don't have to specify the schema, which is good.
I'll try and build elephant-bird.
Russell Jurney http://datasyndrome.com

answered Nov 18 2012 at 17:19

Russell Jurney's gravatar image


U dont need to build neither
Just download those two jar I used in my example.
Arian
Em domingo, 18 de novembro de 2012, Russell Jurney escreveu:
> Thanks - looks like I don't have to specify the schema, which is good.
> I'll try and build elephant-bird.
>
> Russell Jurney http://datasyndrome.com
>
>

answered Nov 18 2012 at 22:30

Arian Pasquali's gravatar image


They come prebuilt? Neat!
Russell Jurney twitter.com/rjurney

answered Nov 18 2012 at 22:46

Russell Jurney's gravatar image


U dont need to build neither
Just download those two jar I used in my example.
Arian
Em domingo, 18 de novembro de 2012, Russell Jurney escreveu:
> Thanks - looks like I don't have to specify the schema, which is good.
> I'll try and build elephant-bird.
>
> Russell Jurney http://datasyndrome.com
>
>

answered Nov 18 2012 at 22:46

Arian Pasquali's gravatar image


I dont think you really need to build it.
you can find it at any maven repository.
Arian Rodrigo Pasquali
FEUP, SAPO Labs
http://www.arianpasquali.com
twitter @arianpasquali
2012/11/18 Arian Pasquali 
> U dont need to build neither
> Just download those two jar I used in my example.
>
> Arian
>
> Em domingo, 18 de novembro de 2012, Russell Jurney escreveu:
>
>> Thanks - looks like I don't have to specify the schema, which is good.
>>
>> I'll try and build elephant-bird.
>>
>> Russell Jurney http://datasyndrome.com
>>
>>

answered Nov 19 2012 at 00:31

Arian Pasquali's gravatar image


answered Nov 19 2012 at 16:23

Russell Jurney's gravatar image


Got it building. Are google collections and json-simple external deps?

answered Nov 19 2012 at 19:27

Russell Jurney's gravatar image


Talking to myself... never mind, guava and json-simple are included with
Pig.

answered Nov 19 2012 at 19:30

Russell Jurney's gravatar image


Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer the
schema from a record. This is what I was looking for. Looks like I have to
write that myself.
And yes, I understand the tradeoffs in doing so. Assuming a sample is the
overall schema is a big assumption.

answered Nov 19 2012 at 19:33

Russell Jurney's gravatar image


Ok, its even worse. My data is a big array.
Am I being negative in saying that JSON and Pig is like a nightmare?

answered Nov 19 2012 at 19:35

Russell Jurney's gravatar image


I also ran into same dilemma..here is something that I found easier and
working for me .. I compiled some sources from http://www.json.org/java/
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.List;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class JsonParser extends EvalFunc {
    @Override
    public Tuple exec(Tuple input) throws IOException {
        TupleFactory tf =3D TupleFactory.getInstance();
        Tuple t =3D tf.newTuple();
        if ( input.get(0) !=3D null ){
            String inString =3D (String) input.get(0);
            try {
                JSONObject jsn =3D new JSONObject(inString);
                t.append(getJsonArr(jsn));
                    } catch (JSONException e) {
                e.printStackTrace();
            }
        }
        return t;
    }
    private String getJsonArr(JSONObject jsn) {
        String jsnArrVal =3D "";
        try {
            if (!jsn.has("jsonKey"))
                return null;
            JSONArray jTagArray =3D jsn.getJSONArray("jsonKey");
            for (int i=3D0; i

answered Nov 19 2012 at 20:22

Deepak Tiwari's gravatar image


I'm also experiencing problems working with JSON objects in Pig.
I have managed to load in a log file in JSON format but only query the top
level objects.  Whenever I try to call anything that is nested it fails.
-- Register JARS
register elephant-bird-2.2.3.jar;
register json-simple-1.1.jar;
-- Load data
nestobject =3D LOAD '/Users/Path/GoogleDrive/test.json'
        USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=3Dtrue')
        AS (json:map[]);
DUMP nestobject;
-- Example query
tester =3D FOREACH nestobject GENERATE json#'event',json#'uid',
json#'data'#'expired_reason' as reason;
DUMP tester;
The above fails ...
Does anyone have any ideas?
Thanks
Sax

answered Nov 21 2012 at 05:56

Saxifrage Cucvara's gravatar image


Try
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
This should allow access to nested object as nested map ($0#'level1#'level2'#'level3' …)
David

answered Nov 21 2012 at 14:25

David LaBarbera's gravatar image


Thanks David.
However, I did try this.  I can read things on first level of the JSON file
but anything in any of the nested levels is failing.
Not sure if the below errors help with identifying what the problem might
be:
*012-11-22 09:29:07,065 [Thread-39] WARN
 org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in
cleanup*
*2012-11-22 09:29:07,065 [Thread-39] WARN
 org.apache.hadoop.mapred.LocalJobRunner - job_local_0009*
*org.apache.pig.backend.executionengine.ExecException: ERROR 1081: Cannot
cast to map. Expected bytearray but received: chararray*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOpera=
tors.POCast.getNext(POCast.java:1422)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOpera=
tors.POMapLookUp.processInput(POMapLookUp.java:87)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOpera=
tors.POMapLookUp.getNext(POMapLookUp.java:98)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOpera=
tors.POMapLookUp.getNext(POMapLookUp.java:117)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperato=
r.getNext(PhysicalOperator.java:320)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOpera=
tors.POForEach.processPlan(POForEach.java:332)
*
* at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOpera=
tors.POForEach.getNext(POForEach.java:284)
*
* at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapB=
ase.runPipeline(PigGenericMapBase.java:271)
*
* at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapB=
ase.map(PigGenericMapBase.java:266)
*
* at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapB=
ase.map(PigGenericMapBase.java:64)
*
* at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)*
* at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)*
* at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)*
* at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)*
*Caused by: java.lang.ClassCastException*
*2012-11-22 09:29:07,199 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaun=
cher
- HadoopJobId: job_local_0009*
*2012-11-22 09:29:07,199 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaun=
cher
- 0% complete*
*2012-11-22 09:29:12,207 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaun=
cher
- job job_local_0009 has failed! Stop running all dependent jobs*
*2012-11-22 09:29:12,207 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaun=
cher
- 100% complete*
*2012-11-22 09:29:12,207 [main] ERROR
org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!*

answered Nov 21 2012 at 22:36

Saxifrage Cucvara's gravatar image


Related discussions

Tagged

Group Pig-user

asked Nov 17 2012 at 22:09

active Nov 21 2012 at 22:36

posts:19

users:7

©2013 QnaList.com . QnaList is part of ZisaTechnologies LLC.