QnaList > Groups > Sqoop-User > Nov 2014
faq

Handling Special Character While Sqoop Import

Hi,
I am doing a Sqoop import from mysql as source, recently I figured out that
data imported through sqoop from mysql was having some special characters
and even control character which was loosing its meaning while moved to
sqoop data files.
Looking out for a solution as how to handle this case of special character
or if possible pruning the unwanted data out of my target dataset.
Looking out for resolution at the earliest!
Thanks!

asked Nov 21 2014 at 05:57

Vineet Mishra 's gravatar image



5 Replies for : Handling Special Character While Sqoop Import
Hey there,
Could you explain what you mean by "losing its meaning"? It's possible you
may need to set the character set:
http://dev.mysql.com/doc/connector-j/en/connector-j-reference-charsets.html.
-Abe

answered Nov 21 2014 at 12:45

Abraham Elmahrek 's gravatar image


Hi Abe,
Well with the above statement I mean to say that the data which is residing
in mysql is different from what is been imported via sqoop.
So let me shoot out an example for the same,
*Data in mysql : *सुरेन्द्र कुमार पाण्डेय
*Data in HDFS(Sqoop import) : * M-`M-$M-8M-`M-%M-
So this is the kind of changes I am landing into which is completely
loosing the meaning of the data.
Any help would be appreciated.
Thanks again!

answered Nov 22 2014 at 00:28

Vineet Mishra 's gravatar image


This could be in 2 places: Loading to HDFS, or extracting from MySQL. Sqoop
should load every thing as UTF-8 by default, which supports Hindi.
What is your default character set in MySQL? Could you copy/paste your
my.cnf? Also, what version of MySQL are you running?

answered Nov 22 2014 at 11:10

Abraham Elmahrek 's gravatar image


Hi Abe,
Thanks for your mail, well mysql table is defined with utf-8 and even the
data is visible like mentioned below,
*Data in mysql : *सुरेन्द्र कुमार पाण्डेय
but as I move the same through sqoop import of data gets corrupted, as
provided in the last thread of this mail.
Well I even tried to set the parameters
*useUnicode=true&characterEncoding=utf8* and *--direct --
still there's no luck.
Additionally, the data is containing some control character like Ctrl-A
(x001) and Ctrl-M likewise, which is even violating the field delimeter set
to sqoop import precisely as Ctrl-A. Is there a way to keep a possible
delimeter which can handle/work with any special or control character
introduced.
Looking out for quick response.
Thanks!

answered Nov 24 2014 at 02:50

Vineet Mishra 's gravatar image


Well it seems to be the issue with Mysql Client configuration present on
the datanodes where sqoop is invoking the m/r job.
I performed a test on my local machine dumping the same data to mysql and
did a sqoop import to the hdfs and I can clearly see the data boarded to
HDFS.
This clearly indicates that the issue was in mysql client configuration
which I need to rectify and set character-set type to utf-8(I thought the
default character-set would be set to utf-8).
But still the later part of the question remains same, how do I manage the
control character present in the data as I don't know what could be the
part of data(as I have encountered Control characters), setting delimiter
as Control character would not solve the meaning if the data contained that
character itself.
Looking out for the standard solution.
Thanks!

answered Nov 25 2014 at 21:45

Vineet Mishra 's gravatar image


Related discussions

Tagged

Group Sqoop-user

asked Nov 21 2014 at 05:57

active Nov 25 2014 at 21:45

posts:6

users:2

©2013 QnaList.com . QnaList is part of ZisaTechnologies LLC.