What is SQOOP? - step by step
::SQOOP:: (SQL + HADOOP)
Sqoop is one of the component of Hadoop, built on top of HDFS is meant for interacting with RDBMS systems.
SQOOP == SQL + HADOOP
RDBMS(Oracle or MySQL): Relational DB carries structured data. Sometimes if you are getting huge amount of data, it's hard to store and process it in RDBMS. SO, better to move it to HDFS
Moving from RDBMS to HDFS needs a tool, that is called SQOOP.
Below are the some of the key observation with respect to sqoop
- Either sqoop import/export will work with HDFS only (means there is no intervention of LFS)
- To interact with any RDBMS using sqoop, the target RDBMS should be java compatible.
- If you are interacting with any RDBMS using sqoop, it needs specific RDBMS connector. i.e. jar file should be part of sqoop installed lib directories.
- import entire table by using sqoop.
- import part of the table with "where" clause or with "column" clause.
- import all tables from a specified DB.
- export a table from HDFS to RDBMS.
::connect to RDBMS via sqoop:
sqoop import --connect jdbc:mysql://<ip address>/<DBname> --table <Table Name>;
Sometimes you'll get the error as access denied, because of privileges. Here i'm using it locally, so use the following code.
grant all privileges on <DBname>.* to '%'@'localhost';
for only one user, use username instead of %.
By default Sqoop takes 4 Mappers to perform task. It can be change by using -m <num of mappers>
Default field/delimiter separator is ","
Syntax: sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]
Following are some common arguments used mostly in sqoop.
Change the Directory by --target-dir '<path>'
Can use different delimiter instead of "," by Field terminated : --field-terminated-by '|'
Import column wise by --columns 'empid,ename'
Filter --where 'esal>2000'
Can get sqoop available commands by using : sqoop help
List of Databases: sqoop list-databases --connect jdbc:mysql://localhost;
list of tables: list-tables
Import all tables from DB: sqoop import-all-tables --connect jdbc:mysql://localhost/batch9 -m 1
eval: evolve and display results
If there is "eval and --query" that results in the RDBMS, not in HDFS. as well as "import and --query" is results in the HDFS, not in RDBMS.
Example of eval and query:
sqoop eval --connect jdbs:mysql://localhost/batch9 --query "select * from emp";
Example of "import and --query"
sqoop import --connect jdbs:mysql://localhost/batch9 --query "select * from emp where \$CONDITIONS" --target-dir '/import9query';
sqoop import --connect jdbs:mysql://localhost/batch9 --query "select * from emp where sal> 1000 \$CONDITIONS" --target-dir '/import9query';
NOTE: where \$CONDITIONS --> should be present for all queries.
Import Binay data:
sqoop import --connect jdbs:mysql://localhost/batch9 --table emp -m 1 --where 'sal> 1000' --target-dir '/import9query' --as-sequencefile;
avro file: --as-avrodatafile;
Job management arguments:
sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]
--create <job-id> creates/Saves a new job-id
--delete <job-id> Delete a saved job.
--exec <job-id> Run a saved job.
--help print usage instructions
--list List saved jobs
--meta-connect <jdbc-uri> specify JDBC connect string for the meta-store
--show <job-id> show the parameters for a saved job
--verbose print more information while working
EXPORT: Some key observations with respect to Sqoop
1) Before export, related schema should be present well in Advance.
2) Fields order and type should be matched.
3) If at all we are exporting multiple files, even single record is not matching the entire export is going to fail.
sqoop export --connect jdbc:mysql://localhost/<DB> --table emp --export-dir '<filename_path>' --fields-terminated-by '|';
ERROR: Sometimes we face an issue like "--split-by", comes if the table contains primary key.
Resolve it by using number of mappers is equal to 1 or add primary key option in sqoop, or add --split-by to sqoop.
Comment below for any queries and share it. @SsaiK
0 comments:
Post a Comment