Skip to content
sonalgoyal edited this page Nov 29, 2010 · 4 revisions

What all can HIHO fetch for me?

All query results. Anything you can write in a sql statement. Single tables, table joins. whatever your traditional JDBC driver supports!

I want to perform periodic fetching and keep data in one place. Does HIHO allow that?

Yes, surely. HIHO has the append feature, which allows you to write to the same folder again as part of the import step. You have to be very careful about wanting to do this.

How does fetching work?

HIHO internally connects to the database using JDBC. You can tell HIHO what query to run for fetching the records. As you may want to parallelize getting data and split that across mappers, you can also specify a bounding query. A bounding query is a query which fetches the upper and lower bounds for the split. Then, based on the number of mappers you want to use, HIHO splits your records by range of the bounding column. Lets say you want to import records which were added between 2 days ago and yesterday. So, your bounding query pesudo sql may be

select id from tableName where addedDate >= (sysdate -2) and addedDate <= (sysdate -1)

This provides a range of ids for the records to be imported. Say between 1000 and 10000. Now, if your main query is

select name, age, ....from table

the above will be split into range queries over the column you specify for the split in the orderBy configuration.

Say, if your mappers are 10, the first mapper will fetch between 1000-2000 records, second will fetch between 2000-3000 records and the last one will fetch between 9000-10000. Typically, you will have range queries over the primary key column of a table to read in data.

How is data transformed?

Data is read from the JDBC resultset. Current HIHO comes with two formats to transform the data, one is a simple delimited format, and second is conversion to an Avro GenericRecord. You can specify the delimiter you want, HIHO is configurable to a wide degree. In case you wish to escape some portions of your delimited output, you can modify your query based on your needs.

An Avro GenericRecord is more suitable for binary data, or if you want to define your further Map Reduce workflow pipeline in Avro. Avro is lightweight and fast, and soon to become the RPC mechanism for Hadoop. it is easy to be able to build rule engines and complex procesing over Avro schemas. Avro performs [http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking exceedingly well] against other JVM serialization libraries. You can read more about Avro at its [http://hadoop.apache.org/avro/ Avro website]

Can I create my own custom format?

Yes. Surely. HIHO understands you may need to process your data in a different way. Or that you may want to perform some analysis while records are being fetched. You can define your own mapper and use com.meghsoft.hiho.mapreduce.lib.db.GenericDBWritable as the value. Or, you can build your own RecordReader over the com.meghsoft.hiho.mapreduce.lib.db.DBQueryRecordReader

I dont want to dump and then process. Can I process the data while it is being fetched?

Yes. See the above answer.

Clone this wiki locally