Data Engines: Join Stratagies in Hive

http://www.openkb.info/2014/11/understanding-hive-joins-in-explain.html

Common join :

It is a default join . It is expensive join uses intensive resources . It goes with all Map stage and Reduce stages. It is irrespective of tables
sizes
Hint : /*STREAMTABLE */

If we specify the table name in hint , it will take that table , else will be computed by entire right.
Usually large table should be streamed and other tables should be buffered.

Map Join :

Map Join is the join where one of the small table in the join used will be taken to inmemory for the computation of result.
Right Outer or Full Outer will not work.

Hint : /*MAPJOIN*/

Bucket Join :

In this join the columns which are used in the join condition are bucketed .
The bucketing mechanism should be small table is multiple of larger table.
The tables should be large enough .

The below flag should be set hive.optimize.bucketmapjoin=true;

Cons:
Tables need to be bucketed in the same way how the SQL joins, so it cannot be used for other types of SQLs.

Sort Merge Bucket Join :

In this join the columns which are used in join condition are bucketed and sorted.

1. The tables need to be created bucketed and sorted on the same join columns and also data need to be bucketed when inserting.
One way is to set "hive.enforce.bucketing=true" before inserting data.

2. Below parameters need to set to convert SMB join to SMB map join.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
3. Big table selection policy parameter "hive.auto.convert.sortmerge.join.bigtable.selection.policy" determines which table is for only streaming.
4. Hint "MAPJOIN" can determine which table is small and should be loaded into memory.
5. Small tables are read on demand which means not holding small tables in memory.
6. Outer join is supported.

Skew Join :

This join is used when one of the column values which are used in the join condition are in high skew .

It will help the dimension table rows to be which has skew values to be kept in inmemory
Mappers are triggered for values in Fact tabe ( for rows with high skew value).

for remaining values rows are moved to map and reducers.

by this small table (dimension table) with skew values are traversed twice and all others are moved once .

where fact table values are skipped for Reduce and shift operations.

Join Stratagies in Hive

No comments:

Training