Data Engines: Data structures in Spark

Data structures in Spark

Data structures in Spark
RDD :

Rdd is set of data which is partitioned across the nodes
RDD does not have concrete or physical existance until the data is stored which is it is abstract.
It is fault tolerant which means that it has the capabilty to rebuild itself whenever the node fails.Rebuild is done using RDD lineage Graph
RDD can be created either from a file or manually by using parallelize keyword

DataFrame:

Dataframe : A set of data with named columns which is divided across nodes in another words set of rows distributed among the nodes
Dataframe is same as a RDBMS and hence gives us the flexibility to use traditional SQL operations such as order by,group by , filter etc... on the data
Dataframe is run on sqlcontext and help run SQL queries on that.

Dataset:

A set of data with named columns available across nodes is a DataFrame .
Dataset has capability the to use the the functions of RDD such as map , filter etc... with a optimsed execution of dataframe which is why called as a hybrid entity ( Which uses Spark SQL ). In other words it is combination of RDD and Dataframe functionality.
Dataset can be created either from RDD or from an Dataframe
Datasets can be joined , Unioned and can be aggregated

No comments:

Subscribe to: Posts (Atom)