Search This Blog

Multi Load Utility

Multi Load Utility

Mload is a command driven utility which is used to Insert/update/Delete Operations.

Following tables are created in Mload :


Error Tables : Requires 2 error tables per target table , one for Constraint violations and other for UPI violations .


Work Tables : It holds the DML tasks that are  requested and the input data which is ready to APPLY to AMPS.

Log Table: keeps a record of results from each phase of the load so that multi-load knows proper point from which to restart.


There are five phases in Multi-load , Please find the below 5 steps ...

Preliminary Phase :
First Checks SQL commands or Multiload commands are valid or not .
Establishes multiload sessions with Teradata.
Creates Error Tables, Work Tables and log table
DML Phase :
All the SQL DML statements are sent to database.
PE will  parses the DML and releases a step-by-step plan to execute the request.
This execution plan will be communicated to each AMP and will be stored in appropriate work tables for each target table.
during the Acquisition phase the actual input data is stored in the worktable so that it can  be applied the Application Phase.

Acquisition Phase :
MultiLoad will now acquire the data in large, unsorted 64K blocks from the host and will send it to the AMPs (round Robin). 
The AMP hashes every row on the primary index and sends it to the BYNET to the proper AMP where it will  be used ultimately. But the row will not get inserted into its target table, just yet.
The AMP puts all of the hashed rows it has got from the other AMPs into the worktables where it assembles them into the SQL.
Now the utility will place a load lock on each target table for preparation for the Application Phase.
Application Phase :

In this phase loading of  the data from Worktable to Actual table will be done .
The main use of this phase is to write or APPLY, the specified changes to both  target tables and NUSI subtables.
End Phase – Basic Cleanup :

If the last error code is zero (0), all of the job steps have been completed successfully (i.e., all has certainly ended well). 
All  the empty error tables, the worktables and the log table are dropped. All locks  both Teradata and MultiLoad are released.
MultiLoad session is logged off.

Limitations of Multi-Load:


    * MultiLoad Utility will notsupport SELECT statement.
    * Concatenation of multiple input data files is not allowed.
    * MultiLoad will not support Arithmetic functions i.e. ABS, LOG etc. in Mload Script.
    * MultiLoad will not support Exponentiation and Aggregator Operators i.e. AVG, SUM        etc...  in Mload Script.
    * MultiLoad will notsupport USIs (Unique Secondary Indexes), Referential Integrity,              Join Indexes, Hash Indexes and Triggers.
    * Import task requires the use of PI (Primary Index).

T PUMP in Teradata


 T PUMP in Teradata


T PUMp in teradata is a utilitiy for Data loading which helps maintain (update, delete, insert, and  atomic upsert) the data in Teradata Database.
It Allows the near-real time data to be achieved in the data warehouse.
In real time Teradata TPump updates  information by acquiring data from the client system with low processor utilization.
uses row hash locks rather than table level locks,instead of updating Teradata Databases overnight, or in batches throughout the day,
This  will allow queries to be run while Teradata TPump is running.
Simple, hassle-free setup so does not require staging of data, intermediary files, or any special hardware.
Efficient, time-saving operation so  jobs could continue running in spite of database restarts ,dirty data, and network slowdowns. Jobs will restart without  any intervention.

How TPT Works :



One or more distinct Teradata TPump  tasks can be executed in series with any Teradata TPump support commands, with a single invocation of Teradata TPump,
The Teradata TPump task provides the acquisition of data from client files for  the application to target tables using INSERT, UPDATE, or DELETE statements that specify the full primary index of the tables. Data is fetched from the client, and is sent as transaction rows to Teradata Database, which will be immediately applied to the various target tables.
Each Teradata TPump task  acquires data from one or many client files with same or different layouts. From each source record, one or more INSERT, UPDATE, or DELETE statements can be made and directed to any target table.

Benefits  of TPump:


The following concepts tells us how Teradata TPump is understood.

•  To describe the task
which needs to be accomplished,the language of Teradata TPump commands and statements is use.
• Teradata TPump examines all  the commands and statements for a task, from the BEGIN
LOAD command till the END LOAD command, before actually executing the task.
• After all commands and statements  in a given task have been processed and
validated by Teradata TPump, then Teradata TPump task will be executed .
•Teradata TPump supports data serialization for a provided row optionally, which guarantees that if a row insert is immediately followed by a row update, the insert is processed first. This is done by hashing records to a given session.
• It supports bulletproof restartability using time-based checkpoints. Using
frequent checkpoints provides a greater ease in restarting, but at the expense of the
checkpointing overhead.
• It supports upsert logic similar to MultiLoad.
• It supports insert/update/delete statements in multiple-record requests.
• It uses macros to minimize network overhead.
• It supports interpretive, record manipulating and restarting features similar to MultiLoad.
• It supports conditional apply logic similar to MultiLoad.
• It supports error treatment options similar to MultiLoad.
• It runs as single process.
• It supports Teradata Database internationalization features such as kanji
character sets.
• Up to 2430 operations could be packed into a single request for network efficiency. This  limit of 2430 may change as the overall limit for a request is one megabyte. Teradata TPump
assumes that every statement will be one- or two- (for fallback) step request.

Operating Modes


Teradata TPump runs in the below operating modes:
1)Interactive
• Interactive processing will require more or less continuous participation of the user.
2) Batch
•  Batch programs will process data in discrete groups of previously scheduled operations, mostly in a separate operation, rather than interactively or in real-time.


Limitations:

Max Rowsize of data plus indicator is 64k.
SELECT command , CONCATENATION and AGGREGATE Functions are not allowed .
With one TPUMP File, there is  Limit of Four Import Commands  .
For the  dates before 1900 or after 1999, the year portion of the date should  be represented by four numerals (yyyy).
Unlike Mload and Fast load which uses Access logging , TPUMP does not use access logging for performance benefits.
Specify the values for the partitioning column set while performing Teradata TPump deletes and updates to avoid lock contention problems which can degrade performance. We should Avoid updating the  primary index and partitioning columns with Teradata TPump to reduce performance degradation.

TPT in Teradata

TPT in Teradata

 Teradata Parallel Transporter (TPT) is an object-oriented client application that gives us   high-speed,scalable, parallel data:

                                    • Extraction
                                    • Loading
                                    • Updating
TPT  has advantage on the functionality of the traditional Teradata extract and the  load utilities  that has FastLoad, MultiLoad, FastExport, and TPump ( also known as standalone utilities).
Multiple targets are possible in a single TPT job. A source and Target for a TPT job can be any of the following:
                               • Databases (both relational and non-relational)
                               • Database servers
                               • Data storage devices
                               • File objects, texts and comma separated values (csv)

TPT execution Process : 


When job scripts are submitted TPT can do the following:

Analyzing the statements in the job script.
Initializing the internal components.
Creating, optimizing and executing a parallel plan for completing the job by
Creating instances of the required operator objects.
Creating a network of data streams that interconnect the operator instances.
Coordinating the execution of the operators.
Coordinate checkpoint and restart processing.
automatically when the Teradata Database signals restart, it can .Restart the job automatically
Terminate the processing environments.
Between the data source and destination TPT jobs can:

Retrieve, store and transport specific data objects via parallel data streams.
Merge or split multiple parallel data streams.
Duplicate data streams for loading multiple targets.
Filter and Cleanse data. 
TPT supports the following types of environments:
Pipeline Parallelism
Data Parallelism

Workflow managers comparision: Airflow Vs Oozie Vs Azkaban

Airflow Vs Oozie Vs Azkaban

 Airflow has a very powerful UI and is written on Python and is developer friendly. Hence it is extremely easy to create a new workflow based on DAG. It provides both CLI and UI that allows users to visualize dependencies, progress, logs, related code, and when various tasks are completed. It has many features like Centralized logging, Great automation, scheduling and Backfilling options. It also has Retries feature which automatically performs retries before failing the task. It is modular, scalable, and highly extensible. In Airflow, we have Parameterizing scripts as built in using Jinja templating engine. it provides analytics on search ranking and sessionization information to track user's clickstream and time spent.

GUI:UI and Python
Tools compatable:Hive, Presto, MySQL, HDFS, Postgres, or S3
Event driven:Supports Event Based Scheduling
Programmability:UI (Need simple pyhton coding)


Oozie is an Open Source Java Web-Application. It provides out-of-the-box support for mapreduce, Pig, Hive, Sqoop, and Distcp, as well as jobs that are system-specific. It has features like Scalable, reliable, extensible and High availability. Using Oozie, Identical workflows can be parameterized to run concurrently. It allows for bulk kill, suspend, or resume jobs.  Multiple coordinator and workflow jobs can be packaged and managed together through Oozie Bundle.   
GUI:Command line, Java API and web browser as well
Tools compatable:mapreduce, Pig, Hive, Sqoop, Distcp and jobs that are system-specific
Event driven:supports both time-based and input-data based scheduling
Programmability:XML -- Easy to use


Azkaban is another open-source workflow manager.Simple web and HTTP workflow uploads can be done in Azkaban. It is Modular and pluginable for each Hadoop ecosystem.It has features like Tracking user actions, authentication, and authorization.It Provides a separate workspace for each new project and also Provides email alerts on SLAs, failures, and successes. It allows users to retry failed jobs.
GUI: Web browser only
Tools compatable:Compatible with any version of Hadoop
Event driven:Azkaban supports only time based scheduling
Programmability: Java

Encryption & Decryption in GCP

 Encryption & Decryption in GCP

We have two approaches to Encrypt and Decrypt data in GCP.

Approach-I:


Before we encrypt or decrypt our data we need two Keys: KeyRing and CryptoKey.

Creating KeyRings and CryptoKeys:

Cloud KMS uses an object hierarchy, such that a CryptoKey belongs to a KeyRing, which resides in a particular location.


Syntax to Create a KeyRing:  

gcloud kms keyrings create KEYRING_NAME --location LOCATION

Sampe Code Used:
gcloud kms keyrings create my-key --location global

Encryption & Decryption in GCP


Description:
Creating a new KeyRing with name "my-key" in location "global"
Create a CryptoKey:
Syntax:  
gcloud kms keys create CRYPTOKEY_NAME --location LOCATION --keyring KEYRING_NAME --purpose encryption 
Sample Code Used:
gcloud kms keys create sree-key --location global --keyring my-key --purpose encryption

Encryption & Decryption in GCP


Description:
Create a new CryptoKey "sree123-key" for the KeyRing "my-key"

Encrypting the Data:

Syntax:
gcloud kms encrypt \
    --location=global  \
    --keyring=my-key-ring \
    --key=my-key \
    --plaintext-file=YOUR_FILEPATH_AND_FILENAME_TO_ENCRYPT \
    --ciphertext-file=YOUR_FILEPATH_AND_FILENAME.enc

Sample Code Used:
gcloud kms encrypt \
    --location=global  \
    --keyring=my-key \
    --key=sree123-key \
    --plaintext-file= /scripts/sample.txt \
    --ciphertext-file= /scripts/sample.txt.enc

Description:
To encrypt data, we have to provide the appropriate key information, specify the name of the plaintext file to encrypt, and specify the name of the file that will contain the encrypted content

Encryption & Decryption in GCP


Decrypting the Data:

Syntax:
gcloud kms decrypt \
    --location=global \
    --keyring=my-key-ring \
    --key=my-key \
    --ciphertext-file=YOUR_FILEPATH_AND_FILENAME_TO_DECRYPT \
    --plaintext-file=YOUR_FILEPATH_AND_FILENAME_TO_DECRYPT.dec

Sample Code Used:
gcloud kms decrypt \
    --location=global \
    --keyring=my-key-ring \
    --key=my-key \
    --ciphertext-file=/scripts/sample.txt.enc \
    --plaintext-file=/scripts/sample1.txt.dec

Description:
To decrypt data, we have to provide the appropriate key information, specify the name of the encrypted file (ciphertext file) to decrypt, and specify the name of the file that will contain the decrypted content.

File:
Encryption & Decryption in GCP


Encryption & Decryption in GCP


Approach-II:

Encryption & Decription of File can be done by openSSL [Secure Socket Layer] Protocols using key based symmetric Ciphers. Below are the commands to encrypt the file

File:
Encryption & Decryption in GCP


Encryption: [Commands]

openssl enc -in employee.txt \
-aes-256-cbc \
 -pass stdin > employee.txt.enc

The above Command encrypts Sample.txt to Sample.txt.enc using a 256 bit AES [Advanced Encryption Standard] which is Strong Symentric encryption algorithm. A Secret Key is used for both Encrption & Decryption of the Data. The Command will wait for user to enter the password and use that to generate an appropriate Key.

Encryption & Decryption in GCP


Copying the Encrypted File to GCP Using Gsutil:

Command: gsutil cp E:\employee.txt.enc gs://emp2/

Encryption & Decryption in GCP


Copying the Encrypted File to Local Google Shell:
Command: gsutil cp gs://emp2/employee.txt.enc /scripts

Decryption Commands:

openssl enc -in employee.txt.enc \
-d -aes-256-cbc \
 -pass stdin > employee.txt
Encryption & Decryption in GCP