Thursday, March 11, 2021

Multi Load Utility

Multi Load Utility

Mload is a command driven utility which is used to Insert/update/Delete Operations.

Following tables are created in Mload :


Error Tables : Requires 2 error tables per target table , one for Constraint violations and other for UPI violations .


Work Tables : It holds the DML tasks that are  requested and the input data which is ready to APPLY to AMPS.

Log Table: keeps a record of results from each phase of the load so that multi-load knows proper point from which to restart.


There are five phases in Multi-load , Please find the below 5 steps ...

Preliminary Phase :
First Checks SQL commands or Multiload commands are valid or not .
Establishes multiload sessions with Teradata.
Creates Error Tables, Work Tables and log table
DML Phase :
All the SQL DML statements are sent to database.
PE will  parses the DML and releases a step-by-step plan to execute the request.
This execution plan will be communicated to each AMP and will be stored in appropriate work tables for each target table.
during the Acquisition phase the actual input data is stored in the worktable so that it can  be applied the Application Phase.

Acquisition Phase :
MultiLoad will now acquire the data in large, unsorted 64K blocks from the host and will send it to the AMPs (round Robin). 
The AMP hashes every row on the primary index and sends it to the BYNET to the proper AMP where it will  be used ultimately. But the row will not get inserted into its target table, just yet.
The AMP puts all of the hashed rows it has got from the other AMPs into the worktables where it assembles them into the SQL.
Now the utility will place a load lock on each target table for preparation for the Application Phase.
Application Phase :

In this phase loading of  the data from Worktable to Actual table will be done .
The main use of this phase is to write or APPLY, the specified changes to both  target tables and NUSI subtables.
End Phase – Basic Cleanup :

If the last error code is zero (0), all of the job steps have been completed successfully (i.e., all has certainly ended well). 
All  the empty error tables, the worktables and the log table are dropped. All locks  both Teradata and MultiLoad are released.
MultiLoad session is logged off.

Limitations of Multi-Load:


    * MultiLoad Utility will notsupport SELECT statement.
    * Concatenation of multiple input data files is not allowed.
    * MultiLoad will not support Arithmetic functions i.e. ABS, LOG etc. in Mload Script.
    * MultiLoad will not support Exponentiation and Aggregator Operators i.e. AVG, SUM        etc...  in Mload Script.
    * MultiLoad will notsupport USIs (Unique Secondary Indexes), Referential Integrity,              Join Indexes, Hash Indexes and Triggers.
    * Import task requires the use of PI (Primary Index).

T PUMP in Teradata


 T PUMP in Teradata


T PUMp in teradata is a utilitiy for Data loading which helps maintain (update, delete, insert, and  atomic upsert) the data in Teradata Database.
It Allows the near-real time data to be achieved in the data warehouse.
In real time Teradata TPump updates  information by acquiring data from the client system with low processor utilization.
uses row hash locks rather than table level locks,instead of updating Teradata Databases overnight, or in batches throughout the day,
This  will allow queries to be run while Teradata TPump is running.
Simple, hassle-free setup so does not require staging of data, intermediary files, or any special hardware.
Efficient, time-saving operation so  jobs could continue running in spite of database restarts ,dirty data, and network slowdowns. Jobs will restart without  any intervention.

How TPT Works :



One or more distinct Teradata TPump  tasks can be executed in series with any Teradata TPump support commands, with a single invocation of Teradata TPump,
The Teradata TPump task provides the acquisition of data from client files for  the application to target tables using INSERT, UPDATE, or DELETE statements that specify the full primary index of the tables. Data is fetched from the client, and is sent as transaction rows to Teradata Database, which will be immediately applied to the various target tables.
Each Teradata TPump task  acquires data from one or many client files with same or different layouts. From each source record, one or more INSERT, UPDATE, or DELETE statements can be made and directed to any target table.

Benefits  of TPump:


The following concepts tells us how Teradata TPump is understood.

•  To describe the task
which needs to be accomplished,the language of Teradata TPump commands and statements is use.
• Teradata TPump examines all  the commands and statements for a task, from the BEGIN
LOAD command till the END LOAD command, before actually executing the task.
• After all commands and statements  in a given task have been processed and
validated by Teradata TPump, then Teradata TPump task will be executed .
•Teradata TPump supports data serialization for a provided row optionally, which guarantees that if a row insert is immediately followed by a row update, the insert is processed first. This is done by hashing records to a given session.
• It supports bulletproof restartability using time-based checkpoints. Using
frequent checkpoints provides a greater ease in restarting, but at the expense of the
checkpointing overhead.
• It supports upsert logic similar to MultiLoad.
• It supports insert/update/delete statements in multiple-record requests.
• It uses macros to minimize network overhead.
• It supports interpretive, record manipulating and restarting features similar to MultiLoad.
• It supports conditional apply logic similar to MultiLoad.
• It supports error treatment options similar to MultiLoad.
• It runs as single process.
• It supports Teradata Database internationalization features such as kanji
character sets.
• Up to 2430 operations could be packed into a single request for network efficiency. This  limit of 2430 may change as the overall limit for a request is one megabyte. Teradata TPump
assumes that every statement will be one- or two- (for fallback) step request.

Operating Modes


Teradata TPump runs in the below operating modes:
1)Interactive
• Interactive processing will require more or less continuous participation of the user.
2) Batch
•  Batch programs will process data in discrete groups of previously scheduled operations, mostly in a separate operation, rather than interactively or in real-time.


Limitations:

Max Rowsize of data plus indicator is 64k.
SELECT command , CONCATENATION and AGGREGATE Functions are not allowed .
With one TPUMP File, there is  Limit of Four Import Commands  .
For the  dates before 1900 or after 1999, the year portion of the date should  be represented by four numerals (yyyy).
Unlike Mload and Fast load which uses Access logging , TPUMP does not use access logging for performance benefits.
Specify the values for the partitioning column set while performing Teradata TPump deletes and updates to avoid lock contention problems which can degrade performance. We should Avoid updating the  primary index and partitioning columns with Teradata TPump to reduce performance degradation.

TPT in Teradata

TPT in Teradata

 Teradata Parallel Transporter (TPT) is an object-oriented client application that gives us   high-speed,scalable, parallel data:

                                    • Extraction
                                    • Loading
                                    • Updating
TPT  has advantage on the functionality of the traditional Teradata extract and the  load utilities  that has FastLoad, MultiLoad, FastExport, and TPump ( also known as standalone utilities).
Multiple targets are possible in a single TPT job. A source and Target for a TPT job can be any of the following:
                               • Databases (both relational and non-relational)
                               • Database servers
                               • Data storage devices
                               • File objects, texts and comma separated values (csv)

TPT execution Process : 


When job scripts are submitted TPT can do the following:

Analyzing the statements in the job script.
Initializing the internal components.
Creating, optimizing and executing a parallel plan for completing the job by
Creating instances of the required operator objects.
Creating a network of data streams that interconnect the operator instances.
Coordinating the execution of the operators.
Coordinate checkpoint and restart processing.
automatically when the Teradata Database signals restart, it can .Restart the job automatically
Terminate the processing environments.
Between the data source and destination TPT jobs can:

Retrieve, store and transport specific data objects via parallel data streams.
Merge or split multiple parallel data streams.
Duplicate data streams for loading multiple targets.
Filter and Cleanse data. 
TPT supports the following types of environments:
Pipeline Parallelism
Data Parallelism

Workflow managers comparision: Airflow Vs Oozie Vs Azkaban

Airflow Vs Oozie Vs Azkaban

 Airflow has a very powerful UI and is written on Python and is developer friendly. Hence it is extremely easy to create a new workflow based on DAG. It provides both CLI and UI that allows users to visualize dependencies, progress, logs, related code, and when various tasks are completed. It has many features like Centralized logging, Great automation, scheduling and Backfilling options. It also has Retries feature which automatically performs retries before failing the task. It is modular, scalable, and highly extensible. In Airflow, we have Parameterizing scripts as built in using Jinja templating engine. it provides analytics on search ranking and sessionization information to track user's clickstream and time spent.

GUI:UI and Python
Tools compatable:Hive, Presto, MySQL, HDFS, Postgres, or S3
Event driven:Supports Event Based Scheduling
Programmability:UI (Need simple pyhton coding)


Oozie is an Open Source Java Web-Application. It provides out-of-the-box support for mapreduce, Pig, Hive, Sqoop, and Distcp, as well as jobs that are system-specific. It has features like Scalable, reliable, extensible and High availability. Using Oozie, Identical workflows can be parameterized to run concurrently. It allows for bulk kill, suspend, or resume jobs.  Multiple coordinator and workflow jobs can be packaged and managed together through Oozie Bundle.   
GUI:Command line, Java API and web browser as well
Tools compatable:mapreduce, Pig, Hive, Sqoop, Distcp and jobs that are system-specific
Event driven:supports both time-based and input-data based scheduling
Programmability:XML -- Easy to use


Azkaban is another open-source workflow manager.Simple web and HTTP workflow uploads can be done in Azkaban. It is Modular and pluginable for each Hadoop ecosystem.It has features like Tracking user actions, authentication, and authorization.It Provides a separate workspace for each new project and also Provides email alerts on SLAs, failures, and successes. It allows users to retry failed jobs.
GUI: Web browser only
Tools compatable:Compatible with any version of Hadoop
Event driven:Azkaban supports only time based scheduling
Programmability: Java

Wednesday, March 10, 2021

Encryption & Decryption in GCP

 Encryption & Decryption in GCP

We have two approaches to Encrypt and Decrypt data in GCP.

Approach-I:


Before we encrypt or decrypt our data we need two Keys: KeyRing and CryptoKey.

Creating KeyRings and CryptoKeys:

Cloud KMS uses an object hierarchy, such that a CryptoKey belongs to a KeyRing, which resides in a particular location.


Syntax to Create a KeyRing:  

gcloud kms keyrings create KEYRING_NAME --location LOCATION

Sampe Code Used:
gcloud kms keyrings create my-key --location global

Encryption & Decryption in GCP


Description:
Creating a new KeyRing with name "my-key" in location "global"
Create a CryptoKey:
Syntax:  
gcloud kms keys create CRYPTOKEY_NAME --location LOCATION --keyring KEYRING_NAME --purpose encryption 
Sample Code Used:
gcloud kms keys create sree-key --location global --keyring my-key --purpose encryption

Encryption & Decryption in GCP


Description:
Create a new CryptoKey "sree123-key" for the KeyRing "my-key"

Encrypting the Data:

Syntax:
gcloud kms encrypt \
    --location=global  \
    --keyring=my-key-ring \
    --key=my-key \
    --plaintext-file=YOUR_FILEPATH_AND_FILENAME_TO_ENCRYPT \
    --ciphertext-file=YOUR_FILEPATH_AND_FILENAME.enc

Sample Code Used:
gcloud kms encrypt \
    --location=global  \
    --keyring=my-key \
    --key=sree123-key \
    --plaintext-file= /scripts/sample.txt \
    --ciphertext-file= /scripts/sample.txt.enc

Description:
To encrypt data, we have to provide the appropriate key information, specify the name of the plaintext file to encrypt, and specify the name of the file that will contain the encrypted content

Encryption & Decryption in GCP


Decrypting the Data:

Syntax:
gcloud kms decrypt \
    --location=global \
    --keyring=my-key-ring \
    --key=my-key \
    --ciphertext-file=YOUR_FILEPATH_AND_FILENAME_TO_DECRYPT \
    --plaintext-file=YOUR_FILEPATH_AND_FILENAME_TO_DECRYPT.dec

Sample Code Used:
gcloud kms decrypt \
    --location=global \
    --keyring=my-key-ring \
    --key=my-key \
    --ciphertext-file=/scripts/sample.txt.enc \
    --plaintext-file=/scripts/sample1.txt.dec

Description:
To decrypt data, we have to provide the appropriate key information, specify the name of the encrypted file (ciphertext file) to decrypt, and specify the name of the file that will contain the decrypted content.

File:
Encryption & Decryption in GCP


Encryption & Decryption in GCP


Approach-II:

Encryption & Decription of File can be done by openSSL [Secure Socket Layer] Protocols using key based symmetric Ciphers. Below are the commands to encrypt the file

File:
Encryption & Decryption in GCP


Encryption: [Commands]

openssl enc -in employee.txt \
-aes-256-cbc \
 -pass stdin > employee.txt.enc

The above Command encrypts Sample.txt to Sample.txt.enc using a 256 bit AES [Advanced Encryption Standard] which is Strong Symentric encryption algorithm. A Secret Key is used for both Encrption & Decryption of the Data. The Command will wait for user to enter the password and use that to generate an appropriate Key.

Encryption & Decryption in GCP


Copying the Encrypted File to GCP Using Gsutil:

Command: gsutil cp E:\employee.txt.enc gs://emp2/

Encryption & Decryption in GCP


Copying the Encrypted File to Local Google Shell:
Command: gsutil cp gs://emp2/employee.txt.enc /scripts

Decryption Commands:

openssl enc -in employee.txt.enc \
-d -aes-256-cbc \
 -pass stdin > employee.txt
Encryption & Decryption in GCP

Row level security without views


 Different users access to different rows without creating separate views in BigQuery


To give different users access to different rows in your table, you don't need to create separate views for each one. These options all make use of the CURRENT_USER() function in BigQuery, which returns the e-mail address of the currently running user.


SELECT CURRENT_USER(); example, we get back tigani@google.com.


The simplest option for displaying different rows to different users is to add a column to your table that is the user who is allowed to see the row.

For example: the schema before : {customer:string, id:integer}
 would become 
After schema {customer:string, id:integer, allowed_viewer: string}.

Then I'd be able to see only the fields where CURRENT_USER() was the value in the allowed_viewer column.

This approach has its own drawbacks. However, You can only grant access to a single user at a time and One option would be to make the allowed_viewer column a repeated field that would let you provide a list of users which has access each row.

However, this is still pretty restrictive, and requires a lot of information to be stored about which users should have access to which row and chances are, what you'd really like to do is specify a group. So your schema would look like: {customer: string, id: integer, allowed_group: string}. Anyone in the allowed_grouplist would be able to see your table.

To make this work by having another table that has group mappings. That table would look like: {group:string, user_name:string}. The rows might look like:

{engineers, tigani@google.com}
{engineers, some_engineer@google.com}
{administrators, some_admin@google.com}
{sales, some_salesperson@google.com}
...

Let's call this table private.access_control. Then we can change our view definition:

SELECT c.customer, c.id
FROM private.customers c
INNER JOIN (
    SELECT group
    FROM private.access_control
    WHERE CURRENT_USER() = user_name) g
ON c.allowed_group = g.group

Note: you will want to make sure that there are no duplicates in private.access_control, otherwise, it could record to repeat in the results.

In this way, we can manage the groups in the private.access_control separately from the data table (private.customers).

There is still one piece missing that you might want which is the ability for groups to contain other groups. You can get this by doing a more complex join to expand the groups in the access control table (you might want to consider doing this only once and saving the results to save the work each time the main table is queried).

Hadoop Mapreduce Architechture

Hadoop Mapreduce Architechture

 Advent of Hadoop

To Understand Hadoop better let’s start with an Example let say we have a site for selling out mobile phones called MobileZone and our technical system looks something like below.



Now if we had to find out how many iphones should we order to sell. We need to use the data from the e-commerce and the inventory system. We load the data coming from e-commerce and inventory to a traditional DataWarehouse and do the reporting using the reporting tools like tableau and based on the report we could decide on. This system works well in years where the data generation is typically less.
Now if I have to change the question how many customers bought the iphone who loved it but hated the delivery? Now these customers showed up their reviews in the twitter and other sites. These data that will be generated from other sources does not fit into our traditional database system.
Hadoop Mapreduce Architechture

These Data which is coming from different sources and in different formats is called as Bigdata. We need something new to store and process this Bigdata. That’s when Hadoop came in to picture.
Hadoop Mapreduce Architechture
Hadoop Mapreduce Architechture
Hadoop
In a crude way think of Hadoop as a very big datawarehouse which takes data from any source and  in any format.It host a master and  many nodes and it gives us 2 services ie. Storage and Processing.
Hadoop Mapreduce Architechture
Now after Hadoop Processes the data, The processed data can be loaded to analytics and for reporting there by we can predict or decide on the future sales.
Hadoop Mapreduce Architechture

Hadoop is a framework for distributed processing of large data sets that uses a clusters of computers which has simple programming models for data processing.
Architecture
Hadoop Processing:
Let us say we have a site and we need to create a dashboard which will show us how many liked or viewed the site. So our first task is to set up the Cluster, so we use the Hadoop admin to set up the cluster with one Master Node also called as the NameNode and 4 Data Nodes.We will see more about the Name node and Data nodes further. Once the Cluster is set up the data is ingested to the Hadoop.
Hadoop Mapreduce Architechture

For eg: We have the data facebook.json(640mb) which when ingested in Hadoop , it is broken into 128mb blocks each.Now each 128 mb block is replicated 3 times to avoid fault tolerance. This makes Hadoop so reliable. So totally  we have 15 blocks.
Hadoop Mapreduce Architechture



To Process the data of facebook.json, the data about the data  that is loaded in to dataNodes will be stored in Namenode and the Actual data will be stored in the respective datanodes.
Now as shown below the divided among the name nodes accordingly.
Hadoop Mapreduce Architechture



NameNode has a Service called Job tracker,and the data nodes will have a service called TaskTracker.Once the Data is loaded, the tasktracker will read the metadata stored in namenodes and assigns the respective tasks to respective tasktrackers which will perform their jobs locally Once the data is processed

Hadoop Mapreduce Architechture

NameNode :

       The NameNode is the Master Node in a HDFS file system. It maintains the data about the  file system,and keeps track of the metadata stored across the datanodes . It only stores the Metadata of the data and not the data itself. Client applications converses to the NameNode whenever a request is recieved  to locate a file, or when they want to add/copy/move/delete a file and in response to that the NameNode will return a list of relevant DataNode servers where the data lives.

Secondary Data Node :

 The NameNode is a Single Point of Failure for the HDFS Cluster as the Metadata is stored only on the name node. This makes HDFS not a High Availability system. When the NameNode goes down, the file system goes down. We can host a optional SecondaryNameNode on a separate machine. It creates checkpoints of that namespace by merging the edits file into the fsimage file and hence does not provide any real redundancy.  Hadoop 0.21+ has this BackupNameNode that the user can configure to make it Highly Available.

DataNodes :

  An HDFS cluster can have many DataNodes. DataNodes stores the blocks of data and blocks from different files can be stored on the same DataNode. Each DataNode marks its presence or activeness by sending a signal message like "I am Alive." Periodically. This helps the NameNode to keep track of the data nodes and maintain the metadata accordingly.

JobTracker Service :
  
  The JobTracker is the service within Hadoop that runs the MapReduce tasks in the respective nodes in the cluster acoording to the client task,The nodes that have the data, or at least are in the same rack.

 TaskTracker  :

  A TaskTracker is service  that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Every TaskTracker will be configured with a set of slots, from this we will know  the number of tasks that it can accept.To schedule a task, the job tracker first looks for an empty slot on the same server that hosts the DataNode containing the data, if not, it looks in on the same machine in the same rack.The TaskTracker starts a separate JVM processes to do the actual task to ensure that process failure does not take down the task
tracker. The TaskTracker monitors these processes, capturing the output and exit codes. Once the process is completed, either success or not, the tracker notifies the JobTracker accordingly.The TaskTrackers also send out heartbeat messages to the JobTracker, to ensure that they are still alive and active, so that the jobtracker can update its metadata about the empty slots.

Mapreduce Execution Process :

  • *    The job or the task is submitted to the Job trackers.
  • *    The JobTracker connects to the NameNode to find the location of the data
  • *    The JobTracker locates TaskTracker nodes with the available nodes and other data respectivily
  • *    The JobTracker assign the task to the identified available TaskTracker nodes.
  • *    The TaskTracker nodes will be monitored for their heartbeat signals every min ,if they seem to have failed and the task is assigned on a different TaskTracker.
  • *    Job Tracker will be notified if at all if a taskfails.The jobtracker will then resubmit the job or avoids that specific record from processing or it may blacklist the tasktracker as unreliable.
  • *    The status is updated once the task is completed by the tasktracker.
  • *    Client applications will request the JobTracker for information on the task processed.
  • The JobTracker is the point of failure for MapReduce service. All the jobs are halted if it goes down