Tuesday, May 27, 2025

Training

Data Engines IT Training Institute - Transform Your Career

Transform Your Career with Data Engines IT Training Institute !

At Data Engines, we specialize in delivering high-quality Training & Consultancy Services tailored to modern enterprise needs. With a focus on below courses, our mission is to bridge the gap between development and operations through hands-on, real-world instruction and expert guidance.

Master in-demand tech skills with our hands-on training programs. Get industry-ready with real-world projects and expert guidance.

Explore Courses

Our Specialized Courses

DevOps & AWS

Master cloud infrastructure, automation, and deployment strategies with hands-on AWS experience.

Data Sciences

Learn statistical analysis, machine learning, and data visualization to extract insights from complex datasets.

Data Engineering

Build robust data pipelines and architectures for large-scale data processing and analytics.

Power BI

Create interactive dashboards and business intelligence solutions for data-driven decision making.

Ab Initio

Master enterprise-grade ETL processes and data integration using Ab Initio platform.

Python

Learn versatile programming with Python for web development, automation, and data analysis.

Why Choose Data Engines?

100% Hands-On Training

Learn by doing with practical exercises and real-world scenarios.

Online Training

Flexible learning from anywhere with live interactive sessions.

Real-Time Projects

Work on industry-relevant projects and use cases.

Career-Oriented Curriculum

Curriculum designed to meet current industry demands.

Certification Guidance

Prepare for industry-recognized certifications.

Interview Preparation

Comprehensive interview preparation and placement support.

Recorded Sessions

Access to recorded sessions for revision and practice.

🚀 Hurry! New Batches Starting Soon!

⚡ Limited Time Offer

Enroll now & Get 10% off!

Offer valid only for a limited period

Contact Us Now

Get In Touch

Call or WhatsApp

© 2025 Data Engines IT Training Institute. All rights reserved.

Empowering careers through quality training and expert guidance.

Friday, March 1, 2024

Bigquery Optimization Techniques

Quick Tips for Performance Tuning and best practices in Bigquery

Simple in Nature: (Reduction in data being processed)

·         Projections : Select only required columns.Use Except Keyword in Select *

·      Selections: Select Only required Rows with Where Clause . Use Limit clause for data analysis

·         Filter using Pseudo Partition Column (_PARTITIONDATE)

Medium in nature :

·         Partitioned Tables

·         Reducing data before join

·   Where clause  : Operations on BOOLINTFLOAT, and DATE columns are typically faster than operations on STRING or BYTE

Complex  in Nature :

·         Reuse the repeatedly used transformations

·         Avoid multiple usage of CTE ( Common table expression )

·         Avoid Repeated Joins and Subqueries

Heavy In Nature :

·         Split complex queries into smaller ones

·         Materializing Large Datasets and use in necessary places

·    Optimize your join patterns (place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.)

Minimize Data Skew in below two instances by using Concepts of bigquery

·         Skew at Partition level: Partition skew refers to when data is not evenly distributed across partitions, resulting in some partitions being larger than others. This imbalance causes more data to be processed in certain slots, leading to inefficiencies. Unequally sized partitions exacerbate this issue, creating a disparity in data distribution.

·         Skew at Join level: Data skew can happen when you use JOIN clauses. This means that when Big Query organizes data for joining, it may put too much data with the same joining key into one group. This can overwhelm the system's capacity.


Thursday, June 24, 2021

Python code to deal with REST API by posting json data

Python code to deal with REST API by posting json data

Here we are taking the Cordial API as an example.

Making an API call by posting json data using Python:

To use the Cordial API, you must use their API key and also whitelist the external API of the client that is calling it.

Here we have a convenient feature that allows data exports to be directly saved to a GCS bucket.


The Code will be as follows:


        # Parameters required by Cordial API call

        url = 'https://api.cordial.io/v2/contactactivityexport'

        compress = False

        all_properties = True


        # Start datetime parameter

        start_time = execution_date.subtract(hours=1).strftime('%Y-%m-%dT%H:%M:%S.000Z')

        print("start_time api: "+start_time) # This variable is in this format: 2020-04-15T17:00:00.000Z

        

        # End datetime parameter

        end_time = execution_date.subtract(seconds=1).strftime('%Y-%m-%dT%H:%M:%S.000Z')

        print("end_time api: "+end_time) # This variable is in this format: 2020-04-15T17:00:00.000Z


        # Get the API key from Airflow connections

        api_key = BaseHook.get_connection('cordial_api_key').password


        #Send a POST request to Cordial API

        call_cordial_api = requests.post(url, 

            json={

              "name": ts_name,

              "exportType": "json",

              "destination": {

                "type": "gcs",

                "path": gcs_path,

                "gcs_bucket": gcs_bucket

              },

              "selected_timeframe_start": start_time,

              "selected_timeframe_end": end_time,

              "selected_action_names": ['open','click','optout','message-sent','message-stopped','bounce','complaint','custom'],

              "showAllProperties": all_properties,

              "compress": compress,

              "confirmEmail":"gcpnotifications@company.com"

            } , 

             auth=HTTPBasicAuth(api_key,''),

             headers={'Accept': 'application/json'}

         )

        print(call_cordial_api)

        if call_cordial_api.status_code == 200 or call_cordial_api.status_code == 201:

            pass

        else:

            raise Exception

        # Check status code for response recieved. Success Code - Response [200]. Error Code - Response [401].

        print(call_cordial_api.json()) 


 

Sample Output:

{"_id":"5f9754f","cID":"5d9cf4","ts":"2020-10-26T23:00:03+0000","mcID":"946:5f93fe:1","baseAggregate":"ot","UID":"6af0896","chnl":"email","chnl-type":"email","dur":0,"first":true,"tzo":-6,"rl":"8","email":"aecoghan@icloud.com","message_name":"SockSaleEndsToday","message_sent":"2020-10-25T11:00:00.0Z","message_tags":["Promotional"],"action":"open","time":"2020-10-26T23:00:00+0000","bmID":"946:5f93204ace295161aa12f50c:ot"}



Contact Activities:

The Contact Activities collection contain all contact-related activities such as opens, clicks, sends, and any custom actions.

Friday, June 4, 2021

Python code to deal with REST API by uploading images

Python code to deal with REST API by uploading images

Here we are taking the MulticolorEngine API as an example.

Making an API call by uploading images using Python:

The image file is sent to the MulticolorEngine API, but only one may be specified at a time and get the dominant colors as response. For each color, a similarity rank, a weight factor and a color class are returned along with the RGB values.

We can make the API call either by uploading the image files or by sending the image URLs. 

i) Python code using the image files:

import requests

from requests.auth import HTTPBasicAuth


api_url = 'https://multicolorengine.tineye.com/sandbox/rest/extract_image_colors/'

user = 'username’

password = 'password'

auth = HTTPBasicAuth(user, password)

files = {

    'images[0]': open('image1.jpg', 'rb')

}


response = requests.post(api_url, auth=auth, files=files)

json_data = response.json()

print(json_data)


images[n]: The image file from which to extract colors. Either this or urls is required, but only one may be specified at a time. This parameter can be included multiple times to specify multiple values, with n starting at 0 and increasing for each additional value.

Image limitations:

Image size: For optimal performance, uploaded images (those given by images[n] parameters) should be 600px in size in the smallest dimension. For example, 1200x800 pixels is larger than required and it will take longer to transfer this file to your MulticolorEngine server. It would be faster to resize this image to be 900x600 and then send it. Smaller images may work, and need not be scaled up.

Image format: Accepted formats are JPEG, PNG, WebP, GIF, BMP and TIFF files. Animated images are not supported.


ii).To send URL of an image file instead of the image file:

import requests

from requests.auth import HTTPBasicAuth


api_url = 'https://multicolorengine.tineye.com/sandbox/rest/extract_image_colors/'

user = 'username’

password = 'password'

auth = HTTPBasicAuth(user, password)

data={'limit': 2, 'urls[0]': 'https://content.website.com/images/image1.jpg'}


response = requests.post(api_url, auth=auth, data=data)

json_data = response.json()

print(json_data)


urls[n]: The publicly-accessible URL where the API can find the image from which to extract colors. Either this or images is required, but only one may be specified at a time. This parameter can be included multiple times to specify multiple values, with n starting at 0 and increasing for each additional value.

limit: For extracting only 2 colors from Image, set limit : 2

The maximum number of colors to be extracted when processing multiple image files, defaults to 32.


Sample output:

{

    "status": "ok",

    "error": [],

    "method": "extract_image_colors",

    "result": [

        {

            "color": [

                194,

                66,

                28

            ],

            "rank": 1,

            "class": "orange-dark,red",

            "weight": 76.37

        },

        {

            "color": [

                141,

                125,

                83

            ],

            "rank": 2,

            "class": "brown-light",

            "weight": 23.63

        }

    ]

}


Response Code: If we get the response status code as 401, then it is a failure/error. Response code 200 or 201 is for success.


Saturday, May 8, 2021

Mapreduce Example

Hadoop Mapreduce :


Map Phase :This is a phase whern the Mappers will accept the task and process(division of computation) specific to each node.
The result will be in key-value pairs. This is called intermediate output and will be stored in Local disk.

Sort and shuffle :
Each key value pairs from each mapper are taken and the values are now joined based on Keys and stored in local disk . After sorting and shuffling is done based on Keys of Key-value pair , the values will be sent to Reducers.

Reduce :  The output from the sort and shuffle will now be reduced and is stored in HDFS. This will be the final output.

Key- Value Pair : This is the output of the Mapper which will be given for Sorting and merging .

Combiner  : It is called as a mini reducer .It is generally used for searching in data set (Example highest salary in employee table).
It will search the highest of each dataset from Map stage .

Hive cannot convert nested subqueries into joins


Sample Text :

<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>

Map :

<What,1> <do,1> <you,1> <mean,1> <by,1> &lt;Object,1&gt;
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

Combiner :

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> &lt;Object,1&gt;
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Partitioner :Partitioner will move the input to respective reducers based on key values from the mapper stage.


No of Partitioners = No of reducers



Reducer :

<What,3> <do,2> <you,2> <mean,1> <by,1> &lt;Object,1&gt;
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Hadoop Yarn Architecture

 Yarn ( Yet Another Resource Negotiator) :


The  YARN was introduced basically to split up the functionalities of resource management and job scheduling or monitoring into separate processes .The Whole  idea was to  have a global ResourceManager (RM) and for each application an ApplicationMaster (AM). An application can be a single job or a DAG of jobs so that the
MapReduce jobs will run unchanged on top of YARN with just by recompile.

Resource Manager : There are 2 components in ResourceManager:

1.Scheduler

2. ApplicationsManager.

 Application Manager : The ApplicationsManager will accept the submitted jobs and ignore the first container for executing the application specific .Application Master and on failure,it provides the service for restarting the ApplicationMaster container.Per-application ApplicationMaster will ignore the appropriate resource containers from the Scheduler, track their status and monitor their progress.

Scheduler :
      The Scheduler will be mainly for allocating resources to various running applications keeping the familiar constraints of capacities, queues etc. for allocation The Scheduler does not perform any monitoring or tracking the status for the application. It is not responsbible as to why the restarting failed tasks due to application failure or hardware failures. The Scheduler will schedule based the resource requirements of the applications. Sheduling is done based on the abstract conception of a resource container which includes elements such as memory, cpu, disk, network etc.

Node Manager :
      The Node Manager is the slave which will be many per cluster . Upon starting,the node manager will send a heartbeat signal to the Resource Manager periodically. Node Manager offers some resources to the cluster for execution of programs. Resource capacity is amount of memory and the number of vcores. At run-time, the Resource Scheduler will decide use this capacity at runtime.Container is a fraction of the NodeManager capacity and is used by the client for running the program.

Container :
      Container is an allocated resource in the cluster. Set of system resources like ,  CPU core , RAM etc are allocated for each container. It is the sole authority of ResourceManager to allocate any Container to applications

Application Master :
      The Application Master will be responsible for the execution of a single application.The Resource Scheduler (Resource Manager)  will provide the required containers on which the specific programs (e.g., the main of a Java class) are executed.  The Application Master knows the application logic and hence it is framework-specific. The MapReduce framework gives its own implementation of an Application Master.

 In YARN, there are three actors:
o          The Job Submitter (the client)
o          The Resource Manager (the master)
o          The Node Manager (the slave)
Yarn Execution Process :

 The application startup process is the following:
o          Mapreduce Application will be submitted by the client program to the resource manager. It also provides the information required to launch the application-specific ApplicationMaster.

o          Client program submits the MapReduce application to the ResourceManager, along with information to launch the application-specific ApplicationMaster.
o          ResourceManager will negotiate a container for the ApplicationMaster and launches the ApplicationMaster.
o          ApplicationMastervwill boot and registers itself with the ResourceManager, there by allowing the original calling client to converse directly with the ApplicationMaster.
o          ApplicationMaster  will negotiate resources (resource containers) for client application.
o          ApplicationMaster provides the container launch specification to the NodeManager, which will  launche a container for the application.
o          At the time of execution, client will be polling ApplicationMaster for application status and progress.
o          On completion, ApplicationMaster will deregister with the ResourceManager and shuts down and returs its containers to the resource pool.