Massive Parallel Processing (MPP)

May 28, 2019

Share this post
issues-after-qlikview-version-upgrade-on-passive-node-of-production-servers

Introduction:

      • Massive Parallel Processing (MPP) is a type of processing where the data and processing power are split up among several different nodes (servers), with one leader node and one or many compute nodes.
      • In MPP, the leader node would be communicating with the client and with one or more compute nodes. Leader node receives a query from the client, then it assigns work to compute nodes, then consolidates the results from compute nodes and passes it to the client.
      • MPP method can scale horizontally by adding more compute resources (nodes), rather than having to worry about upgrading to more and more expensive individual servers (scaling vertically). Adding more nodes to a cluster allows the data and processing to be spread across more machines, which means the query will be completed sooner.
      • MPP is also known as a “shared nothing” system. This is because each node uses its own memory and operating system.
      • AWS Redshift is a cloud data warehouse which uses this MPP architecture.

Amazon Redshift MPP Architecture:

Amazon Redshift uses the following components to achieve MPP.

Leader node:

If a cluster is configured with two or more compute nodes, one node will act as the leader node and remaining nodes will act as compute nodes. The leader node is responsible for communicating with clients, parsing, rewriting and planning incoming queries and compiling code to be sent to the compute nodes for execution.

Compute Nodes:

The compute nodes are responsible for storing your data, executing the code sent to them by the leader node, and returning their intermediate result sets to the leader node. Each compute node has its own CPU, memory, and attached disk storage.

Node slices:

A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node.

 

MPP workflow:

    1. Leader node breaks the large workload into smaller, manageable pieces.
    2. Each piece is handed off to an individual compute node. Each computes nodes will be working on different parts of the same large workload.
    3. Leader node will communicate with each computes node and coordinates the work.
    4. Once all compute nodes have completed the assigned work, the separate results are combined into one large result set by the leader node.
    5. The final result set is then sent to the client by the leader node.

Disadvantage with MPP:

      • A common issue with MPP databases structuring the data and MPP databases do not support unstructured data.
      • Even structured data, such as that from a MySQL or PostgreSQL database will require some processing to make sure it fits the MPP structure.
      • This is because MPP databases are usually columnar, which allows analytical queries to be processed faster.
      • However, AWS also allows you to use Redshift Spectrum, which allows easy querying of unstructured data.