Parallel Processing and Distributed Storage

Parallel Processing and Distributed Storage
James points out something I’d like to touch on regarding Parallel Processing and Centralized Data. For this discussion, let consider the following concept that Centralized data doesn’t necessarily mean only a single copy of the data in a single location. Centralized data could mean a single authoritative perspective on data, capable of interacting with that data seamlessly as if it were all physically in one place.

This is really what Parallel Data Warehousing is in a nutshell. You make your request known to the Parallel Database Warehouse, and it determines where the request is best resolved. Sometimes that request is resolved in multiple phases, and even on multiple machines, with the final results brought back and stitched together resulting in the final composite results.

In an OLTP relational database system, this process doesn’t work as efficiently out of the box. How do companies like Google, Amazon, Ebay, et al handle this kind of load? They don’t have a single data store. They have distributed data, and the request to process that data is distributed as well. Rather than bringing the data all into a central location for processing, the processing is instead sent to the location of the data resulting in both distributed data as well as processing.

There is no reason an OLTP database couldn’t be partitioned in this same fashion. The only question then becomes what kind of partitioning is reasonable, how much data would be redundant on multiple data stores (the same question that you have to solve in Parallel Data Warehouse Configurations), and how to keep that data synchronized across all possible storage devices.

The key to this kind of parallel data processing is to have an infrastructure that is capable of determining quickly where the processing is best serviced, routing that processing, and stitching together final results if the processing is distributed across more than one location.

James:
Thank you for the article. In my job I have to wear several DBA hats: SQLServer , Unix DB2, and mainframe DB2. The bigger Unix boxes and mainframes have been expanding horizontally in their computing power for years. Sure, the chips get somewhat faster allowing more vertical scaling, but as you pointed out the real growth in capability has been the horizontal scaling.

As the capability of the hardware continues to scale horizontally, this growth can be exploited as long as the workload can scale accordingly. For the single OLTP T-SQL statement, it is more difficult since the focus in OLTP is usually on a single table partition. As you pointed out, Microsoft is finding a rich opportunity in DW/BI. The reason is that OLTP is record focused and BI is range focused. A good database design will enable the BI T-SQL to engage multiple table partitions easily.

But that is the “rub”. If the workload doesn’t scale along the lines of the database and the hardware, it is counter-productive. The amount of overhead in managing and consolidating the answer set (with integrity) overwhelms any potential elapsed time improvements.

Our default is to not enable parallelism. Only when we see the possibility and prove it in testing do we enable it.

Elkiki:
This is good for assuring data unifying , but it has several disadvantages like

Delay due to network overcrowded .
Low performance due to lot of request on main server handling data
Dead remote terminals situation on main server or network down time

To overcome assuring data unifying in distributed data we can

Broadcast flags on data change to other data servers till synchronization intervals .
Define intervals for data synchronization
Set time stamp before starting reports

I’ll have more reader responses next time…I ran out of room today. You can get into the discussion by sending your comments to btaylor@sswug.org.

Cheers,

Ben

$$SWYNK$$

Featured Article(s)
The State of Database Administration 2011: Types, Trends and Technologies (Part 2)
Database administration needs to be practiced in a more rigorous manner than it is today.

Parallel Processing and Distributed Storage

Recent Posts

Debugging Multi-Cloud Performance

Mixing Flavors of SQL Server

July Spotlight – Db2 LUW: Types of I/O

Part II: Overview of B- Tree and B+ Tree

Is it undermining or rude to email the boss to ask him to get his act together?

Debugging Multi-Cloud Performance

Mixing Flavors of SQL Server

July Spotlight – Db2 LUW: Types of I/O

Part II: Overview of B- Tree and B+ Tree

Is it undermining or rude to email the boss to ask him to get his act together?

Getting Started With Deep Learning in Your Browser Using TensorFlow.js

SQL Server Collation Overview and Examples

MySQL Escaping on the Client-Side With Go

VS Code Gets New Python Language Server, Named After Monty Python Character

How Hello World! changed – top level statements and functions (C# 9)

SSWUG.ORG