Editorials, Encryption/Data Security

Should Programs be separated from Data?

SSWUGtv
with Stephen Wynkoop
Can your MySQL database scale? If not, what are your options? Do you need to purchase a bigger machine? Do you need to purchase a sharding wrapper, or write your own data broker allowing your databases to shard? Instead, you now have the option to take advantage of the Cloud Database As a Service from Xeround. Watch this interview with Xeround.
Watch the Show

Should Programs be separated from Data?
I chose this topic today as a follow on to the question discussed last week about the value of the CLR in SQL Server; specifically, why CLR support is excluded from SQL Azure. A lot of factors contribute to the answer to this question. David writes in with a well thought out answer to this question.

The most efficient technique to process data on any scale is to keep the volume of data to be processed from growing. Google has proven it is more efficient to pass algorithms around to smaller data sets, and bring the results back together than it is to gather data into parallel processes pulling data from remote data stores.

SQL Azure, as currently implemented, has the ability to scale data through federations as a sort of sharding. Wow! What if it also supported the CLR in these federations? Perhaps you would see more Map/Reduce types of processes, no longer restricted to un-structured data.

David Writes:
Since you’re asking for opinion, here goes. 🙂

There are a two fundamental units of measure dealing with a unit of data: it’s size and it’s speed of retrieval — bytes and latency.

But what *it* is also has two definitions: the final product of the piece of data, and all of the inputs necessary to produce that final product.

Compression algorithms take the final piece of data, and extract only those necessary pieces of data required to reconstruct the original piece of data according to a set of rules, working under the assumption that the latency of downloading those smaller pieces and the latency in the CPU in reconstituting them is less than the latency for downloading the original form.

Databases work under a similar principle that it makes more sense to have intelligence where the data is stored so that only the pieces of the database necessary to construct a particular output are transferred. In this sense, the application reconstructs all possible states of the application output when required, rather than requiring the pre-calculation of all possible inputs and outputs (probably an infinite set, anyways) and the storage of all of these outputs.

So, traditionally, the required calculations have been placed in the application layer, while only the retrieval logic (WHERE clauses) resides in the database — and therefore the development of SQL as (originally, at least) a non-Turing-complete declarative language, so you don’t have to deal with the halting problem (and therefore accidentally providing the SQL server a command that can *never* finish).

But this is simply a set of trade-offs, and in modern computer architecture (especially when dealing with clusters/clouds of computers) there are multiple "stops" for the data, each with their own storage maximums and latencies — generally the "deeper" you go, the lower the latency and smaller the storage space (L1 cache, L2 cache, RAM, Disk, Remote Networked Storage). This also means the further away your data is, the slower your processing goes (assuming a naive algorithm that just asks for the next piece of data when it absolutely needs it and doesn’t cache [which is bringing the data closer]). That’s why SQL has a WHERE clause — give the database a simple definition of what’s not desirable in the result set so you don’t have to spend time filtering it out on your side, paying the higher latency penalty.

The reason why SQL is a declarative language rather than an imperative or functional language, beyond the halting problem, is because the majority of your data retrieval problems can be solved with the declarative filters and they can be optimized in the general sense for exactly the kind of storage system you’re working with (I didn’t bring up things like hard disks which have poor "random" access [getting the first byte] but high "sequential" access [reading the bytes that follow that first byte] that complicate algorithms and storage patterns that most programmers don’t want to think about).

But there is that small number of queries that don’t really fit the mold (and may not be able to, if a compact index cannot be defined), and as the application gets further and further abstracted from its database across networks, and that database gets subdivided into slower and slower channels (a cluster that has to coordinate with each other on retrieval adds more hierarchy and new chokepoints — done only because the underlying computing architecture is hitting a brick wall in raw-performance), the more these queries start to matter, and the more queries fall under this umbrella as chokepoints in the database architecture make certain kinds of joins or filters or sorts expensive as they have to cross over slower communication channels.

The basic idea behind Map-Reduce style database queries is this: you need to perform a calculation using a very large amount of data, but the specific calculations involving any one piece of data are small, and the results of these individual calculations can be rolled into the final result in a simple manner, so you can gain not only by reducing the amount of data transferred, but also reducing the calculation latency by using all of those database server CPUs in parallel.

This seems very restricted, but when you think about it, all an Index is is a pre-computed Map of which columns of data match a particular value, and your WHERE clause is like a Reduce, reading that Index and selecting only those that meet the range or equality, etc specified. Some of these filters can be easily applied in parallel, and belong at the base-server level.

Some of them cannot be easily indexed across a cloud of database servers, and they should be applied *after* the other queries. These can still be done by the database server easily within the declarative SQL syntax.

But other filters still cannot be indexed at all, and need Turing-complete calculation. These have traditionally resided within the application, but where they belong is governed simply by the following equation:

tDelta = tDbFullDatasetTransfer + tAppFiltering – tDbFiltering – tDbPartialDatasetTransfer

The time to transfer the full dataset and the time for the application to filter it minus the time for the database to filter it and transfer the reduced set of data. If that delta is positive, then you gain by executing Turing-complete code on the database-side, negative then you lose. It doesn’t matter if you’re talking about compressing/decompressing data, encrypting/decrypting, filtering out unwanted data, or sorting the data according to a custom equation, they’re all data manipulations that apply.

Being able to run .NET code lets you make that choice, not being able to prevents you from making that choice, so I’m of the opinion that not allowing such code on Azure is *a particularly bad idea for the very concept of SQL Azure*. If they’re worried about a user abusing shared resources, the Windows kernel should be able to ratelimit users based on a rolling average of CPU and I/O load (to allow bursts but not allow constant hogging of resources) since each instance *should* be run in a separate (possibly temporarily instantiated) user account so security vulnerabilities are less likely to let data leak.

To *not* allow this means that you have to transfer more data than necessary to/from your cloud, costing you more money and making the cloud-based offering less attractive.

Now, if you’ll excuse me, I’m going to ponder over a possible query syntax that can be turing complete and can also determine where in your "stack" the actual filtering code should be run so you don’t have to worry about it from either the application or database side..

Send an Email to btaylor@sswug.org if you have other thoughts on this issue.

Cheers,

Ben

$$SWYNK$$

Featured White Paper(s)
Encryption & Key Management for Microsoft SQL Server 2008
Written by Townsend Security

Simplify encryption and key management on … (read more)