SSWUGtv
With Stephen Wynkoop
In this episode Paul Zikopoulus with IBM – talking about big data, what it means, why it is important and much more.
Watch the Show
$$SWYNK$$
SQL Azure Federation Sharding
I received a few questions about SQL Azure Federation today. My understanding is strictly through reading documentation materials from Microsoft…so be aware that I may be incorrect in some aspects. I have researched quite a bit to be as accurate as I know how without having actually used the service myself.
First of all, you need to understand that there are no cross database query capabilities in SQL Azure. This is because the SQL Azure Cloud completely handles all database allocation amongst many instances of SQL Server. I am not sure how many are on a single machine; but, there are definitely many machines, and therefore many instances.
You, as a consumer of SQL Azure database services do not have access to know what instance your particular database(s) are mounted. The SQL Azure system handles the routing and connection through resolving your database connection string and credentials.
So, what do you do when your database grows too big for the maximum size of database in SQL Azure? One solution is to use SQL Azure Federation. This feature is ONLY supported in the SQL Server Azure release of SQL Server at this time.
Federation allows you to create one database as the Federation database. Once the database is created, you can add many different instances of the database to that single Federation. When you create your tables there are some additional TSQL clauses that determine what kind of table is created and how many instances of that table are created.
A sharded table will be created on each sharded database based on a sharding key, and only the data meeting the criteria of the key for each individual shard will be included.
A non-sharded table will be created on each sharded database and all data will be included in all shards (replicated data). This data is often in the form of lookup tables or dimension tables for a data warehouse.
Why do you need the second type of table? Because each shard is completely autonomous. It cannot communicate with any other database except in returning results to the master Federation database.
In essence, the federation database acts as a data broker, keeping track of where each separate instance of your database is located. It also performs the role of distributing your queries and consolidating (union) of the results returned from each shard.
What is unclear to me at this point is if you can actually have tables on the Master Federation database. Of course those tables would not be replicated to each shard, and would only be joined to the final results returned from each shard. I would think you would want to keep those tables to a minimum, especially if there is potential filtering criteria in those tables. You may be returning more data from the sharded database queries than you would if the tables were replicated across each shard.
Paulo Writes:
Can SQL Azure Federations allow me to do the data warehousing implementation on the cloud?
Editor:
I’m not sure. It surely supports a Snowflake or Star Schema. However, you will have to run Analysis Services on a server outside the Azure platform. It is not supported in the current release as well as SSIS, and SSRS.
David Writes:
I read through the article on SQL Azure Federations, and the FEDERATED ON (id=some_column) looks very much like sharding techniques in NoSQL databases, but it also implies that you can only use your database in essentially the exact same way as a document database like MongoDB.
What do I mean by that? When you query a single table, you can query on any column that you have indexed on that table, and it will distribute it amongst all of the shards and get you back the results very quick just like you mentioned, if you don’t specify an ORDER BY that will force it to cache the incoming results until the query is completed by all underlying shards.
But what if you INNER JOIN that table to another federated table? If the ON clause in the join is using the second federated table’s federation key, then it will "know" which database has the "correct" table that could contain the requested data and query only that one, like a cross-DB join, nothing more expensive than that. But if you INNER JOIN on a column that isn’t the federation key for that table, then for *each* row of data found in the first table, the database must *wait* for all of the federated databases to return a record or nothing, because if ALL return nothing, then the INNER JOIN can’t return that record, so it can’t start transmitting record results until the entire INNER JOIN chain has been completed. Once at least one record is returned, again assuming there’s no ORDER BY clause, the pause in data transmission ends and it can continue returning data, but if nothing matches it has to wait for all databases to complete their searches.
That’s also wasteful of computing resources and will slow your entire federation down if the possible result set is either one or zero on that inner join, since *every* DB in the federation has to run the query, while an INNER JOIN on the federation id column would only run on the database that *could* have the data.
So why is this like MongoDB? Because a Document database, for sharding performance, eliminates arbitrary JOINs with implicit ones, the "document structure". You have a document and indexes on that document, while sharding occurs on the document’s id. The queries are fast because there’s only one structure to deal with, basically like always JOINing on the federation keys, so if you query on a sub-object in your document, it’s index can tell you which DBs have matches, and then those DBs can run their internal queries for the data.
At that point, when you’re basically giving up the composability of your data structures in SQL, why not switch to the data representation that better fits the application you’re clearly fitting your database to? (And JSON’s tree-and-array structure much more closely fits the kinds of work software is going to use the data for.) That way, you have simpler logic in your application code: convert my data structure to JSON to store in a simple 1:1 fashion, then convert back to my data structure on retrieval. And the conversion process is non-existent if you use MongoDB with Node.js, since they both speak JSON.
I may be missing something here, but the SQL Azure Federation restricts you in much the same way Document Databases restrict you, but since it still allows arbitrary SQL, someone not familiar with what these restrictions mean can completely shoot themselves in the foot on performance if they assume the near-linear increase in query performance on a single table will also apply to more complex JOINs. (And they’ll be quite confused about it, too, since sometimes it *will* work, with the federation key being the JOIN restrictor, while other times they’ll get much worse performance, and those latter problem queries can "gum up" the entire federation.)
Editor:
Thanks for the critical thoughts, David. I tried to address your question in my editorial above. Again, please don’t take my understanding of the documentation to be authoritative.
At this point I’m going to see if I can get access to evaluate this capability in more detail and write about it later. If you are already using SQL Azure Federation, why not write to btaylor@sswug.org with your experience. I would especially appreciate any feedback if I have miss-understood or miss-represented anything about Federation in my study of the service. I definitely would want to post corrections for our readers.
Cheers,
Ben
Featured Article(s)
Configuring FILESTREAM Storage in SQL Server
We will take a simple step by step example of FILESTREAM storage configuration, including how to enable FILESTREAM storage in an Instance SQL Server.
Featured White Paper(s)
SharePoint 2010 Enables the Enterprise
Written by KnowledgeLake
read more)