Centralized Data
Data Warehousing and Business Intelligence systems have become key assets for company success. More focus is being placed on centralizing data from different data sources providing a centralized view of customer information. Less emphasis is being placed on designing new sources of data.
If you are not convinced of this shift in technology, simply look at job postings and compare the number for BI related skills to those of the traditional DBA. New skills are in higher demand for ETL expertise using many different platforms and tools. Data Mining skills as well as reporting skills are in high demand as well.
One of the key problems of centralized data is maintaining purity, handling change from multiple sources, and synchronizing that change to other data stores. Techniques for maintaining data synchronization and purity are evolving, and many new products are developing around those requirements.
I am curios how many companies actually achieve the holy grail of useful centralized data. A great number of resources are required, both in people, software and hardware, to create and maintain clean centralized data. Inaccurate data provided quickly may be more damaging in certain scenarios that accurate data more slowly obtained.
What are the key challenges you are facing when implementing centralized data? Did you have the necessary skills when you started data warehousing? Are there specific tools or skills you have found useful? Are you warehousing using data stores other than Relational Databases?
Share your thoughts or experiences with us by sending an Email to btaylor@sswug.org.
Reader Comments – Making SQL Work Too Hard
Aaron:
Are you NUTS?
Using a computed column for Address allows you to have ONE version of the calculation.
Better yet.. using VIEWS extensively offers the same advantage!!!
Maurice:
Samples has to be choosen carefully about this matter.
For example I believed SQL Server is much more suited to do aggregation. If you reproduce the SUM() aggregate on client side instead of letting the server do it, you pay the cost transporting of a lot of rows on the client side. Networking issues are also factors that limit SQL scale-up.
Much database code implies a lot of saving in network bandwidth, query optimization etc. When no database code is used and logic is perform mostly elsewhere, data have to travel back and forth between the server and the tier. A lot more queries are necessary.
We have optimized a lot of code, by putting it on the server side, with gains of order of magnitude in performance.
For example SQL Server allows Updates to be built using a standard from clause, and/or even a CTE. Such construct can do pre-processing of data (ex: generating sequence numbers using row_number(), using case statement to reproduce some data transform). So in a single pass you can re-sequence numerous row without a lot of traffic, put some logic in data transformation, in a single transaction and a single trigger activation (if there is any). A single transaction instead of numerous ones means a lot less writes to the log. There is no cursor, a short lock, and the optimizer see the whole picture of the operation. If much of the rows are going to be modified, it may use better join strategy that cost less than too much accesses through indexes.
Suppose you update implied a join between table A and table B to update tableB, than if you update a lot of rows, sometimes a hash join is going to be better than a inner loop join with index seek.
Nima:
I think the main reason which lead us to put much of our computation in other layers, such as Business Layer is maintainability. We know that unit testing and evolving SQL command in a way which is done in BL is not possible (think about the difference between OOA&D and TSQL), and on the other hand we all know that technologies such as RDBMS are in continues changing and we want to have opportunity to use other RDBMS products (changing DAL implementation) without involving any other layers (loosely couple).
I really know that sometimes we should put some calculation or logic in database in order to have better performance. Yes! we should not to be extreme in any side. It is actually a trade off between maintainability, testability, and some other issues like performance.
Cheers,
Ben
$$SWYNK$$
Featured Article(s)
Tips for using SQL Server 2008 Integration Services (Part 3)
In this article, you can find some helpful tips to performance tune and optimize SQL Server 2008 Integration Services.