Editorials

Data Hoarding Experiences

Data Hoarding Experiences
Based on personal experience consulting with many different companies, conversations with who knows how many IT Professionals, and even responses from you, our readers, it appears that there is enough data hoarding going on to start a reality TV series.

Data Hoarding, to me, could be defined as Retaining data in original transaction form, on active data storage equivalent to that of contemporary transactional data, beyond time where the detail of the record is meaningful for any immediate purpose.

I can think of a few reasons for hoarding data

  • I may want to mine that data. I need to retain the detail data because I really don’t know yet what to mine (hence the big data movement).
  • I’m working on data warehousing, but really don’t have it all defined yet. ETL processes are in progress.
  • I don’t have time to purge data
  • I don’t know how to partition data
  • My system performs fine…why should I delete data I may need sometime?

Here’s some thoughts from our readers…

Tom:
hey man, i want to know, too. wouldn’t an easy scale-up path and proper old-school OLTP and OLAP databases be the most cost-effective and practical solution? i feel that the noise needs to be filtered or i need to be aware of what is tried-and-true and what is hype …

Eric:
Another concern to consider when amassing large quantities of data is the ediscovery/retention implication from a legal perspective. These records are legally admissible evidence and can be a two edged sword in a court case. As a government entity, we’re required to manage our retention to a pre-planned schedule. Deviation from this schedule leaves the organization open to more complicated open-records requests, potential subpoena discovery efforts and other records management related challenges.

I agree that the value to the business must come first, but the flip side of that coin must be considered before actions are taken to arbitrarily amass large quantities of data – especially sensitive, personally identifiable information. Making a plan early, with the guidance of your legal counsel as well as the business leaders is necessary to ensure that both the business needs are met and risks are managed. Once that plan has been made, it must be followed and acted upon as well. Ensuring that data is managed in accordance with the plan will help mitigate risks the organization took conscious efforts to try to avoid by making the plan in the first place.

Jim:
Probably the best two-word answer about why data is getting hoarded is “predictive analytics”.

I would seriously doubt that operational data is being hoarded as it would jeopardize critical performance. Rather it is operational data that is being migrated to data warehouses for data mining. With the advent of data warehouse “appliances” (e.g., Greenplum and Netezza), multi-terabyte and multi-petabyte scans don’t have to take hours to complete. From trending in retail, to weather forecasting, and even community-based policing, predictive analytics is driving businesses and agencies to retain more and more data.

David:
Yes, we hoard data… because disk is cheap and we can… and we’re amazed that we "need" over 1TB of disk.

My company is medium-sized… about 700 people nationwide. We run the company on a custom SQL Server app. And we "need" over 1TB of disk for this one app.

Our production database is about 70 GB, and it still holds every transaction since the company began in 1997. We purge some non-critical data, but everything meaningful is still there. We keep an eye for performance issues, but the hardware price/performance curve has always been about parallel with our data growth, so we’ve simply upgraded the box when needed, and we know we could easily throw more hardware at it.

We have some auxiliary log databases, mostly used to help troubleshoot issues and investigate changes to data. Those are about 100 GB a year, and we start each year with new empty versions. We keep the last few year’s versions around with one backup each, and we do run normal SQL backups to disk of the current databases.

So… it all adds up. 70GB production database with 2 days backups and transaction logs on local disk, and we’re up to 250GB. 3 years of log databases with backups is 600GB. Throw in SQL Server and Windows install DVDs copied to disk just in case we need something, and yep, we’ve filled 1TB…. mind-blowing to those of us with many years in the industry.

Our mantra for years has been "disk is cheap". But we’re always wary that all those bytes can have an unexpected cost: when a disk array is configured to max hardware capacity and the drives fill up, or when a backup scheme can no longer complete during its scheduled time, or when copying files to a new machine takes many hours.

Does your company hoard data? Are there other reasons you choose to maintain long life detailed transactions? What are the costs you have paid for hoarding? Drop me a note with your experience to these or other issues you may bring to the conversation. Maybe we need to get someone to come help us clean out our closets? Send your thoughts to btaylor@sswug.org.

Cheers,

Ben

$$SWYNK$$

Featured Article(s)
Total Database Information At Your Fingertips (Part I)
This article is good for the novice who has recently started their career in databases and has had to scratch their head sometimes at some silly results or code. Experienced database developers or administrators are probably familiar with this, but it’s still good to take a look over these things as it helps your memory. This article contains something that is not needed frequently but when needed it become backbreaking to get the results if we don’t know which queries to execute.

Featured White Paper(s)
How to Use SQL Server’s Extended Events and Notifications to Proactively Resolve Performance Issues
read more)

Featured Script
dba3_sp_HelpReNameIndicies
Bonus Proc dbo.sp_HelpReNameIndiciesForEachTable Bonus Script – applies to all DBs on DBMS… (read more)