The Cloud and Big Data
How do you manage big data in the public Cloud? You have two problems when using a public cloud with your big data; 1) Import/Export and 2) Mining.
The cost for cloud services vary by vendor. Some charge for space used, processing resources and external network utilization. With big data, all three are factors determining your cost. None of the costs are generally an issue currently as vendors are discounting their services to get you to move to their platform.
Rather, the problem may be getting data into the cloud in the first place. That’s the whole point of Big Data; to have Big Data. If you are using a remote cloud then it restricts to some extent how quickly you can get your data into the cloud.
So, unless the data you mine is fairly static it may become less feasible to use a public cloud.
Azure is working on a new service allowing you to ship your data to a hosting site where their staff will mount your drive and import or export your data. They are looking to do this on a Terabyte scale utilizing 3.5” drives. That could resolve some of the data loading issues.
Once you resolve the problem of getting your data into and out of the Cloud, you now need to get your computing capacity also into the cloud. I have experienced at least one project that was not able to move to the Cloud simply because some of the data generation and querying could not be moved inside the cloud, and the cost of network and reduced performance were hurdles that could not be resolved.
Cloud platforms providing resources for computation can resolve the issue of network performance by placing the computing near or alongside the data storage. At that point it pretty much comes back to cost, and if you want to be a shared tenant on a computation platform, or if you want to isolate yourself.
Are you dedicating Big Data projects to the Cloud? If not, what are some of the roadblocks you are running into? Share your experience below or drop an Email to btaylor@sswug.org.
Cheers,
Ben
