Most Big Data projects fail. There are many reasons for this, usually around the business reason for the project, and data quality (along with a lack of skills). A key part of ‘Cloud Transformation’ is Data. ‘Transforming’ data usage can lead to new products, services, insights and better operational management. Even though many data projects fail, it is a good area to look at, if you want to either decrease costs, or increase revenues. The problem is ‘how to do it?’
Big data refers to vast amounts of data that can be structured, semi structured or unstructured. It is all about analytics and is usually derived from different sources, such as user input, IoT sensors and sales data.
What is big data in the cloud?
Big data also refers to the act of processing enormous volumes of data to address some query, as well as identify a trend or pattern. Data is analysed through a set of mathematical algorithms, which vary depending on what the data means, how many sources are involved and the business’s intent behind the analysis. Distributed computing software platforms, such as Apache Hadoop, Databricks and Cloudera, are used to split up and organize such complex analytics.
The problem with big data is the size of the computing and networking infrastructure needed to build a big data facility.
Cloud computing provides computing resources and services on demand. A user can easily assemble the desired infrastructure of cloud-based compute instances and storage resources, connect cloud services, upload data sets and perform analyses in the cloud. Users can engage almost limitless resources across the public cloud, use those resources for as long as needed and then dismiss the environment — paying only for the resources and services that were actually used.
The public cloud has emerged as an ideal platform for big data. A cloud has the resources and services that a business can use on demand, and the business doesn’t have to build, own or maintain the infrastructure. Thus, the cloud makes big data technologies accessible and affordable to almost any size of enterprise. The cloud brings a variety of important benefits to businesses of all sizes. Some of the most immediate and substantial benefits of big data in the cloud include the following.
A typical business data centre faces limits in physical space, power, cooling and the budget to purchase and deploy the sheer volume of hardware it needs to build a big data infrastructure. By comparison, a public cloud manages hundreds of thousands of servers spread across a fleet of global data centres. The infrastructure and software services are already there, and users can assemble the infrastructure for a big data project of almost any size.
One project may need 100 servers, and another project might demand 2,000 servers. With cloud, users can employ as many resources as needed to accomplish a task and then release those resources when the task is complete.
A business data centre is an enormous capital expense. Beyond hardware, businesses must also pay for facilities, power, ongoing maintenance and more. The cloud works all those costs into a flexible rental model where resources and services are available on demand and follow a pay-per-use model.
Many clouds provide a global footprint, which enables resources and services to deploy in most major global regions. This enables data and processing activity to take place proximally to the region where the big data task is located. For example, if a bulk of data is stored in a certain region of a cloud provider, it’s relatively simple to implement the resources and services for a big data project in that specific cloud region — rather than sustaining the cost of moving that data to another region.
Data is the real value of big data projects, and the benefit of cloud resilience is in data storage reliability. Clouds replicate data as a matter of standard practice to maintain high availability in storage resources, and even more durable storage options are available in the cloud. Public clouds and many third-party big data services have proven their value in big data use cases. Despite the benefits, businesses must also consider some of the potential pitfalls.
Some major disadvantages of big data in the cloud can include the following.
Cloud use depends on complete network connectivity from the LAN, across the internet, to the cloud provider’s network. Outages along that network path can result in increased latency at best or complete cloud inaccessibility at worst. While an outage might not impact a big data project in the same ways that it would affect a mission-critical workload, the effect of outages should still be considered in any big data use of the cloud.
Data storage in the cloud can present a substantial long-term cost for big data projects. The three principal issues are data storage, data migration and data retention. It takes time to load large amounts of data into the cloud, and then those storage instances incur a monthly fee. If the data is moved again, there may be additional fees. Also, big data sets are often time-sensitive, meaning that some data may have no value to a big data analysis even hours into the future. Retaining unnecessary data costs money, so businesses must employ comprehensive data retention and deletion policies to manage cloud storage costs around big data.
The data involved in big data projects can involve proprietary or personally identifiable data that is subject to data protection and other industry- or government-driven regulations. Cloud users must take the steps needed to maintain security in cloud storage and computing through adequate authentication and authorization, encryption for data at rest and in flight, and copious logging of how they access and use data. Lack of standardization
There is no single way to architect, implement or operate a big data deployment in the cloud. This can lead to poor performance and expose the business to possible security risks. Business users should document big data architecture along with any policies and procedures related to its use. That documentation can become a foundation for optimizations and improvements for the future.
Big Data Challenges
Organizations typically have four different cloud models to choose from: public, private, hybrid and multi-cloud. It’s important to understand the nature and trade-offs of each model.
Private clouds give businesses control over their cloud environment, often to accommodate specific regulatory, security or availability requirements. However, it is more costly because a business must own and operate the entire infrastructure. Thus, a private cloud might only be used for sensitive small scale big data projects.
The combination of on-demand resources and scalability makes public cloud ideal for almost any size of big data deployment. However, public cloud users must manage the cloud resources and services it uses. In a shared responsibility model, the public cloud provider handles the security of the cloud, while users must configure and manage security in the cloud.
A hybrid cloud is useful when sharing specific resources. For example, a hybrid cloud might enable big data storage in the local private cloud — effectively keeping data sets local and secure — and use the public cloud for compute resources and big data analytical services. However, hybrid clouds can be more complex to build and manage, and users must deal with all of the issues and concerns of both public and private clouds.
With multiple clouds, users can maintain availability and use cost benefits. However, resources and services are rarely identical between clouds, so multiple clouds are more complex to manage. This cloud model also has more risks of security oversights and compliance breaches than single public cloud use. Considering the scope of big data projects, the added complexity of multi-cloud deployments can add unnecessary challenges to the effort.
Data ‘transformation’ is an important part, perhaps the central part, of transforming business and IT processes. Big Data management is best done in the Cloud. Choose your platform, become familiar with it, optimise it and make sure that the challenges are properly addressed in your business case. Track the ROI and benefits. Learn from mistakes. Train your staff on the target platform. Standardise your tooling. Use agile teams to build the Big Data platform. Be patient. Be realistic.