In a recent dialogue with CIO&Leader, Piyush Agarwal, SE Leader at Cloudera, delved into the challenges and optimal approaches for businesses in managing data amidst the changing hybrid cloud landscape.
The increasing adoption of multi-cloud environments, digital ecosystems, and AI-driven tools has transformed how businesses handle data storage, processing, and management. A robust data management strategy and effective data utilization are essential today. To navigate the complexities of creating a solid data strategy, investing in cutting-edge analytics tools, such as modern data platforms, is crucial for deriving value from data, regardless of location.
Piyush Agarwal, SE Leader at Cloudera, an American software company, recently spoke with Jatinder Singh, Executive Editor of CIO&Leader, about the challenges businesses face in managing data in the evolving hybrid cloud landscape. In the conversation, he shared insights on the best practices for various data platform solutions and discussed the journey toward modernization.
Cloudera, an enterprise data management and analytics platform, claims to have been designed to work seamlessly with all major public cloud providers, such as Azure, AWS, and GCP. It offers tailored solutions for customer analytics, IoT, security, risk, and compliance, providing the fundamental structures for businesses to treat data as a commodity that can be monetized for new revenue streams, drive innovation, and deliver cost savings.
Here are the excerpts from the interview.
Piyush Agarwal, SE Leader at Cloudera
CIO&Leader: With the global surge in data, how can businesses adapt and evolve their data management techniques for the next decade?
Piyush Agarwal: In today’s digital economy, enterprises are witnessing more significant volumes and varieties of data from an ever-increasing number of sources. This surge in data complexity poses substantial challenges for organizations trying to access, utilize, and derive value from their data, as traditional data infrastructures often need more processing capabilities to support these needs effectively.
Most enterprises’ primary focus is the generation of real-time insights for business-critical decisions. Maintaining agility in the rapidly evolving digital landscape demands businesses continually refine their cloud strategies and embrace robust data strategies. Investing in advanced analytics tools, such as modern data platforms, enables companies to unlock value from their data, regardless of location.
The investment in a comprehensive end-to-end data management and analytics platform capable of efficiently ingesting, processing, analyzing, and modeling data has become critical to extracting meaningful insights from the vast amounts of data they encounter. Advanced analytics features, including artificial intelligence, machine learning (AI/ML), and a built-in security and governance layer, are integral to achieving this goal. This layer must be consistently applicable across all a company’s cloud and on-premise environments, ensuring the business can identify and manage sensitive data for compliance purposes without impeding operational workflows.
CIO&Leader: How can modern data platforms cater to diverse industry requirements, especially in sectors that have historically been less data-driven?
Piyush Agarwal: Data is a strategic asset and will remain digital gold for business. Making sense of every bit of data will allow companies to derive more excellent value, from becoming more attuned to their customers’ needs to providing improved service to manage their supply chain better. The potential is even more significant for businesses in less data-driven industries like retail, energy, or automotive.
To fully leverage data, organizations need the ability to quickly move data across environments for analysis and modeling in real-time. With these capabilities, businesses can easily manage vast, intricate datasets to support rapidly evolving analytic needs.
For instance, we are working with oil and retail Thai conglomerate PTT Oil and Retail Business Public and Company Limited (PTTOR) to power their organization’s data architecture. This allows them to design omnichannel experiences for online and in-person shopping at their gas stations and restaurants. We have also worked with German retailer NEW YORKER to harness their inventory data, leading to improved retail operations and order monitoring at their stores for seamless in-store experiences.
CIO&Leader: Can you provide a real-world example of a business that underwent significant transformation by harnessing the power of modern data platforms?
Piyush Agarwal: In India, we have helped YES Bank, one of the largest private sector banks, architect its data strategy to simplify banking for its customers. YES Bank required a solution framework that could offer speed, agility, flexibility, and the storage capacity needed to process unstructured data and conduct real-time analytics, all while enhancing data security. Central to this endeavor was implementing a platform capable of meeting the rigorous governance standards and strict data security regulations prevalent in the financial services industry.
In collaboration with YES Bank, we developed a ‘unified on-premise data management platform,’ enabling the Bank to craft bespoke and innovative solutions with faster turnaround times for product launches and updates. Integrating Cloudera Shared Data Experiences (SDX) technologies within the platform alleviates compliance concerns and ensures secure data access for numerous users. This has helped the Bank internalize a previously outsourced customer loyalty program, resulting in a quarter of a million dollars in annual savings.
We are also helping a leading Indian fintech platform, PhonePe, achieve greater operational efficiencies and higher scalability, agility, flexibility, and reliability through our unified data solution, CDP.
CIO&Leader: What are the driving factors behind the shift from traditional data lakes to Data Lakehouse architecture? Why is this architecture pivotal for modern businesses?
Piyush Agarwal: Traditional data lakes act as a unified data repository, storing all collected data from various sources in unprocessed form, including structured, unstructured, and semi-structured data. They often complement a business’s data warehouse, enabling companies to enjoy the analytical flexibility of data lakes and the simplicity and speed of SQL queries in data warehouses. Although this architecture offers the economic advantage of inexpensive, scalable storage within the data lake, it also presents challenges such as data duplication, limited analytical service support, and inflexibility. With data confined to the data warehouse, running analytics incurs high transformation and integration costs, leading to unacceptable delays.
The Data Lakehouse model merges the best attributes of both worlds by supplanting the largely independent data lake and data warehouse architectures. It offers a more structured approach that addresses the limitations of each while preserving their respective benefits. By merging the storage and processing capabilities of data lakes with the relational querying functionality of data warehouses, data lakehouses eliminate data duplication and support a comprehensive ecosystem of analytical engines and advanced analytics, including machine learning. This marks a significant step towards a more streamlined, efficient, and agile data management strategy.
However, companies’ analytic needs evolve rapidly. Modern data lakehouses cater to business intelligence, AI/ML, and data engineering within a single platform. Achieving this effectively necessitates a willingness to explore different analytic engines or even vendors.
CIO&Leader: What are the main challenges arising in a hybrid cloud environment, and how can they be addressed effectively?
Piyush Agarwal: Hybrid cloud environments enable businesses to be flexible, agile, compliant, and scale according to their needs. However, this often results in data trapped in silos across different infrastructures, presenting significant challenges in accessing or managing it. Rapid global digital transformation has also created interconnectedness – which means more entry points that businesses must actively safeguard against.
Unlocking more profound value from data requires that organizations re-architecture their data platforms for portability and to realize synergies across environments.
CIO&Leader: How can we ensure that AI algorithms remain transparent and understandable when dealing with vast and intricate data sets?
Piyush Agarwal: AI is only as good as the data it has been trained on. The business’s ability to trust its AI begins with trusting its data. Ensuring data quality, accuracy, and consistency by documenting data sources, handling missing values, and addressing data biases goes a long way. It is imperative to choose relevant and interpretable features and avoid overly complex or irrelevant ones. Additionally, businesses should publish transparency reports that provide insights into the model’s behavior, strengths, and limitations. This can build trust with users and stakeholders.
Strong data governance measures can further ensure the trustworthiness of data and mitigate these risks. Good data governance clearly defines who is responsible for which data and ensures data is in the right hands and properly handled while accessed, modified, archived, and deleted. Data security, governance, lineage, data management, and automation are applied holistically across the data lifecycle and all environments and consistently across an organization’s private and public clouds. Beyond the IT organization or department, all employees are aware of the data they create daily, how to store and access it securely, and how to maintain the same security when working with outside partners, customers, and contractors.
CIO&Leader: With the increasing blend of vast data sources, how can organizations ensure AI’s ethical and responsible use?
Piyush Agarwal: AI systems inherently pose several dangers: biases ingrained in AI’s foundation (as influenced by humans) and the current lack of regulations and risks surrounding AI’s innovation. In developing ethical AI systems that can deal with bias, causality, correlation, uncertainty, and human oversight, it is imperative to maintain strong controls over data management and governance, as well as the ability to reproduce outcomes reliably.
In leveraging data and analytics for informed decision-making, an organization can quickly run into data governance issues, even if the number of data sources used is minimal. Gaps will exist in audit logs if data access policies and lineage are inconsistent across an organization’s private and public clouds. More worryingly, poor data quality and lineage may mean a data user decides on incomplete or out-of-date information. Without knowing how data is derived, data users lack trust in their data assets, impacting their ability to make informed, business-critical decisions. The continued use of these data assets hinders the business’s ability to innovate and differentiate and may impact the customer experience and brand reputation.
Also concerning are AI hallucinations – users of AI today often encounter instances where generative models produce inaccurate or irrelevant results that appear confident and well-phrased enough to be easily believable.
Trusted AI results from strong controls over your data across its lifetime, from the edge to AI. Businesses must carefully control who or what has access to the data and significantly how the data is transformed or changed over time.
CIO&Leader: How might the integration of LLMs reshape the data analytics landscape? What challenges and opportunities could arise?
Piyush Agarwal: Enterprises view embracing generative AI as a strategic imperative for staying relevant in competitive markets. Central to this success is the enhanced capacity of generative AI and Large Language Models (LLMs) to significantly reduce the time and effort required to prepare unstructured data for analysis and model training. Approximately 80% of organizational data is unstructured, underscoring the invaluable role of technologies like OpenAI’s GPT-3 LLM in accelerating AI capabilities within the enterprise.
The key to the success of any LLM or generative AI model lies in understanding data and enterprise context. With widespread interest in generative AI and reduced barriers to entry, more organizations are expected to invest in bolstering their data architectures for seamless integration with existing IT systems and comprehensive control over the data lifecycle. Whether deploying generative or traditional AI solutions, data quality and enterprise context are paramount. Thus, building trust in AI among enterprises becomes essential to realize synergies across data mesh, fabric, and lakehouse environments, facilitating advanced use cases such as generative AI and LLM deployment.
Organizations must develop responsible and reliable generative AI solutions for the enterprise, securely and responsibly leveraging enterprise knowledge bases or proprietary data sources to mitigate privacy risks and contextual limitations.