The first, critical step to AI monetization

Intelligent, automated AI data pipelines for unstructured data is the linchpin for AI success.

Prateek Kansal,
Head of Engineering, India Operations,
Komprise

As AI developments continue to heat up, CXOs are expecting more. They don’t want to be left behind in the race to leverage AI for competitive and operational value. A survey conducted by the MIT Center for Information Systems Research found that top-performing organizations (in terms of profitability, revenue growth, customer experience, and other factors) attributed 11% of their revenues to data monetization, compared with just 2% reported by bottom-performing organizations. AI is the key to data monetization moving forward.

Unstructured data is the primary fuel for AI given that it makes up at least 80% of all data in the world. Creating the right unstructured data foundation for AI is a critical first step—regardless of which AI tools and services you develop or procure. For this article, we will discuss the rising importance of AI data pipelines as a fundamental component in that data foundation. 

What problems do AI data pipelines solve?

AI data pipelines are a process and technology to curate data from multiple sources, prepare the data for proper ingestion, mobilize the data to the destination, and persist any transformation results in a global meta database.  By automating these data workflows, the system optimizes for performance and governance requirements. Data pipelines for unstructured data have special considerations since unstructured data is large, diverse, and difficult to search, organize and move. 

Data pipelines for unstructured data are different than traditional extract, transform and load (ETL) processes, which focus on data transformation. They rely upon global metadata indexing, search, curation, and intelligent data mobilization. Unstructured data transformation (sometimes called vectorization) is done locally in the AI process.

What makes AI data pipelines so powerful and relevant is if they can continually update the generative AI application with new data, such as to ensure a customer service chatbot always has the latest product information.

There are several challenges that enterprises face in preparing and moving unstructured data to AI tools:

  • Unstructured data, which is data that does not fit neatly in a database or a table, is large and unwieldy. Enterprises are storing multiple petabytes of such multimodal data including audio, video, documents, application data and logs – sometimes reaching 50PB or more in data-heavy sectors like healthcare.   Unstructured data frequently lives across multiple silos from on-premises to the edge to cloud. 
  • Unstructured data is difficult to search and leverage because of the lack of unifying structure and context. 
  • Data governance issues including privacy and security concerns with AI are hampering industry progress. IT leaders need fail-proof methods to filter and exclude protected (PII) and proprietary data from AI ingestion.
  • If an AI program results in a generative product that violates privacy or copyright or presents inaccurate or dangerous information, business leaders need ways to investigate and remediate the issue.
  • AI is expensive. The more corporate data that you run AI on, the more you will pay. If you run AI on the same data repeatedly, you could be incurring redundant costs. 

A recipe for AI data pipelines that maximizes unstructured data value

There’s already a lot of buzz about AI data pipelines, but it’s important to understand what’s needed to ensure that they will be efficient, adaptive, and flexible as your AI plans evolve. 

Granular search and classification of unstructured data: There is an oft-used statistic saying that 80% of the data scientist’s time is spent finding the right data sets for the project at hand. File and object data is hugely varied in size, type and structure: consider PDF files, different medical image files, social media data, audio and video files, images from different cell phones, instrument data, sensor data and on and on. 

Metadata tagging and enrichment – which can be augmented using AI tools – is a feature of unstructured data management which can add context and structure to unstructured data so that employees can easily search and find what they need across billions of files. This also requires a global file index, essentially a meta database, that can look across all storage silos, whether they are on-premises or in the cloud. 

Improve data quality issues such as inconsistencies and inaccuracies: Data quality of unstructured data involves determining the quality and accuracy of the metadata used to find the right data. For instance, is the metadata gleaned through a process reliable and interpreted accurately based on the context?  Is the metadata generated from AI or has it been verified by a human?  Is the metadata stale or up to date? These are issues that a systematic metadata management system should be able to address.

Minimize time and cost of moving large data sets to AI: It’s prohibitively expensive and time-consuming to move petabytes of data to a central data lake and then process it for the small subset you’ll need for an AI program. Each time you run a new job, you must process it all again. Further, many AI solutions have proprietary methods of generating RAG and vectorization of data, so you shouldn’t need to vectorize it first. Therefore, curating the precise data sets you need—using metadata enrichment for classification–and sending only that data to AI is a more cost-efficient way to feed an AI pipeline. 

Address risks of sensitive data: Most organizations have PII, IP and other sensitive data inadvertently stored in places where it should not live. Storage teams need better ways to do this than hunting and pecking. For one, they don’t always have access to the organization’s security tools. For two, enterprise security suites may not be able to scale their detection capabilities across petabytes of data in multiple storage systems. AI data pipelines require tools to crack open files and detect sensitive data with rules to move it into secure storage where it cannot be discovered nor ingested into an AI tool. It’s also important to track what corporate data was fed to which AI process so there is an audit trail. 

How AI data pipelines will evolve

AI data pipelines will need to evolve to address the unique requirements of unstructured data, such as by bringing faster and more intelligent data indexing and data pre-processing and built-in data governance. Continual innovation is needed to buttress data mobility and data workflow technologies to handle the scale and performance requirements of moving and processing large datasets. IT leaders should endeavor to create an auditable process and workflow that can maximize the value of vast stores of unstructured data and deliver AI results within budget and with lower risk. And once you do this, you’re well on the way to driving data monetization, which your CEO and CFO will love.

Share on