Rewiring Enterprise AI: From Data Proximity to Memory-First Intelligence

As enterprises confront growing concerns around data leakage, cost spikes, and AI scalability, a new architectural shift is emerging—one that brings AI closer to where data resides. In this conversation with CIO&Leader, Renu Raman, CEO & Founder of Proximal Cloud, outlines how “data-adjacent AI” and memory-centric design are redefining enterprise AI infrastructure. From air-gapped deployments and sovereign architectures to microkernel-driven performance and cost optimization, Raman argues that the future of AI is not just about models—but about how data, memory, and compute interact. The discussion offers a grounded view into building secure, scalable, and economically viable AI systems in enterprise environments.

CIO&Laeder: How does “data-adjacent AI” reduce leakage risk compared to public cloud GenAI?

Renu Raman: Primarily, we deploy it in the private cloud and on-prem. The data is not moving into the public cloud in the first instance. Effectively, it’s air-gapped.

Now, over all of that, you want to have controls. People can get access to both internal and external parties. That level of access control is something we partner with others to provide. We don’t do that ourselves. The primary deployment is in your own VPC-like environments, with partners like NISA, NextGen, and others, where you have your own cages and your own storage. We bring or build our compute into those cages, wherever your data is stored.

It’s an air-gapped model to start with, overlaid with additional access controls. We do not do those because customers have already chosen their tools. We work with whatever they have picked.

CIO&Leader: What’s unique about your memory-centric architecture for dark enterprise data?

Renu Raman: Well, there are two parts to that question. Dark enterprise data is an opportunity because it is sitting there, untapped, for better insights. Most of it is historical, like 30 or 40 years of data sitting there. Old invoices, paper clips, everything is useful today. If you can bring all your data, whether it is in paper form or already stored, we can bring our computer to the dark data. There is so much data sitting there. That’s number one.

The unique value in bringing that in is that, regardless of whether it’s dark data or not, you are interacting with the computing system in natural language. Behind the scenes, you are working with enterprise context and a variety of complex data systems. It has to execute at three different layers. We look at memory as the first principle in optimization. At the hardware layer, it’s CPUs and GPUs.

If you take a common enterprise query, you will be ping-ponging between databases like SQL, Oracle, SAP, LLMs, knowledge graphs, and others. Some run only on CPUs, some only on GPUs, and you’re ping-ponging between them. A typical query ends up moving across these subsystems, and you still have to get an answer in 120 milliseconds. So you cache them. There is an AI memory layer on top of these multiple data systems that we build and interact with. That’s the physical layer we cache.

That’s the memory from the hardware side. On the hardware side, we use a microkernel called Compute AI to handle complex queries. If you ask something simple like, “What was the steel price when Trump issued a policy; you can run a select and find the price. But in natural language, queries become complex. You may need to pull data across databases into complex joins.

We do memory tiering to optimize for cost. The first part is caching for performance. The second is cost optimization across memory tiers in the physical layer. The third, and more important, is deep context. You want the session context of your interaction with the system and a much deeper understanding of what you’re asking, so it becomes a self-learning system. That context memory is also something we optimize.

Finally, the logical layer is depersonalization. Across all layers, we apply different optimizations to the logical and physical views of memory. Think of it as a memory-first design, not a CPU problem. It’s a memory-first problem, and then we wrap the necessary components around it at all layers. That’s what we design and what we bring to market.

CIO&Leader: How do you handle fine-tuning and inference on-prem without GPU cost spikes?

Renu Raman: For on-prem—well, on-prem is typically already invested in the capital cost structure, right? So there are two parts to that question. We are not primarily focused on inference engines; we partner with third-party companies.

Think of us as a query engine for complex natural language queries that have to interrogate various data systems. Today, we work with third-party companies on the inference engines. We either use open-source models or, via an API call, a public cloud model.

Regarding fine-tuning, we will proceed on a customer-by-customer basis. Based on the context and the depersonal agent, we will fine-tune the model, but that is coming in the future. We don’t have it today.

And to answer the second part about cost spikes, it is about model selection and then model optimization. So initially, we’ll start with a simple one—let’s say that if you say “hi,” you bring up a model. You are already burning about 680 gigabytes of memory to bring up the model.

But it’s not efficient to do that. So we bring a simpler model, and when you ask—after “hi”—what the question is, we can bring up the bigger parameter model.

And then, based on what you’re asking and querying, we do track runtime. We trade off—without sacrificing response quality—both memory optimization and model selection, as well as model optimization, to give you optimal cost and performance. That’s behind the backend system that we do, so that’s the cost and utilization optimization. So that, in part, solves part of the spike issue. That’s not always.

You may have other spike issues where more people in the same company are using it. So that’s the virtualization layer. So that’s in the roadmap—to be able to take—so there are two parts.

At the hardware layer, we assume it’s an inference-only, shardable system. So think of every node, let’s say, has 512 gigabytes or a terabyte of memory. You have a certain capacity of models, and you can bin-pack a certain number of models.

We can automatically move the model across nodes, and we connect them with high-speed Ethernet—400G and 800G. This ability to move the model across multiple nodes over the Ethernet fabric helps mitigate spikes in query load.

There are certain things we handle ourselves at the physical layer, but for the rest, we rely on VLM and other third-party engines on the inference engine side to handle spikes.

CIO&Leader: What does sovereign AI mean at the infrastructure and model-governance layers?

Renu Raman: That’s a very broad and big question. Sovereign means many things to many people. We started using the word “private” more than “sovereign,” but “sovereign,” fundamentally, is that there’s no data leakage.

There is a chart that we worked with. You should be able to operate the infrastructure without any dependencies on any party at the highest level. That includes ensuring you are not dependent on third-party licensed software or a third-party country to communicate via an API, right? What if, for example, an API call happens across national boundaries? What do you deal with?

We don’t solve all those problems. To a certain extent, we are self-selected—we go on-prem and private. If you choose an open-source model running on your own infrastructure, you address a significant portion of the sovereignty and privacy concerns.

There are still other elements that you cannot operate in isolation, as you depend on many services. And that we have to work on with customers, because customers have a more complex environment and expect third-party integrations.

We only provide what we deal with—the data that we have access to and the compute. We ensure it’s within the garden wall for privacy. The rest of it, we have to work with the customers and meet their sovereignty requirements. There is a wide range of people with different definitions of sovereignty.

CIO&Leader: How do Crania integrate SQL, NoSQL, and unstructured data with LLMs at scale?

Renu Raman: So there are three core elements of the kernel that we do: resiliency, concurrency, and cost optimization. Resiliency means failure modes, concurrency, and cost optimization.

When I said earlier about memory, it is about how you handle and move pages across memory to optimize costs. Concurrency is—if you go to Snowflake, you have only eight concurrent queries, I mean eight particular instance queries that you can do at a given instance. Part of it is the business model, and part of it is the way the engine is defined.

The MicroKernel is designed to handle very high throughput. And especially when you’re blocked—now we’re getting into low-level technical details—you’re blocked in transactions into the memory system. And that’s where we think of it like an operating system problem. That’s where we have engineered it to have higher concurrency.

Now, that leads to the second problem. It’s not just in a single node—you go across multiple nodes, and you have failure modes. Because you’re going across multiple nodes, failures occur at the physical layer as well as in software. So, how do you handle graceful degradation?

The benchmark we show is that we can go from one instance of TPCx running to large-scale, thousands- and trillions-scale transactions, and how do we deal with failure modes? We won’t fail, but we’ll experience gradual performance degradation.

And that’s about how you deal with the resources and handle the query so that, as you have fewer machine resources available, we run it on the available machine resources, but you have degradation in performance. But we will guarantee execution and not fail.

CIO&Leader: What security controls prevent model drift and data exfiltration in private AI clouds?

Renu Raman: We are not in the model and inference engine, so we will have to work with the particular model that the customers have picked and deal with that problem. Obviously, we will detect model drift based on the query, lifecycle, and context.

We still have certain things we will work on. I mean, in the sense that we are dependent on open-source models, so we don’t have much in dealing with the model drift, per se, that we have to work on with the customers.

CIO&Leader: How will the on-prem AI appliance change the economics of enterprise AI deployment?

Renu Raman: There are two or three different segments we look at. There are small enterprises and medium enterprises. They are very cost-conscious. If you think of a 10 crore business or a 100 crore business, they can spend only 2% of their spending on something like this. So, what price points can you achieve? You need to achieve the right price point and simplicity.

They are not very tech-savvy. Out of the box, you want the experience to let you view all your data and get better responses. They’ll only want to talk in natural language. Today, it’s all English everywhere, but hopefully, with what Sarvam, Bhashyam, and others are doing, all the Indian languages will be there.

Actually, we have a partner working on natural language-to-SQL generation for Kiranawala. So they can organize themselves in the back end. Anyway, that’s a digression.

So, in that segment, we believe the price point we’re going to achieve—25 lakhs to 50 lakhs, if you take 2% —will map to the 10 crore to 50 crore business segment.

Now, the second part is not only capital; it’s the operational part. What do we do to simplify the software stack, operationalize it, and manage its lifecycle? We’ll have resiliency, but there’s still an operational piece we need to work on with them, since they’ll be deployed in all kinds of places.

Then there’s a mid-market category. You can call it about $50 million to $100 million businesses, or roughly 50 crore and above. And what is the price point there?

So, there, you need two things: what is the pay-as-you-go model? You start with, say, one rack of gear that’s probably costing—let’s call it—one crore per unit. One rack could be 16 units, so roughly a 16 crore rack. Then how do you provide a more scalable solution as they scale up? That’s what we enable.

And along the way, how do you build a scalable system to run this Crania across the 16 nodes in the rack? You can dynamically add nodes or delete nodes. Now, not all the features are there on day one—that will come over time—but it’s engineered to get there.

The third category is the nifty-fifty and the larger enterprises. There, they don’t have a cost problem per se, but they may have other issues to start with. At scale, you face a different cost problem.

At scale, you can manage supply chains and source boxes at lower cost, but you are not optimizing the box cost per se. It’s really the cost of compute, compute utilization, and memory utilization.

So the biggest difference between the mid-market and large-scale enterprises is how our microkernel manages memory utilization and how quickly queries can be executed. That’s where the benefits will come, primarily from the TCO model.

The second part is operational cost—how many staff they need to run the system. We believe the microkernel and the Crania architecture will get there, but we need real-world experience to prove they will require fewer IT staff, to be honest.

Share on