Why is the AI industry still talking as if everything belongs in the cloud?

Neel Khokhani, Founder and CEO of Epochal Corporation, argues that the economics, physics, and regulation shaping AI infrastructure all point in the same direction: inference is moving to the edge, and the market has been slow to catch up.

Three out of every four dollars spent on AI chips in 2025 went to edge processing. The industry is still talking as if AI lives in the cloud. Increasingly, it does not.

In 2025, edge processing accounted for more than 75% of all AI chip revenue by processing type. Not cloud. Not hyperscale data centres. Edge. Devices, endpoints, local nodes. The chips sitting in pockets, on factory floors, and in server closets, closer to where decisions actually happen.

Meanwhile, inference, the act of actually using a trained AI model, now accounts for roughly two-thirds of all AI compute workloads, up from one-third in 2023. Deloitte projects that the market for inference-optimised chips alone will exceed $50 billion in 2026. IDC forecasts that by 2030, half of all enterprise AI inference will be processed locally on endpoints or edge nodes, rather than in the cloud.

The infrastructure conversation has not caught up with the infrastructure reality. The industry still defaults to a mental model in which AI means centralised GPU clusters in hyperscale facilities. That model made sense for training. It makes less and less sense for inference. And inference is where AI increasingly interacts with the physical world.

This was not the trajectory many expected. The assumption was that the cloud would absorb the AI workload the way it absorbed everything else. Instead, the opposite is beginning to happen, for reasons that are not philosophical. They are economic, physical, and legal.

The toll booth on the way out

There is a term gaining traction among infrastructure engineers that the broader market has been slow to absorb: the cloud tax.

The cloud tax is not just a metaphor. It is a line item. Every time data leaves a hyperscale cloud provider’s network, the organisation that owns that data pays an egress fee. Azure charges $0.087 per gigabyte. Google Cloud charges $0.12 per gigabyte for the first terabyte. These are rates that run four to six times higher than what those same providers charge to store the data in the first place.

For a traditional web application, this is a manageable cost. For an AI workload, it is a structural problem. AI inference at scale requires continuous, high-volume data transfer: camera feeds, sensor readings, model outputs, real-time decisions, flowing back and forth between where the data is generated and where it gets processed. The meter runs every time.

Consider a concrete example. A company storing 50 terabytes and moving 20 terabytes per month, a modest figure for a serious AI deployment, pays roughly $31,500 a year in egress fees on Azure. On Google Cloud, $35,100. Egress alone represents more than 65% of the total storage bill for active workloads. One company recently received a $250,000 egress bill from AWS on a single invoice. AWS eventually waived it. Most companies do not get that call.

The economics are worth stating plainly. The hyperscalers make it cheap or free to put data in. They make it expensive to get data out. The architecture is not neutral. It makes leaving costly. Egress fees are not simply a pricing decision. They also function as a retention mechanism.

For AI workloads running at the edge, the maths inverts. Data processed locally never crosses the network boundary. There is no egress fee because there is no egress. The data that does travel is smaller, more refined, and less frequent. A farmer who mills his grain at the edge of the field does not pay to transport the chaff.

This is not a theoretical saving. It is a structural one. And it compounds with every additional sensor, every additional camera feed, every additional inference cycle. The cloud tax is invisible at a small scale. At AI scale, it becomes one of the largest recurring expenses in the deployment, and it is entirely a function of architectural choice.

The two-thirds problem

The reason the cloud tax matters more now than it did five years ago is that the nature of AI compute has shifted fundamentally, and many of the industry’s infrastructure assumptions have not shifted with it.

Training a large language model is a concentrated, one-time effort. It requires massive centralised compute, enormous datasets, and weeks or months of continuous processing. The hyperscale model was built for this. Aggregate everything. Optimise for throughput. Centralise without limit. For training, that logic still holds. Few would argue otherwise.

Inference is different in every way that matters architecturally. Inference is every time the model is used. A fraud detection system screening a transaction. A predictive maintenance tool flagging a vibration anomaly on the factory floor. A surgical assistance system processing visual data in real time. A logistics platform recalculating routes as conditions shift. A real-time quality-control agent on a production line. These are not batch jobs. They happen continuously, in milliseconds, at the exact point where the operation runs.

Routing those decisions to a data centre in another region introduces latency that, in some cases, is incompatible with the use case. A surgical system cannot wait for a round trip. Neither can an autonomous inspection drone, an industrial safety monitor, or a real-time quality-control agent on a production line. The speed of light is not a software problem. It is a physics problem. And physics does not compress in response to capital expenditure.

If training is writing the textbook, inference is the student using what they learned in the field. The student does not need the entire library beside them. They need the right knowledge, immediately, where they are standing. Sending every question back to a central library hundreds of miles away and waiting for the answer defeats the purpose of having learned anything at all.

Inference now represents two-thirds of all AI compute. By 2028, it will dominate. The infrastructure serving it needs to match what it actually requires: compute close to where the decision happens.

The quiet migration

Something significant is happening in the hardware layer that many commentators have overlooked because it lacks the drama of a new GPU launch or a hyperscaler earnings beat.

Neural processing units, chips designed specifically for AI inference, are now embedded in nearly every computing device sold. Not as an experiment. As a standard feature. Gartner projected 114 million AI PCs shipped in 2025, a 165% increase from 2024. By 2026, NPU-equipped laptops will represent nearly 60% of all global PC shipments. Gartner has gone further, predicting that by 2026, AI laptops will become the standard choice for large businesses.

The smartphone market tells the same story. End-user spending on NPU-equipped smartphones is projected to reach $393 billion in 2026. Apple’s Neural Engine, Qualcomm’s Hexagon NPU, and MediaTek’s APU are putting inference-capable silicon into devices that cost a few hundred dollars and fit in a pocket.

These are not marketing exercises. Dell’s Latitude 7455, equipped with the Snapdragon X Elite, can run 13-billion-parameter language models locally. That is a machine running Llama 3 on a laptop, with no cloud connection, no egress fee, no latency, and no data leaving the building.

Five years ago, running a model of that complexity required a server rack, a climate-controlled room, and a meaningful electricity bill. Today, it runs on a device that weighs less than two kilograms. The cost of inference-capable hardware is falling on a curve that makes the centralisation assumption harder to defend with each passing quarter.

When the hardware was expensive and power-hungry, the argument for centralisation was strong. Concentrate the costly machines, amortise across users, accept the latency. But when the hardware becomes cheap, compact, and efficient enough to deploy at the point of use, the calculus changes. The latency stops being worth tolerating. The egress fees stop being worth paying. The complexity of routing everything back to a central facility stops being justified by economics.

The edge AI hardware market reached $26 billion in 2025 and is growing at nearly 18% annually. The on-device AI market is projected to reach $157 billion by 2033. These are not niche categories. This is becoming a central part of AI computing, arriving on a timeline that the market has consistently underestimated.

The letter that never left the building

There is a legal dimension to the edge migration that the technology press has not given sufficient attention to, and it is accelerating faster than many infrastructure planning cycles can accommodate.

Data sovereignty regulation is tightening globally, and the trajectory is unambiguous. The European Union’s GDPR was the opening chapter. India’s Digital Personal Data Protection Act imposes strict conditions on cross-border data transfers. Brazil’s LGPD carries enforcement mechanisms with real consequences. Indonesia, Vietnam, Nigeria, Saudi Arabia, and a growing list of jurisdictions are introducing or strengthening requirements that personal data be processed within national borders.

For organisations running AI inference on customer data, patient records, financial transactions, or employee information, centralising that processing in a small number of hyperscale facilities creates legal exposure that compounds with every new regulation. The compliance cost is not just legal fees. It is the architectural complexity of routing data through approved channels, maintaining audit trails across borders, and managing the risk that a law changes faster than the infrastructure can adapt.

Edge infrastructure resolves this by design. When inference runs locally, within the relevant jurisdiction, the data never leaves. There is no cross-border transfer to defend. No complex legal scaffolding to maintain. No regulatory risk that scales with your geographic footprint.

A letter that never leaves the building does not need a customs declaration. For a global enterprise operating across dozens of regulatory environments, that simplicity is not a feature. It is a strategic necessity that becomes more valuable with every new privacy law that enters into force.

One grid, too many megawatts

There is one more constraint that centralised infrastructure cannot spend its way around, and it may be the most stubborn of all.

Power availability, not price, is becoming the binding constraint on data centre capacity. In Northern Virginia, the most concentrated cloud hub on earth, utilities have projected connection timelines for large new projects stretching up to seven years. Ireland’s data centres already consume more than 20% of national electricity. Singapore imposed a multi-year moratorium on new data centre construction. These are not temporary disruptions. They are the predictable result of concentrating enormous compute demand into a small number of locations.

AI workloads are materially more power-intensive than traditional cloud computing. Microsoft has billions of dollars of AI chips sitting in warehouses because the electrical grid cannot absorb them. Satya Nadella has said it publicly: the constraint is not chip supply. It is the ability to get builds done fast enough, close to power. Microsoft has $80 billion in Azure backlog it cannot fulfil. Not because demand softened, but because power and construction timelines intervened.

A river that spreads across a floodplain nourishes the land. The same volume of water forced through a single channel overwhelms it. Edge deployments distribute the energy demand across many smaller sites, each drawing from local grid capacity rather than requiring dedicated substations and years of utility planning. The megawatt problem becomes more tractable when it does not need solving in one place.

The architecture the internet already solved

None of this is theoretically novel. The internet answered this question 20 years ago.

In the early 2000s, the architects of the internet faced a structurally identical problem: how do you build a system that handles massive, unpredictable demand without breaking when any single part fails? Their answer was peer-to-peer networking. P2P systems distributed load across thousands of individual nodes. No single point of failure. Intelligence close to the user. Resilience built into the topology rather than bolted on top.

BitTorrent did not solve file transfer by building faster central servers. It distributed the problem across the network. Each node is close to the user. Each node handles local demand locally. When individual nodes dropped off, the system degraded at the margin. No central failure took the whole network down. It outperformed centralised alternatives on speed, resilience, and scale simultaneously.

Edge computing applies the same logic to AI infrastructure. Smaller, modular compute facilities positioned close to where data originates distribute the inference workload in much the same way P2P distributed file transfer. Each site handles local decisions locally. The network becomes more resilient because no single facility carries the entire load.

The market has it backwards

I want to be precise about what this argument is and what it is not.

This is not a case against cloud infrastructure. Training workloads, large-scale batch processing, and many enterprise applications will continue to run efficiently in centralised environments. The hyperscalers have built extraordinary infrastructure, and their role remains important for the workloads they were designed to serve.

But the workloads are changing. Inference is different from training. Edge is different from cloud. And the economics, the physics, the regulations, and the hardware are all converging on the same conclusion: the next phase of AI infrastructure will be more distributed, not more centralised.

The edge AI market is projected to grow from $25 billion in 2025 to more than $118 billion by 2033. The edge data centre market will grow from $15 billion in 2025 to $72 billion by 2035. IDC predicts that by 2027, 80% of CIOs will turn to edge services to meet the demands of AI inference. These are not speculative projections from the fringe. They are widely cited forecasts from firms many organisations use for planning.

Yet the conversation remains anchored in a paradigm built for training, not inference. Built for centralisation, not distribution. Built for a world where compute was expensive and scarce, not cheap and embedded in every device sold.

Edge processing already generates three-quarters of AI chip revenue. Inference already consumes two-thirds of all AI compute. NPU-equipped devices will be the majority of all PCs and smartphones shipped by 2026. The shift is no longer theoretical. It is already under way. The market has been slower to reflect that reality.

The engineers who built P2P networks understood something fundamental. Distributing intelligence across the network can make it stronger, faster, and more resilient. As inference pushes AI out of the data centre and into the factories, hospitals, vehicles, and devices where the world actually operates, that lesson is becoming harder to ignore.

The cloud was the answer to the last generation of computing problems. Edge is emerging as an answer to this one. The companies that recognise the difference early will be better placed to build infrastructure that is cheaper to run, easier to govern, faster to respond, and harder to disrupt. The companies that do not may spend the next decade paying the cloud tax and wondering where the margin went.

The internet solved this problem 20 years ago. It is time we listened.