Narek Tatevosyan, Product Director at Nebius AI, explores how AI startups can boost efficiency by optimising their tech stack and using full-stack platforms.
The current generative AI boom is pushing the boundaries of technology, transforming industries — and driving up demand for compute. Many AI start-ups are falling into the ‘compute trap’ and focussing on gaining access to the latest, most powerful hardware whatever the cost, rather than optimising their existing infrastructure or finding more effective and efficient solutions to building GenAI applications.
While GPU power will always be critical to training large AI models and other machine learning applications, it isn’t the only thing that matters. Without state-of-the art CPUs, high speed network interface cards like the InfiniBand 400 ND, DDR5 memory, and a motherboard and server rack that can tie it all together, it’s impossible to get maximum performance from an NVIDIA H100 or other top-spec GPUs. As well as taking a broader view of compute, focusing on a more holistic approach to developing AI applications that includes efficient data preparation, optimised training runs, and using scalable inference infrastructure can allow you to scale and evolve your AI applications in a sustainable manner.
The problem with compute
All else being equal, the more compute you have available and the larger your dataset, the more powerful the AI models you can build. For example, Meta’s Llama 3.1 8B and 405B LLMs were trained on the same 15 trillion token dataset using NVIDIA H100s – but the 8B version took 1.46 million GPU hours while the significantly more powerful 405B version took 30.84 million GPU hours.
In the real world, of course, all else is seldom equal, and very few AI companies have the resources to compete head on with a Meta. Instead of falling into the compute trap and trying to match the compute spend of some of the largest companies in the world, it can be a more effective competitive strategy to focus holistically on the whole tech stack driving your ML development.
It’s also worth noting that while Llama 8B isn’t as powerful as Llama 405B, it’s still an effective and competitive LLM that outperforms many older, larger models. While Meta obviously used a huge amount of compute in developing Llama, the researchers were innovating aggressively in other areas too.
The full stack advantage
Using a single platform to manage everything – from data preparation and labelling to model training, fine-tuning and even inference – comes with a number of advantages.
Developing and deploying an AI application on a single full-stack provider means your team has to learn to use a single set of tools, rather than multiple different platforms. Similarly, your data stays on a single platform so you don’t have to deal with the complexities and inefficiencies of multi-cloud operations. Perhaps most usefully, if you run into any issues you are dealing with a single support team who understands the other layers in your stack.
And, of course, there can also be financial benefits: By using the same infrastructure for data handling, training, and inference, you are more likely to get better pricing from your infrastructure provider.
Beyond the big three
While hyperscalers like AWS, Microsoft Azure, and Google Cloud might seem the obvious choice if you are investing in a single platform, they can have downsides for many if not most AI companies.
Most notably, the Big Three cloud computing platforms are expensive. If you operate an incredibly well funded start-up or a massive tech company, this might not be an issue – but for the majority of AI companies bigger cloud providers don’t offer the best ROI. Moreover, they aren’t optimised for AI specific operations, and as a result you pay significant premiums for features you don’t need.
Dedicated full-stack AI platforms like Nebius offer a much more efficient solution. As well as providing more affordable compute and hardware setups optimised for both training and inference, they only include the tools and features needed for developing AI applications. You can focus on developing, training and optimising your AI models being confident that they’re on the right hardware for the job, not navigating a sprawling server backend or wondering why your expensive GPUs aren’t getting the data throughput they should.
While leveraging a full-stack approach to ML development requires investment, it minimises your ongoing infrastructure costs. Building a better optimised application from the start not only saves on training runs, but should also reduce the cost of inference. These kinds of saving can compound over multiple generations of AI models. A slightly more efficient prototype can lead to a much more efficient production model, and so on into the future. The decisions you make now could be what gives your company the runway to make it to an IPO.