DataPelago came out of stealth this week with a new virtualization layer that it says will allow users to move AI, data analytics and ETL workloads to the physical processor of their choice, without make changes to the code, bringing potentially significant new efficiency and performance gains. fields of data science, data analytics and data engineering, as well as HPC.
The advent of generative AI has sparked a rush toward high-performance processors that can handle the massive computational demands of large language models (LLMs). At the same time, businesses are looking for ways to maximize the efficiency of their existing compute spend for advanced analytics and big data pipelines, while coping with the endless growth of structured, semi-structured and unstructured data. structured.
People at DataPelago have responded to these market signals by creating what they call a universal data processing engine that eliminates the need to wire data-intensive workloads to the underlying compute infrastructure, allowing users to run big data, advanced analytics, AI and HPC workloads on any medium. a public cloud or on-premises system they have or that meets their price/performance requirements.
“Just as Sun built the Java Virtual Machine or VMware invented the hypervisor, we are building a virtualization layer that runs in software, not hardware,” says Rajan Goyal, co-founder and CEO of DataPelago. “It runs on software, which gives a clear abstraction of everything that has benefits.”
The DataPelago virtualization layer sits between the query engine, such as Spark, Trino, Flink, and standard SQL, and the underlying infrastructure consisting of storage and physical processors, such as CPUs, GPUs, TPUs, FPGAs, etc. . Users and applications can submit tasks. as they normally would, and the DataPelago layer will automatically route and execute the work to the appropriate processor to meet user-defined availability or cost/performance characteristics.
At the technical level, when a user or application executes a task, such as a data pipeline task or a query, the processing engine, such as Spark, converts it into a plan, and then DataPelago will call an open source layer, such as Apache Gluten, to convert this plan into an intermediate representation (IR) using open standards like Substrait or Velox. The plan is sent to the worker node in the DataOS component of the DataPelago platform, while the IR is converted into an executable Data Flow Graph (DFG) that runs in the DataOS component of the DataPelago platform. DataVM then evaluates the DFG nodes and dynamically maps them to the correct processing element, according to the company.
Having an automated method for matching the right workloads to the right processor will be a boon to DataPelago customers, who in many cases have not experienced the performance capabilities they expected when adoption of accelerated calculation engines, explains Goyal.
“CPU, FPGA and GPU – they have their own sweet spot, like SQL workload or Python workload have a variety of operators. Not all of them run efficiently on CPU, GPU or FPGA,” says Goyal. BigDATA thread. “We know these strengths. So our software at runtime maps the operators to the correct…processing element. It can split this massive query or workload into thousands of tasks, and some will run on CPUs, some on GPUs, some on FPGAs. It is an innovative run-time adaptive mapping to the right computing element that is missing in other frameworks.
DataPelago obviously can’t exceed the maximum performance capabilities an application can achieve by developing natively in CUDA for Nvidia GPUs, ROCm for AMD GPUs, or LLVM for high-performance CPU tasks, says Goyal. But the company’s product can come much closer to optimizing the performance of applications available from these programming layers, while protecting them from underlying complexity and without tying users and their applications to these middleware layers, he says.
“There is a huge gap between the maximum performance expected from GPUs and that achieved by applications. We are closing that gap,” he says. “You will be shocked to find that applications, even Spark workloads running on GPUs today, get less than 10% of peak GPU FLOPS. »
One reason for the performance gap is I/O bandwidth, says Goyal. GPUs have their own local memory, which means you have to move data from host memory to GPU memory to use it. People often don’t factor data movement and I/O into their performance expectations when moving to GPUs, Goyal says, but DataPelago can even eliminate the need to worry about that.
“This virtual machine handles it in such a way that we merge operators and run data flow graphs,” says Goyal. “Things don’t move from one area to another. There is no data movement. We operate in streaming mode. We do not do storage and forwarding. Result: I/O is much lower and we can stall the GPUs at 80 to 90% of their maximum performance. This is the beauty of this architecture.
The company targets all kinds of data-intensive workloads that organizations try to speed up by running them on accelerated compute engines. This includes SQL queries for ad hoc analytics using SQL, Spark, Trino, and Presto, ETL workloads built using SQL or Python, and streaming data workloads using frameworks like Flink . Generative AI workloads can benefit, both at the training stage of LLMs and at runtime, thanks to DataPelago’s ability to accelerate recovery augmented generation (RAG), fine-tuning and creating vector embeddings for a vector database, says Goyal.
“So it’s a unified platform to do both classic Lakehouse analysis and ETL, as well as GenAI preprocessing of data,” he says.
Customers can run DataPelago on-premises or in the cloud. When running alongside a cloud lakehouse, such as AWS EMR or Google Cloud’s DataProc, the system has the ability to achieve the same amount of work previously done with a 100-node cluster with a 10-node cluster. nodes, explains Goyal. While the queries themselves run 10x faster with DataPelago, the end result is a 2x improvement in total cost of ownership after accounting for licensing and maintenance, he says.
“But above all, this is done without any modification to the code,” he specifies. “You write Airflow. You use Jupyter notebooks, you write Python or PySpark, Spark or Trino: whatever medium you work on, they remain unchanged.
The company benchmarked its software against some of the fastest data lake platforms on the market. When run against Databricks Photon, which Goyal calls “the gold standard,” DataPelago showed a 3- to 4-fold performance improvement, he says.
Goyal says there is no reason why customers can’t use the DataPelago virtualization layer to accelerate scientific computing workloads running on HPC setups, including AI or computing workloads. simulation and modeling, explains Goyal.
“If you have custom code written for specific hardware, that you’re optimizing for an A100 GPU that has 80 GB of GPU memory, lots of SMs, and lots of threads, then you can optimize that,” he says. “Now you are orchestrating your low-level code and kernels to maximize your FLOPS or operations per second. What we’ve done is provide a layer of abstraction where now this thing is made underneath and we can hide it, so it gives extensibility and paplyin according to the same principle.
“At the end of the day, it’s not like there’s any magic here. There are only three things: the compute, the I/O and the storage part,” he continues. “As long as you design your system with impedance matching of those three things, so you’re not limited by I/O, you’re not limited by compute, and you’re not limited storage, then life is good. »
DataPelago already has paying customers using its software, some of which are in pilot phase and others in the process of moving into production, Goyal says. The company plans to officially release its software to full availability in the first quarter of 2025.
Meanwhile, the Mountain View company came out of stealth today announcing it has $47 million in funding from Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners and Silicon Valley Bank, a division of First Citizens. Bank.