Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more
Getting data from where it’s created to where it’s effectively used for data analytics and AI isn’t always a straight line. It’s the role of data orchestration technologies like the open source Apache Airflow project to help build a data pipeline that gets data where it needs to be.
Today, the Apache Airflow The project is about to release its 2.10 update, marking the first major update to the project since Airflow 2.9 Released Last April, Airflow 2.10 introduced hybrid execution, enabling organizations to optimize resource allocation across diverse workloads, from simple SQL queries to compute-intensive machine learning (ML) tasks. Enhanced lineage capabilities provide greater visibility into data flows, critical for governance and compliance.
To go further, Astronomerthe leading commercial vendor behind Apache Airflow is updating its Astro platform to incorporate open source dbt-core (Data Build Tool) technology, unifying data orchestration and transformation workflows on a single platform.
These enhancements aim to streamline data operations and bridge the gap between traditional data flows and emerging AI applications. The updates provide enterprises with a more flexible approach to data orchestration, addressing the challenges of managing diverse data environments and AI processes.
“If you think about why you adopt orchestration in the first place, it’s because you want to coordinate things across the entire data supply chain, you want that central pane of glass,” said Julian LaNeve, CTO of Astronomer, told VentureBeat.
How Airflow 2.10 Improves Data Orchestration with Hybrid Execution
One of the big updates in Airflow 2.10 is the introduction of a feature called hybrid execution.
Prior to this update, Airflow users had to select a single execution mode for their entire deployment. This deployment could have been choosing a Kubernetes cluster or using Airflow’s Celery runner. Kubernetes is better suited for heavier compute tasks that require more granular control at the per-task level. Celery, on the other hand, is lighter and more efficient for simpler tasks.
However, as LaNeve explains, real-world data pipelines often have a mix of workload types. For example, he noted that in an Airflow deployment, an organization might just need to run a simple SQL query somewhere to get data. A machine learning workflow might also connect to that same data pipeline, which requires a heavier Kubernetes deployment to run. This is now possible with hybrid execution.
The hybrid execution capability differs significantly from previous versions of Airflow, which required users to make a single choice for their entire deployment. Now, they can optimize each component of their data pipeline for the appropriate level of compute and control resources.
“Being able to choose at the pipeline and task level, instead of having everything use the same execution mode, I think opens up a whole new level of flexibility and efficiency for Airflow users,” LaNeve said.
Why Data Lineage in Data Orchestration Matters for AI
Understanding data provenance is the domain of data lineage. It is a critical capability for traditional data analytics as well as emerging AI workloads where organizations need to understand data provenance.
Prior to Airflow 2.10, data lineage tracking had some limitations. LaNeve said that with the new lineage capabilities, Airflow will be able to better capture dependencies and data flow within pipelines, even for custom Python code. This improved lineage tracking is crucial for AI and machine learning workflows, where data quality and provenance are paramount.
“A key element of any next-generation AI application that people are building today is trust,” LaNeve said.
So if an AI system provides an incorrect or unreliable result, users won’t continue to rely on it. Trustworthy lineage information helps solve this problem by providing a clear, verifiable trail that shows how engineers obtained, transformed, and used the data to train the model. Additionally, strong lineage capabilities enable more comprehensive data governance and security controls around sensitive information used in AI applications.
Waiting for Airflow 3.0
“Data governance, security and privacy are becoming more important than ever because you want to make sure you have complete control over how your data is used,” LaNeve said.
While Airflow 2.10 brings several notable improvements, LaNeve is already looking forward to Airflow 3.0.
According to LaNeve, the goal of Airflow 3.0 is to modernize the technology for the era of next-gen AI. The main priorities of Airflow 3.0 are to make the platform more language-agnostic, allowing users to write tasks in any language, as well as making Airflow more data-aware, shifting the focus from process orchestration to data flow management.
“We want to make sure Airflow is the standard for orchestration for the next 10 to 15 years,” he said.