Today, I’m very excited to announce the general availability of Amazon SageMaker Lakehouse, a feature that unifies data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics, artificial intelligence, and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse is part of the next generation of Amazon SageMakerwhich is a unified platform for data, analytics and AI, bringing together widely adopted AWS machine learning and analytics capabilities and delivering an integrated experience for analytics and AI.
Customers want to do more with data. To accelerate their analytics journey, they choose the right storage and databases to store their data. Data is distributed across data lakes, data warehouses, and different applications, creating data silos that make it difficult to access and use. This fragmentation leads to duplicate data copies and complex data pipelines, which increases costs for the organization. Additionally, customers are forced to use specific query engines and tools because how and where data is stored limits their options. This restriction hinders their ability to work with data as they would like. Finally, inconsistent access to data makes it difficult for customers to make informed business decisions.
SageMaker Lakehouse addresses these challenges by helping you unify data between Amazon S3 data lakes and Amazon Redshift data warehouses. It gives you the flexibility to access and query data on-premises with all Apache Iceberg compatible engines and tools. With SageMaker Lakehouse, you can centrally define granular permissions and apply them across multiple AWS services, simplifying data sharing and collaboration. Importing data into your SageMaker Lakehouse is simple. In addition to seamlessly accessing data from your existing data lakes and data warehouses, you can use zero ETL from operational databases such as Amazon Aurora, Amazon RDS for MySQL, Amazon DynamoDBas well as applications such as Salesforce and SAP. SageMaker Lakehouse integrates into your existing environments.
Get started with SageMaker Lakehouse
For this demonstration, I’m using a preconfigured environment with multiple AWS data sources. I go to the Amazon SageMaker Unified Studio console (preview), which provides an integrated development experience for all your data and AI. With Unified Studio, you can seamlessly access and query data from a variety of sources through SageMaker Lakehouse, while using familiar AWS tools for analytics and AI/ML.
This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop AI models together. Creating a project automatically configures AWS Glue Data Catalog databases, catalogs Redshift Managed Storage (RMS) data, and grants necessary permissions. You can start by creating a new project or continuing an existing project.
To create a new project, I choose Create a project.
I have 2 project profile options for building and interacting with a lake house. The first is Data analysis and AI-ML model developmentwhere you can analyze data and build ML and generative AI models powered by Amazon DME, AWS GlueAmazon Athena, Amazon SageMaker AI, and SageMaker Lakehouse. The second is SQL Analysiswhere you can analyze your data in SageMaker Lakehouse using SQL. For this demo, I proceed with SQL analysis.
I enter a project name in the field Project Name field and choose SQL analysis below Project profile. I choose Continue.
I enter the values of all parameters under Tools. I enter the values to create my Lake House databases. I enter the values to create my Serverless Redshift resources. Finally, I enter a name for my catalog under Lakehouse Catalog.
In the next step, I review the resources and choose Create a project.
Once the project is created, I look at the project details.
I’m going to Data in the navigation pane and choose the + (plus) sign to add data. I choose Create a catalog to create a new catalog and choose Add data.
Once the RMS catalog is created, I choose Build in the navigation pane, then choose Query Editor below Data analysis and integration To create a schema under the RMS catalog, create a table, then load the table with sample sales data.
After entering the SQL queries into the designated cells, I choose Select the data source in the right drop-down menu to establish a database connection to the Amazon Redshift data warehouse. This connection allows me to run the queries and retrieve the desired data from the database.
Once the database connection is successfully established, I choose Run everything to run all queries and monitor execution progress until all results are displayed.
For this demonstration, I’m using two additional preconfigured catalogs. A catalog is a container that organizes your Lakehouse object definitions such as schemas and tables. The first is an Amazon S3 data lake catalog (catalog-test-s3) which stores customer records, containing detailed transactional and demographic information. The second is a Lakehouse catalog (unsubscribe_lakehouse) dedicated to storing and managing customer churn data. This integration creates a unified environment where I can analyze customer behavior as well as churn predictions.
In the navigation pane, I choose Data and locate my catalogs under the Lake House section. SageMaker Lakehouse offers several analytics options, including Query with Athena, Query with RedshiftAnd Open in Jupyter Lab notebook.
Note that you must choose Data analysis and AI-ML model development profile when you create a project, if you want to use Open in Jupyter Lab notebook option. If you choose Open in Jupyter Lab notebookYou can interact with SageMaker Lakehouse using Apache Spark via EMR 7.5.0 or AWS Glue 5.0 by configuring the Iceberg REST catalog, allowing you to process data in your data lakes and data warehouses in a unified way.
This is what a query looks like using the Jupyter Lab notebook:
I continue by choosing Query with Athena. With this option, I can use Amazon Athena’s serverless query capability to analyze sales data directly in SageMaker Lakehouse. When selecting Query with AthenaTHE Query Editor launches automatically, providing a workspace where I can compose and execute SQL queries on Lakehouse. This integrated query environment provides a seamless experience for data exploration and analysis, complemented by syntax highlighting and autocomplete features to improve productivity.
I can also use Query with Redshift option to run SQL queries on Lakehouse.
SageMaker Lakehouse provides a comprehensive solution for modern data management and analysis. By unifying data access across multiple sources, supporting a wide range of analytics and ML engines, and providing fine-grained access controls, SageMaker Lakehouse helps you get the most out of your data assets . Whether you’re working with data lakes in Amazon S3, data warehouses in Amazon Redshift, or operational databases and applications, SageMaker Lakehouse provides the flexibility and security you need to drive innovation and make informed decisions. on the data. You can use hundreds of connectors to integrate data from various sources. Additionally, you can access and query data on-site with federated query capabilities across third-party data sources.
Now available
You can access SageMaker Lakehouse through the AWS Management ConsoleAPI, AWS Command Line Interface (AWS CLI)Or AWS SDKs. You can also access via AWS Glue Data Catalog And AWS Lake Formation. SageMaker Lakehouse is available in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Asia Pacific (Sydney), Asia. -Pacific (Hong Kong). Kong), Asia-Pacific (Tokyo) and Asia-Pacific (Singapore) AWS Regions.
For pricing information, visit Amazon SageMaker Lakehouse pricing.
For more information about Amazon SageMaker Lakehouse and how it can simplify your data analytics and AI/ML workflows, visit Amazon SageMaker Lakehouse documentation.