During his QCon London presentation, Danilo Sato, vice president of data and AI at Thoughtworks, reaffirmed the importance of using domain-driven design and the principles of team topologies when implementing data products. This ensures effective encapsulation of data in a more complex landscape where data responsibilities “shift left” towards the developer.
Cassie Shumthe host of Architectures that you have always wondered about on the theme of the conferencementioned during the conference:
Given how important data is today as everyone moves toward AI, I can’t imagine how this presentation wouldn’t be part of the program.
Sato began the presentation by walking through several data architectures from different industries (from traditional to streaming), emphasizing that all have the familiar components: ingestion (batch or stream), data pipelines, storage, consumers and analysis.
Additionally, the two “worlds” of the data universe – the operational world and the analytical world – are getting closer. He emphasized the importance of using data product concepts rather than the underlying technologies: “architect For data products or with products” meaning treating data entities as real products with versioning and contracts.
Sato pointed out that the technical landscape has evolved from the simple decisions of the 2000s, where you just had to choose which DBMS you were going to use, to an ever-changing landscape that is now identified as Machine Learning AI and Data (MAD). Whatever the development, he stressed: “modeling is still difficult“, especially since the usefulness of the model depends on the problem its users are trying to solve, and there is no way to objectively evaluate its effectiveness.
George EP box set: All models are wrong, but some are useful
From Gregor Hohpe“architect elevator” – which states that an architect must be able to communicate at different levels of an enterprise, from vision to implementation details, Sato presented data architecture issues to keep in mind from the bottom up. high.
The lower level refers to the data that flows within a system. Although historically it was acceptable to break encapsulation, e.g. “Give me access to the data and I will take care of the analyses” When thinking from a product perspective, there are more aspects to consider than just operational aspects such as volume, velocity, consistency, availability, latency or access patterns .
The analytical side refers more to the data product, the socio-technical approach to data management and sharing.
Data needs to be further encapsulated, with input, output, and control ports that manage the flow of data, as well as providing a versionable data product specification. If the systems data is well encapsulated, you can even choose to replace the underlying database without any impact on the outside world. As usual with architectural decisions, there are tradeoffs to consider when choosing which type of database to use for the implementation.
The middle tier refers to the data that flows between systems. At this stage, the decisions made have a broader impact. Exposing data for other systems creates a contract that must be respected, i.e. it exposes an API of your data product. Some of the points that need to be considered as part of the data contract include supported data formats, inter-organization standards, data schema, metadata, and discoverability.
When talking about data in motion, there are two main paradigms: batch (a limited data set for the time period) and streaming (an infinite data set). For the latter, you need to add an additional check in case the data arrives late or is not processed. The data mesh book mentions three types of data products: source-aligned data products, aggregated data products, and consumer-aligned data products.
The highest level refers to the direction of organizational structure and data governance at the enterprise level. At this level, one learns more about how the business is organized, what these areas are and who owns them, as well as the scenes between these areas. DDD’s strategic design helps answer these questions.
Moving to a decentralized model will require ensuring that the data product is owned for the long term, even though a team may own multiple products. A self-service platform is essential to implementing consistent data products. This will avoid heterogeneous approaches when building products and will facilitate the implementation of governance.
At the enterprise level, it is important to also talk about data governance: how does the enterprise manage data ownership, accessibility, security and quality? Since governance is also about people and processes, Sato emphasizes Team Topologies as a source of inspiration for how to organize teams around data.
He concluded by saying: “Thinking about data encompasses many things: data in the system, between systems and at the enterprise level. »
Access recorded QCon London discussions with a Video pass only.