Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

2 Listeners

All episodes

Best episodes

Top 10 Data Engineering Podcast Episodes

Goodpods has curated a list of the 10 best Data Engineering Podcast episodes, ranked by the number of listens and likes each episode have garnered from our listeners. If you are listening to Data Engineering Podcast for the first time, there's no better place to start than with one of these standout episodes. If you are a fan of the show, vote for your favorite Data Engineering Podcast episode by adding your comments to the episode page.

Exploring Incident Management Strategies For Data Teams

Data Engineering Podcast

03/20/22 • 57 min

Summary

Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free... or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
Your host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams

Interview

Introduction
How did you get involved in the area of data management?
Can you start by describing some of the ways that an "incident" can manifest in a data system?
- At a high level, what are the steps and participants required to bring an incident to resolution?
The principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams?
What are the signals that teams should be monitoring to identify and alert on potential incidents?
- Alerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue and provide useful context in the alerts that do get sent?
  - Another aspect of this problem is the proper routing of alerts to ensure that the right person sees and acts on it. How have you seen teams deal with the challenge of delivering alerts to the right people?
When there is an active incident, what are the steps that you commonly see data teams take to understand the cause and scope of the issue?
How can...

1 Listener

Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion

Data Engineering Podcast

05/02/22 • 53 min

Summary

The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Matillion is and the story behind it?
What are the use cases and user personas that you are focused on supporting?
- How does that influence the focus and pace of your feature development and priorities?
How is Matillion architected?
- How have the design and goals of the system changed since you started working on it?
The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your produc...

1 Listener

Operational Analytics At Speed With Minimal Busy Work Using Incorta

Data Engineering Podcast

04/24/22 • 71 min

Summary

A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more.
Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Incorta is and the story behind it?
What are the use cases and customers that you are focused on?
- How does that focus inform the design and priorities of functionality in the product?
What are the technologies and workflows that Incorta might replace?
- What are the systems and services that it is intended to integrate with and extend?
Can you describe how Incorta is implemented?
- What are the core technological decisions that were necessary to make t...

1 Listener

The Future of Data Engineering: AI, LLMs, and Automation

Data Engineering Podcast

02/26/25 • 59 min

Summary
In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large language models (LLMs) to enhance productivity and reduce manual toil. The conversation covers the potential of AI to transform data engineering tasks, such as text-to-SQL interfaces and creating semantic graphs to improve data accessibility, and explores practical applications of LLMs in automating code reviews, testing, and understanding data lineage.
Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about

Interview

Introduction
How did you get involved in the area of data management?
modern data stack is dead
where is AI in the data stack?
"buy our tool to ship AI"
opportunities for LLM in DE workflow

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1 Listener

Repeatable Patterns For Designing Data Platforms And When To Customize Them

Data Engineering Podcast

04/03/22 • 47 min

Summary

Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
Hey Data Engineering Podcast listeners, want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join our live webinar on April 20th. Joybird director of analytics, Brett Trani, will walk through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Visit www.rudderstack.com/joybird?utm_source=rss&utm_medium=rss to register today.
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
Your host is Tobias Macey and today I’m interviewing Brandon Beidel about his data platform journey at Red Ventures

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Red Ventures is and your role there?
- Given the relative newness of data product management, where do you draw inspiration and direction for how to approach your work?
What are the primary categories of data product that your data consumers are building/relying on?
What are the types of data sources that you are working with to power those downstream use cases?
Can you describe the size and composition/organization of your data team(s)?
How do you approach the build vs. buy decision while designing and evolving your data platform?
What are the tools/platforms/architectural and usage patterns that you and your team have developed for your platform?
- What are the primary goals and constraints that have contributed to your decisions?
- How have the goals and design of the platform changed or evolved since you started working with the team?
You recently went through the process of establishing and reporting on SLAs for your data products. Can you describe the approach you took and the useful lessons that were learned?
What are the technical and organizational components of the data work at Red Ventures that have proven most difficult?
What excites you most about the future of data engineering?
What are the most interesting, innovative, or unexpected ways that you have seen teams building more reliable data systems?
What aspects of...

Data Observability Out Of The Box With Metaplane

Data Engineering Podcast

01/08/22 • 50 min

Summary

Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Your host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between.

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Metaplane is and the story behind it?
Data observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term?
- What are the areas of differentiation that you see across vendors in the space?
Can you describe how the Metaplane platform is architected?
- How have the design and goals of Metaplane changed or evolved since you started working on it?
establishing seasonality in data metrics
blind spots from operating at the level of the data warehouse
What are the most interesting, innovative, or unexpected ways that you have seen Metaplane used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaplane?
When is Metaplane the wrong choice?
What do you have planned for the future of Metaplane?

Contact Info

LinkedIn
@kevinzhenghu on Twitter

Build Your Second Brain One Piece At A Time

Data Engineering Podcast

04/28/24 • 50 min

Summary
Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macey and today I'm interviewing Tsavo Knott about Pieces, a personal AI toolkit to improve the efficiency of developers

Interview

Introduction
How did you get involved in machine learning?
Can you describe what Pieces is and the story behind it?
The past few months have seen an endless series of personalized AI tools launched. What are the features and focus of Pieces that might encourage someone to use it over the alternatives?
model selections
architecture of Pieces application
local vs. hybrid vs. online models
model update/delivery process
data preparation/serving for models in context of Pieces app
application of AI to developer workflows
types of workflows that people are building with pieces
What are the most interesting, innovative, or unexpected ways that you have seen Pieces used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pieces?
When is Pieces the wrong choice?
What do you have planned for the future of Pieces?

Contact Info

Parting Question

From your perspective, what is the biggest barrier to adoption of machine learning today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

Links

Building A Data Lake For The Database Administrator At Upsolver

Data Engineering Podcast

06/02/20 • 56 min

Summary

Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver

Interview

Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of what a data lake is and what it is comprised of?
We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time?
- How has Upsolver changed or evolved since we last spoke?
  - How has the evolution of the underlying technologies impacted your implementation and overall product strategy?
What are some of the common challenges that accompany a data lake implementation?
How do those challenges influence the adoption or viability of a data lake?
How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake?
- What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway?
What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform?
How is the SQL layer in Upsolver implemented?
- What are the most challenging or compl...

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

Data Engineering Podcast

07/28/24 • 49 min

Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.
Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data

Interview

Introduction
How did you get involved in the area of data management?
Can you describe the scope and purpose of data contracts in the context of this conversation?
In what way(s) do they differ from data quality/data observability?
Data contracts are also known as the API for data, can you elaborate on this?
What are the types of guarantees and requirements that you can enforce with these data contracts?
What are some examples of constraints or guarantees that cannot be represented in these contracts?
Are data contracts related to the shift-left?
Data contracts are also known as the API for data, can you elaborate on this?
The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
How did you approach the design of the syntax and implementation for Soda's data contracts?
Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
When are data contracts the wrong choice?
What do you have planned for the future of data contracts?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The

Bringing The Metrics Layer To The Masses With Transform

Data Engineering Podcast

07/23/21 • 61 min

Summary

Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Transform is and the story behind it?
How do you define the concept of a "metric" in the context of the data platform?
What are the general strategies in the industry for creating, managing, and consuming metrics?
- How has that been changing in the past couple of years?
  - What is driving that shift?
What are the main goals that you have for the Transform platform?
- Who are the target users? How does that focus influence your approach to the design of the platform?
How is the Transform platform architected?
- What are the core capabilities that are required for a metrics service?
What are the integration points for a metrics service?
Can you talk through the workflow of defining and consuming metrics with Transform?
- What are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition?
- What are the lifecycle stages that need to be factored into th...

Show more best episodes

FAQ

How many episodes does Data Engineering Podcast have?

Data Engineering Podcast currently has 464 episodes available.

What topics does Data Engineering Podcast cover?

The podcast is about Podcasts, Technology and Education.

What is the most popular episode on Data Engineering Podcast?

The episode title 'Exploring Incident Management Strategies For Data Teams' is the most popular.

What is the average episode length on Data Engineering Podcast?

The average episode length on Data Engineering Podcast is 54 minutes.

How often are episodes of Data Engineering Podcast released?

Episodes of Data Engineering Podcast are typically released every 6 days, 23 hours.

When was the first episode of Data Engineering Podcast?

The first episode of Data Engineering Podcast was released on Jan 8, 2017.

Show more FAQ

Data Engineering Podcast

Tobias Macey

Top 10 Data Engineering Podcast Episodes

Summary

Announcements

Interview

Summary

Announcements

Interview

Summary

Announcements

Interview

Summary

Announcements

Interview

Summary

Announcements

Interview

Contact Info

Summary

Announcements

Interview

Summary

Announcements

Interview

FAQ

How many episodes does Data Engineering Podcast have?

What topics does Data Engineering Podcast cover?

What is the most popular episode on Data Engineering Podcast?

What is the average episode length on Data Engineering Podcast?

How often are episodes of Data Engineering Podcast released?

When was the first episode of Data Engineering Podcast?

Comments