Data Engineering Weekly
Ananth Packkildurai
www.dataengineeringweekly.com
1 Listener
All episodes
Best episodes
Seasons
Top 10 Data Engineering Weekly Episodes
Goodpods has curated a list of the 10 best Data Engineering Weekly episodes, ranked by the number of listens and likes each episode have garnered from our listeners. If you are listening to Data Engineering Weekly for the first time, there's no better place to start than with one of these standout episodes. If you are a fan of the show, vote for your favorite Data Engineering Weekly episode by adding your comments to the episode page.
We are super excited to be back to discussing Data Engineering Weekly Newsletter articles every week. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis.
On Data Engineering Weekly edition #119, We are taking three articles.
- #1 Netflix's article about Scaling Media Machine Learning at Netflix
https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243
- #2 Alex Woodie's article about Open Table Formats Square Off in Lakehouse Data Smackdown
https://www.datanami.com/2023/02/15/open-table-formats-square-off-in-lakehouse-data-smackdown/
- #3 Plum Living's article about Building a semantic layer in Preset (Superset) with dbt
https://medium.com/plum-living/building-a-semantic-layer-in-preset-superset-with-dbt-71ee3238fc20
We referenced David Jayatillake's article about Metricalypse in the show.
1 Listener
DEW #120: The Case for Data Contracts, Action-Position data quality assessment framework & Stop emphasizing the Data Catalog
Data Engineering Weekly
03/12/23 • 36 min
Please read Data Engineering Weekly Edition #120
Topic 1: Colin Campbell: The Case for Data Contracts - Preventative data quality rather than reactive data quality
In this episode, we focus on the importance of data contracts in preventing data quality issues. We discuss an article by Colin Campbell highlighting the need for a data catalog and the market scope for data contract solutions. We also touch on the idea that data creation will be a decentralized process and the role of tools like data contracts in enabling successful decentralized data modeling. We emphasize the importance of creating high-quality data and the need for technological and organizational solutions to achieve this goal.
Key highlights of the conversation
- "Preventative data quality rather than reactive data quality. It should start with contracts." - Colin Campbell. - Author of the article
- "Contracts put a preventive structure in place" - Ashwin.
- "The successful data-driven companies all do one thing very well. They create high-quality data." - Ananth.
Link:
https://uncomfortablyidiosyncratic.substack.com/p/the-case-for-data-contracts
https://www.dataengineeringweekly.com/p/introducing-schemata-a-decentralized
Topic 2: Yerachmiel Feltzman: Action-Position data quality assessment framework
In this conversation, we discuss a framework for data quality assessment called the Action Position framework. The framework helps define what actions should be taken based on the severity of the data quality problem. We also discuss two patterns for data quality: Write-Audit-Publish (WAP) and Audit-Write-Publish (AWP). The WAP pattern involves writing data, auditing it, and publishing it, while the AWP pattern involves auditing data, writing it, and publishing it. We encourage readers to share their best practices for addressing data quality issues.
Are you using any Data Quality framework in your organization? Do you have any best practices on how you address data quality issues? What do you think of the action-position data quality framework? Please add your comments in the SubStack chat.
Link:
Dremio WAP pattern: https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/
Topic 3: Guy Fighel - Stop emphasizing the Data Catalog
We discuss the limitations of data catalogs and the author’s view on the semantic layer as an alternative. The author argues that data catalogs are passive and quickly become outdated and that a stronger contract with enforced data quality could be a better solution. We also highlight the cost factors of implementing a data catalog and suggest that a more decentralized approach may be necessary to keep up with the increasing number of data sources. Innovation in this space is needed to improve organizations' discoverability and consumption of data assets.
Link:
https://www.linkedin.com/pulse/stop-emphasizing-data-catalog-guy-fighel/
https://www.dataengineeringweekly.com/p/data-catalog-a-broken-promise
DEW #129: DoorDash's Generative AI, Europe data salary, Data Validation with Great Expectations, Expedia's Event Sourcing
Data Engineering Weekly
05/27/23 • 31 min
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #129, we selected the following article
DoorDash identifies Five big areas for using Generative AI
Generative AI has taken the industry by storm, and every company is trying to determine what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business.
- The assistance of customers in completing tasks
- Better tailored and interactive discovery [Recommendation]
- Generation of personalized content and merchandising
- Extraction of structured information
- Enhancement of employee productivity
https://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/
Mikkel Dengsøe: Europe data salary benchmark 2023
Fascinating findings on Europe’s data salary among various countries. The key findings are
- German-based roles pay lower.
- London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, which is more indicative of the sample than jobs in Dublin paying higher than in London.
- The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlin
https://medium.com/@mikldd/europe-data-salary-benchmark-2023-b68cea57923d
Trivago: Implementing Data Validation with Great Expectations in Hybrid Environments
The article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods.
Expedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth
“Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach.
DEW #124: State of Analytics Engineering, ChatGPT, LLM & the Future of Data Consulting, Unified Streaming & Batch Pipeline, and Kafka Schema Management
Data Engineering Weekly
04/29/23 • 36 min
Welcome to another episode of Data Engineering Weekly. Aswin and I select 3 to 4 articles from each edition of Data Engineering Weekly and discuss them from the author’s and our perspectives.
On DEW #124 [https://www.dataengineeringweekly.com/p/data-engineering-weekly-124], we selected the following article
dbt: State of Analytics Engineering
dbt publishes the state of analytical [data???🤔] engineering. If you follow Data Engineering Weekly, We actively talk about data contracts & how data is a collaboration problem, not just an ETL problem. The state of analytical engineering survey validates it as two of the top 5 concerns are data ownership & collaboration between the data producer & consumer. Here are the top 5 key learnings from the report.
- 46% of respondents plan to invest more in data quality and observability this year— the most popular area for future investment.
- Lack of coordination between data producers and data consumers is perceived by all respondents to be this year’s top threat to the ecosystem.
- Data and analytics engineers are most likely to believe they have clear goals and are most likely to agree their work is valued.
- 71% of respondents rated data team productivity and agility positively, while data ownership ranked as a top concern for most.
- Analytics leaders are most concerned with stakeholder needs. 42% say their top concern is “Data isn’t where business users need it.”
https://www.getdbt.com/state-of-analytics-engineering-2023/
Rittman Analytics: ChatGPT, Large Language Models and the Future of dbt and Analytics Consulting
Very fascinating to read about the potential impact of LLM in the future of dbt and analytical consulting. The author predicts we are at the beginning of the industrial revolution of computing.
Future iterations of generative AI, public services such as ChatGPT, and domain-specific versions of these underlying models will make IT and computing to date look like the spinning jenny that was the start of the industrial revolution.
🤺🤺🤺🤺🤺🤺🤺🤺🤺May the best LLM wins!! 🤺🤺🤺🤺🤺🤺
LinkedIn: Unified Streaming And Batch Pipelines At LinkedIn: Reducing Processing time by 94% with Apache Beam
One of the curses of adopting Lambda Architecture is the need for rewriting business logic in both streaming and batch pipelines. Spark attempt to solve this by creating a unified RDD model for streaming and batch; Flink introduces the table format to bridge the gap in batch processing. LinkedIn writes about its experience adopting Apache Beam’s approach, where Apache Beam follows unified pipeline abstraction that can run in any target data processing runtime such as Samza, Spark & Flink.
Wix: How Wix manages Schemas for Kafka (and gRPC) used by 2000 microservices
Wix writes about managing schema for 2000 (😬) microservices by standardizing schema structure with protobuf and Kafka schema registry. Some exciting reads include patterns like an internal Wix Docs approach & integration of the documentation publishing as part of the CI/ CD pipelines.
DEW #123: Generative AI at BuzzFeed, Building OnCall Culture & Dimensional Modeling at WhatNot
Data Engineering Weekly
04/22/23 • 33 min
Welcome to another episode of Data Engineering Weekly Radio. Ananth and Aswin discussed a blog from BuzzFeed that shares lessons learned from building products powered by generative AI. The blog highlights how generative AI can be integrated into a company's work culture and workflow to enhance creativity rather than replace jobs. BuzzFeed provided their employees with intuitive access to APIs and integrated the technology into Slack for better collaboration.
Some of the lessons learned from BuzzFeed's experience include:
- Getting the technology into the hands of creative employees to amplify their creativity.
- Effective prompts are a result of close collaboration between writers and engineers.
- Moderation is essential and requires building guardrails into the prompts.
- Demystifying the technical concepts behind the technology can lead to better applications and tools.
- Educating users about the limitations and benefits of generative AI.
- The economics of using generative AI can be challenging, especially for hands-on business models.
The conversation also touched upon the non-deterministic nature of generative AI systems, the importance of prompt engineering, and the potential challenges in integrating generative AI into data engineering workflows. As technology progresses, it is expected that the economics of generative AI will become more favorable for businesses.
https://tech.buzzfeed.com/lessons-learned-building-products-powered-by-generative-ai-7f6c23bff376
Moving on, We discuss the importance of on-call culture in data engineering teams. We emphasize the significance of data pipelines and their impact on businesses. With a focus on communication, ownership, and documentation, we highlight how data engineers should prioritize and address issues in data systems.
We also discuss the importance of on-call rotation, runbooks, and tools like PagerDuty and Airflow to streamline alerts and responses. Additionally, we mention the value of having an on-call handoff process, where one engineer summarizes their experiences and alerts during their on-call period, allowing for improvements and a better understanding of common issues.
Overall, this conversation stresses the need for a learning culture within data engineering teams, focusing on building robust systems, improving team culture, and increasing productivity.
Finally, Ananth and Aswin discuss an article about adopting dimensional data modeling in hyper-growth companies. We appreciate the learning culture and emphasize balancing speed, maturity, scale, and stability.
We highlight how dimensional modeling was initially essential due to limited computing and expensive storage. However, as storage became cheaper and computing more accessible, dimensional modeling was often overlooked, leading to data junkyards. In the current landscape, it's important to maintain business-aware domain-driven data marts and acknowledge that dimensional modeling still has a role.
The conversation also touches upon the challenges of tracking slowly changing dimensions and the responsibility of data architects, engineers, and analytical engineers in identifying and implementing such dimensions. We discuss the need for a fine balance between design thinking and experimentation and stress the importance of finding the right mix of correctness and agility for each company.
Data Engineering Weekly: Reflecting on 2023 and Looking Ahead to 2024
Data Engineering Weekly
12/25/23 • 38 min
Welcome to another insightful edition of Data Engineering Weekly. As we approach the end of 2023, it's an opportune time to reflect on the key trends and developments that have shaped the field of data engineering this year. In this article, we'll summarize the crucial points from a recent podcast featuring Ananth and Ashwin, two prominent voices in the data engineering community.
Understanding the Maturity Model in Data Engineering
A significant part of our discussion revolved around the maturity model in data engineering. It's crucial for organizations to recognize their current position in the data maturity spectrum to make informed decisions about adopting new technologies. This approach ensures that adopting new tools and practices aligns with the organization's readiness and specific needs.
The Rising Impact of AI and Large Language Models
2023 witnessed a substantial impact of AI and large language models in data engineering. These technologies are increasingly automating processes like ETL, improving data quality management, and evolving the landscape of data tools. Integrating AI into data workflows is not just a trend but a paradigm shift, making data processes more efficient and intelligent.
Lake House Architectures: The New Frontier
Lakehouse architectures have been at the forefront of data engineering discussions this year. The key focus has been interoperability among different data lake formats and the seamless integration of structured and unstructured data. This evolution marks a significant step towards more flexible and powerful data management systems.
The Modern Data Stack: A Critical Evaluation
The modern data stack (MDS) has been a hot topic, with debates around its sustainability and effectiveness. While MDS has driven hyper-specialization in product categories, challenges in integration and overlapping tool categories have raised questions about its long-term viability. The future of MDS remains a subject of keen interest as we move into 2024.
Embracing Cost Optimization
Cost optimization has emerged as a priority in data engineering projects. With the shift to cloud services, managing costs effectively while maintaining performance has become a critical concern. This trend underscores the need for efficient architectures that balance performance with cost-effectiveness.
Streaming Architectures and the Rise of Apache Flink
Streaming architectures have gained significant traction, with Apache Flink leading the way. Its growing adoption highlights the industry's shift towards real-time data processing and analytics. The support and innovation around Apache Flink suggest a continued focus on streaming architectures in the coming year.
Looking Ahead to 2024
As we look towards 2024, there's a sense of excitement about the potential changes in fundamental layers like S3 Express and the broader impact of large language models. The anticipation is for more intelligent data platforms that effectively combine AI capabilities with human expertise, driving innovation and efficiency in data engineering.
In conclusion, 2023 has been a year of significant developments and shifts in data engineering. As we move into 2024, we will likely focus on refining these trends and exploring new frontiers in AI, lake house architectures, and streaming technologies. Stay tuned for more updates and insights in the next editions of Data Engineering Weekly. Happy holidays, and here's to a groundbreaking 2024 in the world of data engineering!
DEW #121: Data Product @ Oda, Reflection Talking with Data Leaders & Great Migration To Snowflake
Data Engineering Weekly
03/22/23 • 43 min
Subscribe to www.dataengineeringweekly.com
From Data Engineering Weekly Edition #121, we took the following articles
Oda: Data as a product at Oda
Oda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.
https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820
The blog highlights six key principles of the value creation of data.
- Domain knowledge + discipline expertise
- Distributed Data Ownership and shared Data Ownership
- Data as a Product
- Enablement over Handover
- Impact through Exploration and Experimentation
- Proactive attitude towards Data Privacy & Ethics
Ashwin & Ananth Conversation Highlights
- "Oda builds the whole data product principle & the implementation structure being built on top of the core values, instead of reflecting any industry jargons.”
- "Don't make me think. The moment you make your users think, you lose your value proposition as a platform or a product.”
- "The platform enables the domain; domain enables your consumer. It's a chain of value creation going on top and like simplifying everyone's life, accessing data, making informed decisions.”
- "I think putting that, documenting it, even at the start of it, I think that's where the equations start proving themselves. And that's essentially what product thinking is all about.”
Peter Bruins: Some reflections on talking with Data leaders
Data Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?
https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/
Ashwin & Ananth Conversation Highlights
- "Ownership. It's all about the ownership." - Peter Burns.
- "The weight of the success (growth of adoption) of the data leads to its failure.
Faire: The great migration from Redshift to Snowflake
Is Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.
https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52
Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflake
https://github.com/Faire/snowflake-migration
Ashwin & Ananth Conversation Highlights
- "If you left like one percent of my data is still in Redshift and 99% of your data in Snowflake, you're degrading your velocity and the quality of your delivery.”
We thank all the writes of the blog for sharing their knowledge with the data community
Data Engineering Weekly Radio #120
Data Engineering Weekly
03/12/23 • 36 min
We are back in our Data Engineering Weekly Radio for edition #120. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis.
From editor #120, we took the following articles
Topic 1: Colin Campbell: The Case for Data Contracts - Preventative data quality rather than reactive data quality
In this episode, we focus on the importance of data contracts in preventing data quality issues. We discuss an article by Colin Campbell highlighting the need for a data catalog and the market scope for data contract solutions. We also touch on the idea that data creation will be a decentralized process and the role of tools like data contracts in enabling successful decentralized data modeling. We emphasize the importance of creating high-quality data and the need for technological and organizational solutions to achieve this goal.
Key highlights of the conversation
"Preventative data quality rather than reactive data quality. It should start with contracts." - Colin Campbell. - Author of the article
"Contracts put a preventive structure in place" - Ashwin.
"The successful data-driven companies all do one thing very well. They create high-quality data." - Ananth.
Ananth’s post on Schemata
Topic 2: Yerachmiel Feltzman: Action-Position data quality assessment framework
In this conversation, we discuss a framework for data quality assessment called the Action Position framework. The framework helps define what actions should be taken based on the severity of the data quality problem. We also discuss two patterns for data quality: Write-Audit-Publish (WAP) and Audit-Write-Publish (AWP). The WAP pattern involves writing data, auditing it, and publishing it, while the AWP pattern involves auditing data, writing it, and publishing it. We encourage readers to share their best practices for addressing data quality issues.
Are you using any Data Quality framework in your organization? Do you have any best practices on how you address data quality issues? What do you think of the action-position data quality framework? Please add your comments in the SubStack chat.
Dremio WAP pattern: https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/
Topic 3: Guy Fighel - Stop emphasizing the Data Catalog
We discuss the limitations of data catalogs and the author’s view on the semantic layer as an alternative. The author argues that data catalogs are passive and quickly become outdated and that a stronger contract with enforced data quality could be a better solution. We also highlight the cost factors of implementing a data catalog and suggest that a more decentralized approach may be necessary to keep up with the increasing number of data sources. Innovation in this space is needed to improve organizations' discoverability and consumption of data assets.
Something to think about in this conversation
"If you don't catalog everything and we only catalog what is required for the purpose of business decision-making, does that solve the data catalog problem in an organization?"
https://www.linkedin.com/pulse/stop-emphasizing-data-catalog-guy-fighel/
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
Podcast: Data Product @ Oda, Reflection Talking with Data Leaders & Great Migration To Snowflake
Data Engineering Weekly
03/22/23 • 43 min
We are back in our Data Engineering Weekly Radio for edition #121. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis.
Please subscribe to our Podcast on your favorite apps.
From editor #121, we took the following articles
Oda: Data as a product at Oda
Oda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.
https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820
The blog highlights six key principles of the value creation of data.
Domain knowledge + discipline expertise
Distributed Data Ownership and shared Data Ownership
Data as a Product
Enablement over Handover
Impact through Exploration and Experimentation
Proactive attitude towards Data Privacy & Ethics
Here are a few highlights from the podcast
"Oda builds the whole data product principle & the implementation structure being built on top of the core values, instead of reflecting any industry jargons.”
"Don't make me think. The moment you make your users think, you lose your value proposition as a platform or a product.”
"The platform enables the domain; domain enables your consumer. It's a chain of value creation going on top and like simplifying everyone's life, accessing data, making informed decisions.”
"I think putting that, documenting it, even at the start of it, I think that's where the equations start proving themselves. And that's essentially what product thinking is all about.”
Peter Bruins: Some reflections on talking with Data leaders
Data Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?
https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/
Here are a few highlights from the podcast
"Ownership. It's all about the ownership." - Peter Burns.
"The weight of the success (growth of adoption) of the data leads to its failure.
Faire: The great migration from Redshift to Snowflake
Is Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.
https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52
Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflake
https://github.com/Faire/snowflake-migration
Here are a few highlights from the podcast
"If you left like one percent of my data is still in Redshift and 99% of your data in Snowflake, you're degrading your velocity and the quality of your delivery.”
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com
Data Engineering Weekly #75
Data Engineering Weekly
02/21/22 • 15 min
Show more best episodes
Show more best episodes
FAQ
How many episodes does Data Engineering Weekly have?
Data Engineering Weekly currently has 28 episodes available.
What topics does Data Engineering Weekly cover?
The podcast is about Podcasts and Technology.
What is the most popular episode on Data Engineering Weekly?
The episode title 'DEW #119: Netflix's Scaling Media Machine Learning at Netflix, Open Table Formats Square Off in Lakehouse Data Smackdown & Building a semantic layer in Preset (Superset) with dbt' is the most popular.
What is the average episode length on Data Engineering Weekly?
The average episode length on Data Engineering Weekly is 35 minutes.
How often are episodes of Data Engineering Weekly released?
Episodes of Data Engineering Weekly are typically released every 2 days.
When was the first episode of Data Engineering Weekly?
The first episode of Data Engineering Weekly was released on Feb 21, 2022.
Show more FAQ
Show more FAQ