What Happened at Data Council 2023?

04/06/23 • 36 min

Hey folks, have you heard about the Data Council conference in Austin? The three-day event was jam-packed with exciting discussions and innovative ideas on data engineering and infrastructure, data science and algorithms, MLOps, generative AI, streaming infrastructure, analytics, and data culture and community.

"People are so nice in the data community. Meeting them and brainstorming with many ideas and various thought processes is amazing. It was an amazing experience; The conference is mostly like a jam of different thought processes, ideas, and entrepreneurship.

The keynote by Shrishanka from AcrylData talked about how data catalogs are becoming the control center for pipelines, a game-changer for the industry.

I also had a chance to attend a session on Malloy, a new way of thinking about SQL queries. It was experimental but had some cool ideas on abstracting complicated SQL queries. ChatGPT will change the game in terms of data engineering jobs and productivity. Charge GPT, for example, has improved my productivity by 60%. And generative AI is becoming so advanced that it can produce dynamic SQL code in just a few lines.

But of course, with all this innovation and change, there are still questions about the future. Will Snowflake and Databricks outsource data governance experience to other companies? Will the modern data stack become more mature and consolidated? These are the big questions we need to ask as we move forward in the world of data.

The talk by Uber on their Ubermetric system migrating from ElasticSearch to Apache Pinot - which, by the way, is an incredibly flexible and powerful system. We also chatted about Pinot's semi-structured storage support, which is important in modern data engineering.

Now, let's talk about something (non)controversial: the idea that big data is dead. DuckDB brought up three intriguing points to back up this claim.

Not every company has Big Data.
The availability of instances with higher memory is becoming a commodity
Even with the companies have big data; they do only incremental processing, which can be small enough

Abhi Sivasailam presented a thought-provoking approach to metric standardization. He introduced the concept of "metric trees" - connecting high-level metrics to other metrics and building semantics around them. The best part? You can create a whole tree structure that shows the impact of one metric on another. Imagine the possibilities! You could simulate your business performance by tweaking the metric tree, which is mind-blowing!

Another amazing talk was about cross-company data exchange, where Pardis discussed various ways companies share data, like APIs, file uploads, or even Snowflake sharing. But the real question is: How do we deal with revenue sharing, data governance, and preventing sensitive data leaks? Pardis's startup General Folders, is tackling this issue, becoming the "Dropbox" of data exchange. How cool is that?

To wrap it up, three key learnings from the conference were:

The intriguing idea is that "big data is dead" and how it impacts data infrastructure architecture.
Data Catalog as a control plane for modern data stack? Is it a dream or reality?
The growing importance of data contracts and the fascinating idea of metric trees.

Overall, the Data Council conference was an incredible experience, and I can't wait to see what they have in store for us next year.

The keynote by Shrishanka from AcrylData talked about how data catalogs are becoming the control center for pipelines, a game-changer for the industry.

Now, let's talk about something (non)controversial: the idea that big data is dead. DuckDB brought up three intriguing points to back up this claim.

Not every company has Big Data.
The availability of instances with higher memory is becoming a commodity
Even with the companies have big data; they do only incremental processing, which can be small enough

To wrap it up, three key learnings from the conference were:

The intriguing idea is that "big data is dead" and how it impacts data infrastructure architecture.
Data Catalog as a control plane for modern data stack? Is it a dream or reality?
The growing importance of data contracts and the fascinating idea of metric trees.

Overall, the Data Council conference was an incredible experience, and I can't wait to see what they have in store for us next year.

Previous Episode

What Happened at Data Council 2023?

The keynote by Shrishanka from AcrylData talked about how data catalogs are becoming the control center for pipelines, a game-changer for the industry.

I also had a chance to attend a session on Malloy, a new way of thinking about SQL queries. It was experimental but had some cool ideas on abstracting complicated SQL queries. ChatGPT will change the game in terms of data engineering jobs and productivity. Charge GPT, for example, has improved my productivity by 60%. And generative AI is becoming so advanced that it can produce dynamic SQL code in just a few lines.

The talk by Uber on their Ubermetric system migrating from ElasticSearch to Apache Pinot - which, by the way, is an incredibly flexible and powerful system. We also chatted about Pinot's semi-structured storage support, which is important in modern data engineering.

Now, let's talk about something (non)controversial: the idea that big data is dead. DuckDB brought up three intriguing points to back up this claim.

Not every company has Big Data.

The availability of instances with higher memory is becoming a commodity

Even with the companies have big data; they do only incremental processing, which can be small enough

Abhi Sivasailam presented a thought-provoking approach to metric standardization. He introduced the concept of "metric trees" - connecting high-level metrics to other metrics and building semantics around them. The best part? You can create a whole tree structure that shows the impact of one metric on another. Imagine the possibilities! You could simulate your business performance by tweaking the metric tree, which is mind-blowing!

Another amazing talk was about cross-company data exchange, where Pardis discussed various ways companies share data, like APIs, file uploads, or even Snowflake sharing. But the real question is: How do we deal with revenue sharing, data governance, and preventing sensitive data leaks? Pardis's startup General Folders, is tackling this issue, becoming the "Dropbox" of data exchange. How cool is that?

To wrap it up, three key learnings from the conference were:

The intriguing idea is that "big data is dead" and how it impacts data infrastructure architecture.

Data Catalog as a control plane for modern data stack? Is it a dream or reality?

The growing importance of data contracts and the fascinating idea of metric trees.

Overall, the Data Council conference was an incredible experience, and I can't wait to see what they have in store for us next year.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.dataengineeringweekly.com

Next Episode

Podcast: dbt Reimagined, Change Data Capture @ Brex, on Data Products and how to describe them

DBT Reimagined by Pedram Navid

The challenge with this, having the Jinja templating, I found out two things. One is like; it is on runtime. So you have to build it and then run some simulations to understand whether you did it correctly or not.

Jinja Templates also add cognitive load. The developers have to know how the Jinja template will work; how SQL will work, and it becomes a bit difficult to read and understand.

In this conversation with Aswin, we discuss the article "DBT Reimagined" by Pedram Navid. We talked about the strengths and weaknesses of DBT and what we would like to see in a future version of the tool.

Aswin agrees with Pedram Navid that a DSL would be better than a templated language for DBT. He also points out that the Jinja templating system can be difficult to read and understand.

I agree with both Aswin and Pedram Navid. A DSL would be a great way to improve DBT. It would make the tool more powerful and easier to use.

I'm also interested in a native programming language for DBT. It would allow developers to write their own custom functions and operators, giving them even more flexibility in using the tool.

The conversation shifts to the advantages of DSL over templated code, and they discuss other tools like SQL Mesh, Malloy, and an internal tool by Criteo. I believe that more experimentation with SQL is needed.

Overall, the article "DBT Reimagined" is a valuable contribution to discussing the future of data transformation tools. It raises some important questions about the strengths and weaknesses of DBT and offers some interesting ideas for how to improve.

Change Data Capture at Brex by Jun Zhao

https://medium.com/brexeng/change-data-capture-at-brex-c71263616dd7

Aswin provided a great definition of CDC, explaining it as a mechanism to listen to database replication logs and capture, stream, and reproduce data in real time🕒. He shared his first encounter with CDC back in 2013, working on a Proof of Concept (POC) for a bank🏦.

Aswin explains that CDC is a way to capture changes made to data in a database. This can be useful for a variety of reasons, such as:

Auditing: CDC can be used to track changes made to data, which can be useful for auditing purposes.

Compliance: CDC can be used to ensure that data complies with regulations.

Data replication: CDC can replicate data from one database to another.

Data integration: CDC can be used to integrate data from multiple sources.

Aswin also discusses some of the challenges of using the CDC, such as:

Complexity: CDC can be a complex process to implement.

Cost: CDC can be a costly process to implement.

Performance: CDC can impact the performance of the database.

So, in a summary of the conversation about change data capture (CDC):

CDC is a way to capture changes made to data in a database.

CDC can be used for various purposes, such as auditing, compliance, data replication, and integration.

CDC can be implemented using a variety of tools, such as Debezium.

Some of the challenges of the CDC include latency, cost, and performance.

CDC can’t carry business context, which can be expensive to recreate.

Overall, CDC is a valuable tool for data engineers.

On Data Products and How to describe them by Max Illis

https://medium.com/@maxillis/on-data-products-and-how-to-describe-them-76ae1b7abda4

The library example is close to heart for Aswin since his father started his career as a librarian! 📖

👨‍💻 Aswin highlights Max's broad definition of data products, including data sets, tables, views, APIs, and machine learning models. Anand agrees that BI dashboards can also be data products. 📊

🔍We emphasize the importance of exposing tribal knowledge and democratizing the data product world. Max's journey from skeptic to believer in data products is very admirable. 🌟

📝We dive into data products' structural and behavioral properties and Max's detailed description of build-time and runtime properties. They also appreciate the idea of reference queries to facilitate data consumption. 🧩

🚀In conclusion, Max's blog post on data products is one of the best written up on data products around! Big thanks to Max for sharing his thoughts! 🙌

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit