Best of Chad Sanderson - LinkedIn Posts by Chad Sanderson

Chad Sanderson

Mar 7, 2023 at 7:49 PM

Businesses are failing their data science teams, and we are seeing a mass transition from Data Science to Data Engineering for that reason.

Data Scientists join a company expecting to apply statistical modeling and machine intelligence to challenging business problems. What they find is an impenetrable maze of low-quality data, virtually indecipherable JSON blobs with little indication of ownership or semantic meaning.

Instead of model building, data scientists spend the majority of their time on validation and 'untangling' spaghetti SQL in the Data Warehouse. They are encouraged to 'prove business value' and 'move fast and break things' yet the underlying infrastructure allows them to do neither effectively.

This is not sustainable. For data scientists to truly add value in a scalable way data engineers and data scientists need to operate from a shared understanding of data quality and infrastructure. That means:

1. An investment in data architecture early on
2. Clearly defined ownership of core data assets
3. Business meaning of the data, defined centrally
4. A shared responsibility for data quality

Good luck!

#dataengineering

▿ Show more

Chad Sanderson

Jan 18, 2023 at 5:36 PM

Data Infrastructure is a long-term opportunity that can't be exclusively judged on its short-term value, yet SO many businesses do exactly that.

Many data engineering teams find their projects being deprioritized because it's challenging to explain exactly how many dollars an initiative will contribute to the bottom line. This leads to DEs spinning their wheels, jumping from one broken pipeline to unscalable SQL query after another with no end in sight.

Do data catalogs, monitoring tools, and access control all help? Yes, they do. But they cannot address the core problem: The data platform is built on years (sometimes decades) of tech debt with no ownership and low quality. Solving THAT problem takes time, strategy, and collaboration across the entire business.

#dataengineering

▿ Show more

Chad Sanderson

Nov 21, 2022 at 5:30 PM

While ML/AI is the hot topic today, a foundation of tech debt and crumbling data infrastructure is the cost for tomorrow.

Companies excited by the promise of ML/AI will quickly discover that unless the data itself represents the real world accurately, arrives in a timely fashion, has clear rules of ownership, and can be debugged at the source - every data science program is at risk of implosion from within.

The first two steps any ML/AI project manager should take is to 1.) invest in contracts between producers and consumers which drive clear ownership, accountability, and set the 'failure state' for data as a product and 2.) robust data monitoring to guard against data drift and other anomalies that can't be prevented during the build-phase or stream processing layer.

This foundation of quality will ensure that new data features can be implemented quickly, with high-quality, and semantic validity, while ensuring the owning team stays vigilant against edge cases and bugs.

Good luck and have a great weekend!

#machinelearning

▿ Show more

Chad Sanderson

Nov 10, 2022 at 6:59 PM

The vast majority of data models, cloud spend, and dashboards are unnecessary.

The majority of businesses have a relatively small number of business-critical pipelines which power ROI-associated assets like ML/AI models, financial reporting pipelines, and other production-grade data products.

However, because these data products often sit downstream of a tangled mess of spaghetti SQL, the data assets which generate the most value are often completely under-served, lack data contracts, monitoring, CI/CD, alerting, and ownership.

Data products that should be incrementally improved are left to rot, because the cost of refactoring the entire upstream pipeline is far too high for the upside.

My advice: Forget about complete warehouse refactors (which will be out of date in 6 months anyway). Focus ONLY on the most valuable, high ROI data sets. Data consumers and producers should work together to create data contracts around the core schemas, layer in strong change management, and apply quality downstream.

Some great places to start with data contracts:
- A table that captures usage-based pricing for accounting
- sev0 ML models like pricing or offer relevance
- Data that is surfaced to a 3rd party customer or embedded in an app

Good luck!

#dataengineering

▿ Show more

Chad Sanderson

Apr 20, 2022 at 4:04 PM

The true semantic definition of a “Data Warehouse“ (a representation of the real world through data) has been lost. Today, a Data Warehouse means “a place to dump data and do transforms.“ There are many reasons why this happened - chief among them data product vendors and consultants co-opting terminology in order to make sales.

Your average data platform architect is only loosely familiar with Bill Inmon or Ralph Kimball. Data engineers who studied warehousing in school may be more familiar but are unlikely to have practically implemented older design methodologies, like Entity-Relationship Diagrams. Or worse yet, they have attempted to implement such a methodology only to be met with confusion or stony silence from revenue-focused leadership that frankly couldn't care less about data governance.

As more and more modern companies transition to the Modern Data Stack with ELT at its core, these classic design structures and the learnings which drove them are being forgotten, replaced by an obsession for speed and cheap compute.

Where we are now in data has a striking familiarity with the state of DevOps in the mid-2000s. While CI/CD and agile development methodologies were leveraged by fast-growing startups with the need to rapidly raise capital and deploy quickly, the norm was still waterfall release schedules with heavy governance and a high-quality overhead.

So what changed?

In my opinion, the introduction of Git (and subsequently GitHub) drove a culture shift in how teams build software. Large governance organizations weren't as needed - it became simple to create a branch, kick off peer review with the right set of team members, deploy iteratively, and roll back changes in case something broke. All this happened at the right level of abstraction: code.

Data has yet to have its watershed technology moment for enabling rapid deployment WITH enough governance and best practice enablement to ensure safe deployment. ETL is akin to Waterfall, as ELT is akin to Agile. The system we need is one that facilitates great data architecture and design, the development of a 'true' data warehouse, with the speed and flexibility provided by the Cloud at the right level of abstraction for data - semantics. I call this model the Immutable Data Warehouse.

In my next set of blog posts, I'll be diving into some specific implementations of the Immutable Data Warehouse, specifically how to preserve strong design and governance for the fast-moving modern tech company.

See you soon!

▿ Show more

Chad Sanderson

Jun 30, 2023 at 3:37 PM

ChatGPT will not replace data engineers. Yes, it can write SQL, but the hard part of data development is understanding how code translates to the real world.

Every business has a unique way of storing data. One customerID could be stored in a MySQL DB. Another could be imported from Mixpanel as nested JSON, and a third might be collected from a CDP. All three IDs and their properties are slightly (or significantly) different and must be integrated into a single table in the Data Warehouse.

As smart as ChatGPT might be, it would need to understand the SEMANTICS of how these IDs coalesce into something meaningful in order to automate any step of modeling or ETL. To do that, the algorithm must have some cognition of the real world, grok how the business works, and dynamically tie the data model to that understanding. That's not machine learning anymore - it's general intelligence.

A lot of folks in the 'LLMs are magical' camp don't understand how utterly disastrous the underlying data of most data ecosystems actually are, and how poorly models would hallucinate without a complete overhaul to the infrastructure that in and of itself would probably take years of effort.

What ChatGPT can and will do, is (eventually) make the work of query optimization far simpler. By directing an LLM at a common question (query) ChatGPT could scan the pipeline and make recommendations for optimizations and simplification reducing cost and increasing usability.

LLMs will become an amazing developer tool that radically improves productivity and data modeling speed but nothing that kills your job. If a human can't make sense of your data infrastructure there's no way a machine could do it either.

Good luck!

#dataengineering

▿ Show more

Chad Sanderson

Best Posts by Chad Sanderson on LinkedIn

Related Influencers

Nike

Thanh Nguyen (Steve)

Arturo Martinez R.

Dr Palaniappan Manickam