You shouldn't try to learn all of data engineering at once! You'll get overwhelmed and feel like you aren't making any progress!
A piecemeal approach to eating the data engineering elephant is better.
Start with:
- SQL
Get good with SELECT, GROUP BY, WHERE, HAVING, JOIN, etc. DataLemur š (Ace the SQL & Data Interview) is a great resource to get into this.
Then branch into:
- Python
Get good with loops, variables, classes, dictionaries, tuples, and arrays. LeetCode still seems like the best place to practice this.
Then branch into:
- job orchestration
Airflow is the most popular option here but the startup costs are kind of high to get going. A new option is Mage that has a very easy startup and can orchestrate things as well as Airflow.
Then branch into:
- distributed compute (Snowflake, Spark, BigQuery, etc)
Almost all of these platforms have free trials. Some key things here to learn about are partitioning, memory, broadcast joins, and caching.
Then branch into:
- data modeling
Learn about fact tables, dimension tables, slowly changing dimensions, cumulative table design, and change data capture. Reading one of Bill Inmon's books about this will get you ahead here!
#dataengineering