Introduction to Data Engineering
Overview of Data Engineering in modern architectures.
Data lifecycle and pipelines.
Key technologies and trends (e.g., ETL, ELT, Batch Processing, Streaming).
Activity: Discuss a real-world data pipeline use case.
Relational databases (e.g., MySQL, PostgreSQL) vs. NoSQL databases (e.g., MongoDB, Cassandra).
Cloud-based data storage (AWS S3, Azure Blob Storage).
Choosing the right storage based on use cases.
Understanding Spark architecture.
Loading and transforming data using Spark.
Difference between RDDs, DataFrames, and Datasets.
Activity: Run a sample batch processing job using Spark on a dataset.
Data Transformation & ETL Tools
Understanding ETL vs ELT.
Using ETL tools: Talend, Apache Nifi, or Airflow.
Data cleansing and transformation concepts.
Activity: Create a data pipeline with Talend/Airflow for a simple ETL process.