Course Content
Prerequisites for a Data Engineering
Preparing for a Data Engineering boot-camp can enhance your experience and success. Here are the core prerequisites:
0/2
Data Ingestion, Storage & Processing
Introduction to Data Engineering Overview of Data Engineering in modern architectures. Data lifecycle and pipelines. Key technologies and trends (e.g., ETL, ELT, Batch Processing, Streaming). Activity: Discuss a real-world data pipeline use case.
0/5
Data Ingestion Techniques
Understanding structured, semi-structured, and unstructured data. Batch ingestion: Using Apache Sqoop, Talend. Streaming ingestion: Using Apache Kafka.
0/5
Data Storage Solutions
Relational databases (e.g., MySQL, PostgreSQL) vs. NoSQL databases (e.g., MongoDB, Cassandra). Cloud-based data storage (AWS S3, Azure Blob Storage). Choosing the right storage based on use cases.
0/4
Batch Processing with Apache Spark
Understanding Spark architecture. Loading and transforming data using Spark. Difference between RDDs, DataFrames, and Datasets. Activity: Run a sample batch processing job using Spark on a dataset.
0/4
Data Transformation, Orchestration & Monitoring
Data Transformation & ETL Tools Understanding ETL vs ELT. Using ETL tools: Talend, Apache Nifi, or Airflow. Data cleansing and transformation concepts. Activity: Create a data pipeline with Talend/Airflow for a simple ETL process.
0/4
Data Orchestration
Introduction to orchestration tools: Apache Airflow, AWS Step Functions. Creating workflows to manage complex pipelines. Managing dependencies and retries in workflows.
0/1
Data Engineering
About Lesson

Relational Databases (RDBMS) vs. NoSQL Databases

Feature

Relational Databases (RDBMS)

NoSQL Databases

Data Model

Structured data in tables (rows & columns)

Schema-less or flexible schema (documents, key-value, etc.)

Schema

Fixed schema (predefined structure)

Flexible schema (dynamic, schema-less)

Examples

MySQL, PostgreSQL, Oracle

MongoDB (Document), Cassandra (Wide-column), Redis (Key-value)

Data Relationships

Supports complex relationships via joins

Avoids joins, uses de-normalized data for relationships

Query Language

SQL (Structured Query Language)

Query mechanisms vary: MongoDB (JSON-like), Cassandra (CQL)

Transactions

ACID compliance (strong consistency)

BASE (eventual consistency, some support for ACID)

Scaling

Vertical scaling (scaling up)

Horizontal scaling (scaling out, distributed across nodes)

Performance

Optimized for transactions and complex queries

Optimized for high throughput, distributed environments

Storage Format

Tables with rows and columns

Documents (JSON, BSON), key-value pairs, wide-columns, etc.

Consistency

Strong consistency

Eventual consistency (some databases offer strong consistency)

Best Use Cases

Financial systems, inventory, CRM, ERP

Big data analytics, IoT, real-time applications, social media

Scaling Complexity

Complex horizontal scaling (requires sharding)

Easy horizontal scaling (data distributed by default)

Handling Unstructured Data

Less flexible for unstructured data

Designed for unstructured and semi-structured data

This table provides a quick and clear comparison of the two database types based on key factors.

  • Schema less Databases. There are several benefits of a schema less database over a schema-based database. First, there is greater flexibility over data types. You can also make data type changes without taking the database offline or updating connected systems.
  • Data normalization–Removes redundancy from a database and introduces non-redundant, standardized data.
  • De-normalization — A process used to combine data from multiple tables into a single table that can be queried faster.
  • Horizontal scaling refers to adding additional nodes.
  • Vertical scaling describes adding more power to your current machines

Use Cases:

  • Real-time analytics, content management systems, recommendation engines, IoT data, large-scale distributed applications.
wpChatIcon
wpChatIcon