Course Content
Prerequisites for a Data Engineering
Preparing for a Data Engineering boot-camp can enhance your experience and success. Here are the core prerequisites:
0/2
Data Ingestion, Storage & Processing
Introduction to Data Engineering Overview of Data Engineering in modern architectures. Data lifecycle and pipelines. Key technologies and trends (e.g., ETL, ELT, Batch Processing, Streaming). Activity: Discuss a real-world data pipeline use case.
0/5
Data Ingestion Techniques
Understanding structured, semi-structured, and unstructured data. Batch ingestion: Using Apache Sqoop, Talend. Streaming ingestion: Using Apache Kafka.
0/5
Data Storage Solutions
Relational databases (e.g., MySQL, PostgreSQL) vs. NoSQL databases (e.g., MongoDB, Cassandra). Cloud-based data storage (AWS S3, Azure Blob Storage). Choosing the right storage based on use cases.
0/4
Batch Processing with Apache Spark
Understanding Spark architecture. Loading and transforming data using Spark. Difference between RDDs, DataFrames, and Datasets. Activity: Run a sample batch processing job using Spark on a dataset.
0/4
Data Transformation, Orchestration & Monitoring
Data Transformation & ETL Tools Understanding ETL vs ELT. Using ETL tools: Talend, Apache Nifi, or Airflow. Data cleansing and transformation concepts. Activity: Create a data pipeline with Talend/Airflow for a simple ETL process.
0/4
Data Orchestration
Introduction to orchestration tools: Apache Airflow, AWS Step Functions. Creating workflows to manage complex pipelines. Managing dependencies and retries in workflows.
0/1
Data Engineering
About Lesson

Activity: Hands-On Setup for Storing Ingested Data Using AWS S3 or Azure Blob Storage

This hands-on activity guides you through setting up either an AWS S3 bucket or Azure Blob Storage to store ingested data. You can choose which cloud platform to use, depending on your preference. Below are instructions for both.


1. AWS S3 Setup for Storing Ingested Data

Prerequisites:

  • AWS account (free tier is available).
  • AWS Management Console access.

Steps to Set Up AWS S3 Bucket:

  1. Login to AWS Management Console:

  2. Navigate to S3:

    • From the AWS Management Console, search for “S3” in the search bar and select Amazon S3 from the results.
  3. Create a New S3 Bucket:

    • Click on Create bucket.
    • Enter a Bucket name (e.g., data-ingestion-bucket).
    • Choose an AWS Region closest to where your application or data pipeline is running.
    • Leave the Block Public Access settings enabled unless you require public access (for public files like web assets).
    • Click Create bucket at the bottom.
  4. Set Up Folder Structure (Optional):

    • After the bucket is created, click on the bucket name.
    • Create folders if necessary to organize your data (e.g., ingested-data/).
  5. Upload Ingested Data:

    • Click on Upload in your bucket.
    • Drag and drop files, or browse to the files that you want to upload. This data could be CSVs, JSON files, images, etc.
    • Click Upload to store the data in your S3 bucket.
  6. Grant Permissions (Optional):

    • You can control access to the data via Bucket Policies or IAM Roles. For data ingestion, it’s common to grant specific roles access to upload or read files from the bucket.
  7. Accessing Data via URL:

    • Once data is uploaded, each object (file) has a unique S3 URL. You can use these URLs to access files within your applications or data pipelines.
  8. Set Lifecycle Policies (Optional):

    • To manage data storage costs, you can set up Lifecycle Policies to automatically move data to cheaper storage classes (e.g., S3 Glacier) after a certain time.

2. Azure Blob Storage Setup for Storing Ingested Data

Prerequisites:

  • Microsoft Azure account (free tier available).
  • Access to the Azure portal.

Steps to Set Up Azure Blob Storage:

  1. Login to Azure Portal:

  2. Create a Storage Account:

    • In the search bar at the top of the portal, search for Storage accounts.
    • Click on Create to create a new storage account.
    • Choose a Subscription, Resource group, and a Storage account name (e.g., ingestionstorage).
    • Select the appropriate Region.
    • Under Performance, choose Standard (for most use cases) or Premium for low-latency access.
    • Leave the other settings as default for now, and click Review + Create. Then click Create to provision the account.
  3. Create a Blob Container:

    • Once your storage account is created, navigate to it from the Storage accounts list.
    • On the left-side menu, click Containers under the Data storage section.
    • Click + Container to create a new blob container. Name it (e.g., ingested-data) and set the Public access level to Private (unless you need public access).
    • Click Create.
  4. Upload Data to Blob Storage:

    • Click on the newly created container.
    • Click Upload and browse to select files from your local system.
    • Click Upload to store the data in Azure Blob Storage.
  5. Accessing Data:

    • Each blob (file) uploaded has a unique Blob URL. You can use this URL to access the file within your application or data processing pipeline.
    • Example URL: https://<yourstorageaccount>.blob.core.windows.net/<containername>/<filename>
  6. Set Access Policies (Optional):

    • To manage access to your blob storage, you can set Shared Access Signatures (SAS) for temporary, controlled access to files.
    • Alternatively, use RBAC or Azure AD roles to manage user and application access.
  7. Set Lifecycle Policies (Optional):

    • To manage costs, you can configure Lifecycle Management Policies to automatically move blobs to different tiers (e.g., Cool or Archive) based on access patterns.

Next Steps:

  • Integrate this storage with your data pipeline, where ingested data is stored automatically.
  • Set up versioning or replication (AWS S3 Cross-Region Replication or Azure Blob geo-redundancy) to ensure data durability.
  • Automate ingestion into these storages using tools like Apache Kafka, Apache NiFi, or custom ingestion scripts.

This hands-on experience helps you understand how to set up cloud-based storage for storing ingested data, a fundamental step in building scalable data pipelines.

wpChatIcon
wpChatIcon