Course Content
Prerequisites for a Data Engineering
Preparing for a Data Engineering boot-camp can enhance your experience and success. Here are the core prerequisites:
0/2
Data Ingestion, Storage & Processing
Introduction to Data Engineering Overview of Data Engineering in modern architectures. Data lifecycle and pipelines. Key technologies and trends (e.g., ETL, ELT, Batch Processing, Streaming). Activity: Discuss a real-world data pipeline use case.
0/5
Data Ingestion Techniques
Understanding structured, semi-structured, and unstructured data. Batch ingestion: Using Apache Sqoop, Talend. Streaming ingestion: Using Apache Kafka.
0/5
Data Storage Solutions
Relational databases (e.g., MySQL, PostgreSQL) vs. NoSQL databases (e.g., MongoDB, Cassandra). Cloud-based data storage (AWS S3, Azure Blob Storage). Choosing the right storage based on use cases.
0/4
Batch Processing with Apache Spark
Understanding Spark architecture. Loading and transforming data using Spark. Difference between RDDs, DataFrames, and Datasets. Activity: Run a sample batch processing job using Spark on a dataset.
0/4
Data Transformation, Orchestration & Monitoring
Data Transformation & ETL Tools Understanding ETL vs ELT. Using ETL tools: Talend, Apache Nifi, or Airflow. Data cleansing and transformation concepts. Activity: Create a data pipeline with Talend/Airflow for a simple ETL process.
0/4
Data Orchestration
Introduction to orchestration tools: Apache Airflow, AWS Step Functions. Creating workflows to manage complex pipelines. Managing dependencies and retries in workflows.
0/1
Data Engineering
About Lesson

Running a sample batch processing job using Apache Spark involves several steps, including setting up your environment, loading a dataset, performing transformations, and writing the output. Below, I’ll walk you through a simple example using PySpark, which is the Python API for Spark.

Activity Overview

In this activity, we will:

  1. Set up a Spark environment.
  2. Load a sample dataset (CSV file).
  3. Perform some transformations (filtering and aggregating).
  4. Write the results back to a file.

Prerequisites

  1. Apache Spark installed (or use a managed service like AWS EMR, Databricks, or Google Dataproc).
  2. Python and PySpark installed (if running locally).
  3. A sample dataset (for example, a CSV file). For this example, you can create a simple CSV file with the following structure:

Sample Dataset: people.csv

csv
name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
David,28,New York
Eva,45,Los Angeles

Step 1: Set Up Spark Environment

If you are running this locally, you can set up a Spark session in Python as follows:

python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder
.appName("Sample Batch Processing Job")
.getOrCreate()

Step 2: Load the Dataset

Next, we’ll load the dataset into a Spark DataFrame:

python
# Load the dataset into a DataFrame
df = spark.read.csv("path/to/people.csv", header=True, inferSchema=True)

# Show the DataFrame
df.show()

Step 3: Perform Transformations

Now, let’s perform some transformations. In this example, we will filter the DataFrame to include only people over the age of 30 and count the number of people in each city.

python
# Filter the DataFrame to include only people over the age of 30
filtered_df = df.filter(df.age > 30)

# Group by city and count the number of people in each city
result_df = filtered_df.groupBy("city").count()

# Show the result DataFrame
result_df.show()

Step 4: Write the Results

Finally, we will write the results back to a CSV file.

python
# Write the result DataFrame to a CSV file
result_df.write.csv("path/to/output/result.csv", header=True)

Complete Code Example

Here’s the complete code snippet for running the entire batch processing job:

python
from pyspark.sql import SparkSession

# Step 1: Set up Spark environment
spark = SparkSession.builder
.appName("Sample Batch Processing Job")
.getOrCreate()

# Step 2: Load the dataset
df = spark.read.csv("path/to/people.csv", header=True, inferSchema=True)

# Show the DataFrame
df.show()

# Step 3: Perform transformations
filtered_df = df.filter(df.age > 30) # Filter people over age 30
result_df = filtered_df.groupBy("city").count() # Count people by city

# Show the result DataFrame
result_df.show()

# Step 4: Write the results
result_df.write.csv("path/to/output/result.csv", header=True)

# Stop the Spark session
spark.stop()

Running the Code

  1. Replace "path/to/people.csv" and "path/to/output/result.csv" with the actual paths to your dataset and desired output location.
  2. Run the script in your preferred Python environment where Spark is configured.

Expected Output

After executing the above code, you should see an output similar to this on the console:

sql
+----------+-----+
| city|count|
+----------+-----+
|Los Angeles| 1|
| Chicago| 1|
| New York| 1|
+----------+-----+

And the output CSV file (result.csv) should contain the aggregated results:

csv
city,count
Los Angeles,1
Chicago,1
New York,1

Conclusion

You have successfully set up a simple batch processing job using Apache Spark! You can expand this example by using larger datasets, more complex transformations, or integrating it with other data sources and sinks.

wpChatIcon
wpChatIcon