Activity: Run a sample batch processing job using Spark on a dataset.

Data Engineering

Running a sample batch processing job using Apache Spark involves several steps, including setting up your environment, loading a dataset, performing transformations, and writing the output. Below, I’ll walk you through a simple example using PySpark, which is the Python API for Spark.

Activity Overview

In this activity, we will:

Set up a Spark environment.
Load a sample dataset (CSV file).
Perform some transformations (filtering and aggregating).
Write the results back to a file.

Prerequisites

Apache Spark installed (or use a managed service like AWS EMR, Databricks, or Google Dataproc).
Python and PySpark installed (if running locally).
A sample dataset (for example, a CSV file). For this example, you can create a simple CSV file with the following structure:

Sample Dataset: `people.csv`

Step 1: Set Up Spark Environment

If you are running this locally, you can set up a Spark session in Python as follows:

Step 2: Load the Dataset

Next, we’ll load the dataset into a Spark DataFrame:

Step 3: Perform Transformations

Now, let’s perform some transformations. In this example, we will filter the DataFrame to include only people over the age of 30 and count the number of people in each city.

Step 4: Write the Results

Finally, we will write the results back to a CSV file.

Complete Code Example

Here’s the complete code snippet for running the entire batch processing job:

python

from pyspark.sql import SparkSession
# Step 1: Set up Spark environment

spark = SparkSession.builder

    .appName("Sample Batch Processing Job")

    .getOrCreate()
# Step 2: Load the dataset

df = spark.read.csv("path/to/people.csv", header=True, inferSchema=True)
# Show the DataFrame

df.show()
# Step 3: Perform transformations

filtered_df = df.filter(df.age > 30)  # Filter people over age 30

result_df = filtered_df.groupBy("city").count()  # Count people by city
# Show the result DataFrame

result_df.show()
# Step 4: Write the results

result_df.write.csv("path/to/output/result.csv", header=True)

# Stop the Spark session spark.stop()

Running the Code

Replace "path/to/people.csv" and "path/to/output/result.csv" with the actual paths to your dataset and desired output location.
Run the script in your preferred Python environment where Spark is configured.

Expected Output

After executing the above code, you should see an output similar to this on the console:

And the output CSV file (result.csv) should contain the aggregated results:

Conclusion

You have successfully set up a simple batch processing job using Apache Spark! You can expand this example by using larger datasets, more complex transformations, or integrating it with other data sources and sinks.