Exercise 1: Data Collection and Preprocessing
Objective: To understand the importance of clean data in machine learning.
Task:
-
Download Dataset: Use a simple dataset such as the Iris dataset or a sample dataset of your choice (e.g., student scores, product sales).
-
Create a Table in Excel:
- The dataset should contain various features (such as age, weight, height, or sales numbers) and a target column (such as category labels, pass/fail, etc.).
-
Data Cleaning:
- Remove any duplicates using the Excel feature Data → Remove Duplicates.
- Handle missing values by either filling them with the mean or median value (using AVERAGE or MEDIAN functions) or removing the rows with missing data.
-
Task:
- Create a new column “Age Group” by categorizing ages as “Young”, “Middle”, or “Old”. Use IF statements for classification.
- Example:
=IF(A2<30, "Young", IF(A2<60, "Middle", "Old"))
.
-
Questions:
- What kind of issues can missing or dirty data cause in a machine learning model?
- How does data preprocessing improve the quality of predictions?
Exercise 2: Data Visualization for Understanding Patterns
Objective: To visualize the dataset and understand potential patterns or relationships.
Task:
-
Use Excel Charts:
- After cleaning your dataset, create scatter plots or histograms to visualize relationships between variables. For example, if you’re working with a dataset on student scores, plot the relationship between hours studied and scores achieved.
- To create a scatter plot: Select your data → Insert → Scatter Plot.
-
Task:
- Create a histogram of a numerical feature (like “Age” or “Score”) to observe its distribution. Use Insert → Histogram.
-
Questions:
- What kind of patterns do you observe in the scatter plot (e.g., linear, non-linear)?
- What conclusions can you make about the relationship between the data points based on the histogram?
Exercise 3: Basic Regression Analysis
Objective: To introduce basic linear regression concepts using Excel.
Task:
-
Use Excel’s Built-in Linear Regression:
- If you have data like “Hours Studied” and “Exam Scores”, use Excel’s Data Analysis Toolpak to perform regression analysis.
-
Steps:
- Go to Data → Data Analysis → Select Regression.
- Input the “Hours Studied” as the X Range and “Exam Scores” as the Y Range.
- Click OK, and Excel will output a regression analysis, including coefficients, R-squared value, and other statistical data.
-
Task:
- Using the coefficients, create a regression line by plotting it in a scatter plot, and add the regression equation.
- Interpret the R-squared value to understand the goodness of fit.
-
Questions:
- What does the regression model suggest about the relationship between hours studied and exam scores?
- How can we interpret the R-squared value in terms of prediction accuracy?
Exercise 4: Decision Tree Concept using Excel
Objective: To introduce decision trees for classification tasks.
Task:
-
Prepare the Dataset: Create a simple dataset (e.g., students with features like “Study Hours”, “Grades”, “Attendance”) and a target column like “Pass/Fail”.
-
Manually Create a Simple Decision Tree:
- Use an IF-THEN logic in Excel to create a basic decision tree.
- For example, if “Study Hours” > 5, “Pass” else “Fail”.
- Nested IF statements can help create this tree structure.
-
Task:
- Apply the decision rule for different inputs.
- Add more complexity by adding another variable, like “Attendance”, and creating additional branches in the tree (e.g., if “Study Hours” > 5 and “Attendance” > 80%, “Pass”; otherwise, “Fail”).
-
Questions:
- How would the decision tree change with more features in the dataset?
- What are the advantages of using decision trees in machine learning?
Exercise 5: K-Means Clustering Simulation
Objective: To understand the concept of clustering and how K-Means works.
Task:
-
Prepare the Dataset: Create or use a simple dataset with features like “Height”, “Weight”, or “Income”, and assign random initial cluster centers.
-
Manual K-Means Implementation:
- Randomly assign initial cluster centroids (e.g., choose two rows as initial centroids).
- Calculate the Euclidean distance between each data point and the centroids using Excel’s SQRT and SUMSQ functions.
- Assign each data point to the closest centroid.
- Recalculate the centroids by averaging the data points assigned to each cluster.
- Repeat this process until the centroids stabilize.
-
Task:
- Create a new column to indicate cluster assignment.
- Use conditional formatting to visually differentiate the clusters in the dataset.
-
Questions:
- How do the centroids change after each iteration?
- What do you notice about the clusters once they stabilize? Are they well-separated?
Exercise 6: Basic Model Evaluation with Confusion Matrix
Objective: To understand the concept of model evaluation and metrics.
Task:
-
Create a Predicted vs. Actual Outcomes Table:
- Create a table with Predicted Outcome and Actual Outcome columns (e.g., predicted “Pass” or “Fail” vs. actual).
-
Build a Confusion Matrix:
- Use Excel formulas to create a confusion matrix by comparing predicted values against actual values.
- Formula for True Positives (TP):
=COUNTIFS(PredictedColumn, "Pass", ActualColumn, "Pass")
- Similarly calculate True Negatives (TN), False Positives (FP), and False Negatives (FN).
-
Calculate Evaluation Metrics:
- Accuracy:
(TP + TN) / Total
- Precision:
TP / (TP + FP)
- Recall:
TP / (TP + FN)
- F1-Score:
2 * (Precision * Recall) / (Precision + Recall)
- Accuracy:
-
Questions:
- How do the evaluation metrics reflect the performance of the model?
- What is the significance of precision and recall in classification models?