Introduction to AI
About Lesson

Exercise 1: Data Collection and Preprocessing

Objective: To understand the importance of clean data in machine learning.

Task:

  1. Download Dataset: Use a simple dataset such as the Iris dataset or a sample dataset of your choice (e.g., student scores, product sales).

  2. Create a Table in Excel:

    • The dataset should contain various features (such as age, weight, height, or sales numbers) and a target column (such as category labels, pass/fail, etc.).
  3. Data Cleaning:

    • Remove any duplicates using the Excel feature Data → Remove Duplicates.
    • Handle missing values by either filling them with the mean or median value (using AVERAGE or MEDIAN functions) or removing the rows with missing data.
  4. Task:

    • Create a new column “Age Group” by categorizing ages as “Young”, “Middle”, or “Old”. Use IF statements for classification.
    • Example: =IF(A2<30, "Young", IF(A2<60, "Middle", "Old")).
  5. Questions:

    • What kind of issues can missing or dirty data cause in a machine learning model?
    • How does data preprocessing improve the quality of predictions?

Exercise 2: Data Visualization for Understanding Patterns

Objective: To visualize the dataset and understand potential patterns or relationships.

Task:

  1. Use Excel Charts:

    • After cleaning your dataset, create scatter plots or histograms to visualize relationships between variables. For example, if you’re working with a dataset on student scores, plot the relationship between hours studied and scores achieved.
    • To create a scatter plot: Select your data → Insert → Scatter Plot.
  2. Task:

    • Create a histogram of a numerical feature (like “Age” or “Score”) to observe its distribution. Use Insert → Histogram.
  3. Questions:

    • What kind of patterns do you observe in the scatter plot (e.g., linear, non-linear)?
    • What conclusions can you make about the relationship between the data points based on the histogram?

Exercise 3: Basic Regression Analysis

Objective: To introduce basic linear regression concepts using Excel.

Task:

  1. Use Excel’s Built-in Linear Regression:

    • If you have data like “Hours Studied” and “Exam Scores”, use Excel’s Data Analysis Toolpak to perform regression analysis.
  2. Steps:

    • Go to Data → Data Analysis → Select Regression.
    • Input the “Hours Studied” as the X Range and “Exam Scores” as the Y Range.
    • Click OK, and Excel will output a regression analysis, including coefficients, R-squared value, and other statistical data.
  3. Task:

    • Using the coefficients, create a regression line by plotting it in a scatter plot, and add the regression equation.
    • Interpret the R-squared value to understand the goodness of fit.
  4. Questions:

    • What does the regression model suggest about the relationship between hours studied and exam scores?
    • How can we interpret the R-squared value in terms of prediction accuracy?

Exercise 4: Decision Tree Concept using Excel

Objective: To introduce decision trees for classification tasks.

Task:

  1. Prepare the Dataset: Create a simple dataset (e.g., students with features like “Study Hours”, “Grades”, “Attendance”) and a target column like “Pass/Fail”.

  2. Manually Create a Simple Decision Tree:

    • Use an IF-THEN logic in Excel to create a basic decision tree.
    • For example, if “Study Hours” > 5, “Pass” else “Fail”.
    • Nested IF statements can help create this tree structure.
  3. Task:

    • Apply the decision rule for different inputs.
    • Add more complexity by adding another variable, like “Attendance”, and creating additional branches in the tree (e.g., if “Study Hours” > 5 and “Attendance” > 80%, “Pass”; otherwise, “Fail”).
  4. Questions:

    • How would the decision tree change with more features in the dataset?
    • What are the advantages of using decision trees in machine learning?

Exercise 5: K-Means Clustering Simulation

Objective: To understand the concept of clustering and how K-Means works.

Task:

  1. Prepare the Dataset: Create or use a simple dataset with features like “Height”, “Weight”, or “Income”, and assign random initial cluster centers.

  2. Manual K-Means Implementation:

    • Randomly assign initial cluster centroids (e.g., choose two rows as initial centroids).
    • Calculate the Euclidean distance between each data point and the centroids using Excel’s SQRT and SUMSQ functions.
    • Assign each data point to the closest centroid.
    • Recalculate the centroids by averaging the data points assigned to each cluster.
    • Repeat this process until the centroids stabilize.
  3. Task:

    • Create a new column to indicate cluster assignment.
    • Use conditional formatting to visually differentiate the clusters in the dataset.
  4. Questions:

    • How do the centroids change after each iteration?
    • What do you notice about the clusters once they stabilize? Are they well-separated?

Exercise 6: Basic Model Evaluation with Confusion Matrix

Objective: To understand the concept of model evaluation and metrics.

Task:

  1. Create a Predicted vs. Actual Outcomes Table:

    • Create a table with Predicted Outcome and Actual Outcome columns (e.g., predicted “Pass” or “Fail” vs. actual).
  2. Build a Confusion Matrix:

    • Use Excel formulas to create a confusion matrix by comparing predicted values against actual values.
    • Formula for True Positives (TP): =COUNTIFS(PredictedColumn, "Pass", ActualColumn, "Pass")
    • Similarly calculate True Negatives (TN), False Positives (FP), and False Negatives (FN).
  3. Calculate Evaluation Metrics:

    • Accuracy: (TP + TN) / Total
    • Precision: TP / (TP + FP)
    • Recall: TP / (TP + FN)
    • F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
  4. Questions:

    • How do the evaluation metrics reflect the performance of the model?
    • What is the significance of precision and recall in classification models?
wpChatIcon
wpChatIcon