Excel based Assignments - KnowPowerSolutions

Introduction to AI

Objective: To understand the importance of clean data in machine learning.

Task:

Download Dataset: Use a simple dataset such as the Iris dataset or a sample dataset of your choice (e.g., student scores, product sales).
Create a Table in Excel:
- The dataset should contain various features (such as age, weight, height, or sales numbers) and a target column (such as category labels, pass/fail, etc.).
Data Cleaning:
- Remove any duplicates using the Excel feature Data → Remove Duplicates.
- Handle missing values by either filling them with the mean or median value (using AVERAGE or MEDIAN functions) or removing the rows with missing data.
Task:
- Create a new column “Age Group” by categorizing ages as “Young”, “Middle”, or “Old”. Use IF statements for classification.
- Example: =IF(A2<30, "Young", IF(A2<60, "Middle", "Old")).
Questions:
- What kind of issues can missing or dirty data cause in a machine learning model?
- How does data preprocessing improve the quality of predictions?

Objective: To visualize the dataset and understand potential patterns or relationships.

Task:

Use Excel Charts:
- After cleaning your dataset, create scatter plots or histograms to visualize relationships between variables. For example, if you’re working with a dataset on student scores, plot the relationship between hours studied and scores achieved.
- To create a scatter plot: Select your data → Insert → Scatter Plot.
Task:
- Create a histogram of a numerical feature (like “Age” or “Score”) to observe its distribution. Use Insert → Histogram.
Questions:
- What kind of patterns do you observe in the scatter plot (e.g., linear, non-linear)?
- What conclusions can you make about the relationship between the data points based on the histogram?

Objective: To introduce basic linear regression concepts using Excel.

Task:

Use Excel’s Built-in Linear Regression:
- If you have data like “Hours Studied” and “Exam Scores”, use Excel’s Data Analysis Toolpak to perform regression analysis.
Steps:
- Go to Data → Data Analysis → Select Regression.
- Input the “Hours Studied” as the X Range and “Exam Scores” as the Y Range.
- Click OK, and Excel will output a regression analysis, including coefficients, R-squared value, and other statistical data.
Task:
- Using the coefficients, create a regression line by plotting it in a scatter plot, and add the regression equation.
- Interpret the R-squared value to understand the goodness of fit.
Questions:
- What does the regression model suggest about the relationship between hours studied and exam scores?
- How can we interpret the R-squared value in terms of prediction accuracy?

Objective: To introduce decision trees for classification tasks.

Task:

Prepare the Dataset: Create a simple dataset (e.g., students with features like “Study Hours”, “Grades”, “Attendance”) and a target column like “Pass/Fail”.
Manually Create a Simple Decision Tree:
- Use an IF-THEN logic in Excel to create a basic decision tree.
- For example, if “Study Hours” > 5, “Pass” else “Fail”.
- Nested IF statements can help create this tree structure.
Task:
- Apply the decision rule for different inputs.
- Add more complexity by adding another variable, like “Attendance”, and creating additional branches in the tree (e.g., if “Study Hours” > 5 and “Attendance” > 80%, “Pass”; otherwise, “Fail”).
Questions:
- How would the decision tree change with more features in the dataset?
- What are the advantages of using decision trees in machine learning?

Objective: To understand the concept of clustering and how K-Means works.

Task:

Prepare the Dataset: Create or use a simple dataset with features like “Height”, “Weight”, or “Income”, and assign random initial cluster centers.
Manual K-Means Implementation:
- Randomly assign initial cluster centroids (e.g., choose two rows as initial centroids).
- Calculate the Euclidean distance between each data point and the centroids using Excel’s SQRT and SUMSQ functions.
- Assign each data point to the closest centroid.
- Recalculate the centroids by averaging the data points assigned to each cluster.
- Repeat this process until the centroids stabilize.
Task:
- Create a new column to indicate cluster assignment.
- Use conditional formatting to visually differentiate the clusters in the dataset.
Questions:
- How do the centroids change after each iteration?
- What do you notice about the clusters once they stabilize? Are they well-separated?

Objective: To understand the concept of model evaluation and metrics.

Task:

Create a Predicted vs. Actual Outcomes Table:
- Create a table with Predicted Outcome and Actual Outcome columns (e.g., predicted “Pass” or “Fail” vs. actual).
Build a Confusion Matrix:
- Use Excel formulas to create a confusion matrix by comparing predicted values against actual values.
- Formula for True Positives (TP): =COUNTIFS(PredictedColumn, "Pass", ActualColumn, "Pass")
- Similarly calculate True Negatives (TN), False Positives (FP), and False Negatives (FN).
Calculate Evaluation Metrics:
- Accuracy: (TP + TN) / Total
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
Questions:
- How do the evaluation metrics reflect the performance of the model?
- What is the significance of precision and recall in classification models?