Machine Learning Worksheet: Predicting Health Risks with Google Colab

In this project, you will use Google Colab to analyze health data and build a machine learning model to predict if someone is at risk of high blood pressure. You’ll explore data, train a model, and test it by entering your own values.

1. Understand the Goal

What’s this about? Machine learning helps computers learn from data to make predictions. In this project, you’ll use health data to predict whether someone is at risk of high blood pressure (a health condition where blood pushes too hard against arteries). The data includes features like age and smoking habits, which the computer will use to find patterns.

You’ll use a dataset about people’s health to predict whether someone is at risk of high blood pressure based on their age, exercise hours, diet quality, stress level, and smoking status.

What do you think makes someone at risk for high blood pressure (e.g., smoking, stress)?

2. Set Up Google Colab

Google Colab is a free online tool that lets you write and run Python code in your browser, like a digital notebook. It’s great for machine learning because it has all the tools you need pre-installed.

What are you doing? You’re creating a new Colab notebook where you’ll write code to analyze data and build your model. You’ll learn how to use code cells (for Python) and text cells (for notes).

  1. Open a web browser and go to Google Colab.
  2. Sign in with your Google account (ask me if you need help).
  3. Click File > New Notebook to start a new notebook.
  4. Name your notebook “Health ML Project” by clicking the title at the top.
  5. Add a code cell by clicking + Code or a text cell by clicking + Text in the toolbar. Code cells run Python code, while text cells are for notes or instructions.
  6. Click the play button next to a code cell to run it. The output (like a table or graph) will appear below the cell. For text cells, click outside the cell to save your text.
Google Colab runs Python code in “cells.” Use code cells for Python and text cells for explanations. The play button executes code and shows results immediately.
Google Colab interface showing code and text cells

3. Load the Health Dataset

Here, the dataset contains health information about 15 people, such as their age and whether they smoke.

What are you doing? You’re using Python code to create and display the dataset as a table. This lets you see the health features and the risk of high blood pressure for each person.

  1. Copy and paste this code into a new code cell in your Colab notebook.
  2. Click the play button to run the cell.
import pandas as pd

# Create a health dataset
data = {
    'Age': [25, 45, 30, 60, 35, 50, 28, 40, 55, 32, 48, 27, 62, 38, 44],
    'Exercise_Hours': [3, 1, 0, 2, 4, 1, 5, 2, 0, 3, 1, 4, 0, 2, 3],
    'Diet_Quality': [3, 2, 1, 2, 3, 1, 3, 2, 1, 3, 2, 3, 1, 2, 3],  
    # 1=Poor, 2=Average, 3=Good
    'Stress_Level': [2, 3, 3, 2, 1, 3, 1, 2, 3, 1, 2, 1, 3, 2, 1],  
    # 1=Low, 2=Medium, 3=High
    'Smoking': [0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0],  
    # 0=No, 1=Yes
    'High_BP_Risk': [0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0]  
    # 0=No Risk, 1=Risk
}
df = pd.DataFrame(data)

# Show the first 5 rows
df.head()

# Show all rows
#df
    
Run the code and look at the table. The first 5 rows will appear with df.head(). You can use # to comment out a line in Python. To uncomment, remove the #. All 15 rows will appear with df. Identify who has a high blood pressure risk and guess why based on their features (e.g., smoking or stress).
The dataset has 15 people. Each row includes Age, Exercise Hours, Diet Quality (1=Poor, 2=Average, 3=Good), Stress Level (1=Low, 2=Medium, 3=High), Smoking (0=No, 1=Yes), and High BP Risk (0=No Risk, 1=Risk).

4. Explore the Data

What’s this about? Exploring data means looking for patterns, like whether people who exercise less have higher health risks. This helps you understand the data before building a model.

What are you doing? You’re using code to calculate statistics (like average age) and create a scatter plot to visualize if exercise hours relate to high blood pressure risk.

  1. Add a new code cell and paste this code to see statistics:
# Show summary statistics
df.describe()
    
  1. Add another code cell for a scatter plot to see if exercise affects risk:
import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(df['Exercise_Hours'], df['High_BP_Risk'], color='blue')
plt.xlabel('Exercise Hours per Week')
plt.ylabel('High BP Risk (1=Yes, 0=No)')
plt.title('Exercise Hours vs. High BP Risk')
plt.show()
    
Run both cells. What’s the average age? Does more exercise mean lower risk? Try plotting Age instead of Exercise_Hours. What do you see?

5. Prepare the Data

What’s this about? Machine learning models need data split into inputs (features like age) and outputs (what you’re predicting, like high blood pressure risk). You also split the data into training (to teach the model) and testing (to check its performance).

What are you doing? You’re organizing the dataset so the model can learn from features and predict the risk, and dividing the data to ensure you can test the model fairly.

  1. Add a new code cell and paste this code:
from sklearn.model_selection import train_test_split

# Features (inputs) and target (output)
X = df[['Age', 'Exercise_Hours', 'Diet_Quality', 'Stress_Level', 'Smoking']]
y = df['High_BP_Risk']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check sizes
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)
    
Run the code. How many people are in the training set? The testing set?
You’re using 80% of the data to train the model and 20% to test it.

6. Train the Model

What’s this about? Training a model means teaching the computer to find patterns in the data, like how smoking might increase health risks. A Decision Tree model makes predictions by following a series of yes/no questions, like a flowchart.

What are you doing? You’re using a Decision Tree to learn from the training data and make predictions on the test data to see if it works.

  1. Add a new code cell and paste this code:
from sklearn.tree import DecisionTreeClassifier

# Create and train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Show predictions
print("Predictions:", y_pred)
print("Actual:", y_test.values)
    
Run the code. Do the predictions match the actual results?

7. Check the Model’s Accuracy

What’s this about? Accuracy measures how often the model’s predictions are correct. A higher percentage means the model is better at predicting high blood pressure risk.

What are you doing? You’re calculating the accuracy of your model by comparing its predictions to the actual test data results.

  1. Add a new code cell and paste this code:
from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy:", accuracy * 100, "%")
    
Run the code. Is the accuracy high? What does this mean?

8. Test Your Model with Your Inputs

What’s this about? Once the model is trained, you can use it to make predictions for new people by entering their health details. This shows how machine learning can be applied to real-world scenarios.

What are you doing? You’re entering values (like age or diet quality) to see if the model predicts a high blood pressure risk for a new person, testing its practical use.

  1. Add a new code cell and paste this code to test the model:
# Get inputs for a new person
print("Enter health details:")
age = float(input("Age (years): "))
exercise_hours = float(input("Exercise Hours per Week: "))
diet_quality = input("Diet Quality (poor, average, good): ")
stress_level = input("Stress Level (low, medium, high): ")
smoking = input("Smoking (yes, no): ")

# Convert text to numbers
diet_map = {'poor': 1, 'average': 2, 'good': 3}
stress_map = {'low': 1, 'medium': 2, 'high': 3}
smoking_map = {'no': 0, 'yes': 1}
diet_num = diet_map[diet_quality.lower()]
stress_num = stress_map[stress_level.lower()]
smoking_num = smoking_map[smoking.lower()]

# Create new data
new_person = pd.DataFrame({
    'Age': [age],
    'Exercise_Hours': [exercise_hours],
    'Diet_Quality': [diet_num],
    'Stress_Level': [stress_num],
    'Smoking': [smoking_num]
})

# Predict
prediction = model.predict(new_person)
print("Prediction:", "At Risk of High BP" if prediction[0] == 1 else "Not At Risk")
    
Run the code and enter values (e.g., Age: 30, Exercise Hours: 2, Diet Quality: average, Stress Level: low, Smoking: no). What’s the prediction? Try different values (e.g., Smoking: yes). Does it change?

9. Save and Share

Saving your work ensures you can return to it later or share it with others. Colab stores notebooks in Google Drive, making it easy to share.

  1. Click File > Save in Colab to save your notebook.
  2. Click Share in the top-right corner and share the link with your instructor, or download it as a .ipynb file.

10. Wrap-Up and Next Steps

Great job! You’ve built a machine learning model from scratch using Google Colab. You learned how to load and explore a health dataset, train a Decision Tree model to predict high blood pressure risk, and test it with your own inputs. You also visualized data patterns, checked model accuracy, and thought about what makes predictions work (or not). These skills—handling data, building models, and making predictions—are the foundation of data science!

Now it’s time to apply what you’ve learned to a new challenge. You’ll download a machine learning task from the Materials Section (check class website) and use your skills to tackle it. This task will let you experiment with new data or predictions, just like a real data scientist.