Blog

Feb 7, 2025

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

1. IntroductionHave you ever wondered if a person’s biosignals could reveal their smoking status? In this blog post, we’ll walk through an end-to-end data science pipeline to predict whether an individual is a smoker — covering exploratory data analysis (EDA), feature engineering, and modeling. The dataset comes from a Kaggle competition, Binary Prediction of Smoker Status using Bio-Signals, and our goal is to classify individuals as either Smoker (1) or Non-Smoker (0) based on various physiological features.2. Understanding the Problem and the DatasetTarget Variable: Smoker status (Smoker = 0 or 1).Features (Bio-Signals): We have multiple physiological measurements (e.g., blood pressure, heart rate, or other biosignals).Objective: Use these signals to build a model capable of accurately predicting who smokes and who does not.Key Questions to ExploreWhich biosignals are the most predictive of smoking status?Are there outliers or anomalies in the data that might affect model performance?What preprocessing and feature engineering strategies lead to the best predictive model?3. Exploratory Data Analysis (EDA)3.1 Data OverviewAfter loading the dataset into a Pandas DataFrame, we typically do the following checks:Shape of the dataset: Number of rows and columns.Missing values: Identify any columns with large proportions of missing data.Data types: Ensure numeric columns are indeed numeric, and categorical columns are recognized as categories.3.2 VisualizationsDistributions: Histograms and boxplots help us see the spread of each biosignal and detect outliers.Correlation Heatmap: A correlation matrix helps identify which features strongly correlate with each other and potentially with the Smoker variable.

Key Insights:

Some biosignals might have a high correlation with each other, indicating possible redundancy.
Certain signals may differ markedly between smokers and non-smokers, hinting at predictive potential.

4. Feature Engineering

4.1 Handling Missing Values

Imputation: If missing data is not extensive, we can impute it using strategies such as mean, median, or mode.
Dropping Columns: If a feature has too many missing values, it might be best to remove it altogether.

4.2 Transformations

Scaling: Many models perform better if data is standardized. We might use a StandardScaler or MinMaxScaler.
Encoding: If we have categorical features (e.g., gender, region), we might need to encode them (e.g., One-Hot Encoding).

4.3 Feature Selection

Statistical Approaches (e.g., ANOVA, chi-square test).
Model-based approaches (e.g., feature importance from a random forest).

5. Modeling

5.1 Train-Test Split

We partition our data into training and testing sets to evaluate how well the model generalizes.

from sklearn.model_selection import train_test_split

X = data.drop(columns=["Smoker"])y = data["Smoker"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5.2 Model Selection

We test a few popular classification algorithms:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier
Support Vector Machine (SVM)

from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, classification_report

# Example with Logistic Regressionmodel_lr = LogisticRegression()model_lr.fit(X_train, y_train)preds_lr = model_lr.predict(X_test)accuracy_lr = accuracy_score(y_test, preds_lr)print("Logistic Regression Accuracy:", accuracy_lr)print(classification_report(y_test, preds_lr))

5.3 Hyperparameter Tuning

Using GridSearchCV or RandomizedSearchCV helps identify the best set of hyperparameters for each model.

from sklearn.model_selection import GridSearchCV

param_grid = {    'n_estimators': [50, 100, 200],    'max_depth': [None, 5, 10]}model_rf = RandomForestClassifier(random_state=42)grid_search_rf = GridSearchCV(model_rf, param_grid, cv=5, scoring="accuracy")grid_search_rf.fit(X_train, y_train)best_rf_model = grid_search_rf.best_estimator_print("Best Params:", grid_search_rf.best_params_)rf_preds = best_rf_model.predict(X_test)print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))

6. Results and Evaluation

Accuracy: Measures the proportion of correct predictions.
Precision and Recall: Especially helpful if the classes are imbalanced.
F1 Score: Harmonic mean of precision and recall.
Confusion Matrix: Gives a complete picture of how the model is classifying positive (smoker) vs. negative (non-smoker) cases.

Finding: Often, ensembles like Random Forest or Gradient Boosting outperform simpler models (like logistic regression) for such tasks, but logistic regression can provide more interpretability.

7. Key Takeaways

EDA is Vital: Understanding distributions, correlations, and potential data issues up front saves a lot of headaches later.
Feature Engineering: Proper handling of missing values, outliers, and scaling can significantly boost model performance.
Ensemble Methods Excel: Random Forest and other ensemble techniques often yield high accuracy for binary classification.
Interpretability Matters: While black-box models can be accurate, simpler models can offer actionable insights into which factors most influence smoking status.

8. Conclusion and Next Steps

Predicting whether an individual is a smoker or not using biosignals is a fascinating classification problem. Our approach involved a thorough EDA phase, strategic feature engineering, and iterative modeling with hyperparameter tuning. The results show promising accuracy, indicating that certain biosignals indeed carry strong predictive power regarding smoker status.

Future directions might include:

Collecting more diverse data to ensure generalizability.
Exploring advanced feature selection or deep learning approaches.
Incorporating domain knowledge (e.g., medical or public health expertise) to interpret findings more effectively.

To follow the notebook with the code and visualizations click here.

Thank You for Reading!If you found this post helpful, feel free to share your thoughts in the comments. Happy coding and modeling!

ReferencesKaggle Competition: Binary Prediction of Smoker Status using Bio-SignalsScikit-learn Documentation: https://scikit-learn.org/stable/documentation.html

View all posts

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

Binary Prediction of Smoker Status using Bio-Signal — Kaggle

4. Feature Engineering

4.1 Handling Missing Values

4.2 Transformations

4.3 Feature Selection

5. Modeling

5.1 Train-Test Split

5.2 Model Selection

5.3 Hyperparameter Tuning

6. Results and Evaluation

7. Key Takeaways

8. Conclusion and Next Steps

Read more articles

Read more articles

In Search of the 3%

In Search of the 3%

In Search of the 3%

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Predictive Analysis of Credit Default Risk

Predictive Analysis of Credit Default Risk

Predictive Analysis of Credit Default Risk

Let's talk

Time for me:

Email:

Socials:

Let's talk

Time for me:

Email:

Socials: