Blog

Feb 11, 2024

📈Linear Regression VS Logistic Regression for Beginners

Introduction

This is my first article in the series of Understanding the essentials. The reason I am starting this series is to practice and improve the concepts that I am learning in Machine Learning and Data Science.

Over the Series, I will be using a lot of frameworks, code and software which I will list at the end. For example, for this article, I made a draft in Overleaf, as I wanted to learn it for my Capstone project. I then went on to look for Datasets on Kaggle which fit the topic and implemented a notebook. After that I put everything on GitHub to document the process and published it on Medium.

Moving on to explaining the topic.

Regression Explained

Regression is a statistical method used to understand the relationship between dependent (target) and independent (predictor) variables. It is primarily used for prediction and forecasting, where one variable is predicted based on the information available from other variables.

To give you a better understanding of independent and dependent variable:

Independent Variables

Independent Variables are also known as predictor or explanatory variables. These are variables that are manipulated or selected by the researcher or me in this article to investigate their effect on the dependent variable. They are considered the cause or input in an experiment or model. Some characteristics for them are:-

They are assumed to be the cause of changes in the dependent variable.
In a study or model, these variables can be controlled or changed to observe how they affect the outcome.
Independent variables are plotted on the x-axis in graphical representations.

Dependent Variables

Dependent Variables are also known as response or outcome variables. These are the ones that are being tested or measured in a model. They are considered the effect or output that occurs as a result of changes in the independent variables.

Note:

They can be viewed as cause and effect. If the Independent variables are changed, then an effect is seen in the dependent variable.

To get a clearer picture

Dataset

We will be using a dataset for this article called — Data for Admission in University. Pretty straight forward. It can be found here.

Since we learned about Dependent and Independent variables, we can classify the variables from the dataset as follows: —

Independent Variables from the data:

GRE score — Continuous Variable, ranging up to 340
TOEFL score — Continuous Variable, ranging up to 120
University rating — Categorical Variable, rated from 1 to 5
SOP (Statement of Purpose) — Continuous or Ordinal, rated from 1 to 5
LOR (Letter of Recommendation) — Continuous or Ordinal, rated from 1 to 5
Undergrad GPA — Continuous variable, scaled up to 10
Research Experience — Binary Variable, represented as 0 (no experience) or 1 (has experience)

Dependent Variables from the data:

Chance of Admit — Continuous Variable, ranging from 0 to 1, representing the probability of Admission

Let’s look at the data

Initial impression looks like this: —

If you want to follow the code along, the link to that notebook is here.

Just by looking at this, we can draw out some insights:

all the columns are numeric
We can drop the column “Serial No.” as the values are just unique identifiers and will not help out model

To understand the correlation between the variables we can do lots of things, but a heatmap will be apt in this situation :

This tells us what parameters are affecting our dependent variable the most and if it will help in building our model.

We can see that CGPA is highly correlated with Chance of Admit where as research is the least correlated.

Linear Regression Explained

Linear Regression(LR) is a predictive modeling technique used to understand the relationship between a dependent variable and one or more independent variables.

We can use LR to predict the “Chance of Admit” based on the independent variables. This approach treats the chance of admit as a continuous outcome that can be predicted by the applicant’s profile, exam scores and experience. The LR model will help us in understanding how each variable contributes to the chance of admission and which have the most impact.

Data Science Pipeline

The step by step process before we get to model training is as follows:-

Start by Importing Libraries

import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn import metrics

Do a bit of Exploratory Data Analysis
Select the dependent and independent variables

# Selecting the dependent and the independent variablesX = admission_df[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research']]y = admission_df['Chance of Admit ']

Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initialize and fit the model

lr = LinearRegression()lr.fit(X_train, y_train)

Predict, Evaluate and plot the model

pred1 = lr.predict(X_test)mse = mean_squared_error(y_test, pred1)r2 = r2_score(y_test, pred1)# Shows the accuracy for training and testing dataprint('Score For Train Data : {}'.format(lr.score(X_train,y_train)))print('Score For Test Data : {}'.format(lr.score(X_test,y_test)))# Plot showacasing the how well model fitted on testing datasns.scatterplot(x=y_test, y=pred1);plt.xlabel('y_test')plt.ylabel('predictions')plt.title('Actual test data vs Model predictions ')plt.show()

The output looks like this: —

Logistic Regression Explained

If we want to categorize the outcome (For example — Admitted vs not admitted), we could convert the “Chance to admit” into a binary variable (For example — 0 if the chance is below a certain threshold like 0.5, and 1 if above). Then, use logistic regression to predict admission status based on the same set of independent variables.

In python, it would look something like this : —

# Convert 'Chance_of_Admit' to a binary outcomeadmission_df['Admit_Binary'] = (admission_df['Chance of Admit '] >= 0.5).astype(int)

Rest of the process is similar to Linear regression with a few syntax changes, shown in the notebook!

Software Used

Overleaf — It’s an online Latex editor. I was tired of using word and wanted to try something new and wasn’t disappointed.

Kaggle — It’s world’s largest data science community with a lot of datasets. It has competitions and forums as well.

Conclusion

Both models provide valuable insights: the linear regression model quantifies the relationship between each predictor and the chance of admit, while the logistic regression model categorizes applicants into admitted or not based on their profile.

These analyses can help in understanding what factors are most important for admission and how changes in these factors affect the probability of being admitted. I hope you enjoyed this article. Follow for more and keep checking back for interesting stuff.

This concludes our article. If you have any doubts about the concepts, please feel free to reach out to me! Thank you!

View all posts

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

In Search of the 3%

Medium,

Mar 20, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Cornell — Digital Agriculture Hackathon Experience

Medium,

Mar 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

Predictive Analysis of Credit Default Risk

Medium,

Feb 10, 2025

Read article

📈Linear Regression VS Logistic Regression for Beginners

📈Linear Regression VS Logistic Regression for Beginners

Introduction

Regression Explained

Independent Variables

Dependent Variables

Dataset

Let’s look at the data

Linear Regression Explained

Data Science Pipeline

Logistic Regression Explained

Software Used

Conclusion

Read more articles

Read more articles

In Search of the 3%

In Search of the 3%

In Search of the 3%

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Cornell — Digital Agriculture Hackathon Experience

Predictive Analysis of Credit Default Risk

Predictive Analysis of Credit Default Risk

Predictive Analysis of Credit Default Risk

Let's talk

Time for me:

Email:

Socials:

Let's talk

Time for me:

Email:

Socials: