Introduction:

Cybersecurity is becoming increasingly important in today’s world. With the growing number of cyber threats, security experts need to find new and effective methods to combat them. Machine learning plays a significant role in addressing such threats in cybersecurity. In this article, we will focus on anomaly detection in cybersecurity using machine learning techniques.

Learning Objectives:

Understand the fundamentals of machine learning
Learn machine learning techniques for anomaly detection
Develop an anomaly detection model using Python programming language and the Scikit-learn library

Purpose of this Project:

The purpose of this project is to develop a machine learning model for detecting anomalies using network traffic data. This model, built using the K-Nearest Neighbors (KNN) algorithm, will be used to distinguish between normal and abnormal behaviors in a network(Machine Learning in Network Security: Preventing Cyber Attacks).

Amazon Product

Our editor’s recommendation

Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python

-35% $33.86 on Amazon

Importing Libraries:

In this section, we import the Python libraries we’ll use in the project. We’ll utilize Pandas for data processing, NumPy for scientific computations, Scikit-learn for machine learning algorithms, and evaluation metrics.

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report, confusion_matrix

Creating the Dataset:

Here, we generate a sample dataset. In this example, we create 1000 normal and 50 abnormal data points on two features with a normal distribution.

# Creating the dataset
data = {
    'feature1': np.random.normal(0, 1, 1000).tolist() + np.random.normal(5, 1, 50).tolist(),
    'feature2': np.random.normal(0, 1, 1000).tolist() + np.random.normal(5, 1, 50).tolist()
}
df = pd.DataFrame(data)

Labeling Anomalies:

We label the samples in the dataset as normal (0) or abnormal (1). This step creates a labeled dataset for training the model.

# Labeling anomalies (0 = normal, 1 = abnormal)
df['label'] = [0]*1000 + [1]*50

Splitting Features and Labels:

The dataset is split into features (X) and target variable (y). Features represent the input to the model and contain the data features used for anomaly detection. Labels indicate whether each sample is normal or abnormal.

# Splitting features and labels
X = df[['feature1', 'feature2']]
y = df['label']

Scaling the Data:

We scale the features using StandardScaler. This process ensures that the data features are on the same scale, leading to better performance of the machine learning model(Assessing Password Strength with Machine Learning in Python).

# Scaling the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Splitting the Data into Training and Testing Sets:

The dataset is divided into training and testing sets. This creates a training dataset to train the model and a separate testing dataset to evaluate the model’s performance.

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Training the LOF Model:

We train the Local Outlier Factor (LOF) model for anomaly detection. This algorithm calculates the local density of data points to detect abnormal data points.

# Training the Local Outlier Factor (LOF) model
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X_test)

Adjusting the LOF Results:

The LOF algorithm labels abnormal data points as -1 and normal data points as 1. In this step, we adjust these labels to represent normal (0) and abnormal (1).

# Adjusting the LOF results (-1 for abnormal, 1 for normal)
y_pred = np.where(y_pred == 1, 0, 1)

Evaluating the Results:

We evaluate the model’s performance using confusion matrix and classification report metrics. These metrics assess how accurately the model classifies normal and abnormal data points.

# Evaluating the results
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Calculating the ROC-AUC Score:

The ROC-AUC score is calculated to comprehensively evaluate the model’s classification performance. This score measures the model’s ability to distinguish between normal and abnormal data points

# Calculating the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred)
print(f'ROC-AUC Score: {roc_auc}')

Conclusion:

In this article, we highlighted the importance of machine learning techniques in cybersecurity and developed a machine learning model for anomaly detection using network traffic data. Our model demonstrated high accuracy and performance, making it a reliable tool for cybersecurity experts in real-world applications. As machine learning techniques become more prevalent in cybersecurity, there will be an increasing need for such studies to build a safer digital world.

Join Our Discord Server