Inspector Gadget ANN

Where This Started

The original artifact for this submission is a Jupyter notebook from CS 370 Current/Emerging Trends in CS, where I was first introduced to artificial neural networks using Keras. That assignment used a small, scaffolded dataset to demonstrate forward propagation and gradient descent. It was functional, but the problem it solved was synthetic. If you would like to view both the original and enhanced artifacts you can see them in this GitHub repo here.

Teaching a Machine to Recognize Threats

The enhancement replaces that proof-of-concept with a production-grade multi-class classifier trained on real labeled network traffic from the CIC-IDS2017 dataset and exports it to a quantized TFLite model that runs on a Raspberry Pi 4. Every preprocessing decision, every architectural choice, and every metric I tracked was made with one constraint in mind: this model has to be accurate enough to be useful for real-time intrusion detection, and fast enough to run on edge hardware without missing packets.

Feature Selection

The CIC-IDS2017 dataset has 80 feature columns and 15 attack categories. Neither number is useful as-is. The first task was consolidating the attack labels. Fourteen attack subtypes, including DoS Hulk, DoS GoldenEye, FTP-Patator, and Web Attack XSS, collapse into seven broader classes. Two categories, Heartbleed and Infiltration, are too rare to model reliably (Heartbleed has only 11 records in the full dataset), so they are dropped entirely in preprocessing. The mapping is explicit: every label in the raw data maps to a known target class, and anything unrecognized surfaces immediately as a null rather than being silently misclassified. Many thanks to Noushin Pervez for her work on preprocessing and feature correlation. You can check out her CIC-IDS2017 repo here.

attack_map = {
	'BENIGN': 'BENIGN',
	'DDoS': 'DDoS',
	'DoS Hulk': 'DoS',
	'DoS GoldenEye': 'DoS',
	'DoS slowloris': 'DoS',
	'DoS Slowhttptest': 'DoS',
	'PortScan': 'Port Scan',
	'FTP-Patator': 'Brute Force',
	'SSH-Patator': 'Brute Force',
	'Bot': 'Bot',
	'Web Attack  Brute Force': 'Web Attack',
	'Web Attack  XSS': 'Web Attack',
	'Web Attack  Sql Injection': 'Web Attack',
	'Infiltration': 'Infiltration',
	'Heartbleed': 'Heartbleed'
}
data['Attack Type'] = data['Label'].map(attack_map)

# Build connected components from correlated pairs
from collections import defaultdict

def find_clusters(pairs, features):
	# Build adjacency list
	graph = defaultdict(set)
	for a, b, val in pairs:
		graph[a].add(b)
		graph[b].add(a)

	# Find connected components via BFS
	visited = set()
	clusters = []

	for feature in features:
		if feature in visited or feature not in graph:
			continue
		cluster = set()
		queue = [feature]
		visited.add(feature)
		while queue:
			node = queue.pop()
			cluster.add((node, corr['Attack Number'][node]))
			for neighbor in graph[node] - visited:
				visited.add(neighbor)
				queue.append(neighbor)
		clusters.append(cluster)

	return clusters

With clean labels, the next step is reducing 80 features to a set that is predictive and not redundant. The approach is correlation-driven: compute the Pearson correlation of every numeric feature against a numeric encoding of the attack class, then keep features with meaningful signal, whether positive or negative, and discard the rest.

Correlated features don't just correlate with the label. They often correlate with each other as well. Keeping both members of a redundant pair inflates the feature vector without adding information. To handle this, I built a graph where each node is a feature and each edge connects two features with high mutual correlation. A BFS over this graph yields connected components, which are clusters of features that tell the same story. From each cluster, only the member with the strongest correlation to the target label is kept; the rest are dropped.

# Positive correlation features for 'Attack Number'
corr = data.corr(numeric_only=True).round(2)
pos_corr_features = corr['Attack Number'][
	(corr['Attack Number'] > 0) & (corr['Attack Number'] < 1)]
pos_corr_features = pos_corr_features.sort_values(ascending=False)

# Negative correlation features for 'Attack Number'
neg_corr_features = corr['Attack Number'][
	(corr['Attack Number'] < 0) & (corr['Attack Number'] > -1)]
neg_corr_features = neg_corr_features.sort_values(ascending=True)

The last feature selection step was the most deployment-critical. Several features in the dataset, including Flow Duration, Idle Mean, and Max Packet Length, can only be computed once a flow is fully complete. The state machine on the Pi processes packets in sliding windows, not complete flows, so these features are unavailable at inference time. Keeping them in training would produce a model that scores well in the notebook but is impossible to serve on the Pi. They were dropped.

# Drop features that require complete flow statistics (unusable at
# inference time on a sliding window of packets)
drop = {'Bwd Packet Length Mean', 'Packet Length Mean',
		'Bwd Packet Length Max', 'Idle Mean', 'Max Packet Length',
		'Fwd IAT Max', 'Idle Min', 'Idle Max', 'Flow IAT Max',
		'Flow Duration'}
all_kept_feat = [f for f in all_kept_feat if f not in drop]

Handling Imbalanced Data

CIC-IDS2017 is heavily imbalanced. BENIGN traffic accounts for roughly 80% of all records, while Brute Force, Web Attack, and Bot each represent less than 1%. A model trained on this raw distribution learns to predict BENIGN for almost everything and still reports high accuracy. That is not a useful intrusion detector. Heartbleed and Infiltration are dropped first because they're far too rare for SMOTE to synthesize meaningful samples. With only 11 Heartbleed records in the entire dataset, any synthetic neighbor would be nearly identical to the original.

# Heartbleed (11 records) and Infiltration cannot be meaningfully
# synthesized with SMOTE - drop them and reshape to 7 classes
data = data[-data['Attack Type'].isin(['Heartbleed', 'Infiltration'])]

For the remaining six attack classes I used a two-phase balancing strategy. First, BENIGN is downsampled to match the total attack count, cutting the dataset to a manageable size without discarding any attack records.

# Separate benign flows from attacks
benign = data.loc[data['Attack Type'] == 'BENIGN']
attacks = data.loc[data['Attack Type'] != 'BENIGN']

# Down sample BENIGN to match total attack count
benign = benign.sample(n=len(attacks), replace=False)

# Combine new distribution and shuffle
new_data = pd.concat([benign, attacks])
new_data = new_data.sample(frac=1, random_state=40).reset_index(drop=True)

For the three smallest attack classes, Brute Force, Web Attack, and Bot, downsampling Benign alone isn't enough because of the massive disparity between these three and the others. SMOTE generates synthetic training samples by interpolating between existing minority examples and their k-nearest neighbors. Setting k_neighbors=15 widens the neighborhood, producing more varied synthetic samples and reducing the risk of overfitting to the original minority points.

# Oversample the three smallest attack classes with SMOTE
from imblearn.over_sampling import SMOTE

# k_neighbors=15 widens the neighborhood SMOTE draws from,
# producing more variable synthetic samples
sm = SMOTE(
	sampling_strategy={'Brute Force': 64064, 'Web Attack': 15001, 'Bot': 13671},
	random_state=42,
	k_neighbors=15)
X_train, y_train = sm.fit_resample(X_train, y_train)

from sklearn.model_selection import train_test_split

X = new_data.drop(columns=['Attack Type']).values.astype(np.float32)
y = new_data['Attack Type'].values

# 80/20 train and temp split (stratified)
X_train, X_temp, y_train, y_temp = train_test_split(
	X, y, test_size=0.2, random_state=42, stratify=y)

# Split temp evenly into validation and test
X_val, X_test, y_val, y_test = train_test_split(
	X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

Scaling Without Leaking

Before any data reaches the model it passes through a StandardScaler that normalizes each feature to zero mean and unit variance. The scaler is fitted on training data only. Fitting on the full dataset before the split would allow test-set statistics to influence training, which introduces bias and decreases precision in production. The fitted scaler is saved to disk so the Pi can apply the exact same transformation at inference time. A scaler mismatch would corrupt every prediction made during inference.

# Normalize data using sklearn's standard scaler
scaler = StandardScaler()
scaler.fit(X_train)  # fit on train only to prevent data leakage
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Save scaler for reuse in the inference model on the Pi
import joblib
joblib.dump(scaler, 'NMMscaler.joblib')

The target labels are encoded twice: first with LabelEncoder to assign a stable integer to each class in alphabetical order, then with to_categorical to produce a one-hot vector for categorical_crossentropy loss. The LabelEncoder is also saved with joblib for use in deployment. The index mapping, where 0 is BENIGN, 1 is Bot, and so on, must be identical in training and at inference time, or the model will output the correct index for the wrong class label.

# Integer encoding for numerical representation
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)
y_test = le.transform(y_test)
joblib.dump(le, 'NMMlabel_encoder.joblib')

# One-hot encoding for categorical_crossentropy loss
y_train = to_categorical(y_train, CLASSES)
y_val = to_categorical(y_val, CLASSES)
y_test = to_categorical(y_test, CLASSES)

Architecture and Training

The network is a three-hidden-layer feedforward Aritficial Neural Network with Rectified Linear Unit (ReLU) activations, BatchNormalization after the first two layers to stabilize training, and Dropout for regularization. Layer widths step down from 250 to 100 to 50, forming a funnel shape that forces the network to compress the 19 input features into progressively more abstract representations before the 7-class softmax output. The architecture is deliberately modest: deep enough to capture non-linear boundaries between attack classes, small enough to quantize cleanly and run on the Pi without latency problems.

model = Sequential(name='Inspector_Gadget')
model.add(Input(shape=(FEATURES,)))
model.add(Dense(FIRST_LAYER, activation='relu'))   # 250
model.add(BatchNormalization())
model.add(Dropout(DROPOUT))
model.add(Dense(SECOND_LAYER, activation='relu'))  # 100
model.add(BatchNormalization())
model.add(Dropout(DROPOUT))
model.add(Dense(THIRD_LAYER, activation='relu'))   # 50
model.add(Dense(CLASSES, activation='softmax'))    # 7

Accuracy alone is not a useful metric for an intrusion detector. A model that predicts BENIGN for everything would report high accuracy on any dataset where BENIGN dominates. I tracked Precision, Recall, and the Area Under the Curve (AUC) alongside accuracy to keep the full picture visible. Precision measures false positives, which are alerts that aren't real threats. Recall measures false negatives, the real threats that went undetected. For a security tool, a missed threat is the more dangerous failure mode. Training uses EarlyStopping on val_loss with patience of 3, which rolls back to the best weights automatically if validation loss stops improving.

# Compile with a full suite of metrics to see past accuracy alone
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=[
	'accuracy',                       # overall correctness
	tf.keras.metrics.Precision(),     # identifies false positives
	tf.keras.metrics.Recall(),        # identifies false negatives
	tf.keras.metrics.AUC(name='auc')  # separation between classes
])

# EarlyStopping rolls back to the best weights if val_loss stalls
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
	monitor='val_loss',
	min_delta=1e-4,
	patience=3,
	mode='min',
	restore_best_weights=True)

history = model.fit(
	X_train, y_train,
	batch_size=BATCH_SIZE,
	epochs=EPOCHS,
	validation_data=(X_val, y_val),
	callbacks=[early_stop])

Quantization for the Edge

The trained model runs at full float32 precision on a desktop GPU, but the Raspberry Pi 4 has no GPU and limited memory bandwidth. Post-training integer quantization converts weights and activations from 32-bit floats to 8-bit integers, reducing model size by roughly 4× and significantly improving inference throughput on ARM hardware. The quantization is data-aware: a representative dataset drawn from the training split tells the converter the actual dynamic range of each activation so it can choose the right scale factors. To avoid calibrating on only the majority class, the calibration set is balanced at 50 samples per class, drawn randomly.

# Balanced per-class calibration data for INT8 quantization
y_train_labels = np.argmax(y_train, axis=1)
Nper_class = 50
indices = []
for label in np.unique(y_train_labels):
	class_indices = np.where(y_train_labels == label)[0]
	samples = np.random.choice(class_indices, Nper_class, replace=False)
	indices.append(samples)
indices = np.concatenate(indices)
np.random.shuffle(indices)
calibration_data = X_train[indices]

def representative_dataset():
	for data in calibration_data:
		yield [data.reshape(1, 19)]

The converter is configured for full integer quantization using TFLITE_BUILTINS_INT8, which constrains every operation in the graph to integer arithmetic, not just the weights. This produces the most aggressive size and speed reduction and is the format the Raspberry Pi's XNNPACK runtime can accelerate. The output is a single .tflite file that the state machine loads via tf.lite.Interpreter at startup.

# Post-training INT8 quantization for Raspberry Pi deployment
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
tflite_quant_model = converter.convert()

with open('Inspector_Gadget_quant.tflite', 'wb') as f:
	f.write(tflite_quant_model)

Conclusions

This enhancement demonstrates the ability to use well-founded techniques for designing and evaluating computing solutions. The feature selection pipeline, which includes correlation filtering, cluster duplication purging, and relevant feature pruning, reflects an understanding that a model's value is determined by how it behaves at inference time. Not how it scores in a notebook. The preprocessing decisions around SMOTE and downsampling reflect a deliberate choice to fix the data distribution rather than mask the imbalance with a better-looking metric.

The most important lesson here was the gap between a model that trains well and a model that deploys well. Getting to 97% accuracy was straightforward once the data was clean and balanced. Getting the model to actually run on the Pi, quantized with the correct scaler and label encoder and producing correctly-mapped predictions, required careful attention to every artifact the training pipeline produces, not just the weights file. Every saved .joblib file is a contract between the notebook and the inference engine. Breaking that contract produces wrong predictions with no error message. By far the most difficult tasks deploying the model were the dependency management and smartly feeding it every single captured flow. For the first 24 hours of deployment while running tcpreplay the model never seen an attack flow. Once I corrected this the model proved very reliable with a pcap from Malware-Traffic_analysis.net.

References

Canadian Institute for Cybersecurity. (2017). CIC-IDS2017. University of New Brunswick. https://www.unb.ca/cic/datasets/ids-2017.html

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

TensorFlow. (2024). Post-training integer quantization. https://www.tensorflow.org/lite/performance/post_training_integer_quant

Joshua Shoemaker