DATASET:
To explain tamper situations, a test box was built to collect data from three different scenarios. The test box is made with a computer fan to cool the box, an incandescent bulb to warm up the box, and a small door that people can smoke or blow over the sensor. Therefore, label 1 is considered when the box is warming/cooling in normal conditions, label 2 when the sensor is tampered with by people blowing on it, label 3 is defined when people smoke near the sensor, and label 4 when the sensor detects fire. The IoT device has a well-known CO2, temperature, and humidity sensor SCD-30, the LoPy4 as a microcontroller and a relay board to manage the fan and the bulb.
APPROACH:
Most machine learning (ML) proposals in the Internet of Things (IoT) space are designed and evaluated on pre-processed datasets, where the data acquisition and cleaning steps are often considered a black box. Therefore, the data acquisition stage requires additional data cleaning/anomaly techniques, which translate to additional resources, energy, and storage. We propose to carry out such techniques not in the cloud servers and closer to the data source, on the IoT device itself. Consequently, this application defines three anomaly detection steps using smoothing filters, unsupervised learning, and deep learning techniques (hybrid model) to detect the different variations of anomalies, focusing on a small computational/memory footprint.
BACKGROUND:
Noise: The sensor's signal is discretized/digitalized to be understandable to the microcontroller. However, errors like voltage fluctuations, non-linearity response, and vibrations insert noise into the electric signal, confusing the ML algorithm in the feature extraction stage. For this reason, signal smoothing is a filter that reduces these noise components when the phenomenon does not have high sampling frequencies getting a cleaner signal.
Outlier detection: These methods are in charge of detecting data with a different distribution than the rest. This process is carried out through an unsupervised analysis.
Tamper detection: In some scenarios, the information acquired by the IoT device can be compromised by malicious users trying to steal or modify data. tamper detection techniques require knowing all the manipulation possibilities to detect them, which needs supervised learning.
ASSEMBLING THE DATASET:
The LoPy 4 sends data to the computer by serial communication. The computer receives the data from a Python script (this script is used for each label). More information to code in LoPy 4 here:
READ THE DATASET:
In a Python environment, we read and plot the dataset (red: label 1. blue: label 2, black: label 3, green: label 4) :
#libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#read the dataset
dataset = pd.read_csv('data.csv', sep=',')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
#plot the dataset
fig = plt.figure(figsize=(15,11))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dataset['CO2'],dataset['TEMP'],dataset['HUM'],
c=dataset['LABEL'].replace({1:"red",2:"blue",3:"black",4:"green"})
,s=50)
ax.legend(loc="best")
ax.set_xlabel("CO2")
ax.set_ylabel("Temperature")
ax.set_zlabel("Humidity")
plt.show()
#Code to run on the computer
#Libraries:
import pandas as pd
import serial
#create the dataframe
dataset = pd.DataFrame(columns=["co2","temp","hum"])
# serial communication object
com = serial.Serial(port='COM12', baudrate=115200)
#confirmation variable
i=0
#number of samples
samples = 500
while True:
#waits to incoming data
if(com.in_waiting > 0):
# variabe recives data
data = com.readline()
# variable divides the data by separator ";"
val=data.split()
# confirmation the data separation
if len(val)>4:
#400 samples, you can change the number for samples.
if i<samples:
# store data in the dataframe
dataset=dataset.append({'co2' : val[0].decode("utf-8") ,
'temp' : val[2].decode("utf-8") ,
'hum' : val[4].decode("utf-8") },
ignore_index=True)
#confirmation
i += 1
print(i)
else:
# close the COM port.
com.close()
#export to csv the model
dataset.to_csv("label1.csv")
#Code to run on the device
#Libraries:
import time
import math
import pycom
from machine import UART
from machine import I2C
from scd30 import *
uart = UART(0, baudrate=115200) # UART configuration
pycom.heartbeat(False) # disable the heartbeat LED
i2c = I2C(2) # create and use default PIN assignments (P9=SDA, P10=SCL)
# NOTE: Could not make it work using the ESP32 hardware I2C buses (0 & 1),
# but the bitbanged software bus (2) works
sensor = SCD30(i2c, 0x61)
while True:
for i in range (500):
# Wait for sensor data to be ready to read (by default every 2 seconds)
if sensor.get_status_ready() != 1:
time.sleep_ms(200)
(co2, temperature, hum) = sensor.read_measurement() # Adjust for PCB heating effect.
temperature -= 3 # NOTE: Adjust the temperature
#send the information
print(round(co2,2),';',round(temperature,2),';',round(hum,2))
time.sleep_ms(5000)
SPLIT THE DATASET INTO LABELS:
#libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#read the dataset
dataset = pd.read_csv('anomaly_detection.csv', sep=',')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
#plot the dataset
fig = plt.figure(figsize=(15,11))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dataset['CO2'],dataset['TEMP'],dataset['HUM'],
c=dataset['LABEL'].replace({1:"red",2:"blue",3:"black",4:"green"})
,s=50)
ax.legend(loc="best")
ax.set_xlabel("CO2")
ax.set_ylabel("Temperature")
ax.set_zlabel("Humidity")
plt.show()
NOISE DETECTION:
Errors in the data collection on IoT devices are common due to voltage fluctuations on sensors and unexpected board movements. Therefore, smoothing algorithms reduce the noise. To determine the best algorithm to smooth data, it is necessary to use some statistical metrics related to a different class of errors. In this work, we will use signal-to-noise relation, MSE, MAE, RMSE, and R2-score. More information in: https://towardsdatascience.com/comparing-robustness-of-mae-mse-and-rmse-6d69da870828, and https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e
# Signal to noise metric
import numpy as np
def signaltonoise(a, axis=0, ddof=0):
a = np.asanyarray(a)
m = a.mean(axis)
sd = a.std(axis=axis, ddof=ddof)
return np.where(sd == 0, 0, m/sd)
OUTLIER DETECTION:
Outliers are those observations that differ strongly from the rest of the data points, and smoothing algorithms can not detect them. Outliers algorithms can be deployed by labels or with the entire dataset. More info in: https://towardsdatascience.com/5-outlier-detection-methods-that-every-data-enthusiast-must-know-f917bf439210
TAMPERING DETECTION:
Tamper detection is the ability of a device to sense that an active attempt to compromise the device integrity or the data associated with the device is in progress; the detection of the threat may enable the device to initiate appropriate defensive actions. More information in: https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229
It is important to remember that the proposed system has four labels, and two of them are tamper options. Therefore, with the refined dataset obtained from the previous section, the multi-class classification would be improved. Thus, the classification algorithms and deep learning techniques are trained with the original samples. Then, they are trained with the refined dataset to demonstrate the outlier detection phase. More information on classification metrics in:
https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229
###### SVM ################
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
EXPORTING MODELS:
Tne-class SVM prunes outliers adequately, the decision tree has better classification metrics, and the neural network has similar results using original and improved samples. Therefore, we defined two scenarios: using the outlier detection method and decision tree or SVM as classifiers and deploying the deep learning model alone.
RESOURCES:
from sklearn import metrics
def timeseries_evaluation_metrics(y_true, y_pred):
print('Evaluation metric results: ')
print(f'MSE value : {metrics.mean_squared_error(y_true, y_pred)}')
print(f'MAE value : {metrics.mean_absolute_error(y_true, y_pred)}')
print(f'RMSE value : {np.sqrt(metrics.mean_squared_error(y_true, y_pred))}')
print(f'SNR : {signaltonoise(y_pred)}')
print(f'R2 score : {metrics.r2_score(y_true, y_pred)}',end='\n\n')
Several smoothing algorithms should be applied to analog signals. However, the most representative algorithms are Moving average, median, Gaussian, and Savi-Golay. Each algorithm is developed on each analog signal split by labels.
## CO2 signal smoothing analysis
from scipy.ndimage import gaussian_filter
from scipy import ndimage, misc, signal
import math
size = 5
x = label1["CO2"]
#moving average
y_moving = pd.Series(x.rolling(window =5).mean())
y_moving[0:size]=pd.Series(x[0:size])
#median
y_med = pd.Series(signal.medfilt(x,size))
#savi-golay
y_vi = pd.Series(signal.savgol_filter(x,5,2))
#gaussian
y_gaussian = pd.Series(gaussian_filter(x, sigma=2))
output = [y_moving,y_med,y_vi,y_gaussian]
#Ploting the smoothed CO2 signal by different algorithms
plt.plot(x, color="blue")
plt.plot(y_moving+20, color="green", label='Average')
plt.plot(y_med+40, color="tomato", label='Median')
plt.plot(y_vi+60, color="black", label='Savi-Golay')
plt.plot(y_gaussian+80, color= 'red', label='Gaussian')
plt.grid(True)
plt.legend(loc='best')
plt.show()
#Evaluation of smoothed algorithms
for i in output:
timeseries_evaluation_metrics(x,i)
print(signaltonoise(i))
OUT: Evaluation metric results:
MSE value : 439.966779661017
MAE value : 9.394576271186441 RMSE value : 20.975385089695422 SNR : 8.136239429638426
R2 score : 0.9133006210165
#ISOLATION FOREST BY LABELS
from sklearn.ensemble import IsolationForest
###label 1
from sklearn.ensemble import IsolationForest
###label 1
iso=IsolationForest(contamination=0.3)
aux=iso.fit_predict(nd1.iloc[:,:-1])
out1=nd1[(aux==1)]
aux=iso.fit_predict(nd2.iloc[:,:-1])
out2=nd2[(aux==1)]
aux=iso.fit_predict(nd3.iloc[:,:-1])
out3=nd3[(aux==1)]
aux=iso.fit_predict(nd4.iloc[:,:-1])
out4=nd4[(aux==1)]
isolation_database=pd.concat([out1,out2,out3,out4])
print(isolation_database)
# ISOLATION FOREST WITH THE ENTIRE DATASET
iso=IsolationForest(contamination=0.3)
aux=iso.fit_predict(df2.iloc[:,:-1])
isolation_database=df2[(aux==1)]
print(isolation_database)
#ONE CLASS SVM
from sklearn.svm import OneClassSVM
osvm = OneClassSVM(kernel='poly', nu=0.3)
aux=osvm.fit_predict(df2.iloc[:,:-1])
one_svm_database=df2[(aux==1)]
print(one_svm_database)
The dataset is pruned in 173 samples:
CO2 TEMP HUM LABEL
0 656 25.786349 39.935088 1
1 655 25.763746 39.980219 1
2 652 25.714007 40.072484 1
3 650 25.633944 40.212785 1
4 647 25.530006 40.392756 1
.. ... ... ... ...
90 19797 32.434967 37.276137 4
91 21473 32.380756 37.199532 4
[400 rows x 4 columns]
The dataset is pruned in 221 samples:
CO2 TEMP HUM LABEL
0 656 25.786349 39.935088 1
1 655 25.763746 39.980219 1
2 652 25.714007 40.072484 1
3 650 25.633944 40.212785 1
4 647 25.530006 40.392756 1
.. ... ... ... ...
90 19797 32.434967 37.276137 4
91 21473 32.380756 37.199532 4
[352 rows x 4 columns]
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
classifier=SVC(kernel='sigmoid', random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
precision recall f1-score support
1 0.72 1.00 0.83 53
2 0.50 1.00 0.67 8
3 0.41 0.22 0.29 32
4 0.25 0.09 0.13 22
accuracy 0.61 115
macro avg 0.47 0.58 0.48 115
weighted avg 0.53 0.61 0.54 115
[[53 0 0 0]
[ 0 8 0 0]
[16 3 7 6]
[ 5 5 10 2]]
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
precision recall f1-score support
1 1.00 0.96 0.98 53
2 0.73 1.00 0.84 8
3 1.00 0.97 0.98 32
4 1.00 1.00 1.00 22 accuracy 0.97 115
macro avg 0.93 0.98 0.95 115
[[51 2 0 0]
[ 0 8 0 0]
[ 0 1 31 0]
[ 0 0 0 22]]
Original samples -> Classification algorithms:
Original Samples -> Deep Learning:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
encoder = OneHotEncoder(sparse=False)
y_nn=pd.DataFrame(y) # <- New variables to fit into neural network models
y_nn = encoder.fit_transform(y_nn)
sc=MinMaxScaler()
X_nn=sc.fit_transform(X) #<- scaling the dataset between 0 to 1
X_train_nn, X_test_nn,y_train_nn,y_test_nn=train_test_split(X_nn,y_nn,test_size=0.2, random_state=0)
model = Sequential()
model.add(Dense(6, input_shape=(3,), activation='relu', name='fc1'))
model.add(Dense(24, activation='relu', name='fc2'))
model.add(Dense(8, activation='relu', name='fc3'))
model.add(Dense(4, activation='softmax', name='output'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
history=model.fit(X_train_nn, y_train_nn, validation_split=0.33, batch_size=5, epochs=20)
Epoch 1/20
38/38 [============================] - 1s 7ms/step - loss: 1.4285 - accuracy: 0.2074 - val_loss: 1.3595 - val_accuracy: 0.3226
...
Epoch 20/20
38/38 [============================] - 0s 4ms/step - loss: 0.1689 - accuracy: 0.9787 - val_loss: 0.2187 - val_accuracy: 0.9462
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.ylabel('ACCURACY',fontname="Times New Roman")
plt.xlabel('EPOCH',fontname="Times New Roman")
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('LOSS',fontname="Times New Roman")
plt.xlabel('EPOCH',fontname="Times New Roman")
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
y_pred = model.predict(X_test_nn)
y_pred = (y_pred > 0.5)
index=np.argmax(y_pred)
index
from sklearn import metrics
print("")
print("Precision: {}%".format(100*metrics.precision_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))
print("Recall: {}%".format(100*metrics.recall_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))
print("f1_score: {}%".format(100*metrics.f1_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))
print("Error: {}%".format(metrics.mean_absolute_error(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1))))
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
report=classification_report(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1))
print('\nReport\n')
print(report)
CONCLUSIONS:
Precision: 98.63422962014512%
Recall: 98.59154929577466%
f1_score: 98.57176503839088%
Error: 0.014084507042253521%
Report
precision recall f1-score support
0 1.00 1.00 1.00 19
1 1.00 0.92 0.96 12
2 0.97 1.00 0.98 32
3 1.00 1.00 1.00 8 accuracy 0.99 71 macro avg 0.99 0.98 0.99 71 weighted avg 0.99 0.99 0.99 71
Classification algorithms and deep learning models are trained with the refined dataset obtained by one class SVM since this algorithm pruned more samples.
Improved Samples -> Classification Algorithms:
X = one_svm_database.iloc[:,:-1].values
y = one_svm_database.iloc[:,-1].values
X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)
precision recall f1-score support
1 1.00 1.00 1.00 53
2 0.89 1.00 1.00 8
3 1.00 1.00 1.00 32
4 1.00 1.00 1.00 22 accuracy 1.00 115
macro avg 1.00 1.00 1.00 115
[[19 2 0 0]
[ 0 12 0 0]
[ 0 1 32 0]
[ 0 0 0 8]]
classifier=SVC(kernel='linear', random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
precision recall f1-score support
1 1.00 1.00 1.00 53
2 0.89 1.00 1.00 8
3 1.00 1.00 1.00 32
4 1.00 1.00 1.00 22 accuracy 1.00 115
macro avg 1.00 1.00 1.00 115
[[19 2 0 0]
[ 0 12 0 0]
[ 0 1 32 0]
[ 0 0 0 8]]
Improved Samples -> Deep Learning:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))
model = Sequential()
model.add(Dense(4, input_shape=(3,), activation='relu', name='fc1'))
model.add(Dense(24, activation='relu', name='fc2'))
model.add(Dense(8, activation='relu', name='fc3'))
model.add(Dense(4, activation='softmax', name='output'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
history=model.fit(X_train_nn, y_train_nn, validation_split=0.33, batch_size=5, epochs=20)
Epoch 1/20
38/38 [============================] - 1s 7ms/step - loss: 1.4205 - accuracy: 0.1011 - val_loss: 1.3812 - val_accuracy: 0.2366
...
Epoch 20/20
38/38 [============================] - 0s 4ms/step - loss: 0.7994 - accuracy: 0.8883 - val_loss: 0.8741 - val_accuracy: 0.8172
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.ylabel('ACCURACY',fontname="Times New Roman")
plt.xlabel('EPOCH',fontname="Times New Roman")
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('LOSS',fontname="Times New Roman")
plt.xlabel('EPOCH',fontname="Times New Roman")
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
Outlier detection:
##########run once#########
pip install m2cgen
###########################
import m2cgen as m2c
osvm = OneClassSVM(kernel='sigmoid', nu=0.3)
osvm.fit(X,y)
code=m2c.export_to_python(osvm)
with open('osvm.py', 'w') as file:
print(code)
code2=m2c.export_to_c_sharp(osvm)
with open('osvm.h', 'w') as file:
file.write(code2)
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
code=m2c.export_to_python(clf)
with open('dtc2.py', 'w') as file:
file.write(code)
print(code)
code2=m2c.export_to_c_sharp(clf)
with open('dtc2.h', 'w') as file:
file.write(code2)
Classification algorithms:
Deep learning:
for layerNum, layer in enumerate(model.layers):
weights = layer.get_weights()[0]
biases = layer.get_weights()[1]
for toNeuronNum, bias in enumerate(biases):
print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')
for fromNeuronNum, wgt in enumerate(weights):
for toNeuronNum, wgt2 in enumerate(wgt):
print(f'L{layerNum}N{fromNeuronNum} \
-> L{layerNum+1}N{toNeuronNum} = {wgt2}')