ANOMALY DETECTION

DATASET: 

To explain tamper situations, a test box was built to collect data from three different scenarios. The test box is made with a computer fan to cool the box, an incandescent bulb to warm up the box, and a small door that people can smoke or blow over the sensor. Therefore, label 1 is considered when the box is warming/cooling in normal conditions, label 2 when the sensor is tampered with by people blowing on it, label 3 is defined when people smoke near the sensor, and label 4 when the sensor detects fire. The IoT device has a well-known CO2, temperature, and humidity sensor SCD-30, the LoPy4 as a microcontroller and a relay board to manage the fan and the bulb. 

APPROACH:

Most machine learning (ML) proposals in the Internet of Things (IoT) space are designed and evaluated on pre-processed datasets, where the data acquisition and cleaning steps are often considered a black box. Therefore, the data acquisition stage requires additional data cleaning/anomaly techniques, which translate to additional resources, energy, and storage. We propose to carry out such techniques not in the cloud servers and closer to the data source, on the IoT device itself. Consequently, this application defines three anomaly detection steps using smoothing filters, unsupervised learning, and deep learning techniques (hybrid model) to detect the different variations of anomalies, focusing on a small computational/memory footprint. 

BACKGROUND:

 

Noise: The sensor's signal is discretized/digitalized to be understandable to the microcontroller. However, errors like voltage fluctuations, non-linearity response, and vibrations insert noise into the electric signal, confusing the ML algorithm in the feature extraction stage. For this reason, signal smoothing is a filter that reduces these noise components when the phenomenon does not have high sampling frequencies getting a cleaner signal.

 

Outlier detection: These methods are in charge of detecting data with a different distribution than the rest. This process is carried out through an unsupervised analysis.

 

Tamper detection: In some scenarios, the information acquired by the IoT device can be compromised by malicious users trying to steal or modify data. tamper detection techniques require knowing all the manipulation possibilities to detect them, which needs supervised learning.

ASSEMBLING THE DATASET:

The LoPy 4 sends data to the computer by serial communication. The computer receives the data from a Python script (this script is used for each label). More information to code in LoPy 4 here

READ THE DATASET:

 

In a Python environment, we read and plot the dataset (red: label 1. blue: label 2, black: label 3, green: label 4) :

#libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 

 

#read the dataset
dataset = pd.read_csv('data.csv', sep=',')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

 

#plot the dataset
fig = plt.figure(figsize=(15,11))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dataset['CO2'],dataset['TEMP'],dataset['HUM'],

            c=dataset['LABEL'].replace({1:"red",2:"blue",3:"black",4:"green"})

            ,s=50)
ax.legend(loc="best")
ax.set_xlabel("CO2")
ax.set_ylabel("Temperature")
ax.set_zlabel("Humidity")
plt.show()

#Code to run on the computer

#Libraries:

import pandas as pd
import serial 
#create the dataframe
dataset = pd.DataFrame(columns=["co2","temp","hum"])
# serial communication object
com = serial.Serial(port='COM12', baudrate=115200)
#confirmation variable
i=0

#number of samples

samples = 500
while True:
#waits to incoming data
  if(com.in_waiting > 0):
   # variabe recives data
      data = com.readline()
       # variable divides the data by separator ";"
      val=data.split()
      # confirmation the data separation
      if len(val)>4:
       #400 samples, you can change the number for samples.
           if i<samples:
               # store data in the dataframe
               dataset=dataset.append({'co2' : val[0].decode("utf-8") , 
                                       'temp' : val[2].decode("utf-8") ,
                                       'hum' : val[4].decode("utf-8") },
                                       ignore_index=True)
               #confirmation
               i += 1
               print(i)
           else:
              # close the COM port.
               com.close()
#export to csv the model
dataset.to_csv("label1.csv")

#Code to run on the device

#Libraries:

import time

import math

import pycom

from machine import UART

from machine import I2C

from scd30 import *

uart = UART(0, baudrate=115200) # UART configuration

pycom.heartbeat(False) # disable the heartbeat LED

i2c = I2C(2) # create and use default PIN assignments (P9=SDA, P10=SCL)

# NOTE: Could not make it work using the ESP32 hardware I2C buses (0 & 1),

# but the bitbanged software bus (2) works

sensor = SCD30(i2c, 0x61)

while True:

 for i in range (500):

 # Wait for sensor data to be ready to read (by default every 2 seconds)

  if sensor.get_status_ready() != 1:

   time.sleep_ms(200)

  (co2, temperature, hum) = sensor.read_measurement() # Adjust for PCB heating effect.

  temperature -= 3 # NOTE: Adjust the temperature

  #send the information

  print(round(co2,2),';',round(temperature,2),';',round(hum,2))

  time.sleep_ms(5000)

SPLIT THE DATASET INTO LABELS:

 

#libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 

 

#read the dataset
dataset = pd.read_csv('anomaly_detection.csv', sep=',')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

 

#plot the dataset
fig = plt.figure(figsize=(15,11))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dataset['CO2'],dataset['TEMP'],dataset['HUM'],

            c=dataset['LABEL'].replace({1:"red",2:"blue",3:"black",4:"green"})

            ,s=50)
ax.legend(loc="best")
ax.set_xlabel("CO2")
ax.set_ylabel("Temperature")
ax.set_zlabel("Humidity")
plt.show()

NOISE DETECTION:

Errors in the data collection on IoT devices are common due to voltage fluctuations on sensors and unexpected board movements. Therefore, smoothing algorithms reduce the noise. To determine the best algorithm to smooth data, it is necessary to use some statistical metrics related to a different class of errors. In this work, we will use signal-to-noise relation, MSE, MAE, RMSE, and R2-score. More information in: https://towardsdatascience.com/comparing-robustness-of-mae-mse-and-rmse-6d69da870828, and https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e

 

# Signal to noise metric

import numpy as np

 

def signaltonoise(a, axis=0, ddof=0):

    a = np.asanyarray(a)

    m = a.mean(axis)

    sd = a.std(axis=axis, ddof=ddof)

    return np.where(sd == 0, 0, m/sd)

OUTLIER DETECTION:

Outliers are those observations that differ strongly from the rest of the data points, and smoothing algorithms can not detect them. Outliers algorithms can be deployed by labels or with the entire dataset. More info in: https://towardsdatascience.com/5-outlier-detection-methods-that-every-data-enthusiast-must-know-f917bf439210

TAMPERING DETECTION:

Tamper detection is the ability of a device to sense that an active attempt to compromise the device integrity or the data associated with the device is in progress; the detection of the threat may enable the device to initiate appropriate defensive actions. More information in: https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229

It is important to remember that the proposed system has four labels, and two of them are tamper options. Therefore, with the refined dataset obtained from the previous section, the multi-class classification would be improved. Thus, the classification algorithms and deep learning techniques are trained with the original samples. Then, they are trained with the refined dataset to demonstrate the outlier detection phase. More information on classification metrics in:

https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-5906-5_229

###### SVM ################

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

EXPORTING MODELS:

Tne-class SVM prunes outliers adequately, the decision tree has better classification metrics, and the neural network has similar results using original and improved samples. Therefore, we defined two scenarios: using the outlier detection method and decision tree or SVM as classifiers and deploying the deep learning model alone. 

 

RESOURCES:

 

from sklearn import metrics

def timeseries_evaluation_metrics(y_true, y_pred):

   print('Evaluation metric results: ')

   print(f'MSE value : {metrics.mean_squared_error(y_true, y_pred)}')

   print(f'MAE value : {metrics.mean_absolute_error(y_true, y_pred)}')

   print(f'RMSE value : {np.sqrt(metrics.mean_squared_error(y_true, y_pred))}')

   print(f'SNR : {signaltonoise(y_pred)}')

   print(f'R2 score : {metrics.r2_score(y_true, y_pred)}',end='\n\n') 

Several smoothing algorithms should be applied to analog signals. However, the most representative algorithms are Moving average, median, Gaussian, and Savi-Golay. Each algorithm is developed on each analog signal split by labels.

## CO2 signal smoothing analysis

from scipy.ndimage import gaussian_filter

from scipy import ndimage, misc, signal

import math

size = 5

x = label1["CO2"]

#moving average

y_moving = pd.Series(x.rolling(window =5).mean())

y_moving[0:size]=pd.Series(x[0:size])  

#median

y_med = pd.Series(signal.medfilt(x,size))

#savi-golay

y_vi = pd.Series(signal.savgol_filter(x,5,2))

#gaussian

y_gaussian = pd.Series(gaussian_filter(x, sigma=2))

output = [y_moving,y_med,y_vi,y_gaussian]

 

#Ploting the smoothed CO2 signal by different algorithms

plt.plot(x, color="blue")

plt.plot(y_moving+20, color="green", label='Average')

plt.plot(y_med+40, color="tomato", label='Median')

plt.plot(y_vi+60, color="black", label='Savi-Golay')

plt.plot(y_gaussian+80, color= 'red', label='Gaussian')

plt.grid(True)

plt.legend(loc='best')

plt.show()

#Evaluation of smoothed algorithms

for i in output:

    timeseries_evaluation_metrics(x,i)

    print(signaltonoise(i))

OUT: Evaluation metric results:

     MSE value : 439.966779661017

     MAE value : 9.394576271186441         RMSE value : 20.975385089695422       SNR : 8.136239429638426

     R2 score : 0.9133006210165

#ISOLATION FOREST BY LABELS

from sklearn.ensemble import IsolationForest

###label 1

from sklearn.ensemble import IsolationForest

###label 1

iso=IsolationForest(contamination=0.3)

aux=iso.fit_predict(nd1.iloc[:,:-1])

out1=nd1[(aux==1)]

aux=iso.fit_predict(nd2.iloc[:,:-1])

out2=nd2[(aux==1)]

aux=iso.fit_predict(nd3.iloc[:,:-1])

out3=nd3[(aux==1)]

aux=iso.fit_predict(nd4.iloc[:,:-1])

out4=nd4[(aux==1)]

isolation_database=pd.concat([out1,out2,out3,out4])

print(isolation_database)

# ISOLATION FOREST WITH THE ENTIRE DATASET

iso=IsolationForest(contamination=0.3)

aux=iso.fit_predict(df2.iloc[:,:-1])

isolation_database=df2[(aux==1)]

print(isolation_database)

#ONE CLASS SVM

from sklearn.svm import OneClassSVM

osvm = OneClassSVM(kernel='poly', nu=0.3)

aux=osvm.fit_predict(df2.iloc[:,:-1])

one_svm_database=df2[(aux==1)]

print(one_svm_database)

The dataset is pruned in 173 samples: 

      CO2     TEMP    HUM     LABEL

   0 656 25.786349 39.935088 1

   1 655 25.763746 39.980219 1

   2 652 25.714007 40.072484 1

   3 650 25.633944 40.212785 1

   4 647 25.530006 40.392756 1

   .. ... ... ... ...

   90 19797 32.434967 37.276137 4

   91 21473 32.380756 37.199532 4 

  [400 rows x 4 columns]

The dataset is pruned in 221 samples: 

      CO2     TEMP    HUM     LABEL

   0 656 25.786349 39.935088 1

   1 655 25.763746 39.980219 1

   2 652 25.714007 40.072484 1

   3 650 25.633944 40.212785 1

   4 647 25.530006 40.392756 1

   .. ... ... ... ...

   90 19797 32.434967 37.276137 4

   91 21473 32.380756 37.199532 4 

  [352 rows x 4 columns]

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

classifier=SVC(kernel='sigmoid', random_state=0)

classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test,y_pred))

              precision recall f1-score support

         1     0.72     1.00    0.83     53

         2     0.50     1.00    0.67      8

         3     0.41     0.22    0.29     32

         4     0.25     0.09    0.13     22

accuracy                        0.61    115

macro avg      0.47     0.58    0.48    115

weighted avg   0.53     0.61    0.54    115

[[53 0 0 0]

 [ 0 8 0 0]

 [16 3 7 6]

 [ 5 5 10 2]]

from sklearn import tree

clf = tree.DecisionTreeClassifier()

clf = clf.fit(X_train, y_train)

y_pred=clf.predict(X_test)

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test,y_pred))

              precision recall f1-score support

         1     1.00      0.96    0.98      53

         2     0.73      1.00    0.84       8

         3     1.00      0.97    0.98      32 

         4     1.00      1.00    1.00      22 accuracy                         0.97     115

macro avg      0.93      0.98    0.95     115 

[[51 2 0 0]

 [ 0 8 0 0]

 [ 0 1 31 0]

 [ 0 0 0 22]]

Original samples -> Classification algorithms:

Original Samples -> Deep Learning:

from keras.models import Sequential

from keras.layers import Dense

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

import tensorflow as tf

 

encoder = OneHotEncoder(sparse=False)

y_nn=pd.DataFrame(y) # <- New variables to fit into neural network models

y_nn = encoder.fit_transform(y_nn) 

sc=MinMaxScaler()

X_nn=sc.fit_transform(X) #<- scaling the dataset between 0 to 1

X_train_nn, X_test_nn,y_train_nn,y_test_nn=train_test_split(X_nn,y_nn,test_size=0.2, random_state=0)

model = Sequential()

model.add(Dense(6, input_shape=(3,), activation='relu', name='fc1'))

model.add(Dense(24, activation='relu', name='fc2'))

model.add(Dense(8, activation='relu', name='fc3'))

model.add(Dense(4, activation='softmax', name='output'))

model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

history=model.fit(X_train_nn, y_train_nn, validation_split=0.33, batch_size=5, epochs=20)

Epoch 1/20

38/38 [============================] - 1s 7ms/step - loss: 1.4285 - accuracy: 0.2074 - val_loss: 1.3595 - val_accuracy: 0.3226

...

Epoch 20/20

38/38 [============================] - 0s 4ms/step - loss: 0.1689 - accuracy: 0.9787 - val_loss: 0.2187 - val_accuracy: 0.9462

plt.plot(history.history['accuracy'])

plt.plot(history.history['val_accuracy'])

plt.ylabel('ACCURACY',fontname="Times New Roman")

plt.xlabel('EPOCH',fontname="Times New Roman")

plt.legend(['Train', 'Validation'], loc='upper left')

plt.show()

# "Loss"

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.ylabel('LOSS',fontname="Times New Roman")

plt.xlabel('EPOCH',fontname="Times New Roman")

plt.legend(['train', 'validation'], loc='upper left')

plt.show()

y_pred = model.predict(X_test_nn)

y_pred = (y_pred > 0.5)

index=np.argmax(y_pred)

index

from sklearn import metrics

print("")

print("Precision: {}%".format(100*metrics.precision_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))

print("Recall: {}%".format(100*metrics.recall_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))

print("f1_score: {}%".format(100*metrics.f1_score(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1), average="weighted")))

print("Error: {}%".format(metrics.mean_absolute_error(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1))))

from sklearn.metrics import confusion_matrix, accuracy_score,classification_report

report=classification_report(np.array(y_test_nn).argmax(axis=1),y_pred.argmax(axis=1))

print('\nReport\n')

print(report)

CONCLUSIONS:

 

  • Smoothing algorithms and outlier detection techniques allow for improving samples by eliminating errors and data with different distributions.
  • Even when classification algorithms have high scores without improving samples, the model complexity and size reduce when smoothing and outlier techniques are deployed. 
  • Deep learning models need outliers to fine-tune the model. Besides, these models need more samples in comparison with classification algorithms.  

Precision: 98.63422962014512%

Recall: 98.59154929577466%

f1_score: 98.57176503839088%

Error: 0.014084507042253521%

Report

           precision    recall    f1-score    support

       0     1.00        1.00       1.00         19

       1     1.00        0.92       0.96         12

       2     0.97        1.00       0.98         32

       3     1.00        1.00       1.00          8 accuracy                            0.99         71 macro avg    0.99        0.98       0.99         71 weighted avg 0.99        0.99       0.99         71

Classification algorithms and deep learning models are trained with the refined dataset obtained by one class SVM since this algorithm pruned more samples.

Improved Samples -> Classification Algorithms:

X = one_svm_database.iloc[:,:-1].values

y = one_svm_database.iloc[:,-1].values

 

X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state=0)

              precision recall f1-score support

         1     1.00      1.00    1.00      53

         2     0.89      1.00    1.00       8

         3     1.00      1.00    1.00      32 

         4     1.00      1.00    1.00      22 accuracy                         1.00     115

macro avg      1.00     1.00     1.00     115 

[[19 2 0 0]

 [ 0 12 0 0]

 [ 0 1 32 0]

 [ 0 0 0 8]]

classifier=SVC(kernel='linear', random_state=0)

classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test,y_pred))

 

from sklearn import tree

clf = tree.DecisionTreeClassifier()

clf = clf.fit(X_train, y_train)

y_pred=clf.predict(X_test)

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test,y_pred))

              precision recall f1-score support

         1     1.00      1.00    1.00      53

         2     0.89      1.00    1.00       8

         3     1.00      1.00    1.00      32 

         4     1.00      1.00    1.00      22 accuracy                         1.00     115

macro avg      1.00     1.00     1.00     115 

[[19 2 0 0]

 [ 0 12 0 0]

 [ 0 1 32 0]

 [ 0 0 0 8]]

Improved Samples -> Deep Learning:

from sklearn import tree

clf = tree.DecisionTreeClassifier()

clf = clf.fit(X_train, y_train)

y_pred=clf.predict(X_test)

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test,y_pred))

model = Sequential()

model.add(Dense(4, input_shape=(3,), activation='relu', name='fc1'))

model.add(Dense(24, activation='relu', name='fc2'))

model.add(Dense(8, activation='relu', name='fc3'))

model.add(Dense(4, activation='softmax', name='output'))

model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

history=model.fit(X_train_nn, y_train_nn, validation_split=0.33, batch_size=5, epochs=20)

Epoch 1/20

38/38 [============================] - 1s 7ms/step - loss: 1.4205 - accuracy: 0.1011 - val_loss: 1.3812 - val_accuracy: 0.2366

...

Epoch 20/20

38/38 [============================] - 0s 4ms/step - loss: 0.7994 - accuracy: 0.8883 - val_loss: 0.8741 - val_accuracy: 0.8172

plt.plot(history.history['accuracy'])

plt.plot(history.history['val_accuracy'])

plt.ylabel('ACCURACY',fontname="Times New Roman")

plt.xlabel('EPOCH',fontname="Times New Roman")

plt.legend(['Train', 'Validation'], loc='upper left')

plt.show()

# "Loss"

plt.plot(history.history['loss'])

plt.plot(history.history['val_loss'])

plt.ylabel('LOSS',fontname="Times New Roman")

plt.xlabel('EPOCH',fontname="Times New Roman")

plt.legend(['train', 'validation'], loc='upper left')

plt.show()

Outlier detection:

##########run once#########

pip install m2cgen

###########################

import m2cgen as m2c

osvm = OneClassSVM(kernel='sigmoid', nu=0.3)

osvm.fit(X,y)

code=m2c.export_to_python(osvm)

with open('osvm.py', 'w') as file:

print(code)

code2=m2c.export_to_c_sharp(osvm)

with open('osvm.h', 'w') as file:

    file.write(code2)

clf = tree.DecisionTreeClassifier()

clf.fit(X_train, y_train)

code=m2c.export_to_python(clf)

with open('dtc2.py', 'w') as file:

    file.write(code)

print(code)

 

code2=m2c.export_to_c_sharp(clf)

with open('dtc2.h', 'w') as file:

    file.write(code2)

Classification algorithms:

Deep learning:

for layerNum, layer in enumerate(model.layers):

    weights = layer.get_weights()[0]

    biases = layer.get_weights()[1]

    

    for toNeuronNum, bias in enumerate(biases):

        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')

    

    for fromNeuronNum, wgt in enumerate(weights):

        for toNeuronNum, wgt2 in enumerate(wgt):

            print(f'L{layerNum}N{fromNeuronNum} \

                  -> L{layerNum+1}N{toNeuronNum} = {wgt2}')