Outlier Detection on Multiple Temperature Datastreams

Disclaimer

Disruptive Technologies (DT) do not provide an end-solution for outlier detection in temperature data. Presented here is a proposed approach made to serve as an example for developers who want to get started with outlier detection on multistream data. 

This guide assumes you are familiar with the DT ecosystem and have access to temperature sensors and Cloud Connectors. If not, take a look at our Getting Started guide and consider ordering a sensor kit from our Webshop

 

Introduction

When running large scale services, continuously monitoring asset temperatures can provide essential information for smooth long-term operation. Whether it is large office spaces, machinery in a production line, or server racks in a data center, multiple sensors are in some applications placed at once. If one or more sensors report temperature values deviating too far from the norm, preventive steps can be taken to avoid further degradation.

Due to their small size and long-lasting battery life, Disruptive Technologies (DT) Wireless Temperature Sensors are well suited for monitoring large amounts of assets in parallel. Employable in almost any environment, by measuring the temperature every 15 minutes, the data trend and behavior can be monitored and possible outliers caught in realtime.

In this application note, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied on a stream of 25 temperature sensors with the aim of catching outlier events. As shown in figure 1, the data from most sensors are quite similar in both level and trend. Occurrences of sudden spikes or level shifts caught by the algorithm are therefore considered to be outliers where appropriate action can be taken.

header.png

Figure 1: One week of temperature data from 25 DT Wireless Temperature Sensors where outlier events in the data caught by the DBSCAN algorithm are highlighted for visibility. 

 

Sensor Placement

If the aim is to highlight outlier behavior in the temperature originating from a specific device or environment, certain considerations should be taken when mountain the sensors. For instance, if room temperatures throughout a building are the source of interest, sensors should be placed away from external heating sources such as air-conditioning or direct sunlight. Otherwise, the algorithm might classify said external intervention as an outlier, resulting in false alarms.

 

DT Studio Project Configuration

The implementation is built around using the DT Developer API to interact with a single DT Studio project containing all temperature sensors for which outlier detection is performed. If not already done, a project needs to be created and configured to enable the API functionality.

Project Authentication

For authenticating the developer API against a DT Studio project, three separate authentication details have to be located, later to be used in the example code. To generate the correct authentication details, please follow this guide.

Labeling Temperature Sensors

The 'outlier_detection' label should be given to any sensor included in the outlier detection scheme. Labels can be set in the sensor details page in DT Studio, as shown in figure 2. If a sensor resides in a different project, the option for moving it can also be found here.

studio.png

Figure 2: Sensor overview page in DT Studio where labels can be assigned.

 

Example Code

An example code repository is provided in this application note. It illustrates one way of detecting outliers in multistream data and is meant to serve as a precursor for further development and implementation. It uses the Developer API to interact with the DT Studio project.

Source Access

The example code source is publicly hosted on the official Disruptive Technologies GitHub repository under the MIT license. It can be found by following this link

Environment Setup

The code has been written in and tested for Python 3. Required dependencies can be installed using pip and the provided requirements text file. While not required, it is recommended to use a virtual environment to avoid package conflicts.

pip3 install -r requirements.txt 

Using the details found during the project authentication section, edit the following lines in sensor_stream.py to authenticate the API with your DT Studio project.

USERNAME   = "SERVICE_ACCOUNT_KEY"    # this is the key
PASSWORD   = "SERVICE_ACCOUNT_SECRET" # this is the secret
PROJECT_ID = "PROJECT_ID"             # this is the project id 

Usage

If the example code is correctly authenticated to the DT Studio project as described above, running the script sensor_stream.py will start streaming data from each desk sensor in the project for which outlier detection is performed as new data arrive.

python3 sensor_stream.py 

For more advanced usage, such as fetching event history, one or several flags can be provided upon execution. 

usage: sensor_stream.py [-h] [--starttime] [--endtime] [--timestep]
                        [--clusterwidth] [--no-plot]

Outlier detection for multistream temperature data.

optional arguments:
  -h, --help       show this help message and exit
  --starttime      Event history UTC starttime [YYYY-MM-DDTHH:MM:SSZ].
  --endtime        Event history UTC endtime [YYYY-MM-DDTHH:MM:SSZ].
  --timestep       Time in seconds between clusterings.
  --clusterwidth   Seconds of data in clustering data window.
  --no-plot        Suppress streaming plot. 

The arguments --starttime and --endtime should be of the format YYYY-MM-DDThh:mm:ssZ, where YYYY is the year, MM the month, and DD the day. Likewise, hh, mm, and ss are the hour, minutes, and seconds respectively. Notice the separator, T, and Z, which must be included. It should also be noted that the time is given in UTC. Local timezone corrections should, therefore, be made accordingly.

By providing the --timestep argument, the seconds between each clustering event can be changed. The default value is once every 3600 seconds. Similarly, the --window argument changes how many seconds of data should be used in each clustering event. As a rule of thumb, a wider window will produce a more generalized result, though a small window will be better at detecting short-time outliers. The default is 10800 seconds, or 3 hours.

 

Implementation

Classifying data for outlier detection is an ongoing research field that has seen many approaches over the years. Lately, machine learning techniques have been the new frontier in this area at the cost of complexity. In contrast, clustering techniques can be comparably simple while still providing good performance. In particular, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm has been found to provide good performance with relatively little parameter tweaking [1]. 

Preprocessing

Depending on the application, time-series data are often feature engineered before applied to a classification scheme. However, each sample in a time series of length \(N\) can also be considered a feature in an \(N\)-dimensional space and be applied directly. This was, during testing, found to result in much better performance than by extracting mean, kurtosis, skew, and other typical time-series features for cluster input.

At a given period, the most recent data accumulated in a set window of time are for all sensors uniformly resampled to a common time-axis. This synchronization of samples is necessary as new events arrive in the stream independent of each other. DBSCAN does, however, expect an equal number of features for each input. No further filtering or other modification of the data is performed.

window.png

Figure 3: Windowing of the most recent 24 hours of data that are uniformly resampled before providing it as an \(N\)-dimensional input for the DBSCAN clustering algorithm.

DBSCAN

Compared to the likes of k-means clustering, DBSCAN does not require prior knowledge about the number of clusters in the data. It is also unsupervised, simplifying its use in many applications. One feature that makes it particularly useful for outlier detection is its notion of noise in the data. If a point does not fit in any cluster, it is classified as noise instead of the closest match. Figure 4 shows the result of applying DBSCAN on some synthetic data with two features. This website provides excellent animated visualizations of the clustering procedure.

cluster.png

Figure 4: DBSCAN applied to data in 2 dimensions, identifying two individual clusters and noise.

When grouping the features into clusters, DBSCAN uses a distance metric, here Euclidean distance, to determine if two or more points should be linked. For this, the two search parameters \(\epsilon\) and \(p\) must be given, where \(\epsilon\) is the search radius and \(p\) the minimum number of points that can define a cluster. When scanning the dataset, each \(N\)-dimensional point is classified as one out of three possible categories. A core point is defined as one that neighbors at least \(p\) other points within a distance of \(\epsilon\). A border point is one which can be reached by a core point, but does not fulfill the requirement itself, marking the edge of a cluster. If a point is not reached by any core point, it is defined as noise. Figure 5 shows an example of how points are classified to form a cluster.

dbscan.png

Figure 5: Cluster generation procedure of the DBSCAN algorithm where the \(\epsilon\) neighborhood is found for each point, classifying said point as either noise, border, or core.

Finding a balance between generalized behavior and performance is one of the challenges when choosing \(\epsilon\) and \(p\). Here, if we assume that an outlier does not correlate with other potential outliers, setting \(p=2\) should result in said outliers being classified as noise by DBSCAN as there should be no other similar series. On the other hand, \(\epsilon\) dynamically recalculated on each call to compensate for changes in the data. By finding the average of every time series in a window, \(\epsilon\) is calculated as the median Euclidean distance from each series to the average.

 

Realtime Application

Depending on the usage, the algorithm can be applied both in post-processing or as a realtime application. For instance, if left running in the background, an alarm or notification can be sent if an outlier is detected. In the presented case, the \(\epsilon\) parameter has, for illustration purposes, been set to that the implementation is very sensitive to outliers. This can, of course, be adjusted at will. Figure 6 shows a four-day rolling window where DBSCAN is applied for each new sample that arrives in the stream.

output.gif

Figure 6: DBSCAN being continuously applied to 25 different temperature data streams in realtime as they arrive in the stream, highlighting outlier data that differentiates itself.

 

References

  1. https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
  2. https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
  3. https://en.wikipedia.org/wiki/Euclidean_distance