Discovery the motivation of behavior from electricity consumption signal

Key Points

This chapter briefly elaborates how to analyze the motivation of people's operation on a system from the system electricity consumption signal and other data.

Objective: understand how, and why occupants interact with the system.
System: ventilation system in passive houses with adjustable flow rate option.
Raw data: electricity consumption signal; environment sensor records (temperature, humidity, CO2 etc.); 3-min interval * 2 years (2013-2015): 325946 rows × 25 features.
Technique: Noise reduction (Gaussian filter); Edge detection (1st derivative Gaussian filter), Feature selection (L1-penalized logistic regression, recursive feature elimination)

Below Figure 1 shows the overall pipeline I designed.

flowchart

(1) After essential preprocessing and cleaning (NaNs are backfilled), start with a system electricity consumption signal like Figure 2 below. A sudden change in the signal could imply the occupants' interaction with the system (e.g. once the occupant turn the flow rate into a higher option there should be a steep increasing edge on the electricity consumption signal). First thing to do is filtering out the noise (caused by wind etc. or system itself) and "fake operation" (status change with too-short duration).

Figure 2 Demo of Electricity Consumption Signal

(2) Through a finely-tuned 1st derivative Gaussian filter, the noise and "fake operation" could be filtered out and the valid operations would be marked out, like shown in Figure 3 below.

Noise reduced signal

1st Derivative filtered signal

Operation Marked

(3) Then the marked data set would undergo an undersampling process since the dataset is now skewed (The no. of records marked with 'no operation' is far more than ones with operation, either increase or decrease). The undersampling process ensures the data set has balanced scales with each class, for the effectiveness of following classification algorithm.

(4) After undersampling, the training set would be normalized and then fed into a L1-penalized logistic regression classifier. Since linear model penalized with L1 norm has sparse solutions i.e. many of its estimated coefficients would be zero, it could be used for feature selection purpose. Figure 4 below shows an example of the coefficients output in a certain experiment.

Coefficient output of logistic regression model

Then the logistic regression runs repeatedly to make a recursive feature elimination (first, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features). At last, the most informative feature combination (judged by cross-validation accuracy) in this case could be determined, like below Figure 5 shows: these features implies this occupant's motivation for his/her behavior.

Best feature combination after recursive feature elimination

(5) Repeat the process above for different occupants. The results imply there are different kinds of people since their "best feature combination" vary a lot: e.g. some of them are with strong "time pattern" while others may be more sensitive to indoor environment, like temperature etc. A K-Means clustering could help us demonstrate this by grouping the occupants into different user profiles.

Grouping: different user profiles found

From here below is technical log regarding relevant theory and code to realize the whole process.

Technical Details: Noise reduction & Edge detection

Context

There is a ventilation system (with heat recovery) in one passive house, of which the ventilation flow rate is controlled by a fan system, and adjustable by occupants. There are 3 available options (let's say, low, medium, high rate respectively)for the fan flow rate setting.

The electricity consumption of the fan system is recorded by a smart meter in terms of pulse. Obviously, occupants' flow rate setting could put significant influence on the electricity consumption and we could calibrate when and how people adjust their ventilation system based on the electricity consumption.

However, on the one hand, with the influence of back pressure, wind speed etc. the record is not something like a clear 3-stage square wave, instead it is quite noisy. On the other hand, we got many different houses (with similar structure but with different scales of records) within our research. They made it is not really practical to calibrate the ventilation setting position by fixed intervals (like pulse < 3 == position 1; 3 < pulse < 5 == position 2 etc.). We need a new algorithmic method to do this job.

This is a tiny piece of the elec. consumption record (day 185 in year 2014, house #9):

</div>

Methodology

Describe what we want to do in a few words: smooth the noise and detect the edge automatically, without any reset like boundary interval needed. This is actually a classic problem in signal processing or computer vision field. For this 1D signal the simplest solution maybe Gaussian derivative filter, for similar problems in 2D matrix (images) the canny edge detector could be effective. The figure may give you a vivid impression of what we are going to do:

</div>

Basic idea

Tune a Gaussian derivative filter to properly smooth the noise and take 1st derivative, then set an appropriate threshold to detect the edge.

Terms

Gaussian filter

For noise smoothing or "image blur". In layman's words, replace each point by the weighted average of its neighbors, the weights come from Gaussian distribution, then normalize the results.

</div>

Gaussian

(if you are dealing with 2d matrix (images), use 2-D Gaussian instead.)

</div>

Effect:

Gaussian derivative filter

For noise smoothing and edge detection. In layman's words, replace each point by the weighted average of its neighbors, the weights come from the 1st derivative of Gaussian distribution, then normalize the results.

</div>

Effect:

</div>

Advantages of Gaussian Kernel compared to other low-pass filter:

Being possible to derive from a small set of scale-space axioms.
Does not introduce new spurious structures at coarse scales that do not correspond to simplifications of corresponding structures at finer scales.

Scale Space

Representing an signal/image as a one-parameter family of smoothed signals/images, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. Specially for Gaussian kernels: t = sigma^2.

</div>

Results

Finished a demo of auto edge detection in our elec. consumption record, which contains a tuned Gaussian derivative filter, edge position detected, and scale space plot.

Original

</div>

Gaussian filter smoothed (sigma = 8)

</div>

1st derivative Gaussian filtered (sigma = 8)

</div>

Edge position detected (threshold = 0.07 * global min/max)

</div>

Scale Space (sigma = range (1,9))

</div>

Finer Tuning

In practice, it is usually needed to use different tailored tune strategies for the parameters to meet the specific requirements aroused by researchers. E.g. in a case the experts from built environment would like to filter out short-lived status (even they maybe quite steep in terms of pulse number). The strategies is carefully increase sigma (by which you are flattening the Gaussian curve, so the weights of center would be less significant so that the short peaks could be better wiped out by its flat neighbors) and also, properly increase the threshold would help (by which it would be more difficult for the derivatives of smoothed short peaks to pass the threshold and be recognized as one effective operation). Once the sigma and threshold reached an optimized combination, the results would be something like below for this case:

Edge position detected (Sigma = 10, threshold = 0.35 * global min/max)

</div>

In a larger scale, see how does our finely-tuned lazy filter work to filter the fake operations out! (Sigma = 20, threshold = 0.5 * global min/max)

</div>

Technical Details: Feature selection

Before feature selection I made an undersampling to the data set to ensure every class shares a balanced weight in the whole dataset (before which the ratio is something like 150,000 no operation, 400 increase, 400 decrease).

The feature selection process is carried out in Python with scikit-learn. First each feature in the data set need to be standardized since the objective function of the l1 regularized linear model we use in this case assumes that all features are centered on zero and have variance in the same order. If a feature has a significantly lager scale or variance compared to others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. In this case I used sklearn.preprocessing.scale() to standardize each feature to zero mean and unit variance.

Then the standardized data set was fed into a recursive feature elimination with cross-validation (REFCV) loop with a L1-penalized logistic regression kernel since linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero, which could be used for feature selection purpose.

Below is the main part of the coding script for this session (ipynb format).

Feature selection after Gaussian filter

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import math

Data Loading

ventpos = pd.read_csv("/Users/xinyuyangren/Desktop/1demo.csv")
ventpos = ventpos.fillna(method='backfill')
ventpos.head()
ventpos.op.value_counts()

UnderSampling

sample_size = math.ceil((sum(ventpos.op == 1) + sum(ventpos.op == -1))/2)
sample_size
noop_indices = ventpos[ventpos.op == 0].index
noop_indices
random_indices = np.random.choice(noop_indices, sample_size, replace=False)
random_indices
noop_sample = ventpos.loc[random_indices]
up_sample = ventpos[ventpos.op == 1]
down_sample = ventpos[ventpos.op == -1]
op_sample = pd.concat([up_sample,down_sample])
op_sample.head()

Feature selection: up operation

undersampled_up = pd.concat([up_sample,noop_sample])
undersampled_up.head()
#generate month/hour attribute from datetime string  
undersampled_up.dt = pd.to_datetime(undersampled_up.dt)
t = pd.DatetimeIndex(undersampled_up.dt)
hr = t.hour
undersampled_up['HourOfDay'] = hr
month = t.month
undersampled_up['Month'] = month
year = t.year
undersampled_up['Year'] = year
undersampled_up.head()
for col in undersampled_up:
    print col

def remap(x):
    if x == 't':
        x = 0
    else:
        x = 1
    return x

for col in ['wc_lr', 'wc_kitchen', 'wc_br3', 'wc_br2', 'wc_attic']:
    w = undersampled_up[col].apply(remap)
    undersampled_up[col] = w
undersampled_up.head()
openwin = undersampled_up.wc_attic + undersampled_up.wc_br2 + undersampled_up.wc_br3 + undersampled_up.wc_kitchen + undersampled_up.wc_lr
undersampled_up['openwin'] = openwin;
undersampled_up = undersampled_up.drop(['wc_lr', 'wc_kitchen', 'wc_br3', 'wc_br2', 'wc_attic','Year','dt','pulse_channel_ventilation_unit'],axis = 1)
undersampled_up.head()
for col in undersampled_up:
    print col

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

#shuffle the order
undersampled_up = undersampled_up.reindex(np.random.permutation(undersampled_up.index))
undersampled_up.head()

y = undersampled_up.pop('op')

# Columnwise Normalizaion
from sklearn import preprocessing
X_scaled = pd.DataFrame()
for col in undersampled_up:
    X_scaled[col] = preprocessing.scale(undersampled_up[col])
X_scaled.head()

from sklearn import cross_validation
lg = LogisticRegression(penalty='l1',C = 0.1)
scores = cross_validation.cross_val_score(lg, X_scaled, y, cv=10)
#The mean score and the 95% confidence interval of the score estimate
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

clf = lg.fit(X_scaled, y)

plt.figure(figsize=(12,9))
y_pos = np.arange(len(X_scaled.columns))
plt.barh(y_pos,abs(clf.coef_[0]))
plt.yticks(y_pos + 0.4,X_scaled.columns)
plt.title('Feature Importance from Logistic Regression')

REFCV FEATURE OPTIMIZATIN

from sklearn.feature_selection import RFECV

selector = RFECV(lg, step=1, cv=10)
selector = selector.fit(X_scaled, y)
mask = selector.support_ 
mask

selector.ranking_

X_scaled.keys()[mask]

selector.score(X_scaled, y)

X_selected = pd.DataFrame()
for col in X_scaled.keys()[mask]:
    X_selected[col] = X_scaled[col]
X_selected.head()

scores = cross_validation.cross_val_score(lg, X_selected, y, cv=10)
#The mean score and the 95% confidence interval of the score estimate
scores

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

clf_final = lg.fit(X_selected,y)

y_pos = np.arange(len(X_selected.columns))
plt.barh(y_pos,abs(clf_final.coef_[0]))
plt.yticks(y_pos + 0.4,X_scaled.columns)
plt.title('Feature Importance After RFECV Logistic Regression')

Reference

sklearn standardization
undersampling
sklearn feature selection
sklearn REFCV

最后編輯于：2017.12.03 04:21:22

?著作權歸作者所有,轉載或內容合作請聯系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市衰伯，隨后出現的幾起案子帅刊，更是在濱河造成了極大的恐慌飒焦，老刑警劉巖因妇，帶你破解...
沈念sama閱讀 221,430評論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件腌零，死亡現場離奇詭異应役，居然都是意外死亡情组，警方通過查閱死者的電腦和手機，發(fā)現死者居然都...
沈念sama閱讀 94,406評論 3贊 398
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門箩祥，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人肆氓，你說我怎么就攤上這事袍祖。” “怎么了谢揪？”我有些...
開封第一講書人閱讀 167,834評論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵蕉陋，是天一觀的道長捐凭。經常有香客問我，道長凳鬓，這世上最難降的妖魔是什么茁肠？我笑而不...
開封第一講書人閱讀 59,543評論 1贊 296
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮缩举，結果婚禮上垦梆，老公的妹妹穿的比我還像新娘。我一直安慰自己仅孩，他們只是感情好托猩，可當我...
茶點故事閱讀 68,547評論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著辽慕，像睡著了一般京腥。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上溅蛉，一...
開封第一講書人閱讀 52,196評論 1贊 308
城市分裂傳說
那天公浪，我揣著相機與錄音，去河邊找鬼船侧。笑死因悲，一個胖子當著我的面吹牛，可吹牛的內容都是我干的勺爱。我是一名探鬼主播晃琳，決...
沈念sama閱讀 40,776評論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼琐鲁！你這毒婦竟也來了卫旱？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 39,671評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤围段，失蹤者是張志新（化名）和其女友劉穎顾翼，沒想到半個月后，有當地人在樹林里發(fā)現了一具尸體奈泪，經...
沈念sama閱讀 46,221評論 1贊 320
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡适贸，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 38,303評論 3贊 340
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現自己被綠了涝桅。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片拜姿。...
茶點故事閱讀 40,444評論 1贊 352
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖冯遂，靈堂內的尸體忽然破棺而出蕊肥，到底是詐尸還是另有隱情，我是刑警寧澤蛤肌，帶...
沈念sama閱讀 36,134評論 5贊 350
?日本核電站爆炸內幕
正文年R本政府宣布壁却，位于F島的核電站批狱，受9級特大地震影響，放射性物質發(fā)生泄漏展东。R本人自食惡果不足惜赔硫，卻給世界環(huán)境...
茶點故事閱讀 41,810評論 3贊 333
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望盐肃。院中可真熱鬧爪膊，春花似錦、人聲如沸恼蓬。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,285評論 0贊 24
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽处硬。三九已至小槐，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間荷辕，已是汗流浹背凿跳。一陣腳步聲響...
開封第一講書人閱讀 33,399評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留疮方，地道東北人控嗜。一個月前我還...
沈念sama閱讀 48,837評論 3贊 376
代替公主和親
正文我出身青樓，卻偏偏與公主長得像骡显，于是被迫代替她去往敵國和親疆栏。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 45,455評論 2贊 359

Discovery the motivation of behavior from electricity consumption signal

Key Points

Technical Details: Noise reduction & Edge detection

Context

Methodology

Basic idea

Terms

Results

Finer Tuning

Reference

Technical Details: Feature selection

Feature selection after Gaussian filter

Reference

推薦閱讀更多精彩內容