正文之前
事情是這樣的抵碟,我前面說過了桃漾。。立磁。呈队。就是我的畢業(yè)論文字?jǐn)?shù)寫到14200的時候就感覺有點寫不動了,雖然還有性能度量和致謝和一大批的文獻(xiàn)參考沒寫唱歧,但是我總感覺這樣不妥宪摧,所以就特地的又加了點東西。在后剪枝方法和連續(xù)值離散化之間颅崩,我選擇了離散化這個相對好點的東西几于。后剪枝感覺沒什么好補充的。沿后。
正文
從不廢話沿彭,先放代碼!
/* *********************
* Author : HustWolf --- 張照博
* Time : 2018.1-2018.5
* Address : HUST
* Version : 1.0
* 定義一些靜態(tài)的數(shù)值尖滚,并且提供getter
********************* */
import java.text.NumberFormat;
import java.util.*;
class Alone_Value_Category implements Comparable<Alone_Value_Category>{
private float sensor;
private float category;
// private float[] range = new float[2];
Alone_Value_Category(float a, float b){
super();
this.sensor = a;
this.category = b;
}
float getSensor(){
return sensor;
}
float getCategory(){
return category;
}
// void setRange(float a, float b){
// range[0] = a;
// range[1] =b;
// }
@Override
public String toString() {
return "\n[Sensor:" + sensor + ", category=" + category + "]";
}
@Override
public int compareTo(Alone_Value_Category o) {
return Float.compare(this.sensor,o.sensor);
}
}
上面這個是??定義的一個存儲數(shù)據(jù)的地方喉刘,這個類用來分割數(shù)據(jù),做到單屬性對分類的格式漆弄。一條4 Sensor 1Category 一共會被拆解為4個這種類的實例分別參與EADC離散化的過程睦裳。
class Interval{
private float top;
private float bottom;
public Map<Float,List<Alone_Value_Category> > sample = new HashMap<Float, List<Alone_Value_Category>>();
Interval(){};
Interval(Interval b){
top = b.top;
bottom = b.bottom;
sample = b.sample;
}
Interval(float a, float b, float c, List<Alone_Value_Category> d){
this.top = a;
this.bottom = b;
sample.put(c,d);
}
public float getTop() {
return top;
}
public float getBottom() {
return bottom;
}
public void setTop(float top) {
this.top = top;
}
public void setBottom(float bottom) {
this.bottom = bottom;
}
public void setSample(Map<Float, List<Alone_Value_Category>> sample) {
this.sample = sample;
}
public Interval addTmp(Interval b){
Interval re = new Interval(b);
if (top>b.top) re.setTop(top);
else re.setTop(b.top);
if (bottom<b.bottom) re.setBottom(bottom);
else re.setBottom(b.bottom);
re.sample.putAll(sample);
return re;
}
public void merge(Interval b){
if (top<b.top)
top = b.top;
if (bottom>b.bottom)
bottom = b.bottom;
sample.putAll(b.sample);
}
public int getCount(){
int count = 0;
for(List<Alone_Value_Category> s:sample.values()){
count+=s.size();
}
return count;
}
@Override
public String toString() {
return "bottom:"+bottom+" top:"+top+" size:"+getCount();
}
}
區(qū)間類,每一個區(qū)間有上界撼唾,下界廉邑,還有對應(yīng)的Alone_Value_Category集合。不過這里面的集合是按照類別-->List的模式存儲倒谷。按照我的數(shù)據(jù)蛛蒙,應(yīng)該是每一個Interval都有兩個List
public class Parameter {
private static int rate = 2;
private static int trainNum = 40000;
private static int testNum = trainNum/rate;
public static int getTrainNum(){
return trainNum;
}
public static int getRate(){
return rate;
}
public static int getTestNum(){
return testNum;
}
public static int getTestDistance(){
return 2000000/testNum;
}
public static int getTrainDistance(){
return 2000000/trainNum;
}
public static void setRate(int r){
rate = r;
testNum = trainNum / rate;
}
public static void setTrainNum(int t){
trainNum = t;
testNum = trainNum / rate;
}
public static void setTestNum(int t){
testNum = t;
trainNum = testNum * rate;
}
public static void Clear(ArrayList<Interval> allInterval){
ArrayList<Interval> del = new ArrayList<>();
for (int s = 0;s<allInterval.size();++s) {
if (allInterval.get(s).getCount() == 0){
if (s>0) {
allInterval.get(s - 1).merge(allInterval.get(s));
del.add(allInterval.get(s));
}
continue;
}
}
allInterval.removeAll(del);
}
static double Entropy(ArrayList<Interval> set, int size){
double shang = 0;
NumberFormat nf = NumberFormat.getNumberInstance();
nf.setMaximumFractionDigits(4);
for (Interval x:set){
double p =(double)x.getCount()/(double)size;
shang -= p*(Math.log(p)/Math.log(2));
}
return Double.parseDouble(nf.format(shang));
}
public static ArrayList<List<Float>> EADC(float[][] dat) {
ArrayList<List<Float>> re = new ArrayList<>();
for (int valueindex = 0; valueindex< dat[0].length-1;++valueindex) {
ArrayList<Alone_Value_Category> LIST = new ArrayList<>();
for (int i = 0; i < dat.length; ++i) {
LIST.add(new Alone_Value_Category(dat[i][valueindex], dat[i][dat[valueindex].length - 1]));
//便利舊集合沒有就添加到新集合
}
Collections.sort(LIST);
float len = LIST.get(LIST.size() - 1).getSensor() - LIST.get(0).getSensor();
int k = 40;
float gap = (len + 1) / k;
float Lowest = LIST.get(0).getSensor() - 0.50f;
float Highest = LIST.get(LIST.size()-1).getSensor() + 0.50f;
NumberFormat nf = NumberFormat.getNumberInstance();
nf.setMaximumFractionDigits(1);
List<Float> range = new LinkedList<>();
for (int x = 0; x <= k; ++x) {
range.add(Float.parseFloat(nf.format(Lowest + x * gap)));
}
ArrayList<Interval> allInterval = new ArrayList<>();
for (int i = 0; i < k; ++i) {
Interval newarea = new Interval();
newarea.setBottom(range.get(i));
newarea.setTop(range.get(i + 1));
for (Alone_Value_Category s : LIST) {
if (s.getSensor() > range.get(i) && s.getSensor() < range.get(i + 1)) {
if (!newarea.sample.containsKey(s.getCategory())) {
newarea.sample.put(s.getCategory(), new LinkedList<>());
}
newarea.sample.get(s.getCategory()).add(s);
}
}
allInterval.add(newarea);
}
int size = 0;
Clear(allInterval);
for (Interval s : allInterval) {
size += s.getCount();
}
k = allInterval.size();
int k0 = k;
double Ck0 = 0.5;
boolean Loop = true;
double Hpk_1 = 0;
while (Loop && k >= 10) {
double minD = 1000;
int mergePoint = 0;
double Hp0 = Entropy(allInterval, size);
double Hpk;
ArrayList<Interval> newA = new ArrayList<>();
for (int i = 0; i < allInterval.size() - 1; ++i) {
newA.addAll(allInterval);
newA.get(i).merge(newA.get(i + 1));
newA.remove(i + 1);
Hpk = Entropy(newA, size);
if (Hpk - Hp0 < minD) {
Hpk_1 = Hpk;
minD = Hpk - Hp0;
mergePoint = i;
}
newA.clear();
}
allInterval.get(mergePoint).merge(allInterval.get(mergePoint + 1));
allInterval.remove(allInterval.get(mergePoint + 1));
double Ck_1 = (k0 - 1) * Hpk_1 - Hp0 * (k - 2);
if (Ck_1 > Ck0) {
--k;
} else {
Loop = false;
--k;
}
// Ck = Ck_1;
}
range.clear();
range.add(-100f);
for (Interval s:allInterval) {
range.add(s.getTop());
}
range.add(100f);
re.add(range);
// long endTime=System.currentTimeMillis(); //獲取結(jié)束時間
// System.out.println("\n程序運行時間: "+(endTime-startTime)+"ms");
}
return re;
}
}
主體類,也是EADC算法的(一種基于熵的連續(xù)屬性離散化算法)的Java實現(xiàn)渤愁!我是三天曬網(wǎng)牵祟,一天打漁,不過終于今天還是肝出來了抖格。诺苹。這就意味著差不多要收工了!美滋滋QK妗s菸病!
具體來說其實還好吧办桨。筹淫。。等后面畢業(yè)了我把我的畢業(yè)論文寫成簡書發(fā)出來,大家伙就看的明白了咯损姜!現(xiàn)在先上數(shù)學(xué)表達(dá)饰剥!
最后得到的偽代碼就是下面的了:
當(dāng)然,他這個有點看不明白摧阅,看我的解釋吧汰蓉!
整個離散化的過程如下:
(1) 從數(shù)據(jù)庫讀取數(shù)據(jù),傳入到離散化方法中棒卷;
(2) 先針對單一的屬性顾孽,取出所有的值,并且對其進(jìn)行排序比规;
(3) 排序后劃分區(qū)間若厚,并且利用熵的計算公式計算出初始熵,設(shè)置度量數(shù)值Ck = 0 蜒什;
(4) 合并兩個相鄰區(qū)間测秸,使合并前后的熵差最小,并且重置劃分點,保存合并后的熵值;
(5) 根據(jù)上面的度量公式計算出Ck-1 = h妒貌;
(6) 如果Ck-1 > Ck ,那么k = k -1,回到第(4)步沈撞;
(7) 如果Ck-1 < Ck ,保存當(dāng)前的區(qū)間劃分,結(jié)束區(qū)間劃分進(jìn)程仔戈;
(8) 將傳入的數(shù)據(jù)根據(jù)當(dāng)前區(qū)間劃分進(jìn)行離散化关串。
離散化流程圖如下:
上面這圖花了好久拧廊。才算是理清了监徘。。吧碾。不容易啊不容易;丝!
正文之后
爭取今晚寫完論文倦春,明天排版完畢户敬,最好事明天先自查,然后大后天上知網(wǎng)查重睁本。尿庐。。大大后天呢堰,要給某人一個驚喜抄瑟,就是不知道她能不能看到了!枉疼!