Introduction to Intrinsic Dimension Estimation and specifically Fractal-Based Methods GP+CV

What I want to find is:

  • how to define intrinsic dimension (怎樣確定數(shù)據(jù)在潛藏維度長(zhǎng)什么樣呢?怎樣就規(guī)定數(shù)據(jù)就是在那個(gè)維度呢?......)
  • and how to estimate it.

Paper[1][2]Content:

The ID estimation methods depend on the scale of data and still suffer from curse of dimensionality (robustness).It is considered to provide a lower bound on the cardinalityinorderto guarantee an accurate ID estimation, however, it is available only for fractal-based methods and Little-Jung-Maggioni's algorithm.

Firstly,

The defination of intrinsic deminsion:


dimension defination.png-217kB
dimension defination.png-217kB

Secondly,

  • the categories of ID estimation:

    • Global methods use the whole data set making implicitly the assumption that data lie on a unique manifold of a ?xed dimen-sionality.

      • Global methods can be grouped in ?ve families: projection, multidimensional scaling, fractal-based, multiscale, andother methods, where all the methods that cannot be assigned to the ?rst four kinds of method are collected.
    • Local methods are algorithms that provide an ID estimation using the information contained in sample neighborhoods.

      • In this case, data do not lie on a unique manifold of constant dimensionality but on multiple manifolds of different dimensionalities. Since a unique ID estimate for the whole data is clearly not meaningful, it prefers to provide an ID estimate for each small subset of data, assuming that it lies on a manifold of constant dimensionality. Properly, local methods estimate the topological dimension of data manifold. ......The topological dimension is often referred to as the local dimension. For this reason methods that estimate the topological dimension are called local. Local ID estimation methods are Fukunaga-Olsen's, local MDS, local Multiscale, nearest-neighbor algorithms.
    • Pointwise methods, in this category there are the algorithms that can produce both a global ID estimate of the whole data set and local pointwise ID estimate of each pattern of the data set.

      • Unlike local methods, where the term local refers to the topological dimension, in pointwise methods local means that the dimension is estimated for the neighborhood of each data sample, thus providing an estimate of pointwise dimension (see the first part). The global ID estimate is given by the mean of pointwise dimension of all patterns of data set.
  • Meanwhile, the ideal ID estimator is considered to be of the following characters:
  1. be computational feasible;
  2. be robust to the multiscaling;
  3. be robust to the high dimensionality;
  4. have a work envelope (or operative range);
  5. be accurate, i.e., give an ID estimate close to the underlying manifold dimensionality (accuracy).
  • ID method evaluaton:

    <center>
    ID Method Evaluation.png-75.3kB
    ID Method Evaluation.png-75.3kB
    </center>

Thirdly,

The more specific description of Fractal-based methods (GP+CV):

  • Fractal-based methods are global methods that were originally proposed in physics to estimate the attractor dimension of nonlinear systems. Since fractals are generally characterized by a non-integer dimensionality (e.g., Koch's curve dimension is $\frac{ln 4}{ln 8}$), these methods are called fractal.
    • Kégl's algorithm
      Since Hausdorff dimension is very hard to estimate, fractal methods usually replace it with an upper bound, the Box-Counting Dimension. Let $\Omega = \begin{Bmatrix}\overrightarrow{x_1}, ..., \overrightarrow{x_l}\end{Bmatrix} \subseteq \mathbb{R}^N$ be a data set, we denote by ν(r) the number of the boxes (i.e., hypercubes) of size r required to cover $\Omega$. It can be proved that $ v(r)\propto\frac{1}{r}^M$, where $M$ is the dimension of the set. This motivates the following definition. The Box-Counting Dimension (or Kolmogorov capacity) $M_B$ of the set is defined by
      $$M_B = \lim_{r\to 0} \frac{ln(v(r))}{ln(\frac{1}{r})}$$
      where the limit is assumed to exist.
      Kégl's algorithm is a fast algorithm for estimating the Box-Counting Dimension, however, it does not take into account multiscaling and the high dimensionality robustness. Kégl's algorithm is based on the observation that $ν(r)$ is equivalent to the cardinality of the maximum independent vertex set $MI(G_r)$ of the graph $G_r(V, E)$ with vertex set $V=\Omega$ and edge set $E=\begin{Bmatrix}(\overrightarrow{x_i}, \overrightarrow{x_j}) | d(\overrightarrow{x_i}, \overrightarrow{x_j})<r\end{Bmatrix}$. Kégl proposed to estimate $MI(G)$ using the following greedy approximation. Given a data set $\Omega$??, we start with an empty set C. In an iteration over $\Omega$, we add to $C$ data points that are at distance of at least $r$ from all elements of $C$. The cardinality of $C$, after every point in $\Omega$?? has been visited, is the estimate of $ν(r)$. The Box-Counting Dimension estimate is given by:
      $$M_B = -\frac{ln v(r_2) - ln v(r_1)}{ln r_2 - ln r_1}$$
      where $r2$ and $r1$ are values that can be set up heuristically.

    • Grassberger–Procaccia algorithm (GP)
      A good alternative to the Box-Counting Dimension, among many proposed is the Correlation Dimension, de?ned as follows. If the correlation integral $C(r)$ is given by:
      $$C(r) = \lim_{l\to \infty}\frac{2}{l(l-1)} \sum_{i=1}^{l} \sum_{j=i+1}^{l} I( \begin{Vmatrix}\overrightarrow{x_j}- \overrightarrow{x_i} \end{Vmatrix}\leq r)$$
      where $I$ is an indicator function (i.e., it is 1 if condition holds, 0 otherwise), then the Correlation Dimension Mc of $\Omega$ is:
      $$M_B = \lim_{r\to 0} \frac{ln(C(r))}{ln(r)}$$
      It can be proved that the Correlation Dimension is a lower bound of the Box-Counting Dimension. The most popular method to estimate Correlation Dimension is the Grassberger–Procaccia algorithm. This method consists in plotting $ln (C_m(r))$ versus $ln (r)$. The Correlation Dimension is the slope of the linear part of the curve.

    • Takens' method
      Takens proposed a method, based on the Maximum Likelihood principle, that estimates the expectation value of Correlation Dimension. Let $Q = \begin{Bmatrix}q_k|q_k<r\end{Bmatrix}$ be the set formed by the Euclidean distances (denoted by $q_k$), between data points of $\Omega$??, lower than the so-called cut-off radius $r$. Using the Maximum Likelihood principle Takens proved that the expectation value of the Correlation Dimension $\left\langle M_c \right\rangle$ is:
      $$ \left\langle M_c \right\rangle = -(\frac{1}{|Q|} \sum_{k=1}^{|Q|}q_k )^{-1} $$
      where $|Q|$ denotes the cardinality of $Q$. Takens' method presents some drawbacks. Firstly, the cut-off radius can be set only by using some heuristics. Besides, the method is optimal only if the correlation integral C(r) has the form $C(r)= \alpha r^D[1+ \beta r^2 + o(r^2)]$ where $\alpha, \beta \in \mathbb{R}^+$

    • Work envelope of fractal-based methods
      Differently from the other ID methods described before, fractal-based methods satisfy, in addition to the third one, the fourth Ideal ID requirement, i.e., they have a work envelope. They provide a lower bound that the cardinality of data set must ful?ll in order to get an accurate ID estimate. Eckmann and Ruelle proved that to get an accurate estimate of the dimension M,the data set cardinality has to satisfy the following inequality:
      $$M<2\log_{10}l$$
      which shows that the number of data points required to estimate accurately the dimension of a M-dimensional set is at least $10^{\frac{M}{2}}$. Even for sets of moderate dimension this leads to huge values of $l$??.

      To improve the reliability of the ID estimate when the cardinality $l$?? does not ful?ll the inequality, the method of surrogate data was proposed. The method of surrogate data is based on bootstrap. Given a dataset $\Omega$??, the method consists in creating a new synthetic data set $\Omega'$, with larger cardinality, that has the same mean, variance and Fourier Spectrum of $\Omega$. Although the cardinality of $\Omega'$ can be chosen arbitrarily, the method of surrogate data becomes infeasible when the dimensionality of the data set is high. For instance, a 50-dimensional data set to be estimated must have at least $10^{25}$ data points, on the basis of the inequality.

    • Camastra–Vinciarelli's algorithm (CV)
      For the reasons described above, Fractal-based algorithms do not satisfy the third Ideal ID requirement, i.e., they do not provide reliable ID estimate when the cardinality of the data set is high. In order to cope with this problem, Camastra and Vinciarelli proposed an algorithm to power Grassberger and Procaccia method (GP method) w.r.t. high dimensionality, evaluating empirically how much the GP method underestimates the dimensionality of a data set when the data set cardinality is unadequate to estimate ID properly. Let $\Omega = \begin{Bmatrix}\overrightarrow{x_1}, ..., \overrightarrow{x_l}\end{Bmatrix} \subseteq \mathbb{R}^N$ be a data set, Camastra–Vinciarell's algorithm has the following steps:

    1. Create a set $\Omega'$, whose ID M is known, with the same cardinality $l$ of $\Omega$. For instance, $\Omega'$ could be composed of ?? data points randomly generated in a M-dimensional hypercube.

    2. Measure the Correlation Dimension Mc of $\Omega'$ by the GP method.

    3. Repeat the two previous steps for T different values of M, obtaining the set $C = \begin{Bmatrix}(M_i, M^{i}_{c}) : i = 1, 2,...,T\end{Bmatrix}$.

    4. Perform a best ?tting of data in C by a nonlinear method. A plot (reference curve) $P$ of $M_c$ versus $M$ is generated. The reference curve allows inferring the value of $M_c$ when $M$ is known.

    5. The Correlation Dimension $M_c$ of $\Omega$?? is computed by the GP method and, by using $P$, the ID of $\Omega$ can be estimated.

      The algorithm assumes that the curve $P$ depends on and its dependence on sets are negligible. It is worth mentioning that OganovandValle used Camastra–Vinciarelli's algorithm to estimate ID of high dimensional Crystal Fingerprint spaces.


  1. Intrinsic dimension estimation: Advances and open problems ?

  2. Fractal-Based Methods as a Technique for Estimating the Intrinsic Dimensionality of High-Dimensional Data: A Survey ?

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市琢锋,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖炫彩,帶你破解...
    沈念sama閱讀 211,817評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異絮短,居然都是意外死亡江兢,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,329評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門丁频,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)杉允,“玉大人,你說(shuō)我怎么就攤上這事席里∈辶祝” “怎么了?”我有些...
    開封第一講書人閱讀 157,354評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵奖磁,是天一觀的道長(zhǎng)改基。 經(jīng)常有香客問(wèn)我,道長(zhǎng)咖为,這世上最難降的妖魔是什么秕狰? 我笑而不...
    開封第一講書人閱讀 56,498評(píng)論 1 284
  • 正文 為了忘掉前任,我火速辦了婚禮躁染,結(jié)果婚禮上鸣哀,老公的妹妹穿的比我還像新娘。我一直安慰自己吞彤,他們只是感情好我衬,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,600評(píng)論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著饰恕,像睡著了一般挠羔。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上埋嵌,一...
    開封第一講書人閱讀 49,829評(píng)論 1 290
  • 那天褥赊,我揣著相機(jī)與錄音,去河邊找鬼莉恼。 笑死拌喉,一個(gè)胖子當(dāng)著我的面吹牛速那,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播尿背,決...
    沈念sama閱讀 38,979評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼端仰,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了田藐?” 一聲冷哼從身側(cè)響起荔烧,我...
    開封第一講書人閱讀 37,722評(píng)論 0 266
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎汽久,沒(méi)想到半個(gè)月后鹤竭,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,189評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡景醇,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,519評(píng)論 2 327
  • 正文 我和宋清朗相戀三年臀稚,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片三痰。...
    茶點(diǎn)故事閱讀 38,654評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡吧寺,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出散劫,到底是詐尸還是另有隱情稚机,我是刑警寧澤,帶...
    沈念sama閱讀 34,329評(píng)論 4 330
  • 正文 年R本政府宣布获搏,位于F島的核電站赖条,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏常熙。R本人自食惡果不足惜纬乍,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,940評(píng)論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望症概。 院中可真熱鬧蕾额,春花似錦早芭、人聲如沸彼城。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,762評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)募壕。三九已至,卻和暖如春语盈,著一層夾襖步出監(jiān)牢的瞬間舱馅,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,993評(píng)論 1 266
  • 我被黑心中介騙來(lái)泰國(guó)打工刀荒, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留代嗤,地道東北人棘钞。 一個(gè)月前我還...
    沈念sama閱讀 46,382評(píng)論 2 360
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像干毅,于是被迫代替她去往敵國(guó)和親宜猜。 傳聞我的和親對(duì)象是個(gè)殘疾皇子硝逢,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,543評(píng)論 2 349

推薦閱讀更多精彩內(nèi)容