-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
115 lines (88 loc) · 3.73 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
-------------------------
HUB TOOLBOX VERSION 2.1
October 16, 2015
-------------------------
This is the HUB TOOLBOX for Matlab/Octave
(c) 2013, Dominik Schnitzer <[email protected]>
and
(c) 2015, Roman Feldbauer <[email protected]>
If you use the functions in your publication, please cite:
@article{schnitzer2012local,
title={Local and global scaling reduce hubs in space},
author={Schnitzer, Dominik and Flexer, Arthur and Schedl, Markus and Widmer,
Gerhard},
journal={Journal of Machine Learning Research},
volume={13},
pages={2871--2902},
year={2012}
}
The full publication is available at:
http://jmlr.org/papers/volume13/schnitzer12a/schnitzer12a.pdf
The HUB TOOLBOX is a collection of hub/anti-hub analysis tools. To quickly
try the various scaling functions on your distance matrices and evaluate their
impact use the hubness_analysis() function:
>> hubness_analysis(D, classes, vectors);
'D' is your (NxN) distance matrix, 'classes' is an optional vector with a
class number per item in the rows of D. 'vectors' is the optional original data
vectors. The function will output various hubness measurements, try to remove
hubs and evaluates the input data again.
Internally the function uses the:
* mutual_proximity(D),
* local_scaling(D, k),
* shared_nn(D, k)
functions to reduce hubness with different methods, and
* hubness(D, k),
* knn_classification(D, classes, k),
* goodman_kruskal(D, classes),
* intrinsic_dim(vectors),
to do the hubness analysis. Use the functions separately to do a more specific
analysis of your own data.
--------------------------------------
EXAMPLE WITH BUNDLED DEXTER DATA SET
--------------------------------------
If no parameter to hubness_analysis() is given, the DEXTER data set is loaded
and evaluated. See example_datasets/ABOUT for more information about the data.
>> hubness_analysis()
NO PARAMETERS GIVEN! Loading & evaluating DEXTER data set.
DEXTER is a text classification problem in a bag-of-word
representation. This is a two-class classification problem
with sparse continuous input variables.
This dataset is one of five datasets of the NIPS 2003 feature
selection challenge.
http://archive.ics.uci.edu/ml/datasets/Dexter
>> hubness_analysis
NO PARAMETERS GIVEN! Loading & evaluating DEXTER data set.
DEXTER is a text classification problem in a bag-of-word
representation. This is a two-class classification problem
with sparse continuous input variables.
This dataset is one of five datasets of the NIPS 2003 feature
selection challenge.
http://archive.ics.uci.edu/ml/datasets/Dexter
Hubness Analysis
ORIGINAL DATA:
data set hubness (S^n=5) : 4.22
% of anti-hubs at k=5 : 26.67%
% of k=5-NN lists the largest hub occurs: 23.67%
k=5-NN classification accuracy : 80.33%
Goodman-Kruskal index (higher=better) : 0.104
original dimensionality : 20000
intrinsic dimensionality estimate : 161
MUTUAL PROXIMITY (Empiric/Slow):
data set hubness (S^n=5) : 0.64
% of anti-hubs at k=5 : 3.33%
% of k=5-NN lists the largest hub occurs: 6.00%
k=5-NN classification accuracy : 90.00%
Goodman-Kruskal index (higher=better) : 0.132
LOCAL SCALING (Original, k=10):
data set hubness (S^n=5) : 1.42
% of anti-hubs at k=5 : 5.33%
% of k=5-NN lists the largest hub occurs: 7.67%
k=5-NN classification accuracy : 86.00%
Goodman-Kruskal index (higher=better) : 0.156
SHARED NEAREST NEIGHBORS (k=10):
data set hubness (S^n=5) : 1.77
% of anti-hubs at k=5 : 5.67%
% of k=5-NN lists the largest hub occurs: 8.67%
k=5-NN classification accuracy : 73.33%
Goodman-Kruskal index (higher=better) : 0.152
>>