ABSTRACT:
Outlier
detection in high-dimensional data presents various challenges resulting from
the “curse of dimensionality.” A prevailing view is that distance
concentration, i.e., the tendency of distances in high-dimensional data to
become indiscernible, hinders the detection of outliers by making
distance-based methods label all points as almost equally good outliers. In
this paper we provide evidence supporting the opinion that such a view is too
simple, by demonstrating that distance-based methods can produce more
contrasting outlier scores in high-dimensional settings. Furthermore, we show
that high dimensionality can have a different impact, by reexamining the notion
of reverse nearest neighbors in the unsupervised outlier-detection context.
Namely, it was recently observed that the distribution of points’
reverse-neighbor counts becomes skewed in high dimensions, resulting in the
phenomenon known as hubness. We provide insight into how some points (antihubs)
appear very infrequently in k-NN lists of other points, and explain the
connection between antihubs, outliers, and existing unsupervised
outlier-detection methods. By evaluating the classic k-NN method, the
angle-based technique (ABOD) designed for high-dimensional data, the
density-based local outlier factor (LOF) and influenced outlierness (INFLO)
methods, and antihub-based methods on various synthetic and real-world data
sets, we offer novel insight into the usefulness of reverse neighbor counts in
unsupervised outlier detection.
AIM
The
aims of this paper show that high dimensionality can have a different impact,
by reexamining the notion of reverse nearest neighbors in the unsupervised
outlier-detection context.
SCOPE
The Scope of this project evaluating the
classic k-NN method, the angle-based technique (ABOD) designed for
high-dimensional data, the density-based local outlier factor (LOF) and
influenced outlierness (INFLO) methods, and anti hub-based methods on various
synthetic and real-world data sets, we offer novel insight into the usefulness
of reverse neighbor counts in unsupervised outlier detection
EXISTING SYSTEM
Distinguishes
three problems brought by the “curse of dimensionality” in the general context
of search, indexing, and data mining applications: poor discrimination of
distances caused by concentration, presence of irrelevant attributes, and
presence of redundant attributes, all of which hinder the usability of
traditional distance and similarity measures. The authors conclude that despite
such limitations, common distance/similarity measures still form a good
foundation for secondary measures, such as shared-neighbor distances, which are
less sensitive to the negative effects of the curse. the
discussion of problems relevant to unsupervised outlier-detection methods in
high-dimensional data by identifying seven issues in addition to distance
concentration: noisy attributes, definition of reference sets, bias
(comparability) of scores, interpretation and contrast of scores, exponential
search space, data-snooping bias, and hubness. In this article we will focus on
the aspect of hubness, and assume that all attributes carry useful information,
i.e., are not overly noisy.
DISADVANTAGES:
- Curse of dimensionality
- The tendency of distances in high-dimensional data to become indiscernible.
PROPOSED SYSTEM
In
this paper, Reverse nearest-neighbor counts have been proposed in the past as a
method for expressing outlierness of data points, but no insight apart from
basic intuition was offered as to why these counts should represent meaningful
outlier scores. Recent observations that reverse-neighbor counts are affected
by increased dimensionality of data warrant their reexamination for the
outlier-detection task. In this light, we will revisit the ODIN method. we
explore two ways of using k-occurrence information for expressing the
outlierness of points, starting with the method ODIN proposed . Our main goal
is to provide insight into the behavior of k- occurrence counts in different
realistic scenarios (high and low dimensionality, multimodality of data), that
would assist researchers and practitioners in using reverse neighbor
information in a less ad-hoc fashion. we describe experiments with synthetic
and real data sets, the results of which illustrate the impact of factors such
as dimensionality, cluster density and anti hubs on outlier detection,
demonstrating the benefits of the methods, and the conditions in which the
benefits are expected.
- Focusing on the effects of high dimensionality on unsupervised outlier-detection methods and the hubness phenomenon, extending the previous examinations of (anti)hubness to large values of k, and exploring the relationship between hubness and data sparsity.
- It would be interesting to examine supervised and semi-supervised methods as well.
SYSTEM CONFIGURATION
HARDWARE REQUIREMENTS:-
· Processor - Pentium –III
· Speed - 1.1 Ghz
· RAM - 256 MB(min)
· Hard Disk - 20 GB
· Floppy Drive - 1.44 MB
· Key Board - Standard
Windows Keyboard
· Mouse - Two or Three Button Mouse
· Monitor - SVGA
SOFTWARE REQUIREMENTS:-
·
Operating
System : Windows 7
·
Front
End :
JSP AND SERVLET
·
Database :
MYSQL
REFERENCE:
Nanopoulos, A. ;
Ivanovic, M. Radovanovic, M. “Reverse Nearest Neighbors In
Unsupervised Distance-Based Outlier Detection”, IEEE
Transactions on Knowledge and Data Engineering, Volume 27, Issue 5 NOVEMBER 2014.
No comments:
Post a Comment