Knowledge Discovery and Data Mining
Edited by Max Bramer (University of Portsmouth, UK)
Published by the Institution of Electrical Engineers. Summer 1999. ISBN 0
85296 767 5. 326 pp.
Volume 1 in the IEE Professional Applications of Computing series
A collection of 12 refereed papers on the theory and practice of Knowledge Discovery
and Data Mining
from Amazon (UK)
Modern computer systems are accumulating data at an
almost unimaginable rate and from a very wide variety of sources: from
point of sale machines in the high street to machines logging every cheque
clearance, bank cash withdrawal and credit card transaction, to Earth
observation satellites in space. Three examples will serve to give an
indication of the volumes of data involved:
The 1990 US census collected over a million million bytes of data
The Human Genome project will store thousands of bytes for each of several
billion genetic bases NASA Earth observation satellites generate
a terabyte (i.e. 109 bytes) of data every day
Alongside advances in storage technology which increasingly
make it possible to store such vast amounts of data at relatively low
cost, whether in commercial data warehouses, scientific research laboratories
or elsewhere, has come a growing realisation that such data contains buried
within it knowledge that can be critical to a company's growth or decline,
knowledge that could lead to important discoveries in science, knowledge
that could enable us accurately to predict the weather and natural disasters,
knowledge that could enable us to identify the causes of and possible
cures for lethal illnesses, knowledge that could literally mean the difference
between life and death. Yet the huge volumes involved mean that most of
this data is merely stored - never to be examined in more than the most
superficial way, if at all. Machine learning technology, some of it very
long established, has the potential to solve the problem of the tidal
wave of data that is flooding around organisations, governments and individuals.
Knowledge Discovery has been defined as the 'non-trivial extraction
of implicit, previously unknown and potentially useful information from data'.
The underlying technologies of knowledge discovery include induction of decision
rules and decision trees, neural networks, genetic algorithms, instance-based
learning and statistics. There is a rapidly growing body of successful applications
in a wide range of areas as diverse as:
- Medical Diagnosis
- Weather Forecasting
- Product Design Electric Load Prediction
- Thermal Power Plant Optimisation
- Analysis of Organic Compounds
- Credit Card Fraud Detection
- Predicting Share of Television Audiences
- Real Estate Valuation
- Toxic Hazard Analysis
- Automatic Abstracting
- Financial Forecasting
The book comprises six papers on technical issues in the field of Knowledge
Discovery and Data Mining followed by six chapters on applications. It grew
out of a colloquium on Knowledge Discovery and Data Mining which I organised
for Professional Group A4 (Artificial Intelligence) of the Institution of Electrical
Engineers (IEE) in London on May 7th and 8th 1998. This was the third in a series
of colloquia on this topic which began in 1995. The colloquium was co-sponsored
by BCS-SGES (the British Computer Society Specialist Group on Knowledge Based
Systems and Applied Artificial Intelligence), AISB (the Society for Artificial
Intelligence and Simulation of Behaviour) and AIED (the International Society
for AI and Education).
The papers included here have been significantly expanded from those presented
at the colloquium and were selected for inclusion following a rigorous refereeing
process. The book should be of particular interest to researchers and active
practitioners in this increasingly important field. I should like to thank the
referees for their valuable contribution and Jonathan Simpson (formerly of the
IEE) for his encouragement to publish the proceedings in book form.
Part I: Knowledge Discovery and Data Mining in Theory looks
at a variety of technical issues, all of considerable practical importance for
the future development of the field.
- Estimating Concept Difficulty with Cross-Entropy by Kamal Nazar
and Max Bramer presents an approach to anticipating and overcoming
some of the problems which can occur in applying a learning algorithm due
to unfavourable characteristics of a particular dataset such as feature interaction.
- Analysing Outliers by Searching for Plausible Hypotheses by Xiaohui
Liu and Gongxian Cheng describes a method for determining whether
'outliers' in data are merely noise or potentially valuable information and
presents experimental results on visual function data used for diagnosing
two blinding diseases: glaucoma and onchocerciasis.
- Attribute-Value Distribution as a Technique for Increasing the Efficiency
of Data Mining by David McSherry describes a method for efficient
rule discovery, illustrated by generating rules for the domain of contact
- Using Background Knowledge with Attribute-Oriented Data Mining by
Mary Shapcott, Sally McClean and Bryan Scotney looks at the important
question of how background knowledge of a domain can be used to aid the data
- A Development Framework for Temporal Data Mining by Xiaodong Chen
and Ilias Petrounias is concerned with datasets which include information
about time. The paper presents a framework for discovering temporal patterns
and a query language for extracting them from a database.
- An Integrated Architecture for OLAP and Data Mining by Zhengxin
Chen examines features of DM specific to the data warehousing environment
where On-Line Analysis Processing (OLAP) takes place. An integrated architecture
for OLAP and data mining is proposed.
Part II: Knowledge Discovery and Data Mining in Practice begins
with a chapter entitled Empirical Studies of the Knowledge Discovery Approach
to Health Information Analysis by Michael Lloyd-Williams which introduces
the basic concepts of knowledge discovery, identifying data mining as an information
processing activity within a wider knowledge discovery process (although the terms
knowledge discovery and data mining are often used interchangeably). The chapter
presents empirical studies of the use of a neural network learning technique known
as the Kohonen Self-Organising Map in the analysis of health information taken
from threes sources: the World Health Organisation's 'Health for All' database,
the 'Babies at Risk of Intrapartum Asphyxia' database and a series of databases
containing infertility information.
The next chapter Direct Knowledge Discovery and Interpretation from a
Multilayer Perceptron Network that Performs Low Back Pain Classification by Marilyn
Vaughn et al. discusses the uses of a widely used type of neural network,
the Multi-Layer Perceptron (MLP) to classify patients suffering from low back
pain, an ailment which it is estimated that between 60% and 80% of the population
will experience at least once at some time in their lives. A particular emphasis
of this work is on the induction of rules from the training examples.
The two chapters on medical applications are followed by two on meteorology.
Discovering Knowledge from Low-Quality Meteorological Databases by Craig
Howard and Vic Rayward-Smith proposes a strategy for dealing with databases
containing unreliable or missing data based on experiences derived from experiments
with a number of meteorological datasets.
A Meteorological Knowledge Discovery Environment by Alex Buchner
describes a knowledge discovery environment which allows experimentation with
a variety of textual and graphical geophysical data, incorporating a number of
different types of data mining model.
The final two chapters are concerned with the application of knowledge discovery techniques in two other important
areas: organic chemistry and the electricity supply industry.
Mining the Organic Compound Jungle: A Functional Programming Approach
by Kathryn Burn-Thornton and John Bradshaw describes experiments aimed
at enabling researchers in the Pharmaceutical industry to determine common substructures
of organic compounds using data mining techniques rather than by the traditional
method involving visual inspection of graphical representations.
Data Mining with Neural Networks: An Applied Example in Understanding
Electricity Consumption Patterns by Philip Brierley and W.Batty gives
further information about neural networks and shows how they can be used to analyse
electricity consumption data as an aid to comprehension of the factors influencing
demand. Fortran 90 source code for a multi-layer perceptron is also provided as
a way of showing that 'implementing a neural network can be a very simple process
that does not require sophisticated simulators or super-computers'.