There are two problems in modern science:

 too many people use different terminology to solve the same problems;

 even more people use the same terminology to address completely different issues.


In recent years, there has been an explosive growth of methods for learning (or estimating dependencies) from data. This is not surprising given the prolifera-tion of

 low-cost computers (for implementing such methods in software)

 low-cost sensors and database technology (for collecting and storing data)

 highly computer-literate application experts (who can pose ‘‘interesting’’ application problems)

A learning method is an algorithm (usually implemented in software) that esti-mates an unknown mapping (dependency) between a system’s inputs and outputs from the available data, namely from known (input, output) samples. Once such a dependency has been accurately estimated, it can be used for prediction of future system outputs from the known input values. This book provides a unifified descrip-tion of principles and methods for learning dependencies from data.

Methods for estimating dependencies from data have been traditionally explored in diverse fifields such as statistics (multivariate regression and classifification), engi-neering (pattern recognition), and computer science (artifificial intelligence, machine learning, and, more recently, data mining). Recent interest in learning from data has resulted in the development of biologically motivated methodologies, such as artifificial neural networks, fuzzy systems, and wavelets.

Unfortunately, developments in each fifield are seldom related to other fifields, despite the apparent commonality of issues and methods. The mere fact that hundreds of ‘‘new’’ methods are being proposed each year at various conferences and in numerous journals suggests a certain lack of understanding of the basic issues common to all such methods.

The premise of this book is that there are just a handful of important principles and issues in the fifield of learning dependencies from data. Any researcher or practitioner in this fifield needs to be aware of these issues in order to successfully apply a particular methodology, understand a method’s limitations, or develop new techniques.

This book is an attempt to present and discuss such issues and principles (common to all methods) and then describe representative popular methods originating from statistics, neural networks, and pattern recognition. Often methods developed in different fifields can be related to a common conceptual framework. This approach enables better understanding of a method’s properties, and it has methodological advantages over traditional ‘‘cookbook’’ descriptions of various learning algorithms.

Many aspects of learning methods can be addressed under a traditional statistical framework. At the same time, many popular learning algorithms and learning methodologies have been developed outside classical statistics. This happened for several reasons:

1. Traditionally, the statistician’s role has been to analyze the inferential limitations of the structural model constructed (proposed) by the application-domain expert. Consequently, the conceptual approach (adopted in statistics) is parameter estimation for model identifification. For many reallife problems that require flflexible estimation with fifinite samples, the statistical approach is fundamentally flflawed. As shown in this book, learning

with fifinite samples should be based on the framework known as risk minimization, rather than density estimation.

2. Statisticians have been late to recognize and appreciate the importance of computer-intensive approaches to data analysis. The growing use of computers has fundamentally changed the traditional boundaries between a statistician (data modeler) and a user (application expert). Nowadays, engineers and computer scientists successfully use sophisticated empirical datamodeling techniques (i.e., neural nets) to estimate complex nonlinear

dependencies from the data.

3. Statistics (being part of mathematics) has developed into a closed discipline, with its own scientifific jargon and academic objectives that favor analytic proofs rather than practical methods for learning from data.


Historically, we can identify three stages in the development of predictive learning methods. First, in 1985–1992 classical statistics gave way to neural networks(and other empirical methods, such as fuzzy systems) due to an early enthusiasmand naive claims that biologically inspired methods (i.e., neural nets) can achievemodel-free learning not subject to statistical limitations. Even though such claims later proved to be false, this stage had a positive impact by showing the power and usefulness of flflexible nonlinear modeling based on the risk minimization approach. Then in 1992–1996 came the return of statistics as the researchers and practitioners of neural networks became aware of their statistical limitations, initiating a trend toward interpretation of learning methods using a classical statistical framework. Finally, the third stage, from 1997 to present, is dominated by the wide popularity of support vector machines (SVMs) and similar margin-based approaches (such as boosting), and the growing interest in the Vapnik–Chervonenkis (VC) theoretical framework for predictive learning.

This book is intended for readers with varying interests, including researchers/practitioners in data modeling with a classical statistics background, researchers/practitioners in data modeling with a neural network background, and graduate students in engineering or computer science.

The presentation does not assume a special math background beyond a good working knowledge of probability, linear algebra, and calculus on an undergraduate level. Useful background material on optimization and linear algebra is included in Appendixes A and B, respectively. We do not provide mathematical proofs, but, whenever possible, in place of proofs we provide intuitive explanations and arguments. Likewise, mathematical formulation and discussion of the major concepts and results are provided as needed. The goal is to provide a unifified treatment of diverse methodologies (i.e., statistics and neural networks), and to that end we carefully defifine the terminology used throughout the book. This book is not easy reading because it describes fairly complex concepts and mathematical models for solving inherently diffificult (ill-posed) problems of learning with fifinite data. To aid the reader, each chapter starts with a brief overview of its contents. Also, each chapter is concluded with a summary containing an overview of open research issues and pointers to other (relevant) chapters.

Book chapters are conceptually organized into three parts:

 Part I: Concepts and Theory (Chapters 1–4). Following an introduction and motivation given in Chapter 1, we present formal specifification of the inductive learning problem in Chapter 2 that also introduces major concepts and issues in learning from data. In particular, it describes an important concept called an inductive principle. Chapter 3 describes the regularization (or penalization) framework adopted in statistics. Chapter 4 describes Vapnik’s statistical learning theory (SLT), which provides the theoretical basis for predictive learning with fifinite data. SLT, aka VC theory, is important for understanding various learning methods developed in neural networks, statistics, and pattern recognition, and for developing new approaches, such as SVMs(described in Chapter 9) and noninductive learning settings (described in Chapter 10).

 Part II: Constructive Learning Methods (Chapters 5–8). This part describes learning methods for regression, classifification, and density approximation problems. The objective is to show conceptual similarity of methods originating from statistics, neural networks, and signal processing and to discuss their relative advantages and limitations. Whenever possible, we relate constructive learning methods to the conceptual framework of Part I. Chapter 5 describes nonlinear optimization strategies commonly used in various methods. Chapter 6 describes methods for density approximation, which include statistical, neural network, and signal processing techniques for data reduction and dimensionality reduction. Chapter 7 provides descriptions of statistical and neural network methods for regression. Chapter 8 describes methods for classifification.

 Part III: VC-Based Learning Methodologies (Chapters 9 and 10). Here we describe constructive learning approaches that originate in VC theory. These include SVMs (or margin-based methods) for several inductive learning problems (in Chapter 9) and various noninductive learning formulations (described in Chapter 10).

The chapters should be followed in a sequential order, as the description of constructive learning methods is related to the conceptual framework developed in the fifirst part of the book. A shortened sequence of Chapters 1–3 followed by Chapters 5, 6, 7 and 8 is recommended for the beginning readers who are interested only in the description of statistical and neural network methods. This sequence omits the mathematically and conceptually challenging Chapters 4 and 9. Alternatively, more advanced readers who are primarily interested in SLT and SVM methodology may

adopt the sequence of Chapters 2, 3, 4, 9, and 10.

In the course of writing this book, our understanding of the fifield has changed. We started with the currently prevailing view of learning methods as a collection of tricks. Statisticians have their own bag of tricks (and terminology), neural networks have a different set of tricks, and so on. However, in the process of writing this book, we realized that it is possible to understand the various heuristic methods (tricks) by a sound general conceptual framework. Such a framework is provided by SLT developed mainly by Vapnik over the past 35 years. This theory combines fundamental concepts and principles related to learning with fifinite data, welldefifined problem formulations, and rigorous mathematical theory. Although SLT is well known for its mathematical aspects, its conceptual contributions are not fully appreciated. As shown in our book, the conceptual framework provided by

SLT can be used for improved understanding of various learning methods even where its mathematical results cannot be directly applied. Modern learning methods (i.e., flflexible approaches using fifinite data) have slowly drifted away from the original problem statements posed in classical statistical decision and estimation theory. A major conceptual contribution of SLT is in revisiting the problem statement appropriate for modern data mining applications. On the very basic level,SLT makes a clear distinction between the problem formulation and a solution approach (aka inductive principle) used to solve a problem. Although this distinction appears trivial on the surface, it leads to a fundamentally new understanding of the learning problem not explained by classical theory. Although it is tempting to skip directly to constructive solutions, this book devotes enough attention to the learning problem formulation and important concepts before describing actual learning methods.





