"Genetic Algorithms for Data Mining and Multivariate Data Analysis"
by Charles E. Davidson

Table of Contents

(view in PDF format)
jump to chapter: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | A | B

1Introduction1
2Multivariate Classification and Pattern Recognition4
2.1 Data Preprocessing6
2.2 Mapping and Display8
2.3 Clustering14
2.4 Classification18
2.4.1 k-Nearest Neighbor19
2.4.2 Discriminant Analysis20
2.4.3 Partial Least Squares21
2.4.4 SIMCA23
2.5 Practical Considerations24
2.6 Applications of Pattern Recognition Techniques26
2.6.1 Fuel Spill Identification27
2.6.2 Sorting Plastics for Recycling36
2.7 A Closer Look at Feature Selection44
2.7.1 Exhaustive Search44
2.7.2 Weight-Based Selection45
_._._ Variance Weights45
_._._ Fisher Weights46
_._._ Classifier-based Weights46
_._._ Including/Excluding a Feature47
2.8 Conclusion48
Back to top
3Genetic Algorithms49
3.1 The Simple Genetic Algorithm (SGA)49
3.1.1 Schemata51
3.1.2 The Schema Theorem52
3.1.3 Optimization55
3.2 Applying a GA58
3.3 Customizing a GA58
3.3.1 Encoding59
3.3.2 The Fitness Function59
3.3.3 Selection60
3.3.4 Reproduction60
_._._ 1-Point versus 2-Point Crossover61
_._._ From 2-Point to n-Point62
_._._ Crossover Reordering63
3.3.5 Insertion63
3.3.6 Other Operators63
3.3.7 Controlling Parameters64
3.4 Conclusion64
Back to top
4A Genetic Algorithm for Feature Selection65
4.1 Basic PCKaNN66
4.1.1 Population67
4.1.2 Fitness Function68
4.1.3 Selection71
4.1.4 Crossover72
4.1.5 Mutation72
4.1.6 Insertion73
4.1.7 End Criterion73
4.2 Advanced PCKaNN74
4.2.1 Culling74
4.2.2 Ordinal PCKaNN75
4.2.3 Taking the PCA out of PCKaNN75
4.2.4 A Clustering GA78
4.2.5 Incorporation of Transverse Learning81
4.3 Conclusion83
Back to top
5Analysis of Complex Chromatographic and Spectroscopic Data84
5.1 Chemical Communication Among Social Insects84
5.1.1 Experimental85
5.1.2 Results86
_._._ Time Influence87
_._._ Queen Influence89
_._._ Time vs. Queen Influence92
5.1.3 Conclusion94
5.2 Quality Control of Pharmaceutical Tablets95
5.2.1 Un-normalized Data97
5.2.2 Normalized Data101
5.2.3 Conclusion103
5.3 Raman Spectroscopy of Hard, Soft, and Tropical Woods104
5.3.1 Experimental105
5.3.2 Results106
5.4 Fuel Spill Identification113
Back to top
6Extracting Information from Biological Tissue122
6.1 DNA Microarray Data123
6.1.1 Small Round Blue Cell Tumors124
6.1.2 Leukemia134
6.2 Proteomic Data141
6.2.1 Ovarian Cancer141
6.3 Conclusion147
Back to top
7Extracting Information from Biological Compounds148
7.1 Musk Odorants148
7.1.1 Experimental150
7.1.2 TAE Descriptors150
_._._ Results151
7.1.3 Wavelet/PEST Descriptors154
_._._ Macrocycles155
_._._ Nitroaromatics157
_._._ Macrocycles + Nitroaromatics160
Back to top
8Conclusion175
Back to top
AMATLAB Implementation178
A.1 Data Formatting Issues178
A.2 Data Organization within MATLAB179
A.3 Building a Project180
A.4 Building a Run Plan182
A.5 Activating a Run Plan188
A.6 Helpful Hints and Advanced Techniques188
A.6.1 Population Parameters191
A.6.2 Tuning the k-Value193
_._._ The Assymetric Case194
A.6.3 Ordinal Fitness196
A.6.4 Clustering196
A.6.5 Transverse Learning197
A.6.6 Kohonen Neural Network198
A.6.7 Other Non-PCA Approaches199
A.7 Investigating the Results of a Run200
A.7.1 Using projview200
_._._ projview Menus202
_._._ Viewing the Raw Data204
A.7.2 Using somproj204
_._._ somproj Menus205
_._._ Weights and Net Topology206
_._._ Neighborhood and Learning Functions209
A.8 Manually Editing the Project211
A.9 Do-it-yourself Functions213
A.9.1 prune and process213
A.10 Conclusion214
Back to top
BDynamics Simulations and Molecular Binding217
B.1 Data Formatting Issues218
B.2 Building a Project218
B.3 Determining Similarity from Molecular Coordinates219
B.4 Visualization by PCA220
B.5 Feature Selection220
B.6 An Example: Oxygen Binding by Myoglobin221
B.7 Conclusion224
Back to top