The 13th ACM International ACM International Conference on

Web Search and Data Mining

 

TUTORIAL

Intelligible Machine Learning and Knowledge Discovery Boosted by Visual Means

 

Boris Kovalerchuk

Department of Computer Science, Central Washington University, USA

Slides

 

Motivation

Intelligible machine learning and knowledge discovery are important for modeling individual and social behavior, user activity, link prediction, community detection, crowd-generated data, and others. The role of the interpretable method in web search and mining activities is also very significant to enhance clustering, classification, data summarization, knowledge acquisition, opinion and sentiment mining, web traffic analysis, and web recommender systems.

Deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts motivated the surge of efforts to make Machine Learning (ML) models more intelligible and understandable. The prominence of visual methods in getting appealing explanations of ML models motivated the growth of deep visualization, and visual knowledge discovery.

This tutorial covers the state-of-the-art research, development, and applications in the area of Intelligible Knowledge Discovery, and Machine Learning boosted by Visual Means. The topic is interdisciplinary, bridging efforts of research and applied communities in Data Mining, Machine Learning, Visual Analytics, Information Visualization, and HCI. This is a novel and fast-growing area with significant applications, and potential.

Interactive Machine Learning and Visual Knowledge Discovery (VKD) enhance the analytical and the visualization methods for discovering hidden patterns in multidimensional data. The fundamental challenge for visual discovery in multidimensional data is that we cannot see n-D data with the naked eye, and need visual analytics tools ("n-D glasses"). This challenge starts at 4-D. Often multidimensional data are visualized by non-reversible, lossy dimension reduction methods such as Principal Component Analysis. While these methods are very useful, they can remove important information critical for knowledge discovery in n-D data, before starting finding n-D patterns, in addition to the difficulties interpreting the artificial features generated by such methods. Therefore, the expansion of reversible lossless, and interpretable visualization methods, is important. The hybrid methods, which combine such reversible methods with non-reversible visualization, and ML methods open new wide opportunities for knowledge discovery in n-D data. The lossless displays are important because of the abilities: (1) to restore all attributes of each n-D data point from these graphs, (2) to leverage the unique power of human vision to compare in parallel the hundreds of their features, and (3) to speed up the selection of an appropriate n-D model.

 

Outline of topics

The tutorial includes the analysis of the major approaches: (1) to visualize Machine Learning models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover an analytical ML model assisted by visual means. The approach (1) has multiple goals in contrast with a specific goal of explanation in (3). There are ML model visualizations that do not produce an explanation, but can create a basis for deriving it. Also, in (2) discovering ML models by visual means can be quite limited without assistance from analytical ML covered by (4).

The presenter will review and compare reversible and non-reversible visual knowledge discovery methods such as General Line Coordinates, PCA, and Multidimensional Scaling, Manifolds and others. The successful real-world applications will be presented along with a discussion on how to apply these methods, in multiple domains. The presenter will use relevant material from references, his books, including "Visual Knowledge Discovery and Machine Learning" (Springer, 2018) and his recent tutorials at the ACM, IEEE and HCI International Conferences and at the e-Science Institute of the University of Washington.

Lossless visual representation of high-dimensional data will be introduced to participants, based on a new concept of the General Line Coordinates (GLC). Theoretical background will be provided, which includes the Johnson-Lindenstrauss lemma. The methods for combining GLC, with the embeddings of high-dimensional data, will be presented, to demonstrate the scalability of methods.