Data Mining in Finance

Book

Web Guide

Data Mining Standard

Contact Us

 

Data Mining Standard

Comments on the Microsoft draft standard (specification) for Data Mining



KDnuggets News:
B. Kovalerchuk, Abstract
Z. Tang, OLE DB for DM

Comments on the Microsoft draft standard (specification) for Data Mining

April 23, 2000

Microsoft with support from Data Mining companies (ANGOSS Software, Appsource, Comshare, DB Miner Technology, Knosys, Magnify, Megaputer Intelligence, Maximal Innovative Intelligence, NCR, PolyVista and SPSS) developed a draft standard for Data Mining (OLE DB for Data Mining, DRAFT Specification):

http://www.microsoft.com/presspass/press/2000/Mar00/DataMiningPR.asp

http://www.microsoft.com/data/oledb/.

This draft (Version 0.9) is open for a public discussion until May 15, 2000.

From our viewpoint the main goals of these specifications are

1) to unify terminology,

2) to unify, simplify and speed up communications between databases, data mining tools (called data mining services), mined knowledge (in the form of data mining models) and data mining final output (in the form of forecasts, ranking, distributions, associations, correlations and so on for a particular data set), and

3) to help to select (automatically) the most appropriate DM services/algorithms for a specific data set. To solve these tasks Microsoft specified metadata These metadata describe each data column (target column, column used for forecasting the target, numeric data formats, contents of the column, type of the possible DM model and so on).

Similarly metadata are specified for DM services, characterizing an algorithm's capabilities.

Some flexibility is permitted. DM services can add provider-specific metadata.

Potentially these two sets of metadata (database metadata and DM service metadata) can be matched automatically for selecting an appropriate Data Mining service. This productive idea of matching probably was most clearly illuminated by Dhar and Stein in the concept of problem ID [Intelligent Decision Support Methods, Prentice Hall, 1997].

It is critical that in order for this matching to occur that specifications catch the really important features of both data and algorithms and are flexible enough to incorporate future algorithm developments and improvements. From our viewpoint, the most sensitive component for matching is the type of contents of columns. Microsoft suggests the following flags for types of data contents: key, discrete, continuous, cyclical, ordering, probability distribution and so on. For instance, the flag PROBABILITY permits matching services working with probability distributions with databases, which contain probability data. However, the matching for discrete, ordered, continuous and some other content data types is not so obvious.

There are two difficulties:

1. Terminology (equal terms should have the same meanings for DM consumers and providers)
2. OLE DB for DM Grammar (Microsoft draft, p.80) should permit adequate matching.

The MS draft provides the following description for data contents.

Type Flags

 

KEY

The column is discrete and is a key. Key columns will not have any other flags except in the case of a nested table with no attribute columns.

CONTINUOUS

The column contains values in a continuous range, such as Age or Salary.

DISCRETE

The column contains a discrete set of values, such as Gender.

DISCRETIZED

The column contains a continuous set of values that should be converted to buckets.

ORDERED

The column contains a discrete set of values that are ordered, such as Salary Level.

CYCLICAL

The column contains an ordered discrete set of values that are cyclical, such as Day of Week, or Month.

SEQUENCE TIME

The column contains time measurement units.

Specifically DISCRETE is described as follows: "Even if the values are numeric, NO ORDERING is implied by the values. ("Area Code" is a good example.)"
This means that DISCRETE could be used to represent ordered data (salary levels) or unordered data (gender). No specific flag is provided for DISCRETE and UNORDRED data types. Two flags should be set up for this case: DISCRETE and UNORDERED, but the grammar presented by Microsoft, does not provide the flag UNORDERED.
According to Microsoft a column definition is one of the following forms:

<column name> <type> [<content flags>] [<column relation>] [<prediction flag>]
<column name> TABLE [<prediction flag>] ( < non-table column definition list > )

The fields <content flags> can be selected as one of the words: continuos, discrete, discretized, sequence_time, ordered and cyclical.

<col_content> -> DISCRETE
| CONTINUOUS
| DISCRETIZED( [<disc_method> [,
<numeric_const>]] )
| SEQUENCE_TIME
<col_content_qual>-> ORDERED
| CYCLICAL

We suggest adding a new flag to identify DISCRETE and UNORDERED using the term NOMINAL (or classification):
<col_content> -> NOMINAL
| DISCRETE
| CONTINUOUS
| DISCRETIZED( [<disc_method> [,
<numeric_const>]] )
| SEQUENCE_TIME

This term has been used a standard term in measurement theory for more than 30 years [P. Suppes, J. Zinnes, Basic Measurement Theory, in: Handbook of Math. Psychology, v. 1, 1963, Wiley].
We think that it will be more efficient to add the NOMINAL type directly into the grammar rather than to rely on non-unified terms provided by an individual vendor.

Why is it important to add the NOMINAL data type to the grammar directly?
The Microsoft draft provides the following example (Tree Model to Predict Credit Risk):

<?xml version="1.0"?>
<pmml>
<statements>
<statement type = "CREATE" value = "Create Mining Model CreditTree1
( ID long key,
Credit text discrete predict,
Education text discrete,
Age text discrete,
Pay text discrete
) using microsoft_decision_trees
"/>
<statement type = "TRAIN" value = "Insert Into CreditTree1 ( ID, Credit, Education, Age, Pay)
OPENROWSET("Microsoft.Jet.OLEDB.4.0",
"data source=w:\test\demozero\credit.mdb",
"SELECT ID, Credit, Education, Age , Pay FROM CreditTraining" )
"/>

This example uses four DISCRETE columns: Credit (bad, god), Education (Bachelor,Partial College, High School, Partial High School,), Age (Young, Middle, Old) and Pay (Weekly pay, Monthly salary,). Credit, Education and Age are clearly ORDERED. On the other hand when one consider Pay, it is not so obvious what kind of order makes sense. But all four colums are described as DISCRETE without any specifics for Pay. Similarly a new discrete column Occupation (professor, student, composer, artist,...) can be added in this example and coded as 1,2,3,4, Again, there is no obvious order for occupation. Therefore, a DM service should not consider codes 1,2,3,4 as ordered numbers. They are just labels and any meaningful data mining algorithm should treat them in this way, i.e., avoiding knowledge discovery computations which include relation "> " or "< ", because they are not defined for "Occupation". DISCRETE and UNORDERED (NOMINAL) is just one example from the large set of contents data types not represented in the draft grammar.
Another example would be the grades, which millions of students get each year at the universities and colleges. Professors give letter grades such as A, A-,B+,B,B-,C+,C,C-,D+,D,D-,F,and I (incomplete).
The I is a grade (cell value), it is not the mark that the cell is empty. Is letter grade descrete, ordered data contents type? Without the grade I (incomplete) it is, but the I grade makes this data type special. There is no term for this data type in the grammar. Ignoring the I grade we might match this ORDERED column with a Data Mining algorithm (service) such as decision trees, which work properly with ordered data. However, letter grades are converted by the University registrar office into numeric values to compute GPA, using mapping such as 4 for A. In this way, ORDERED column can become DISCRETIZED or even CONTINIUOUS, completely changing the set of applicable Data Mining Algorithms. Now we can apply the whole spectrum of numeric statistical methods and discover trends in student's performance by comparing GPA probability distributions for different subjects, professors, groups of students an so on. It is not clear how the current draft can address this issue. If a DM service provider will get letter grades generated by professors without mapping them into numbers, it can prevent this service from using some of the most suitable algorithms. If the computed numeric column is provided instead of letter grades then the problem can be solved, but in this case a DM service will actually work with a secondary database. The development of the secondary database requires extra effort. In both scenarios information about the mapping should be available. The most natural place to keep this mapping would be metadata, which should accompany the original column.
However, we oppose the unrestricted extending of the grammar with more and more terms. Instead we suggest adding a special reference to metadata similar to the suggestion by Microsoft for Model ID (Model Catalog, Model Schema, Model name). This will be a reference for an application programmer interface (API), which will represent the contents of the column, e.g., letter grade. It can be a C++ header file with contents data type description and an implementation file for member functions. In particular, the above mentioned mapping for letter grades can be naturally implemented in this OOP approach and supplied to a DM service provider by a DM consumer. It can include the alphabet of the column and all meaningful operations and relations over them. The set of these APIs can be developed by industry vendors or the DM volunteer community (similar to LINUX) and made publicly available. If this idea will be implemented it will impact not only Microsoft products but also many others.
We would suggest an open discussion on the subject. This OOP approach is outlined in the recent book [Data Mining in Finance: Advances in Relational and Hybrid Methods, by B. Kovalerchuk and E. Vityaev, Kluwer, 2000, see pp. 164, 169-186]. The study is based on the concept of measurement scales and homomorphisms for scales pioneered by Professor P. Suppes at Stanford University [Foundation of measurement, by D. Krantz, R.Luce, P. Suppes and A.Tversky, Academic Press, v.1-3, 1971, 1989,1900].
We omit a specific analysis of other flags such as CONTINIUOUS and CYCLICAL, but we want to provide a few comments showing that they also should be analyzed more closer. The Microsoft draft suggests that CYCLICAL is discrete (p. 8 ).

CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good example, since day number one follows day number seven. Attributes with a type flag of CYCLICAL are also considered to be ORDERED and DISCRETE.

However AZIMUTH is CYCLICAL, but not necessarily discrete and stronger than simple ordering. It can be CONTINIOUS. Again the draft does not permit the simultaneous setting of flags as CONTINIUOUS and CYCLICAL, without easing the requirement for cyclical to be discrete. Similarly, temperature or salary require a separate flag CONTINIOUS and NON-CYCLICAL. Algorithms appropriate for CYCLICAL can be inappropriate for NON-CYCLICAL and visa versa.

Summary of suggestions:
1. Add NOMINAL to <col_content&qt; and consider adding some more flags (see #6 below).
2. Permit negated flags, such as NON-CYCLICAL.
3. Permit combinations of flags, such as CONTINIOUS and NON-CYCLICAL.
4. Add flag COL_REFERENCE < reference ID &qt; for special column contents data types, such as "letter grade". This flag will be used in addition to common terms presented in the Grammar for < col_content&qt; and < col_content_qual&qt; directly.
5. Develop APIs, which will describe contents data types as C++ classes (OOP approach). Each API should be an implementation for <reference ID&qt; in COL_REFERENCE <reference ID&qt; statement. For more about this OOP approach see [Kovalerchuk B., Vityaev E., Data Mining in Finance: Advanced in Relational and Hybrid Methods, Kluwer, 2000, pp. 164, 169-186].
6. Make terms used in and consistent with terms already used in measurement theory [D. Krantz, R.Luce, P. Suppes and A.Tversky, Foundation of measurement, Academic Press, v.1-3, 1971, 1989,1900].

Discussion of Microsoft specification on Data Mining in KDNuggets can be an important input for further development of DM applications.


Boris Kovalerchuk, Ph.D.
Dept. of Computer Science, Central Washington University
Ellensburg, WA 98926-7520
E-mail: borisk@cwu.edu
http://www.cwu.edu/~borisk/finance