Step by Step Guide
To create your first Statistics, Decision Rules and
Decision Trees follow these simple steps:
Loading and setting your data
Load the Learning database
The first step for the Data Mining process is selecting the target database. It can be any database containing
information you would like to use for data mining. This database is called the
Learning database and can be loaded on the "Database"
page ( button) or from
the main menu. Loading a database is performed in a few steps with the help of a
"Wizard".
Check the ID field
After loading the learning database the list of tables and table fields
details will be displayed on the "Database"
page. In case if your database contains only one table, you don't need an ID
field.
Analysing two or more tables together. ID field
In case if you've selected a database with several tables you might want to
analyse dependencies between these tables. In such a case you will need a field
that will set a correlation between records in these tables. Such a field
should contain only unique values (each value should be met only one time), it
can be of any type, and it should appear in every table you want to analyse.
This field is called the ID field and is set automatically (if any
detected) by the program after you load a database. To change the ID field just
select the necessary table from the "Main Table" menu and the field from the "ID
Field" menu.
If your tables do not have ID fields, you can analyse each of them separatly
from others.
Understanding comments to Table fields
Comments to the list of table fields, displayed in "Field Type", "% of Unique
values" and "Comments" are helpful for further analysis.
"Field type" value contains the type of field in the database. It is
necessary to know the field type if you have problems with setting your first
query. For example, you are using Excel file with several columns, including
"Sales" column, which should contain numeral values. Let's suppose that after
selecting this column as the "class/examined field" you find out that controls
for intervals creation are not displayed. Instead of it you see the list of
values found in the column. And one of them is not numeral zero ("0"), but text
"None". In case you have such uncleaned data, you will see that the program
detected that the "Sales" column contains Text data, and not Numeral, and for
good results data should be cleaned - "None" should be replaced by zero("0").
"% of Unique Values" contains the approximate % of unique values found
in the column. If the field contains too much unique values, in most cases it
might not give good results for analysis. Such fields are marked with grey in
the controls on the "Initial Query" page. These fields are not recommended to be
used for analysis, but it is important to remember that in your case using such
field might reveal some unique information for you.
"Comments"- depending on "% of Unique values", this column will
contain information about which functions the seleceted field can perform, for
example, whether it can be the ID field, or whether it's not recommended for use
in analysis.
Your first query
There are three types of queries in EDM - the initial Statistics query, Rules query
and the Decision Tree query.
To perform Rules query or Decision Tree query first you have to perform the
initial
Statistics query.
To create rules or trees you can use Query Wizard, in this case the
Statistics query will be performed automatically, or you can go to the
" Initial Query" page and set all
manually. The difference between these two methods is that if you use the
"Query Wizard" (
button), the Statistics query is performed every time, and the total time of the
data mining is longer than in case of using controls on
"Initial Query" page.
Here we will explain how to work with "Initial Query" page.
Select the examined "class" field
To select the examined field, scroll the "Manage Fields" list, double
click a field (or drag and drop it to the "Examined Field/Class", see the image
below).
Class values are all unique values met in the examined field. For example,
if a field contains only two types of values: "True"/"False", this means there
are two classes in this field.

The best way to
receive valuable data from data mining is to analyze the field that contains
key information in record. It can be of any type, but not the ID field.
Besides, setting Text fields with high level of unique values will probably
not give good results. For example, field "Customer name", containing names
of the customers of a company, will probably contain lots of unique values -
one per each customer. If
you set "Customer name" field as the class field, you will receive Rules equal to records,
which are the descriptions of every customer. This doesn't apply to
numeral fields, because for this type of fields classes are created manually, and you can create
as many intervals as you need.
Creating numerical classes
Creating a numerical class means
creating intervals you want to analyze.
By default program creates 2 equal intervals. You can use them or create your
own. You can edit the number of ranges by changing the “Intervals” value and
clicking the “check” button.
By default the first interval is selected and available for adjusting it. You
will see it's minimum and maximum values in "From" and "To" controls. Input
desired values into these fields and click "check" button to adjus an interval.
Intervals you create should not
intersect. For example, intervals "1..10" and "11..20" do not intersect, while
"1..10" and "6..20" intersect. If you create intersecting intervals, you will
be asked to change the inputted value. Add or delete intervals to the list
using the corresponding buttons.
To decide which intervals you want to
analyze use the "Class Field Values" chart to analyze how many values are met
in equal intervals.
If you choose a numeral field as the "class" field, all interwals you will
create will be selected for analysis by default. In other case you will be
displayed a list of all unique values found in the target field. Mark the values
you would like to analyse.
Select fields to be analysed
After selecting the class field it is necessary to select at least one
another field from the "Manage Fields" list. To add a field,
double click it in the "Manage Fields" list, or drag and drop it to the "Fields
List". You can also use popup menu or buttons below the list to manage the
fields list.
Fields marked with grey color are
the ones with the high % of unique values and are not recommended for use.
Nethertheles, fields with high % of unique values might give you important
results, so it's important to repeat your analysis for several times, before
stopping your work.
You can click the "Recommended" button and all recommended fields will be
selected automaticaly.
Start your first request and view it's results
Click the "Process Statistics Query" button on the
"Initial Query" page
. A notification of
successful query will be displayed.
Now you can view statistics on the
"Statistics" page and create rules and decision trees.
Viewing statistics
After obtaining statistics you can view such data:
- Classes Statistics - all class field values met in the class field,
their number and percent.
- Fields Statistics - all values met in all fields selected for analysis
(with exception of the class field and ID fields) and number of cases for
each value.
- Exceptions - unexpected groups of data met in fields.
- Field Values and Class - displays dependencies between values in the
class field and other selected field.
Create Profiles (Rules) and view results
Before creating the rules make sure you've created Statistcs. To create
decision rules check the current rules query settings
("Initial Query"
or "Query Parameters" page). It is better to start with higher values and then
repeat rules creation with lower settings. First time you might get no rules at
all, or get a small number of rules, but if you are working with a large
database with many thousands of records, this will highly increase the
perfomance and as a result you won't get "overfitted" rules. For example, you
want to analyse a database with 40 000 of records. As the class field you've
selected a field that contains such values: True/False. In this case if you set
rules cases equal 5, you would probably get thousands of rules, that will
describe small data patterns. If it is hard to decisde what value to set for
"Minimum Number of Cases for a Rule" - check "Classes Statistics",
select the smallest value in the "Met In" column and set it for the
"Minimum Number of Cases for a Rule". "Minimum rule probability" setting
also has direct influence on the number of rules, and, as a result on time
necessary for their creation and output. It is also recommended to start with
higher values for this setting, for example - with 50%-90%.
Switch to the "Query Parameters" tab. Select classes and fields you would
like to use for rules creation.
Start rules creation by pressing
"Create Rules" button
.
The program will automatically switch to the
"Rules" page after creating the rules.
Viewing rules
All rules are grouped by classes and sorted by length. In the "Rule
Details" form you can view details for each rule, or sort rules by
length, probability and cases. After sorting, rules will remain grouped by
classes. On this page you also can view the number of rules for each class,
undefined classes and overall rules number. You also can create a report or
export rules to .TXT file.
Create Decision Tree and view results
Before creating the decision tree make sure you've created Statistcs.
Use default settings for creating your first decsion tree. If "Minimum number of
cases for a branch" value is set to "1" this means this setting is turned off.
If necessary, you can change this value, to cut off branches that describe small
data patterns. Switch to the "Query Parameters" tab. Select classes and
fields you would like to use for decision trees creation.
Start decision tree creation by pressing "Create Decision Tree" button
.
The program will automatically switch to the
"Decision Tree" page after creating the
tree. On this page you can view the obtained decision tree, create a report or
export tree to .TXT file.
Using BI Functions: WHAT-IF
Use Decision Rules and Decision Trees for WHAT-IF analysis
Now after you've created rules or decision trees, you can use them not only
for your reports and analysis, but you also can use the obtained data for fast
WHAT-IF analysis of a case.
To perform WHAT-IF analysis, go to the "WHAT-IF" page. You will see the list of fields you've selected for
initial query.
For example: you've created rules describing clients with high risk of
bad debt. Now you can analyse information about a new client with just a few
clicks: input the data you have about the new customer into the corresponding
fields, make sure that option "Use Rules" is marked in the "Analyse Case
Options" dialog (
button at
"WHAT-IF" page ) and press the "Analyse" button
. If some rules correspond to the inputed
data, they will be displayed in the list below, grouped by classes they
describe. Now you will know how much the new client has incommon with clients
with high risk of bad debt. You also can see charts representing results
in graphical way.
Use Decision Rules and Decision Trees to search a database or test
obtained data
You also can apply the obtained rules and decision trees to search for target records
in a database. Continuing the example with clients with high risk of bad debt
described above, after creating rules or decision trees you will have to go to
the "Search a database" page and
make sure that in options dialog the "Use Rules" option is selected. Now upload
the target database (
button). Press the "Search"
button
at the "Search a Database" page and select rules you want to use for
the search. To receive records described by any of selected rules, mark
the "Join Rules with OR" option. If you want to select records described by
each of rules, select the "Join Rules with AND" option. Now press the
"Process Query" button. You will see a list of records and their number.
If you don't receive any records, try to change the list of rules, or mark
the "Join Rules with OR" option.
|