CRISP - DM stands for Cross Industry Standard Process
for Data Mining. The CRISP-DM methodology is practical,
flexible and useful when solving business issues with analytics.
The definition of CRISP – DM is a data mining technology
or a methodology or a process that helps you or provides you a blueprint to
conduct a data mining project. It was implemented in 1996 and was founded by
major companies like Daimla Benz, ISL, NCR & OHRA. These companies have
actually implemented in around 200 data mining users and tools and then they
came up with this model. This is a non proprietory documented and freely
available process that’s what the actually designed, so everybody can use it.
How it helps?
CRISP – DM provides a roadmap, it gives you best
practices and it provides you structures for better and faster results of using
data mining, so that’s how it helps the business to follow while planning and
carrying out a data mining project.
Business Understanding is the first phase where we convert
a busniness objective or we understand the project from business perspective
and then we convert it to data mining sub tasts, so we convert a business
objective into a data mining objective
or a data mining tasks where we can apply technologies for modeling
technologies into it.
Four major tasks to be focused in business
Determine business objective –
Here we actually focus and understand what is the true goal of your project and
what are some of the impotant factors that we need to know about the business.
Assess the situation – Here we
list out what are the assumptions that we need to make, what are the cost
benefit analysis that we need to do.
Determinr data mining goals – Here
we set objectives for the team or the business.
Provide a proper project – Here we
provide a project plan and we set specific outlines and also propose a timeline
and you see these are all the tools and techniques that we are going to use.
Data Understanding is the second phase which starts
with initial collection of data, where we increase the familiarity with the
data and we also have to create hypothesis based on the data quality and the
data we already have, if we have any interesting data sets we can provide an
initial hypothesis with the hidden information that we have collected.
Four major tasks to be focused in data Understanding:
Collecting the data – Data
collection is where we collect and aquire the data and when we find there is
any problem that you have encountered you have to make note of it.
Describing the data: Describing
the data is where we actually examine the surface of the data and if we see any
problems that we have during aquiring the data and then we also have an option
to see what are the formats that we can set and how much quality and quantity
that you have, also you can set records and fields in tablets and all this we
can do in the description of data.
Exploring the data – Data
exploration is where we create a data exploration report and then what all are
our first findings or our initial hypothesis that we have and then we give it
as exploration report.
Data Quality – This is the
significant task, here we find the missing attributes and then we see if there
is any blank fields or if you see any spellings mistakes of the values, we just
make a note of the quality of the data that you have, also if you see any
conflicts in the data you can mention that as well.
Data Preparation is the third phase where we have the
data , we have aquired the data , we have the quality so now here in data
preparation we set the final data set and we will be using this data set for
the modeling which is the next phase. So to give a defination, its all about
collecting all the data and setting final data set and that will be fed into
the modeling tools thatwe are going to use in the next phase.
Some straighforward actions that we have to do are:
Select – Decide what data we are
going to use
Clean – Here we go to the data
quality and see if there are any missing attributes or any spelling mistakes,
so we clean the data and have the correct verified data.
Construct – Here we actually
develop new records or we describe new attributes that we want to create.
Integrate – Here we combine
multible records and tables altogether and integrate and aggregate the data.
Format – Here we remove some
illegal characters we find or if you want to trim the values as per your model,
so all this is done in formatting the data.
Modeling is where we actually propose various model
techniques and select and apply them and see if we can apply that and what are
the options that we have.
Four major tasks to be focused in Modeling:
Select the model
Test the model
Create the model
Assess the model
In evaluation we actually create and work with our
business objectives and then we come up with evaluation sheets and then we come
up with process reviewing and then we see if there is anything that we have to
determine for the next steps, so here we actually summarize the whole result
and then we give it as a business criteria, that is what we do in evaluation.
Here in the final 6th phase we actually
deploy, deploying is where we present the report or decide to carry the project to the next
level or we carry it to the business steps.
Some major tasks are:
Plan final report
So, here in this article we saw the process of CRISP –
DM and how it works. Further we would discuss about CRISP in the upcoming