We will introduce two well-known Data Science approaches for project management in this article: CRISP-DM (Cross-Industry Standard Process for Data Mining) and Microsoft TDSP (Team Data Science Process).
CRISP-first DM's version was developed in 1999 as part of a concerted effort to identify and establish industry recommendations for the data mining process. Several improvements and extensions have been offered since then. CRISP-DM was the most extensively utilized approach for analytics, data mining, and data science projects as of 2014. Microsoft released TDSP, a new data science technique that leverages well-known frameworks and tools such as git version control, in October 2016.
Both methodologies aim to provide Data Science teams with a systematic approach based on industry best practices to guide and structure Data Science projects, improve team collaboration, enhance learning, and, ultimately, ensure quality and efficiency throughout the project development and delivery of data-driven solutions.
CRISP-DM is a 6-step planning approach that consists of a series of events:
Business Understanding focuses on understanding the project objectives and needs from a business standpoint before turning this information into a Data Science problem formulation.
1. Data Understanding focuses on gathering and familiarizing oneself with data; this is important for identifying data quality issues, gaining early insights into the data, and forming hypotheses.
2. Data Preparation is the process of transforming raw data into a final dataset that can be utilized as input to modeling approaches (e.g., Machine Learning algorithms).
3. Modeling is the process of applying several modeling approaches to a dataset in order to obtain a set of candidate models.
4. Evaluation: Once the models are developed, they must be verified to ensure that they generalize against unseen data and that the essential business objectives are met (e.g., the final model must be fair, human-interpretable, and achieve an accuracy X% greater than the client's present solution). The champion model is the result of this stage.
5. Deployment: The champion model is put into production so that it can make predictions on unknown data. All data preparation processes are included so that the model treats the new raw data in the same way that it did during model development.
In order to support the effective execution of Data Science projects, Microsoft introduced the Team Data Science Process in October 2016 as an Agile, iterative Data Science approach built on Microsoft's (and other companies') best practices.
The procedure is made up of four major components:
Definition of the Data Science Lifecycle
Project framework that is standardized
Data Science projects require infrastructure and resources.
Tools and utilities required for project completion
The first component, the data science lifecycle, will be discussed in this blog article.
TDSP provides a lifecycle to structure data science project development, defining all of the phases that are typically followed when executing a project. Because Data Science projects are R&D in nature, utilizing standardized templates helps to minimize misunderstanding by improving the capacity to convey tasks to other members of the team as well as clients by using a well-defined set of artifacts.
TDSP Lifecycle is made up of 5 stages:
Business Understanding
Data Acquisition & Understanding
Modeling
Deployment
Customer Acceptance
1. Business Understanding: This step entails identifying the business challenge, defining the business goals, and identifying the important business variables that the analysis must forecast. This stage also defines the measures that will be used to assess the project's success. Another critical stage is to scan the available data sources and understand the types of data that are appropriate for answering the issues underlying the project objectives. This study will aid in determining whether data collection or other data sources are required.
2. Data Acquisition and Understanding: Because data is the most important component of every data science endeavor, the second stage is all about data. Before proceeding to the modeling stage, it is critical to examine the existing state of the data, including its size and quality. The data is investigated, preprocessed, and cleaned at this stage. This is critical not only for assisting data scientists in developing an initial data understanding, but also for avoiding error propagation downstream and increasing the likelihood of generating a trustworthy and accurate model.
This stage also seeks to uncover trends in the data to aid the selection of the best modeling techniques to apply. By the end of this stage, the data scientists should have a clearer notion of whether the existing data is sufficient, if new data sources are required to supplement the initial dataset, or if the data is appropriate to help answer the questions underlying the project goals.
3. Modeling: At this stage, feature engineering is performed on the cleaned dataset to create a new, enhanced dataset that will aid in model training. Feature engineering is typically based on the insights gathered during the data exploration process as well as the data scientist's domain experience. After ensuring that the dataset contains (mainly) useful characteristics, various models are trained and assessed before the best one is chosen for deployment.
4. Deployment: At this point, the data pipeline and the winner model are deployed to a production or production-like environment. Model predictions can be made in real-time or in batches, which must be decided at this point.
5. Client acceptance: The final stage of TDSP, for which there is no CRISP-DM equivalent, is customer acceptance. This entails two critical tasks: (1) system validation and (2) project hand-off. The purpose of system validation is to check that the deployed model satisfies the goals and expectations of the customer, whereas project hand-off entails passing over the project to the person in charge of running the system in production, as well as delivering any project reports and documentation.