In my last blog, I discussed how members of my team helped me to understand how the techniques that I used in data migrations and data warehousing could contribute immensely to the world of Data Science. Namely through the discipline of Features Engineering. Before you read this blog, it would first be well worth watching Buck Woody’s short video on an Introduction to the Cortana Intelligence Suite (CIS) to give you some context. I will see you back here in 20 minutes……
…. Ok, so in Buck’s presentation he talks about the CRISP-DM process and how it maps to the Cortana Intelligence Suite of technologies. In the Data Preparation part of the process, he explains how we can generate Feature and Labels, and how these can be consumed within Machine Learning models, R or Python to perform predictive or pre-emptive analytics.
Feature selection refers to the process of selecting the input data for processing and analysis, or of finding the most meaningful inputs. A related term, feature engineering (or feature extraction), refers to the process of extracting useful information or features from existing data.
Hold on a minute! Don’t we do this already as SQL Server professionals? Well, we do. If you have worked on a BI or data warehouse project, or perhaps you have done a data migration, one of the most important aspects of these project is to identify the source data that we require to migrate to the new system or to a data warehouse. During the load we are highly likely to cleanse and transform that data to make it useful for our destination.
Within Data Science, it is widely regarded that Features Engineering is an important aspect of the field. The Data Scientists who I work with agree that it is just as, if not more important than, choosing the right model to use in your machine learning experiments.
It seems to be a subject that is talked about very little in the Data Science world, but it is a discipline that can utilise many of the skills that a SQL Server Professional possesses. Don’t get me wrong, there are still plenty of new skills to be learned in this process and in using the Cortana Intelligence Suite, but it is worth knowing that the skills that we currently have will be useful in this arena.
So now I’m really starting to feel like I am part of a Data Science team. It’s a bit like a team that performs cardiovascular surgery. OK, so I may not be the surgeon, or the anaesthetist, but as the scrub nurse it’s important that I pass on the right tools for the job and that those tools are clean. And my role is seen as just as important as the surgeon who ties the knots. In fact, Ryan Swanstrom has written a great post on Data Science and the Perfect Team.
The first task in the data preparation phase is to extract the features from the source data. At this point you may cleanse it, transform it or indeed augment the data with additional data. In the past SQL Server Integration Services (SSIS) would have been the tool of choice to do this. And you can still use SSIS, however you have the ability to extract and perform data manipulations with the following technologies in the Cortana Intelligence Suite.
- Azure Data Factory – a service that orchestrates data movement and cleansin in Azure
- R – an open source language that can perform data manipulation and statistical analysis
- Azure Streaming Analytics – a service that simplifies complex event processing using standard Transact-SQL statement to manipulate datasets.
I will go through these tools in future blogs, and I know R isn’t strictly in the CIS, but as you become more familiar with the language you will find that it can be integrated with many of the CIS technologies that are available, to great effect.
And for completeness. Labels are the attributes that you are trying to predict in a model. So Features Engineering plays a particularly important role in ensuring that your features are primed as best as they can given the business understanding of the data. But remember, like Buck states in his video, you may not even do any modelling with the features that you select, and just push the features to a Power BI Dashboard for descriptive analytics.
So what we have established in this blog is that we already have some of the skillsets as SQL Server professionals that can contribute to data science through the field of Features Engineering. Its time to get the popcorn and look into this a bit more.