Here, I dive deeper into the technical elements of a successful implementation. Some of these topics may be considered beyond the scope of a C-level executive. However, for CIOs who have to bridge the gap between business and technology, it is advantageous to not only approach problems from a strategic point of view, but also maintain programmatic perspective of model requirements and efficacy. So, what are the 4 most important facts you should know about data preparation, programming languages, and ML tools?
Garbage in, garbage out
All data sets are flawed. But with poor quality data comes poor quality outputs; a concept known in computer science as garbage in, garbage out (GIGO). That’s why data preparation is such a critical part of the machine learning process. However, preparing data for data science is typically one of the most time-consuming parts of data analytics – particularly if the organization doesn’t have a data-driven culture with a focus on data governance (I discussed the importance of this in my first article).
This will become even more important as we face exponential increase in the complexity of relationships amongst corpuses of training data, driven by the increasing number of interconnected devices (Internet of Things). By 2025, IDC projects that there will be 79.4ZB of data created just by connected IoT devices alone. And, instead of being structured data in fixed locations, IoT devices result in huge amounts of unstructured and semi-structured data. The impact? A need for forensic data preparation and cleansing.
3 steps to data preparation
There are a number of steps that data science teams must take to successfully prepare data for machine learning:
1. Data cleansing and transformation.
Your business likely stores data in multiple places such as IT or ERP systems, CRM tools, and custom applications. Data cleansing involves removing meaningless data (like dummy data, duplicates, or contradictory values) from the data sets, while data transformation converts your clean data from one format into another for easier processing. For some more technologically advanced organizations, data cleaning can be automated using tools like Azure Machine Learning or Amazon ML. Data cleansing and transformation should both take place before loading into your enterprise data warehouse (a central location where you can collect, store and manage data from all your multiple business sources).
2. Dimensionality reduction and PCA
As well as collecting clean data, it is also important to collect the right volume of data. When there isn’t enough data, it is hard to perform accurate analysis. But if there is too much data, it ends up decreasing a model’s accuracy (as well as adding computational costs). It is important to find a balance between type, quality, and volume of data to mitigate these risks. Dimensionality reduction reduces the number of input variables for a predictive model, which can improve prediction performance. The most popular method for dimensionality reduction in ML is Principal Component Analysis, or PCA for short.
Feature selection and feature engineering are additional steps of dimensionality reduction; feature selection involves selecting a subset of the original data features, while feature engineering extracts new variables from raw data. Discussing the details go beyond the scope of this article, but it is something all tech leaders should consider as part of their data processing – reach out to me for more information on this.
3. Testing vs. training data
Following data preparation, you need to split your clean data between training and testing sets. Training data set is used to build your model (more on this in my next article), while your test data set will be used to validate your model. Typically, a 20:80 split between testing and training data is optimal – the bigger the training data set, the more accurate the model is likely to be. However, this split is up to you, and will change depending on your business goals, the data set being used and the desired model outputs.
Sometimes, data sets are so massive that you are unable to train or test on the entirety of the data. In this instance, you can employ special techniques such as sampling. Sampling (with or without replacement) is a method that is used to get a representative subset of the data for training, testing, or both, without slowing down the process with millions of rows of data.
Choosing your programming language
When building your team of data scientists, you will want people with proficiency in a range of different programming languages. There are thousands of these to choose from, and your decision will vary depending on your individual project and business requirements. For example, Python is better for handing massive volumes of data and building deep learning models, while R has the advantage when creating graphics and data visualizations. There’s a lot to be said for hiring ‘polyglot’ developers, particularly if you create solutions for a range of clients.
Selecting your tools
Machine learning is a complex discipline but implementing ML models is far less daunting than it used to be thanks to the variety of frameworks and tools on the market. I’ve listed just a small selection of the most popular of these below:
- Scikit-learn: Scikit-learn is an open-source machine learning framework for the Python programming language which has a library of many supervised and unsupervised learning algorithms. It is generally used in practice for a broader range of models than TensorFlow.
- TensorFlow: TensorFlow is an open-source machine learning framework from Google. It is particularly useful for real-world applications of deep learning; for example, it is used by DeepFace (Facebook’s image recognition system) and Apple’s Siri for voice recognition.
- Kubeflow: Kubeflow is an open-source specialized machine learning platform that helps in scaling machine learning models and deploying them to production. It is a great tool for your data scientists who want to build and experiment with ML pipelines.
- GUI systems: Graphical User Interface systems (GUI) allow your team to take actions without knowing the coding behind the action; Knime, Weka, and Rapid Miner are two examples of these. GUIs are useful for democratizing machine learning skills across your team.
The elements I’ve outlined in this article are fundamental to your ML roadmap. In my final article in this series, I will discuss the remaining pieces of the puzzle – model selection and evaluation, and the increasing necessity of MLOps (machine learning ops) to scale your program across the organization.