Exploring the benefits of AI and ML
Companies start building AI projects and massively rely on those statistical functions and the math behind them to predict customer churn, recognize images, detect fraud, mine knowledge, and so much more. ML projects often begin as part of a big data or Data Warehouse project, but this can also be the other way around; that is, the start of an AI or machine learning project often leads to the development of an analytical system.
As my data scientist colleagues tell me, if you want to be able to really predict an event based on incoming data, a machine learning model needs to be trained on quite a large amount of data. The more data you can bring to train a model, the better and more accurate the model will be in the end.
A wonderful experiment to do with image recognition, for example, can be done at www.customvision.ai. You can start by examining one of the example projects there. I like the "Chocolate or Dalmatian" example.
This is a nice experiment that did not need too much of input to enable the image recognizer to distinguish between Stracciatella Chocolate ice cream and Dalmatian dogs. When you try to teach the system on different images and circumstances, you might find out that you need far more training images than six per group.
Understanding ML challenges
I have experimented with the service and uploaded images of people in an emergency versus images of people relaxing or doing Yoga or similar. I used around 50 – 60 images for each group and still didn't reach a really satisfying accuracy (74%).
With this experiment, I even created a model with a bias that I first didn't understand myself. There were too many "emergency" cases being interpreted incorrectly as "All good" cases. By discussing this with my data scientist colleagues and examining the training set, we found out why.
There were too many pictures in the "All good" training set that showed people on grass or in nature, with lots of green around them. This, in turn, led the system to interpret green as a signifier of "All good," no matter how big the emergency obviously was. Someone with their leg at a strange angle and a broken arm, in a meadow? The model would interpret it as "All good."
In this sandbox environment, I did no harm at all. But imagine a case where a system is used to help detect emergency situations, and hopefully kickstart an alarm those vital seconds earlier. This is only a very basic example of how the right tool in the wrong hands might cause a lot of damage.
There are so many different cases where machine learning can help increase the accuracy of processes, increase their speed, or save them money because of the right predictions – such as predicting when and why a machine will fail before it actually fails, helping to mine information from massive data, and more.
Sorting ML into the Modern Data Warehouse
How does this relate to the Modern Data Warehouse? As we stated previously, the Modern Data Warehouse does not only offer scalable, fast, and secure storage components. It also offers at least one (and, in the context of this book, six) compute component(s) that can interact with the storage services and can be used to create and run machine learning models at scale. The "run" can be implemented in batch mode, near-real time or even in real time, depending on the streaming engine used. That Modern Data Warehouse can then store the results of ML calculations into a suitable presentation layer to provide this data to the downstream consumers, who will process the data further, visualize it, draw their insights from it, and take action. The system can close the loop using the enterprise service bus and the integration services available to feed back insights as parameters for the surrounding systems.
Understanding responsible ML/AI
A responsible data scientist will need tools that support them conducting this work properly. The buzzword of the moment in that area is machine learning operations (MLOps). One of the most important steps of creating a responsible AI is having complete traceability/auditability of the source data, and the versions of the datasets used to train and retrain a certain model at a certain timestamp. With this information, the results of the model can also be audited and interpreted. This information is vital when it comes to audits and traceability regarding legal questions, for instance. The collaborative aspects of an MLOps-driven environment are another important factor.
Note
We can find a definition of Responsible AI, following the principles of Fairness, Reliability and Safety, Privacy and Security, Inclusiveness, and Transparency and Accountability, at https://www.microsoft.com/en-us/ai/responsible-ai.
An MLOps environment embedded in the Modern Data Warehouse will be another puzzle part of the bigger picture and helps integrate those principles into the analytical estate of any company. With the tight interconnectivity between data services, storage, compute components, streaming services, IoT technology, ML and AI, and visualization tools, the world of analytics today offers a wide range of possibilities at a far lower cost than ever before. The ease of use of these services and their corresponding productivity is constantly growing.