Machine learning: Tackling the ‘big’ in Big Data
Tuesday, October 25, 2016
Big Data is becoming too big to manage manually. The amount of data coming from sensors, streams and social media is astronomical—but that’s only part of the problem. Out of all the data that is being collected, only a small amount of it is actually essential, making it an impossible task to find the needle (value) in the haystack (data).
“Data collection is easy,” said Sri Ambati, CEO of H2O.ai, a machine learning solution provider. “But it is not just about collecting data for your customer anymore; it is knowing what they want that makes a big difference.”
In order to sift out the value from all the data, organizations are turning to machine learning technologies to learn from their data, make sense of their data, and make better business decisions based on the data. “Machine learning is the crucial link between business use, between applications at the business level, and between ROI to the actual collection of data,” said Ambati.
(Related: How machine learning became the new SOA)
Big Data has become the norm in today’s enterprise, and machine learning is now becoming imperative to that norm, according to Steven Noels, cofounder and CTO of NGDATA, a Big Data analytics and management provider. Businesses need to continuously pull insights out of their massive amounts of data in order to improve customer experience, streamline business processes, optimize solutions, and understand the business in real time.
“There is so much data available today that machine learning has become mandatory for any company that wants to take advantage of that data,” said Noels. “The human brain cannot process all that is needed to truly gain insight from Big Data today and into the future.”
However, Forrester’s principal analyst Mike Gualtieri doesn’t believe it is exactly mandatory. “Humans do it all the time; it is often referred to business logic or decision logic,” he said. The problem though with humans is that they naturally have cognitive biases, so when they manually try to derive value from their data and have insufficient information, they often try to fill in the blanks, make assumptions and take shortcuts, which leads to improper decision models and improper predictive models. The point of machine learning in Big Data is being able to give the machine as much data as possible so it figures out the value on its own and predicts something, he explained.
“Machine learning isn’t capable of thinking completely as a human, which is a good thing from the perspective of the biases that might be added to the thought processes,” said Noels. “It doesn’t do any decisioning based on gut feeling. It is more precise and it moves through data faster, allowing data scientists to then review decisions that have been made based on all of the data vs. static sets of data.”
No matter how they derive those insights, one thing is clear: Companies not only want more, but they expect more from their data, according to Lance Olson, director for Cortana intelligence at Microsoft. “They are collecting and storing more data than ever before, and they want more insights from their data,” he said.
BI versus machine learning
Traditionally, organizations have turned to business intelligence (BI) to gain meaningful insights from their data. Today, with machine learning, organizations can gain those insights in real time.
Business intelligence enables a business to present its information in an aesthetically pleasing way so it can see what has happened in the past. Machine learning takes that data, brings insights to the surface, and puts them into action, according to NGDATA’s Noels.
“Business intelligence is looking after the fact. It is all about the historical data,” said Manish Sainani, principal product manager for machine learning at Splunk, an operational intelligence platform provider. “You use business intelligence to go and report on things and how you can do better in the future. Machine learning is looking at the data and helping you predict what is going to happen. You are able to detect something before it is going to happen.”
With machine learning, you can build systems that take action on behalf of the organization and take humans out of the equation, according to Joshua Lewis, vice presidents of products for Alpine Data, an advanced analytics company.
However, that doesn’t mean machine learning will eventually replace business intelligence. They are complementary, according to Lewis. For example, he said, you would still use business intelligence when you have to make a major decision such as acquiring a company or launching a new business unit. Business intelligence is very powerful in situations that “have a strong component of the real world because you need to understand the state of the world as a person, as a human who is grounded in the context of what the business does,” he said.
However, if you want to make one type of decision a zillion times (and take humans out of the loop), then businesses should turn to machine learning. “Think of it like self-driving cars,” said Lewis. “A driver makes a zillion decisions all the time, and you can take over some of that authority.”
Over time machine learning will predominate business intelligence, but businesses will still have to go back and report on what happened. Even with machine learning, we are still going to need humans to monitor these systems and understand how they are performing, according to Splunk’s Sainani. “Both are going to coexist. It is just that machine learning is going to play a bigger and bigger role as enterprises invest in their particular technology space,” he said.
How companies use machine learning
Along with the massive amounts of data, there are many ways machine learning can be applied to a business to derive results. According to NGDATA’s Noels, it all depends on the use cases you are designing for.
“Every industry has its own unique set of requirements that will dictate what methods are most successful for utilizing machine learning and Big Data,” he said. “It can turn interactions with customers into more relevant interactions. It can allow companies to better engage their customers through targeted marketing campaigns, inbound marketing, and even reduce churn.”
Splunk’s Sainani sees three major machine learning algorithms businesses are using: clustering, classification and regression.
Clustering is where you take data and sort them into groups. “It is a type of machine learning known as unsupervised learning where you are just taking these algorithms, looking at the data, and classifying them into different groups,” said Sainani.
Classification is where you predict whether something is going to happen or not based on historical references. An example of classification is looking at breast cancer and test data along with new data, which you use to determine whether or not a person has breast cancer with without running new tests.
Regression leverages historical data to come up with a prediction of what a future value will be. For example, you can use regression to do anomaly detection, Sainani explained.
One of the most common examples of machine learning algorithms is a recommendation engine such as the one devised by Netflix that predicts movies or shows based on what you have viewed previously and liked or disliked, according to Forrester’s Gualtieri.
“The main output of machine learning is a predictive model. What you are trying to do is you are trying to create a predictive model that might predict customer behavior,” he said.
Machine learning is also mainly used for data ETL, “where you can guide and recommend actions which can aid users to build the data integration pipeline, cloud data warehouses, and analytical solutions,” according Oracle’s Zavery.
Other examples of machine learning include fraud detection, weather detection patterns, and medicinal detection such as looking at medical history to pick up on signs or chances that a patient will have a particular disease in the future, according to Dinesh Nirmal, vice president of next-generation platform, Big Data and analytics for IBM.
“One of the great things about machine learning is that there are solutions out there for pretty much every industry,” said NGDATA’s Noels. “Partnering with the right technology provider can enable companies to focus on the core elements of their business while the machine learning experts help them to optimize the output of their data.”
Taking machine learning a step further
Machine learning is a part of the broader field of artificial intelligence (AI). AI refers to intelligent systems that help people make decisions at scale without requiring human interaction.
“AI needs to build a smart human by eliminating stuff that does not need to be done manually in a more automated way, allowing humans to focus on what they are already good at as opposed to pure logic,” said H2O.ai’s Ambati.
Part of AI is the use of a technique called deep learning, which many believe to be the next step to machine learning. “Deep learning is a general-purpose learning system,” said Forrester’s Gualtieri. “With machine learning and predictive models, you have to tell it what you want it to know, and then it will try to predict that. Deep learning is different. You feed it all kinds of data and it just learns about the data.”
Deep learning takes machine learning a step forward in that it uses neural networks. Currently deep learning is associated with self-driving cars, and image recognition and classification, according to Splunk’s Sainani. “Deep learning will help bring data to the surface that is actually valuable. It will help businesses eliminate data that machines don’t need or don’t use,” he said. For instance, deep learning will help enable better resource utilization and revolutionize consumer behavior.
“Deep learning is at the height of the Gartner hype cycle, machine learning is a little below that because it has gotten democratized, and people understand it. Deep learning is more of a [game-changing technology player],” said Sainani. “It is about collecting data with all these sources, understanding when something bad is happening and being able to have a neural network based system that is constantly learning and constantly evolving.“ However, it isn’t something he sees the vast majority of enterprises using in the near future because they have to start with basic machine learning techniques before they can get into neural network capabilities.
“Machine learning and Big Data are still in their infancy, and there is a tremendous amount of innovation that we will see in this specific across all sectors in the near future,” said Doug Rybacki, vice president of product management at Conga. “Long term it will be absolutely critical for almost any data-driven application to incorporate machine learning to process the increasing amount of data our systems and processes produce.”
Most enterprises are still only starting to scratch the surface of machine learning, according to IBM’s Nirmal. “Machine learning is where their models are being built, where their enterprises are beginning to run to make sure they are learning about their customers, trends, patterns and making the right decisions,” he said. Going forward, the task is to make sure the complexity is taken out of machine learning so that it is simple for everyone. “As the data grows, we [have] to make this simple enough that we democratize the machine learning to the general in our profession and everyone else out there.”
Oracle’s Zavery sees machine learning becoming an integral part of the businesses instead of being something separate. “It needs to be used as a differentiated as well as part of the core application infrastructure so that consumers can benefit from those things, not only for a few people,” he said.
To do that, Ronen Schwartz, senior vice president and general manager of data integration and cloud integration at Informatica (a data integration and management software provider), envisions more out-of-the-box, simple-to-use algorithms that reduce the need for specialization, more sophisticated algorithms that make it easier to consume data and empower users, and empowering users to access more data so they can actually learn from it. “It will become very easy to collect data not just from mobile devices but from a lot of other things, and the amount of data it is collecting is actually going to continue to grow in an exponential way,” he said.
And that requires maturing the key algorithms and making vision, speech and language understanding more accurate and efficient, according to Microsoft’s Olson. “Over the last 30 years, the vast majority of breakthroughs in business technology have come from advances in hardware and software. While these two areas continue to improve, increasingly the biggest breakthroughs are coming from data applied to machine learning algorithms,” he said. “These sort of cognitive services lower the cost and complexity involved in applying machine learning to a given business problem by providing prebuilt machine learning systems which don’t require a data scientist to code them up, but instead can be ‘trained’ with your data.”
Alpine Data’s Lewis believes the next challenge is moving to real-time from batch. According to him, people have streams of data constantly coming in, as well as things that are constantly generating data. They need a way to measure and understand the data at a more granular level.
“What you want to do is connect those fire hoses up to systems that are built by data scientists that help you do something, change your behavior, and act based on what is going on right now,” said Lewis. “It is a switch from ‘Let’s look at the historical state of the world and infer things from it’ to ‘We have done that, now we want to use it and to react to what is going on in the world in real time.’ ”
Democratizing machine learning
For the industry to make it easier and more accessible to all, companies have to democratize machine learning
In today’s software development world, there aren’t enough developers, programmers or data scientists to go around. According to Code.org, there are currently more than 500,000 computing jobs open nationwide, with fewer than 50,000 computer science graduates coming into the workforce last year.
“The data is growing so much, it is impossible for enterprises to hire people to go and do that analysis themselves,” said Splunk’s Sainani. “There are only so many data scientists we have in the world.”
Traditionally, machine learning has only been available to organizations that can afford to make significant capital investments, or that can afford to hire a team of data scientists, according to Microsoft’s Olson. But with data becoming a competitive differentiator, vendors and tool providers are changing their gears to focus on citizen analysts or line-of-business analysts. “The demand for prediction and prescription matched up with the low supply of pro data scientists is what is driving the rise of the citizen data scientists,” he said.
In order to work with machine learning, you need a working knowledge of the algorithms and what you want them to do, the kind of models you want to build, the kind of parameters you are going to build into that model, and then the ability to train that model. This requires skills that not many people have. That is why IBM is working on democratizing machine learning and making it easy for everyone to use, explained IBM’s Nirmal.
According to him, the company has a cognitive assistant for data scientists that will become a part of machine learning. The solution is designed to look at an organization’s data and pick the best algorithm for them. “Today, a lot of enterprises have some level of data science skills in house, but the way the data is growing, there is no way they will be able to keep up. It is going to be very critical for us to make it simple enough that everyone can easily deploy, learn and build machines,” he said.
IBM is working on three things: simplification, collaboration and convergence. Simplification is looking at how to make it easier for any professional to build machine learning into their Big Data process. Collaboration refers to all the different personas that exist within a company who can collaborate to build a better model. And convergence is taking all the sets of software a company uses to build a model or do machine learning and converge them into a single platform.
But once businesses move toward citizen data scientists, there will be a broad range of sophistication in the data world that will pose a challenge for tool providers, according to Alpine Data’s Lewis. He suggests consulting with a team of data scientists who can help organizations figure out what can and can’t be done with machine learning, identifying the sources of data, the quality of data, and the data issues.
“Not every business can hire their way out of these problems,” said Lewis. “We can’t just sit here and wait for customers who have a tremendous about of sophistication.”
Oracle believes tools should enable them to make predictions and discoveries without having to understand how they happened. “You don’t really need to learn any of the tooling; you don’t have to learn the coding part of it and don’t have to worry about how things work underneath the cover,” Oracle’s Zavery said. “That reduces the scope of the usage as well as reduces the amount of users who can really benefit out of it.”
What you need to do machine learning
Machine learning isn’t magic, however. You won’t be able to get the results you are looking for without doing the hard work or building the machine-learning models. The reality is that in order to be successful with machine learning, it takes a lot of effort, according to Splunk’s Sainani.
To work with machine learning, he said you need to have data scientists on staff, or someone with a strong statistical or mathematical background who is trained on the capabilities of machine learning. “It is not like you are going to magically turn on the machine learning toolkit and put some data in it and get predictions,” he said.
However Forrester’s Gualtieri notes that if you pick the right tool, you may not actually need a data scientist. “There are plenty of tools out there that actually have machine learning built in,” he said.
But Informatica’s Schwartz believes that while there is a movement to democratize machine learning, for now companies that want to derive value need to have at least a few data scientists on their team to help empower the businesspeople.
“We are seeing more advanced companies where they are building a competency center for data scientists, and this group is starting to focus on how to empower other people to use machine learning,” Schwartz said.
You also have to make sure you have historical data around in order to begin using machine learning, and that the data is clean and labeled. Having your data structured and organized makes it easier to learn from it, according to NGDATA’s Noels.
“Machine learning is only as good as the data you provide it,” said Conga’s Rybacki. “It may take a short amount of time for it to learn what your organization is attempting to track. It is not meant to be 100%, but to gather as much intelligence out of as much data as possible as quickly and as accurately as possible.”
It is also important to set up a well-governed, centralized data lake organized by those who can help make sure the data is organized in a way that will help support the business. “These central entities should be scored and re-scored as regularly as possible (preferably in real time) so that organizations can detect events and alerts that indicate significant opportunities or problems,” said Noels. Then that data should be integrated with a decision engine that ensures the right actions can be done as things happen.
The tools you choose should help you do everything from pre-processing, feature selection, feature extraction and feature engineering, according to Splunk’s Sainani. It should then employ algorithms that help with things like classification, regression, clustering, recommendation, and text analytics.
The other important thing to look for is how easy a tool is to use, then to start building models, and determining what kind of interface it provides (graphical guided or visual). Basically, “what algorithms does it use? How does it use it? And how does it govern the results to make sure they are applied correctly?” said Informatica’s Schwartz.
But companies also need to make sure they don’t fixate on the algorithms, according to Lewis. Every machine learning tool vendor will have a list of algorithms their tools possess, so if you are focused on just the algorithms, you are going to be overwhelmed by choices. He adds that you need to be looking for how the tools connect all the algorithms and models into the business layer, and whether the platform allows you to deploy into the outside world. “You need to think about how it actually connects instead of if it has these algorithms and runs on Hadoop. That is only 10% of the problem,” he said.
According to Forrester, the three key elements a tool should provide are machine learning algorithms that do not require data scientists or application developers to code them; a convenient way of preparing an analytical dataset; and a way to evaluate the model and test whether it is accurate and useable.
At the end of the day, organizations should just make sure to “test drive it with your data, but give the solution time to understand it as well,” said Conga’s Rybacki.
The biggest thing to remember is that it is going to take some time, said Splunk’s Sainani. “It is going to take a month or even three months, depending on the complexity of space. You are not going to see results through machine learning over night,” he said.
A new era of data
Machine learning is bringing us into a new era of Big Data where businesses are now able to serve their customers even better with more personalization, and more targeted solutions.
We are in the midst of a technology revolution where machine learning is becoming an absolute requirement for businesses and applications to remain competitive, according to Rybacki. “Simply, computers are faster than humans at technical tasks. Machine learning can execute hundreds of times more models with ever increasing refinement than humans can.”
Machine learning isn’t new, but with the rise of the cloud, it is making it easier for companies to obtain, according to Microsoft’s Olson. “Machine learning in the cloud changes the game by making the power of machine learning accessible to anyone with a browser,” he said.
The cloud is making applications simpler to use and easier to adopt because you don’t have to worry about the back-end systems necessary to implement, provision, manage and upgrade them, according to Oracle’s Zavery. “The ability to provide updates, newer algorithms, and new capabilities is so much faster in the cloud than a traditional on-premises deployment,” he said.
The other benefit of the cloud is that machine learning technologies are able to learn from multiple datasets, incorporate different sources, and allow businesses to collaborate easily in the cloud. This is making it easier to train machine learning algorithms and make them smarter because it is able to get many interactions in the cloud as opposed to doing it in an isolated environment, according to Zavery.
Big Data combined with machine learning is pushing us into the Data 3.0 era, according to Informatica’s Schwartz. Data 1.0 happened 20 to 30 years ago as a part of the application. In Data 1.0, you were able to see the data, he explained.
Data 2.0 has been going on in the last 10 to 15 years, where it has been all about enterprise data. “How am I able to collect data, mainly structured information from across the enterprise and really get a single view of the data from multiple applications,” said Schwartz.
We are now in the midst of Data 3.0, where data is growing and becoming a core part of the business. “Data is becoming the center, and data is becoming one of the biggest assets that you have,” said Schwartz. “Machine learning is going to be one of the things that help you derive more value from your data, do it faster and in a better way. The data itself is still key for your success.”
According to H2O.ai’s Ambati, businesses are now having to rewrite their legacy apps to get knowledge out of, operationalize, and modernize their data. “It is enabling us to build applications that are not rule-based, but based on patterns, based on more experiences from the data,” he said.
What is machine learning?
According to a report by Forrester’s Gualtieri (along with analyst Rowan Curran), machine learning is “a field of computer science involving creating and continuously improving algorithms that automatically analyze data to identify patterns or predict outcomes.” It refers to “a broad set of algorithms which can add a cornucopia of new functionality, understanding, and experiences to applications.”
According to the report, the power of machine learning apps lies in becoming able to anticipate user needs, and to adapt based on changing circumstances.
However, Gualtieri explains there are two types of machine learning: unsupervised and supervised.
Supervised machine learning refers to the creation of a predictive model. “It is supervised because you are giving it data, and you are telling it what you want it to predict,” said Gualtieri.
Unsupervised machine learning refers to giving a machine a dataset, but not telling it what you are looking for. “The unsupervised machine learning will then try to find patterns or clusters of information that might or might not be interesting to you,” Gualtieri said.
The difference is that with supervised machine learning you are training the model, and with unsupervised machine learning the model is training itself, according to IBM’s Nirmal.
Gualtieri explains supervised machine learning is more commonly seen in the enterprise today than unsupervised learning.
According to Gualtieri, an example of when you would use unsupervised machine learning is in a healthcare situation. If you have electronic health records that contain doctors’ notes in them, you could use machine learning to look at those notes and find something you may not have been looking for, such as the likelihood of people visiting the doctor needing to go to a hospital, or being able to notice the outbreak of a disease. “You are not sure what you are looking for, so you are trying to find those patterns that may or may not be interesting,” he said.
The report explains: “Unsupervised machine learning is often used to find segments of customers, or to analyze free-form text, such as social media posts, to determine sentiment.
“Supervised machine-learning algorithms are used when you know what you are looking for in the data.”