Every year, some 720,00 blogs, 10,000 research papers, and data from countless malware varieties, viruses, and software vulnerabilities add to the massive, growing, and often messy collection of cybersecurity knowledge.
But because most of that information is in written form and not formally structured for data crunching computers, much of that information isn't analyzed and dissected to help solve today's most pressing digital security problems.
Now, researchers at IBM want to see if they can use the company’s Watson super computer to digest that data in hopes its machine can help humans outsmart malicious hackers.
If its winning performance on "Jeopardy!" is any indication, Watson's processing power may be a boon to an industry drowning in data and struggling to more quickly find and fix computer vulnerabilities.
"Security analysis is based upon the consumption of lots of data," said Jon Oltsik, an analyst at Enterprise Strategy Group, a tech research firm.
But since many cybersecurity professionals can't spend all day crunching data, "Watson is engineered to do this and actually learn as it does so. It can help sort through the noise and point analysts toward relevant content," he said.
Given the huge skills gaps that exists in the security industry, most organizations do not have anywhere near the resources required to manually pore through and correlate data from other sources with the data generated by their own devices.
Applying machine learning technology to the problem offers a way to combine and extract value from a much broader and diverse data sets than possible today, says Caleb Barlow, vice president of IBM Security.
"Watson is an unstructured data engine," said Mr. Barlow, referring to the technology’s ability to make sense of data that has not been specifically structured for use by computers. "It allows us to go look at thing in blogs, wikis, video transcripts and bring that data into the context of trying to solve cybersecurity challenges."
IBM says its research shows that a staggering 80 percent of all security information on the Internet is in a form that cannot be easily consumed by modern security software tools. In fact, the average organization taps just 8 percent of the data available to them that is not generated by a network security product.
But before Watson begins analyzing cyberthreats, it'll need to learn the language of cybersecurity, Barlow said. Just like IBM researchers trained the supercomputer over a period of time to play "Jeopardy!," they now need to train it to look at documents and data and extract security intelligence from it.
That's a task that requires annotating and inputting huge volumes of security reports into the system and helping it identify the terms, the definitions and the language associated with cybersecurity – similar to Watson's brief stint as a chef, where the supercomputer learned to develop recipes from thousands of ingredients for a food truck at the South by Southwest festival in 2014.
Over the next several months, students from the California State Polytechnic University, Pomona, Pennsylvania State University, the Massachusetts Institute of Technology, New York University, and four other universities will process and input content into Watson from an average of 15,000 security documents per month.
"This isn’t like developing a normal software development product," IBM's Barlow said. "It is much like teaching a child to read. We have to teach Watson how to read and understand security data. We have to teach it what an attack is, who an attacker is and what an indicator of compromise looks like."
Smart as Watson is, it can make mistakes, said Barlow. A case in point has been Watson’s tendency to classify the term "ransomware" as a city. "We really had to go in and force the correction that ransomware is not a city."