Fishing for data
To harness the vast information flow generated each day, scientists are developing sophisticated software that can instantly mine streaming data, such as videos, without ever needing to archive it.
If looking for a needle in a haystack sounds difficult, try searching for a few flecks of gold dust in the gush of a fire hose, extracting them, then analyzing them for purity, all in a few seconds.Skip to next paragraph
Subscribe Today to the Monitor
That's the kind of challenge computer scientist Hillol Kargupta and his colleagues face as they try to develop new tools to "mine" digital databases and the streams of information that feed them.
One result of their efforts is a wireless hand-held computer that constantly monitors several financial-data services and immediately alerts a user when changes occur in a sector of the stock market that could affect the user's portfolio.
The experimental setup, a prototype for other wireless applications, may not shake the world of high finance. But it typifies an intensifying push among computer scientists worldwide to quickly navigate and retrieve nuggets from vast storehouses of data that are growing at unprecedented rates.
Humans seem to "expand to fit whatever space we have," observes Ronald Indeck, a professor of electrical engineering at Washington University in St. Louis. "If it's a big house, within a matter of weeks we fill up every space and start throwing stuff out into the garage."
The world of computers is no different, he continues. "Data storage is growing, and the challenge is that we want to be able to get the stuff back."
Each day, he says, the US intelligence community gathers an amount of information equal to all the printed pages in the US Library of Congress. The World Wide Web is growing by more than 1.5 million Web pages daily, taxing the ability of the current generation of search engines to track down the answer to a user's query quickly and accurately. Overall, he notes, the average size of a database and the software needed to use it is growing at a faster rate than computer-processing speeds, which double roughly once every 18 months.
Meanwhile, storage capacity has grown even as the cost has plummeted.
Four years ago, a credit-card size hard-disk storage device in a typical consumer laptop computer might have held a respectable six gigabytes of data - enough space to store six conventional movies. Today, laptops with hard drives capable of holding from 40 to 60 gigabytes are common, and can range up to 120 gigabytes or more.
Dr. Indeck adds that these days, the cost of hard-drive space, after adjusting for inflation, now stands at a paltry one-tenth of one cent per megabyte and continues to plummet.
These trends, combined with the growth in mobile and wireless computing and with visions of "wearable" computers in the future, are prompting researchers to explore pathways to a new generation of data-mining technologies.
Current data-mining techniques emerged to support "market-basket analysis," says Anupam Joshi, a computer scientist at the University of Maryland at Baltimore. Supermarkets, for example, use data on consumer buying patterns to reposition products on the shelf to boost sales.
Dr. Joshi cites an example of a supermarket chain whose data showed that men who bought beer also were highly likely to buy disposable diapers at the same time. The key, apparently, was football. As men headed to the store to stock up for the big game, wives would remind them to pick up diapers. "So the stores stuck their diapers next to the beer cases" to serve as an in-store reminder to bring home the diapers during the off-season, he says.
But demands are growing for more sophisticated data-mining techniques, he notes. Homeland-security applications will likely require an ability to search existing financial, criminal, immigration, or other fixed databases, as well as monitor streaming video and audio sources for evidence of potential terrorist activity.
Streaming data in particular can be challenging to harness, since the amount of information flowing is almost too much to archive, notes Dr. Kargupta, also at the University of Maryland at Baltimore. And getting that data to people in the field who may be using portable computers and wireless networks has its unique set of challenges.