Fishing for data
To harness the vast information flow generated each day, scientists are developing sophisticated software that can instantly mine streaming data, such as videos, without ever needing to archive it.
If looking for a needle in a haystack sounds difficult, try searching for a few flecks of gold dust in the gush of a fire hose, extracting them, then analyzing them for purity, all in a few seconds.
That's the kind of challenge computer scientist Hillol Kargupta and his colleagues face as they try to develop new tools to "mine" digital databases and the streams of information that feed them.
One result of their efforts is a wireless hand-held computer that constantly monitors several financial-data services and immediately alerts a user when changes occur in a sector of the stock market that could affect the user's portfolio.
The experimental setup, a prototype for other wireless applications, may not shake the world of high finance. But it typifies an intensifying push among computer scientists worldwide to quickly navigate and retrieve nuggets from vast storehouses of data that are growing at unprecedented rates.
Humans seem to "expand to fit whatever space we have," observes Ronald Indeck, a professor of electrical engineering at Washington University in St. Louis. "If it's a big house, within a matter of weeks we fill up every space and start throwing stuff out into the garage."
The world of computers is no different, he continues. "Data storage is growing, and the challenge is that we want to be able to get the stuff back."
Each day, he says, the US intelligence community gathers an amount of information equal to all the printed pages in the US Library of Congress. The World Wide Web is growing by more than 1.5 million Web pages daily, taxing the ability of the current generation of search engines to track down the answer to a user's query quickly and accurately. Overall, he notes, the average size of a database and the software needed to use it is growing at a faster rate than computer-processing speeds, which double roughly once every 18 months.
Meanwhile, storage capacity has grown even as the cost has plummeted.
Four years ago, a credit-card size hard-disk storage device in a typical consumer laptop computer might have held a respectable six gigabytes of data - enough space to store six conventional movies. Today, laptops with hard drives capable of holding from 40 to 60 gigabytes are common, and can range up to 120 gigabytes or more.
Dr. Indeck adds that these days, the cost of hard-drive space, after adjusting for inflation, now stands at a paltry one-tenth of one cent per megabyte and continues to plummet.
These trends, combined with the growth in mobile and wireless computing and with visions of "wearable" computers in the future, are prompting researchers to explore pathways to a new generation of data-mining technologies.
Current data-mining techniques emerged to support "market-basket analysis," says Anupam Joshi, a computer scientist at the University of Maryland at Baltimore. Supermarkets, for example, use data on consumer buying patterns to reposition products on the shelf to boost sales.
Dr. Joshi cites an example of a supermarket chain whose data showed that men who bought beer also were highly likely to buy disposable diapers at the same time. The key, apparently, was football. As men headed to the store to stock up for the big game, wives would remind them to pick up diapers. "So the stores stuck their diapers next to the beer cases" to serve as an in-store reminder to bring home the diapers during the off-season, he says.
But demands are growing for more sophisticated data-mining techniques, he notes. Homeland-security applications will likely require an ability to search existing financial, criminal, immigration, or other fixed databases, as well as monitor streaming video and audio sources for evidence of potential terrorist activity.
Streaming data in particular can be challenging to harness, since the amount of information flowing is almost too much to archive, notes Dr. Kargupta, also at the University of Maryland at Baltimore. And getting that data to people in the field who may be using portable computers and wireless networks has its unique set of challenges.
"You're getting a continuous flow of data, and you have a limited amount of time to analyze large numbers of data points quickly," he says. Systems also must be designed to use battery power sparingly and deal with communications links that can carry substantially less information than do fiber-optic cables or copper wire.
His team's stock-market monitor, which can run on a wireless Palm Pilot or other personal digital assistants, is testing software approaches to meeting those requirements, he says. The team also has developed a system for monitoring truck shipments, providing more information on the condition of the vehicle and cargo than merely receiving periodic updates on a truck's position via navigation satellites.
Others, such as Washington University's Indeck, are taking hardware approaches to boosting database search speed. Typically, he says, a database on a storage device such as a hard drive must cross from the drive to the computer's main memory and processor for the search to take place, substantially slowing the search time. The interconnection, called a bus, is basically an "electronic water pipe," Indeck says, and has a fixed carrying capacity.
Indeck and colleagues have developed a hard drive that contains its own processing circuitry, so the only signals that must cross the bus are the initial query and the answers, not the entire contents of the database itself. By using this configuration, he says, searches that once might have taken days can be concluded in "many seconds." Overall, his team estimated that the approach can run searches 200 times faster than existing technologies.
These and other technologies are likely to be high on the shopping list for the federal government's Total Information Awareness project, spearheaded by the Defense Advanced Research Projects Agency (DARPA). The program, which some have dubbed "the mother of all data-mining projects," kicked off last year with the fiscal 2002 budget. The R&D program aims to "detect, classify, identify, and track terrorists so that we may understand their plans and act to prevent them from being executed," according to John Poindexter, the project's director.
Speaking at a meeting on the project last summer, Dr. Poindexter noted that much of the effort will focus on unifying and probing databases that carry information on financial transactions.
Maryland's Kargupta notes that researchers are working to ensure that privacy can be maintained by designing software that will randomly mask characteristics of individuals in a monitored group so that the group's activities can be monitored as a whole without revealing any one individual's identity. If the need arises, however, that safeguard can be lifted for any individual in the group.
Indeck adds that systems can be established that give users access to the gross output of a search or query, but not to the raw information from which the output was derived.
Researchers acknowledge that as data-mining technologies improve, the software they write will have to reflect existing privacy laws and be easy to adjust as legal rulings on privacy issues emerge.
But some privacy advocates doubt that those efforts will be sufficient to ensure that civil liberties will be maintained.
"The problem overall is that so much emphasis is being put on the data-mining aspect with little being said about controls," says Lee Tien, senior staff attorney with the Electronic Frontier Foundation in San Francisco, referring to DARPA's push. The project's efforts to improve the human-computer interactions and use technology to boost collaboration between federal agencies "is hard to argue with," he says, especially in light of the missed clues that might have heightened alerts in advance of last year's terrorist attacks on the World Trade Center and the Pentagon.
But the emphasis on surveillance, he says, raises questions of accountability, from the software engineers who design the programs to the people who would provide human checks on the automated results - a challenge common to many envisioned data-mining schemes.
"The big questions are how do you define privacy and how will you maintain it?" agrees Indeck. "We have to engage the privacy issue and embed it properly into our systems."