Machine Learning Moves Popular Data Elements Into A Bucket Of Their Own, If you look under the hood of the internet, you’ll find lots of gears churning along that make it all possible.

For example, take a company like AT&T. They have to intimately understand what internet data are going where so that they can better accommodate different levels of usage. But it isn’t practical to precisely monitor every packet of data, because companies simply don’t have unlimited amounts of storage space. (Researchers actually call this the “Britney Spears problem,” named for search engines’ long-running efforts totally trending topics.)

Because of this, tech companies use special algorithms to roughly estimate the amount of traffic heading to different IP addresses. Traditional frequency-estimation algorithms involve “hashing,” or randomly splitting items into different buckets. But this approach discounts the fact that there are patterns that can be uncovered in high volumes of data, like why one IP address tends to generate more internet traffic than another.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have devised a new way to find such patterns using machine learning.

Their system uses a neural network to automatically predict if a specific element will appear frequently in a data stream. If it does, it’s placed in a separate bucket of so-called “heavy hitters” to focus on; if it doesn’t, it’s handled via hashing.

“It’s like a triage situation in an emergency room, where we prioritize the biggest problems before getting to the smaller ones,” says MIT Professor Piotr Indyk, co-author of a new paper about the system that will be presented in May at the International Conference on Learning Representations in New Orleans, Louisiana. “By learning the properties of heavy hitters as they come in, we can do frequency-estimation much more efficiently and with much less error.”

In tests, Indyk’s team showed that their learning-based approach had upwards of 57 percent fewer errors for estimating the amount of internet traffic in a network, and upwards of 71 percent fewer errors for estimating the number of queries for a given search term.

READ MORE ON(Machine Learning Moves Popular Data Elements Into A Bucket Of Their Own): MIT NEWS