Advertisement

What is the Size of the Big Data Universe?

Writer : Mr. George Miguel

What is the size of the Big Data universe?

The Data Age has officially begun. We leave a trail of data everywhere we go, from cookies to our social media profiles. So, just how much information is out there? On a daily basis, how much data do we deal with? The Zettabyte Age has arrived.

1. Zettabyte Era

Bits and bytes are the units of measurement for data. A single bit can be either 0 or 1. A byte is made up of eight bits. As a result, we have kilobytes (1,000 bytes), megabytes (1,0002 bytes), gigabytes (1,0003 bytes), terabytes (1,0004 bytes), petabytes (10005 bytes), exabytes (10006 bytes) and zettabytes (10007 bytes).

In 2016, Cisco estimated that we had passed one zettabyte in total annual Internet traffic, which is all data that we had uploaded and shared on the world wide web, with the majority of it being file sharing. In computing, a storage capacity unit is known as a zettabyte, which is equal to 10007. (1,000,000,000,000,000,000,000 bytes). A zettabyte is the same as 1,000 exabytes, 1 billion terabytes, or 1 quadrillion gigabytes. That's a lot, in other words! Even more so when we consider the fact that the Internet is only 40 years old. According to Cisco, the amount of data transmitted each year will reach over 2 zettabytes by 2020.

All personal and business devices, including the internet, contribute to the total data storage. Data storage capacity estimates for 2019 range from 10–50 zettabytes, but are already in this range. This is expected to reach 150–200 zettabytes by 2025.

Since the rate at which new data is generated is only going to increase in the coming years, you might wonder if there is any upper limit to the amount of data that can be stored. But there are limits that are so far away that we'll never come close to them. Just one gram of DNA can store 700 terabytes of data, allowing us to store all of our current data on 1500kg of DNA, which is so densely packed that it would fit in an average-sized living room. That, on the other hand, is a far cry from the products we can currently produce. The largest currently available hard drive is 15 terabytes in capacity, while the largest currently available SSD is 100 terabytes in capacity.

If a dataset is too large or complex for a typical computer to handle, it is known as "Big Data". Because of this, it is a function of what is currently available on the market. As of 1999, we had 1.5 exabytes of data, and 1 gigabyte was considered a large amount of data. By the year 2006, the total amount of data had already reached 160 exabytes, or a 1000-fold increase in seven years. If you want to talk about "big data," you have to talk about at least 1 terabyte of data, not just 1 gigabyte. Data sets that exceed the total amount of data created in the world divided by 10003 are naturally referred to as "Big Data," according to this definition.

 

2. Petaflops

For data to be useful, you must not only store it, but also access and process it. Computing power can be measured in two ways: in terms of instructions per second (IPS) or floating-point operations per second (FPPS) (FLOPS). However, IPS is less precise than FLOP because it relies on a programming language. As opposed to this difficulty, FLOPS can be easily visualized because they are directly proportional to the speed at which we can multiply or divide numbers per second. Most modern CPUs are in the range of 20–60 GFLOPS (gigaFLOPS = 10003 FLOPS) in comparison to a simple handheld calculator that requires several FLOPS. At 122.3 petaFLOPS (10005 FLOPS), IBM's record-breaking computer in 2018 outperformed even the most powerful desktop PC by a wide margin (200 petaflops in a peak performance).

GPUs outperform CPUs when performing floating-point calculations with hundreds of GFLOPS or more (mass market devices). When you start looking at specialized architecture, things get interesting. The most well-known example of this new trend is Google's TPU, which has a peak performance of 45 teraFLOPS (10004 FLOPS) and can be accessed via the cloud.

The next best thing is to rent a supercomputer or do your computations in the cloud if you require large computations but do not own a supercomputer. Amazon's P3 can deliver up to 1 petaFLOPS of performance, while Google's TPU pod can deliver up to 11.5 petaFLOPS of performance.

3. A combination of machine learning and big data.

Now that you have the data and the computing power to analyze it, it's time for you to put it all together and come up with new ideas. Machine learning is the only way to get the most out of both. Predictions about the weather, traffic, or health can be made thanks to AI, which is at the forefront of data use (from discovering new drugs to early detection of cancer).

An important metric for gauging computing power vs. data usage in AI is the amount of training required to achieve peak performance. AI training measured in petaflops/day (petaFD) has been doubling every 3.5 months since 2012, according to an OpenAI report in 2018. Performing 10005 neural net operations per second for a day results in a single petaFD, which is equal to about 1020 neural net operations. The beauty of this metric is that it not only considers network architecture in terms of the number of operations required, but also links that architecture to current device implementation (compute time).

The chart below shows how much petaFD was used in recent AI advancements.

AlphaGo Zero, developed by DeepMind, has amassed the most processing power (1 exaFD) and is clearly the leader in the field. How many resources does that represent? According to this detailed estimate, you could easily spend up to $3 million if you were to replicate the training yourself. Based on the chart above, 1,000 petaFD is at least the equivalent of using the best Amazon P3 for 1,000 days. The current price of $31.218 per hour equates to $749,232 when multiplied by 24 hours by 1,000 days. Since one neural net operation is one floating-point operation, this is the lowest bound, assuming that you get the same performance on P3 as you do on different GPUs/TPUs.

A lot of power and resources are required to train AI, as this example illustrates. Even though there have been recent advances in machine learning that didn't necessitate a lot of computing power or data, this isn't always the case. That's why upgrading our supercomputers and expanding our data centers makes sense if we want to advance artificial intelligence and, by extension, our civilization. Supercomputers are like Large Hadron Colliders in that you build ever-larger colliders in order to learn more about our universe's mysteries. Computing power and artificial intelligence are the same. In order to understand our own intelligence and creative abilities, we need to increase the FLOPS scale.

The Zettabyte Era is Here! Yottabyte Era isn't too far away, so you'd better take advantage of it quickly.


 

In order to understand what constitutes "Big Data," we must first understand what "Big Data" means.

 

How do we define "big data" and what exactly constitutes a "big data" analysis in an era where nearly everything is dubbed "big data"? What is the point of merely searching a multi-petabyte dataset for specific keywords? Is extracting a few million tweets from a trillion-tweet archive considered "big data" if you use a date filter? Running a 100 petabyte file server or even just keeping 100 petabytes of backups in the cloud counts as "backup" storage? Today, exactly what should be considered "big data"?

In 2013, I would begin my data science presentations by stating that the day before, I had run several hundred analyses on a 100-petabyte database containing more than 30 trillion rows and incorporating more than 200 indicators. Asked if this was a "big data" analysis, the audience was unanimous in their response.

However, when I explained that I had simply run a bunch of Google searches and that millions of people around the world do the same thing every day, the audience usually agreed that this was not a "big data" analysis, but rather a "search." Arguments like "a 10-year-old running Google searches should count as a bleeding-edge "big data" analyst" seem hardly satisfying.

Twitter, anyone? With just over a trillion tweets sent in its history as of Spring 2018, and growing at a rate of around 350 million tweets a day at the time, Twitter would appear to be a textbook example of the traditional "three V's" of "volume, velocity, and variety," big data would seem to be a textbook example.

As a result of this, the vast majority of Twitter studies only look at a tiny portion of the entire trillion-tweet archive. For example, a researcher could run a keyword search to find all of the tweets that contain a specific keyword since Twitter's inception, yielding a dataset of 100,000 tweets. Is a trillion-tweet "big data" analysis possible if you only search for keywords in a trillion tweets? This means that only 100,000 tweets are being used in the actual analysis, even if sentiment mining is performed on the tweets. All of the findings, patterns, and results are based on just 100,000 tweets, not a trillion.

In addition to this, many commercial social media analysis platforms that offer advanced tools such as sentiment mining or topical tagging, apply their algorithms only to a random sample of the total results, which can be as low as 1,000 randomly selected tweets from the total search results.

Are "big data" analyses really possible when only 1,000 randomly selected tweets are analyzed from a trillion tweets????

It's also important to think about the commercial sector. More than 250PB of data were stored in Facebook's internal enterprise data warehouse in 2013. According to the company, only 320 terabytes of that data were accessed each day.. Despite the fact that Facebook had 250 petabytes of data at the time, only one tenth of one percent of that data was accessed each day. The rest was simply archived while it was at rest..

Which factor—the dataset size, dataset complexity, dataset being touched by a query, or data volume returned—really matters? This raises an interesting question: Does it matter what size the dataset is to begin with?

In 2013, having a 250PB warehouse for Facebook was impressive, and it still is today. If data size is our only metric, Facebook's analyses are clearly "big data" based on the size of their underlying dataset.

How much of that data is actually accessed by queries every single day? 320 terabytes is still a lot of data to analyze. However, when the 320TB of queries are processed by 850 employees each day, the average amount of data per user is only 376GB, which is still significant but not as eye-popping. It's impressive if your data mart processes hundreds of terabytes of queries each day. No, saying that a few hundred gigabytes of queries are run by a small group of employees each day is not.

Even Facebook isn't alone in the fact that the vast majority of studies only look at a small portion of the data. Uncompressed JSON from the Twitter Decahose in 2012 weighed 2.8 terabytes per month. It's likely that only a small portion of Twitter's data will be analyzed: just 112.7GB, or 4% of the total dataset. Indeed, columnar storage formats like Google's BigQuery are used in the production of analytic platforms like BigQuery.

Is the level of difficulty of the analysis relevant? A Google keyword search for 100 petabytes does not count as "big data" if most of those 320TB of queries per day were simply numeric range searches.

A "big data" analysis, on the other hand, would require a 100 petaflop TPU pod to run a massive neural network on just a few gigabytes of input data? Is the amount of data being analyzed more important than the complexity of the analysis?

How much weight should be given to the size of a result? There may be days of processing time and terabytes of data required for an enormous neural network to run on an array with multiple-petaflops but only one "yes/no" answer will be returned. Do you think that's "big data"? For "big data" analyses that extract simple findings like timelines or "go/no-go" results from massive piles of input data, it would seem that the output data size is a less than satisfactory metric for assessing what precisely counts as "big data" analyses.

They contribute to the debate over "big data management" and "big data analysis. " I once served on an advisory board with a CIO of a Fortune 50 company who claimed that his company was at the forefront of the big data revolution in its industry because it had petabytes of information. Even so, when I inquired about the source of all that data, he said it was the backup images of the thousands of desktop and laptop computers in the company and that they hardly ever used the information stored in those backup copies.

On-premises storage infrastructure maintenance requiring multiple petabytes of data can be a major undertaking requiring both specialized hardware and software engineering expertise. A real-time analytics platform that can perform complex analyses over petabytes in minutes, thousands of times per day is very different from a cold storage solution that has a few petabytes and only a few accesses per year.

More than 100PB of data were stored at CERN by the year 2013. 13PB was stored on disk, with the remaining 88PB spread out across eight robotic archival storage tape libraries, What about data that is stored on tape, but has to be staged back onto disk with extremely high latencies before being used? Again, the answer revolves around whether "big data" counts only the analysis of large datasets or also the operational complexities of storing large datasets, whether or not they are actually accessed..

Does storing 100PB count as “big data” if it is in the form of a file server, rather than cold storage? By 2012 Facebook stored more than 100PB of photos and videos from its users in what amounted to a giant file server. File server data can be accessed, but it doesn't feel right to count it as "big data" analysis because it's just sitting there.

Google, Facebook, and CERN all had 100PB datasets just five years ago. It was used to search for keywords. The bulk of CERN's cold storage was on tape. As a fileserver, Facebook's was accessed via a hard drive. Can we classify any or all of these as "big data"?

Can a company claim to be "big data" if it only stores a few petabytes in Google or Amazon's cold storage cloud offerings when they outsource "big data storage" to the commercial cloud? Are we still talking about a "big data" company if we never do anything with the data we store in the cloud, even if we no longer have to worry about engineering and operational requirements?

The company can at least claim to be in charge of a "petascale storage fabric" if it installs a five-petabyte tape backup system for its global desktop backup program on-premise. The petascale storage fabric would be run by the cloud instead if all of those desktops simply uploaded their backup images to it. Is the company still a "big data" company if it outsources all of its work?

Because this is precisely the point of commercial cloud computing, which is to outsource petascale storage and analytics for those companies who have pioneered it. When a company like Google can run your petascale analytics infrastructure with BigQuery and leverage the collective engineering, performance, and analytic creativity of its workforce, why bother managing your own?

If we're counting "big data" sizes, do we count all of the data generated, or only the data that was actually written to disk?? For comparison purposes, the Large Hadron Collider generated over one petabyte of data a second in 2013, an enormous volume even by today's standards. This data, however, is not kept. Only 100 gigabytes per second of that data stream were selected by a specialized prefiltering process, but that was further reduced by 99 percent to just one gigabyte per second. Only 25 petabytes per year were required for long-term storage from a petabyte per second of raw data to just one gigabyte per second of actual data.

Does the size of the "big data" project determine how it is rated? With GPS trackers in their vehicles, a company could theoretically track each vehicle's position every second and generate petabytes of data per day. If you're going to use GPS, you're going to have to record every few seconds or every minute because the hardware isn't going to be able to update that quickly. A multi-petabyte-per-day stream of data or the actual gigabytes that the company actually collects per day?

Is "big data" confined to the realm of digitally stored information? A vast archive of scanned pages, OCR'd text, and search indexes was created by Google in 2012 after scanning and digitizing more than 20 million books. As a result, the public is more likely to associate a traditional academic library with dust and obsolete technology than with cutting-edge "big data" technology. Google Books is a "big data storage" application, but the libraries it draws from and represents only a fraction of are not.

To summarize, what does "big data" actually entail in 2019? Why does a query's size differ from that of an underlying set of data? How much data is actually written to disk, or is it based on the theoretical data size? Does it matter if the data is stored locally or in the commercial cloud, or if it isn't digital at all? Big data is a complex concept, and it's worth taking a step back to figure out exactly what we're referring to when we use the term.


Read more:


Big Data