Data refers to a collection of facts and statistics collected together for analysis. Our entire lives, in one way or another, are run entirely by collation and processing of various forms of data. This data can be anything, from the number of times you’ve ordered from your favorite restaurant to the number of traffic signals you encounter on your daily drive. Your brain then processes this data to give you a handy set of results, like a change in menu, or an alteration of the route. In essence, this is exactly what data science is.
The primary purpose of Data Science is to the process and analyze the raw data to produce conclusive and usable results in each scenario. You can imagine data science being the same as preparing food. First, one must clean the meat and vegetable to be used, followed by chopping and sorting the raw ingredients into manageable pieces. Then one must cook the meat and vegetable carefully and add spices and other ingredients to help the process. Finally, one needs to serve the food appealingly.
Similarly, in Data Science, first, one must process the raw data into usable data by using processes like data cleansing, data munging, ETL (extract, transform and load), etc. Then the data is sorted into various categories to be processed in appropriate ways. Following this, the data is worked upon by algorithms, and various sources and features are added to help the turn the data in usable results. Finally, these results are presented in the form of easy to understand graphics and charts to be utilized for various purposes.
Now that we understand the basis of what Data Science is let’s take a deeper dive into various facts and statistics associated with it.
Basic Facts about Data Science
Data science is a vast field and can seem daunting to get into for uninitiated. So first, let’s start by learning some basic facts about this industry.
1. The term Data Science is a lie: This is to say data science isn’t classified as a scientific field of study. Data scientists don’t work in academia, and their work doesn’t revolve around research and publishing papers. The term was coined in 1974 by Peter Naur, in his book Concise Surveys of Computer Methods. However, the term’s current definition originated in 1996, during the second Japanese-French Statistics Symposium, to refer to the methods and the people who utilize them to analyze various data.
(Source: Perceptual Edge)
2. Users generate nearly 900 exabytes of raw data: Raw data refers to all the data that is collected from a certain source using a basic set of parameters. For example, the user data for everyone that bought a Toyota Car. Users are responsible for generating 900 exabytes (1 exabyte is equivalent to 1 billion gigabytes!) of data, of which various enterprises store 80%.
(Source: Towards Data Science)
3. Only 0.5% of the data we create is ever processed: The human race is constantly creating gigantic amounts of data. Our lives have become increasingly digital, and that gives rise to a massive amount of data on a moment to moment basis. However, only a minuscule amount of this data is processed and organized.
(Source: Analytics Training)
4. Nearly 80% of a scientist’s job is just cleaning! Data cleansing refers to the process of removing or correcting corrupted or unimportant data from raw data. It is the first step involved in any data science application. The expected output of this procedure is to produce data that is similar to other data in its set. Cleaning data takes up a majority of the time in a project as having a clean set of data improves the efficiency of all the further steps by entire magnitudes.
5. Artificial Intelligence does most of the work: When working with large amounts of data (large companies often need hundreds of terabytes of data to be processed), it is impractical to create models and algorithms to process it all manually. So, data scientists train artificial intelligence to do their jobs easier. Machine learning is a way to teach a computer to react dynamically to different types of data and in turn, process it accordingly. Machine learning algorithms take much longer to create than simple algorithms, but they are very effective at their jobs and do not leave any room for human error.
(Source: Machine Learning Mastery)
6. Graphic Designers are data scientists too: Once the respective algorithms have generated the required results, they need to be collated and organized into a form that is easy to present and forward to the respective destinations. Here is where graphic designers come in. These are usually data analysts with a graphic designing background, and they convert numbers and statistics into easy to understand charts and graphs. These are then forwarded to the various analysts that can use these results to benefit the organizations for which the scientist works.
The Massive Data Science Industry.
Now that we understand the basics of how data science works and what the various terms associated with it are, let’s take a brief look into just how huge the Data Science industry is.
7. 90% of all data was created in the past three years: Due to the rapid rise of IoT and interest in Big Data, the amount of data we create has massively shot up in the past few years, with some estimates suggesting that nearly 90% of all data was created in the recent past.
(Source: Base Line)
8. We create nearly 2.5 quintillion bytes of data every day: Each of us now has access to at least two devices connected to the internet. This makes it so that we keep creating more data about ourselves every single second. All this data resides in the digital universe until someone makes use of it.
(Source: Data Never Sleeps)
9. The internet traffic is growing at a rate of more than 50,000 Gb/s: According to estimates by IBM, internet traffic in 2018 reached 50,000 Gb/s. Some things that happen on the internet every second are: 72 hours of footage is uploaded to YouTube, 216,000 posts are made on Instagram, 204 million emails are sent, and 500,000 cat/dog images are viewed (very ruff estimate).
(Source: IBM Big Data & Analytics Hub)
10. It is projected to be a 40.6 billion-dollar industry soon: Data analytics is a Multi-Billion dollar, rapidly expanding industry, and is projected to become worth 40.6-Billion-dollar globally by 2023, showing a combined annual growth rate of 29.7% from 2017. In India alone, this industry is worth nearly 2.7 billion dollars and is expected to be worth almost 3.30 billion-dollars by 2021.
(Source: Inside Big Data)
11. The Big Data job market is going to grow by 12%: In a time when most major industries are bearing witness to large scale lay-offs, an increasing number of companies are looking to hire data analysts. The job market is expected to grow by at least 12% by 2024, with the median pay expected to be around 110,000 $ annually.
12. Federal Investments of up to 200 million dollars are being made into data science: In 2012, the Obama administration invested 200 million dollars into developing the big data analytics industry. This investment was directly responsible for the rapid rise of data science in the US and led to an increasing series of investments being made each following year, totaling more than 125 million dollars into various federal agencies utilizing data science.
13. A mere 10% increase in data accessibility increases revenue by 65 million dollars: Data quality and accessibility are the two highest priorities for large companies. Higher quality data can produce far better results than unreliable data. These results can then be utilized to design campaigns, products, services, etc to improve the market share of the company undertaking this effort. According to Forbes, for a Fortune 1000 company, increase can be up to 65 million dollars for just a 10% increase in quality and quantity of data.
14. Bad/Corrupted data causes major losses to industries: Given how integral data analytics has become to industries, it is no surprise that bad data can cause horrific losses to companies. Each year nearly 3.1 trillion dollars are lost in the US alone due to bad data. This number is estimated to be almost 21 trillion dollars worldwide.
(Source: IBM Big Data & Analytics Hub)
A Brief History of Data Science
Having taken a look at the current state of the data science industry and understood the massive impact it has on the modern world, it is fascinating to note that this industry only came into being in the early 2000s. Once advancements in compression and storage had been made, it became cheaper and cheaper to store and process data. The easy availability of large quantities of storage and powerful processing capabilities led to the rise of Big Data. Let’s briefly explore the various milestones and upheavals this industry went through from its inception to reach the state it is at now.
15. A college student was responsible for the onus of Data Science: As we have seen, data science is heavily dependent on having large amounts of data with which to work. The mass storage of data is only possible due to a compression algorithm that was created by David Huffman in 1951 when he was a student at MIT. David Huffman created the Huffman Encoding scheme as a final paper for an IT course. This encoding technique forms the basis of all modern compression algorithms and is the reason large companies can store and process millions of terabytes of data without much trouble.
16. The first commercially viable computer became available in 1981: IBM created their IBM PC and released it to the public in 1981, marking the first-time common users could generate and interact with data in a meaningful way. The home computer market was further helped along by Apple’s first PC in 1983 and then the meteoric rise of Windows following their first PC launch in 1985. These advancements led to the public starting to generate data that could be fruitfully utilized.
17. Data Science was recognized as a field in 1996: In 1996, the members of the International Federation of Classification Societies recognized and classified Data Science as a field of emerging importance. This classification helped Data science get the attention it needed to bring itself into the eyes of the mainstream public and kick-start the interest in this field by various organizations.
(Source: Data Science, Classification, and Related Methods)
18. Google used Data Science for the first time to beat its rival in 2003: Both AltaVista and Google started to be used more and more by users as the internet became a common thing. By 2000 both these search engines had started to use rudimentary data science methods to improve their search efficiency. Recognizing the potential of data science, Google established a dedicated team to use data analysis techniques for the first time in history. They designed the proprietary PageRank algorithm which directly led to Google becoming the most used search engine by 2003, the position it still holds.
19. In 2015 there were more than five times as many devices as people on Earth: In this past decade, the amount of people owning and operating a device connected to the internet has increased many folds, with the number of devices far exceeding the population of the Earth. This, in turn, has enabled Data Science as an industry to flourish since now there are tons of ways to get useful data about an individual, from their smartphones to their wearable electronics.
Applying Data Science to Solve Everything
We all know that the answer to everything in the Universe is the number 42. However, arriving at this conclusion takes a lot of computing power and data. After all, it was the most powerful Artificial Intelligence ever, working with the largest data set in the universe, that came up with the number 42 in the first place.
While modern data science hasn’t yet reached the capability to answer every question in the universe, it is slowly approaching that state. Let’s take a look into the various ways data science is being applied by organizations all around us to answer any and all questions they might have.
20 Most banks now use data science to approve loans: One of the most widespread uses of data science is in the banking and finance sector. Most major banking companies rely on something known as a “Credit Score,” which is number assigned to every individual.
This score is assigned to them by an algorithm that takes all the available finance information for the individual, from their banking statements to their salary, housing status, family assets, etc. and generates a single tangible score for them to rank them against every other applicant that has been given a credit score.
This score is then taken into consideration by banks to assign the appropriate scheme to the applicant.
21. Stockbrokers are getting their jobs done for them: As if people needed any more reason to dislike stockbrokers! Most modern stockbroking and trading companies rely on proprietary algorithms to track the stock market. These algorithms take into account a vast array of parameters related to the company’s performance. After collating these parameters, the algorithm is able to predict market growth accurately. This information then put to use by the stockbrokers.
22. Machine learning algorithms are a very powerful weapon in the fight against Cancer: Until recently, doctors had to rely on visual analysis of MRI and CT scans of patients to identify the presence of cancerous cells. While medical professionals are among the most skilled and careful people, human error is unavoidable. Hence to prevent misdiagnosis, extensive research has yielded many resources to help doctors process the scans through a machine learning algorithm which can accurately diagnose the patients, removing human error entirely.
23. Machines monitor your every move to bring you the product/service you desire: Ever wonder how Amazon or Flipkart know exactly what you want to buy and when? Has it happened to you that you started getting ads for flour or sugar right when you ran out of them? Big corporations achieve such marketing by using data science to track your purchases and searches online and bring you products and services targeted exclusively at you.
24. Tagging people on Facebook has never been easier!: As soon as you upload a photo to Facebook, you instantly get an option to tag all your friends in the photo. This is due to a facial recognition algorithm working behind the scenes. This type of algorithm is a subset of a type of algorithm known as image processing. These algorithms are designed to extract usable information from an image and to use it for a particular purpose. In this case, Facebook extracts information about a person’s face from their images and uses it to suggest people to tag.
25. Blame Artificial Intelligence the next time you can’t find an appropriate flight: Aviation as an industry is struggling heavily, with most major companies operating at a loss. To help them recoup some of these losses more and more of them are turning to data science to help them make management decisions. They use machine learning to predict delays, survey users to figure out what services to offer, build loyalty programs, figure out connections between flights, etc.
(Source: Analytics India Magazine)
26. Your game is playing you!: Even the gaming industry has recognized the potential of data science and is actively using it to make games more appealing to its players. For example, games like Fortnight and World of Warcraft study their players extensively to find out their gaming habits and how much they are expected to play and for what reason. Using this information, they are able to make their game seems more enticing to player, incentivizing them to play for longer periods of time.
27. Are you not entertained?: Entertainment companies, especially the ones that own and operate a streaming service, use a deep neural network to track your likes and dislikes on their service. These are then put to use crafting the perfect library of entertainment for you, be it Netflix suggesting movies, or Spotify suggesting a song, or even Steam suggesting games. All of this is done to ensure you have the best experience you can on their service and keep paying them money!
28. Self-Driving cars are closer than you think, all due to data science: Self-driving cars have always been a staple of fiction, but they are quickly turning to fact. Utilizing the road data from all sources such as traffic cameras, GPS systems, voluntary data collectors and more, many companies now have been able to create working prototypes of self-driving cars. In particular, Tesla cars are constantly learning on the road, and improve over time, adapting your particular city and driving preferences.
Bibliography & Data Sources:
- Perceptual edge
- Towards Data Science
- Analytics Training
- Machine Learning Mastery
- IBM Big Data & Analytics Hub
- Data Never Sleeps
- IBM Big Data & Analytics Hub
- Inside Big Data
- IBM Big Data & Analytics Hub
- Data Science, Classification, and Related Methods
- Analytics India Magazine)