The future of Data Science

representational image (iStock photo)


With the advent of computers and the internet, and with virtually everything confined under the system of ‘Internet of Things’, a gigantic amount of data is generated continuously. The ever-expanding horizon of ‘Data’ is now growing exponentially – the size of the digital universe was expected to double every two years beyond 2020.

The Covid pandemic might have induced a higher rate. ‘Big Data’ is perhaps the biggest hype in recent years. People aspire to slice and dice data to devise effective strategies in every aspect of lives and lifestyles. Michael Lewis’ 2003 book ‘Moneyball’, and the follow-up 2011 movie based on it, starring Brad Pitt, outlined the real story of how the Oakland Athletics’ manager Billy Beane used past data and analytics to achieve huge success in Major League Baseball despite a lean budget. The ‘Moneyball’ culture has since infiltrated every bit of our lifestyle.

And this has created a new class of professionals – data scientists, which according to a 2012 Harvard Business Review article, is the hottest job of the 21st century. A recent and growing phenomenon has been the emergence of ‘data science’ programs at various leading universities/institutes around the world. India, certainly, is no exception. However, is data science going to reshape our lifestyle?

And, how easy it is to leverage that huge volume of data? For we have neither the statistical expertise of handing thousands of variables, nor the suitable computational algorithms and equipment to handle billions of data points. Even if algorithms are available, standard computers are inadequate to handle this gigantic volume of data.

Data has always been instrumental in developing science and the growth of human knowledge. About two centuries ago, Charles Darwin’s theory of natural selection and his seminal book ‘On the Origin of Species’ was largely based on observational data that he collected during his voyages around the globe between 1831 and 1836 when he was part of a survey expedition carried out by the ship HMS Beagle.

About 150 years back, from the data collected from his experiments on peas, Gregor Mendel developed three principles of Mendelian inheritance that described the transmission of genetic traits. So, historically, science has been data-driven on numerous important occasions. The only difference is that waves of data are available now.

Statistics, certainly, is a data-driven science. But the main focus of Statistics is to develop theories – possibly based on data insights. Take the example of Sir Francis Galton, a cousin of Charles Darwin. In 1884, Galton created the Anthropometric Laboratory in London, a centre for collecting data on people who volunteered to participate. He managed to get data on more than 10,000 people, which was certainly big data at that time.

A clear pattern was observed by Galton in these data that the children of parents who lie at the tails of a distribution will tend to lie closer to the middle of the distribution. He made up the term “reversion towards mediocrity” for this phenomenon. Today, we call it regression to the mean. Consequently, Galton invented what we now call the correlation coefficient. The data on the ratio of ‘forehead’ breadth to body length for 1,000 crabs sampled at Naples by Professor W.F.R. Weldon provided the impetus to Karl Pearson’s earliest statistical innovations and led to Pearson, one of the statistical doyens in its developing phase, shifting his professional interests from having had an established career as a mathematical physicist to developing one as a biometrician or statistician.

A remarkable event in the history of Statistics, for sure! Well, it’s well-known that a lot of study of anthropometric data helped Professor Prasanta Chandra Mahalanobis to come up with his D-squared statistic, his most notable research contribution. In the early 1900s, William Gosset, under the pseudonym Student, used the Guinness brewery data to develop the famous Student’s t-distribution, which is widely used to date.

Take an interesting example from the 1930s. A woman colleague of the legendary British statistician R.A. Fisher claimed that she could identify whether tea or milk was added first to a cup of tea. Fisher wanted to verify the claim. He prepared eight cups of tea, of which milk was added first in four cups, and tea was added first in the remaining four cups. His colleague could correctly identify six cups, three from each group. Fisher devised a test procedure for analysing this data, which is now widely treated as one of the two supporting pillars of the randomization analysis of experimental data in statistical literature.

About 60 years ago, American mathematician and statistician John Tukey – another doyen of modern statistics – called for a reformation of academic statistics. Through his 1962 paper ‘The Future of Data Analysis’ in the journal The Annals of Mathematical Statistics, Tukey deeply shocked his readers (academic statisticians) when he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’.

The term ‘data science’, as “the science of dealing with data”, however, was initially used by the Danish computer science pioneer and Turing award winner Peter Naur in 1960, as a substitute for computer science. Today’s data science is certainly a combination of statistics, mathematics, algorithms, engineering chops, and communication and management skills. Still, data science is widely perceived as statistics by many. The American Statistical Association defined Statistics as the “science of learning from data”.

So, there is enough scope for confusing data science with statistics, for sure. C.F. Jeff Wu, now a professor at the Georgia Institute of Technology, delivered a famous lecture entitled ‘Statistics = Data Science?’ at the University of Michigan in 1997 and at the Indian Statistical Institute in 1998. The confusion somewhat continues.

I read an interesting comment of Andrew Gelman, a professor at Columbia University: “Statistics is the least important part of data science.” Interestingly, the reliance on and prominence of computer programs and software have become synonymous with ‘data analysis’. In 2017, David Donoho, a professor of statistics at Stanford University, wrote an interesting paper in the Journal of Computational and Graphical Statistics entitled ‘50 Years of Data Science’.

Starting from Tukey’s 1962 paper, he discussed the passage of development of data science. At the end of his paper, Donoho also wanted to foresee the future of data science. Donoho wrote: “In 2065, mathematical derivation and proof will not trump conclusions derived from state-of-the-art empiricism. …theory which produces new methodology for use in data analysis or machine learning will be considered valuable, based on its quantifiable benefit in frequently occurring problems, as shown under empirical test.”

After about half a century from  now, will that really be so straightforward? I doubt. A shade of uncertainty remains. On the internet, I’ve seen an interesting meme. A statistician and a data scientist are sitting together in a press conference. The microphones of all the media are, however, in front of the data scientist – they want to listen to him, while the statistician is sitting idle – nobody is seeking the statistician’s view. This, in fact, reflects both the present hype and the immediate future regarding data science.

(The writer is Professor of Statistics, Indian Statistical Institute, Kolkata.)