• People
  • Courses

The Center for Data Science Manifesto

Yann LeCun, November 2011, (rev4.1 2012-02-12)

History of this Document

A version of this document was first written in the spring of 2009, and refined over the years until the fall of 2011, in part to alert the NYU administration of the growing importance of data-driven science, and the emergence of the field of data science. In the fall of 2011, NYU launched a University-Wide Initiative in Data Science and Statistics, under the leadership of Gérard Ben Arous, the new director of the Courant Institute and the Vice Provost for Science and Engineering. I was asked to chair a working group to define what NYU should do in this area. The working group was composed of faculty from all over the university, and submitted a report in the summer of 2012 recommending the creation of the multi-disciplinary http://cds.nyu.edu, and the creation of a Master of Science in Data Science and a PhD program in Data Science. The launch of the CDS was publicly announced in February 2013, and I was named founding director. The first MSDS students starts in September 2013. The PhD program is pending approval.

Executive Summary

  • The deluge of data from science, business, government, and the digital society has caused the emergence of a new discipline, Data Science, whose goal is to develop methods, software, and systems that automatically extracts knowledge from data. As the flow of data increases exponentially, it is increasingly processed, analyzed, and acted upon by machines, not humans.
  • The data deluge has created a need for “data scientists”, in industry, academia, and government. The domain of competence of a data scientists is a combination of statistics (particularly model estimation and selection), computer and information science (machine learning, database systems, AI, data mining, scientific computing), applied mathematics (numerical methods, optimization, stochastic processes, probability), and knowledge in the relevant application domains. No existing program at NYU (or anywhere else) trains students with the comprehensive combination of skills required to be a well-rounded data scientist.
  • The demand from industry for data scientists at the MSc and PhD levels is large, growing very quickly, and largely unfulfilled. Creating a program to fill this need is a huge opportunity for NYU in general, and for the Courant Institute in particular. The primary industries with large activities in Data Science include web and e-commerce companies, pharmaceutical, financial, telecom, information technology, and defense. These sectors are particularly strong in the NYC area.
  • If we judge by the amount of funding, the number of top PhD applicants, and the number of faculty involved, Data Science has become a topic of major importance in statistics, computer science, and applied mathematics. Several influential mathematicians (from all specialties) have mentioned “large datasets” as the source of the biggest challenges for mathematical scientists in the coming decades.
  • We are planning to create a Center for Data Science, and a graduate program in Data Science at NYU. The center and program will be hosted at Courant, but would involve faculty from many different NYU schools and departments. A significant number of NYU faculty are already involved in Data Science research and teaching all over the university.
  • As Data Science gains in importance in Academia and society, the creation of significant research and teaching activities in Data Science at Courant is necessary to preserve the Courant Institute's prominence in applied mathematics in the coming decades.
  • Because of the high demand for data scientists, and the absence of such program at all but a few universities around the country, there is little doubt that a graduate program in Data Science at NYU will be an instant success.
  • Data Science is becoming an essential component to progress in many fields of science and the humanities, including biology and genomics, neuroscience, AI/robotics, astronomy/cosmology, high-energy physics, medicine, and finance. It is fast gaining importance in fields such as linguistics, economics, law, history, sociology, even political science and international relations.
  • A Center for Data Science would make it easier to win the increasingly large governments funding devoted to the subject, and would make it possible to raise funds from industry partners (through an industry affiliate program) as well as from donors.
  • The creation of the MSc program would require the initial hiring of a few tenure-track or tenured faculty. Together with the esixting NYU faculty whose main research focus is Data Science, they would constitute the core of the Center for Data Science. A number of faculty from all schools and many departments within NYU will be affiliated with the center.
  • Finding space to regroup the core faculty and their graduate students and postdocs is highly desirable, but will require major donations and investment from the university.



The Age of the Exabyte

We live in the “Age Of The Petabyte”, soon to become “The Age Of The Exabyte”. Our networked world is generating a deluge of data that no human, or group of humans, can process fast enough. Some of that data is readily interpretable by machines, but much of it consists of unstructured data captured from the real world using a variety of sources: sensors from scientific experiments, pictures and videos from the Web, Web usage data, location data from smartphones, link data from social networks, customer data from e-commerce websites, transaction data from financial companies, text from news agencies, blogs, and collaborative filtering websites, usage data from credit card companies and utilities, the list goes on.

The deluge of data is transforming the way science and medicine are carried out. New knowledge is being derived by automatically analyzing massive amounts of data. New fields of Science are emerging that are built around the automated analysis of massive datasets. This includes new branches of genomics, biochemistry, astronomy and cosmology, high-energy physics, and neuroscience (particularly brain imaging). Several areas of social science and the humanites are already starting to be influenced by automated data analysis, including law, economics, sociology, political science, and history. The trend is only going to accelerate in the coming years.

We will see an increasingly large number of traditional discipline spawning new sub-disciplines with the adjective “computational” in front of them. We already know about computational physics, computational neuroscience and computational biology. We will soon see the emergence of computational economics, computational history, computational psychology, and many others.

The common thread of these new domains will be their reliance on automated methods to analyze massive amounts of data, and to extract knowledge from them. While each application domain has its own set of specific techniques, a new discipline has emerged whose object is to provide the underlying methods of the data revolution. This new discipline overlaps multiple traditional disciplines including Statistics, Computer Science, Applied Mathematics, and application domains. This emergent discipline is known by several names: computational statistics, machine learning, or even “modern” Artificial Intelligence.

We will call it “Data Science”.

The phrase Data Science has gained acceptance because its methods emerge at the intersection (or the union) of several existing disciplines (mainly statistics, machine learning, AI, applied mathematics, and databases), without being entirely contained in any one of them.

The Data-Centered World

We are at the cusp of what Joe Hellerstein of Berkeley has called “The Industrial Revolution of Data” [1], in which information (and perhaps “knowledge”) is primarily produced by machines, not people.

Data Science has already revolutionized the commercial world. Without the tools of Data Science, companies such as Google, Amazon, Facebook, Linkedin would not be the successes that they are. Even brick-and-mortar companies such as Walmart owe their success partly to Data Science: tracking and predicting customer and supplier behavior allows them to optimized stocks and prices. More broadly, all modern enterprises rely on so-called “business analytics” which is the business world's name for their application of data science. Business analytics is an increasingly strategic segment: business analytic startups are being bought by large consulting and services companies such as IBM and SAP.

There is a huge, and largely unmet, need for data science specialists on the job market today. Established Companies such as Google, IBM, AT&T, and Amazon hire large numbers of graduates with computer science, mathematics, statistics, or engineering backgrounds to turn them into data scientists, but they are not the only ones. Practically every Wall Street company now use computational methods to predict financial data, analyze business news, and automate trading. They recruit in financial mathematics program, but also in statistics and machine learning. Every company with an e-commerce website (even brick-and-mortar stores) rely on data analytics to manage customers and control prices, sales, and stocks. Many mobile app companies collect usage and location data from users. Services companies such as banks and telecom companies collect data on customer behavior for credit assessment, churn prediction, and fraud detection. Biotech and pharmaceutical companies hire data scientists to analyze biological data and drug testing results.

Data Science is so important that the CTOs and Chief Scientists of a large proportion of successful web companies have backgrounds in AI, Machine Learning or Computational Statistics (e.g. Google, bit.ly, Microsoft's on-line advertising and search divisions, Yahoo! labs).

Data scientists may be the most sought-after graduates of the next few years. The New York Times ran an article in August 2009 entitled “For Today’s Graduate, Just One Word: Statistics” which included the following quote [2]:

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”
Hal Varian, Chief Economist at Google

Another quote from the same article [2] is from Erik Brynjolfsson, director of the MIT Center for Digital Business:

“…But the big problem is going to be the ability of humans to use, analyze and make sense of the data.”

Data Science will also revolutionize medicine, education, and other areas that are slated to become “evidence-based”. Here is a quote from Peter Orszag [3], former director of the White House Office of Management and Budget:

“Robust, unbiased data are the first step toward addressing our long-term economic needs and key policy priorities. […] I noted two particular areas where more and better data would be useful: health care and education.”

A Gaping Educational Void

Despite the huge growing demand for data scientists, not a single university in the US proposes a degree in Data Science today. There is a huge and gaping void to fill, and a huge opportunity for the first universities to fill it.

Existing programs in statistics, computer science, applied mathematics, information science and business management include courses related to data science, but they don't quite fit the bill. Many computer science departments offer courses in machine learning, many statistics departments offer courses in data analytics, many business schools and schools of information science have programs in data mining, and some applied mathematics programs offer courses in data analysis, visualization and optimization. But very few graduates of these programs have a the breadth of skills required to call themselves Data Scientists.

Drew Conway's Data Science Venn Diagram

The problem is well expressed by Drew Conway, an NYU graduate student in Politics who is a major contributor to the Data Science blog “The Dataist”. Drew published the Data Science Venn Diagram [4]: CS programs don't have enough mathematics, substantive expertise in an application area, and exposure to large datasets; statistics and mathematics programs don't have enough programming, algorithmics, substantive expertise, and exposure to large datasets; and business and information science programs don't have enough mathematics, programming and algorithmics. None provides nearly enough exposure to large, real-world datasets. Machine Learning is at the intersection of computer science and math/stats, traditional research is at the intersection of math/stats and topical expertise, and data science it at the intersection of all three sets: CS, math/stats, and topical expertise.

Data Science requires skills in continuous mathematics that are currently only required from students majoring in physics, mathematics, and some engineering fields. The mathematical pre-requisites for Data Scientists include such topics as multivariate calculus, linear algebra, optimization and probability. Data Science also requires programming skills that are only offered to CS majors (C/C++ programming, scientific computing, database systems, machine learning and AI). Data Science also clearly requires solid training in statistics. In addition, Data Science can profit from skills in a number of application areas, such as signal and image processing, bioinformatics, natural language processing, etc.

There is a stupendously large educational gap to fill, and huge opportunity for educational institutions that will fill it first.

Here is another quote from Hal Varian (Google's chief economist) from a 2009 McKinsey Quarterly article [5]:

“I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.”

However, Mike Driscoll in his 2009 Dataspora.com article entitled “The Three Sexy Skills of the Data Geeks” [6] echos Drew Conway's point when he writes:

“I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization.”

Nathan Yau in his 2009 Flowingdata.com article entitled “Rise of the Data Scientist” [7] reinforces this important point:

“We're seeing data scientists - people who can do it all - emerge from the rest of the pack. […] Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.”

Many comments from readers of these articles are from employers asking “where can I find data scientists?”. The sad answer is “nowhere”. Because of this unmet demand, an MSc and PhD program in Data Science is almost guaranteed to be an instant success. The Courant Institute is particularly well positioned to seize the opportunity by putting together and hosting such a program.

In a recent study, McKinsey & company estimates that

“the United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings”.

The New Data-Centered Science

While data analytics has revolutionized business practices, it is also revolutionizing the way science is carried out.

A recent Wired Magazine article quoted famous statistician George Box's well-known quip: “All models are wrong, but some are useful.” The title of the article was: “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” [8]. While this is certainly a journalistic exaggeration, the article points to a clear trend: more and more branches of science are relying on the automatic analysis of massive datasets to extract new knowledge. Some of the best examples can be found at NYU: genomics and proteomics, astronomy, high-energy physics, brain imaging and neuroscience, economics, social network analysis, text mining in law, history, and political science, etc.

Science Magazine ran a special issue in February 2011 entitled “Dealing with Data”. Even the popular science press is echoing the trend: the November 2011 issue of Popular Science is a special issue entitled “Data is Power, how information is driving the future”, with articles such as “The Glory of Big Data”, “The Rise of the Machines”, and “The Data-Centric Universe”.

A good evidence that a new field is emerging is the migration of researchers from previously unrelated fields to data-centered science. Perhaps the best example is genomics. A number of prominent researchers in genomics and computational biology have backgrounds outside of biology. A famous representative of this trend is Prof. David Haussler from UCSC who, from a machine learning theorist in the early 90's went on to develop some of the first statistical methods for biological sequence analysis, and played a key role in the public effort to sequence the human genome. He now sees himself as a biologist who uses data analysis to explore the evolution of the human genome. Another one is Eric Lander, founding director of the MIT Whitehead Institute, and a major contributor to the human genome sequencing project, and whose PhD is in mathematics.

The Science behind Data Science

Data Science is helping the sciences, but it was born from a desire to answer one of the most important scientific questions of our times: what is intelligence, and how does the brain work? The methods of Data Science, and reason it attracts so many bright students is that, beyond the applications to business and science, its goal is to produce intelligent machines. Data Science is the new AI. Its methods come not only from statistics and applied mathematics, but also from theoretical and computational neuroscience. Some classes of models, such as artificial neural networks, sparse modeling, non-negative matrix factorization, convolutional networks, and others have their roots in neuroscience and cognitive science. The connection between AI, machine learning, computer vision, and robotics on one hand, and computational neuroscience and cognitive science on the other hand has proved very fruitful in both directions.

The methods of Data Science draw on statistics for Bayesian methods, parameter estimation, generalization bounds, theoretical guarantees, and inference methods such as Monte-Carlo, MCMC, and variational methods. It draws on applied mathematics for large-scale numerical methods, large-scale SVD and eigen-problems, convex optimization, non-convex optimization, and harmonic analysis. It draws on computer science for machine learning, kernel methods, parallel processing, database systems, and application areas such computer vision, robotics, and AI in general. Areas of pure mathematics are poised to have a large impact on the “big data” problem, particularly geometry and analysis.

Data Science: an Opportunity and a Threat for NYU and the Courant Institute

Courant is ranked number one in the nation for Applied Mathematics. What has made Courant so successful is a tradition of developing new mathematics and new computational methods with an eye towards real-world applications. If Courant is to maintain its leading position, it must continue that tradition and lead the next wave in Applied Mathematics.

In the “Age of the Exabyte”, the next wave in applied mathematics is undoubtedly the analysis of large datasets. With its almost seamless boundary between CS and Mathematics, and its strong link with the Sciences, Courant is uniquely positioned to lead that wave.

In fact, if Courant doesn't lead the data revolution, it would almost certainly lose its dominance in the world of applied mathematics.

Margaret Wright and Yann LeCun have been serving on a National Academy of Sciences committee named “Mathematical Sciences in 2025”, which is studying future important directions in the field. The committee has been interviewing a number of leading figures in the mathematical sciences, as well as in fields that are heavy users of mathematics. When asked about the next challenge for mathematical sciences, virtually every interviewees put “large datasets” on top of their list. This answer came not only from statisticians and biologists, but also from prominent pure mathematicians, applied mathematicians, and the leaders of almost all the professional societies, national laboratories, and major industry research labs.

Perhaps the best evidence that Applied Mathematics is on a collision source with Data Science is the research direction adopted by many leading applied mathematicians over the last few years. Many are actively involved in the development of intelligent data analysis methods. In the interviews for MathSci-2025, David Donoho (Statistician at Stanford) said “mathematics doesn't give enough credit to the empirical sciences […] we should encourage people to be mathematical phenomenologists”. Ronald Coifman (Applied Mathematician at Yale) said that the biggest challenge for the coming years is “the underlying mathematics of approximating high-dimensional functions”, “the mathematics behind high-dimensional data representations”, and “the need to invent new mathematical tools to deal with high-dimensional data”.

Data Science indeed poses interesting questions for pure mathematics and probability theory: what is the structure of data in high-dimensional spaces? many branches of pure mathematics and probability theory could make major contributions to our understanding of data, particularly geometry. Some of the new mathematics of the 21st century will undoubtedly be inspired by the “big data” question.

The Role of Courant in a University-Wide Initiative in Data Science

With Courant's reputation in applied mathematics, the uniquely tight relationship between mathematics and computer science, and the strength of the faculty in theoretical statistics, probability, machine learning, AI, computer vision, optimization, computational mathematics, and mathematical finance, Courant has a unique mix of talents in the basic methods of data science. However, the NYU expertise in data science goes well beyond Courant. Other groups within NYU have considerable strengths in data science and its applications. This includes a number of departments in FAS, notably biology, physics, neuroscience, and economics, which have strong activities in data science for such things as genomics, astronomy, high-energy physics, brain imaging, neural signal analysis, and economic prediction. Stern has a strong presence in machine learning and predictive modeling in its Information Systems department and programs. Steinhardt has two programs that are strongly connected with data science: quantitative methods in social sciences, and music information retrieval in the music technology program. Poly has expertise in data visualization, signal processing, and robotics. The medical school has a large bioinformatics group with considerable strength in machine learning and statistical modeling, as well as a biostatistics group with a graduate program.

While NYU at large and Courant in particular have clear strengths in applied math, machine learning, theoretical statistics and probability and a large number of application areas, other universities have relied on the strength of their statistics departments, establishing links with computer science and applied mathematics to build effort in data science. While the absence of a computationally-oriented statistics department at NYU could be seen as a handicap, it can also be turned into an advantage: it makes it possible to design a 21st century computational statistics and data science program from the ground up, and to leapfrog other institutions without being hampered too much by the weight of legacy.

By creating a program in data science, Courant would not replace existing programs within the university, but would strengthen them: students in these programs would be able to take classes on fundamental methods from the data science program. Conversely, students from the data science program would be able to take course and gain expertise in a wide range of application areas.

In many fields (such as CS and biology) where research relies heavily on graduate students and postdocs, the quality of the graduate students has a considerable influence on the quality of the research, which in turn influences the ability to publish in top venue, to attract funding, and to attract top faculty. A strong PhD program in data science would help NYU at large, and Courant in particular, to attract top students, funding, and faculty. The ideal data science PhD candidate has a strong mathematical background and a strong background in computer science, with an undergraduate major in mathematics, statistics, physics, EE, or math-heavy CS degree. But many math majors lack enough familiarity with computer science and programming. Conversely, many candidates with a pure CS background have too little background in continuous mathematics (some very prominent machine learning researchers admit to never take North-American CS majors as PhD students for that reason). Unfortunately, many non-CS students are often scared away by the course requirements of CS PhD programs (e.g. in systems), while non-Math students are scared away by Mathematics PhD requirements (e.g. in complex variables). A Data Science program would attract more top PhD candidates with a strong mathematical background and good computer science skills, and would offer them an appropriate range of courses.

The topics of Data science, Machine Learning and AI are extremely popular among the best PhD applicants in Computer Science: during the fall 2009 PhD admission process, the Computer Science Department pre-selected roughly 120 top applications, of which approximately 35% were in ML, computer vision, or robotics, despite the fact that the NYU CS department does not not have a specific program in ML/statistics, and had only two faculty whose primary interest is ML, and three in computer vision. Furthermore, ML and computer vision attract a considerable proportion of the federal and private funding into the CS department. There is a huge demand from prospective students and from funding agencies in this area, which cannot currently be met because of lack of space and faculty members.

Demand from Industry in the NYC Area and Beyond

There is a huge demand for data scientist from a quickly increasing number of companies, and data scientists are nowhere to be found (see an example of job post for a data scientist at Facebook in Appendix D). The occurrence of the term “data scientist” in job offers has exploded in recent years [9].

A large of number of Web-oriented startup companies in New York City are entirely built around data analytics, and many have significant R&D activities in machine learning, data mining, and data science in general. The pace of recruitment in these areas has been accelerating in the last few years. The most prominent companies include Google (Manhattan), Yahoo (Manhattan), NEC Labs (Princeton), IBM (Yorktown Heights), AT&T (Florham Park), Lucent/Bell Labs (Murray Hill), Telcordia (Piscataway), SRI (Princeton). The pharmaceutical and bio-tech companies in New Jersey and in the Hudson Valley are desperately looking for skilled statisticians and data scientists. Many Wall Street companies such as Standards and Poor, as well as hedge fund, such as D. E. Shaw and Renaissance, actively recruit graduates with a machine learning background. A large number of Web-oriented startups in and around NYC are also recruiting in this area (bit.ly, foursquare, etc).

The annual Strata Conference (held in New York City in September 2011) covers the latest and best tools and technologies for data science, from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

A NYC-based “meetup” on Machine Learning was formed a few years ago. This informal monthly gathering brings together people from various background around an invited speaker. Participants typically number in the hundreds.

References


Appendix A: Data Science, Artificial Intelligence, and Machine Learning

The New AI

Beyond solving the very immediate and practical problems associated to the data deluge, Data Science provides new methodologies with which to attack long-standing scientific questions.

One of the grand scientific challenges of our times is to understand intelligence and explain how the brain works. One of the grand technical challenges of our times is to build intelligent machines and robots. These apparently different challenges boil down to a single set of questions: what is intelligence? what is learning? how can we build intelligent machines that learn from data?

While the history of AI has been described as a succession of false starts and broken promises, considerable progress has been achieved over the last few years, largely due to breakthroughs in large-scale automatic machine learning, and to the ever-increasing power of computers. sometimes in the next half century, the raw computing power of computers will approach and surpass that of the human brain. If theory and algorithms are developed concurrently, AI may finally deliver on its promises. Intelligent data analysis systems will increasingly sift through massive amounts of data from industry, commerce, government, and the digital society. Intelligent autonomous robots will be roaming around. Intelligent cars will drive themselves (prototypes already exist). Intelligent knowledge extraction systems will help scientists and researchers in all fields to produce new science from large collections of experimental data.

There is no AI without Machine Learning

Noted IBM/JHU speech recognition expert Fred Jelinek is famously quoted as having said in 1988: “Every time I fire a linguist, the performance of our speech recognition system goes up”. Underlying this joke is the fact that AI systems that are built by applying statistical learning to massive datasets far outperform those that are carefully engineered “by hand”, i.e. by attempting to translate expert knowledge into algorithms. The recent history of almost every branch of Artificial Intelligence points in the same direction: machine learning methods are already surpassing the ingenuity of human engineers when it comes to designing artificially intelligent machines. Examples abound in many sub-fields of AI, including speech recognition, handwriting recognition, computer vision, robotics, natural language processing, automatic language translation, information retrieval, and question-answering systems. In all of these domains, machine learning plays a key role, and is arguably the single most important cause of progress in recent times (beside progress in hardware).

Over the next few decades, the raw computing power of computers will approach that of the human brain. A low-ball estimate of the raw computing power of the human brain is 10^18 operations per second. The most powerful computer in the world will reach that speed around sometimes in the 2020's. By 2035, thousands of computers will be as powerful as the human brain (see http://www.top500.org/lists/2008/11/performance_development ). While building intelligent machines will require some major conceptual breakthroughs, not just powerful computers, we may finally enter the age of intelligent machines after years of unfulfilled promises. Intelligent machines and robots will revolutionize society in ways that are difficult to envision today.

The AI industry, which has had a number of false starts in the 80's and 90's has finally taken off. In the 80's “knowledge engineers” were going to compile human knowledge into rule-based expert systems. That approach failed, but with the emergence of machine learning and the availability of large datasets and fast computers, the process of deriving knowledge and predictive models from data is becoming more automated and much more effective. Machine Learning at NYU

NYU has several faculty members whose primary research area is machine learning including Nathaniel Daw (CNS), Panos Ipeirotis (Stern), Yann LeCun (CS/CNS), Mehryar Mohri (CS), Foster Provost (Stern), and David Sontag (CS), Alexander Statnikov (Med School/Medical Informatics). A number of faculty who work in AI-related fields, such as computer vision, image processing, and robotics, are also active in ML research: Chris Bregler (CS), Rob Fergus (CS), Yann LeCun (CS/CNS), and Eero Simoncelli (CNS/Courant). A large number of NYU faculty in other areas, while not primarily involved in ML research, work in closely related field and are heavy users of ML techniques. Such domains include:

  • neuroscience and cognitive science
  • computational biology, bio-statistics, and medical informatics
  • astronomy and high-energy physics
  • natural language processing, computational linguistics
  • signal processing and music theory
  • business data analytics
  • quantitative methods in social sciences and the humanities
  • economics and finance



Appendix B: Q&A

Q: Isn't Data Science a sub-field of Machine Learning and Computer Science?

A: No, for two main reasons: 1. the topic is multi-disciplinary, overlapping computer science, statistics, and applied mathematics for the methods, and overlapping many fields for its application domains (AI, computer vision, computational neuroscience, computational biology, mathematical finance, and many other); 2. many graduate students who are best positioned for a degree in Data Science do not have a background in Computer Science, and are not motivated to join a traditional CS program, with its requirements in “core” computer science, such as operating systems, programming languages, compilers, and such. In a Data Science program, these requirements would be replaced by a number of courses in machine learning, statistics, optimization, numerical analysis, linear algebra, signal processing, data mining, and information theory.

In many universities, Machine Learning researchers have their primary affiliation with computer science, EECS, statistics, or computational biology. However, few ML researchers and graduate students have a purely computer science or purely statistics background. Many have a background in engineering, physics, or mathematics. Similarly, many graduate students whose primary interest is ML do not feel at home in computer science, and few PhD applicants with a pure CS background are admitted into ML research groups. Why? because the tools of ML and statistics are continuous mathematics: multivariate calculus, linear algebra, optimization, geometry, Hilbert spaces, harmonic analysis, etc. By contrast, computer science traditionally deals in discrete mathematics (e.g. combinatorics, graph theory, etc). ML students often have little interest (and little background) in what is considered “core” computer science, such as operating systems, compilers, theory of computation, security, cryptography, and such. Currently, most CD program around the country do not have much to offer to these students.

Q: Isn't Data Science a sub-field of Statistics?

A: While Statistics is a crucially important part of Data Science, it would be a mistake to see Data Science as being contained in Statistics. The computational aspects, and some of the mathematical aspects are squarely within applied mathematics and computer science. Furthermore the application areas go beyond the traditional territory of statistics. The main purpose of Data Science is often prediction and model building from very large datasets and with correspondingly large models. The current and future challenges are in the extraction of knowledge from very large datasets, be it for science or commerce. Taking on these challenges will require a tight collaboration between computer scientists, applied mathematicians, statisticians, and specialists of the application areas.

 
 
/srv/www/cilvr/htdocs/data/pages/publications/cds-manifesto.txt · Last modified: 2013/04/19 11:40 by yann
Recent changes RSS feed Creative Commons License Valid XHTML 1.0 Valid CSS Driven by DokuWiki
Drupal Garland Theme for Dokuwiki