Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Behold my pithiest attempt: âData science is the discipline of making data useful.â Feel free to flee now or stick around of a tour of its three subfields.
The term no one really defined
If you poke around in the early history of the term data science, you see two themes coming together. Allow me to paraphrase for your amusement:
- Big(ger) data means more tinkering with computers.
- Statisticians canât code their way out of a paper bag.
And thus, data science is born. The way I first heard the job defined is âA data scientist is a statistician who can code.â Iâll be full of opinions on that in a moment, but first, why donât we examine data science itself?
Twitter definitions circa 2014.
I love how the 2003 launch of Journal of Data Science goes right for the narrowest possible scope: âBy âData Scienceâ we mean almost everything that has something to do with data.â So⊠everything, then? Itâs hard to think of something that has nothing to do with information. (I should stop thinking about this before my head explodes.)
Since then, weâve seen a multitude of opinions, from Conwayâs well-traveled Venn diagram (below) to Mason and Wigginsâ classic post.
Drew Conwayâs definition of data science. My personal taste runs more towards the definition on Wikipedia.
Wikipedia has one thatâs very close to what I teach my students:
Data science is a âconcept to unify statistics, data analysis, machine learning and their related methodsâ in order to âunderstand and analyze actual phenomenaâ with data.
Thatâs a mouthful, so let me see if I can make it short and sweet:
âData science is the discipline of making data useful.â
What youâre thinking around about now might be, âNice try, Cassie. Itâs cute, but itâs an egregiously lossy reduction. How does the word âusefulâ capture all of that jargon stuff?â
Well, okay, lets argue it out with pictures.
Hereâs a map for data science for you, perfectly faithful to the Wikipedia definition.
What are these things and how do you know where you are on the map?
If youâre about try breaking them down by standard toolkits, slow down. The difference between a statistician and a machine learning engineer is not that one uses R and the other uses Python. The SQL vs R vs Python taxonomy is ill-advised for so many reasons, not least of which is that software evolves. (As of recently, you can even do ML in SQL.) Wouldnât you prefer a breakdown thatâll last? In fact, just go ahead and unread this entire paragraph.
Perhaps worse is the favorite way novices split the space. Yup, you guessed it: by the algorithm (surprise! itâs how university courses are structured). Pretty please, donât taxonomize by histograms vs t-tests vs neural networks. Frankly, if youâre clever and you have a point to make, you can use the same algorithm for any part of data science. It might look like Frankensteinâs monster, but I assure you it can be forced to do your bidding.
Enough with the dramatic buildup! Hereâs the taxonomy IÂ propose:
What on earth is this? Why, decisions, of course! (Under incomplete information. When all the facts you need are visible to you, you can use descriptive analytics for making as many decisions as you please. Just look at the facts and youâre done.)
Itâs through our actionsâââour decisionsâââthat we affect the world around us.
Iâd promised you we were going to talk about making data useful. To me, the idea of usefulness is tightly coupled with influencing real-world actions. If I believe in Santa Claus, it doesnât particularly matter unless it might influence my behavior in some way. Then, depending on the potential consequences of that behavior, it might start to matter an awful lot. Itâs through our actionsâââour decisionsâââthat we affect the world around us (and invite it to affect us right back).
So hereâs the new decision-oriented picture for you, complete with the three main ways to make your data useful.
If you donât know what decisions you want to make yet, the best you can do is go out there in search of inspiration. Thatâs called data-mining or analytics or descriptive analytics or exploratory data analysis (EDA) or knowledge discovery (KD), depending on which crowd you hung out with during your impressionable years.
Golden rule of analytics: only make conclusions about what you can see.
Unless you know how you intend to frame your decision-making, start here. The great news is that this one is easy. Think of your dataset as a bunch of negatives you found in a darkroom. Data-mining is about working the equipment to expose all the images as quickly as possible so you can see whether thereâs anything inspiring on them. As with photos, remember not to take what you see too seriously. You didnât take the photos, so you donât know much about whatâs off-screen. The golden rule of data-mining is: stick to what is here. Only make conclusions about what you can see, never about what you canât see (for that you need statistics and lot more expertise).
Other than that, you can do no wrong. Speed wins, so start practicing.
Expertise in data-mining is judged by the speed with which you can examine your data. It helps not to snooze past the interesting nuggets.
The darkroomâs intimidating at first, but thereâs not that much to it. Just learn to work the equipment. Hereâs a tutorial in R and hereâs one in Python to get you started. You can call yourself a data analyst as soon as you start having fun and you can call yourself an expert analyst when youâre able to expose photos (and all the other kinds of datasets) with lightning speed.
Statistical inference
Inspiration is cheap, but rigor is expensive. If you want to leap beyond the data, youâre going to need specialist training. As someone with undergrad and graduate majors in statistics, I may be just a tad biased here, but in my opinion statistical inference (statistics for short) is the most difficult and philosophy-laden of the three areas. Getting good at it takes the most time.
Inspiration is cheap, but rigor is expensive.
If you intend to make high-quality, risk-controlled, important decisions that rely on conclusions about the world beyond the data available to you, youâre going to have to bring statistical skills onto your team. A great example is that moment when your finger is hovering over the launch button for an AI system and it occurs to you that you need to check it works before releasing it (always a good idea, seriously). Step away from the button and call in the statistician.
Statistics is the science of changing your mind (under uncertainty).
If you want to learn more, Iâve written this 8-minute super-summary of statistics for your enjoyment.
Machine learning
Machine learning is essentially making thing-labeling recipes using examples instead of instructions. Iâve written a few posts about it, including whether itâs different from AI, how to get started with it, why businesses fail at it, and the first couple of articles in a series of plain-language takes on the jargon nitty gritties (start here). Oh, and if you want to share them with non-English-speaking friends, a bunch of them are translated here.
What about data engineering, the work that delivers data to the data science team in the first place? Since itâs a sophisticated field in its own right, I prefer to shield it from data scienceâs hegemonic aspirations. Besides, itâs much closer in species to software engineering than to statistics.
The difference between data engineering and data science is a difference of before and after.
Feel free to see the data engineering versus data science difference as before versus after. Most of the technical work leading up to the birthing of the data (before) may comfortably be called âdata engineeringâ and everything we do once some data have arrived (after) is âdata scienceâ.
Decision intelligence
DI is all about decisions, including decision-making at scale with data, which makes it an engineering discipline. It augments the applied aspects of data science with ideas from the social and managerial sciences.
Decision intelligence adds components from the social and managerial sciences.
In other words, itâs a superset of those bits of data science not concerned with researchy things like creating fundamental methodologies for general-purpose use.
Still hungry? Hereâs a breakdown of the roles in a data science project to entertain you while I go clack on my keyboard.
What on earth is data science? was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.