Data Science Is Not Magic

Data Science Is Not Magic

Throughout my career, there have been countless times my Data Science work has been referred to as magic. I have even had colleagues that refer to parts of their own Data Science projects as magic. I am always quick to try and correct this behavior, because as Data Science professionals, we don’t work in a world of magic and mystique. We work in a world of data, logic, and facts.

Magic starts with fiction

I think this entire Data Science versus magic issue stems from Arthur C. Clarke. He was a British science fiction writer. He most famously has been given credit for his three laws.

  1. When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
  2. The only way of discovering the limits of the possible is to venture a little way past them into the impossible.
  3. Any sufficiently advanced technology is indistinguishable from magic.

I agree with Mr. Clarke’s first law. I have often said to stakeholders in reference to Data Science projects that with enough time, effort and money, we can accomplish anything. So far, in my career, this has held true.

The second law is also true, that we need to push our boundaries to discover what is truly possible. If you limit yourself, then you never really will discover anything new or of note.

The third law is where I start to diverge from Mr. Clarke’s thinking. I get where he is going with it. To the observer, with no understanding of the technology, the technology appears to be magic. Imagine if we were able to travel back in time 50 years to 1970 and show someone a smart phone. That person would have no basis of understanding. They wouldn’t even understand the concept of personal computing, as it was nearly a decade away from hitting the general public. The closest reference they would have would be our NASA space missions and Star Trek from the late 1960s. I’m sure that smart phone device would seem like some sort of wizardry.

Data Science starts with observation

If we aren’t going to refer to Data Science as magic, then it must follow a real scientific approach. The Scientific Method is the method in science in how we take our observations of the real world, create a hypothesis about how things should work, and experiment to see if things do work that way or another way entirely. Wash, rinse and repeat.

You might challenge this right now and say that you just know better because of your years of experience. That there is no need for science or magic because your experience will drive the right decisions. It sounds funny to think about it that way, but many business operate in just that way. We’ll call this the gut instinct approach. It’s a viable system for running a business. I’ve been around multi-million dollar businesses that function entirely on gut instinct. One thing about gut instinct is that it’s psychologically safe, but it isn’t necessarily better. It’s gives you about as good of odds as flipping a coin. Statistics tells us that is about 50% chance of being right and 50% chance of being wrong.

This approach is exactly why we want to tackle business issues with Data Science. We want an edge on our competition. If they are operating at around a 50% success rate, all we have to do is gain a few points and we’ll start lapping them. How do we get there? We must observe some characteristic of the business. In our modern businesses, many are sitting on piles of data. Hence the term, big data. Often, this data is not organized in a way that is useful, so we might have to merge, append and normalize the data. After observing a trend in the business, or something in the data, we can then start to ask questions about it.

Question everything

When analyzing our observations, we start to form questions about why things are the way they are. Eventually, we will settle in on a good question. This leads to our hypothesis. A good hypothesis is both falsifiable and testable.

Let’s set up an example. For our example, we are a store that sells many products. We have always assumed that the best strategy for advertising toys to maximize our 4th quarter sales is to run ads after Thanksgiving to capture the shoppers buying close to the Christmas holiday.

We can start by looking at attributes of customers, sales, and the date and time of orders and we could start to make some assumptions about sales, and create several hypothesis’s to test. For this example, we’ll just choose one hypothesis:

Running toy advertisements the week before Thanksgiving will significantly increase 4th quarter sales.

Now that we have a hypothesis, we’ll refer to it as H1, because in statistics we call this our alternate hypothesis. Now it’s important to create a null hypothesis as well. A null hypothesis is what we already assume to be true and what we will be disproving in our experiment. We refer to it as H0:

Running toy advertisements the week before Thanksgiving has little to no effect on 4th quarter sales.

Test, test, and test some more

Now that we have our H0 and H1, we can now test our alternate hypothesis. You might be wondering how we go about testing the hypothesis if we have never run toy advertisements the week before Thanksgiving. One way is to give it the old college try! We’ll call this the trial and error method. Our company could just run the ads the week before Thanksgiving and measure what happens. The downside could be that it does not work and we lose all the money we spent on advertising without gaining any sales. Then you might wonder if we can conclude anything, or if some other variable was introduced that affected the test. Should we test again? What if we lose money once again? These are all good questions.

What if we used our historical sales records and sampled certain situations? We could start by pulling a sample of toy sales the week before Thanksgiving. Then we pull toy sales after Thanksgiving and before Christmas. We could then pull other merchandise sales from the week before Thanksgiving in which we were advertising. There are several other samples we could pull too. After we pulled all of these, we could compare them and try and find similarities between other merchandise, customers, proximity to holidays and large shopping days like Black Friday. We start to build a data set with many attributes to describe all of the data. We can then choose several different strategies such as clustering or regression analysis to see how the sales are similar or dissimilar and how different variables affect the forecasted outcomes.

Data Science arrives at conclusions

Once we’ve run our tests, we can start to make some conclusions about what we have learned. The nice thing about using the data in our example, is that we can use many years of history to try and predict what will happen when we run toy ads the week before Thanksgiving. If we just used the trial and error method, we may have to continue to attempt the strategy for many years before we could derive any conclusions about the strategy. Using the data, we were able to predict how the future might look based on our experiences over the entire course of our business history.

Can we predict the future with any degree of accuracy? Yes, we can. I know we all like to complain about when the local weather person gets the forecast wrong, but in reality weather forecasters are right nearly all of the time with short-term forecasts less than 10 days. Why is this the case? Weather professionals are working with much more data than ever before and using computing power that can actually process the information faster than ever before. They have a better than 9 out of 10 chance of being right inside of a 5-day forecast. That is pretty impressive!

For our example, if I were to come to you as a decision maker and tell you that you had a 90% chance of increasing your 4th quarter sales by a significant measure by only adding one week of advertising to your marketing budget, I think most people would implement that strategy nearly 100% of the time.

Data Science is science, not magic

Data Science is a fun topic and there is much value that can be extracted from the data sets you have within your organization. It is a high-demand field. It’s being used to affect how businesses are operating at every level, from back-end systems, to finances, to operations, and even at the front-lines with chat bots and other learning systems.

Today’s businesses are collecting data at a rate that we’ve never seen before, and oftentimes it is neglected and underutilized. This data can be set to work for the business and make huge impacts on the business and its success both in the short-term and long-term.

As we discussed, Data Science is not magic. We can follow the scientific method and formulate a hypothesis, test and draw conclusions about business strategy. We don’t have to utilize gut instinct or trial and error when we can use data that the business already has to draw meaningful insights about how the future might operate within a given strategy.

Oftentimes, we are so busy trying to get ahead we won’t stop for a moment and find time for whimsy, so I thought it would be fun to wrap this up with a Data Science joke.

There are two kinds of Data Scientists:

1.) Those who can extrapolate from incomplete data.

Flash Slothmore Reacts to my Data Science Joke

Are you thinking about implementing a Data Science strategy? Read about our Big Data, Data Analytics, and Data Visualizations service and read a case study about how we were able to help a small business with our services.