• Franco Arda

Why raw SQL can be bad in Data Science ...

Let’s say you have cities “Los Angeles”, “New York” and “Boston.” In your dataset, you have also misspelling such as “Bston.” A relational database, invalid entries might be dropped. Such invalid entries are in big data lingo also called “veracity or variety.” Now, if you had pulled the whole data first into an analysis tool such as R, Python, or Tableau, you might have spotted the misspelling.

I’m biased, but I see another trend. When you start in Data Science, you still heat that SQL is a must. This is also reflected in job profiles. Worse, many interviews contain some SQL questions.

I personally think this is dangerous for producing great Data Scientists. Sure, the should know some basic SQL. Sure, SQL will not go way in the near future (not should it). But what I think is dangerous is requiring new Data Scientists to spread themselves thin on many tools such as SQL, R, Python, Tableau.

If you’re not really good at one tool, you might have fallen for “Bston.”

Again, I’m biased. I favor an expert in one or two tools anytime over many tools.

I’ve seen some confirmation of this trend, though:

DataCamp, one of the biggest platforms (biggest in terms of traffic which is around 5 million per month) for training Data Scientists, dropped SQL in their Data Scientist curriculum.

One of the most popular Data Science books on Amazon (popular measured by the number of reviews), “R for Data Science” favors doing SQL joins directly in R as opposed to raw SQL.

Again, my personal opinion, but if you want to become a superstar Data Scientist, I would favor becoming an expert in one single tool. To me, R is the number one choice, but it could also be Python or Tableau.

With R, you cover everything in Data Science (let’s leave it ML and DL). You know how to pull data, wrangle it, visualize it, analyze and model it. Tableau could get you pretty far as well … but not as far as R. But then, learning Tableau or R requires different amounts of input. I guess you’d be pretty good after 500 hours studying Tableau. But being top with R, including statistical models, requires thousands of hours. No wonder R has so many Ph.D. users.



Franco Arda, Frankfurt am Main (Germany)                                                                                                                 franco.arda@hotmail.com