Which is the best course on data analysis or data science on Coursera, Udacity or elsewhere...

Quora Feeds

Active Member
Jason T Widjaja

A bit late to the question so I'm not sure how much this will be seen, but I felt the need to add an alternative perspective on core data analyst skills that have not been covered.

I am repeating an answer I gave to a related question, but hopefully this helps you save your organisation a lot of pain (and money!).

Let me start with two points:

Firstly, data analysis is not data science and doesn't have to be. There are a lot of skills you should learn that will allow you to add value without doing any modelling. I understand the interest around data science (I did a masters in analytics myself), but sound data analysis has been around for a long time and good analysts are valuable.

Secondly, tools are important but secondary to the task - you just use whatever tools you are comfortable with, as long as the application is fundamentally sound. For instance if you do end up modelling, don't use a binary classifier on a multivariate categoric label. Don't do log transformations on negative values. But whether you use R or SAS for modelling or Quikview or Tableau for visualisation doesn't matter. It is like having a conversation about whether brand X or brand Y pencils are better for drawing.

So with that in mind, here are seven suggestions that I've mostly learned the hard way, and try my best to drill into my analysts:

Understand the data generating process. You are given purchase order data. If you took it at face value without understanding that half of the data was automatically generated and half was manually input, with different lead times, that would have cost you hundreds of thousands in bad inventory forecasting.

Sanity check your data. You are given a data set of financial transactions to analyse for trends. Taking time to do exploratory data analysis and making sure the data made sense, you discover millions of dollars of transactions 50 years in the future. That was obviously a mistake or system quirk that would have messed up any calculations you would have done.

Check for changing definitions. If you look at any census data or 'open data' data sets, there is a danger of the definition (e.g. what constitutes a 'serious' criminal offence) changing midway through time.

Think carefully about sample bias. In one government study on public transport was conducted at a train station. There is no way that is representative of what the general population thinks. (If someone hated public transport or felt it didn't meet their needs, they wouldn't be at the
train station)

Think about the context of the data. In payroll data, trying to compare packages across countries can be tricky. Besides shifting exchange rates, different countries have different views towards fixed salary vs. commissions, minimum wage, bonuses, regulated saving etc.

Understand statistics. There will be instances where comparable data comes close, and consumers of your data will be hungry for 'signal' that will sway them one way or another. It is your responsibility to point out some findings are just now statistically significant. It is also your responsibility to choose your metrics carefully - 'average' just fails in many situations.

Talk to DBAs to understand system specific quirks. Real life data sets are rife with strange behaviour driven by the way different systems handle data. In particular nulls or missing data can be handled in a variety of ways. And operation like integer division or division by 0 can cause havok to metrics.

I hope that helps. All the best and feel free to drop me a message if you have any specific questions.

See Questions On Quora

Continue reading...
 

Similar threads

Top