tl;dr: Choosing the right technology stack can mean the difference between SUCCESS and FAILURE in data science. It can also mean the difference between ‘I Feel Productive’ and ‘Everything I try takes so much time’. When deciding which toolkit / stack be sure to learn from others, consider implementation partners, and always remain cautious.
Note: In addition to this post, I am surveying people to determine which data science technology stacks are being used right now. The survey responses will remain confidential and can be accessed via this link: http://bit.ly/ToolkitSurvey (takes less than 200 seconds!). Those that complete the survey will get a copy of the report (which will detail the pros/cons of popular data science technology stacks).
This context is that I’ve got a short time to evaluate the quality of a data science project recently completed. This project has generated an insight that tells us something about our business. From my evaluation , I need to decide whether the business should act upon the insights generated. Acting upon insights will cost time and money. If the initiative fails, which is healthy and happens often, I’m going to need to both justify my actions and learn from the failure. To do this, I need to make sure to act only upon quality analyses. Repeatedly doing this requires me to learn how to sniff out the good and bad projects quickly.
If I need to quickly evaluate the quality of a data science project, I try answer three simple questions:
- Data Understanding: Has the underlying data supporting the output been correctly understood?
- Insight Value: Are the insights generated valuable to the business?
- Execution: Was the technical execution of the initiative robust?
In my experience, A high-quality data scientist will answer these questions as an output to the project. For a new data scientist (or one without business context), asking questions allows me to quickly reach a conclusion on each.
If you want to be an excellent data scientist, ensuring you nail each of the above the above should be the compass that directs your efforts.
I usually try answer some or all of these questions to reach a conclusion. Feel free to suggest ones I’ve missed in the comments below.
- Has the data quality been assessed? What did this exercise look like? Where there any significant concerns?
- For data that is manually captured by (at some stage) by humans, has this been accounted for? How?
- Is the data history clear? Has this data been through multiple legacy system migrations? Has the impact of this been determined? How was this dealt with?
- How have missing or null values been addressed? What proportion of such values were there?
- Is the organisation’s data capture process clear? Were there any data items (having looked at the data capture process) that didn’t match their definitions?
- Is the insight actionable (can I act upon this insight)? Are there any legal, regulatory, logistical or other challenges that would prevent me from acting upon this insight?
- Is the insight valuable? If this insight is correct, and take action based upon it, will those actions lead to better outcomes (e.g. profits, subscribers, success) for the organisation?
- Is the insight testable? Is it possible to verify the conclusions of the project? If not, does this present a major problem?
- Will acting on the insights be cost-effective relative to the potential gains?
- Are there any major risks to acting upon this insight? Will I be opening a Pandora’s box (e.g.significantly changing our relationship with customers)? How do these risks stack up against the potential benefits?
- Has the code quality been tested?
- Is the statistical model / implementation based on reliable assumptions (e.g. homogeneity of time series patterns),
- Have the model outputs been tested independently of the model (e.g. train/test)?
- Good data governance throughout the project (so nothing got inadvertently messed up)?
- Are the significant data and business assumptions documented?
- Was testing designed and preformed independent of the project delivery?
- What statistical methods were tried and rejected? Why were these rejected?
[Your data science project is either: (1) Bespoke & Once-off, (2) Big & Risky, (3) Quick & Dirty, (4) Simple & Effective]
Imagine a senior executive at your firm walks into the room right now, looks at you and says ‘I need you to design a solution that will help us understand the data is telling us. Specifically we are interested in….’ (and let us suppose he continues and outlines some relevant and interesting question in your domain).
How do you begin the process of designing this solution? Assuming that the design will have to meet some non-trivial set of requirements, where do you start?
A good place to start is by thinking about the strategic trade-offs the design will make by thinking about the solution space. The most important trade off that will be made is between the level of sophistication and the ease of execution (or repeatability). The level of sophistication includes the novelty and accuracy of techniques used, expert input during execution, and the quantity of variables and observations included (mostly things correlated with the power & quality of the output).Ease of execution refers to the cost (in time and money) for the organisation to produce this output (now & in the future).
Clearly an ideal solution is a sophisticated, easily-executed implementation. But in all cases (given fixed or limited resources) there is a choice between sophistication and easy-execution. Hammering out which of each is valued by stakeholders is the first (and most important) question that you should answer when designing the solution.
A good way to do this is to break the solution space into four different area (or quadrants). The relative strengths and weakness of each quadrant can often facilite a discussion to identify what is important to the stakeholders.
The 4-Quadrants Heuristic for Data Science Solutions
- Bespoke & Once-off: Projects that require sophisticated tools and skill-sets (best suited to analyses that need only be completed once)
- Quick & Dirty: Simple quick-result analysis usually completed in a spreadsheet or short script (best suited for decisions that need to be made quickly)
- Simple & Effective: Stable well implemented solution that produces regular (or on demand) outputs automatically (best suited for environments where data signal is strong and data changes periodically/frequently)
- Big & Risky: Transformational high-risk projects with big potential if successful (best suited when data science is going to be a core competency of your organisation)
So the next time you see scope for a data science solution, your first though should be “Which Quadrant?”
If you enjoyed this post, you can subscribe to this blog using the link provided in the sidebar. Questions, comments and feedback are always welcome.