Data Science | Becoming A Data Scientist

The scenario is this: You are a data scientist supporting a marketing manager in preventing customers from switching to your competitor. She is quite savvy and has a reliable technique, which costs some amount per use, that is excellent at convincing customers not to switch e.g. an unexpected discount on their bill. She needs your expertise is in identifying a list of customers she should apply this technique to. A good list she tells you, should pick only customers that result in an outcome that increases profit (revenues less expenditure) for your company.

Clearly the objective is to best split the list into 2 groups (‘high-switch-risk’ and ‘not-high-switch-risk’) given the underlying uncertainty But here’s the rub: the size of the high-switch-risk group has not been specified. It would easy to generate a list with a really high certainty about which customers are actually high-switch-risk. However, this list would probably only number a handful of customers, as to increase the accuracy we’d only include customers we are really really certain about. This list would likely miss out on many high-switch-risk customers.

Therefore, we need to increase the size of the list, and hence reduce the accuracy . In order to determine when to stop increasing the size, and remembering that the objective is to maximise profits, we often ask ourselves the following questions:

False Positive Cost: What is the cost of incorrectly identifying a customer as high risk of switching (i.e. wasted marketing cost)?
False Negative Cost: What is the cost of failing to identify a high-risk customer (i.e. lost revenue)?
How many False Positive Costs = A False Negative Cost?

Ask yourself these questions when score a model for quality. Remember that depending the on context, sometimes having less false positives is more important than overall accuracy and sometime having less false negatives is more important.

The ability to execute upon an insight (=’Actionability’) is nearly always the most important metric to judge model success. Models exist to help us create more desirable outcomes; therefore models are scored on their ability to enable us to create more desirable outcomes. The objective is to maximise profitability, under this perspective some insights are more valuable than others.

I’ve summarised below a number of stylized scenarios to illustrate the concept. In a future post, I plan to dive into this is more detail – including implementing a predictive solution in R (if you have any good multivariate datasets I could use for this, please do point them towards me).

Case A
Fraud Investigation (limited number of investigations)
An investigation by an IRS or HMRS agent costs a lot – wrong predictions are expensive!
– False Positive is very expensive
– False Negative is inexpensive
– List should have few customers, and be really accurate

Case B
Mail Package Identification
Opening a package doesn’t take much time, letting €mm worth of contraband into a country is not good!
– False Positive is inexpensive
– False Negative is expensive
– List should have many customers, and hence less accurate

Case C
Churn Prediction, where intervention is inexpensive (e.g. ‘cinema tickets’) and losing customers is expensive
– False Positive is inexpensive
– False Negative is expensive
– List should have many customers, and hence less accurate

Case D
Churn Prediction, where intervention is expensive (e.g. ‘big discount on bill’) and losing customers is expensive
– False Positive is expensive
– False Negative is expensive
– List should be midsized, with reasonable accuracy and reasonable number of false negatives

If you look up the word ‘Ideal’ in the dictionary we get: “A person or thing regarded as perfect”. This definition stems from moral philosophy where perfection or ‘the ideal’ is seen not as something attainable, but rather the direction one should strive towards.

In order to become a better data scientist, one approach would be to define the ideal data scientist and begin by striving towards that ideal. The goal of this blog is to help you accomplish that task, by sketching the different skills a data scientist may have and discussing how and why different positions place emphasis on different skills.

Producing valuable insight from data requires more than just an ability to understand or implement a statistical technique. It requires an ability to communicate, understand existing challenges, manage data and (sometimes) bigger data, quantify and justify trade-offs, deliver outputs at the right time, and help others see the value.

In order for organisations to generate valuable insight, skills across multiple dimensions are required. Most likely, these skills re brought together by a combination of people who have different, but complimentary, skill sets.

For example, data science teams in startups usually consist of someone who can manage the data and someone who can build the predictive models. Each may be proficient at the other’s area, but their key contribution comes from what they’re responsible for delivering. Imagine a CEO recruiting for this team, suppose the choice is between (a) a generalist who could do both tasks well or (b) a specialist who could build models that are 50% better than the generalist, but manage data with 50% less ability. What choice do you think the CEO will make?

As teams grow larger, or as problems require relatively more of a particular skill set, there can be great demand for more specialist data scientists. As business priorities change, an ability to adapt become more important, and often more generalist data scientists become successful.

I propose the complete set of skills that a data science team needs in order to be successful. From this beginning, the objective of this blog will be explain each and describe way’s in which you can demonstrably improve your ability to preform in that area. Some of these may seem outside the traditional ‘data scientist’ job description, but the focus here is to identify the sills that together make the Ideal data scient team.

Please share your experience, and let me know what you think in the comments below.

‘Building Things’ Skills

Statistical & Analytical Techniques: Able to understand and use statistical or machine learning approaches for a particular problem
Programming Proficiency: Can write well structured, reliable code that achieves project aims
Data Management: Understands, and can work with / develop large structured databases and data schemas.
Optimisation / Big Data: Architects & implements solutions to scale methods from desktop to server or server to cluster

‘Sharing Things’ Skills

Communication: Knowing how often, at what level of detail and when to communicate to others. Emphatic.
People Management: Supports others in succeeding; let’s them know (and help’s them out) when they’re not
Visionary: Inspiring; helping others believe that achieving a difficult goal is possible; respected

‘Coordinating Things’ Skills

Project Management: Managing for an ‘on time’ and ‘on budget’ delivery. Recognizing early when this will not happen
Financial: Able to quantify financial value to organisation of insight, supports trade-off decisions or business case development
Operationalise: Can take a solution and transition it from one-off/bespoke to business as usual

‘Selling Things’ Skills

Industry Knowledge: Can provide perspective on key stakeholders, history of failed / successful products and customers reaction. Has relationship with some key industry players.
External Sales: Knows customer’s pain points, seeks and sells insights that address these pain points

In a later post, I plan on outlining how these skills work well together, but int he main time, you might want to think about what your strengths and weaknesses are, and what you’re organisation may be lacking.

Simple Self-Assessment.

On a five point scale, rank your current ability in the areas below
On the same five point scale, rank where you would like to be in 1-year and 3-years (this will help you think through how to focus your development efforts)

Data Science

Statistical & Analytical Techniques:
Programming Proficiency:
Data Management:
Optimisation / Big Data:
Communication:
People Management:
Visionary:
Project Management:
Financial:
Operationalise:
Industry Knowledge:
External Sales:

You can sketch your score on the radar chart below and share your experience on our facebook page (which itself is a brand new experiment).

	civkan on Create Rich Interactive V…
	Ola on Create Rich Interactive V…
	Ola on Create Rich Interactive V…
	Steve Harris on Create Rich Interactive V…
	Soma on Create Rich Interactive V…

Becoming A Data Scientist

Category Archives: Data Science

Model accuracy is less important than you think

The Ideal Data Scientist (and how becoming more like one is different than you think)

Share this:

Share this: