Who am I?
Those who read my previous posts may know that my current job is labelled as "Data Engineer" (D.E.). I work at the french Business Unit of a multinational retail company specialised in DIY. I've been working there as a D.E. for more than 3 years as consultant, and recently signed with them.
My role is already evolving and my manager recently asked me to take "Tech Lead" role in my squad. Yet, It is important for me to continue to develop and improve (or at least maintain) my skills as a D.E.
The context
After reading some articles about the roles of Data Engineer, Data Analyst and DataScientist, I thought that I could share some of my insight on one of those roles : The DataScientist.
First, I should give some context about my company organization : previously, our team was labelled as the "BI Team" and was divided into "Squads" with specific perimeters. We have "Supply", "Performance" or "HR" squads, and so, and so... I've been part of the Supply Squad from the start of my consulting missions. With the Covid pandemics, the need for a much stronger Data team was felt, resulting in the fusion of the BI and "Big Data" teams. My Squad was the first to integrate our fellow DataScientists, and we all felt this as very good thing. We kept the squad organization, with some squad being split up to handle smaller perimeters.
As I'll be talking about DataScience, one may wonder about my qualification here. At some point of my career, I thought about becoming a DataScientist. I read books, followed courses, did some Kaggles, and in the last few weeks practiced my knowledge with data coming from the Hive Blockchain. The specifics of this little exercice will be discussed in another article.
What is the DataScientist job about ?
I found this little article while browsing Reddit :
https://medium.com/@Code2Relax/data-engineer-vs-pipeline-engineer-vs-data-scientist-vs-data-analyst-6e03a6c241e
I quote :
Data Scientists
This is what requires little advanced skills and the things you hear about AI/ML ( artificial intelligence and machine learning).
Data scientists help identify patterns and insights that are otherwise hard to see in plain data. They predict future based on the past. That might sound impossible but there is mathematics and probablity involved here.
That is only one cherry picked example, yet this little quote precisely encompass the problem with the DataScientist role: It's a false promise!
DataScience students are mainly taught how to create and optimize models. They have a statistics courses, in-depth machine learning algorithms courses, ... Oh right, they learn the basics of SQL, the importance of data cleansing and preparation. But they practice on small, non-representative datasets and not on real "industry" data. This is often a problem because even if they can build amazing models, precisely parametered, they don't realize that the data they will come across when working in a company is full of flaws.
Don't get me wrong: it's a really good thing that they practice their model building skills on those almost clean and perfect datasets. Because the data they will feed to their industry model should be also be clean and almost perfect if they want to produce excellent predictions.
But when they arrive in a company, they don't know what's waiting them : Data exploration, Data retrieval, Data preparation, Data cleansing, Data understanding... and especially, what massive part of their work this will be.
Before working on the model
That's precisely where our Datascientist struggle. Their teachers and the managers who hired them told them they would be working on fantastic projects, helping the company predict important stuffs and improving the efficiency of many colleagues. That's usually true... in the end.
But they're usually not prepared to the amount of work they will have to produce before even writing their first python code line.
Most big companies have an oldish Datawarehouse at hand. For us, it was a quite massive Teradata cluster, that was well provided with data.
There are more and more companies switching to "Data centric" organizations, cloud databases (BigQuery, RedShift, etc.), implementing Data Stewardship and Ownership through Data Dictionnary and Data Governance tools such as Collibra. This shift can be a Data Engineer dream : we get to work on and learn new stacks, propose new pipelines ideas, and have a lot of work to do.
But this shift takes a massive amount of time (we are talking about years here). And the Business teams can't wait for the data to be clean and available. So, the DataScientists have to go find their data...
They find :
- a bit of prepared data, linked to business terms on the data governance platform
- a lot of old prepared data, coming from the legacy datawarehouse, many times without proper documentation
- RAW data, coming from the many many legacy IT systems
- and a lot of data located in the head of a key business user
They'll need to understand this data, to mix those differents sources and make them talk in order to have all the information they need. Then, they will realise that the quality of those data is merely "meeeh!"... And finally, they will have to move this data to their own project.
Datascientist are not prepared and trained for this. Nobody seriously told them that they would spend more time diving and maybe drowning in a (data)lake than coding and tweaking their model.
And last but not least : once they get nice results, it's not the end of their journey! Now, they have to make their model production ready.
This leads to demotivation and sometimes suffering for the DataScience profiles and for many of them, to question their career path. Yet, this is not bound to happen. The management can arrange the synergy between the differents profiles, teachers could insist on the reality of the job, etc.
Conclusion
Maybe all I said is about a certain kind of organization or, related to a certain level of company data literacy and maturity. The thing is, I used to work as a consultant, and my previous company had many Datascientist who I happened to be friend with. The complain was the same, whatever the client.
Last year, one of my DataScientist colleague was struggling with all the gathering and preparation. He could not deliver value at a regular pace to the business. So I was asked to help him.
The gain was gigantic. What previously took 4 weeks or more could be done in one, because each of us was focusing on his own expertise area. We learned from each other and, more than that, we grew a real professional sympathy and work synergy.
After some month, he gained a lot of knowledge of the data, the pipelines, the governance. As I built the correct infrastructure for him, he was then very autonomous and fast, and my personal contribution to the project fell to just one day a week max.
After that, I was able to point him to the right data quickly because I knew his project very well. I could prepare a new source in his project in a very short time because I already worked on this particular project. He knew about our rules and conventions and followed them, which allowed me to come back in his git repo and "be at home" instantly.
So, yeah, for a moment there were 2 full-time engineers on the project and that was not planned. But in the end, in terms of productivity, 1+1 was equal to 3 or 4.
Personal advices
To the datascience student
There is no magical recipe here. You won't change the program of your school. Just be aware that you will need other qualification and skills, quite different from those you will thaught at school. For example, SQL is not this old 'not even a real language' stuff that is easy and that you won't use because, you know: you've got Numpy, Pandas and Python! No, SQL will be your best friend for the most part of your work.
To the datascientist starting at his new position
Don't let yourself be overwhelmed by all the data extraction and preparation work. Go see your manager, explain that you'd need to have a fellow D.E. helping and guiding you. When your data is finally prepared and you start using and publishing it, stick to the standards, describe your data in the governance portal... You'll be helping the next DataScientist, and yourself, in the future.
To the DataScience-BI team manager
Assigning a D.E. to help a DataScientist is actually THE smart move. Don't let your DataScientists struggle with the pipelines, the ETL/ELT's, the preparation and gathering of the data (unless he really wants it ! ).
The Data Engineer will be more proficient at those tasks and will probably love it. And the DataScientist will be able to focus on adding value on top of those data.
Final Words
As I said, that's my personal take on the subject.
What's yours ? How does your team handle the subject ?
Do you have any advices to share ?
The rewards earned on this comment will go directly to the people( @cocaaladioxine ) sharing the post on Twitter as long as they are registered with @poshtoken. Sign up at https://hiveposh.com.
Congratulations @cocaaladioxine! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):
Your next target is to reach 1000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOP
To support your work, I also upvoted your post!
Support the HiveBuzz project. Vote for our proposal!