From Build a Career in Data Science by Emily Robinson and Jacqueline Nolis
This article guides you through what it’s like working as a data scientist. We’re going to walk you through what to expect in your first few months and how to use them to set yourself up for success. These months will have an outsized impact on how the job goes and this is your chance to set up a system and support network that allows you to be successful.
Take 37% off Build a Career in Data Science by entering fccrobinson into the discount code box at checkout at manning.com.
Check out part 1 (attending a bootcamp), part 2 (building a portfolio) and part 3 (looking for a job) of our series if you’re still preparing yourself for a job in data science.
When you start working, you’ll instinctively want to get as much done as possible immediately. Fight that instinct. You need to be sure that you aren’t merely accomplishing tasks but doing them in the right way. This is the easiest time to ask questions about how something should be done, because you aren’t expected to know this at your new company. Managers occasionally forget that you don’t have the institutional knowledge that your predecessor may have had, and you might get tasked with something that doesn’t make sense to you. You might be able to fake your way through the first few tasks, but you’ll be much better served by asking questions early on and finding out how to approach your work process.
Although each data science job is different, there are some broad patterns as well as principles that can be applied to any job.
The First Month
Your first month will look different depending on the type of company you’re joining. Large and small companies, or more specifically, companies with large data science teams or small ones, approach your onboarding from almost opposite perspectives. Let’s take a peek at what you expect from two companies, one which is massive with tons of data scientists, and one with no (or barely any) data science team. These two examples highlight two ends of a spectrum, but the company you’re working at could easily fall in between.
Onboarding at a large organization: a well-oiled machine
You’re one of the dozens of people starting this week. You get an email the week before telling you where to go, when to arrive, and what you need to bring. You then begin a formal, multi-day onboarding process with people from all different departments. Together, you’re issued your laptop and go through setting it up. You listen to presentations on the company culture, HR policies, and how the company is organized. Everything runs like clockwork: they’ve done this thousands of times before.
On the data science side, you’ll get help setting up your coding environment. It’s likely you’ll be given a checklist or extensive documentation on everything you need to do to get access to the data. A central repository of old reports and documentation of the data is there for you to read and absorb. No one’s expecting you to deliver much right away—although they’re excited to have you join the team, they know you’ll need time to adjust. You’ll be expected to take a few weeks to go through all the training and get your access approved for the systems. You may feel almost frustrated that it’s taking a long time before you feel productive, but a slow start going through a process is natural in this environment.
Again, if you’re given a list of to-dos or an assignment, you should take it seriously but worry more about the process than the result. Established data science teams can often have their own idiosyncrasies that you’ll need to adopt. It’s not only good to ask questions at this stage, it’s essential to your ability to perform your job later on. The first few months are your chance to see what’s been done before you and get yourself well versed in the rhythm of your peers.
Onboarding at a small company: what onboarding?
“Oh, you’re starting today?” If you’re joining a small start-up, don’t be surprised if everything isn’t ready, including your laptop. You may be left on your own to figure out how to access the data, and when you get in it isn’t well optimized for your job, it takes six minutes to run a SQL query on a small table of 100,000 rows. Onboarding sessions to learn about the company may not happen for weeks, if there even are any, because there aren’t enough people starting in a given week for that to make sense to run each week.
No standards exist and no one is telling you what programming language to use, or which is the standard approach to making an analysis. They ask you to quickly start getting results. Unlike at a large organization, you don’t have to worry about not being productive—you’ll be asked to do that right away. You need to worry a lot more that you’re accidentally doing something wrong and no one tells you, or only finding out a few months later after your (incorrect) work is already being relied upon. This is why it’s still imperative that you ask questions and work to get a foothold before you no longer have the fallback of being new. Going from crisis to crisis causes you to burn out quickly; work to build your own processes which allow you to be successful in the long-run.
Understanding and setting expectations
One of the most important things you can do in your first few weeks is to have a meeting with your manager to discuss priorities. This is important because it gives you the knowledge of what you’re supposed to be working towards in your job, which allows you to work towards it. In some data science jobs, the priority is to provide analyses to a specific set of stakeholders to help grow a particular part of the business. For other data science jobs, the goal is to make high performing models that help run the business. For some jobs it’s both or neither of these.
It may feel like you should already know what the job expectations are by the job posting and interview process. Although this is sometimes true, a lot can change between the interview process and when the job starts. The interviewers may not be on the same timeframe you’re working on, or the organization may have changed before your joined. By talking to your manager as early as possible, you’ll get this information as soon as possible, and in a setting where you can likely talk about it for multiple hours.
Ideally, your manager has a vision for what you’ll be doing but is open to your priorities and strengths. Together, you want to define what success means in your job. Generally, this is tied to making your team and/or manager successful—if the members of the data science team aren’t all working broadly towards the same objective it can be difficult to support each other. To define your own success, you need to understand what problems the team’s trying to solve and how performance is evaluated. Will you be helping generate more revenue by working on experiments to increase conversion, or will you be making a machine learning model to help customer service agents predict what concern a customer has, with the goal to decrease average time spent per request?
Performance usually doesn’t mean “make a machine learning model with 99% accuracy” or “use the newest statistical model in your analysis.” These are the tools to help you solve a problem, not the end goal itself. If your models and analyses are on problems people don’t care about, they’re pretty much useless. This is a common misconception from people entering their first data science job. It makes sense that it’s common, because a lot of academic research and educational courses cover the many different methods of making accurate models. Ultimately, for most data science jobs having highly accurate models isn’t enough to be successful. Things like the model’s usefulness, level of insight, and maintainability are often more important.
You can’t know, when you start a new job, what the expectations are in terms of job responsibilities. Some companies value teamwork and you may be expected to help with several different projects at once and drop your work on a moment’s notice to help a colleague. Other companies ask that you have deliverables on a regular basis and it’s ok to ignore emails or slack messages in order to achieve your project. The way you find out if you’re meeting expectations is by having regular meetings with your direct supervisor. Most companies have a weekly one-on-one to discuss what you’re working on or any issues you have. These exist to find out if you’re spending your time on the tasks that matter to your boss. Why should you guess at what’s wanted when you can get explicit feedback? Thinking in shorter term blocks helps you be sure that you are on the right track for when the larger performance reviews come along.
Unless you’re at a tiny company, there’s a formal performance review process; be sure to ask about what that entails and when it happens. One common practice is to have one every six months, with salary increases and promotions potentially following afterwards. Many companies do this as a “360” process, where you get feedback directly not only from your manager but also your peers. If this is the case, find out whether you choose the peers or if they’re chosen by your manager.
For more established data science teams, there may be a matrix that says which areas you’re evaluated on and what’s expected for each of them at different levels of seniority. For example, one area could be “technical expertise.” A junior data scientist may be expected to have the foundations and show that they’re learning, a mid-level that they’ve mastered one area of expertise, and a senior that they’re the go-to person at the company for a whole area, like A/B testing or writing big data jobs. If one doesn’t exist, see if you can come up with a few areas with your manager.
Regardless of the system, make a plan with your manager to have a three-month review if it’s not common practice. This review helps you make sure you’re on the same page as your manager, give updates, and plan for the rest of your first six month and year.
The point of defining success isn’t that you already need to be excelling in every area in your first months; in fact, most companies won’t do a formal performance evaluation of someone who’s been there less than six months because much of that time has been on-ramping. Rather, it’s to make sure as you learn about your role and begin work that you do this with the big picture in mind.
Knowing your data
You need to learn about the data science part as well. If your company has been doing data science for a while, a great place to start is by reading reports they’ve written. Reports tell you not only what types of data your company keeps and some key insights, but also the tone and style of how to communicate your results. Much of a data scientist’s job is conveying information to non-technical peers and by reading reports; you’ll have a sense of how non-technical those peers are. See how simplified or complex certain concepts are and you’ll be less likely to over or under explain when it comes time to write your own reports.
Then you’ll need to learn where the data lives and get access to it. This includes knowing what table contains the data you want, but maybe also what data system has it. Perhaps the most frequently accessed data lives in SQL, but the event data from two years ago lives in HDFS (Hadoop Distributed File System) which you need to use another language to access.
Take a broad look at the data you’re going to be working with on a regular basis but go in with an open mind. Some tables have documentation (either packaged with the data or in a report about the data) which explains potential quality issues or quirkiness. Read those first, as it keep you from investigating “mysteries” that later turn out to already have been solved. Then take a look at a few rows and summary statistics. This can help you avoid “gotchas,” where you find out that some subscriptions start in the future or that a column often has missing values. When you find these undocumented surprises, usually the best way to figure them out is to talk to the expert on that table. That may be a data scientist, if your company’s large enough, or those who collected the data. You might find out that they’re a true issue which needs fixing, or it might turn out to be expected. For example, subscriptions that start in the future could be those which have been paused and set to restart on that date. Or the coupons for last year’s New Year promo you find was also used in May of this year because support issued them.
Some companies are better than others about having data which was created for testing separate from real data, as others merge the data without a second thought. In the latter case you need to ask around if there are orders or activity generated by test accounts or special business partnerships that you should exclude. Similarly, some datasets include users with radically different behavior. American Airlines once offered a lifetime flying pass which included a companion fare. One of the people with the pass used the companion fare for strangers, pets, his violin, and might fly multiple times a day. Although you may not have anyone this extreme, it’s not uncommon for newer businesses to offer deals that later look silly (e.g. 10 years of access for $100) which may need to be accounted for in your analysis.
Throughout this process of investigating the data you’re figuring out what kind of overall shape your data is in. If you’re at a smaller company, you may find that you need to work with engineers to collect more data before the overall data is useful. If you’re at a larger company, you’ll decipher which of the dozens of tables to determine if the data you want exists. Maybe you’re looking for tables with a column called “order” across twelve databases. Ideally, there should be well-documented, well-maintained tables for the core business metric, like transactions or subscriptions. But this is unlikely to be the case for other less-important datasets, and you should try to learn more if you’re going to be focused on one of the less documented areas.
Make sure you learn how the data got to you. If you’re working with something like website data, it’ll likely flow through multiple systems to get from the website to a database you can use. Each of these systems likely changes the data in some way. When data collection suddenly stops, you want to know where to try and find the problem (rather than panicking). But some places have data that people input manually, like doctors in a hospital, or if you deal with a lot of survey results. In these situations, you have to worry less about pipelines, and much more about understanding the many different attributes of the data and potential places where a human inputted it incorrectly. Pretty much anywhere you go you must deal with some data dirtiness.
As you go along, try writing down any “gotchas” in the data and a map of where everything lives. It’s difficult to remember these sorts of facts over the course of a job, and many companies don’t have great system for documentation or data discovery. Like commenting code to allow your future self and others to understand its purpose, documenting data provides enormous dividends. Although keeping this documentation locally on your laptop’s okay, the best thing’s storing it somewhere that everyone in the company can access. You’ll be helping future new hires and even current data scientists at the company who aren’t familiar with that specific area.
That’s all for this article. If you want to learn more about the book, check it out on our browser-based liveBook reader here.