From Build a Career in Data Science by Emily Robinson and Jacqueline Nolis

If you have already learned the skills you need for a data science job, why not put them to use in a way that a potential employer can see: by building a portfolio.


Take 37% off Build a Career in Data Science by entering fccrobinson into the discount code box at checkout at manning.com.


Let’s imagine that prior to reading this you have finished either a bootcamp, a degree program, a set of online courses, or a series of data projects in your current job. If you’re thinking about attending a boot camp, check out part 1 for a thorough walkthrough of why you might (or might not) want to attend one and what to expect.

Congratulations—you’re ready to get a data scientist job! Right?

Well, maybe, but there’s another step that can help you be successful—building a portfolio. A portfolio is a data science project (or set of projects) that you can show to people to explain what kind of data science work you can do.

A strong portfolio has two main parts—GitHub repositories, or repos for short, and a blog. Your GitHub repo hosts the code for a project, and the blog shows off your communication skills and the non-code part of your data science work. Most people don’t want to read through thousands of lines of code (your repo); they want a quick explanation of what you did and why it’s important (hence the blog). And who knows, you might even get data scientists from around the world reading your blog, depending on the topic. You don’t need to extensively blog about analyses you did or models you built; you could explain a statistical technique, write a tutorial for a text analysis method, or even share career advice, like how you picked your degree program.

This isn’t to say you need to have a blog or GitHub repos filled with projects to be a successful data scientist. In fact, the majority of data scientists don’t, and people get jobs without a portfolio all the time. But it’s a great way to help you stand out and to practice your data science skills and get better. Hopefully it’s fun too!

This article walks you through how to build a good portfolio. The first part is about doing a data science project and organizing it on GitHub. The second goes over best practices for how to start and share your blog, getting you the most value out of the work you’ve done.

Creating a project

A data science project starts with two things: a dataset which is interesting and a question to ask about it. For example, you could take government census data and want to know, “how are the demographics across the country changing over time?” The combination of question and data is the kernel of the project, and with those two things you can start doing data science.



Finding the data and asking a question

When thinking about what data you want to use, the most important thing is to find data which interests you. Why do you want to use this data? Your choice of data is a way to show off your personality or your domain knowledge from your previous career or studies. For example, if you’re in fashion, you can look at articles about fashion week and see how styles have changed in the past twenty years. If you’re an enthusiastic runner, you can show how your runs change over time and maybe look to see if it’s related to the weather.

Something you shouldn’t do is use the titanic dataset, MNIST, or any other popular beginning datasets. It’s not that these aren’t good learning experiences; they can be, but you’re probably not going to find anything novel. It won’t surprise and intrigue employers or teach them more about you.

Sometimes you let a question lead you to your dataset. For example, you may be curious about how the gender distribution of college majors has changed over time and if this is related to the median earnings after graduation. You then would take to Google and try to find the best source of that data. But maybe you don’t have a burning question you’ve been waiting to have data science skills to answer. In this case, you can start by browsing datasets and seeing if you can come up with any interesting questions. Here are a few suggestions of where you might start:

  • Kaggle: Kaggle started as a website for doing data science competitions. Companies post a dataset and a question and usually offer a prize for the best answer. Because the questions entailed machine learning projects where you’re trying to predict something, like whether someone would default on a loan or how much a house will sell for, you can compare models based on their performance on a holdout test set and get a performance metric for each one. Kaggle also has discussion forums and “kernels” where people share their code, and you can learn how others approached the dataset. All of this means that Kaggle has thousands of datasets with accompanying questions and examples of how other people analyzed them.
  • The biggest benefit of Kaggle is also its biggest drawback: by handing you a (generally cleaned) dataset and problem, they’ve done a lot of the work for you. You also have thousands of people tackling the same problem, and it’s difficult to make a unique contribution. One way to use Kaggle is to take a dataset but pose a different question or do an exploratory analysis. Generally, we think Kaggle is best for learning by tackling a project and then seeing how you perform compared to others, learning from what their models did, rather than as a piece of your portfolio.
  • Datasets in the news: Recently, many news companies have started making their data public. For example, FiveThirtyEight, a website focused on opinion poll analysis, politics, economics, and sports blogging, publishes data they use for their articles and even links to the raw data directly from the article website. Though these data sets often require manual cleaning, the fact that they were in the news means there’s probably an obvious question associated with them.
  • APIs: APIs are Application Programming Interfaces: developer tools that allow you to access data directly from companies. You know how you can type in a URL and get to a website? APIs are like URLs but instead of a website you get data. Some examples of companies with helpful APIs are the New York Times and Yelp, which let you pull their articles or reviews respectively. Some APIs even have R or Python packages that specifically make it easier to work with them; for example, rtweet for R lets you quickly pull Twitter data, and you can find tweets with a specific hashtag, what the trending topics in Kyoto are, or what tweets Steven King is favoriting. Keep in mind that there are limitations and terms of service for how you can use them; for example, right now Yelp limits you to five thousand calls a day, and you wouldn’t be able to pull all reviews ever. The advantage of using APIs for a project is that they can provide extremely robust, organized data from many sources. The downsides are you often have to think hard about what interesting question to tackle—for example you have five thousand yelp reviews, now what?
  • Government open data: A lot of government data is available online—you can use census data, employment data, the general social survey, and tons of local government data like New York City’s 911 calls or traffic counts. Sometimes you can download this data directly as a CSV; other times, you need to use an API. You can even submit freedom of information requests to government agencies to get data which isn’t publicly listed. Government information is great because it’s often detailed and on unusual subjects, like data on the registered pet names of every animal in Seattle. The downside of government information is that it’s often not well-formatted, such as tables stored within PDFs.
  • Your own data: You can download data about yourself in many places: social media websites and email services are two big ones, but if you use apps to keep track of your physical activity, reading list, budget, sleep, or anything else, you can usually download that data as well. Maybe you could build a chatbot based off your emails with your spouse. Or you could look at the most common words you use in your tweets and how this changed over time. Perhaps you could track your caffeine intake and exercise for a month and see if it can predict how much and well you sleep.
  • Web scraping: Web scraping is a way to extract data from websites which don’t have an API by automating visiting webpages and copying the data. For example, you could create a program to search a movie website for a list of one hundred actors, load their actor profiles, copy the list of movies they’re in, and put that in a spreadsheet. You must be careful though: scraping a website can be against terms of use and you can be banned. You can check the robots.txt of a website to find out what they allow. You also want to be nice to websites—if you hit them too many times, you can bring down the site. But assuming terms allow it and you build in time between your hits, scraping can be a great way to get unique data.

What makes a side project “interesting”? One recommendation is to pick a more exploratory analysis where any result probably teaches the reader something or clearly demonstrate your skills. For example, creating an interactive map of 311 calls in Seattle, color-coded by category, clearly demonstrates your visualization skills and you can write about the patterns that emerge. On the other hand, if you try to predict the stock market, you’ll likely not be able to and it’s hard for an employer to assess your skills if you only have a negative outcome.

Another tip is to see what comes up when you google your question. If the first results are newspaper articles or blog posts answering exactly the question you were asking, you may want to rethink your approach. Sometimes you can expand upon someone else’s analysis or bring in other data to add another layer to the analysis, but you may need to start the process over again.

Choosing a direction

Building a portfolio doesn’t need to be a huge time commitment. Perfect is definitely the enemy of the good here. Something is better than nothing and employers are first and foremost looking for evidence that you can code and communicate about data. You may be worried that people will look and laugh at your code or say, “Wow, we thought this person may be okay, but look at their terrible code!” It’s unlikely that this will happen. One reason is employers tailor their expectations to seniority level: you won’t be expected to code like a computer science major if you’re a beginning data scientist. Generally, the bigger worry is you can’t code at all.

This is where it’s also good to think about areas of data science. Do you want to specialize in visualization? Make an interactive graph using D3. Do you want to do natural language processing? Use text data. Machine learning? Predict something?

Use your project to force yourself to learn something new. Doing this kind of hands-on analysis shows you where the holes in your knowledge may be. When data you’re interested in is on the web, you’ll learn web scraping. If you think a particular graph looks ugly, you’ll learn tweaks to ggplot2. If you’re self-studying, it’s a nice way to solve the paralysis of not knowing what to learn next.

A common problem with self-motivated projects is over-scoping. This is where you want to do everything, or you keep adding more stuff as you go. You can always keep improving/editing/supplementing, but that means you never finish. One strategy is to think like Hollywood: sequels. You should set yourself a question and answer it, but if you think you might want to revisit it later, you can end your research with a question or topic for further investigation or even a “TO BE CONTINUED…?” if you must.

Another problem is not being able to pivot. Sometimes the data you wanted isn’t available, or there’s not enough of it. Or you’re not able to clean it. This is frustrating and it can be easy to give up at this point, but it’s worth trying to figure out how you can salvage it. Do you have enough already to write a blog post tutorial, maybe on how you collected the data? Employers look for people who learn from their mistakes and aren’t afraid of admitting them. Showing what went wrong to prevent others from suffering the same fate is still valuable.

Filling out a GitHub README

Maybe you’re in a bootcamp or a degree program where you’re already doing your own projects. You’ve even committed your code to GitHub. Is that enough?

Nope! A minimal requirement for a useful GitHub repository is filling out the README. Here are a couple of questions to answer:

  • What is the project? For example, what data does it use? What question is it answering? What was the output – a model, a machine learning system, a dashboard, or a report?
  • How is the repository organized? This implies that the repo is organized in some manner! Lots of different systems are out there, but a basic one is splitting up your script into different parts – getting (if relevant) your data, cleaning it, exploring it, and the final analysis. This way, people know where to go depending on what they’re interested in. It also suggests that you keep your work organized when you go to a company. A company doesn’t want to risk you working there and then, when it’s time to hand-off a project, you give someone an uncommented, five-thousand-line script that may be impossible for them to figure out and use. Good project management also helps future you—if you want to reuse part of the code later, you’ll know where to go.

Although doing a project and making it publicly available on a documented GitHub repo is good, it’s hard to look at code and understand why it’s important. After you do a project, the next step is to write a blog, which lets people know why what you did was cool and interesting. No one cares about pet_name_analysis.R, but everyone cares about “I used R to find the silliest pet names!”

Stay tuned for part 3, in which we will put data science skills on the back burner and dive into strategies for looking for a data science job that fits you.

That’s all for this article. If you want to learn more, check out the book on our browser-based liveBook reader here.