Hirokazu Miyake

Data Science

Data science broadly encompasses ideas and techniques, including machine learning and deep learning, to analyze data in a quantitative and systematic way to extract actionable insights. Some of the projects I worked on before I found a job in this area can be found at my GitHub page, and below are a few specific projects with links. Below that, I list some resources I found helpful in learning about the technical aspects and breaking into data science. And at the end, I list resources that can help you learn more once you are a "real" data scientist.

GitHub

Select Projects

Learning Resources

The topics that are particularly important for a data science practitioner are 1) computer science/programming, 2) probability and statistics, 3) machine learning, and 4) Structured Query Language (SQL). Below is a list of some resources, which include Massive Open Online Courses (MOOCs), websites, and books, I've found particularly useful in learning about these topics and preparing for data science job interviews. I've personally gone through each of the resources and believe each of them will be worth your time at least for your intellectual enrichment, if not to help you do well in your data science job interviews. They would be most appropriate for people who have some college-level math and computer science experience. Many of the resources listed below are accessible online for free.

General Data Science-Related Books

  1. The Signal and the Noise
    Nate Silver, Penguin Books, 2015.
    Readable account of various applications of analytics in the real world.
  2. Thinking, Fast and Slow
    Daniel Kahneman, Farrar, Straus and Giroux, 2011.
    Primarily about behavioral economics but the techniques and lessons are broadly applicable to data science.
  3. Naked Statistics
    Charles Wheelan, W. W. Norton and Company, 2013.
    Nice, readable book on fundamental probability and statistics.
  4. How to Lie with Statistics
    Darrell Huff, W. W. Norton and Company, 1993.
    Short, readable book on the pitfalls of statistics used in the real world.
  5. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
    Cathy O'Neil, Broadway Books, 2017.
    Describes some of the pitfalls which should ideally be considered in a data science product.

Computer Science and Programming

  1. Think Python
    A concise and excellent introduction to Python and computer science. Web and PDF versions are both free.
  2. MITx: Introduction to Computer Science and Programming Using Python
  3. MITx: Introduction to Computational Thinking and Data Science
    Two-part edX courses which teach you both foundational computer science and coding in Python.
  4. Princeton - Algorithms, Part I
  5. Princeton - Algorithms, Part II
    Two-part Coursera course on algorithms. Programming assignments are in Java. Challenging but worth the effort.
  6. Stanford - Algorithms
    Another set of courses on algorithms covering similar topics as the Princeton one.
  7. https://leetcode.com/
    Definitive resource for coding interview preparation.

Probability and Statistics

  1. MITx: Probability - The Science of Uncertainty and Data
    Great edX MOOC on probability for those without formal college-level coursework on the topic.
  2. MITx: Fundamentals of Statistics
    A continuation of the above course on probability, focusing on statistics in a rigorous yet accessible way.
  3. Udacity A/B Testing
    A course on A/B testing, which is essentially applied hypothesis testing as used in data science.
  4. Statistics, 4th edition by D. Freedman, R. Pisani, and R. Purves, W. W. Norton and Company, 2007.
    Excellent book on statistics. Pretty basic at the beginning, but by the end you will be able to explain in plain English what confidence intervals, hypothesis tests, and p values really are.

Machine Learning

  1. Stanford - Machine Learning
    Classic, accessible Coursera course on machine learning, from linear regression up to neural networks.
  2. Stanford - Statistical Learning
    Another great course on machine learning. Covers various regressions as well as topics like cross validation.
  3. MITx: The Analytics Edge
    This edX course covers many common machine learning algorithms, as well as linear programming. Also serves as a crash course on R.
  4. An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie, and R. Tibshirani, Springer, 2014.
    Excellent introductory book on machine learning which is free online here. The above statistical learning MOOC is based off of this book. This book is more accessible and probably sufficient to get into data science than the book Elements of Statistical Learning.

Structured Query Language (SQL)

  1. https://www.sqlteaching.com/
    If you are new to SQL, then this website gives you a quick, hands-on introduction to SQL. This can be completed in a few hours. No frills but is a good place to start.
  2. http://sqlzoo.net/
    This is a good place to further improve your SQL skills. It has significantly more exercises than the one above.
  3. https://leetcode.com/
    In addition to coding questions, they have SQL questions.

Learning After Becoming a Data Scientist

Congratulations! You are now a data scientist. But this is just the beginning, and you still have a lot to learn. Here are more resources to help you become a more effective data scientist.

  1. Hacker News
    Keep up with the latest news in software.
  2. The Missing Semester of Your CS Education
    Short and sweet overview of many fundamental computer skills not taught in traditional classes.
  3. The Linux Command Line
    This website and the corresponding free PDF book are a great introduction to learn about the power of the command line.
  4. Pro Git
  5. Atlassian Git Tutorials
    Knowing how to use git is important to be able to work with colleagues on a common code base. However, mastering git is hard and there is no one resource that explains everything well. These two links are reasonably good resources to understand how to use git. Otherwise, just search online for the specific thing you want to do.
  6. Coursera Deep Learning Specialization by deeplearning.ai
    This series covers the fundamentals of deep learning, including convolutional and recurrent neural networks.
  7. Stanford CS229 - Machine Learning
    If you found that the first Coursera machine learning course was not rigorous enough, this course goes into more detail by the same instructor.