Data Science
Data science broadly encompasses ideas and techniques,
including machine learning and deep learning, to
analyze data in a quantitative and systematic way to
extract actionable insights.
Some of the projects I worked on before I found a job
in this area can be found at
my GitHub page,
and below are a few specific projects with links.
Below that, I list some resources I found helpful in
learning about the technical aspects and breaking into data science.
And at the end, I list resources that can help you learn more
once you are a "real" data scientist.
Select Projects
-
Arxiv Topic Trend Analysis
Arxiv
is a preprint server used by many researchers in physics
to post research papers for general viewing. I explored
the general trend of a few topics over the years using
their API with Python.
-
Movie Recommender with Collaborative Filtering
I implemented a movie recommendation system based
on collaborative filtering in Python. The movie rating
data is taken from the
MovieLens
website.
-
Word Prediction
I developed an ngram model for word prediction in R.
This was deployed as a
Shiny app
and is written up in an R Jupyter notebook.
-
Strontium Atom Number Variation
Jupyter notebook looking at the dependence of
atom number as determined from a CCD camera on a number
of observables in our ultracold strontium experiment,
including laboratory temperature and humidity measurements.
-
PCA on Cold Atom Images
Python files that can be used to subtract unwanted
fringes on cold atom absorption images taken with
CCD cameras. This was created to be used with cold atom
experiments for image analysis, and is based on the
technique of principal component analysis.
Learning Resources
The topics that are particularly important for a data
science practitioner are 1) computer science/programming,
2) probability and statistics, 3) machine learning, and
4) Structured Query Language (SQL). Below is a list of
some resources,
which include Massive Open Online Courses (MOOCs),
websites, and books, I've found particularly useful
in learning about these topics and preparing for
data science job interviews. I've personally gone
through each of the resources and
believe each of them will be worth your time
at least for your intellectual enrichment, if not
to help you do well in your data science job interviews.
They would be most appropriate for people who have
some college-level math and computer science experience.
Many of the resources
listed below are accessible online for free.
General Data Science-Related Books
- The Signal and the Noise
Nate Silver, Penguin Books, 2015.
Readable account of various applications of
analytics in the real world.
- Thinking, Fast and Slow
Daniel Kahneman, Farrar, Straus and Giroux, 2011.
Primarily about behavioral economics but the
techniques and lessons are broadly applicable to
data science.
- Naked Statistics
Charles Wheelan, W. W. Norton and Company, 2013.
Nice, readable book on fundamental probability
and statistics.
- How to Lie with Statistics
Darrell Huff, W. W. Norton and Company, 1993.
Short, readable book on the pitfalls of
statistics used in the real world.
- Weapons of Math Destruction: How Big Data Increases
Inequality and Threatens Democracy
Cathy O'Neil, Broadway Books, 2017.
Describes some of the pitfalls which should
ideally be considered in a data science product.
Computer Science and Programming
-
Think Python
A concise and excellent introduction to Python and computer science.
Web and PDF versions are both free.
-
MITx: Introduction to Computer Science and
Programming Using Python
-
MITx: Introduction to Computational Thinking and
Data Science
Two-part edX courses which teach you both
foundational computer science and coding in Python.
-
Princeton - Algorithms, Part I
-
Princeton - Algorithms, Part II
Two-part Coursera course on algorithms.
Programming assignments are in Java. Challenging
but worth the effort.
-
Stanford - Algorithms
Another set of courses on algorithms covering
similar topics as the Princeton one.
-
https://leetcode.com/
Definitive resource for coding interview preparation.
Probability and Statistics
-
MITx: Probability - The Science of Uncertainty and Data
Great edX MOOC on probability for those without formal
college-level coursework on the topic.
-
MITx: Fundamentals of Statistics
A continuation of the above course on probability,
focusing on statistics in a rigorous yet accessible way.
-
Udacity A/B Testing
A course on A/B testing, which is essentially
applied hypothesis testing as used in data science.
- Statistics, 4th edition by D. Freedman, R. Pisani,
and R. Purves, W. W. Norton and Company, 2007.
Excellent book on statistics. Pretty basic at
the beginning, but by the end you will be able to
explain in plain English what confidence intervals,
hypothesis tests, and p values really are.
Machine Learning
-
Stanford - Machine Learning
Classic, accessible Coursera course on machine learning,
from linear regression up to neural networks.
-
Stanford - Statistical Learning
Another great course on machine learning.
Covers various regressions as well as topics
like cross validation.
-
MITx: The Analytics Edge
This edX course covers many common machine learning
algorithms, as well as linear programming.
Also serves as a crash course on R.
- An Introduction to Statistical Learning by G. James,
D. Witten, T. Hastie, and R. Tibshirani, Springer, 2014.
Excellent introductory book on machine learning
which is free online
here.
The above statistical learning MOOC is based off of
this book. This book is more accessible and probably
sufficient to get into data science than the book
Elements of Statistical Learning.
Structured Query Language (SQL)
-
https://www.sqlteaching.com/
If you are new to SQL, then this website gives
you a quick, hands-on introduction to SQL.
This can be completed in a few hours.
No frills but is a good place to start.
-
http://sqlzoo.net/
This is a good place to further improve your
SQL skills. It has significantly more exercises
than the one above.
-
https://leetcode.com/
In addition to coding questions, they have SQL questions.
Learning After Becoming a Data Scientist
Congratulations! You are now a data scientist.
But this is just the beginning, and you still have a lot to learn.
Here are more resources to help you become a more effective
data scientist.
-
Hacker News
Keep up with the latest news in software.
-
The Missing Semester of Your CS Education
Short and sweet overview of many fundamental computer
skills not taught in traditional classes.
-
The Linux Command Line
This website and the corresponding free PDF book are a
great introduction to learn about the power of the command line.
-
Pro Git
-
Atlassian Git Tutorials
Knowing how to use git is important to be able to work
with colleagues on a common code base. However,
mastering git is hard and there is no one resource that
explains everything well. These two links are reasonably good
resources to understand how to use git. Otherwise, just search
online for the specific thing you want to do.
-
Coursera Deep Learning Specialization by deeplearning.ai
This series covers the fundamentals of deep learning,
including convolutional and recurrent neural networks.
-
Stanford CS229 - Machine Learning
If you found that the first Coursera machine
learning course was not rigorous enough,
this course goes into more detail by the same instructor.