Data Science has never been more appealing as a career option than now. Says who? Says the Harvard Business Review, just to name one.
There has been an extreme proliferation of the Internet in most of our lives. It has been estimated that 42.4% (3079 million people) of the entire world’s population was already on the Internet by Dec 2014 and the number is projected to rise steeply, thanks to the cellular devices. This has lead to a whole variety of data getting collected at an incredible velocity. For instance, Youtube digests a whooping 300 hours of video every minute. If that doesn’t convince you then imagine this, 2.7 Zetabytes of data exists in the digital universe today. Zetabytes, how many zeroes does it have! Imagine the sheer scale of the logs produced by a popular social networking site like Facebook! Or that produced by an e-commerce vendor like Amazon.
The variety and speed of data getting collected is already set to change by order of magnitude due to the next big technological wave of Internet of Things or IoT. The number of connected devices is already more than the number of people in the world and the researchers are predicting that there will be about 30-100 billion connected devices by 2020. Some of the connected devices that are already making it big are – Fitbit, Nest, Google Glass and iWatch. Google Car does not seem to be far away.
Now that everyone has the data, they want to utilize it to gain useful insights into various aspects of their business; from finding influential customers, to identifying the most co-occurring products in transaction baskets, to recommending products suitable to a customer, or to even making projections about the revenue expected for the next month. All this could not have been achieved without funky “Data Scientists”.
The kind of impact Data Science already has on various aspects of one’s life is tremendous. Think “accurate weather forecasts”, think “Netflix recommendations”, think “Facebook friend suggestions”, and think “spam detection by Gmail”. Now is the time for you to make a difference, the Data way!
I have been working in this domain for a decent amount of time and have been a part of both academic and industrial setups. Based on my experience I have listed below the concepts, tools, and skills that will be most useful for somebody aspiring to be a data scientist. I have also listed some online courses and competitions that can be leveraged to learn the required concepts, and develop the necessary skills
1. Conceptual foundations for a Data Scientist
While you will be mostly using pre built tools for analyzing data, you need strong conceptual foundations to leverage these tools effectively. Some of these conceptual foundations are listed below.
This is probably a must have if one wants to solve any computer science problem! A thorough knowledge of data structures and choosing the right one is essential. This helps tremendously when one is required to do large volume data preprocessing as part of building a learning system. For instance if we wanted to simply create a document frequency matrix to be given to a text classification algorithm; we need to understand what data structure is efficient to store sparse matrix and what could go wrong if we were to load everything in a dense matrix representation.
Know your maths and statistics
It’s imperative to understand basic maths and statistics, and the associated tools and techniques, to be a good data modeler. Also fundamental roots of a lot of machine learning algorithms lie within maths and statistics. If you know optimization that would be awesome.
Machine Learning and Data Mining
There are several machine learning and data mining techniques that you need to understand to be a good data scientist. Some of the commonly used techniques include classification, clustering, regression, and frequent pattern mining.
Big Data and Distributed Systems
How do you think data is being stored and processed by applications or systems like Google, Facebook, YouTube and Twitter? Do they store and process the entire data on a single server? Of course not, as this will neither allow the system to scale as the data grows, and nor will it be able to provide the robustness and reliability. While distributed systems have been in research and development for several years, it is Hadoop that has ushered in a revolution in this space, and the space has been evolving rapidly with many fantastic open source systems coming into the picture every few months. Nowadays extremely cool systems like Apache Spark for scalable online computations and Apache Giraph for large scale graph computations are becoming very popular. Knowing some basics about these should be a great asset. Remember, the more hands on you are the better it is. So, go and explore these technologies. Solve a toy problem or two to start with. Realize their potential and adapt to these technologies, because trust me you have to think different.
As a data scientist you can see and understand complex models and statistics but you have to make the people in the board room (or other lay people) understand the implications of your work. Creating easy to understand visualisations is one way to achieve the same. Many Python and R packages can be used to do the same. D3 is also another popular option.
Again it’s critical to be able to communicate your thoughts to others who might not be very familiar with the data science jargon. Communication is a must for interacting both with clients and peers in your own organization. Explain to them the magic you can do
Creative Problem Solving
You have to be adept at it. People will throw all kind of data at you and ask you to waive your magic wand (given the Harry Potter you are to them) and give them insights. You MUST be creative in thinking about the kind of problems you can solve using this data and how. This mapping of the problem onto Machine Learning algorithms is a critical piece of the puzzle that you need to solve. You have to train to be good at this.
2. The Technology Platforms
A brief list of technologies which are handy to know are (this is by no means exhaustive):
While you would be using pre built libraries for various machine learning and data mining algorithms, you would need to stitch them together into an integrated solution for the end user. You would need to use a programming language for this. While there are many options, it is the scripting languages like Python or R that have gained ground in this space. I personally prefer Python. Also, remember there are multiple add on libraries or packages available for both Python and R, use them for data preprocessing or data visualisation or doing some complex computations.
SQL and NoSQL Databases
What is the critical resource for a data science application- Data. The data would typically be huge, and we would need a persistent store to leverage it.
For several decades relational databases also called SQL databases like Oracle, Microsoft SQL, MySQL and Postgres have been very popular. If you have structured data then SQL databases would be a good choice for storing data. You could use MySQL as it is open source, and easy to install and use. Writing efficient queries involving joins and nested subqueries comes in very handy and should be mastered.
NoSQL databases have also become very popular over the last few years for storing large volumes of unstructured data, specially for non-transactional applications. MongoDB has emerged as a popular NoSQL database. Another category of NoSQL databases that have emerged over the last few years are graph databases like Neo4j. There is a whole array of such non relational databases available out there, but one should select carefully and justify one’s need thoroughly.
As discussed earlier modern applications like social computing and IoT are generating huge amount of data that can not be handled by conventional databases and need new distributed approaches for storing and processing data.
Hadoop, the paradigm-shifting data analysis technology, has become the go-to solution for business needing quick and reliable processing for growing data sets for a fraction of the time and cost as previous technologies. With Hadoop, no data is too big– Pete Warden is the legend who used Hadoop to analyse 22 million facebook profiles in under 11 hours, and with a total cost of $100. He was eventually sued by Facebook! To sum it up, Hadoop is nothing less than the Excalibur in a data scientist’s toolkit.
Apache Spark, the open source cluster computing system, is simple and beautiful and is offering magnitude of improvement in performance over Hadoop. It offers rich built in libraries, as well as simple APIs in Python, SQL, Scala and Java. It supports other types of computations such as stream processing and interactive queries.
Apache Giraph offers distributed graph storage and processing; and is designed for scalability. Facebook currently uses it to study the social networks formed by each user along with their friends. Giraph was actually based out of a research paper created by Google about Pregel, their own system. The first place to go would be the legendary paper itself.
Machine Learning and Statistics
There are multiple packages and tools available for machine learning and statistics including scikit-Learn, NumPy, and SciPy in Python to Weka in Java. R has most of the things built in. In my experience Scikit learn worked quite well for me.
You can try using matplotlib in Python for data visualization. Or you can use ggplot2, plotty, and GoogleVis in R. Tableau can be another very interesting option, but be aware of the learning curve and establish your requirements beforehand.
To summarize- SQL is awesome, you can do a lot with it. Python and R rock. Personal preference is Python with Scikit learn.
3. Where do I learn the concepts and skills?
One obvious approach is to join a Masters program that offers specialization in data sciences. However many of us may not have the time and financial flexibility to join such a program. What are the alternatives?
Since most of us are avid web users (and know that there is more to internet than Facebook and Whatsapp!), there is a wide array of resources available online to learn about Data Science. Most of these resources cost nothing at all.
Given below are links to some very useful articles, in which the respective authors have summarised online resources available for these areas.
4. How can I practice?
While the online courses offer their assignments and projects, you can go a step further to work on real life challenging problems that are offered by Online Challenges.
Some of the leading platforms for data science challenges are listed below for your reference.
5. Where to find data on which I can experiment?
One of the important resources required for practising your data science skills is availability of large data sets from different domains.
Some of the popular datasets that are available publically are listed below.
- SNAP contains data which could be used for large network analysis such as influencer mining, community detection etc.
- UCI Repository contains a variety of datasets which could be used for different kind of Machine Learning problems.
- MovieLens Repository contains user and movie rating data for recommendation engine related problems.
- Synthetic Graph Generator this can be used to generate synthetic graphs for problems like subgraph mining. There are other such tools available as well.