Blog Details

image

Some Data Science Interview Questions

See how well you would do in a Data Science interview with the following 10 interview questions!
The questions were mainly collected from Edureka free learning platform. Some of them were modified.


1. What do you understand by selection bias?
Selection bias is a statistical error that causes a bias in the sampling portion of an experiment.
For example, imagine vote prediction where the model incorrectly forecasts the winner. This could happen if the majority of voters surveyed were high-income individuals, resulting in biased results favoring that particular class of people.


2. What is the difference between Type I and Type II errors?
Follow up: Is it better to have too many Type I or Type II errors in a solution?

  • Type I error (false positives): Claiming something has happened when, in reality, it hasn't.
  • Type II error (false negatives): Claiming nothing has happened when, in fact, something has.

Follow-up answer: It depends on the problem and domain. For example, in fire alarm systems, it's better to have more false positives (Type I) than false negatives (Type II).


3. You are working with a time series data set. Your goal is to build a high-accuracy model. You start with a decision tree algorithm, knowing it works fairly well on all sorts of data. Later, you try a time series regression model and get higher accuracy. Can this happen? Why?
Yes. This question tests whether you understand that linear regression works best for time series data fitting. Time series data is based on linearity, whereas a decision tree works best with non-linear interactions. A linear regression model can provide more robust predictions only if the dataset satisfies linearity assumptions.


4. Name some Python libraries for data analysis and scientific computation.
Some key libraries are:

  • Numpy
  • SciPy
  • Pandas
  • Scikit-Learn
  • Matplotlib
  • Bokeh
  • Seaborn

Use cases:

  • Quick analysis: Matplotlib
  • Publishing/presentation: Bokeh
  • In-depth analysis: Seaborn

5. You are given a dataset consisting of variables with more than 30% missing values. For example, out of 50 variables, 8 have more than 30% missing values. How do you deal with them?

  • Remove them if they are not important.
  • Check their distribution with the target variable. If a pattern is found, assign them a new category and keep them, while removing others.
  • Decipher the missing values by analyzing similar variables in the dataset.

6. What is a Restricted Boltzmann Machine (RBM)?
A Restricted Boltzmann Machine (RBM) is a generative stochastic artificial neural network that learns a probability distribution over its set of inputs.

Applications:

  • Dimensionality reduction
  • Collaborative filtering
  • Feature learning

Structure:

  • It has two layers:
    • Visible layer (input layer)
    • Hidden layer

Key restriction: No intra-layer communication. Each node processes input and makes stochastic decisions about whether to transmit it or not.


7. Is it recommended to use ReLU or a linear activation in the hidden layers of a neural network? Why?
ReLU is recommended because:

  • Other activation functions like sigmoid or tanh can saturate and kill gradients.
  • Sigmoid outputs are not zero-centered. Tanh is preferred over sigmoid for this reason.

Why do we use non-linearity in a neural network?
Linear activation functions make neural networks equivalent to a single-layer perceptron, regardless of how complex the architecture is. Since real-world problems are non-linear, activation functions introduce non-linearity to map the incoming data effectively.


8. What is a Confusion Matrix?
A Confusion Matrix is a table used for measuring the performance of a classification algorithm.

Example:
For a binary classifier, the table contains:

  • True Positives (TP)
  • False Negatives (FN)
  • False Positives (FP)
  • True Negatives (TN)

9. What is the difference between inductive and deductive learning?
The key difference lies in their direction of reasoning:

  • Inductive learning: Observation → Conclusion
  • Deductive learning: Conclusion → Observation

Diagram:
Inductive: Data → (induction) → Model → (deduction) → Prediction


10. What is Capsule Neural Network?

A new type of CNN that improves on it adding spatial hierarchies into account. Read more here.