• Antoreep Jana

Understanding Various Sampling Methods

Sampling in lay man terms would mean a customized data distribution to solve the purpose/requirement. Sampling is an important topic of Data Science which are given least importance. Let's talk a bit about them.

Difference of Sampling from Feature Selection is Feature Selection deals with the particular features. Whereas Sampling deals with the classes of the samples and how to reshuffle/redistribute them.

Eg. A particular class/label has lower/larger number of samples compared to other, sampling is the best way to go.

There are many ways to perform data sampling ->

1) Random Undersampling
     Use imblearn package. Tomek links are used to undersample the required class.

2) Random Oversampling
     Use imblearn package. SMOTE is a common oversampling method for oversample the required class.

3) Simple Random Sampling
     A simple subset selection where each element has equal probability of getting selected.
     eg. subset = df.sample(100)

4) Stratified Sampling
     A stratified sampling is performed when the class labels are taken care of while performing the data sampling.
     eg. train_test_split(X,y, **stratify = y**, test_size = 0.2)

5) Reservoir Sampling
     Reservoir sampling doesn't consider the length of the samples. Rather it considers to sample out of infinite data stream and generate a sample out of it. It uses probability concepts to generate a subset of data. Used in Big Data applications where a large stream of input flows in and close approximations are made. 
7 views0 comments

Recent Posts

See All