Understanding Various Sampling Methods
Sampling in lay man terms would mean a customized data distribution to solve the purpose/requirement. Sampling is an important topic of Data Science which are given least importance. Let's talk a bit about them.
Difference of Sampling from Feature Selection is Feature Selection deals with the particular features. Whereas Sampling deals with the classes of the samples and how to reshuffle/redistribute them.
Eg. A particular class/label has lower/larger number of samples compared to other, sampling is the best way to go.
There are many ways to perform data sampling ->
1) Random Undersampling Use imblearn package. Tomek links are used to undersample the required class. 2) Random Oversampling Use imblearn package. SMOTE is a common oversampling method for oversample the required class. 3) Simple Random Sampling A simple subset selection where each element has equal probability of getting selected. eg. subset = df.sample(100) 4) Stratified Sampling A stratified sampling is performed when the class labels are taken care of while performing the data sampling. eg. train_test_split(X,y, **stratify = y**, test_size = 0.2) 5) Reservoir Sampling Reservoir sampling doesn't consider the length of the samples. Rather it considers to sample out of infinite data stream and generate a sample out of it. It uses probability concepts to generate a subset of data. Used in Big Data applications where a large stream of input flows in and close approximations are made.