R Rattle Preprocessing Data | R With NO Coding

Описание к видео R Rattle Preprocessing Data | R With NO Coding

Preprocess data in R with no coding. Rattle is a GUI front end for R and is used for data analysis. This is from a series of videos, Data Analysis made Easy in R Using Rattle. Some the other videos that you may find useful include:

Installing R:    • How to Install R on Windows  
Installing Rattle:    • How to Install Rattle on Windows  
Upload and Explore Data in Rattle:    • Explore Data in R With No Coding | Up...  

The 2 data sets used in this video were:
Test.csv: https://bit.ly/3huuOVg
CatData.csv: https://bit.ly/3itWCdL

Exporting data
After data transformation, you can export your new dataset. In the main menu, press the Export button; this will open a dialog window. Choose a directory and a filename and save.

Cleaning Data
The Cleanup option in the Transform tab allows you to delete columns and observations from your dataset.

The following are the different available cleanup options:
• Delete Ignored: This will delete variables marked as ignore
• Delete Selected: This will delete the selected variables
• Delete Missing: This will delete all variables with any missing values
• Delete Obs with Missing: This will delete observations with missing values in the selected variable.

Indicator Variables
1. Some algorithms (like many clustering models) only work with numeric variables. A simple technique to convert categorical variables into numeric variables is indicator variables.
2. In Rattle, the Transform tab has an Indicator Variable option. In order to apply this transformation, select the variable (to change to an indicator variable). Select Recode, and then select Indicator Variable, and click on Execute. Rattle will create a variable for each category belonging to the categorical variable.

Partitioning the Data
1. Select the Data tab and ensure Partition is ticked.
2. In the text box next to Partition change the numbers to 60/20/20 (60% train and 20% validate, 20% test).
3. The first number, 60, is the percentage of data that will be used for training. The second number 20, is the percentage of data that will be used for validating. The last number 20 is the percentage of data that will be used for testing. You can have any percentage for the training, testing and validating, providing they all add up to 100.
4. Note, that in most situations data is only split into training and testing.
5. The Seed number is used to randomly and reproducibly select the data. You can select any number you prefer.

Комментарии

Информация по комментариям в разработке