Скачать или смотреть Eliminating duplicated rows in R: Speeding up unique()

Eliminating duplicated rows in R: Speeding up unique()

rstatsdata.tableuniqueduplicated casesduplicated rowsstatisticsBase Rbenchmarkbench

Скачать Eliminating duplicated rows in R: Speeding up unique() бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Eliminating duplicated rows in R: Speeding up unique() или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку Eliminating duplicated rows in R: Speeding up unique() бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Eliminating duplicated rows in R: Speeding up unique()

A client recently told me she was eliminating duplicated rows from a large dataset using Base R's unique() function. She achieved the desired result, but her script ran too slowly.

I came up with two ideas to speed it up:
1. Instead of applying the unique() function to the whole dataset, use only a subset of columns. Even if you don't have an ID column, a combination of few variables should suffice to identify duplicated rows. In Base R, we switch to the duplicated() function to accomplish that.
2. Use data.table's unique() function instead of Base R. To do that, just supply the unique() function a data.table object. To use a subset of columns, we can conveniently supply the by argument in data.table.

So combining these two approaches, we have four options to benchmark.

While reducing the number of variables to identify duplicates brings, in our example, about a two-fold speed improvement, data.table really makes a massive difference: It runs 100 times faster than Base R's counterpart on the full dataset (using all columns), about 60 times faster on a subset of columns, and 126 times faster combining the two approaches and comparing the fastest to the slowest.

Note that the factor by which your code might speed up using these approaches will likely depend on your use case. Anyhow, we can be confident to gain massively using data.table.

We used the bench package to measure runtimes. It has some convenient features I missed using microbenchmark: It is more explicit about the garbage collector, and it automatically checks whether the benchmarked functions really return identical results, which can be crucial!

You can find the code used in this video here:
https://github.com/fjodor/unique

Contact me, e. g. to discuss (online) R workshops / trainings / webinars:

LinkedIn:   / wolfriepl
Twitter:   / statistikindd
Xing: https://www.xing.com/profile/Wolf_Riepl
Facebook:   / statistikdresden

https://statistik-dresden.de/kontakt
R Workshops: https://statistik-dresden.de/r-schulu...
Blog (German, translate option): https://statistik-dresden.de/statisti...

Playlist: Music chart history
   • Music Chart History

Комментарии

Информация по комментариям в разработке