dmv.community is one of the many independent Mastodon servers you can use to participate in the fediverse.
A small regional Mastodon instance for those in the DC, Maryland, and Virginia areas. Local news, commentary, and conversation.

Administered by:

Server stats:

172
active users

#RStats

126 posts109 participants19 posts today

#rstats Is there an existing tool to automate a repex->rpubs pipeline? My current manual workflow is make a reprex in an .R script, copy the contents over to a .qmd, and use the publish feature in the rstudio IDE.

Sometimes my reprexes get just a tad bit more complex and requires some prose to walk through the steps. In those cases I like publishing them as almost like standalone micro blogposts.

Ex: this reprex doc I made to show how to recover ggrepel coordinates rpubs.com/yjunechoe/ggrepel-re

rpubs.comRPubs - Recover ggrepel drawn positions

#rstats hivemind: would it be too funky to define a package version major.minor.patch.dev as YYYY.MM.DD.VERSION, i.e. map major to year, minor to month, patch to day, and leave the dev component for the actual version..? I'm thinking of a data package whose upstream data releases are versioned based on the date... anyone ever tried such heretic approach? Would CRAN maintainers be okay with this?! ;)

asking for a friend.

On #regression models #rstats:

1) cleaning method for test data has more impact than for train data

2) best performance does not require same cleaning process in both test and training data

3) regression models should should test several test cleaning pipelines.

peerj.com/articles/cs-2793/

PeerJ Computer ScienceThe effects of mismatched train and test data cleaning pipelines on regression models: lessons for practiceData quality problems are present in all real-world, large-scale datasets. Each of these potential problems can be addressed in multiple ways through data cleaning. However, there is no single best data cleaning approach that always produces a perfect result, meaning that a choice needs to be made about which approach to use. At the same time, machine learning (ML) models are being trained and tested on these cleaned datasets, usually with one single data cleaning pipeline applied. In practice, however, data cleaning pipelines are updated regularly, often without retraining of production models. It is therefore common to apply different test (or production) data than the data on which the models were originally trained. The changes in these new test data and the data cleaning process applied can have potential ramifications for model performance. In this article, we show the impact that altering a data cleaning pipeline between the training and testing steps of an ML workflow can have. Through the fitting and evaluation of over 6,000 models, we find that mismatches between cleaning pipelines on training and test data can have a meaningful impact on regression model performance. Counter-intuitively, such mismatches can improve test set performance and potentially alter model selection choices.

Looking forward to a (virtual) homecoming next week! Guest lecturing at my alma mater for STAT 447 @ UIUC on Wednesday, April 9, 6pm Central.

Shiny Without Boundaries: One App, Multiple Destinations

Deploy your #RStats #rshiny apps anywhere: cloud, desktop, browser & beyond.