Title: | Fill Data Points |
---|---|
Description: | Provides numerous functions to fill data. These can be applied either to missing or skewed data. The functions are designed within the scope of Student Analytics. |
Authors: | Tomer Iwan [aut, cre], Yaïr Jacob [ctb], VU Analytics [cph] |
Maintainer: | Tomer Iwan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.7.9000 |
Built: | 2025-03-11 05:01:25 UTC |
Source: | https://github.com/vusaverse/vvfiller |
Check if some missing values are present, but not all are missing. returns a boolean. This check is done to save time for vectors where filling is not needed
check_some_missing(x)
check_some_missing(x)
x |
the vector to check |
TRUE or FALSE
Calculate a summary statistic (mean, median, vvconverter::mode, min, max etc.) by group and use it to fill missing values in a column. Primarily for use in fill_with_agg_by_group().
fill_col_with_agg_by_group(df, group, col, statistic)
fill_col_with_agg_by_group(df, group, col, statistic)
df |
tibble to use |
group |
string or vector of strings: columns to group by |
col |
string: column to impute |
statistic |
function: summary statistic to use (mean, median, min etc.). For now requires a function with na.rm argument |
a filled vector
Function to calculate a summary statistic (mean, median, vvconverter::mode, min, max etc.) by group and use it to fill missing values. Note: this takes and produces a tibble rather than a vector.
fill_df_with_agg_by_group( df, group, columns, overwrite_col = FALSE, statistic = mean, fill_empty_group = FALSE )
fill_df_with_agg_by_group( df, group, columns, overwrite_col = FALSE, statistic = mean, fill_empty_group = FALSE )
df |
tibble to use |
group |
string or vector of strings: columns to group by |
columns |
string or vector of strings: columns to impute |
overwrite_col |
boolean: whether to overwrite column. If FALSE, a new column with suffix _imputed will be created |
statistic |
function: summary statistic to use (mean, median, min etc.). For now requires a function with na.rm argument |
fill_empty_group |
boolean: If TRUE, fills groups that only contain NA with summary statistic of entire column |
a tibble with filled column(s)
wrapper function to do check and call all fill_vector functions
fill_missing(x, min_known_n = NULL, min_known_p = NULL, type)
fill_missing(x, min_known_n = NULL, min_known_p = NULL, type)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
type |
the type of fill missing function to be called |
filled vector
Fill all missing values for an interval observed in the vector
fill_missing_interval(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_interval(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_interval(c(NA, 1, 2, NA)) fill_missing_interval(c(NA, 10, 20, NA))
fill_missing_interval(c(NA, 1, 2, NA)) fill_missing_interval(c(NA, 10, 20, NA))
Fill all missing values in a vector with the last value if it is known.
fill_missing_last(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_last(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_last(c(1, 2, NA)) fill_missing_last(c(NA, 1, 2, NA))
fill_missing_last(c(1, 2, NA)) fill_missing_last(c(NA, 1, 2, NA))
Fill all missing values in a vector with the maximum value if it is known.
fill_missing_max(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_max(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_max(c(1, 2, NA)) fill_missing_max(c(NA, 1, 2, NA))
fill_missing_max(c(1, 2, NA)) fill_missing_max(c(NA, 1, 2, NA))
Fill all missing values in a vector with the minimum value if it is known.
fill_missing_min(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_min(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_min(c(1, 2, NA)) fill_missing_min(c(NA, 1, 2, NA))
fill_missing_min(c(1, 2, NA)) fill_missing_min(c(NA, 1, 2, NA))
Fill all missing values in a vector with the previous value if it is known.
fill_missing_previous(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_previous(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_previous(c(1, 2, NA)) fill_missing_previous(c(NA, 1, 2, NA))
fill_missing_previous(c(1, 2, NA)) fill_missing_previous(c(NA, 1, 2, NA))
Impute missing values of a count variable. Imputation is done by counting from the last known value. Example: c(NA,4,NA,NA) then becomes c(NA,4,NA,NA).
fill_missing_rownumber(x)
fill_missing_rownumber(x)
x |
Integer vector. |
Integer vector with filled values.
fill_missing_rownumber(c(NA,4,NA,NA))
fill_missing_rownumber(c(NA,4,NA,NA))
Fill all missing values in a vector with the same value if it is known. Only fills the value when all known values are the same
fill_missing_strict(x, min_known_n = NULL, min_known_p = NULL)
fill_missing_strict(x, min_known_n = NULL, min_known_p = NULL)
x |
The vector to fill |
min_known_n |
numeric value: the minimum number of not-missing values |
min_known_p |
numeric value between 0 and 1: the minimum fraction of not-missing values |
a filled vector
fill_missing_strict(c(NA, 1))
fill_missing_strict(c(NA, 1))
Returns a vector with all missing values filled with another value
fill_value(x, value)
fill_value(x, value)
x |
vectors. All inputs should have the same length |
value |
a value with the same class as x |
vector with the same length as the first vector
fill_value(c(NA,1), 2)
fill_value(c(NA,1), 2)
fill_vector_interval
fill_vector_interval(x)
fill_vector_interval(x)
x |
the vector to be filled |
fill_vector_last
fill_vector_last(x, x_na_omit)
fill_vector_last(x, x_na_omit)
x |
the vector to be filled |
x_na_omit |
the x vector without NA values |
fill_vector_max
fill_vector_max(x, x_na_omit)
fill_vector_max(x, x_na_omit)
x |
the vector to be filled |
x_na_omit |
the x vector without NA values |
fill_vector_min
fill_vector_min(x, x_na_omit)
fill_vector_min(x, x_na_omit)
x |
the vector to be filled |
x_na_omit |
the x vector without NA values |
fill_vector_previous
fill_vector_previous(x)
fill_vector_previous(x)
x |
the vector to be filled |
fill_vector_strict
fill_vector_strict(x, x_na_omit)
fill_vector_strict(x, x_na_omit)
x |
the vector to be filled |
x_na_omit |
the x vector without NA values |
Is a specialized function which takes a variable and turns it into two new variables to be used in a prediction model.
the variable for which missing values are imputed by the median for the given year.
an indicator when the variable is missing
na_impute_median(data, var, year = 2014, year_column)
na_impute_median(data, var, year = 2014, year_column)
data |
The data frame. |
var |
The variable used to create new variables. |
year |
Year used for the median for imputation. |
year_column |
Column with year to use median on. |
New data frame in which missing values are filled.