5: Data Wrangling en R

[0.] Checklist

☑ Lectures previas
☑ Versión de R

[1.] ¿Qué es tidy-data?

1.1 Raw data y tidy data
1.2 Reglas de un conjunto de datos tidy
1.3 tidyverse
1.4 Cheat sheet

[2.] Adicionar variables a un conjunto de datos

2.0 Configuración inicial

☑ Script de la clase:
☑ Librerías

2.1 Conjuntos de datos disponibles en la memoria de R
2.2 Función $
2.3 mutate()
2.4 Generar variables usando condicionales:
2.5 Aplicar funciones a variables

Ordenar un objeto por os valores de una variable:

[3.] Remover filas y/o columnas

3.1 Seleccionar variables

3.1.1 Seleccionar variables usando partes del nombre
3.1.2 Seleccionar variables usando el tipo
3.1.3 Seleccionar variables usando un vector
3.1.4 Deseleccionar variables

3.2 Remover filas/observaciones

3.2.1 Remover filas usando condicionales

[4.] Operador pipe (%>%)

Veamos un ejemplo:
Veamos otro ejemplo:

Para seguir leyendo:

[0.] Checklist

Antes de iniciar con esta lectura asegúrese de…

☑ Lectures previas

Asegúrese de haber revisado la https://lectures-r.gitlab.io/unimag-202201/lecture-04/

☑ Versión de R

Tener la versión R version 4.1.1 (2021-08-10) instalada:

R.version.string

## [1] "R version 4.1.1 (2021-08-10)"

[1.] ¿Qué es tidy-data?

1.1 Raw data y tidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”, Hadley Wickham

1.2 Reglas de un conjunto de datos tidy

R for Data Science

1.3 `tidyverse`

Allison Horst

tidyverse es un conjunto de 8 librerías diseñadas para limpiar/manipular conjuntos de datos en R:

library("tidyverse")

> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
> ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
> ✓ tibble  3.1.0     ✓ dplyr   1.0.5
> ✓ tidyr   1.1.3     ✓ stringr 1.4.0
> ✓ readr   1.4.0     ✓ forcats 0.5.1
> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
> x dplyr::filter() masks stats::filter() 
> x dplyr::lag()    masks stats::lag()

Para ver los conflictos entre nombres de funciones en tidyverse con nombre de funciones en otras librerías, puede escribir sobre la consola tidyverse_conflicts()

1.4 Cheat sheet

Puede encontrar una hoja de trucos para cada librería aquí.

[2.] Adicionar variables a un conjunto de datos

Puedes adicionar variables a un dataframe/tibble usando data$var de la librería base de R o con la función mutate() de la librería dplyr.

2.0 Configuración inicial

☑ Script de la clase:

Para replicar este ejercicio, primero debe descargar el archivo clase-03 y abrir el archivo clase-02.Rproj. Ahora puede seguir el script de la clase que está ubicado en: script/clase-03.R

☑ Librerías

Instale/llame la librería pacman, y use la función p_load() para instalar/llamar las librerías de esta sesión:

## instalar/llamar pacman
require(pacman)

## usar la función p_load de pacman para instalar/llamar las librerías de la clase
p_load(tidyverse, # funciones para manipular/limpiar conjuntos de datos.
       rio, # función import/export: permite leer/escribir archivos desde diferentes formatos. 
       skimr, # función skim: describe un conjunto de datos
       janitor) # función tabyl: frecuencias relativas

2.1 Conjuntos de datos disponibles en la memoria de R

data(package="datasets")

Data sets in package ‘datasets’:

iris                            Edgar Anderson´s Iris Data
iris3                           Edgar Anderson´s Iris Data
islands                         Areas of the World´s Major Landmasses
ldeaths (UKLungDeaths)          Monthly Deaths from Lung Diseases in the UK
lh                              Luteinizing Hormone in Blood Samples
longley                         Longley´s Economic Regression Data
lynx                            Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)          Monthly Deaths from Lung Diseases in the UK
morley                          Michelson Speed of Light Data
mtcars                          Motor Trend Car Road Tests
uspop                           Populations Recorded by the US Census
volcano                         Topographic Information on Auckland´s Maunga Whau Volcano
women                           Average Heights and Weights for American Women

Se muestran solo algunos de los 104 conjuntos de datos disponibles en la librería datasets.

2.2 Función `$`

Crear un objeto con la base de datos mtcars:

df = as_tibble(x = women) # Obtener dataset
df

## # A tibble: 15 × 2
##    height weight
##     <dbl>  <dbl>
##  1     58    115
##  2     59    117
##  3     60    120
##  4     61    123
##  5     62    126
##  6     63    129
##  7     64    132
##  8     65    135
##  9     66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164

Crear una variable con la estatura en centímetros (1 pulgada = 2.54 centímetros):

df$height_cm = df$height*2.54 # agregar nueva variable
df

## # A tibble: 15 × 3
##    height weight height_cm
##     <dbl>  <dbl>     <dbl>
##  1     58    115      147.
##  2     59    117      150.
##  3     60    120      152.
##  4     61    123      155.
##  5     62    126      157.
##  6     63    129      160.
##  7     64    132      163.
##  8     65    135      165.
##  9     66    139      168.
## 10     67    142      170.
## 11     68    146      173.
## 12     69    150      175.
## 13     70    154      178.
## 14     71    159      180.
## 15     72    164      183.

2.3 mutate()

Generar una variable con la relación weight/height_cm:

df = mutate(.data = df , weight_hcm = weight/height_cm)
head(x=df, n=5)

## # A tibble: 5 × 4
##   height weight height_cm weight_hcm
##    <dbl>  <dbl>     <dbl>      <dbl>
## 1     58    115      147.      0.781
## 2     59    117      150.      0.781
## 3     60    120      152.      0.787
## 4     61    123      155.      0.794
## 5     62    126      157.      0.800

2.4 Generar variables usando condicionales:

Generar una variable para las mujeres más con una relación weight/height_cm mayor a 0.8 y otra con las mujeres de más de 180 cm:

Operador lógico	Descripción
< , >	Menor y mayor que…
<= , >=	Menor o igual y mayor igual que…
`==`	Igual a…
`!=`	Diferente de…
&	y
`\|`	o
`!`	Negación

# data$var
df$height_180 = ifelse(test=df$height_cm>180 , yes=1 , no=0)

#mutate
df = mutate(.data=df , sobrepeso = ifelse(test=weight_hcm>=0.85 , yes=1 , no=0))

head(x=df, n=5)

## # A tibble: 5 × 6
##   height weight height_cm weight_hcm height_180 sobrepeso
##    <dbl>  <dbl>     <dbl>      <dbl>      <dbl>     <dbl>
## 1     58    115      147.      0.781          0         0
## 2     59    117      150.      0.781          0         0
## 3     60    120      152.      0.787          0         0
## 4     61    123      155.      0.794          0         0
## 5     62    126      157.      0.800          0         0

Generar una variable con categorías para la relación weight/height_cm.

df = mutate(df , category = case_when(weight_hcm>=0.85 ~ "pesado" ,
                                      weight_hcm>=0.8 & weight_hcm<0.85 ~ "promedio" ,
                                      weight_hcm<0.8 ~ "liviano"))
head(x=df, n=5)

## # A tibble: 5 × 7
##   height weight height_cm weight_hcm height_180 sobrepeso category
##    <dbl>  <dbl>     <dbl>      <dbl>      <dbl>     <dbl> <chr>   
## 1     58    115      147.      0.781          0         0 liviano 
## 2     59    117      150.      0.781          0         0 liviano 
## 3     60    120      152.      0.787          0         0 liviano 
## 4     61    123      155.      0.794          0         0 liviano 
## 5     62    126      157.      0.800          0         0 promedio

2.5 Aplicar funciones a variables

Convertir todas las variables en caracteres:

df = mutate_all(.tbl=df , .funs = as.character)
str(df)

## tibble [15 × 7] (S3: tbl_df/tbl/data.frame)
##  $ height    : chr [1:15] "58" "59" "60" "61" ...
##  $ weight    : chr [1:15] "115" "117" "120" "123" ...
##  $ height_cm : chr [1:15] "147.32" "149.86" "152.4" "154.94" ...
##  $ weight_hcm: chr [1:15] "0.780613630192778" "0.780728680101428" "0.78740157480315" "0.793855686072028" ...
##  $ height_180: chr [1:15] "0" "0" "0" "0" ...
##  $ sobrepeso : chr [1:15] "0" "0" "0" "0" ...
##  $ category  : chr [1:15] "liviano" "liviano" "liviano" "liviano" ...

Convertir solo algunas variables a numéricas:

df = mutate_at(.tbl=df , .vars = c("height","weight","height_cm","weight_hcm"),.funs = as.numeric)
glimpse(df)

## Rows: 15
## Columns: 7
## $ height     <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
## $ weight     <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150,…
## $ height_cm  <dbl> 147.32, 149.86, 152.40, 154.94, 157.48, 160.02, 162.56, 165…
## $ weight_hcm <dbl> 0.7806136, 0.7807287, 0.7874016, 0.7938557, 0.8001016, 0.80…
## $ height_180 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ sobrepeso  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1",…
## $ category   <chr> "liviano", "liviano", "liviano", "liviano", "promedio", "pr…

Convertir a numéricas solo las variables que son caracteres:

df2 = mutate_if(.tbl=df , .predicate = is.character,.funs = as.numeric)

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

glimpse(df2)

## Rows: 15
## Columns: 7
## $ height     <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
## $ weight     <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150,…
## $ height_cm  <dbl> 147.32, 149.86, 152.40, 154.94, 157.48, 160.02, 162.56, 165…
## $ weight_hcm <dbl> 0.7806136, 0.7807287, 0.7874016, 0.7938557, 0.8001016, 0.80…
## $ height_180 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1
## $ sobrepeso  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1
## $ category   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

Ordenar un objeto por os valores de una variable:

Ordenar un dataframe: alfabético ascendente

df = arrange(.data=df , category)
head(df)

## # A tibble: 6 × 7
##   height weight height_cm weight_hcm height_180 sobrepeso category
##    <dbl>  <dbl>     <dbl>      <dbl> <chr>      <chr>     <chr>   
## 1     58    115      147.      0.781 0          0         liviano 
## 2     59    117      150.      0.781 0          0         liviano 
## 3     60    120      152.      0.787 0          0         liviano 
## 4     61    123      155.      0.794 0          0         liviano 
## 5     69    150      175.      0.856 0          1         pesado  
## 6     70    154      178.      0.866 0          1         pesado

Ordenar un dataframe: alfabético descendente

df = arrange(.data=df , desc(category)) 
head(df)

## # A tibble: 6 × 7
##   height weight height_cm weight_hcm height_180 sobrepeso category
##    <dbl>  <dbl>     <dbl>      <dbl> <chr>      <chr>     <chr>   
## 1     62    126      157.      0.800 0          0         promedio
## 2     63    129      160.      0.806 0          0         promedio
## 3     64    132      163.      0.812 0          0         promedio
## 4     65    135      165.      0.818 0          0         promedio
## 5     66    139      168.      0.829 0          0         promedio
## 6     67    142      170.      0.834 0          0         promedio

Ordenar un dataframe: numérico ascendente

df = arrange(.data=df , height_cm)
head(df)

## # A tibble: 6 × 7
##   height weight height_cm weight_hcm height_180 sobrepeso category
##    <dbl>  <dbl>     <dbl>      <dbl> <chr>      <chr>     <chr>   
## 1     58    115      147.      0.781 0          0         liviano 
## 2     59    117      150.      0.781 0          0         liviano 
## 3     60    120      152.      0.787 0          0         liviano 
## 4     61    123      155.      0.794 0          0         liviano 
## 5     62    126      157.      0.800 0          0         promedio
## 6     63    129      160.      0.806 0          0         promedio

Ordenar un dataframe: numérico descendente

df = arrange(.data=df , desc(height_cm))
head(df)

## # A tibble: 6 × 7
##   height weight height_cm weight_hcm height_180 sobrepeso category
##    <dbl>  <dbl>     <dbl>      <dbl> <chr>      <chr>     <chr>   
## 1     72    164      183.      0.897 1          1         pesado  
## 2     71    159      180.      0.882 1          1         pesado  
## 3     70    154      178.      0.866 0          1         pesado  
## 4     69    150      175.      0.856 0          1         pesado  
## 5     68    146      173.      0.845 0          0         promedio
## 6     67    142      170.      0.834 0          0         promedio

[3.] Remover filas y/o columnas

3.1 Seleccionar variables

iris es un conjunto de datos de la librería datasets, que contiene las medidas en centímetros de la longitud y ancho del sépalo y largo y ancho del pétalo, respectivamente, para 50 flores de cada una de las 3 especies de iris:

db =  mutate(.data = iris, Species=as.character(Species))
db[1:3,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

La función select() permite seleccionar columnas de un dataframe o un tibble, usando el nombre o la posición de la variable en el conjunto de datos:

select(.data = db[1:3,], c(1,3,5))

##   Sepal.Length Petal.Length Species
## 1          5.1          1.4  setosa
## 2          4.9          1.4  setosa
## 3          4.7          1.3  setosa

select(.data = db[1:3,], Petal.Length , Petal.Width , Species)

##   Petal.Length Petal.Width Species
## 1          1.4         0.2  setosa
## 2          1.4         0.2  setosa
## 3          1.3         0.2  setosa

3.1.1 Seleccionar variables usando partes del nombre

Nombres de variable que empizan con (Sepal)

select(.data = db[1:3,], starts_with("Sepal"))

##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2

Nombres de variable que contengan la palabra (Width)

select(.data = db[1:3,], contains("Width"))

##   Sepal.Width Petal.Width
## 1         3.5         0.2
## 2         3.0         0.2
## 3         3.2         0.2

3.1.2 Seleccionar variables usando el tipo

Variables de tipo carácter:

select_if(.tbl = db[1:3,], is.character)

##   Species
## 1  setosa
## 2  setosa
## 3  setosa

Variables de tipo numérico:

select_if(.tbl = db[1:3,], is.numeric)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

3.1.3 Seleccionar variables usando un vector

Vector de caracteres

vars = c("Species","Sepal.Length","Petal.Width")
select(.data = db[1:3,], all_of(vars))

##   Species Sepal.Length Petal.Width
## 1  setosa          5.1         0.2
## 2  setosa          4.9         0.2
## 3  setosa          4.7         0.2

Vector numérico

nums = c(5,2,3)
select(.data = db[1:3,], all_of(nums))

##   Species Sepal.Width Petal.Length
## 1  setosa         3.5          1.4
## 2  setosa         3.0          1.4
## 3  setosa         3.2          1.3

3.1.4 Deseleccionar variables

select(.data = db[1:3,], -Species)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

3.2 Remover filas/observaciones

En esta parte de la clase se empleará la base de datos starwars de la librería dplyr.

df = tibble(starwars)
df

## # A tibble: 87 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

3.2.1 Remover filas usando condicionales

La función subset() pertenece a una de las librerías base de R y permite seleccionar todas las filas/observaciones de un conjunto de datos que cumplen una o más condiciones lógicas:

subset(x = df, height > 180)  # height mayor a 180

## # A tibble: 38 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Darth Vader    202 136   none    white   yellow     41.9 male  mascu… Tatooi…
##  2 Biggs Dark…    183  84   black   light   brown      24   male  mascu… Tatooi…
##  3 Obi-Wan Ke…    182  77   auburn… fair    blue-g…    57   male  mascu… Stewjon
##  4 Anakin Sky…    188  84   blond   fair    blue       41.9 male  mascu… Tatooi…
##  5 Chewbacca      228 112   brown   unknown blue      200   male  mascu… Kashyy…
##  6 Boba Fett      183  78.2 black   fair    brown      31.5 male  mascu… Kamino 
##  7 IG-88          200 140   none    metal   red        15   none  mascu… <NA>   
##  8 Bossk          190 113   none    green   red        53   male  mascu… Trando…
##  9 Qui-Gon Ji…    193  89   brown   fair    blue       92   male  mascu… <NA>   
## 10 Nute Gunray    191  90   none    mottle… red        NA   male  mascu… Cato N…
## # … with 28 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

La función filter() de la librería dplyr :

filter(.data = df, mass > 100) # Más de 100 libras

## # A tibble: 10 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  2 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  3 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  4 Jabba Desi…    175  1358 <NA>    green-… orange    600   herm… mascu… Nal Hu…
##  5 Jek Tono P…    180   110 brown   fair    blue       NA   male  mascu… Bestin…
##  6 IG-88          200   140 none    metal   red        15   none  mascu… <NA>   
##  7 Bossk          190   113 none    green   red        53   male  mascu… Trando…
##  8 Dexter Jet…    198   102 none    brown   yellow     NA   male  mascu… Ojom   
##  9 Grievous       216   159 none    brown,… green,…    NA   male  mascu… Kalee  
## 10 Tarfful        234   136 brown   brown   blue       NA   male  mascu… Kashyy…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `colnames()` to see all variable names

El nombre de la función filter() presenta coflictos con el nombre de la función filter() de la librería stats (base).

tidyverse_conflicts() # ver conflictos con los nombres de las funciones de tidyverse

── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Una forma de solucionar este conflicto es usar :: para llamar la función de librería dplyr::filter() o creando un objeto con la función a preferir:

filter = dplyr::filter # Tenga en cuenta que no se usa paréntesis.
filter(.data = df, mass > 100)  # Más de 100 libras

## # A tibble: 10 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  2 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  3 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  4 Jabba Desi…    175  1358 <NA>    green-… orange    600   herm… mascu… Nal Hu…
##  5 Jek Tono P…    180   110 brown   fair    blue       NA   male  mascu… Bestin…
##  6 IG-88          200   140 none    metal   red        15   none  mascu… <NA>   
##  7 Bossk          190   113 none    green   red        53   male  mascu… Trando…
##  8 Dexter Jet…    198   102 none    brown   yellow     NA   male  mascu… Ojom   
##  9 Grievous       216   159 none    brown,… green,…    NA   male  mascu… Kalee  
## 10 Tarfful        234   136 brown   brown   blue       NA   male  mascu… Kashyy…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `colnames()` to see all variable names

[4.] Operador pipe (%>%)

pipe es un operador que permite conectar funciones en R. %>% se enfoca en la transformación que se le está haciendo al objeto y no en el objeto, permitiendo que el código sea más corto y fácil de leer.

Veamos un ejemplo:

Anteriormente se creó un tibble con la base de datos de women, después se generaron dos variables usando la función mutate y finalmente se ordenaron los datos por la variable height_cm:

df = as_tibble(x = women)
df = mutate(.data = df , height_cm = height*2.54,
                         weight_hcm = weight/height_cm)
df = arrange(.data=df , desc(height_cm))
head(x=df , n=5)

## # A tibble: 5 × 4
##   height weight height_cm weight_hcm
##    <dbl>  <dbl>     <dbl>      <dbl>
## 1     72    164      183.      0.897
## 2     71    159      180.      0.882
## 3     70    154      178.      0.866
## 4     69    150      175.      0.856
## 5     68    146      173.      0.845

Otra forma de hacerlo es emplear el operador pipe %>%:

df = as_tibble(x = women) %>% 
     mutate(height_cm = height*2.54, weight_hcm = weight/height_cm) %>%
     arrange(desc(height_cm))
head(x=df , n=5)

## # A tibble: 5 × 4
##   height weight height_cm weight_hcm
##    <dbl>  <dbl>     <dbl>      <dbl>
## 1     72    164      183.      0.897
## 2     71    159      180.      0.882
## 3     70    154      178.      0.866
## 4     69    150      175.      0.856
## 5     68    146      173.      0.845

Con %>% no es necesario mencionar el objeto en cada nueva transformación. Además, las líneas de código se redujeron a la mitad.

Veamos otro ejemplo:

Intente reescribir el siguiente código usando el operador %>%:

df <- import("https://www.datos.gov.co/resource/epsv-yhtj.csv")
df <- as_tibble(df)
df <- select(df, -cod_ase_)
df <- mutate(df,ifelse(is.na(estrato),1,estrato))

Para seguir leyendo:

Wickham, Hadley and Grolemund, Garrett, 2017. R for Data Science [Ver aquí]
- Cap. 5: Data transformation
- Cap. 10: Tibbles
- Cap. 12: Tidy data

Taller de R Aplicado a la Investigación en Ciencias Económicas y Empresariales

Lecture 5: Data Wrangling en R

Eduard Fernando Martínez González

[0.] Checklist

☑ Lectures previas

☑ Versión de R

[1.] ¿Qué es tidy-data?

1.1 Raw data y tidy data

1.2 Reglas de un conjunto de datos tidy

1.3 `tidyverse`

1.4 Cheat sheet

[2.] Adicionar variables a un conjunto de datos

2.0 Configuración inicial

☑ Script de la clase:

☑ Librerías

2.1 Conjuntos de datos disponibles en la memoria de R

2.2 Función `$`

2.3 mutate()

2.4 Generar variables usando condicionales:

2.5 Aplicar funciones a variables

Ordenar un objeto por os valores de una variable:

[3.] Remover filas y/o columnas

3.1 Seleccionar variables

3.1.1 Seleccionar variables usando partes del nombre

3.1.2 Seleccionar variables usando el tipo

3.1.3 Seleccionar variables usando un vector

3.1.4 Deseleccionar variables

3.2 Remover filas/observaciones

3.2.1 Remover filas usando condicionales

[4.] Operador pipe (%>%)

Veamos un ejemplo:

Veamos otro ejemplo:

Para seguir leyendo:

Taller de R Aplicado a la Investigación en Ciencias Económicas y Empresariales

Lecture 5: Data Wrangling en R

Eduard Fernando Martínez González

[0.] Checklist

☑ Lectures previas

☑ Versión de R

[1.] ¿Qué es tidy-data?

1.1 Raw data y tidy data

1.2 Reglas de un conjunto de datos tidy

1.3 tidyverse

1.4 Cheat sheet

[2.] Adicionar variables a un conjunto de datos

2.0 Configuración inicial

☑ Script de la clase:

☑ Librerías

2.1 Conjuntos de datos disponibles en la memoria de R

2.2 Función $

2.3 mutate()

2.4 Generar variables usando condicionales:

2.5 Aplicar funciones a variables

Ordenar un objeto por os valores de una variable:

[3.] Remover filas y/o columnas

3.1 Seleccionar variables

3.1.1 Seleccionar variables usando partes del nombre

3.1.2 Seleccionar variables usando el tipo

3.1.3 Seleccionar variables usando un vector

3.1.4 Deseleccionar variables

3.2 Remover filas/observaciones

3.2.1 Remover filas usando condicionales

[4.] Operador pipe (%>%)

Veamos un ejemplo:

Veamos otro ejemplo:

Para seguir leyendo:

1.3 `tidyverse`

2.2 Función `$`