The rray

library(rray)
library(vctrs)

Introduction

rray provides a new array class that changes some of the behavior with base R arrays that I think could be altered to result in code that is more predictable and easy to program around. In the same spirit as tibble, rrays do less than base R arrays to try and work as consistently and intuitively as possible.

Subsetting

The problem

Base R subsets matrices and arrays with a default of drop = TRUE, similar to data frames. This is one of the most common sources of bugs, especially when programming around arrays, because a lot of mental work is required to ensure that functions don’t accidentally subset matrices in a way that coerces them to vectors.

x_mat <- matrix(1:6, ncol = 2)
x_mat
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5
#> [3,]    3    6

x_mat[1:2,]
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5

x_mat[1L,]
#> [1] 1 4

x_mat[,1:2]
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5
#> [3,]    3    6

x_mat[,1L]
#> [1] 1 2 3

This example demonstrates that [ is not type stable. This means it is impossible to predict the type of the output solely based on the input’s type. Type is defined in the vctrs sense (read about it here), but for usage in rray think of the type as being the class of an object along with its attributes, and the shape of the object, which is the dimensionality of the array excluding the first dimension. The first dimension is handled separately as the size.

To give a concrete example, x_mat has a type of <integer[,2]>. Both 1:2 and 1L have types of <integer>. So the full subset expression looks like: <integer[,2]>[<integer>,]. If [ was type stable, you would be able to predict the output’s type based on these inputs. But you can’t! In some cases, this returns <integer[,2]>, and in others it returns <integer>.

Looking past the type instability, the fix for the dropping of dimensions is usually to use drop = FALSE.

x_mat[1, , drop = FALSE]
#>      [,1] [,2]
#> [1,]    1    4

But this requires careful thinking about how to subset arrays programmatically, adding the right number of commas where required to ensure that dimensions aren’t dropped. Consider how you would pull the first row of a 3D vs 4D array, and pay attention to the varying number of commas required to prevent a vector from being returned.

x_3D <- array(1:12, c(2, 3, 2))
x_4D <- array(1:12, c(2, 3, 2, 1))

x_3D[1, , , drop = FALSE]
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11

x_4D[1, , , , drop = FALSE]
#> , , 1, 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> 
#> , , 2, 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11

The solution

rray takes a different approach, and never drops dimensions when subsetting. This results in a type stable [ method, and actually frees up subsetting syntax that I find more intuitive, but results in an error with base R. To convert x_3D to an rray, use as_rray().

x_3D_rray <- as_rray(x_3D)

# First row, but still 3D
x_3D_rray[1]
#> <rray<int>[,3,2][1]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11

# Trailing commas are ignored, so this is the same as above
x_3D_rray[1,]
#> <rray<int>[,3,2][1]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11

# In base R this is an error
x_3D[1,]
#> Error in x_3D[1, ]: incorrect number of dimensions

This type stability makes rrays more predictable to program around. Whether subsetting one element or multiple elements from a specific dimension, you can know that the output will have a consistent type.

x_3D_rray[, 1L]
#> <rray<int>[,1,2][2]>
#> , , 1
#> 
#>      [,1]
#> [1,]    1
#> [2,]    2
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]    7
#> [2,]    8

x_3D_rray[, 1:2]
#> <rray<int>[,2,2][2]>
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    7    9
#> [2,]    8   10

If you are writing a package where you aren’t sure if the user is going to pass in an rray or an array, you’ll want to be sure that you can subset consistently, no matter the input type. For that, rray_subset() has been exposed, which is what powers [. This means you can have this type stable subsetting with base R objects.

x_3D[1, 1]
#> Error in x_3D[1, 1]: incorrect number of dimensions

rray_subset(x_3D, 1, 1)
#> , , 1
#> 
#>      [,1]
#> [1,]    1
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]    7

# Same as with the rray
rray_subset(x_3D_rray, 1, 1)
#> <rray<int>[,1,2][1]>
#> , , 1
#> 
#>      [,1]
#> [1,]    1
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]    7

Subsetting by position

The problem

Base R allows you to access “inner” elements of an array by using [ without any commas, i.e. x[i]. While technically this one operation on its own is type stable, when you compare it to x[i,] and x[,i], it becomes clear that it is doing something completely different.

# Positions 1 and 2
x_3D[1:2]
#> [1] 1 2

# Positions 2 and 3
# Correspond to elements at indices: 
# (2, 1, 1) and (1, 2, 1)
x_3D[2:3]
#> [1] 2 3

rray_dim(x_3D[1:2])
#> [1] 2
rray_dim(x_3D[1:2,,])
#> [1] 2 3 2
rray_dim(x_3D[,1:2,])
#> [1] 2 2 2
rray_dim(x_3D[,,1:2])
#> [1] 2 3 2

I think that this is a completely separate operation, called subsetting by position. The traditional behavior of x[i, j, ...] is called subsetting by index.

The solution

As seen briefly in the subsetting section, rrays never drop dimensions, and ignore trailing commas when subsetting. This means that x[i] selects rows from x, not positions.

x_3D_rray[2]
#> <rray<int>[,3,2][1]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    2    4    6
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    8   10   12

# Same as this
x_3D_rray[2,]
#> <rray<int>[,3,2][1]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    2    4    6
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    8   10   12

This behavior results in a nice symmetrical predictable behavior that is easy to see if you draw out the different operations. Just by adding commas, we go from selecting rows, to columns, to elements in the third dimension. The type is stable the entire time.

x[i]
x[,i]
x[,,i]

Because trailing commas are ignored, the following are also equivalent to the above explanation.

x[i,]
x[,i,]
x[,,i,]

But what about selecting by position? This is still a useful operation, so it would make sense to have some way to do it. For that, you can equivalently use either rray_yank() or [[. rray_yank() always returns a 1D vector, and allows you to pass in an integer vector to subset by position.

# First 3 positions
x_3D_rray[[1:3]]
#> [1] 1 2 3

rray_yank(x_3D_rray, 1:3)
#> [1] 1 2 3

There are assignment variations of these as well, which are very useful for replacing NA values if you combine it with the other type of object that i can be, a logical with the same dimensions as x. TRUE values are interpreted as the positions to subset.

dummy <- x_3D_rray

# Set the first row to NA
dummy[1] <- NA

# Set NA values to 0
dummy[[is.na(dummy)]] <- 0

dummy
#> <rray<int>[,3,2][2]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    2    4    6
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    0    0    0
#> [2,]    8   10   12

The 3rd arm of subsetting is when you want to subset by index like rray_subset(), but you want to actually drop the dimensions like rray_yank(). This is accomplished with rray_extract().

rray_subset(x_3D_rray, 1)
#> <rray<int>[,3,2][1]>
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11

rray_extract(x_3D_rray, 1)
#> [1]  1  3  5  7  9 11

Dimension Names

The problem

Base R has consistent dimension name handling rules, but they can be surprising because they often don’t retain the maximum amount of information that it seems like they could.

x_row_nms <- matrix(1:2, dimnames = list(c("r1", "r2")))
x_col_nms <- matrix(1:2, dimnames = list(NULL, c("c1")))

x_row_nms
#>    [,1]
#> r1    1
#> r2    2

x_col_nms
#>      c1
#> [1,]  1
#> [2,]  2

# names from x_row_nms are used
x_row_nms + x_col_nms
#>    [,1]
#> r1    2
#> r2    4

# names from x_col_nms are used
x_col_nms + x_row_nms
#>      c1
#> [1,]  2
#> [2,]  4

It is reasonable that dimension names from both inputs could be used here.

The solution

rray has a different set of dimension name reconciliation rules that attempts to pull names from all inputs.

x_row_nms_rray <- as_rray(x_row_nms)
x_col_nms_rray <- as_rray(x_col_nms)

x_row_nms_rray + x_col_nms_rray
#> <rray<int>[,1][2]>
#>    c1
#> r1  2
#> r2  4

x_col_nms_rray + x_row_nms_rray
#> <rray<int>[,1][2]>
#>    c1
#> r1  2
#> r2  4

The order of the inputs still matters. If both inputs have row names, for example, the row names of the result come from the first input.

x_col_and_row_nms_rray <- rray_set_row_names(x_col_nms_rray, c("row1", "row2"))
x_col_and_row_nms_rray
#> <rray<int>[,1][2]>
#>      c1
#> row1  1
#> row2  2

x_row_nms_rray + x_col_and_row_nms_rray
#> <rray<int>[,1][2]>
#>    c1
#> r1  2
#> r2  4

x_col_and_row_nms_rray + x_row_nms_rray
#> <rray<int>[,1][2]>
#>      c1
#> row1  2
#> row2  4

This approach emphasizes maintaining the maximum amount of information possible. The underlying engine for dimension name handling is rray_dim_names_common(). Pass it multiple inputs and it will return the common dimension names using rray’s rules.

rray_dim_names_common(x_row_nms, x_col_nms)
#> [[1]]
#> [1] "r1" "r2"
#> 
#> [[2]]
#> [1] "c1"

Introduction

Subsetting

The problem

The solution

Subsetting by position

The problem

The solution

Dimension Names

The problem

The solution

Contents