I have a large matrix that I would like to transpose without having to bring it into memory. There are three ways I can think of to accomplish this:

  1. Write the original matrix to a .txt file column by column. Later, read it into memory row by row with readLines(...), and sequentially write these rows to a new file. The problem with this approach is that I am unaware of how to append to a .txt file by column rather than by row.
  2. Read the matrix from the .txt file column by column, then write the columns to a new file by row. I have tried this with scan(pipe("cut -f1 filename.txt")), but this operation opens a separate connection at each iteration and therefore takes far too long to complete due to the overhead associated with opening and closing these connections.
  3. Use some unknown R function to complete the task.

Is there something I am missing here? Do I need to do this with a separate program? Thanks in advance for the help!

3

Best Answer


There's a lot of languages way better at this kind of thing. If you really want to use R, you will have to read the file in one row at a time, take one element from the column you want, store it in a vector, and then write that vector as a row. And do that for every column.

Columns = 1e9Rows = 1e6FileName = "YourFile.csv"NewFile = "NewFileName"for(i in 1:Columns){ColumnToBeRow = vector("numeric", Columns)for(j in 1:Rows){ColumnToBeRow[j] = read.csv(FileName, nrows=1, skip=(j - 1), header=F)}write.csv(ColumnToBeRow, NewFile, append=TRUE)}

This post to the R-help mailing list includes my naive (psuedo?) code to split the input file into n transposed output files, then tile across chunks of the n output files (in a checkerboard fashion) to stitch the transposed columns back together. It's efficient to do this in chunks of rows in both the transpose and stitch phases. It's worth asking what you're hoping to do after transposing the matrix to generate a file that still won't fit in memory. Also there is a scholarly a literature on efficient out-of-memory matrix transposition (e.g.).

scan can read it in as a stream, and all you need to add to the mix is the number of rows. Since your original matrix has a dimension attribute you just need to save the column value and use it as the row value when reading back in.

 MASS::write.matrix(matrix(1:30, 6), file="test.txt")matrix( scan("test.txt"), 5)#-------------Read 30 items[,1] [,2] [,3] [,4] [,5] [,6][1,] 1 2 3 4 5 6[2,] 7 8 9 10 11 12[3,] 13 14 15 16 17 18[4,] 19 20 21 22 23 24[5,] 25 26 27 28 29 30

I suspect that your code to write rows of matrices as lines will not be a fast as Ripley's MASS-pkg will achieve, but if I'm wrong, you should offer the improvement to Prof Ripley.