In the third project, a bunch of mega data are generated for analysis. However, the simulated data is too big to load in memory. Thus, I need to extract some information without opening the file. After quite a few time searching for a solution, sed
command in bash language helps me out. More details about the third project is coming soon.
The data file
The data file contains a few snapshot on different time points. Some matrix are written in the file when a snapshot time is hit. These matrix may be very large according to the parameter settings you chose. What I want to extract is the matrix \(\boldsymbol{D}\) and \(\boldsymbol{R}\) at the end.
Extracting the matrix
Firstly, I extract all the \(\boldsymbol{D}\)s, for example, from the data file. The following code does the thing for me.
1 |
|
As you see, there are in total 25 data files I need to deal with. For each data file, sed
finds the line starting with “\(D\left\{length(D)+1\right\} = [\)” till “\(];\)” from the file testji.m and writes/prints (\(p\)) all the matched lines to the file Dsji.Rdata.
Replacing the last D matrix
Then, I want to extract the last \(\boldsymbol{D}\) matrix. How can I do that? Use the following code:
1 |
|
The command grep
returns the number of the replicates of the line “\(D\left\{length(D)+1\right\} = [\)”. Then sed
replaces the head of the last matrix with \(D\) and extracts this matrix out and writes into file Dij.Rdata. As I want to work in R with these results, I reformat the structure of the results to fit the matrix format in R which is something like:
1 | D = structure(c(1,0,0,1),.Dim=c(2,2)) |
At last, I can directly source
the file in R and read the matrix.
Details in sed
The first sed
in the second script is the key part in this function.
1 | sed '/D'{'length(D)+1'}'/{G;s/\nX\{'$B'\}//;tend;x;s/^/X/;x;P;d};p;d;:end;s/D'{'length(D)+1'}'/D/;:a;n;ba' Ds"$j$i".Rdata>Dt"$j$i".Rdata |
I don’t fully understand what every letter in this command means. In this answer, the author explained the idea is to store a \(X\) at each match in the holdspace, and when all the \(X\)s are there, loop till the end of file. If you know more, please comment after the post. Thanks!
The third sed
reformats the matrix to fit in R. -e
executes several commands in one line.
1 | sed -i -e 's/D = \[/D = structure(c(/' -e 's/\];/),.Dim=c('$(echo $C)','$(echo $C)'))/' -e 's/;/ /g' -e '2,${s/ /,/g}' -e '2,${s/,,/ /g}' -e '3,'$(echo $D)'{s/^/,/g}' D"$j$i".Rdata |
The first segment replaces “\(D = [ \)” with “\(D = structure(c( \)” while the second replaces the tail “];” with “(,.Dim=c(row,col)”. Then all the signs “;” are substituted by space and the single space is substituted by “,” from the second line to the end. Following the replacement of “,,” by space from the second line and add “,” at the beginning of the line from the third line on. Finally, a standard matrix form is rebuilt.
BTW, a good tutorial can be found here. Have fun!