Sed:a fast way to extract information from data

In the third project, a bunch of mega data are generated for analysis. However, the simulated data is too big to load in memory. Thus, I need to extract some information without opening the file. After quite a few time searching for a solution, sed command in bash language helps me out. More details about the third project is coming soon.

The data file

The data file contains a few snapshot on different time points. Some matrix are written in the file when a snapshot time is hit. These matrix may be very large according to the parameter settings you chose. What I want to extract is the matrix $\boldsymbol{D}$ and $\boldsymbol{R}$ at the end.

Raw data files.

Extracting the matrix

Firstly, I extract all the $\boldsymbol{D}$ s, for example, from the data file. The following code does the thing for me.

#!/bin/bash
for j in {0..4};
do
for i in {0..4};
do 
sed -n '/D'{'length(D)+1'}' = \[\r/,/\];/p' test"$j$i".m > Ds"$j$i".Rdata
echo $j$i' done'
done
done

As you see, there are in total 25 data files I need to deal with. For each data file, sed finds the line starting with “ $D\left\{length(D)+1\right\} = [$ ” till “ $];$ ” from the file testji.m and writes/prints ( $p$ ) all the matched lines to the file Dsji.Rdata.

All D matrix are extracted.

Replacing the last D matrix

Then, I want to extract the last $\boldsymbol{D}$ matrix. How can I do that? Use the following code:

#!/bin/bash
for j in {0..4};
do
for i in {0..4};
do 
A=$(grep -c 'D'{'length(D)+1'}' = \[' Ds"$j$i".Rdata)
B=$[$A-1]
sed '/D'{'length(D)+1'}'/{G;s/\nX\{'$B'\}//;tend;x;s/^/X/;x;P;d};p;d;:end;s/D'{'length(D)+1'}'/D/;:a;n;ba' Ds"$j$i".Rdata>Dt"$j$i".Rdata
sed -n '/D = \[\r/,/\];/p' Dt"$j$i".Rdata > D"$j$i".Rdata

C=$(cat D"$j$i".Rdata |wc -l)
C=$[$C-2]
D=$[$C+1]

sed -i -e 's/D = \[/D = structure(c(/' -e 's/\];/),.Dim=c('$(echo $C)','$(echo $C)'))/' -e 's/;/ /g' -e '2,${s/ /,/g}' -e '2,${s/,,/ /g}' -e '3,'$(echo $D)'{s/^/,/g}' D"$j$i".Rdata
echo $j$i' done'
done
done

The command grep returns the number of the replicates of the line “ $D\left\{length(D)+1\right\} = [$ ”. Then sed replaces the head of the last matrix with $D$ and extracts this matrix out and writes into file Dij.Rdata. As I want to work in R with these results, I reformat the structure of the results to fit the matrix format in R which is something like:

1	D = structure(c(1,0,0,1),.Dim=c(2,2))

At last, I can directly source the file in R and read the matrix.

The last D matrix is captured.

Details in sed

The first sed in the second script is the key part in this function.

1	sed '/D'{'length(D)+1'}'/{G;s/\nX\{'$B'\}//;tend;x;s/^/X/;x;P;d};p;d;:end;s/D'{'length(D)+1'}'/D/;:a;n;ba' Ds"$j$i".Rdata>Dt"$j$i".Rdata

I don’t fully understand what every letter in this command means. In this answer, the author explained the idea is to store a $X$ at each match in the holdspace, and when all the $X$ s are there, loop till the end of file. If you know more, please comment after the post. Thanks!

The third sed reformats the matrix to fit in R. -e executes several commands in one line.

1	sed -i -e 's/D = \[/D = structure(c(/' -e 's/\];/),.Dim=c('$(echo $C)','$(echo $C)'))/' -e 's/;/ /g' -e '2,${s/ /,/g}' -e '2,${s/,,/ /g}' -e '3,'$(echo $D)'{s/^/,/g}' D"$j$i".Rdata

The first segment replaces “ $D = [$ ” with “ $D = structure(c($ ” while the second replaces the tail “];” with “(,.Dim=c(row,col)”. Then all the signs “;” are substituted by space and the single space is substituted by “,” from the second line to the end. Following the replacement of “,,” by space from the second line and add “,” at the beginning of the line from the third line on. Finally, a standard matrix form is rebuilt.

BTW, a good tutorial can be found here. Have fun!