As discussed in the main introduction a numpy array has been defined which imitates the binary structure of a Seismic Unix (SU) file.
Why an SU file? A number of reasons
So why not just use SU directly?
What are some of the downsides of using Python?
All data on computers is stored as ones and zeros. Each one or zero is referred to as a bit. A group of 8 bits (an 8 bit "word") is referred to as a byte.
The actual order of the ones and zeros, the 'encoding', depends on it's type. Whole numbers (integers) such as 1, 2, 3 and 4 are easy - they are stored as the direct binary equivalent. Generally, 4 bytes are allocated to store each integer - that is, 32 bits. Thus, there are $2^{32}$ possible combinations available. If you dont care about negative numbers, these can all be positive (unsigned). But generally we want both positive and negative numbers (signed).
In [1]:
print "unsigned (positive numbers only) 2^32 = ", 2**32
print 'signed (positive and negative numnbers) (2^32/2) -1 = ', 2**32/2-1 #have to account for zero
print "the binary representation of 0 is ", bin(0)
print "the binary representation of (2^32/2)-1 (signed) is ", bin(2**32/2-1)
Numbers with decimal points are refered to as floating point numbers (floats). Obviously if you're working with ones and zeros, you cant really store a decimal point. So special encodings are used, for example IEEE floats.
These numbers do not have to be 4 bytes. If you know you dont need $2^{32}$ integers, you can use 2 bytes ($2^{16}$), often called a short integer. If 32 bits isnt enough, you can use 64 ($2^{64}$), called a long int. A more complete list of fundamental data types available to numpy can be seen here.
Why is this important? Lets have a quick look at the SU header definition. There are 71 entries, a mixture of 2 byte and 4 byte integers. Most entries are not usually used, but there are some "highly recommended" entries:
Byte # 2/4-byte SU Keyword Description
001 - 004 4 * tracl Trace sequence number within line.
009 - 012 4 * fldr Original field record number.
013 - 016 4 * tracf Trace sequence number within original field record.
029 - 030 2 * trid Trace identification code:
115 - 116 2 * ns Number of samples in this trace.
117 - 118 2 * dt Sample interval of this trace in microseconds.
to this list I will add 4 more entries:
073 - 076 4 sx X source coordinate.
077 - 080 4 sy Y source coordinate.
081 - 084 4 gx X receiver group coordinate.
085 - 088 4 gy Y receiver group coordinate.
So the first 4 bytes contain an integer called "tracl", etc. Interestingly, "ns" is a 2 byte integer, a short, which is actually a significant limitation. 2 bytes = 16 bits = $2^{16}$ samples.
In [2]:
print "thus a single trace can only contain 2^16 (%d) samples" %2**16
If we assume a 1ms sample rate ("dt" = 1000), then a single trace can only contain 65 seconds of data. This seems like a lot, until you consider a surface source like mini-SOSIE can easily record a single event for over a minute.
Another limitation is the coordinates themselves. As 4 byte integers, they cannot store fractional information. Assuming an easting/northing coordinate system (highly recommended) this limits the resolution to 1 metre. You can get around this, but we'll leave that for later.
So the SU files has the following format:
Item | Size (bytes) | cumulative sum(bytes) |
---|---|---|
header | $240$ | $240$ |
trace | $ns*4$ | $240+4ns$ |
header | $240$ | $240+(240+4ns)$ |
trace | $ns*4$ | $2(240+4ns)$ |
header | $240$ | $240+2(240+4ns)$ |
trace | $ns*4$ | $3(240+4ns)$ |
header | $240$ | $240+3(240+4ns)$ |
trace | $ns*4$ | $4(240+4ns)$ |
.... | .... | .... |
EOF | total file size | $nt*(240+4ns)$ |
Where "ns" is the number of samples and "nt" is the number of traces.
In [ ]:
import sys
sys.path.insert(2, '../PySeis')
import toolbox
import numpy as np
%pylab nbagg
sutype = toolbox.su_header_dtype
headers = [sutype[item].itemsize for item in sutype.fields]
headers_cumsum = np.cumsum(headers)
n = 0
pylab.figure(figsize=(15,1))
for i in range(4):
plot(n+headers_cumsum, np.zeros_like(headers_cumsum), 'r.', label='header')
plot(n+np.arange(500*4)+240, np.zeros(500*4),'b.', label='trace')
n += 240+(500*4)
In [ ]:
import toolbox
import numpy as np
%pylab nbagg
sutype = toolbox.su_header_dtype
headers = [sutype[item].itemsize for item in sutype.fields]
headers_cumsum = np.cumsum(headers)
n = 0
pylab.figure(figsize=(5,5))
for i in range(100):
plot(np.ones_like(headers_cumsum)*i,headers_cumsum, 'rs', label='header')
plot(np.ones(500)*i,n+np.arange(0,500*4,4)+240,'bs', label='trace')
pylab.ylim([1000,0])
pylab.show()