problem
The Konrad Lorenz Research Station (University of Vienna) has multiple terabytes of video data — continuous wildlife footage from fixed cameras monitoring ravens. 1TB translates to roughly 85 days of non-stop video.
Researchers manually watch and annotate behavioral events. Most footage contains nothing — punctuated by brief moments of activity. Missing short events is not an option, making this both time-consuming and demanding high concentration.
sample footage from the station
goal
Automatically detect activity segments.
first attempt (2019): gpu frame differencing
Stack: Docker, OpenCV, NumPy, CuPy (CUDA), PyNvVideoCodec (NVDEC hardware decoding).
Decode every frame, compute pixel-by-pixel differences between consecutive frames. Areas with large differences indicate motion.
frame differencing output — bright areas show motion
This works, but requires significant compute. I pursued several optimizations.
optimization 1: downsampling
Reduced resolution from 1440×1080 to 360×270, converted RGB to grayscale.
1440×1080 rgb
360×270 rgb
360×270 gray
| stage | per frame | 41,250 frames |
|---|---|---|
| 1440×1080 RGB | 4.67 MB | 192 GB |
| 360×270 RGB | 292 KB | 12 GB |
| 360×270 Gray (float16) | 194 KB | 8 GB |
optimization 2: frame batching
Split the 27:30 video into 4 equal segments (~7 min each). Sample one frame from each segment's current position, tile all 4 into a single GPU buffer. This allows processing 4 frames simultaneously — one from each video segment in parallel.
Video split into 4 segments (processed in parallel):
GPU buffer → Parallel difference computation:
t
t
t
t
t+1
t+1
t+1
t+1
diff
diff
diff
diff
optimization 3: frame skipping
Process 2fps instead of 25fps. Tradeoff: sub-500ms events get missed. I suspect those are hard to catch manually too.
subject detection pipeline
After identifying motion regions, detect actual subjects (birds) to filter false positives from wind, shadows, lighting changes.
Step 1: Sparse sampling — pick one peak frame per activity region
Motion curve shows activity over time. Green vertical lines mark selected frames at peak activity in each region. Only 2 frames selected from 80 seconds of video.
Step 2 & 3: Segment selected frames, classify segments
(segment)
(classify)
Meta SAM segments each frame into regions. ResNet-50 (pretrained on ImageNet) classifies each segment. Only segments classified as "bird" with high confidence are kept.
new approach (2025): h.264 already computed this
Stack: Docker, FastAPI, PostgreSQL, React, ffmpeg, mvextractor, PyAV.
Debugging the pipeline above led to out-of-memory errors. Uncompressed frames exhausted RAM + swap. The 256MB video file expanded to gigabytes in memory. I knew H.264 used motion vectors but hadn't considered extracting them directly.
H.264 doesn't store frames as RGB matrices. It uses two main concepts to reduce storage — one of which I use to solve the problem:
- I-frames (keyframes) — full image as DCT coefficients
- P-frames (predicted) — motion vectors + residuals, referencing the previous I or P frame
Decoding starts at an I-frame, then each P-frame applies its motion vectors and residuals to reconstruct the next.
motion vectors (temporal compression)
Instead of storing pixels, P-frames store how 16×16 blocks moved from the previous frame. A vector like (3, -2) means "copy this block from 3 pixels right, 2 pixels up." The encoder searches for the best matching block and stores only the displacement.
This is why motion detection data is already embedded in the file — we just need to extract it.
discrete cosine transform (spatial compression)
I-frames and P-frame residuals are compressed using the Discrete Cosine Transform. Each 8×8 block is represented as a weighted sum of 64 basis patterns. High-frequency coefficients are typically near zero and can be discarded (run-length encoding).
64 basis patterns
(fixed, same for all blocks)
coefficients (weights)
64 values
after quantization
~18 non-zero (stored)
Reconstructing an 8×8 block of the letter "A" from DCT coefficients.
left: result so far
center: weighted pattern being added
right: basis pattern × coefficient
what's actually stored in the file
I extracted the motion vectors and residuals from the compressed stream and rendered them as videos for visualization:
motion vectors
residuals
advantages
- no GPU required, runs entirely on CPU
- parallelizable, one video per core simultaneously
- full resolution, complete motion vectors, no temporal or spatial sampling
result
The target machine has a 4-core CPU and no GPU. Frame differencing would run slower than realtime — processing 1TB would take months.
| frame differencing | MV extraction | |
|---|---|---|
| dev machine (16-core, GPU) | 17× realtime (5 days/TB) | 168× realtime (12 hours/TB) |
| target machine (4-core, no GPU) | 0.7× realtime (4 months/TB) | 52× realtime (1.5 days/TB) |
context
Built in cooperation with the University of Vienna. The Konrad Lorenz Research Station studies corvid behavior — specifically ravens and their social dynamics.
Behavioral science is outside my background. The signal processing problem was interesting, and it felt good to build something that might support actual research.