klf-cv · uflx.io

problem

The Konrad Lorenz Research Station (University of Vienna) has multiple terabytes of video data — continuous wildlife footage from fixed cameras monitoring ravens. 1TB translates to roughly 85 days of non-stop video.

Researchers manually watch and annotate behavioral events. Most footage contains nothing — punctuated by brief moments of activity. Missing short events is not an option, making this both time-consuming and demanding high concentration.

sample footage from the station

goal

Automatically detect activity segments.

first attempt (2019): gpu frame differencing

Stack: Docker, OpenCV, NumPy, CuPy (CUDA), PyNvVideoCodec (NVDEC hardware decoding).

Decode every frame, compute pixel-by-pixel differences between consecutive frames. Areas with large differences indicate motion.

frame differencing output — bright areas show motion

This works, but requires significant compute. I pursued several optimizations.

optimization 1: downsampling

Reduced resolution from 1440×1080 to 360×270, converted RGB to grayscale.

1440×1080 rgb

→

360×270 rgb

→

360×270 gray

stage	per frame	41,250 frames
1440×1080 RGB	4.67 MB	192 GB
360×270 RGB	292 KB	12 GB
360×270 Gray (float16)	194 KB	8 GB

optimization 2: frame batching

Split the 27:30 video into 4 equal segments (~7 min each). Sample one frame from each segment's current position, tile all 4 into a single GPU buffer. This allows processing 4 frames simultaneously — one from each video segment in parallel.

Video split into 4 segments (processed in parallel):

            Segment 1
          

            Segment 2
          

            Segment 3
          

            Segment 4
          

          fseg1(t) ⊕
          fseg2(t) ⊕
          fseg3(t) ⊕
          fseg4(t)
          →
          GPU(f₁, f₂, f₃, f₄)
        

          tile 4 frames into single buffer → process as one batch → 4× throughput
        

GPU buffer → Parallel difference computation:

seg1
t

seg2
t

seg3
t

seg4
t

frame t

→

seg1
t+1

seg2
t+1

seg3
t+1

seg4
t+1

frame t+1

→

seg1
diff

seg2
diff

seg3
diff

seg4
diff

4× parallel diff

optimization 3: frame skipping

Process 2fps instead of 25fps. Tradeoff: sub-500ms events get missed. I suspect those are hard to catch manually too.

processed skipped 25fps → 2fps (12× fewer)

subject detection pipeline

After identifying motion regions, detect actual subjects (birds) to filter false positives from wind, shadows, lighting changes.

Step 1: Sparse sampling — pick one peak frame per activity region

Motion curve shows activity over time. Green vertical lines mark selected frames at peak activity in each region. Only 2 frames selected from 80 seconds of video.

Step 2 & 3: Segment selected frames, classify segments

3 frames

→

Meta SAM
(segment)

→

~12 segments

→

ResNet-50
(classify)

→

2 birds

Meta SAM segments each frame into regions. ResNet-50 (pretrained on ImageNet) classifies each segment. Only segments classified as "bird" with high confidence are kept.

new approach (2025): h.264 already computed this

Stack: Docker, FastAPI, PostgreSQL, React, ffmpeg, mvextractor, PyAV.

Debugging the pipeline above led to out-of-memory errors. Uncompressed frames exhausted RAM + swap. The 256MB video file expanded to gigabytes in memory. I knew H.264 used motion vectors but hadn't considered extracting them directly.

H.264 doesn't store frames as RGB matrices. It uses two main concepts to reduce storage — one of which I use to solve the problem:

I-frames (keyframes) — full image as DCT coefficients
P-frames (predicted) — motion vectors + residuals, referencing the previous I or P frame

Decoding starts at an I-frame, then each P-frame applies its motion vectors and residuals to reconstruct the next.

■ I-frame (~50KB) — full image ■ P-frame (~10KB) — motion from previous

motion vectors (temporal compression)

Instead of storing pixels, P-frames store how 16×16 blocks moved from the previous frame. A vector like (3, -2) means "copy this block from 3 pixels right, 2 pixels up." The encoder searches for the best matching block and stores only the displacement.

This is why motion detection data is already embedded in the file — we just need to extract it.

discrete cosine transform (spatial compression)

I-frames and P-frame residuals are compressed using the Discrete Cosine Transform. Each 8×8 block is represented as a weighted sum of 64 basis patterns. High-frequency coefficients are typically near zero and can be discarded (run-length encoding).

64 basis patterns
(fixed, same for all blocks)

→

coefficients (weights)
64 values

→

after quantization
~18 non-zero (stored)

Reconstructing an 8×8 block of the letter "A" from DCT coefficients.
left: result so far
center: weighted pattern being added
right: basis pattern × coefficient

what's actually stored in the file

I extracted the motion vectors and residuals from the compressed stream and rendered them as videos for visualization:

motion vectors

residuals

advantages

no GPU required, runs entirely on CPU
parallelizable, one video per core simultaneously
full resolution, complete motion vectors, no temporal or spatial sampling

result

The target machine has a 4-core CPU and no GPU. Frame differencing would run slower than realtime — processing 1TB would take months.

	frame differencing	MV extraction
dev machine (16-core, GPU)	17× realtime (5 days/TB)	168× realtime (12 hours/TB)
target machine (4-core, no GPU)	0.7× realtime (4 months/TB)	52× realtime (1.5 days/TB)

[demo placeholder]

context

Built in cooperation with the University of Vienna. The Konrad Lorenz Research Station studies corvid behavior — specifically ravens and their social dynamics.

Behavioral science is outside my background. The signal processing problem was interesting, and it felt good to build something that might support actual research.