Read tail of big logfile

What is an efficient (memory,…) way in crystal to read lines at the end of a large log file?

1 Like
  1. open file
  2. seek to the end
  3. seek char by char in the direction from the end to the beginning of the file and search for ‘\n’ (line end) character until you find desired number of lines (also check if you are not in the beginning of the file already, of course)
  4. read the lines from that position to the end again

Simplest, probably slow (like random IO), but practically no memory usage at all.

This is basically what @pfischer described.

pp tail_file("file.log", 5)
pp tail_file("file.log", 10)

def tail_file(filename : String, max_lines : Int)
  File.open filename do |f|
    newline_count = 0
    (f.size - 1).downto(0) do |position|
      f.seek position
      # Count both the end of the file and a newline as a newline
      if position == f.size - 1 || f.read_byte === '\n'
        newline_count += 1

        # We use `>` vs `>=` because we're counting newlines, not lines of text.
        break if newline_count > max_lines
      end
    end

    # If we didn't get enough lines to fill the cap, the file doesn't have that
    # many lines, so we just go back to the beginning
    if newline_count <= max_lines
      f.seek 0
    end

    lines = Array(String).new(newline_count + 1)
    while line = f.gets
      lines << line
    end

    lines
  end
end

Regarding the note about performance, if you’re using a local SSD, it’s probably fine. I haven’t tested it on something like an NFS partition on spinning metal, though, so YMMV.

1 Like

I think the kernel buffer should compensate some of the sequential reads, even though they’re backwards. So I wouldn’t expect storage technology to play such a big role. You’ll likely be reading just from the last block of the file.

Without testing this, I’d expect a more relevant impact on efficiency from the sheer number of syscalls. Of course that’s magnitudes faster than disk access, but this is something we can address.

My recommentation for an efficient tail read would be to load a buffer into memory and then search that buffer for line breaks.

Untested “pseudo” code:

File.open(path) do |file|
  buffer_data = uninitialized UInt8[4096] 
  buffer = buffer_data.to_slice
  file.seek(-buffer.bytesize, :end)
  bytes = file.read(buffer)
  buffer = buffer[0, bytes]
  last_pos = buffer.bytesize - 1
  while pos = buffer.rindex(last_pos, '\n'.ord.to_u8!)
    line = String.new(buffer[pos..last_pos])
  end
end

I chose 4KB for the buffer, YMMV. If you can get away with it, make it big enough that everything you need to read can fit into it. Otherwise you’ll have to iterate. Shouldn’t be to hard, but I’m skipping that for conciseness.

Edit: didn’t mean to quote!

Oh neat, I made a shard for this sort of problem a bit ago.

the idea is to index the file concurrently by newlines, and do a pread to read arbitrary lines anywhere in the file.

so you could do something like:

# create an index if you want to do more than one lookup
flashlight index <path_to_file> <path_to_index>
# find the number of lines in the file
flashlight size -f <path_to_index>
# lookup the lines
flashlight lookup -f <path_to_index> <start_line> <number_of_lines_to_read>

I’m not currently messing with it to much because I ended up porting it to C and making a desktop reader using the concept.

Interested in hearing if this solves for the issue.

1 Like

Tested “real” code which works:

File.open("log.log") do |file|
  buffer_data = uninitialized UInt8[4096]
  buffer = buffer_data.to_slice
  file.seek(-(file.size < buffer.bytesize ? file.size : buffer.bytesize), IO::Seek::End)
  bytes = file.read(buffer)
  buffer = buffer[0, bytes]

  last_pos = (buffer.index('\n'.ord) || 0) + 1
  while pos = buffer.index('\n'.ord, last_pos)
    puts String.new(buffer[last_pos..pos])
    last_pos = pos + 1
  end
end
1 Like

Thanks for the library! I’m curious about the motives behind porting the Crystal version to C since, for me at least, Crystal is optimized and low level enough for 99% of the tasks one would choose C before, while being more friendly to the brain. :grin:

1 Like

Fair question.
I can say the C is definitely much faster, although could be attributed to optimizations that make less passes around the index. Mostly though, the C version has control over both concurrency and the threading, which Crystal abstracts.

Also, my goal was to build a desktop app over the library, but I find the Crystal ecosystem to be lacking or frustrating. I’m for sure guilty of this too, but there are too many untested libraries that have partial functionality.

Although C development can be painful and slow, the C feedback loop is fast, and the ecosystem is just a little more reliable in this regard, so it made sense.

I think tests need to be a much more critical part of the ecosystem if I’m to reach for Crystal over other languages though.

2 Likes

Constructive feedback like yours are essential to Crystal improvement. Thanks for the detailed answer!

1 Like