Write CSV straight to disk as GZIP'd file?

carcinocron · November 11, 2020, 2:43pm

Is there a way to GZIP a CSV file while writing it? I didn’t understand how to combine the IO docs with the GZIP docs with the CSV docs.

require "csv"
require "gzip"
# crystal run --release convert.cr

input_file_name = "data_before.csv"

p "converting file #{input_file_name}"

counter = 0

output_file_name = "data_after.csv"
p ({:processing => {:output_file_name => output_file_name}})
File.open(input_file_name) do |infile|
  reader = CSV.new(infile, header = true)

  index_cust_id = reader.indices["cust_id"]

  File.open(output_file_name, "w") do |outfile|
    CSV.build(outfile) do |writer|
      writer.row reader.headers

      # pp reader.headers
      # pp reader.indices
      reader.each do |entry|
        (1..loop_count).each do |loop_index|
          row = entry.row.to_a
          # pp row
          row[index_cust_id] = "#{entry.row[index_cust_id]}#{loop_index}"
          writer.row row

          counter += 1

          if counter % 2000000 == 0
            p "[#{date}] lines: #{counter.humanize}"
            # p `ps -p #{Process.pid} -o rss`
            # p `ls -lh #{output_file_name}`
            # p `ps -caxm -orss= | awk '{ sum += $1 } END { print "Resident Set Size: " sum/1024 " MiB" }'`
            # p GC.stats
            # GC.collect
            outfile.flush
          end
        end
      end
    end
  end
end

p `ls -lh *.csv`

p "finished!"

My key constraints are:

the input and output CSV files will both be larger than my RAM
the output file will come close to filling up my disk
AWS says that S3 will upload the file faster if it’s GZIP’d
AWS says that redshift will read the file faster if it’s GZIP’d in S3

asterite · November 11, 2020, 3:00pm

There’s another overload of CSV.build that takes an IO as a first argument (this is in general true for every method in the standard library that build something into a string).

So, the way to do it would be to:

Open a file and get an IO to do that using File.open(..., "w")
Open a GZip::Writer and pass is the IO from the previous point
Do CSV.build and pass the IO from the previous point

Right now I don’t have time to provide some code for this, but let us know if you get stuck, I can help later, or others will probably do it too.

carcinocron · November 11, 2020, 4:43pm

this is what I’ve got (minus my business logic) If I don’t update, assume it worked in 0.35.1:

require "csv"
require "compress/gzip"
# time crystal run --release convert_gzip.cr

input_file_name = "data_before.csv"

p "converting file #{input_file_name}"

loop_count = 1
counter = 0

output_file_name = "data_after.csv.gz"

# File.open(output_file_name = "data_after.csv", "w") do |outfile|
Compress::Gzip::Writer.open(output_file_name) do |outfile|
  File.open(input_file_name) do |infile|
    reader = CSV.new(infile, header = true)

    index_cust_id = reader.indices["cust_id"]

    CSV.build(outfile) do |writer|
      writer.row reader.headers
      reader.each do |entry|
        (1..loop_count).each do |loop_index|
          row = entry.row.to_a

          row[index_cust_id] = "#{entry.row[index_cust_id]}{loop_index}"

          writer.row row
          counter += 1

          # pp ({:line => __LINE__})
          if counter % 2000000 == 0
            p "lines: #{counter.humanize}"
            # p `ps -p #{Process.pid} -o rss`
            p `ls -lh #{output_file_name}`
            p `ps -caxm -orss= | awk '{ sum += $1 } END { print "Resident Set Size: " sum/1024 " MiB" }'`
            # p GC.stats
            # GC.collect
            outfile.flush
          end
        end
      end
    end
  end

  outfile.flush
end

pp ({:line => __LINE__})
p ({:counter => counter.humanize})

p `ls -lh *.csv *.csv.gz`

p "finished!"

thanks!

jzakiya · November 11, 2020, 10:42pm

FYI, for ultimate performance, and per discussion below,

instead of

if counter % 2000000 == 0

do

if counter.remainder(2000000) == 0

or

if counter.divisible_by?(2000000)

carcinocron · November 12, 2020, 7:33pm

I would like to skip the filesystem and upload the GZIP’d straight to S3, but I’m not sure how to add IO.new and upload = uploader.upload(bucket: ENV["AWS_BUCKET"], object: uploaded_filename, io: infile) in with the GZIP writer to get the desired effect.

asterite · November 12, 2020, 8:13pm

Yeah, it’s really hard to write to an IO that will write to an HTTP request.

I think what you’ll need to do here is use a pipe:

reader, writer = IO.pipe
spawn do
  # Here you write the gzipped csv file to the writer
end

# Here you pass the `reader` IO to AWS
uploader.upload(bucket: ENV["AWS_BUCKET"], object: uploaded_filename, io: reader)

Note: I didn’t try it and I’m not sure it will work.

carcinocron · November 12, 2020, 8:56pm

I had to define them as io_reader, io_writer = IO.pipe to not conflict with the CSV reader/writer.

In convert_gzip.cr:55:26

 55 | Compress::Gzip::Writer.open(io_writer) do |outfile|
                             ^---
Error: instantiating 'Compress::Gzip::Writer.class#open(IO::FileDescriptor)'


In convert_gzip.cr:55:26

 55 | Compress::Gzip::Writer.open(io_writer) do |outfile|
                             ^---
Error: instantiating 'Compress::Gzip::Writer.class#open(IO::FileDescriptor)'


In convert_gzip.cr:95:23

 95 | upload = uploader.upload(bucket: ENV["AWS_BUCKET"], object: uploaded_filename, io: io_reader)
                        ^-----
Error: instantiating 'Awscr::S3::FileUploader#upload()'


In lib/awscr-s3/src/awscr-s3/file_uploader.cr:32:13

 32 | if io.size < UPLOAD_THRESHOLD
            ^---
Error: undefined method 'size' for IO::FileDescriptor

IO::FileDescriptor trace:

  lib/awscr-s3/src/awscr-s3/file_uploader.cr:32

          if io.size < UPLOAD_THRESHOLD

crystal run --release convert_gzip.cr --error-trace  0.71s user 0.27s system 95% cpu 1.017 total

I guess the S3 library anticipates that the size is known. I don’t know enough about IO to do something like chunk every 5MB to a “part” for S3 multipart upload.

asterite · November 13, 2020, 1:27pm

I found something: https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header

It might not be supported by the Crystal library you are doing so you might have to add that, or just write to disk and upload that later.

carcinocron · November 17, 2020, 8:03pm

Here is my stream upload of a GZIP’d generated CSV:

require "csv"
require "compress/gzip"
require "awscr-s3"
# time crystal run --release app.cr

client = Awscr::S3::Client.new(
  region: ENV["AWS_REGION"],
  aws_access_key: ENV["AWS_KEY"],
  aws_secret_key: ENV["AWS_SECRET"]
)
uploader = Awscr::S3::FileUploader.new(client)

input_file_name = "input.csv"
uploaded_filename = "output.csv.gz"

p "converting file #{input_file_name}"

status_interval = 2000000
size_check_interval = 10000

# S3 max file size: 1TB
# S3 max number of parts: 10,000
# part_size = 100000000 # 100MB for a 1TB ceiling
# part_size = 50000000 # 50MB for a 500GB ceiling
part_size = 25000000 # 25MB for a 250GB ceiling

counter = 0

multipart_upload = client.start_multipart_upload(
  bucket: ENV["AWS_BUCKET"],
  object: uploaded_filename)
part_number = 1
parts = Array(Awscr::S3::Response::UploadPartOutput).new

io = IO::Memory.new

Compress::Gzip::Writer.open(io) do |outfile|
  File.open(input_file_name) do |infile|
    csv_reader = CSV.new(infile, header = true)

    index_cust_id = csv_reader.indices["cust_id"]

    CSV.build(outfile) do |writer|
      writer.row csv_reader.headers
      csv_reader.each do |entry|
        row = entry.row.to_a

        # do business stuff
        row[index_cust_id] = "#{entry.row[index_cust_id]}_b"

        writer.row row
        counter += 1

        if counter.divisible_by?(size_check_interval)
          if io.bytesize > part_size
            pp ({:line => __LINE__, :"io.bytesize" => io.bytesize.humanize, :part_number => part_number})
            resp = client.upload_part(
              bucket: ENV["AWS_BUCKET"],
              object: uploaded_filename,
              part: io.to_slice,
              part_number: part_number,
              upload_id: multipart_upload.upload_id)
            # pp resp
            parts << resp
            # empty the IO because the whole point
            # of stream is to conserve RAM/disk
            io.clear
            part_number += 1
          end
        end
        if counter.divisible_by?(status_interval)
          p "[#{date}] lines: #{counter.humanize}"
          # p `ps -p #{Process.pid} -o rss`
          p `ls -lh #{output_file_name}`
          p `ps -caxm -orss= | awk '{ sum += $1 } END { print "Resident Set Size: " sum/1024 " MiB" }'`
        end
      end
    end
  end
end

# if we have a final part, upload it
if io.bytesize > 0
  pp ({:line => __LINE__, :"io.bytesize" => io.bytesize.humanize, :part_number => part_number, :upload_id => multipart_upload.upload_id})

  resp = client.upload_part(
    bucket: ENV["AWS_BUCKET"],
    object: uploaded_filename,
    upload_id: multipart_upload.upload_id,
    part_number: part_number,
    part: io.to_slice)
  parts << resp
  io.clear
end

# tell AWS to finallize the parts
client.complete_multipart_upload(
  bucket: ENV["AWS_BUCKET"],
  object: uploaded_filename,
  upload_id: multipart_upload.upload_id,
  parts: parts.sort_by(&.part_number)
)

kees · November 18, 2020, 5:39pm

@carcinocron thank you for sharing thie example!

Topic		Replies	Views
Streaming Zip Files from S3 Help & Support	2	342	September 30, 2021
Idea for a simple compressed key value store in Crystal Learning Resources	3	353	March 2, 2023
Write same content to gzip file with different times get different size of gz file Help & Support	2	360	December 24, 2019
Process CSV from pipe Help & Support	2	282	September 5, 2021
Zlib Compress Help Needed Help & Support	5	204	October 30, 2022

Write CSV straight to disk as GZIP'd file?

Related Topics