Ruby Journal

How to Import Millions Records via ActiveRecord Within Minutes Not Hours

| Comments

In today tutorial, I’ll show you how to optimise a ActiveRecord import script by 300%. My solution is better than other solution as it doesn’t use any SQL hack, thus you can retain the integrity with the data by running it through ActiveRecord normally.

At work, I am assigned a task to import millions rows of records from a 300MB CSV file into Rails app. The rake task takes in FILE and process it with ActiveRecord.

1
FILE=/tmp/big_file.csv rake data:import

And the rake task would call my ProductsImporter.run method, summarised below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class ProductsImporter

  def self.run
    ...
    # read the CSV files and assigns rows to variable rows

    ActiveRecord::Base.transaction do
      rows.each do |row|
        p = Product.find_or_initialize_by_product_id(row[:product_id])
        p.assign_attributes(row)
        p.save!
      end
    end
  end
end

And soon I bumped into performance issue because ActiveRecord::Transaction could not release garbage effectively. The script tooks ~2hrs to complete. This is unacceptable to my standard.

There are various workarounds on the net such as using ar_import gem which uses SQL INSERT. However I do not like these SQL solutions as there are so many callbacks with my models and data integrity is very important. So I come up with an alternative solution:

  • Split the big_file.csv into smaller files
  • Loop through these smaller chunks and recursively run rake task on each

So now you wonder how the above solution works better. It is because now we run many small processes in which Rails won’t have to deal much with big GC. Once a process is completed, memory will be instantly released. Now, let’s code this up using shell script, I chose bash as example (please adapt to fit your purpose):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#! /bin/bash

NUMBER_OF_SPLIT_LINES=50000
SPLIT_FILE_PREFIX='small_'
BIG_FILE_PATH=/tmp/big_file.csv
SPLIT_FILES=/tmp/$SPLIT_FILE_PREFIX*

temp_home () {
  cd /tmp
}

rails_app_home () {
  cd /your_app
}

split_big_csv_into_small_chunks () {
  echo "Split $BIG_FILE_PATH file into small chunks with size $NUMBER_OF_SPLIT_LINES lines..."
  temp_home && split -l $NUMBER_OF_SPLIT_LINES $BIG_FILE_PATH $SPLIT_FILE_PREFIX
}

process_split_files () {
  for f in $SPLIT_FILES
  do
    echo "Processing $f file..."
    rails_app_home && FILE=$f nohup rake data:import &
  done
}

split_big_csv_into_small_chunks
process_split_files

Let’s go through the above script. I use split UNIX command to split the big file into many smaller files, each with 50000 lines. Then I loop through these small files and parse it to rake task to run. I utilise nohup to make sure my script is not killed by SIGHUP (in my case, I run this script on remote host via SSH). Be noted that, I run each process in the background.

If you want a better performant version of split, I highly recommend you give csv_split a try, please download it at https://github.com/PerformanceHorizonGroup/csv-split.

Now, how many minutes you think our bash script would take to finish? It is 3 mintutes - no kidding! This is a massive gain compared to 2hrs.

Ruby/Rails are not the best for dealing with huge chunk of memory. So before deciding to try some SQL way, you can be pragmatic and abuse UNIX by spawning as many processes as your computer can handle and you’ll be surprised on how much gain you would achieve. Good luck!

Comments