In today tutorial, I’ll show you how to optimise a ActiveRecord import script by 300%. My solution is better than other solution as it doesn’t use any SQL hack, thus you can retain the integrity with the data by running it through ActiveRecord normally.
At work, I am assigned a task to import millions rows of records from a 300MB CSV file into Rails app. The rake task takes in FILE and process it with ActiveRecord.
And the rake task would call my
ProductsImporter.run method, summarised below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
And soon I bumped into performance issue because ActiveRecord::Transaction could not release garbage effectively. The script tooks ~2hrs to complete. This is unacceptable to my standard.
There are various workarounds on the net such as using
ar_import gem which uses SQL INSERT. However I do not like these SQL solutions as there are so many callbacks with my models and data integrity is very important. So I come up with an alternative solution:
- Split the big_file.csv into smaller files
- Loop through these smaller chunks and recursively run rake task on each
So now you wonder how the above solution works better. It is because now we run many small processes in which Rails won’t have to deal much with big GC. Once a process is completed, memory will be instantly released. Now, let’s code this up using shell script, I chose bash as example (please adapt to fit your purpose):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Let’s go through the above script. I use
split UNIX command to split the big file into many smaller files, each with 50000 lines. Then I loop through these small files and parse it to rake task to run. I utilise
nohup to make sure my script is not killed by SIGHUP (in my case, I run this script on remote host via SSH). Be noted that, I run each process in the background.
If you want a better performant version of
split, I highly recommend you give
csv_split a try, please download it at https://github.com/PerformanceHorizonGroup/csv-split.
Now, how many minutes you think our bash script would take to finish? It is 3 mintutes - no kidding! This is a massive gain compared to 2hrs.
Ruby/Rails are not the best for dealing with huge chunk of memory. So before deciding to try some SQL way, you can be pragmatic and abuse UNIX by spawning as many processes as your computer can handle and you’ll be surprised on how much gain you would achieve. Good luck!