This feature is scheduled for: Rails v2.3
ActiveRecord got a little batch-help today with the addition of ActiveRecord::Base#find_each and ActiveRecord::Base#find_in_batches. The former lets you iterate over all the records in cursor-like fashion (only retrieving a set number of records at a time to avoid cramming too much into memory):
1 2 3 |
Article.find_each { |a| ... } # => iterate over all articles, in chunks of 1000 (the default) Article.find_each(:conditions => { :published => true }, :batch_size => 100 ) { |a| ... } # iterate over published articles in chunks of 100 |
You’re not exposed to any of the chunking logic – all you need to do is iterate over each record and just trust that they’re only being retrieved in manageable groups.
find_in_batches performs a similar function, except that it hands back each chunk array directly instead of just a stream of individual records:
1 2 3 4 |
Article.find_in_batches { |articles| articles.each { |a| ... } } # => articles is array of size 1000 Article.find_in_batches(batch_size => 100 ) { |articles| articles.each { |a| ... } } # iterate over all articles in chunks of 100 |
find_in_batches is also kind enough to observe good scoping practices:
1 2 3 4 5 6 |
class Article < ActiveRecord::Base named_scope :published, :conditions => { :published => true } end Article.published.find_in_batches(:batch_size => 100 ) { |articles| ... } # iterate over published articles in chunks of 100 |
One quick caveat exists: you can’t specify :order or :limit in the options to find_each or find_in_batches as those values are used in the internal looping logic.
Batched finds are best used when you have a potentially large dataset and need to iterate through all rows. If done using a normal find the full result-set will be loaded into memory and could cause problems. With batched finds you can be sure that only 1000 * (each result-object size) will be loaded into memory.
tags: ruby, rubyonrails

I suspect that should say “One quick caveat exists: you can’t specify :offset or :limit…”
Will Paginate’s (http://wiki.github.com/mislav/will_paginate) paginated_find will let you do almost exactly this.
yeah, exactly what i need, hidden complexity :) Done this by hand several times, not so dry…
Why doesn’t it use real database pointers?
Hey Pete, nope – it’s
:orderand:limitthat you can’t specify as they’re overridden by the internal iteration logic.Andrew, you can get this with will paginate, but you have to manage the looping and page increment logic yourself. This pushes that all into the background.
Great news. Being a situation you don’t often see in development, you will have to deal with large datasets in production (you hope anyways) and when you deal with really large datasets you often run into problems you don’t see in smaller ones. This would be great for games where you have to run through periodic loops, for example to increment resources, or game play turns, and have hundreds of thousands of rows to loop through.
Shaun, please do work on making this work with database cursors. That would be a great addition. Looking forward to the patch!
Shaun, did you forget the 37 signal motto, make a crappy more simple version. Then charge for it!!!!!
@Shaun,@carlity,@DHH or just take an existing project and rebrand, Look we made online chat rooms and we charge for it!!!! Thanks for the idea irc!!!!
@carlity and @paulc – stop being asshats. The marketplace has clearly voted and finds that 37s provides a lot of value in their solutions.
If you’re using MySQL and your data set is more than a few hundred thousand rows, you might find that :limit slows to a crawl in that range. Apparently MySQL starts from zero and parses every row until the :limited mark with each query, so you get an increasingly large seek time, even though it’s speedy at first. Because nothing makes your day quite like watching your urgent migration cruise along when you leave for the night, only to find it at 30% the next morning (A True Story).
Two alternatives which give a potentially huge performance boost:
1. Use ActiveRecord::Base.connection.select_values to get an array of all the IDs (or in batches) and then #find each one in turn. It’ll cost a few hundred megs of RAM (or more) for all those integers, but CPU is typically your bottleneck with migrations (especially on a 32 GB production box), so that’s okay.
2. Find your starting point and end point, and iterate through your finds like this: previous_id = 0 while previous_id < final_id model = Model.find :first, :conditions => [‘SELECT model WHERE id > ?’, previous_id] model.do_stuff! previous_id = model.id end
I’ve found lots of success with the latter one (you can start and end at specific points by modifying previous_id & final_id or providing different conditions, so it can be fairly flexible) Hmm.. maybe I should wrap it into its own #each variant somehow?
@MikeLaurence: It’s not :limit that slows it down. It’s :offset. MySQL processes every row in the table until it gets to :offset at which point it finally starts returning the rows ul to :limit.
The best current solution without using a cursor specific to each db adapter is:
Model.all(:conditions => {:indexed_column => offset..(offset+limit-1)}, :order => :indexed_column)
That should provide the highest performance on large data sets. Obviously it still needs to be wrapped inside a method like the new version of each mentioned in this article. :)
@DaveMauldin: Right… meant :offset.
You’re still going to encounter the issue I mentioned if you have many iterations, though. Large batches can lower the iteration count and mitigate the problem somewhat, but you still will get dragged down if you have a huge table (parsing 10 million rows to get to batch 10,000 of 1,000 batched records, etc.). Also, really hefty AR objects might not be doable in large batches, depending on your memory situation.
I haven’t looked into the database cursor option much – I believe the MySQL docs mentioned that only prepared statements & triggers could utilize them, which would quickly demoralize most Rails programmers :-)
Let me know if I’m wrong, though! I spent a bunch of time in PL-SQL (Oracle), and while it’s a heinous language to code in, it’s ridiculously fast for data crunching.
Follow-up – after a note from a friend and a peek at the batching code, I noticed that find_in_batches is already using the ”> last id” method I described, and not the troublesome :offset method. So batch away, MySQLers!
@Mike, thanks for the heads up, that is really good to know.
I wrote this function for our app a while back, super helpful.
A word of warning though: often when iterating though lots (hundreds of thousands+) of objects, rubys garbage collector won’t run enough, the process will get huge and then linux will silently kill it and you won’t know what happened.
My version runs GC.start every 10th time through the loop.
Good post Ryan!
This is good news. I understand it’ll save me time next time when I’ll need to iterate over 10 000 documents… Unfortunately, you can’t do everything on database level so sometimes little things like that can save the day!
A minimal change that will have a big impact on the memory consumption of each of your Rails processes – especially if you’ve got a complex app with lots of routes.
Just in case anybody’s trying this as described above—the #each method was apparently renamed to #find_each. You’ll get an “undefined method `each’” if you try #each on an AR model.
So that’s: instead of YourModel.each(:batch_size => 1000), use YourModel.find_each(...)