Monday, March 1, 2010

Ruby: Large xml files Parsing With SAX

SAX is an event-driven parser for XML.

It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It's quite different from the DOM model, where the whole xml is parsed and loaded in an tree.

The Ruby XML Library

The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it's terribly slow, it's highly advisable to use libxml. It's a binding to the popular library from Gnome and it was released as gem.

Installing libxml is simple by running the following command.

gem install libxml-ruby

Refer the following example which reads a large xml file and inserts the data into the corresponding database table.

require 'xml/libxml'
CombinationPackInd.benchmark("Truncating look up tables and inserting new records") do
#truncate combination_pack_ind table
ActiveRecord::Base.connection.execute("TRUNCATE combination_pack_ind")
#truncate combination_prod_ind table
ActiveRecord::Base.connection.execute("TRUNCATE combination_prod_ind")
############################# ---TRUNCATING ENDS--- ########################

class Lookuphandler
include XML::SaxParser::Callbacks
def on_start_element_ns (name, attributes, prefix, uri, namespaces)
@tag_name = name
@main_tag_name = "" if @main_tag_name.nil?
@main_tag_name = case name
when "COMBINATION_PACK_IND"
then "COMBINATION_PACK_IND"
when "COMBINATION_PROD_IND"
then "COMBINATION_PROD_IND"
else
@main_tag_name
end
end

def on_end_element_ns (name, prefix, uri)
@end_element = name


#saving combination pack ind
if @main_tag_name == "COMBINATION_PACK_IND"
if @tag_name == "CD"
@cd = @value
elsif @tag_name == "DESC"
@desc = @value
end

if @end_element == 'INFO'
@comb_pack_ind = CombinationPackInd.new(:DESC => @desc)
@comb_pack_ind.CD = @cd
@comb_pack_ind.save
# clear the variables so that it wil not carried to the next instance.
@cd, @desc = ""
end
end
#saving combination prod ind
if @main_tag_name == "COMBINATION_PROD_IND"
if @tag_name == "CD"
@cd = @value
elsif @tag_name == "DESC"
@desc = @value
end

if @end_element == 'INFO'
@comb_prod_ind = CombinationProdInd.new(:DESC => @desc)
@comb_prod_ind.CD = @cd
@comb_prod_ind.save
# clear the variables so that it wil not carried to the next instance.
@cd, @desc = ""
end
end

##################################### CLEARING VARIABLE CONTAINING TAG INFO AFTER INSERTING RECORD ##################################
@tag_name = ""
@main_tag_name = "" if @end_element == ( "COMBINATION_PROD_IND" || "COMBINATION_PACK_IND")
end

def on_characters(s)
@value = s
end

end
################# ---LOOKUP PARSING BEGINS--- #########################################################
file_path = RAILS_ROOT + "/data/file.xml"
parser = XML::SaxParser.file(file_path)
parser.callbacks = Lookuphandler.new
parser.parse
end