Split An XML File Into Chunks

From SifWiki
Jump to: navigation, search

Split An XML File Into Chunks

I needed a quick way of turning a huge (100MB+) XML file into something more manageable. I'm always annoyed by the lack of command line tools for the processing of XML versus .csv, but this little bit of PHP I found on [Wedells blog] worked well. I improved the wrapper a little though and dropped the class into the same file to keep it compact.

Full credit to the original script from; http://www.bradwedell.com/blog/2008/06/04/php-xml-chunk-class/

Usage

After dropping it into a file called dechunk.php and chmod +x'ing it you can call it from the command line like this;

./dechunk.php /home/you/hugeBigXMLFile.xml [RootTag] [ItemTag] [RecordCount]

There's two parameters you'll want to pass and the third parameter is optional.

RootTag = The root element of the XML which you want to split.

ItemTag = The start and end tag which denotes the section of the XML you need to split on.

RecordCount = How many of the <ItemTag>...</ItemTag> sections to include in each chunk. Optional, defaults to 2000.

I've stuck this in my ~/scripts directory which is part of my $PATH


The Script

It's probably best to grab this from GitHub https://github.com/Siftah/Wiki/blob/master/dechunk.php

Or you can copy and paste if you're feelin' old skool...

#!/usr/bin/php
<?php

class xmlChunk
{
function xmlChunk(){
}
/*$basefilename // the base file name for the chunks
$xmlfile // the xml file name to be processed
$xmldatadelimiter // core data delimiter
$xmlitemdelimiter // record delimiter
$chunksize = 2000; // number of records in each chunk file
$dir // path to where splits will be stored
*/
function doChunk( $basefilename, $xmlfile, $xmldatadelimiter, $xmlitemdelimiter, $chunksize=2000, $dir= "/var/www/public_html"){
//initialize vars
$begin=time(); // script start time
$start = time(); // last gate time
$interval=time(); // current gate time
$minutes=1; // intervals for gates
$filenum = 1; // start chunk file number at 1
$recordnum = 1; // start at record 1
$xmlstring =."\n";
$xmlstring.="<$xmldatadelimiter>\n";
// xmlchunk file header
//dirs and files
$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";
//start processing
echo "Processing (".$dir."/$xmlfile)\n";
$handle = @fopen($dir."/$xmlfile","r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
// if item delimiter reached
// increment record number iterator
if (ereg($xmlitemdelimiter,$buffer)==true) {
$recordnum++;
}
//write line to chunk file
error_log("$buffer",3,$exportfile);
// if chunk limit reached then start to
// close the file with well formed xml
if ($recordnum>$chunksize) {
// post feed end tag
error_log("",3,$exportfile);
// and increment file number to start new log file chunk
//reset record counter number for new chunk file
$recordnum=0;
$filenum++;
//update export file name
$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";
//echo status report to STDOUT
echo"Segment $filenum. Record ".($chunksize*$filenum).".\n";
// write new chunk xml file header
error_log($xmlstring,3,$exportfile);
}
//put in a catch so that script doesn't run riot and
//will die after X number of cycles
if ($filenum>5000) {
die();
}
if (($interval-$start)>60) {
$minutes++;
echo $minutes." Minutes so far.\n";
$start=time();
} else {
$interval = time();
}
}
fclose($handle);
} else {
echo"Unable to open file! (".$dir."$xmlfile\")\n";
}
$procend = time();
echo "\n####\n";
echo "Split Complete (".floor((($procend-$begin)/60))." Minutes)\n";
}
}

if($argv[1] != ""){
	$fileToParse = $argv[1];
	$directoryToUse = dirname($fileToParse);
	$basefilename = basename($fileToParse, ".xml");
	$filename = basename($fileToParse);
	mkdir("$directoryToUse/splits");

if ($argv[4] != ""){
		$recordLimit = $argv[4];
} else {
		$recordLimit = 2000;
}

echo "Creating ".$basefilename." Splits
";
$chunk = new xmlChunk();
$chunk -> doChunk( $basefilename,$filename,$argv[2],$argv[3],$recordLimit,$directoryToUse);
unset($chunk);
} else {
	echo "Usage: ./dechunk.php /home/you/hugeBigXMLFile.xml [RootTag] [ItemTag] [RecordCount]\r\n";
}
?>
Personal tools
Namespaces

Variants
Actions
Navigation
Tools
Google AdSense