Split An XML File Into Chunks

From SifWiki
Revision as of 13:07, 19 June 2013 by Siftah (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

I needed a quick way of turning a huge (100MB+) XML file into something more manageable. I'm always annoyed by the lack of command line tools for the processing of XML versus .csv, but this little bit of PHP I found on [Wedells blog] worked well. I improved the wrapper a little though and dropped the class into the same file to keep it compact.

After dropping it into a file called dechunk.php and chmod +x'ing it you can call it from the command line like this;

./dechunk.php /home/you/hugeBigXMLFile.xml [RootTag] [ItemTag] [RecordCount]

There's two parameters you'll want to pass and the third parameter is optional.

RootTag = The root element of the XML which you want to split.

ItemTag = The start and end tag which denotes the section of the XML you need to split on.

RecordCount = How many of the <ItemTag>...</ItemTag> sections to include in each chunk. Optional, defaults to 2000.

I've stuck this in my ~/scripts directory which is part of my $PATH

Script's here;

  1. !/usr/bin/php

class xmlChunk
function xmlChunk(){
/*$basefilename // the base file name for the chunks
$xmlfile // the xml file name to be processed
$xmldatadelimiter // core data delimiter
$xmlitemdelimiter // record delimiter
$chunksize = 2000; // number of records in each chunk file
$dir // path to where splits will be stored
function doChunk( $basefilename, $xmlfile, $xmldatadelimiter, $xmlitemdelimiter, $chunksize=2000, $dir= "/var/www/public_html"){
//initialize vars
$begin=time(); // script start time
$start = time(); // last gate time
$interval=time(); // current gate time
$minutes=1; // intervals for gates
$filenum = 1; // start chunk file number at 1
$recordnum = 1; // start at record 1
$xmlstring =."\n";
// xmlchunk file header
//dirs and files
$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";
//start processing
echo "Processing (".$dir."/$xmlfile)\n";
$handle = @fopen($dir."/$xmlfile","r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
// if item delimiter reached
// increment record number iterator
if (ereg($xmlitemdelimiter,$buffer)==true) {
//write line to chunk file
// if chunk limit reached then start to
// close the file with well formed xml
if ($recordnum>$chunksize) {
// post feed end tag
// and increment file number to start new log file chunk
//reset record counter number for new chunk file
//update export file name
$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";
//echo status report to STDOUT
echo"Segment $filenum. Record ".($chunksize*$filenum).".\n";
// write new chunk xml file header
//put in a catch so that script doesn't run riot and
//will die after X number of cycles
if ($filenum>5000) {
if (($interval-$start)>60) {
echo $minutes." Minutes so far.\n";
} else {
$interval = time();
} else {
echo"Unable to open file! (".$dir."$xmlfile\")\n";
$procend = time();
echo "\n####\n";
echo "Split Complete (".floor((($procend-$begin)/60))." Minutes)\n";

if($argv[1] != ""){
	$fileToParse = $argv[1];
	$directoryToUse = dirname($fileToParse);
	$basefilename = basename($fileToParse, ".xml");
	$filename = basename($fileToParse);

if ($argv[4] != ""){
		$recordLimit = $argv[4];
} else {
		$recordLimit = 2000;

echo "Creating ".$basefilename." Splits
$chunk = new xmlChunk();
$chunk -> doChunk( $basefilename,$filename,$argv[2],$argv[3],$recordLimit,$directoryToUse);
} else {
	echo "Usage: ./dechunk.php /home/you/hugeBigXMLFile.xml [RootTag] [ItemTag] [RecordCount]\r\n";
Personal tools

Google AdSense