Thursday, February 2, 2012

Xerces C++ Tutorial

I was never able to find a good xerces-c tutorial that walked me through parsing and reading a basic XML document.  I finally figured it out after hours of frustration, so I thought I'd share my results in a moderately complex tutorial that can be easily duplicated.  In this tutorial I will use xerces to iterate through an XML document, storing the XML data in a C++ STL map. I used the CodeLite 3.5 IDE on Ubuntu Linux 11.10 32-bit to develop and run this project.

This tutorial will assume you were able to get xerces installed and linked properly.  If you get stuck here, let me know and maybe I'll write a tutorial for that, but it would only be for the CodeLite 3.5 IDE.

To start with, here is the XML file I used: testdoc.xml
And here is the XML schema for that doc (important): schema.xsd
And finally here is the .cpp file that runs it all: main.cpp

Put these in the same folder as your C++ project.  For this tutorial, I will be using CodeLite 3.5 as my IDE, and I put these two files in the root of  my CodeLite workspace.

I'm going to walk you through that cpp file a few lines at a time, beginning here:

#include <vector>
#include <string>
#include <map>
#include <iostream>
#include <xercesc/parsers/XercesDOMParser.hpp>
#include <xercesc/dom/DOM.hpp>
#include <xercesc/sax/HandlerBase.hpp>

using namespace std;
using namespace xercesc;

That part was pretty simple.  If you get any errors in this section, you'll need to re-check your link to xerces.

Now, let's move on to setting up our data container (a STL map) and initializing xerces!

// This is where our data will go after it's pulled out of the XML file
map<string,pair<int,int> > myData;

// Initialize xerces
try { XMLPlatformUtils::Initialize(); }
catch (const XMLException& toCatch) {
    char* message = XMLString::transcode(toCatch.getMessage());
    cout << "Error during initialization! :\n"
         << message << "\n";
    XMLString::release(&message);
        return 1;
}
This portion of the code is the basic initialization call for xerces, and that's really all it does - we're still not doing anything interesting yet.  The one thing that is important here is our STL map declaration, which is the first actual code in the main function.  As you can see by the XML file we'll be using, there are 3 values for each data item (each data item is called a "word" in the XML), one string (the word text) and two integers.  Therefore, I decided to use a map, which is a fast way to store pairs of information.  The left item will be a string, the word value.  The right item will be another container - a STL "pair".  This pair will contain the two integers.


// Create parser, set parser values for validation
    XercesDOMParser* parser = new XercesDOMParser();
    parser->setValidationScheme(XercesDOMParser::Val_Always);
    parser->setDoNamespaces(true);
    parser->setDoSchema(true);
    parser->setValidationConstraintFatal(true);
    
// You'll probably need to change the string below, or you'll get a segmentation fault:
    parser->parse(XMLString::transcode("../testdoc.xml"));

This chunk of code creates the document parser, and sets a few attributes, mostly to make sure we get valid XML.  This will fail and the program will crash if you don't have the correct schema set up in your .XSD file.  This will also fail and crash with a segmentation fault if it can't find your XML document - so make sure that path is correct!  If you can't seem to get it to work with a relative path, try an absolute path.


DOMElement* docRootNode;
DOMDocument* doc;
DOMNodeIterator * walker;
doc = parser->getDocument();
docRootNode = doc->getDocumentElement();

// Create the node iterator, that will walk through each element.
try { walker = doc->createNodeIterator(docRootNode,DOMNodeFilter::SHOW_ELEMENT,NULL,true); }

The code above (I excluded all of the catch statements - I'm not going to deal with error handling here) creates a number of pointers that will be used for the rest of the code.  It also gets the parsed document and loads it into "doc", which then is used to load the root node (called simply "root" in our XML file) into "docRootNode".


// Some declarations
DOMNode * current_node = NULL;
string thisNodeName;
string parentNodeName;
bool wordParts[3] = {false,false,false};
string wordText = "";
pair<int,int> wordTypeValue;

This code consists of a few more declarations, used to hold temporary information as we loop through all of the elements of the XML document.  "thisNodeName" will hold the name of the node we're currently reading.  "parentNodeName" will hold the name of the current node's parent. "wordParts[3]" will hold 3 true/false values.  As we iterate through each of the 3 elements that make up a <word> (see the XML doc to understand this), we will load them into "wordText" (for the word itself) and "wordTypeValue" (a pair of integers, for the other two values).  As we come across these pieces of a <word>, the booleans in "wordParts" will be turned to true (one for each part of a <word>), and when they're all 3 true we'll know we'll have loaded all the info for one <word> into the temporary variables (wordText and wordTypeValue).  When this happens, we can load all of that data into one map entry (the map that was defined at the very beginning) that will represent one <word>.


for (current_node = walker->nextNode(); current_node != 0; current_node = walker->nextNode()) {
    
    thisNodeName = XMLString::transcode(current_node->getNodeName());
    parentNodeName = XMLString::transcode(current_node->getParentNode()->getNodeName());

The first line above starts the loop that will continue until we reach the end of the XML document.  It goes through the nodes one by one, making the data available through the current_node pointer.

The second and third line of code above assign the correct values to thisNodeName and parentNodeName each time we visit a new node (see the last code section for what these variables represent).


if(parentNodeName == "word" ) {
    if(thisNodeName == "wordText") {
        wordParts[0] = true;
        wordText = XMLString::transcode(current_node->getFirstChild()->getNodeValue());
    } else if(thisNodeName == "wordType") {
        wordParts[1] = true;
        wordTypeValue.first = 
            XMLString::parseInt(current_node->getFirstChild()->getNodeValue());
    } else if(thisNodeName == "wordValue") {
        wordParts[2] = true;
        wordTypeValue.second = 
            XMLString::parseInt(current_node->getFirstChild()->getNodeValue());
    }

This is where the magic happens!  As the program begins to iterate through the nodes of the XML document, eventually it will get to a value that we want.  As you can see from our XML document, the values we're interested in will always be directly underneath a <word> tag.  I used this as the logic for when to take a closer look.  That first if statement above will be true for any of the elements directly underneath a <word> tag, such as <wordText>.

The next three if statements evaluate the name of the current node.  The first node we hit after the <word> tag will be <wordText>.  This will put us inside the first nested if statement above (if (thisNodeName == "wordText").

So now we know that we're on the element <wordText> - we just need to know the value that it contains.  This is actually NOT in the current element!  The value inside the <wordText> tags is is actually in the first child element of <wordText>, which is a text element.  We access this element using the second line of code you see under the nested if, and assign it to our temporary holder wordText.  We also set the first of the three values in the boolean array to true, so that we know we've found the first item  we need to store the <word> data.

On the next iteration, we'll be on <wordType>!  We repeat the same basic process as we did with <wordText>, but instead of putting it's child's value into the temporary holder wordText,  we'll put it into the first element of our integer pair, wordTypeValue.  Another iteration later we'll be on <wordValue>, and we'll put it's child's value into the second element of wordTypeValue.

Now that we've iterated through all three of these elements, all three of the booleans will be true!


if(wordParts[2] && wordParts[1] && wordParts[0]) {
    myData[wordText] = wordTypeValue;
    wordParts[0] = false;
    wordParts[1] = false;
    wordParts[2] = false;
}

So now that all three booleans are true, the conditions for the if statement above will be true.  The first line inside the if statement loads the temporary values we pulled out of XML in the previous code segment into our map.  It also resets the booleans to false so that we are ready to start fresh with another word.


} else {
    // Not in a word
    wordParts[0] = false;
    wordParts[1] = false;
    wordParts[2] = false;
}

This is the code that executes if the parent element isn't <word>.  It will reset the booleans, so that we're ready to start over when we do find the elements with a <word> parent.


cout << endl << "STL map contents:" << endl << endl;

for ( map<string, pair<int,int> >::const_iterator iter = myData.begin();
        iter != myData.end(); ++iter ) {
    cout << "Word: " << iter->first << ", ";
    cout << "Type: " << iter->second.first << ", ";
    cout << "Value: " << iter->second.second << "." << endl;
    
}
cout << endl << "There are " << myData.size();
cout << " words loaded." << endl << endl;

This last bit of code just iterates through the map and outputs everything to the console, to show that the process worked.  If you don't understand this code, read up on the STL map and STL pair.

Not so bad after all!!!  Please leave me a comment if you can't get it to work, or have trouble understanding any part of the code.

17 comments:

  1. Thanks, I'm going to use your example code which was explained well. I also could not find a realistically simple xerces example to start from.

    ReplyDelete
    Replies
    1. It's so weird I couldn't find simple examples too.

      Delete
  2. Very clear, not beating around the bush like *cough* the Xerces docs. Good job!

    ReplyDelete
  3. Thanks for this snippet, very nice.

    ReplyDelete
  4. Files are not anymore online. Can you put them again?

    ReplyDelete
  5. hey.. Nice tutorial.. But difficult to understand because testdoc.xml is not there... Can you please upload it again.??

    ReplyDelete
  6. Thanks for the code snippets. It's very useful and easy. Pls upload the xml file also for the better understanding.

    ReplyDelete
  7. Those files are accessible again, although I doubt that it will do anyone who already commented any good. Sorry about that!

    ReplyDelete
  8. hey, any input on how to get all the code compiled and linked for use in your own project? :)

    thank you

    ReplyDelete
  9. >> hey, any input on how to get all the code
    >> compiled and linked for use in your own project?

    g++ main.cpp -L/usr/local/lib -lxerces-c

    The above assumes libxerces-c* file is placed in /usr/local/lib.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. C++ was a difficult program for me for the first time, but having read this post I realized that it is not that difficult.

    ReplyDelete
  13. Would you please share a link to tutorial on "installation and linking xerces" ?

    ReplyDelete
  14. I'm in no doubt coming back again to read these articles and blogs.
    c++ programming

    ReplyDelete
  15. Fantastic! Really simple... but the files are missing, and it is interesting to have them to understand the code, because you put the node names hardcoded. Does you have the files? Best regards.

    ReplyDelete
  16. this was by far the most useful blog regarding xercers, very easy to follow and very good implementation wise. thanks

    ReplyDelete