|
Location: Desktop development - C/C++ License: The Microsoft Public License (Ms-PL) The XML parsing article that should (not) be written!Posted by Wong Shao VoonThe C++ XML parsing article which should have been written since the advent of XML! This article defines a new Elmax abtraction model over the DOM model. |
Skill: BeginnerPosted: 22/12/2010Views: 396Rating: 5.00 /5Popularity: 0.00 |
| Sign Up to vote for this article |
<Books>
<Book>
<Price>12.990000</Price>
</Book>
</Books>
To create the above XML, see the C++ code below,
The 3rd line of code detects that the 3 elements do not exist and the float assignment will attempt to create those 3 elements and convert 12.99f to string and assign to the price element. To read the price element, we just assign it to the float variable (see below),
It is good practice to check if the price element exists, using Exists(), before reading it.
Over the years in my profession as a C++ software developer, I have to infrequently maintain XML file format for some application project files. I found the DOM to be difficult to navigate and use. I have come across many articles and XML libraries which proffer to be easy to use, but none is as easy as the internal XML library co-developed by my ex-coworkers, Srikumar Karaikudi Subramanian and Ali Akber Saifee. Srikumar wrote the 1st version which could only read from XML file and Ali later added the node creation capability which allowed the content to be saved in XML file. However, that library is proprietary. After I left the company, I lost the use of an really-easy-to-use XML library. Unlike many talented programmers out there, for idiots like me, I need an idiot-proof XML library. Too bad, Linq-to-XML (Xinq) is not available in C++/CLI! I decided to re-construct Srikumar's and Ali's XML library and made it open-source! I dedicate this article to Srikumar Karaikudi Subramanian and Ali Akber Saifee.
Ali Akber Saifee and I are what we called "the world's greatest arch-rivals". While we worked together in the same company, I would always find every opportunity find 'flaws' with Ali and email him to expose some of his 'problems' and carbon-copy everyone else. My arch-rival, as always, beat me with some of his best replies. Ali has once offered me a chance for us to make good and work together to conquer the world together. But I rejected his offer (in thinly-veiled plot) to subservient me! The world's greatest arch-rivals can never work together!
Whenever I lost a friend on facebook, I always check if it was Ali who defriended me. The readers may ask why. Do you, the readers, know the ramifications of the world's greatest arch-rivals defriend each other on facebook? Ans: there can never be world peace! The readers may ask why the world's greatest arch-rivals are on each other's facebook in the 1st place! Well, that is another story for another article in another day!
Why am I rewriting and promoting my arch-rival's XML library? Before Ali says this, let me pre-empt him and say this myself: Imitation is the most sincere form of flattery. The truth is his XML library is really easy to use!
In this section, let us look first at the advantages of XML over binary serialization before we discuss Elmax. I'll not discuss XML serialization because I am not familiar with it. Below is the simplified (version 1) file format for a online bookstore.
Version=1
Books
Book*
ISBN
Title
Price
AuthorID
Authors
Author*
Name
AuthorID
The child elements are indented under the parent. The elements which can be more than 1 in quantity, are appended with a asterisk(*). The diagram below shows what the (version 1) binary serialization file format will typically look like.

Let's say in the version 2, we add a Description under the Book and a Biography under the Author.
Version=2
Books
Book*
ISBN
Title
Price
AuthorID
Description(new)
Authors
Author*
Name
AuthorID
Biography(new)
The diagram below shows the version 1 and 2 binary serialization file format. The new additions in version 2 is in lighter colors.

Notice the version 1 and 2 are binary incompatible? Below is how binary (note: not binary serialization) file format would choose to implement it.
Version=2
Books
Book*
ISBN
Title
Price
AuthorID
Authors
Author*
Name
AuthorID
Description(new)*
Biography(new)*

In this way, version 1 of the application still can read the version 2 binary file while ignoring the new additional parts at the back of the file. If XML is used and without doing any work, version 1 of the application still can read the version 2 XML file (forward compatible) while ignoring the new additional elements, provided that the data type of the original elements remains unchanged and not removed. And version 2 application can read version 1 XML file by using the old parsing code (backward compatible). The downside to XML parsing is it is slower than binary file format and takes up more space but XML file are self-describing.

Below is an example of how I would implement the file format in XML, which is followed by an code example to create the XML file.
<?xml version="1.0" encoding="UTF-8"?>
<All>
<Version>1</Version>
<Books>
<Book ISBN="1111-1111-1111">
<Title>How not to program!</Title>
<Price>12.990000</Price>
<Desc>Learn how not to program from the industry's worst programmers!</Desc>
<AuthorID>111</AuthorID>
</Book>
<Book ISBN="2222-2222-2222">
<Title>Caught with my pants down</Title>
<Price>10.000000</Price>
<Desc>Novel about extra-martial affairs</Desc>
<AuthorID>111</AuthorID>
</Book>
</Books>
<Authors>
<Author Name="Wong Shao Voon" AuthorID="111">
<Bio>World's most funny author!</Bio>
</Author>
</Authors>
</All>
In the section, we'll look at how to use Elmax library to perform creation, reading, update and deletion (CRUD) on elements, attributes, CData sections and comments. As you can see from the previous code sample that Elmax makes use of Microsoft XML DOM library. That's because I do not wish to re-create all that XML functionality, for instance, XPath. Since Elmax depends on Microsoft XML which in turn depends on COM to work, we have to call CoInitialize(NULL); to initialize COM runtime at the start of the application and also call CoUninitialize(); to uninitialize it before the application ends. Elmax is an abstraction over DOM, however, it does not seek to replicate all the functionality of DOM. For example, programmer cannot use Elmax to read element siblings. In Elmax model, element is 1st class citizen. Attribute, CData section and comment are children of a element! This is different from the DOM where they are nodes in their own right. The reason I designed CData section and comment to be children of element, is because CData section and comment are not identifiable by name or ID.
Typically, we use CreateNew to create elements. There is also a Create method. The difference is the Create method will not create the elements if they already exist. Notice that I did not use Create or CreateNew to create All and Version elements? That's because they are created automatically when I assign a value to the last element on the chain. Note that when you call CreateNew repeatedly, only the last element gets created. Let me show you an code example to explain this.
In the 1st CreateNew call, elements "aa", "bb" and "cc" are created. In each subsequent call, only element "cc" is created. This is the resultant XML created (and indented for easy reading).
<aa>
<bb>
<cc/>
<cc/>
<cc/>
</bb>
</aa>
Create and CreateNew has an optional parameter to specify the namespace URI. If your element belongs to a namespace, then you must create it explicitly, using Create or CreateNew; it means you cannot rely on value assignment to create it automatically. More on this later. Note: calling instance Element methods other than Create, CreateNew, setters and accessors when the element(s) do not exists, Elmax will raise an exception!
Note: for AddNode method, you can only add node which has been removed in the current version.
In the begining of the article, I showed how to create elements and assign a value to the last element at the same time. I'll repeat that code snippet here.
It turns out that this example is dangerous as it uses overloaded assignment operator determined by the compiler. What if you mean to assign a float but assign a integer instead just because you forgot to add a ".0" and append a 'f' to the float value? Not much harm in this case, I suppose. In all scenarios, it is better to use the setter method to assign value explicitly.
Here is the list of setter methods available.
In the begining of the article, I showed how to read a value from element. I'll repeat the code snippet here.
This is the more correct version, using the GetFloat accessor to specify a default value.
Price will get a default value of 10.0f if the value does not exist or is invalid whereas the prior example before this example, will get a 0.0f because default value is not specified. But by default, Elmax does not know the string value is a improper float value in textual form, unless you use regular expression to validate the string value. Set REGEX_CONV instead of NORMAL_CONV in the root element to use regular expression type convertor. As an alternative, you can use schema or DTD to validate your XML before doing Elmax parsing. To learn schema or DTD validation, please consult your favorite MSDN.
This is the declaration of SetConvertor method.
To use your own custom type convertor, set the optional pConv pointer.
You are reponsible for the deletion of pCustomTypeConv if it is allocated on heap. There are locale type convertors in Elmax but they are not tested at this point because I am not sure how to test them, as in Asia, number representation are the same in different countries, unlike in Europe. As a tip to the readers who might be modifying Elmax, remember to run through all the 220 unit tests to make sure you did not break anything after modification. The unit test is only available for run in Visual Studio 2010. Below is a list of value accessors available.
For GetBool and the interpretation of boolean value, "true", "yes", "ok" and "1" evaluate to be true while "false", "no", "cancel" and "0" evaluate to be false. They are not case-sensitive.
To create a element under a namespace URI, see below,
The XML output is as below,
<?xml version="1.0" encoding="UTF-8"?>
<All>
<Version>1</Version>
<Books>
<Book xmlns="http://www.yahoo.com"/>
</Books>
</All>
To create a bunch of elements and attribute under a namespace URI, see below,
The XML output is as below,
<All>
<Version>1</Version>
<Books>
<Yahoo:Book xmlns:Yahoo="http://www.yahoo.com" Yahoo:ISBN="1111-1111-1111">
<Yahoo:Title>How not to program!</Yahoo:Title>
<Yahoo:Price>12.990000</Yahoo:Price>
<Yahoo:Desc>Learn how not to program from bad programmers!</Yahoo:Desc>
<Yahoo:AuthorID>111</Yahoo:AuthorID>
</Yahoo:Book>
</Books>
</All>
You can use the AsCollection method to get siblings with the same name in a vector.
This overloaded form (below) of AsCollection is faster as it does not create a temporary vector before returning.
You can use the GetCollection method to get children with the same name in a vector.
This overloaded form (below) of GetCollection is faster as it does not create a temporary vector before returning.
To query the number of children for each name, you can use QueryChildrenNum method.
There is also an overloaded form (below) of QueryChildrenNum which does not create a temporary vector before returning. Note: QueryChildrenNum can only query for elements, not attributes or CData sections or comments.
In the previous enumeration example, I used
instead of
because the 2nd form creates temporary elements, "aa" and "bb" on the stack which are not used. The 1st form saves some tedious typing and only returns 1 element in the overloaded [] operator, not to say it is faster too. '\\' and '/' can be used for delimiters as well. To speed up the below code which excessively use temporaries,
you can assign it to a Element variable, and use that variable instead.
Root element is created when you call SetDomDoc on the element. You should know, by now, that the [] operator is used to access the child element. For root element, the [] operator accesses itself to see it's name correspond to the name in the [] operator.
The "aa" element in the above example actually refers to the root, not the child of root. If a element is not called with SetDomDoc(), then "aa" refers to its child. When using the [] operator, please remember to prefix the (wide) string literal with 'L', eg, elem[L"Hello"] else you will get a strange unhelpful error.
To create attribute (if not exists) and assign a string to it, see example below.
To create attribute with a namespace URI and assign a string to it, you have to create it explicitly.
To delete an attribute, use Delete method.
To find out a attribute with the name exists, use Exists method.
The list of Attribute setters and accessors are the same as Element. And they use the same type convertor.
Below are a bunch of operation you can use with comments. For your information, XML comment come in the form of <!--My example comments here-->
You can get a vector of Comment objects which are children of the element, using GetCommentCollection method.
Below is a bunch of operation you can use with CData sections. For your information, XML CData section come in the form of <![CDATA[" <IgnoredInCDataSection/> "]]>. XML CData section typically contains data which is not parsed by the parsers, therefore it can contains < and > and other invalid text characters. Some programmers prefers to store them in Base64 format (See next section).
You can get a vector of CData sections which are children of the element, using GetCDataCollection method.
Some programmers prefer to store binary data in the Base64 format under 1 element, instead of CData section, to easily identify and find it. The downside is Base64 format takes up more space and data conversion takes time. The code example shows how to use Base64 conversion before assignment, and also to convert back from Base64 to binary data after reading.
Elmax library defines some C++0x move constructors and move assignments. In order to build the library in older Visual Studio prior to the 2010 version, you have to hide them by defining _HAS_CPP0X to be 0 in the stdafx.h.
The abstraction model and the library is named "Elmax" because there is a 'X', 'M' and 'L' in "Elmax". <whisper>I can tell you the real reason but you must not tell anyone, else I have to eliminate you from this world! The reason is the author likes to crack jokes in real life. But all his jokes are deemed by everyone to be lame and cold. In Chinese language, cold joke mean joke which is not funny or laughable at all! If you rearrange alphabets in "Elmax", you get "LameX" which refers to the author!</whisper>
In the next article, the XML parsing is going to get even easier! That is, parsing is eliminated; the programmer does not have to do the XML parsing himself/herself! XML parsing is done automatically, along the lines of Object Relational Mapping (ORM). I personally don't see the need for programmer to do XML parsing. Just pass in a specially formatted structure(s) with an XML file and the library will fill in the structure for you! Just treat that I am kidding! There is no way I'll have time for this as my part-time Bachelor degree course is starting soon!
Thanks for reading!
For bug reports and feature requests, please file them here. When you file a bug report, please do include the sample code and xml file (if any) to reproduce the bug. The current Elmax is at version 0.5 Beta. It's codeplex site is located at http://elmax.codeplex.com/
Base64 conversion class used in Elmax is from Jan Raddatz's article on Codeguru: BASE 64 Decoding and Encoding Class
24/12/2010 : 1st release: My Christmas present for everyone! Happy holiday!
This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)
| Wong Shao Voon
| I guess I'll write here what I does in my free time, than to write an accolade of skills which I currently possess. I believe the things I does in my free time, say more about me. When I am not working, I like to watch Japanese anime. I am also writing some movie script, hoping to see my own movie on the big screen one day. I like to jog because it makes me feel good, having done something meaningful in the morning before the day starts. I also writes articles for IntelliProject; I have a few ideas to write about but never get around writing because of hectic schedule. Location: |
Sign up to post message on the article message board!