asp tutorials, asp.net tutorials, sample code, and Microsoft news from 15Seconds
Data Access  |   Troubleshooting  |   Security  |   Performance  |   ADSI  |   Upload  |   Email  |   Control Building  |   Component Building  |   Forms  |   XML  |   Web Services  |   ASP.NET  |   .NET Features  |   .NET 2.0  |   App Development  |   App Architecture  |   IIS  |   Wireless
 
Pioneering Active Server
 Power Search





Active News
15 Seconds Weekly Newsletter
• Complete Coverage
• Site Updates
• Upcoming Features

More Free Newsletters
Reference
News
Articles
Archive
Writers
Code Samples
Components
Tools
FAQ
Feedback
Books
Links
DL Archives
Community
Messageboard
List Servers
Mailing List
WebHosts
Consultants
Tech Jobs
15 Seconds
Home
Site Map
Press
Legal
Privacy Policy
internet.commerce














internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

HardwareCentral
Compare products, prices, and stores at Hardware Central!

Creating an In-Memory Database Using XML and XPath -- Part 2
By Niall Ginsbourg
Rating: 4.6 out of 5
Rate this article


  • email this article to a colleague
  • suggest an article

    Best Practices for Structuring XML Files

    (Part 1 located here)

    Speed is in the Structure. Some of you may have tried this before, and if you used something like ADO to produce your XML file from a recordset, you will find this technique very, very slow. I was put well off the scent about a year ago when I tried it myself. This is simply because I was using the ADO XML structure which was really designed for persisting/restoring recordsets and is not optimized for searching.

    The secret is in the structure of your XML data, and it makes all the difference. In fact, it suddenly transforms a slow, clunky solution into a mean, lean, turbo-charged machine.

    How Data Is Stored Internally in XML

    When I started trying out different XML representations of the same set of data, one of my first questions was what the optimal structure for searching with XPath was. I asked around, even spoke to the "experts," and didn't get many valuable responses. So instead, without being privy to Microsoft's exact implementation details, I made a few assumptions on how things work internally in MSXML 3, through various tinkering and testing.

    Flat Structure vs. Hierarchical Structure

    Look at how a flat (two-dimensional)-type structure would be represented, which is similar to the Microsoft ADO XML format. If you were to perform the query

    R/n[@A="1" and @B="2" and (@C="3" or @C="4")]

    and the criteria was embedded into each node (of type 'N'),

    e.g., <n id="1234" A="1" B="3", C="4">

    then the XML parser would have to evaluate every single node in the structure to see if it meets the required criteria.


    Figure 2.1 - Flat XML structure. Every node must be scanned for most queries.

    XML allows hierarchical tree structures of almost any dimension to be easily described, so it makes sense to use this feature for several reasons. Otherwise you would just persist this information into a two-dimensional array. By grouping the data so that a new set of child nodes exist at each decision point, you can eliminate the parser having to scan every single node in the structure to perform its search. The diagram below illustrates the same set of data branched at each decision point.


    Figure 2.2 - Hierarchical XML structure.

    If we were to perform an XPath query such as "R/A1/B2/*[name()="C3" or name()="C4")/n" then the following sets of nodes would be selected (highlighted in red):


    Figure 2.3 - Nodes scanned as result of XPath query.

    At each decision point we are eliminating large parts of the structure so by the time we get to the level of the nodes of type 'n' we only have a small subset of nodes to scan through. In this case, the nodes of type 'n' don't actually have to be evaluated at all.

    From this theory we can safely assume that a hierarchical structure will always be optimal (given the correct branching structure has been devised).

    Storing Values as Tags, Attributes, or Text

    A value inside XML can be stored in 3 ways:

    • as an attribute value (e.g., <A id="#VALUE#"/>)
    • as text (e.g., <A>#VALUE#</A>)
    • or as the node name itself (e.g., <#VALUE#/>)
    When it's stored as an attribute or as text, it is safe to assume that MSXML 3.0 has to allocate some memory and store the value into that location. If you are performing an XPath query such as *[@id="#VALUE"], then the parser has to first iterate through the potential nodes, find the memory location of where the attribute is stored, and perform a string comparison to determine if it has found a valid match.

    When MSXML parses a document, it generates a lookup table in memory of all the node names (which may be repeated several times throughout the structure) and gives them unique IDs. When you perform a query such as A/B/#VALUE#, it accesses the lookup tables and then uses them to quickly locate the correct matches. Although it is unclear whether each separate node type is indexed within its context - or it traverses through it performing simple numeric matches - it is safe to assume that this process would be faster than the string matching performed when saving a value as an attribute or text .

    Golden Rules for Structuring XML Files

    Based on the assumptions made above, we can devise a set of golden rules that will help you devise the best structure possible for your data.

    1. Thou shalt structure your XML so that child nodes are created at every decision point. You'll need to analyze what your query is doing to determine the best structure for this. Try to keep the number of child nodes at any one branch to less than 10-15 wherever possible.
    2. Thou shalt avoid using XML attributes for IDs you are searching for and those that are not unique. For example, if you have several nodes with the data, <Location dbid="1"> is better represented as <Location1>. If your search criteria includes a location, then it is faster to find nodes that meet the XPath search pattern.
    3. Thou shalt be sparse with characters and white space. Keep tags as short as possible. For example, for the Suburb node,it is better to simply represent it as <S#id#> because this will make the XML file smaller, - which in turn will reduce creation time, parsing time, and initial memory required for parsing. You may be tempted to indent your XML and keep every tag on separate lines, however, it adds up to wasted space when you have a large amount of records.

    How to Extract XML Files

    To come up with your best XML structure for searching, try a few different representations of your data. A you will need a way to generate each file. (See any of the articles on this site explaining how to create an XML file using ASP scripts or straight VB.)

    Benchmarking Your XML Performance
    The Test Scenario

    Here's a rundown of the problem at hand. As mentioned in the first part of this article, the data being stored into the IMDB is a list of 40,000 real estate advertisements. The HTML search form allows users to select various criteria to narrow down their search. The criteria is as follows:

    State: User selects one or more states of Australia (i.e., Victoria, NSW, Queensland).
    Suburb: User selects one or more suburbs/towns (i.e., South Melbourne, Carnegie, etc.).
    Type: User selects one or more house types (i.e., apartment, unit, townhouse, shop, etc.).
    Price Range: User selects one or more price ranges (i.e., $0 - $100K. $50 - $100K, etc.).
    SaleTypes: User selects one or more sale types (i.e., private, agent, auction, lease, etc.).
    Keywords: User inputs a list of keywords with Boolean logic (i.e., garage NOT carport). Each advertisement record contains an ID field for each of the criteria above (except the keywords field). and it is important to note that only one ID can be selected for each criteria. (For example, a property can only be categorized into one location, one type, one price range, and one sale type.)

    The keywords will be matched against the Property Title, Short Description, and Extended Description fields in the Advertisements table. However, only the Title, Description, and Advertiser Name need to be displayed to the user for the purpose of search results.

    In this set of tests we will deal with a subset of the ads, covering one state, which totals about 3,800 records. This will allow us to generate the XML and perform the tests more quickly and determine how changing the structure can affect performance.

    From the initial SQL Server profiling performed (described in Part 1), I was able to extract 150 search queries that had been performed on the site. The subset of data accurately represented the type of regular activity on the server (25percent of queries were keyword searches and the other 75 percent were index-based searches only).

    These 150 queries will be used to perform all the tests described below.

    SQL Server/ASP Performance Profile

    First let's benchmark the performance of the SQL version of this problem. In order to determine whether this is a worthwhile exercise we need to know how much better the performance is using XML.

    For the SQL tests I isolated the subset of advertisements into its own table (3800 records) and then recreated the various indices we were using on our live site.

    A standalone Visual Basic executable was created which ran the 150 queries against a dedicated SQL Server machine. The VB EXE was located on a separate machine, is similar to our Web site's setup. The executable used ADO 2.5.For fairness I conducted two types of tests with SQL - one where a new connection was made/closed for each query and another where a single connection was maintained throughout. As ASP normally pools your connections (but takes a small amount of time to look up the pool), I deduced that the real scenario on a Web site would be somewhere in between these two tests.

    The SQL test is slightly different to XML because it is using SQL's LIKE clause to locate keywords inside a field. As a result, "MAN" could match "man," "manager," "gamman," etc. Consequently, the SQL tests located an extra 250 records overall, but it is still close enough to do a head-to-head comparison.

    Lets look at the results from these two tests:

    Attempt 1: (SQL Server - New Connection per Query)


    Overall Matches: 8904 Total Time: 8843ms Avg per Query: 58.593ms

    Attempt 2: (SQL Server - One Connection per All Queries)


    Overall Matches: 8904 Total Time: 5849ms Avg per Query: 38.993ms

    As mentioned earlier, the average of these two times will more accurately reflect a traditional ASP/ADO/SQL solution.

    Average from SQL Server Tests


    Avg per Query: approx 48ms

    Therefore, the performance we are trying to beat with our IMDB will be 48ms.

    XML Performance with Different Structures

    For the purpose of illustrating how important structure is, we will start off with a flat structure primarily using attributes to store values and we will gradually apply the "best practices" mentioned above. Please note that we will use the attribute names A, B, C, D, instead of Suburb/Town, PropertyType, Industry, and Discipline to save disk/memory.In addition, we will also monitor the memory usage and parsing time for each XML structure because this will be an important issue when it comes to implementation.

    [ATTEMPTS 3 and 4: FLAT XML STRUCTURE WITH/WITHOUT WHITE SPACE] We will use a flat XML structure, similar to what ADO would produce if using the SavetoXML feature. However, we will include white space (in the form of indentations) to make it easier to read. Each record would look something like this:

    <ads>
     <j id="710185" A="1" B="1" C="1" D="2">
      <ja><![CDATA[ADVERTISER NAME]]></ja>
      <jd><![CDATA[DESCRIPTION IN HERE]]></jd>
      <jt><![CDATA[ADVERTISEMENT TITLE]]></jt>
      <k id="AFFORDABLE"/><k id="AVAILABLE"/><k id="AN"/><k id="ATRIUM"/>
    <k id="ATTIC"/><k id="AFTERMARKET"/><k id="BOUTIQUE"/><k id="BOULEVARD"/> <k id="BACKGROUND"/><k id="BE"/><k id="CARPORT"/> [REST OF KEYWORDS - IN HERE] </j>
    The relevant XPath query would then look something like this:

    ads/j[((@A="1")) and ((@B="3")) and ((@C="2")) and ((@D="18")) and ((k/@id="SPA") or (k/@id="POOL"))]

    The non-white-space version would use exactly the same XPath query.

    Let's put them to the test:

    Attempt 3: Flat XML Structure with White Space
    XML File: 5,165kb Memory Usage: 33.2MB Parsing Time: 3926ms
    Matches: 8628 Total Time: 15160ms Avg per Query: 101.080ms

    Attempt 4: Flat XML Structure with No White Space
    XML File: 5,052kb Memory Usage: 33.2MB Parsing Time: 3896ms
    Matches: 8628 Total Time: 15152ms Avg per Query: 101.013ms

    Well, these initial results weren't very impressive. In fact, it is over twice as slow as our SQL solution (which averaged around 48ms). The removal of white space helped the parsing and file size marginally. However, if we had used large tag names (such as "location"), then we can assume these timesand memory usage would have been significantly increased.

    [ATTEMPT 5: BRANCHED XML STRUCTURE]

    OK, so our first attempts were rather unsuccessful. Let's apply the Golden Rule #1 and branch our structure at each decision point (A, B, C, D).

    The structure for a record would now look something like this:

    <ads>
    <A id="1">
    <B id="1">
    <C id="1">
    <D id="2">
    <j id="710185">
    <ja><![CDATA[ADVERTISER NAME]]></ja>
    <jd><![CDATA[DESCRIPTION IN HERE]]></jd>
    <jt><![CDATA[ADVERTISEMENT TITLE]]></jt>
    <k id="AFFORDABLE"/><k id="AVAILABLE"/><k id="AN"/><k id="ATRIUM"/>
    <k id="ATTIC"/><k id="AFTERMARKET"/><k id="BOUTIQUE"/><k id="BOULEVARD"/> <k id="BACKGROUND"/><k id="BE"/><k id="CARPORT"/> [REST OF KEYWORDS - IN HERE] </j>
    The relevant XPath query would then look something like this:

    ads/A[@id="1"]/B[@id="3"]/C[@id="2"]/D[@id="18"]/j [(k/@id="SPA") or (k/@id="POOL")]

    Let's put it to the test:

    Attempt 5: BRANCHED XML Structure
    XML File: 4,979kb Memory Usage: 32.2MB Parsing Time: 3776ms
    Matches: 8628 Total Time: 5508ms Avg per Query: 36.720ms

    That one structure change sped up our solution up by 300 percent. We are also now down below the SQL Server performance benchmark of 48ms.

    But wait; there's still a lot more we can do with this!

    [ATTEMPT 6: BRANCHED XML STRUCTURE - KEYWORDS AS TAGS]

    Apply the second golden rule to our keyword nodes (which being at the bottom of the structure substantially outnumber any other nodes).

    For example, a keyword node <k id="KEYWORD"> would become <KEYWORD/>.

    The structure for a record would now look something like this:

    <ads>
    <A id="1">
    <B id="1">
    <C id="1">
    <D id="2">
    <j id="710185">
    <ja><![CDATA[ADVERTISER NAME]]></ja>
    <jd><![CDATA[DESCRIPTION IN HERE]]></jd>
    <jt><![CDATA[ADVERTISEMENT TITLE]]></jt>
    <AFFORD/><BOULEVARD/><AVAILABLE/><AN/><ATRIUM/>
    [REST OF KEYWORDS - IN HERE]
    </j>
    
    The relevant XPath query would look something like:

    ads/A[@id="1"]/B[@id="3"]/C[@id="2"]/D[@id="18"]/j [(SPA|POOL)]

    Let's put this to the test:

    Attempt 6: BRANCHED XML Structure - Keywords as Tags
    XML File: 3,346kb Memory Usage: 19.5MB Parsing Time: 2204ms
    Matches: 8628 Total Time: 721ms Avg per Query: 4.801ms

    We just sped this up by a further 900 percent by making that one change! We are now running at about 10 times the speed of our SQL solution. Not bad!

    Some other interesting results are that the file size is now 2 MB smaller, the memory usage has dropped by 10 MB, and the parsing time has almost halved.

    [ATTEMPT 7: BRANCHED XML STRUCTURE - KEYWORDS AND CRITERIA AS TAGS]

    Now let's try converting all our nonunique attributes into tags (A, B, C, D).

    For example, a keyword node <A id="1"> would become <A1>.

    The structure for a record would now look something like this:

    <ads>
    <A1>
    <B1>
    <C1>
    <D1>
    <j id="710185">
    <ja><![CDATA[ADVERTISER NAME]]></ja>
    <jd><![CDATA[DESCRIPTION IN HERE]]></jd>
    <jt><![CDATA[ADVERTISEMENT TITLE]]></jt>
    <AFFORD/><BOULEVARD/><AVAILABLE/><AN/><ATRIUM/>
    [REST OF KEYWORDS - IN HERE]
    </j>
    
    The relevant XPath query would look something like:

    ads/A1/B3/C2/D18/j [(SPA|POOL)]

    Let's put this to the test:

    Attempt 7: BRANCHED XML Structure - Keywords and CRITERIA as Tags
    XML File: 3,341kb Memory Usage: 19.5MB Parsing Time: 2264ms
    Matches: 8628 Total Time: 611ms Avg per Query: 4.073ms
    We just sped this up by a further 20 percent. The change was not as significant as the keyword one because the number of keyword tags far outnumber the number of criteria tags.

    [ATTEMPT 8: BRANCHED XML STRUCTURE - KEYWORDS/CRITERIA AS TAGS, KEYWORD GROUPS]

    If you remember, our first Golden Rule stated we should try to keep the maximum number of child nodes at any one branch below 10 - 15. If we look at the keywords, we find that there may be as many as 100 unique words per record.

    A simple change we can make here is to group the keywords by their starting letter, i.e., <ABUNDANT/><AFFORDABLE/><BOUTIQUE/>

    becomes

    <A><ABUNDANT/><AFFORDABLE/></A><B><BOUTIQUE/></B>

    The structure for a record would now look something like this:

    <ads>
    <A1>
    <B1>
    <C1>
    <D1>
    <j id="710185">
    <ja><![CDATA[ADVERTISER NAME]]></ja>
    <jd><![CDATA[DESCRIPTION IN HERE]]></jd>
    <jt><![CDATA[ADVERTISEMENT TITLE]]></jt>
    <A><AFFORD/><BOULEVARD/><AVAILABLE/><AN/><ATRIUM/></A>
    [REST OF KEYWORDS GROUPS/WORDS- IN HERE]
    </j>
    
    When constructing our XPath query,we can simply perform a Left function to determine the first letter of each keyword. The relevant XPath query would then look something like this:

    ads/A1/B3/C2/D18/j [(C/POOL|S/SPA)]

    Let's put this modification to the test:

    Attempt 8: BRANCHED XML Structure - Keywords and Criteria as Tags - Grouped Keywords
    XML File: 3,817kb Memory Usage: 21.7MB Parsing Time: 2714ms
    Matches: 8628 Total Time: 351ms Avg per Query: 2.340ms

    Once again we were able to speed this up by approximately 40 percent!

    As we actually added a number of new tags (for the keyword groups) the parsing, memory, and file-size figures all went up slightly. However, the increased speed was worth the trade-off!

    Summary of Testing Results

    Description

    File size kb

    Memory MB

    Parsing

    ms

    Avg Speed ms

    1) SQL Query — New Connection per Query

    N/A

    N/A

    N/A

    58.953

    2) SQL Query — One Connection Overall

    N/A

    N/A

    N/A

    38.993

    3) Flat XML — with White Space

    5,165

    33.2

    3926

    101.080

    4) Flat XML — No White Space

    5,052

    33.2

    3896

    101.013

    5) Branched XML on Criteria

    4,979

    32.2

    3776

    36.720

    6) Branched XML on Criteria — Keywords as Tags

    3,346

    19.5

    2204

    4.801

    7) Branched XML on Criteria — Keywords + IDs as Tags

    3,341

    19.5

    2264

    4.073

    8) Branched XML — Keywords + IDs as Tags — Grouped Keywords

    3,817

    21.7

    2714

    2.340

    We have sped up our searching speed from 48ms on SQL Server down to 2.340ms using the IMDB! That in itself should be enough to convince you that this is worthwhile solution -- but it should be implemented using the Golden Rules.

    From the above chart we can see just how important structure is, not just for speed, but it also reduces memory usage, file size, and initial parsing time. When you want to represent even more data than what we did here, these factors will be ever more important.

    Don't forget that you can have several of these IMDBs stored at once in your application object and that a single IMDB can store as many table-like structures as you want.

    Conclusion

    This section of the article shows you how structure affects performance, how XML works under the hood, and how much faster this solution is than a dedicated SQL Server.

    In the next part, I will start to attack some real-life issues, such as strategies for updating your XML cache as your database changes, ways of persisting the XML DOMs both inside and outside IIS, scalability issues, and how an IMDB can be used for off-line/batch processes that are becoming commonplace on large membership Web sites.

    About the Author

    Niall Ginsbourg is a chief Web architect at Seek Communications in Melbourne, Australia. He has spent the last years as a senior A/P, team leader, and consultant in various industries, specializing in Microsoft VB, SQL, ASP/Site-Server-based e-commerce and client/server solutions. Niall is self taught and also has knowledge in areas such as XML, WAP, Flash/Generator, Delphi, and extensive multimedia/CD-ROM production experience. He can be reached at nginsbourg@today.com.au.

  • Rate This Article
    Not HelpfulMost Helpful
    1 2 3 4 5
    Supporting Products/Tools
    Stonebroom.ASP2XML
    Stonebroom.ASP2XML(c) is an interface component designed to make building applications that transport data in XML format much easier. It can be used to automatically pass updates back to the original data source.
    [Top]
    Other Articles
    Sep 22, 2005 - Implementing Remote Calling Without Using AJAX
    Right now the latest buzzword around town is AJAX. AJAX is an acronym for Asynchronous JavaScript and XML and is a method used to implement remote calling. The problem is that AJAX is only implemented in ASP.NET 2.0. This article will show you one way to implement remote calling without using AJAX or the XMLHttpRequest object. The technique outlined can even be used from classic ASP and is sufficient for most remote calling needs.
    [Read This Article]  [Top]
    Aug 18, 2005 - SQL Server 2005 XQuery and XML-DML - Part 3
    This article is the third and final installment of Alex Homer's series covering the new XML support in Microsoft SQL Server 2005. In it he covers updating the contents of xml columns, comparing traditional XML update techniques with XQuery, and using XQuery in a managed code stored procedure.
    [Read This Article]  [Top]
    Aug 11, 2005 - SQL Server 2005 XQuery and XML-DML - Part 2
    In the second part of his series on SQL Server 2005's new XML support, Alex Homer looks at extracting data from XML columns, comparing traditional XML data access approaches with XQuery, and combining XQuery and XSL-T.
    [Read This Article]  [Top]
    Aug 3, 2005 - SQL Server 2005 XQuery and XML-DML - Part 1
    Microsoft SQL Server 2005 now offers great support for and close integration with XML as a data persistence format. In the first article of his series examining this new support, Alex Homer offers an overview of how SQL Server 2005 stores XML documents and schemas, examines how it supports querying and manipulating XML documents, and provides a simple test application that allows you to experiment with XQuery.
    [Read This Article]  [Top]
    Jun 30, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 3, Cont'd
    In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x.
    [Read This Article]  [Top]
    Jun 29, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 3
    In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x.
    [Read This Article]  [Top]
    Jun 16, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 2, Cont'd
    Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET.
    [Read This Article]  [Top]
    Jun 15, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 2
    Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET.
    [Read This Article]  [Top]
    Jun 2, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 1, Cont'd
    In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties.
    [Read This Article]  [Top]
    Jun 1, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 1
    In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties.
    [Read This Article]  [Top]
    Mailing List
    Want to receive email when the next article is published? Just Click Here to sign up.

    Support the Active Server Industry

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info

    Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers