What are the Options for Storing Hierarchical Data in a Relational Database?
What are the Options for Storing Hierarchical Data in a Relational Database?
Good Overviews
Generally speaking you're making a decision between fast read times (e.g. nested set) or fast write times (adjacency list). Usually you end up with a combination of the options below that best fit your needs. The following provides some in depth reading:
- One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set and Nested Interval I've found.
- Models for hierarchical data: slides with good explanations of tradeoffs and example usage
- Representing hierarchies in MySQL: very good overview of Nested Set in particular
- Hierarchical data in RDBMSs: most comprehensive and well organized set of links I've seen, but not much in the way on explanation
Options
Ones I am aware of and general features:
- Adjacency List:
- Columns: ID, ParentID
- Easy to implement.
- Cheap node moves, inserts, and deletes.
- Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Table combined with level column can solve), path (Lineage Column can solve).
- Use Common Table Expressions in those databases that support them to traverse.
- Nested Set (a.k.a Modified Preorder Tree Traversal)
- Popularized by Joe Celko in numerous articles and his book Trees and Hierarchies in SQL for Smarties
- Columns: Left, Right
- Cheap level, ancestry, descendants
- Compared to Adjacency List, moves, inserts, deletes more expensive.
- Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
- Nested Intervals
- Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information. In the later development of this idea nested intervals gave rise to matrix encoding.
- Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
- Columns: ancestor, descendant
- Stands apart from table it describes.
- Can include some nodes in more than one hierarchy.
- Cheap ancestry and descendants (albeit not in what order)
- For complete knowledge of a hierarchy needs to be combined with another option.
- Flat Table
- A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
- Expensive move and delete
- Cheap ancestry and descendants
- Good Use: threaded discussion - forums / blog comments
- Lineage Column (a.k.a. Materialized Path, Path Enumeration)
- Column: lineage (e.g. /parent/child/grandchild/etc...)
- Limit to how deep the hierarchy can be.
- Descendants cheap (e.g.
LEFT(lineage, #) = '/enumerated/path'
) - Ancestry tricky (database specific queries)
- Multiple lineage columns
- Columns: one for each lineage level, refers to all the parents up to the root, levels down from the items level are set to NULL
- Limit to how deep the hierarchy can be
- Cheap ancestors, descendants, level
- Cheap insert, delete, move of the leaves
- Expensive insert, delete, move of the internal nodes
Database Specific Notes
MySQL
Oracle
- Use CONNECT BY to traverse Adjacency Lists
PostgreSQL
- ltree datatype for Materialized Path
SQL Server
- General summary
- 2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.
Answer by CesarGon for What are the Options for Storing Hierarchical Data in a Relational Database?
This is a very partial answer to your question, but I hope still useful.
Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data:
- the HierarchyId data type.
- common table expressions, using the with keyword.
Have a look at this article for starts. See also my own question here.
Answer by Paul Morgan for What are the Options for Storing Hierarchical Data in a Relational Database?
Joe Celko wrote the book on SQL Trees & Hiearichies
This is the first edition. Look at the second edition in Bob's comment.
Answer by Quassnoi for What are the Options for Storing Hierarchical Data in a Relational Database?
Some articles from my blog on the subject:
- Adjacency list vs. nested sets: MySQL
- Adjacency list vs. nested sets: PostgreSQL
- Adjacency list vs. nested sets: Oracle
Hierarchical queries in MySQL (querying adjacency lists in
MySQL
)
Answer by Tegiri Nenashi for What are the Options for Storing Hierarchical Data in a Relational Database?
This is kind of a question that is still interesting even after all big 3 vendors implemented Recursive WITH
clause. I'd suggest that different readers would be pleased with different answers.
- Comprehensive list of references by Troels Arvin.
- For the lack of competition, introductory textbook by Joe Celko "Trees and Hierarchies in SQL for Smarties" can indeed be considered a classic.
- Review of various tree encodings with emphasis to nested intervals.
Answer by Jeff Moden for What are the Options for Storing Hierarchical Data in a Relational Database?
My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.
The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.
I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.
Duration for 1,000 Nodes = 00:00:00:870 Duration for 10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10) Duration for 100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) Duration for 1,000,000 Nodes = 'Didn't even try this'
And here's the duration for the new method (with the push stack method in parenthesis).
Duration for 1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870) Duration for 10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783) Duration for 100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730) Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)
Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.
You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/
I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/
If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.
Answer by Adam Sanderson for What are the Options for Storing Hierarchical Data in a Relational Database?
If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.
Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices. This makes finding parents, children, and depth pretty trivial in a single query. Updates are pretty manageable as well.
I have a full write up of using arrays for materialized paths if you're curious.
Answer by TMS for What are the Options for Storing Hierarchical Data in a Relational Database?
This design was not mentioned yet:
Multiple lineage columns
Though it has limitations, if you can bear them, it's very simple and very efficient. Features:
- Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to NULL
- Limit to how deep the hierarchy can be
- Cheap ancestors, descendants, level
- Cheap insert, delete, move of the leaves
- Expensive insert, delete, move of the internal nodes
Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):
CREATE TABLE `taxons` ( `TaxonId` smallint(6) NOT NULL default '0', `ClassId` smallint(6) default NULL, `OrderId` smallint(6) default NULL, `FamilyId` smallint(6) default NULL, `GenusId` smallint(6) default NULL, `Name` varchar(150) NOT NULL default '' );
and the example of the data:
+---------+---------+---------+----------+---------+-------------------------------+ | TaxonId | ClassId | OrderId | FamilyId | GenusId | Name | +---------+---------+---------+----------+---------+-------------------------------+ | 254 | 0 | 0 | 0 | 0 | Aves | | 255 | 254 | 0 | 0 | 0 | Gaviiformes | | 256 | 254 | 255 | 0 | 0 | Gaviidae | | 257 | 254 | 255 | 256 | 0 | Gavia | | 258 | 254 | 255 | 256 | 257 | Gavia stellata | | 259 | 254 | 255 | 256 | 257 | Gavia arctica | | 260 | 254 | 255 | 256 | 257 | Gavia immer | | 261 | 254 | 255 | 256 | 257 | Gavia adamsii | | 262 | 254 | 0 | 0 | 0 | Podicipediformes | | 263 | 254 | 262 | 0 | 0 | Podicipedidae | | 264 | 254 | 262 | 263 | 0 | Tachybaptus |
This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.
Answer by hyc for What are the Options for Storing Hierarchical Data in a Relational Database?
For more complex hierarchies, like an LDAP tree, the OpenLDAP-MySQL Cluster architecture was presented at the MySQL User Conference back in 2009.
http://en.oreilly.com/mysql2009/public/schedule/detail/6219
It's quite similar to the "Multiple lineage columns" scheme shown above.
Answer by djhallx for What are the Options for Storing Hierarchical Data in a Relational Database?
This is really a square peg, round hole question.
If relational databases and SQL are the only hammer you have or are willing to use, then the answers that have been posted thus far are adequate. However, why not use a tool designed to handle hierarchical data? Graph database are ideal for complex hierarchical data.
The inefficiencies of the relational model along with the complexities of any code/query solution to map a graph/hierarchical model onto a relational model is just not worth the effort when compared to the ease with which a graph database solution can solve the same problem.
Consider a Bill of Materials as a common hierarchical data structure.
class Component extends Vertex { long assetId; long partNumber; long material; long amount; }; class PartOf extends Edge { }; class AdjacentTo extends Edge { };
Shortest path between two sub-assemblies: Simple graph traversal algorithm. Acceptable paths can be qualified based on criteria.
Similarity: What is the degree of similarity between two assemblies? Perform a traversal on both sub-trees computing the intersection and union of the two sub-trees. The percent similar is the intersection divided by the union.
Transitive Closure: Walk the sub-tree and sum up the field(s) of interest, e.g. "How much aluminum is in a sub-assembly?"
Yes, you can solve the problem with SQL and a relational database. However, there are much better approaches if you are willing to use the right tool for the job.
Answer by azerafati for What are the Options for Storing Hierarchical Data in a Relational Database?
Adjacency Model + Nested Sets Model
I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.
+-------------+----------------------+--------+-----+-----+ | category_id | name | parent | lft | rgt | +-------------+----------------------+--------+-----+-----+ | 1 | ELECTRONICS | NULL | 1 | 20 | | 2 | TELEVISIONS | 1 | 2 | 9 | | 3 | TUBE | 2 | 3 | 4 | | 4 | LCD | 2 | 5 | 6 | | 5 | PLASMA | 2 | 7 | 8 | | 6 | PORTABLE ELECTRONICS | 1 | 10 | 19 | | 7 | MP3 PLAYERS | 6 | 11 | 14 | | 8 | FLASH | 7 | 12 | 13 | | 9 | CD PLAYERS | 6 | 15 | 16 | | 10 | 2 WAY RADIOS | 6 | 17 | 18 | +-------------+----------------------+--------+-----+-----+
- Every time you need all children of any parent you just query the
parent
column. - If you needed all descendants of any parent you query for items which have their
lft
betweenlft
andrgt
of parent. - If you needed all parents of any node up to the root of the tree, you query for items having
lft
lower than the node'slft
andrgt
bigger than the node'srgt
and sort the byparent
.
I needed to make accessing and querying the tree faster than inserts, that's why I chose this
The only problem is to fix the left
and right
columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast. I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE http://dba.stackexchange.com/q/89051/41481
Answer by IVO GELOV for What are the Options for Storing Hierarchical Data in a Relational Database?
I am using PostgreSQL with closure tables for my hierarchies. I have one universal stored procedure for the whole database:
CREATE FUNCTION nomen_tree() RETURNS trigger LANGUAGE plpgsql AS $_$ DECLARE old_parent INTEGER; new_parent INTEGER; id_nom INTEGER; txt_name TEXT; BEGIN -- TG_ARGV[0] = name of table with entities with PARENT-CHILD relationships (TBL_ORIG) -- TG_ARGV[1] = name of helper table with ANCESTOR, CHILD, DEPTH information (TBL_TREE) -- TG_ARGV[2] = name of the field in TBL_ORIG which is used for the PARENT-CHILD relationship (FLD_PARENT) IF TG_OP = 'INSERT' THEN EXECUTE 'INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) SELECT $1.id,$1.id,0 UNION ALL SELECT $1.id,ancestor_id,depth+1 FROM ' || TG_ARGV[1] || ' WHERE child_id=$1.' || TG_ARGV[2] USING NEW; ELSE -- EXECUTE does not support conditional statements inside EXECUTE 'SELECT $1.' || TG_ARGV[2] || ',$2.' || TG_ARGV[2] INTO old_parent,new_parent USING OLD,NEW; IF COALESCE(old_parent,0) <> COALESCE(new_parent,0) THEN EXECUTE ' -- prevent cycles in the tree UPDATE ' || TG_ARGV[0] || ' SET ' || TG_ARGV[2] || ' = $1.' || TG_ARGV[2] || ' WHERE id=$2.' || TG_ARGV[2] || ' AND EXISTS(SELECT 1 FROM ' || TG_ARGV[1] || ' WHERE child_id=$2.' || TG_ARGV[2] || ' AND ancestor_id=$2.id); -- first remove edges between all old parents of node and its descendants DELETE FROM ' || TG_ARGV[1] || ' WHERE child_id IN (SELECT child_id FROM ' || TG_ARGV[1] || ' WHERE ancestor_id = $1.id) AND ancestor_id IN (SELECT ancestor_id FROM ' || TG_ARGV[1] || ' WHERE child_id = $1.id AND ancestor_id <> $1.id); -- then add edges for all new parents ... INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) SELECT child_id,ancestor_id,d_c+d_a FROM (SELECT child_id,depth AS d_c FROM ' || TG_ARGV[1] || ' WHERE ancestor_id=$2.id) AS child CROSS JOIN (SELECT ancestor_id,depth+1 AS d_a FROM ' || TG_ARGV[1] || ' WHERE child_id=$2.' || TG_ARGV[2] || ') AS parent;' USING OLD, NEW; END IF; END IF; RETURN NULL; END; $_$;
Then for each table where I have a hierarchy, I create a trigger
CREATE TRIGGER nomenclature_tree_tr AFTER INSERT OR UPDATE ON nomenclature FOR EACH ROW EXECUTE PROCEDURE nomen_tree('my_db.nomenclature', 'my_db.nom_helper', 'parent_id');
For populating a closure table from existing hierarchy I use this stored procedure:
CREATE FUNCTION rebuild_tree(tbl_base text, tbl_closure text, fld_parent text) RETURNS void LANGUAGE plpgsql AS $$ BEGIN EXECUTE 'TRUNCATE ' || tbl_closure || '; INSERT INTO ' || tbl_closure || ' (child_id,ancestor_id,depth) WITH RECURSIVE tree AS ( SELECT id AS child_id,id AS ancestor_id,0 AS depth FROM ' || tbl_base || ' UNION ALL SELECT t.id,ancestor_id,depth+1 FROM ' || tbl_base || ' AS t JOIN tree ON child_id = ' || fld_parent || ' ) SELECT * FROM tree;'; END; $$;
Closure tables are defined with 3 columns - ANCESTOR_ID, DESCENDANT_ID, DEPTH. It is possible (and I even advice) to store records with same value for ANCESTOR and DESCENDANT, and a value of zero for DEPTH. This will simplify the queries for retrieval of the hierarchy. And they are very simple indeed:
-- get all descendants SELECT tbl_orig.*,depth FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth <> 0; -- get only direct descendants SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth = 1; -- get all ancestors SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON ancestor_id = tbl_orig.id WHERE descendant_id = XXX AND depth <> 0; -- find the deepest level of children SELECT MAX(depth) FROM tbl_closure WHERE ancestor_id = XXX;
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment