Thursday, May 13, 2004

It's amazing what 6 hours sleep will do. And to think, 4 months ago I thought I needed nearly 8 to get a good night's sleep. :-)

I refactored my code from yesterday. I replaced the triples with constraints, as I said I would. The main difference in structure is that a constraint has a model and possibly a NOT statement. Model quantifiers are not legal for these constraints, as we only want to iterate through the transitive predicates within the models in the from clause, and further modifiers would not be of much use.

Since the grammar is looking for a constraint with an optional model, and an optional NOT statement, I had to include code to test for the mode or NOT modifier and throw a QueryException if either are found. Similarly, I was able to test for appropriate placement of variables. The TransitiveConstraint object from yesterday then had to change quite a bit to handle the encapulated constraints. I was able to wrap the TransitiveConstraint object around the constraint for the transitive predicate in order to meet most of the ConstraintExpression interface. This refactoring all involved changing much more code than I expected, but it's done now.

Now that I have these constraints available I can use them directly when they get down to the query layer. I can allow the query to go ahead as usual with just the transitive predicate constraint, and it will find each instance for me. I can then use the anchor (if it exists) to start the run through these results, and find my starting position. Otherwise I can just iterate over the whole result set to look for the transitive statements I'm after.

I'm disappointed that it took so long to get out of the grammar code, but it should all go smoothly from here.

Second Design Revisited
Now that I'm more awake than yesterday I can recall the details of this design much better. :-)

To recap: There are 7 index files. Each file is a hashmap which takes one of the 7 possible combinations of Subject, Predicate and Object as its keys. Each value from the hashmap is an index into a data file.

The 3 indexes based on the "single element" hash functions (Subject, Predicate or Object hashes) point to an entry in their data files. These entries contains 3 items: a next pointer; a previous pointer; an index to the next data file. These entries form a circular doubly-linked list that covers every possible value for their key (so the "subject" data file holds a linked list with elements for each possible subject). A small "Root" file with 3 indexes in it holds the heads of these linked lists. These heads are needed for finding a starting point for each of the linked lists.

The index into the next data file is for the "pair-index" data file which is based on the pair of elements that starts with the element used in the first index. Going with the example of the "subject" index, its data file holds indexes into the "subject-predicate" data file. The entries found in this file also form a circular linked list, only this time there are several lists in the file. Each list represents subject-predicate pairs with a fixed subject. So if there are n subjects in a graph, there will be n linked lists in the subject-predicate data file. Similarly, entries in the predicate data file point to lists in the predicate-object data file, and entries in the object data file point to lists in the object-subject data file. Lists in the predicate-object file are for fixed predicates, and lists in the object-subject file are for fixed objects.

Sticking to the example of the subject-predicate pair, if only the subject is known then the hash is done on the subject to find an entry in the "subject" data file. Then the entry from the "subject" data file will find the list of subject-predicate pairs in the "subject-predicate" data file. Alternatively, if both the subject and predicate are known, then a hash can be done on the 2 of them together, and the subject-predicate index hashmap can be looked up. This will find the subject-predicate entry immediately in the subject-predicate data file (the fact that this entry is in a linked list is of no interest if the entry was found using both data elements, as there is no need to find pairs with a different predicate).

The entries in the index-pair data files point to entries in the "Triple" data file. The entries in this file hold more information than in the previous data files. For a start, they hold the entire triple. They also hold "next" and "prev" pointers like the other data files, but in this case they hold 6 such pointers - a pair for the next/prev subject, the next/prev object, and a pair for the next/prev predicate. So each "triple" entry participates in three linked lists, with each list varying by only one element (lists may be only 1 element long).

The data files which represent the single element and dual element data don't actually carry the value of the elements in the node. This is because the value can be looked up in the triple data file. So if a subject were looked up and it led to a list of subject-predicates, then for any of these entries it is possible to look through to the triple file and see what the subject-predicate of that triple is (the object could be anything, depending on where in the linked list the subject-predicate index pointed to). This does lead to an extra disk read in order to know what data we have, but in almost every case we need to get to a triple before we can start using the data anyway.

Operations are performed as follows:

  • Take the known nodes and hash.
  • Look up the appriate index for these nodes, and go to the correct place in the associated data file.
  • Return an iterator which knows about the searched-for nodes, and has the index into the correct data file.

Iterating for a single known node means indirecting through the first data file into the list in the second data file. For each element of this list, indirect through to the list in the triple data file. Then iterate through this list via the appropriate "next" pointer.

Iterating for a pair of known nodes is the same, only now the iterator starts with an index to a list in the second data file, and indirects through these into the lists in the triple file.

Iterators for all three nodes known are trivial, as they can find the triple directly (with the hashmap index for the triple file), and don't have to iterate at all.

  • Find a free entry in the triple file (using a free list - described for the first design), and put the triple there. Remember where it went.
  • Hash the full triple, and put the triple location in the data file into the correct position in the full triple hash table.
  • For each of the paired-indexes:
    • Hash the pair from the triple to test if the entry already exists in the appropriate index file. If so, then look up the location in the paired-index data file for the triple list based on that pair. Modify next/prev pointers of the found triple (and the one before it) and the newly stored triple to insert the new triple in the list. The insert is now over for that index-pair, and the single index work (below) can be skipped.
    • The pair was not in the hashtable at this point, so find a free entry in the paired-index data file and point it to the element in the triple file. Point the triple's appropriate next and prev pointers back at the triple, so it forms a list of length 1.
  • For each of the single-indexes:
    • Hash the single node value from the triple to test if the entry already exists in the appropriate index file. If so, then look up the location in the single-index data file for the list of pairs based on that node. Modify next/prev pointers of the pair found at the head of this list (and the one before it) and the newly stored pair entry to insert this new entry in the list. The insert is now over for that single node.
    • The single node was not in the hashtable at this point, so find a free entry in the single-index data file and point it to the element in the index-pair data file. Point this element's appropriate next and prev pointers back at itself, so it forms a list of length 1.
    • Go to the root file and find the head of the list in the data file. Adjust next/prev pointers to insert the new entry in the single-index data file.

  • Find the triple with the triple index and remove it from the hashmap
  • Remove the triple in the triple data file from its three linked lists (by modifying next/prev pointers for it and for the next/prev triples). Add the triple's index to the free list for the file.
  • Look up all 3 of the pair-indexes for this triple. If any of them pointed to the removed triple entry then update to the next triple in the list.
  • If a triple had no next triple for any of its indexes, then perform the following steps on those indexes, otherwise the deletion is over.
  • Go back to the pair-index hashmap and remove the entry. Go to the data file entry and remove it from its linked list. Add the entry to the free list for the file.
  • Look up the single-index hashmap for this pair. If it pointed to the removed pair-index entry then update to the next pair-index entry in the list.
  • If the pair-index data entry had no next entry then perform the following steps on the associated single-index hashmap and data file, otherwise the deletion for this index is over.
  • Go back to the single-index hashmap and remove the entry. Go to the data file entry and remove it from its linked list. Add the entry to the free list for the file.
  • Look up the Root for this node type. If it pointed to the single node that was just deleted then point to to the next one instead. If there is no next one, then set the root to -1.

Note that most of the time, the deletion code above will exit after the fourth step, making most deletions very fast.

Discussion on Second Design
This design is more circirtuous that the first design, but it saves the space of two pointers per triple when back pointers are used in the linked lists.

Using less space with back pointers is specifically supposed to provide faster deletions, which it does. Most deletions require 4 reads, and 3 writes (ignoring the free list operation which is another read and write each). However, as I said yesterday, faster deletions are not really essential, except when deleting models. So maybe the complexities of this design don't make the minor space saving worth it.

Lookups are trivial, with only one disk read, but iterators are more complex than the first design.

Insertions are not too bad, with a full database taking 7 reads and 4 writes (ignoring the free list, which should be a single read and write, though the read could be cached). A database without much data takes up to another 24 writes and 12 reads (that was a quick count, but I think I got them all, excluding the 6 read/writes on the free lists). The figures can't be accurate due to hashing collisions, available data in the free list, and the order that data is inserted and removed. The worst case figures can also be ignored, as the database would quickly reduce the number of IO operations towards the minimum as more data was inserted.

Like the first design, no ordering is used here, but can be added into the linked lists in an identical way to the first design.

Deletions in the first design are quite slow as they need to traverse linked lists, unless we choose to nearly double the size of the data file with back pointers on the lists. However, we normally only see deletions on entire models, and if we can do that by simply dropping the directory we might be fine.

Insertions and iterating over datasets are easier in the first design, and I think they would be faster - particularly the insertions.

Overall I think I prefer the first design. We'll see which direction DM wants to go in. I think it will depend on what features we want to have fast, and what we're prepared to compromise on.

It's actaully 12:45am, but I'm posting the time as 11:59pm so I can remember that this is all for the 13th, and not the 14th. I did the same for yesterday.

No comments: