Adamovsky Direct Tracking Directed Acyclic Word Graph
-
An Advanced Lexicon Data Structure

| Preamble | | Executive Summary | | ADTDAWG | | Part One | | Part Two | | Part Three & Four | | C Implementation | | 64-Bit CRC Transition To Bytewise Lookup-Table | | Contact Information | | Parent Web Page |

Preamble

The traditional Directed Acyclic Word Graph, or DAWG, has been thoroughly investigated, and has been found wanting. DAWG is a form of compressed Trie that relies heavily on partitioning the raw Trie data into discrete lists. In this way, redundant postfix lists can be eliminated. These lists are stored contiguously inside of an integer array so that the next node in a list is simply the next integer in the array unless a Boolean flag indicates that the current node is the end of a list. As fast as the traversal of a traditional DAWG may appear, it suffers from three prominent weaknesses. The first, and most obvious flaw is that each node representing the end of a word, most likely represents the final character of a great number of words in the lexicon being represented. In this way, information, such as a time stamp, cannot be traced back to one word in the lexicon. The second and third weaknesses arise from the raw-list-structure paradigm. When inside of a single node, exact information about the current node's child list is not available. Traversal of the list below the current node is required to discover information about it. Whether the node being looked for is there or not, time is needlessly being spent scrolling through a list for a node that may, or may not, exist. Finally, the traditional DAWG list structure enforces an "order matters" traversal policy, where when looking for a letter, one must first inspect, and then skip all nodes in the list that alphabetically precede the letter being sought out. A traversal without backtracking must proceed in alphabetical order. The ADTDAWG was designed to score a Boggle board very fast, so the elimination of these shortcomings was the top priority, while maintaining the favourable properties of the traditional, list style, DAWG encoding.

Executive Summary

This web page presents a novel DAWG encoding, aptly named ADTDAWG, which addresses and eliminates the three salient drawbacks of the traditional 32 bit, list style, DAWG. The new, 64 bit, long integer used in the second part of this data structure justifies the complexity of the creation algorithm required. In the new ADTDAWG, the Direct refers to internalized full information about the child list of each node, and the Tracking refers to the ability to pinpoint a reference to the exact word that one has landed on while traversing the data structure in an unspecified manner. It is surprising then, that the size augmentation, required for the new functionality, has been limited to roughly 30%, for a 14 character subset of the TWL06 Scrabble word list. This makes the immutable ADTDAWG lexicon data structure ideally suited for complete storage in processor cache. Included in the Implementation section is a set of Standard C files to create the ADTDAWG and an anagram program to test the results. The C files will have to be modified if the character set or the size of the lexicon is changed.

ADTDAWG: Four Arrays Come Together

If one driving force behind this data structure must be isolated, it is the placement of contiguous child lists of nodes in an integer array. This allows for a single first child reference, and a collection of offset references to each potential child, to replace explicit index values for all possible children. Even a superficial analysis of the English language will find it to be ripe with patterns related to the order, and position, that letters appear in the majority of the words in the lexicon.

Part One contains a generalized and scaled back version of the traditional DAWG nodes. Part Two contains a remarkably short list of 64 bit integers holding all of the child-list offset arrangements present in the lexicon. The minute size of this list echoes the patterns that emerge from an investigation of the English lexicon. Part Three and Part Four represent the data required to implement exact word tracking, simply, the number of words underneath the current node and the trailing nodes towards the end of the current branch list.

Part One: The Main ADTDAWG Structure

The first part of the ADTDAWG holds all of the high level node information for the lexicon data structure. It contains only an End Of Word Flag, an index reference to a Part Two node, where the full child offset information will be found, and finally a reference to the first child of the current node. It is important to note that an ADTDAWG node does not store a letter inside of it. The letter that a particular node represents is assigned by the route used to arrive at it. Another redundant element of a traditional DAWG node, due to the Part Two Direct Child Information, is the Boolean End Of List Flag. The letter information removed from each node now renders many list structures in the DAWG as redundant. Further, more bits are now available for the three remaining elements of each node. The redundant lists have been removed, and the list order has been reorganized to reduce the structure's Part Three and Part Four, space requirement. These operations are non-trivial because the nodes reference each other. Moving or eliminating an entire list becomes quite a tedious task, when each node is interconnected. This will become clear when reading the included C code.

Part Two: The Child Offset Pattern Data

This component of the ADTDAWG is responsible for containing the data required for direct child information to be a member of each node in Part One of the data structure. The direct child information makes it possible to traverse the data structure through only the nodes representing characters contained within a particular word. This novel scheme adds a significant performance boost to all lexicon traversal functions. It also enforces the "Order Does Not Matter," paradigm. As the heading of this section suggests, the 64-bit integers in this array simply contain an ordered list of child offsets from the first child in the list. This is accomplished by using the lowest number of bits required to contain the offset for each character in the lexicon. 64 Bits will hold a maximum of 18 characters, and thus, two long integers are required to hold the direct offsets for the 26 characters in the Standard English lexicon. The format of a Part Two node will be demonstrated below, for the 14 character set used in the BIG.c Boggle board scoring algorithm. An offset of zero will indicate that the associated character does not exist in the list below the current node. An offset of one indicates that the character is located at the First Child Reference. Extraction of a child offset simply requires a bit-mask followed by a bit-shift. These two operations have a very low complexity. If the particular child being searched for does not exist, only a bit-mask is required.

14 Character Set
Character	Bits Required
A C D E G I L M N O P R S T	1 2 2 3 3 3 3 4 4 4 4 4 4 4

Resulting Format of The Part Two, 64-Bit Integer
XXXXX	XXXXX	XXXXX	XXXX	1110	1101	1100	1011	1010	1001	1000	111	110	101	100	11	10	1
XXXXX	XXXXX	XXXXX	XXXX	T---	S---	R---	P---	O---	N---	M---	L--	I--	G--	E--	D-	C-	A

Part Three and Part Four: The Tracking Algorithm

The Governing Equation For ADTDAWG Word Tracking:

Next Marker = Current Marker - FCWTEOBL + NCWTEOBL - EndOfWordFlag

* FCWTEOBL = First Child -> Words To End Of Branch List

* NCWTEOBL = Next Child -> Words To End Of Branch List

* EndOfWordFlag = { 0, or 1 }

When on a particular node, the unique number attached to the word that may end on that node is the Current Marker. This number is a direct result of the path used to arrive at that node. The first level Current Marker is simply the WTEOBL value stored in Part Three of the data structure, as there is only one way to get to the first level of nodes due to the acyclic nature of the graph.

The ability to pinpoint the exact word that an EndOfWord flagged node is currently representing is made possible by the third and fourth part of the ADTDAWG data structure. Combined, these arrays hold the "Words To End Of Branch List" for every single Part One node. Lists that contain nodes with WTEOBL values larger than 255 are rare, but they require more than a single byte to store their WTEOBL values. All of these heavily populated lists have been relocated to the very front of the Part One array and their WTEOBL values are stored in Part Three as 32 bit integers. The remaining nodes have their WTEOBL values stored in the Part Four array as 8 bit unsigned chars. Further research is required to determine if the space saving benefit of using 8 bit unsigned chars outweighs the amount of overhead required to extract an 8 bit number from the processor word. With such a compressed lexicon, it is safe merge Part Three and Part Four together and use 32-bit integers for speed. In the case of a larger lexicon with many more Part One nodes, this will not be the case and size reduction will result in a performance increase.

The "Words To End Of Branch List" data, as opposed to the more intuitive "Words To Start Of Branch List," was chosen due to the nature of the node reduction algorithm. When traversing a DAWG, the route taken to get to a specific node allows for various partial lists to be used. These partial lists do not share a common Start Of Branch List," but invariably, they all end at the same node. Thus, when tracking a word, it may seem odd that words near the beginning of the lexicon are marked with high values, and words near the end of the lexicon are marked with low values. This is a necessary consequence of the node reduction algorithm associated with a node-list type DAWG. The last word in the lexicon will have an association with the number 1.

C Implementation

In the spirit of full disclosure, all of the files required to create the ADTDAWG and then use it have been included in the list below. The code on this page has been converted to an html format, so please download the untampered with files below. The first two C files need to be run in order to create the four data files. The final two C files use the data files to carry out functionality tests. Constants must be altered if any changes are made to Lexicon_14.

Lexicon_14.txt - The Required Word List
ADTDAWG_Creator_Part_1.c - Use "gcc -lm" to compile
ADTDAWG_Creator_Part_2.c - Use "gcc -lm" to compile
Four_Part_1_DTDAWG_For_Lexicon_14.dat - Pre-created data file
Four_Part_2_DTDAWG_For_Lexicon_14.dat - Pre-created data file
Four_Part_3_DTDAWG_For_Lexicon_14.dat - Pre-created data file
Four_Part_4_DTDAWG_For_Lexicon_14.dat - Pre-created data file
ADTDAWG_Word_Search_For_Lexicon_14.c - Use "gcc" to compile - Search for words in Lexicon_14
ADTDAWG_Anagram_For_Lexicon_14.c - Use "gcc" to compile - Rearrange letters to make words in Lexicon_14

"-lm" Includes the "math.h" library.

ADTDAWG_Creator_Part_1.c - This Program Is Non-Trivial

// Simply put, this program needs only 3 things to create an optimal 14 char DAWG for you. // 1) A text file with a lexicon inside of it and the number of words is written on the very first line, in linux format, so no DOS CR. // 2) Set the constant MAX below to the length of the longest word in the lexicon. Modify if your shortest words are not of length 2. Have fun. // 3) Define the 14 characters that you have chosen, and that is it. // The 14 Dawg data structure is important because each node will contain information about all of the child nodes below it. // In this respect, traversal time will be reduced beyond any published DAWG implementation to my knowledge. #include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #include <time.h> #include <assert.h> #define MAX 15 #define NUMBER_OF_REDUCED_LETTERS 14 #define INPUT_LIMIT 30 // C requires a boolean variable type so use C's typedef concept to create one. typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; // we will store the Dawg into a text file to verify the output. Use The Init function this time #define RAW_LEXICON "Lexicon_14.txt" #define OPTIMAL_DAWG_DATA "Traditional_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_ONE "Optimal_Part_One_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_TWO "Optimal_Part_Two_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DAWG_TEXT_DATA "Traditional_Dawg_Text_For_Lexicon_14.txt" #define END_OF_LIST_DATA "End_Of_List_Nodes_Direct_Dawg_For_Lexicon_14.dat" int LetterArrayLocation[26] = {0, -999, 1, 2, 3, -999, 4, -999, 5, -999, -999, 6, 7, 8, 9, 10, -999, 11, 12, 13, -999, -999, -999, -999, -999, -999}; int BitShiftMe[26] = {0, -999, 1, 3, 5, -999, 8, -999, 11, -999, -999, 14, 17, 21, 25, 29, -999, 33, 37, 41, -999, -999, -999, -999, -999, -999}; inline void CutOffNewLine(char *String){ String[strlen(String) - 1] = '\0'; } // Returns the positive integer rerpresented by "TheNumberNotYet" string, and if not a number greater then zero, returns 0. int StringToPositiveInt( char* TheNumberNotYet ) { int result = 0; int X; int Digits = strlen( TheNumberNotYet ); for ( X = 0; X < Digits; X++ ) { printf("X =|%d|, TheNumberNotYet[X]=|%c|, result=|%d|, add|%d|\n", X, TheNumberNotYet[X], result, (int)((TheNumberNotYet[X] - '0')*pow( 10, Digits - X - 1 ))); if ( !(TheNumberNotYet[X] >= '0' && TheNumberNotYet[X] <= '9') ) return 0; result += (int)((TheNumberNotYet[X] - '0')*pow( 10, Digits - X - 1 )); } printf("The result=|%d| is in.\n", result); return result; } // The BinaryNode string must be at least 32 + 5 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertIntNodeToBinaryString( int TheNode, char* BinaryNode){ int X; int Bit; BinaryNode[0] = '|'; // Bit 31, the leftmost bit is the sign bit. BinaryNode[1] = (TheNode < 0)?'1':'0'; // Bit 30 to bit 27 are unused bits in a 32 bit node. BinaryNode[2] = (TheNode & (int)pow(2,30))?'1':'0'; BinaryNode[3] = (TheNode & (int)pow(2,29))?'1':'0'; BinaryNode[4] = (TheNode & (int)pow(2,28))?'1':'0'; BinaryNode[5] = (TheNode & (int)pow(2,27))?'1':'0'; BinaryNode[6] = '|'; // Bit 26 is the boolean EOW flag for end of word. BinaryNode[7] = (TheNode & (int)pow(2,26))?'1':'0'; BinaryNode[8] = '|'; // Bit 25 to bit 15 represent the child offset index. Bit = 25; for ( X = 9; X <= 19; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[20] = '|'; // The first child index that requires 15 bits will finish off the 32 bit int. Bit = 14; for ( X = 21; X <= 35; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[36] = '|'; BinaryNode[37] = '\0'; } // The BinaryOffset string must be at least 64 + 16 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertOffsetLongIntToBinaryString( long int TheOffset, char* BinaryOffset){ int X; int Bit; BinaryOffset[0] = '|'; // Bit 63, the leftmost bit is the sign bit. BinaryOffset[1] = (TheOffset < 0)?'1':'0'; // Bit 62 to bit 45 are not used with the 14 char set chosen for this DAWG. 18 Unused bits. All bits including the sign bit will be used with an 18 character set. Bit = 62; for ( X = 2; X <= 19; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[20] = '|'; // The seven letters that require a 4 bit offset encoding start here --------------------------------------------------- // Bit [44,41] represent the "T" child. Bit = 44; for ( X = 21; X <= 24; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[25] = '|'; // Bit [40,37] represent the "S" child. Bit = 40; for ( X = 26; X <= 29; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[30] = '|'; // Bit [36,33] represent the "R" child. Bit = 36; for ( X = 31; X <= 34; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[35] = '|'; // Bit [32,29] represent the "P" child. Bit = 32; for ( X = 36; X <= 39; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[40] = '|'; // Bit [28,25] represent the "O" child. Bit = 28; for ( X = 41; X <= 44; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[45] = '|'; // Bit [24,21] represent the "N" child. Bit = 24; for ( X = 46; X <= 49; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[50] = '|'; // Bit [20,17] represent the "M" child. Bit = 20; for ( X = 51; X <= 54; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[55] = '|'; // The four letters that require a 3 bit offset encoding start here --------------------------------------------------- // Bit [16,14] represent the "L" child. Bit = 16; for ( X = 56; X <= 58; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[59] = '|'; // Bit [13,11] represent the "I" child. Bit = 13; for ( X = 60; X <= 62; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[63] = '|'; // Bit [10,8] represent the "G" child. Bit = 10; for ( X = 64; X <= 66; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[67] = '|'; // Bit [7,5] represent the "E" child. Bit = 7; for ( X = 68; X <= 70; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[71] = '|'; // The two letters that require a 2 bit offset encoding start here --------------------------------------------------- // Bit [4,3] represent the "G" child. Bit = 4; for ( X = 72; X <= 73; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[74] = '|'; // Bit [2,1] represent the "E" child. Bit = 2; for ( X = 75; X <= 76; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[77] = '|'; // Bit 0 represent the "A" child. BinaryOffset[78] = (TheOffset & (long int)pow(2,0))?'1':'0'; BinaryOffset[79] = '|'; BinaryOffset[80] = '\0'; } /*This Function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. */ void MakeMeAllCapital(char *RawWord){ int count = 0; for ( count=0; count < strlen(RawWord); count++ ){ if (RawWord[count] >= 97 && RawWord[count] <= 122 ) RawWord[count] = RawWord[count] - 32; } } /*Trie to Dawg TypeDefs*/ struct tnode { struct tnode* Next; struct tnode* Child; struct tnode* ParentalUnit; // When populating the array, you must know the indices of only the child nodes. Hence we Store ArrayIndex in every node so that we can mine the information from the reduced trie. int ArrayIndex; char DirectChild; char Letter; char MaxChildDepth; char Level; char NumberOfChildren; char Dangling; char EndOfWordFlag; }; typedef struct tnode Tnode; typedef Tnode* TnodePtr; int TnodeArrayIndex( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->ArrayIndex; } char TnodeDirectChild( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->DirectChild; } TnodePtr TnodeNext( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Next; } TnodePtr TnodeChild ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Child; } TnodePtr TnodeParentalUnit( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->ParentalUnit; } char TnodeLetter( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Letter; } char TnodeMaxChildDepth( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->MaxChildDepth; } char TnodeNumberOfChildren( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->NumberOfChildren; } char TnodeEndOfWordFlag( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->EndOfWordFlag; } char TnodeLevel ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Level; } char TnodeDangling ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Dangling; } TnodePtr TnodeInit( char Chap, TnodePtr OverOne, char WordEnding, char Leveler, int StarterDepth, TnodePtr Parent, char IsaChild ) { TnodePtr Result = (TnodePtr)malloc( sizeof(Tnode) ); assert( Result != NULL ); Result->Letter = Chap; Result->ArrayIndex = 0; Result->NumberOfChildren = 0; Result->MaxChildDepth = StarterDepth; Result->Next = OverOne; Result->Child = NULL; Result->ParentalUnit = Parent; Result->Dangling = FALSE; Result->EndOfWordFlag = WordEnding; Result->Level = Leveler; Result->DirectChild = IsaChild; return Result; } void TnodeOutput( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); printf( "Letter|%c|, DirectChild|%d|, ArrayIndex|%d|, Level|%d|, MaxChildDepth|%d|, Dangling|%d|, NumberOfChildren|%d|, EndOfWordFlag|%d|\n" , ThisTnode->Letter, ThisTnode->DirectChild, ThisTnode->ArrayIndex, ThisTnode->Level, ThisTnode->MaxChildDepth, ThisTnode->Dangling , ThisTnode->NumberOfChildren, ThisTnode->EndOfWordFlag ); } void TnodeSetArrayIndex( TnodePtr ThisTnode, int TheWhat ){ assert( ThisTnode != NULL ); ThisTnode->ArrayIndex = TheWhat; } void TnodeSetChild( TnodePtr ThisTnode, TnodePtr Nexis ){ assert( ThisTnode != NULL ); ThisTnode->Child = Nexis; } void TnodeSetNext ( TnodePtr ThisTnode, TnodePtr Nexi ){ assert( ThisTnode != NULL ); ThisTnode->Next = Nexi; } void TnodeSetParentalUnit ( TnodePtr ThisTnode, TnodePtr Parent ){ assert( ThisTnode != NULL ); ThisTnode->ParentalUnit = Parent; } void TnodeSetMaxChildDepth( TnodePtr ThisTnode, int NewDepth ){ assert( ThisTnode != NULL ); ThisTnode->MaxChildDepth = NewDepth; } void TnodeSetDirectChild( TnodePtr ThisTnode, char Status ){ assert( ThisTnode != NULL ); ThisTnode->DirectChild = Status; } void TnodeFlyEndOfWordFlag ( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); ThisTnode->EndOfWordFlag = TRUE; } // This function Dangles a node, but also recursively dangles every node under it as well //, that way nodes that are not direct children do hit the chopping block. The function returns the total number of nodes dangled as a result. int TnodeDangle( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); if ( ThisTnode->Dangling == TRUE ) return 0; int Result = 0; if ( (ThisTnode->Next) != NULL ) Result += TnodeDangle( ThisTnode->Next ); if ( (ThisTnode->Child) != NULL ) Result += TnodeDangle( ThisTnode->Child); ThisTnode->Dangling = TRUE; Result += 1; return Result; } // This function returns the pointer to the Tnode in a parallel list of nodes with the letter "ThisLetter", // and returns NULL if the Tnode does not exist. In that case an insertion is required. TnodePtr TnodeFindParaNode( TnodePtr ThisTnode, char ThisLetter ) { if ( ThisTnode == NULL ) return NULL; TnodePtr Result = ThisTnode; if ( Result->Letter == ThisLetter ) return Result; while ( Result->Letter < ThisLetter ){ Result = Result->Next; if ( Result == NULL ) return NULL; } if ( Result->Letter == ThisLetter ) return Result; else return NULL; } // This function inserts a new Tnode before a larger letter node or at the end of a para list If the list does not esist then it is put at the beginnung. // The new node has ThisLetter in it. AboveTnode is the node 1 level above where the new node will be placed. // This function should never be passed a NULL pointer. ThisLetter should never exist in the next child para list. void TnodeInsertParaNode( TnodePtr AboveTnode, char ThisLetter, char WordEnder, int StartDepth){ assert( AboveTnode != NULL ); AboveTnode->NumberOfChildren += 1; TnodePtr Holder = NULL; TnodePtr Currently = NULL; // Case 1: ParaList does not exist yet so start it. if ( AboveTnode->Child == NULL ) AboveTnode->Child = TnodeInit( ThisLetter, NULL, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE ); // Case 2: ThisLetter should be the first in the ParaList. else if ( ((AboveTnode->Child)->Letter) > ThisLetter ) { Holder = AboveTnode->Child; // The holder node is no longer a direct child so set it as such. TnodeSetDirectChild( Holder, FALSE ); AboveTnode->Child = TnodeInit( ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE ); // The parent node needs to be changed on what used to be the child. it is the Tnode in "Holder". Holder->ParentalUnit = AboveTnode->Child; } // Case 3: The ParaList exists and ThisLetter is not first in the list. else if ( (AboveTnode->Child)->Letter < ThisLetter ) { Currently = AboveTnode->Child; while ( Currently->Next !=NULL ){ if ( TnodeLetter( Currently->Next ) > ThisLetter ){ break; } Currently = Currently->Next; } Holder = Currently->Next; Currently->Next = TnodeInit( ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, Currently, FALSE ); if ( Holder != NULL ) Holder->ParentalUnit = Currently->Next; } } // The MaxChildDepth of the two nodes can not be assumed equal due to the recursive nature of this function, so we must check for equivalence. char TnodeAreWeTheSame( TnodePtr FirstNode, TnodePtr SecondNode){ if ( FirstNode == SecondNode ) return TRUE; if ( FirstNode == NULL || SecondNode == NULL ) return FALSE; if ( FirstNode->Letter != SecondNode->Letter ) return FALSE; if ( FirstNode->MaxChildDepth != SecondNode->MaxChildDepth ) return FALSE; if ( FirstNode->NumberOfChildren != SecondNode->NumberOfChildren ) return FALSE; if ( FirstNode->EndOfWordFlag != SecondNode->EndOfWordFlag ) return FALSE; if ( TnodeAreWeTheSame( FirstNode->Child, SecondNode->Child ) == FALSE ) return FALSE; if ( TnodeAreWeTheSame( FirstNode->Next, SecondNode->Next ) == FALSE ) return FALSE; else return TRUE; } struct dawg { int NumberOfTotalWords; int NumberOfTotalNodes; TnodePtr First; }; typedef struct dawg Dawg; typedef Dawg* DawgPtr; // Set up the parent nodes in the Dawg. DawgPtr DawgInit (void){ DawgPtr Result; Result = (DawgPtr)malloc( sizeof(Dawg)); assert( Result != NULL ); Result->NumberOfTotalWords = 0; Result->NumberOfTotalNodes = 0; Result->First = TnodeInit( '0', NULL, FALSE, 0, 0, NULL, FALSE ); return Result; } // This function simply makes DawgAddWord look a hell of a lot smaller. It returns the number of new nodes that it inserted. int TnodeDawgAddWord( TnodePtr ParentNode, const char *Word ){ assert( ParentNode != NULL ); int Result = 0; int X, Y = 0; int WordLength = strlen( Word ); TnodePtr HangPoint = NULL; TnodePtr Current = ParentNode; for ( X = 0; X < WordLength; X++){ HangPoint = TnodeFindParaNode ( TnodeChild( Current ), Word[X] ); if ( HangPoint == NULL ) { TnodeInsertParaNode( Current, Word[X], (X == WordLength - 1 ? TRUE : FALSE), WordLength - X - 1 ); Result++; Current = TnodeFindParaNode ( TnodeChild( Current ), Word[X] );; for ( Y = X+1; Y < WordLength; Y++ ){ TnodeInsertParaNode( Current, Word[Y], (Y == WordLength - 1 ? TRUE : FALSE), WordLength - Y - 1 ); Result += 1; Current = TnodeChild( Current ); } break; } else { if ( TnodeMaxChildDepth( HangPoint ) < WordLength - X - 1 ) TnodeSetMaxChildDepth( HangPoint, WordLength - X - 1 ); } Current = HangPoint; // The path for the word that we are trying to insert already exists, so just make sure that the end flag is flying on the last node. //This should never happen if we are to add words in increasing word length, but hey, better safe than sorry. if ( X == WordLength - 1 ) TnodeFlyEndOfWordFlag( Current ); } return Result; } void DawgAddWord ( DawgPtr ThisDawg, char * NewWord ) { assert( ThisDawg != NULL ); ThisDawg->NumberOfTotalWords += 1; int WordLength = strlen( NewWord ); int X = 0; char Temp; int Inter = 0; Inter = TnodeDawgAddWord( ThisDawg->First, NewWord ); ThisDawg->NumberOfTotalNodes += Inter; } // This is a standard depth first preorder tree traversal, whereby the objective is to count nodes of various MaxChildDepths to allow for an optimal reduction process. void TnodeGraphTabulateRecurse ( TnodePtr ThisTnode, int Level, int* Tabulator ) { assert( ThisTnode != NULL ); if ( Level == 0 ) TnodeGraphTabulateRecurse( TnodeChild( ThisTnode ), Level + 1, Tabulator ); else{ if ( ThisTnode->Dangling == FALSE ) { Tabulator[ThisTnode->MaxChildDepth] += 1; // Go Down if possible. if ( ThisTnode->Child != NULL ) TnodeGraphTabulateRecurse( TnodeChild( ThisTnode ), Level + 1, Tabulator ); // Go Right if possible. if ( ThisTnode->Next != NULL && ThisTnode->Dangling == FALSE ) TnodeGraphTabulateRecurse( TnodeNext( ThisTnode ), Level, Tabulator ); } } } // Count the nodes of each max child depth. void DawgGraphTabulate( DawgPtr ThisDawg, int* Count ) { int Numbers[MAX]; int X = 0; for ( X = 0; X < MAX; X++ ) Numbers[X] = 0; if ( ThisDawg->NumberOfTotalWords > 0){ TnodeGraphTabulateRecurse( ThisDawg->First, 0, Numbers ); for ( X = 0; X < MAX; X++ ) { Count[X] = Numbers[X]; } } } // This function can never be called after a trie has been mowed because this will result in pointers being freed twice resulting in a core dump! void FreeTnodeRecurse( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); if ( ThisTnode->Child != NULL ) FreeTnodeRecurse( ThisTnode->Child ) ; if ( ThisTnode->Next != NULL ) FreeTnodeRecurse( ThisTnode->Next ); free(ThisTnode); } // This function can never be called after a trie has been mowed because this will result in pointers being freed twice resulting in a core dump! void FreeDawg( DawgPtr ThisDawg ) { if ( ThisDawg->NumberOfTotalWords > 0 ){ FreeTnodeRecurse( ThisDawg->First ); } free( ThisDawg ); } // An important function, this function returns the first node that is identical to ThisTnode. TnodePtr TnodeMexicanEquivalent( TnodePtr ThisTnode, TnodePtr ** MaxChildDepthWist ) { assert ( ThisTnode != NULL ); int Tall = ThisTnode->MaxChildDepth; int X = 0; while ( TnodeAreWeTheSame( ThisTnode, MaxChildDepthWist[Tall][X] ) == FALSE ) { X += 1; } return MaxChildDepthWist[Tall][X]; } // Recursively replaces all redundant nodes in a trie with their first equivalent. void TnodeLawnMowerRecurse( TnodePtr ThisTnode, TnodePtr ** HeightWits ) { assert (ThisTnode != NULL); if ( ThisTnode->Level == 0 ) TnodeLawnMowerRecurse( ThisTnode->Child, HeightWits ); else { if ( ThisTnode->Next == NULL && ThisTnode->Child == NULL ) goto AndImDone; if ( ThisTnode->Child != NULL ) { // we have found a node that has been tagged to be mowed, so let us destroy it not literally and replace it with its first equivelant in the "HeightWits" list, and it won't be tagged. if ( (ThisTnode->Child)->Dangling == TRUE ) { ThisTnode->Child = TnodeMexicanEquivalent( ThisTnode->Child, HeightWits ); } else TnodeLawnMowerRecurse( ThisTnode->Child, HeightWits ); } if ( ThisTnode->Next != NULL ){ if ( (ThisTnode->Next)->Dangling == TRUE ) { ThisTnode->Next = TnodeMexicanEquivalent( ThisTnode->Next, HeightWits ); } else TnodeLawnMowerRecurse( ThisTnode->Next, HeightWits ); } } AndImDone:; } // Replaces all redundant nodes in a trie with teir first equivalent. void DawgLawnMower( DawgPtr ThisDawg, TnodePtr ** HeightWise ) { TnodeLawnMowerRecurse( ThisDawg->First, HeightWise ); } ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // A queue is required for breadth first traversal, and the rest is self evident. struct breadthqueuenode { TnodePtr Element; struct breadthqueuenode *Next; }; typedef struct breadthqueuenode BreadthQueueNode; typedef BreadthQueueNode* BreadthQueueNodePtr; void BreadthQueueNodeSetNext( BreadthQueueNodePtr ThisBreadthQueueNode, BreadthQueueNodePtr Nexit ) { assert ( ThisBreadthQueueNode != NULL ); ThisBreadthQueueNode->Next = Nexit; } BreadthQueueNodePtr BreadthQueueNodeNext( BreadthQueueNodePtr ThisBreadthQueueNode ) { assert ( ThisBreadthQueueNode != NULL ); return ThisBreadthQueueNode->Next; } TnodePtr BreadthQueueNodeElement( BreadthQueueNodePtr ThisBreadthQueueNode ) { assert ( ThisBreadthQueueNode != NULL ); return ThisBreadthQueueNode->Element; } BreadthQueueNodePtr BreadthQueueNodeInit( TnodePtr NewElement ) { BreadthQueueNodePtr Result = (BreadthQueueNodePtr)malloc( sizeof( BreadthQueueNode ) ); assert( Result != NULL ); Result->Element = NewElement; Result->Next = NULL; return Result; } struct breadthqueue { BreadthQueueNodePtr Front; BreadthQueueNodePtr Back; int Size; }; typedef struct breadthqueue BreadthQueue; typedef BreadthQueue* BreadthQueuePtr; BreadthQueuePtr BreadthQueueInit( void ) { BreadthQueuePtr Result = (BreadthQueuePtr)malloc( sizeof( BreadthQueue ) ); assert( Result != NULL ); Result->Front = NULL; Result->Back = NULL; Result->Size = 0; } void BreadthQueuePush( BreadthQueuePtr ThisBreadthQueue, TnodePtr NewElemental ) { assert( ThisBreadthQueue != NULL ); assert( NewElemental != NULL ); BreadthQueueNodePtr Noob = BreadthQueueNodeInit( NewElemental ); assert( Noob != NULL ); if ( (ThisBreadthQueue->Back) != NULL ) BreadthQueueNodeSetNext( ThisBreadthQueue->Back, Noob ); else ThisBreadthQueue->Front = Noob; ThisBreadthQueue->Back = Noob; (ThisBreadthQueue->Size) += 1; } TnodePtr BreadthQueuePop( BreadthQueuePtr ThisBreadthQueue ) { assert( ThisBreadthQueue != NULL ); if ( ThisBreadthQueue->Size == 0 ) return NULL; if ( ThisBreadthQueue->Size == 1 ) { ThisBreadthQueue->Back = NULL; ThisBreadthQueue->Size = 0; TnodePtr Result = (ThisBreadthQueue->Front)->Element; free( ThisBreadthQueue->Front ); ThisBreadthQueue->Front = NULL; return Result; } TnodePtr Result = (ThisBreadthQueue->Front)->Element; BreadthQueueNodePtr Holder = ThisBreadthQueue->Front; ThisBreadthQueue->Front = (ThisBreadthQueue->Front)->Next; free( Holder ); ThisBreadthQueue->Size -= 1; return Result; } void BreadthQueuePopulateReductionArray( BreadthQueuePtr ThisBreadthQueue, TnodePtr Root, TnodePtr **Holder ) { printf( "inside external function.\n" ); int Taker[MAX]; int X = 0; for ( X = 0; X < MAX; X++ ) Taker[X] = 0; int PopCount = 0; int CurrentMaxChildDepth = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } // Initiate the pop followed by push all children loop. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop( ThisBreadthQueue ); PopCount += 1; CurrentMaxChildDepth = Current->MaxChildDepth; Holder[CurrentMaxChildDepth][Taker[CurrentMaxChildDepth]] = Current; Taker[CurrentMaxChildDepth] += 1; Current = TnodeChild( Current ); while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = TnodeNext( Current ); } } printf( "Completed Populating The Reduction Array.\n" ); } int BreadthQueueUseToIndex( BreadthQueuePtr ThisBreadthQueue, TnodePtr Root ) { int IndexNow = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } // Pop each element off of the queue and only push its children if has not been dangled yet. Assign index if one has not been given to it yet. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop( ThisBreadthQueue ); if ( TnodeDangling( Current ) == FALSE && TnodeArrayIndex( Current ) == 0 ) { IndexNow += 1; TnodeSetArrayIndex( Current, IndexNow ); Current = TnodeChild( Current ); while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } } } printf( "Completed Assigning Indices To Living Nodes.\n" ); return IndexNow; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Next and Child become indices. struct arraydnode{ int Next; int Child; char Letter; char EndOfWordFlag; char Level; }; typedef struct arraydnode ArrayDnode; typedef ArrayDnode* ArrayDnodePtr; void ArrayDnodeInit( ArrayDnodePtr ThisArrayDnode, char Chap, int Nextt, int Childd, char EndingFlag, char Breadth ){ ThisArrayDnode->Letter = Chap; ThisArrayDnode->EndOfWordFlag = EndingFlag; ThisArrayDnode->Next = Nextt; ThisArrayDnode->Child = Childd; ThisArrayDnode->Level = Breadth; } void ArrayDnodeTnodeTranspose( ArrayDnodePtr ThisArrayDnode, TnodePtr ThisTnode ){ ThisArrayDnode->Letter = ThisTnode->Letter; ThisArrayDnode->EndOfWordFlag = ThisTnode->EndOfWordFlag; ThisArrayDnode->Level = ThisTnode->Level; if ( ThisTnode->Next == NULL ) ThisArrayDnode->Next = 0; else ThisArrayDnode->Next = (ThisTnode->Next)->ArrayIndex; if ( ThisTnode->Child == NULL ) ThisArrayDnode->Child = 0; else ThisArrayDnode->Child = (ThisTnode->Child)->ArrayIndex; } int ArrayDnodeNext( ArrayDnodePtr ThisArrayDnode ){ return ThisArrayDnode->Next; } int ArrayDnodeChild (ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->Child; } char ArrayDnodeLetter( ArrayDnodePtr ThisArrayDnode ){ return ThisArrayDnode->Letter; } char ArrayDnodeEndOfWordFlag (ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->EndOfWordFlag; } int ArrayDnodeNumberOfChildrenPlusString(ArrayDnodePtr DoggieDog, int Index, char* FillThisString){ int X; int CurrentArrayPosition; if ( (DoggieDog[Index]).Child == 0 ) { FillThisString[0] = '\0'; return 0; } CurrentArrayPosition = (DoggieDog[Index]).Child; for ( X = 0; X < NUMBER_OF_REDUCED_LETTERS; X++ ) { FillThisString[X] = (DoggieDog[CurrentArrayPosition]).Letter; if ( (DoggieDog[CurrentArrayPosition]).Next == 0 ) { FillThisString[X+1] = '\0'; return (X + 1); } CurrentArrayPosition = CurrentArrayPosition + 1; } } struct arraydawg { int NumberOfStrings; ArrayDnodePtr DawgArray; int First; char MinStringLength; char MaxStringLength; }; typedef struct arraydawg ArrayDawg; typedef ArrayDawg* ArrayDawgPtr; //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // This function is the core of the dawg creation procedure. Pay close attention to the order of the steps involved. ArrayDawgPtr ArrayDawgInit( char **Dictionary, int *SegmentLenghts, int MaxStringLength ){ int X = 0; int Y = 0; int *ChildCount; char *ChildStrings; printf("step 0 - Allocate the framework for the array data structure.\n"); /// Dynamically allocate the upper Data Structure. ArrayDawgPtr Result = (ArrayDawgPtr)malloc( sizeof( ArrayDawg ) ); /// set MinStringLength, MaxStringLength, and NumberOfStrings. X = 0; while ( SegmentLenghts[X] == 0 ) X++; Result->MinStringLength = X; Result->MaxStringLength = MaxStringLength; Result->NumberOfStrings = 0; for ( X = Result->MinStringLength; X <= Result->MaxStringLength ; X++ ) Result->NumberOfStrings += SegmentLenghts[X]; printf("step 1 - Create a temporary trie and begin filling it with the provided lexicon.\n"); /// Create a Temp Trie structure and then feed in the given dictionary. DawgPtr TemporaryTrie = DawgInit(); for ( Y = Result->MinStringLength; Y <= Result->MaxStringLength; Y++ ){ for ( X = 0; X < SegmentLenghts[Y]; X++ ){ DawgAddWord ( TemporaryTrie, &(Dictionary[Y][(Y + 1)*X]) ); } } printf("step 2 - Finished adding words to the temporary trie.\n"); /// Create the associated pointer array with all the nodes inside it. int *NodeNumberCounter = (int*)malloc( (Result->MaxStringLength)*sizeof(int) ); for ( X = 0; X < Result->MaxStringLength; X++ ) NodeNumberCounter[X] = 0; int *NodeNumberCounterInit = (int*)malloc( (Result->MaxStringLength)*sizeof(int) ); for ( X = 0; X < Result->MaxStringLength; X++ ) NodeNumberCounterInit[X] = 0; // Pass a fake pointer because we haven't counted the nodes yet, after that, set NodeNumberCounterInit. Pass FALSE kuz we are not filling the array yet. printf("step 3 - Count the total number of nodes in the raw trie by height.\n"); DawgGraphTabulate( TemporaryTrie, NodeNumberCounter ); printf("step 4 - Initial counting of trie nodes complete, so display results.\n"); int TotalNodeSum = 0; for ( X = 0; X < Result->MaxStringLength; X++ ){ NodeNumberCounterInit[X] = NodeNumberCounter[X]; TotalNodeSum += NodeNumberCounter[X]; } for ( X = 0; X < Result->MaxStringLength; X++ ){ printf( "Initial Count For MaxChildDepth |%d| is |%d|\n", X, NodeNumberCounterInit[X] ); } printf("The Total Number Of Nodes In The Trie = |%d| \n", TotalNodeSum); // We will have exactly enough space for all of the Tnode pointers. printf("step 5 - Allocate a 2 dimensional array of Trie node pointers to minimize the trie into an optimal DAWG.\n"); TnodePtr ** HolderOfAllTnodePointers = (TnodePtr **)malloc( (Result->MaxStringLength)*sizeof( TnodePtr * ) ); for ( X = 0; X < MAX; X++ ) HolderOfAllTnodePointers[X] = (TnodePtr *)malloc(NodeNumberCounterInit[X]*sizeof(TnodePtr)); // When populating the array we are going to need to go about doing so in a breadth first manner to // make sure that the first time that we encounter something it will come first in the array structure. // Or maybe this is unnecessary, because we have literally replaced all references to obsolete nodes. printf("step 6 - Populate the 2 dimensional trie node pointer array.\n"); // Let us use a breadth first traversal to populate the HolderOfAllTnodePointers array so that the first occurance of a group of identical nodes becomes the one that survives. BreadthQueuePtr Populator = BreadthQueueInit(); BreadthQueuePopulateReductionArray( Populator, (TemporaryTrie->First)->Child, HolderOfAllTnodePointers ); printf("step 7 - Population complete.\n"); // Flag all of the reduntant nodes. Flagging requires the node comparison function that will take a very long time // for a big dictionary especially when comparing the nodes with small MaxChildDepths because there are so many, so start at the top. int NumberDangled = 0; int TotalDangled = 0; int W = 0; // keep track of the number of nodes of each MaxChildDepth dangled recursively so we can check how many remaining nodes we need for the optimal array. int DangleCount[Result->MaxStringLength]; for ( X = 0; X < Result->MaxStringLength; X++) DangleCount[X] = 0; printf("step 8 - Begin to tag redundant nodes as dangled - Use recursion because only direct children are considered for elimination to reduce node size.\n"); // Start at the largest MaxChildDepth and work down from there for recursive reduction to take place. for ( Y = (Result->MaxStringLength - 1); Y >= 0 ; Y--){ NumberDangled = 0; // Move through the holder array looking for any nodes that have not been dangled. for ( W = 0; W < (NodeNumberCounterInit[Y] - 1); W++ ){ // The Node is already Dangling. Note the this node need not be the first in a child line. //In order to eliminate the need for the next index all identical reduced nodes only must be direct children only. if ( TnodeDangling( HolderOfAllTnodePointers[Y][W] ) == TRUE ) { continue; } // Traverse the rest of the array looking for equivalent nodes that are both not dangled and are tagged as direct children. Then dangle when found to be identical. for ( X = W + 1; X < NodeNumberCounterInit[Y]; X++ ){ if ( TnodeDangling( HolderOfAllTnodePointers[Y][X] ) == FALSE && TnodeDirectChild( HolderOfAllTnodePointers[Y][X] ) == TRUE ) { if ( TnodeAreWeTheSame( HolderOfAllTnodePointers[Y][W], HolderOfAllTnodePointers[Y][X] ) == TRUE ){ NumberDangled += TnodeDangle( HolderOfAllTnodePointers[Y][X] ); } } } } printf( "Dangled |%d| Nodes In all, through recursion, for MaxChildDepth of |%d|\n", NumberDangled, Y ); DangleCount[Y] = NumberDangled; TotalDangled += NumberDangled; } printf( "Total Number Of Dangled Nodes |%d|\n", TotalDangled ); int NumberOfLivingNodes = TotalNodeSum - TotalDangled; printf( "Total Number Of Living Nodes |%d|\n", NumberOfLivingNodes ); printf("step 9 - Count the number of living nodes in the trie before replacement so that we check our numbers.\n"); DawgGraphTabulate( TemporaryTrie, NodeNumberCounter ); for ( X = 0; X < Result->MaxStringLength; X++ ){ printf( "Count for living nodes of MaxChildDepth |%d| is |%d|. It used to be |%d| and so the number dangled is |%d| \n" , X, NodeNumberCounter[X], NodeNumberCounterInit[X], NodeNumberCounterInit[X] - NodeNumberCounter[X] ); } int TotalDangledCheck = 0; for ( X = 0; X < MAX; X++ ) { TotalDangledCheck += (NodeNumberCounterInit[X] - NodeNumberCounter[X]); } assert( TotalDangled == TotalDangledCheck ); printf("step 10 - Dangling is complete, so replace all dangled nodes with their mexican equivelants in the Trie to make a distributed DAWG.\n"); // Node replacement has to take place before indices are set up so nothing points to redundant nodes. - This step is absolutely critical. Mow The Lawn so to speak! Then Index. DawgLawnMower( TemporaryTrie, HolderOfAllTnodePointers ); printf("step 11 - Mowing of the lawn is now complete, so start to assign array indices to all living nodes.\n step 11 - The assigning of array indices is accomplished with a breadth first queue so that all nodes with a Next refrence not equal to NULL will point to the next element of the array, this is after all an optimal DAWG.\n"); BreadthQueuePtr OrderMatters = BreadthQueueInit(); // Try to find out if the nodes we are setting are unique before we set them. int IndexCount = BreadthQueueUseToIndex( OrderMatters, HolderOfAllTnodePointers[MAX - 1][0] ); printf( "Finish\n" ); printf( "NumberOfLivingNodes from after the dangling process|%d|\n", NumberOfLivingNodes ); printf( "IndexCount from the index handing out breadth first traversal |%d|\n", IndexCount ); // We are going to likely need a FIFO queue to do a breadth first traversal but know that this will attempt to add the same nodes multiple times unless we test for it. // Assigning the array indices the way I have been doing to does not allow for the elimination of the next index reference, so a new method, using the trie structure is necessary. // Then assign array indices to the living nodes. Keep track of the total number of living nodes. The living nodes are now supposed to point to only living nodes. /// Allocate the space needed to store the "DawgArray". Result->DawgArray = (ArrayDnodePtr)calloc( (NumberOfLivingNodes + 1), sizeof( ArrayDnode ) ); int IndexFollow = 0; int IndexFollower = 0; /// Roll through the pointer arrays and use the "ArrayDnodeTnodeTranspose" function to populate it. // Set the dummy entry at the beginning of the array. ArrayDnodeInit( &(Result->DawgArray[0]), 0, 0, 0, 0, 0 ); Result->First = 1; printf("step 12 - Populate the new array dawg structure with each node being as small as possible for optimization.\n"); // Traverse the whole 2D pointer array and look for undangled nodes, if so then transpose that node into the optimal DAWG array. for ( X = Result->MaxStringLength - 1; X >= 0; X-- ){ for (W = 0; W < NodeNumberCounterInit[X]; W++ ){ if ( TnodeDangling( HolderOfAllTnodePointers[X][W] ) == FALSE ){ IndexFollow = TnodeArrayIndex( HolderOfAllTnodePointers[X][W] ); ArrayDnodeTnodeTranspose( &(Result->DawgArray[IndexFollow]), HolderOfAllTnodePointers[X][W] ); if ( IndexFollow > IndexFollower ) IndexFollower = IndexFollow; } } } printf( "IndexFollower, which is the largest index assigned in the array = |%d|\n", IndexFollower ); printf( "NumberOfLivingNodes|%d|, assert that these two are equal because they must be.\n", NumberOfLivingNodes ); assert ( IndexFollower == NumberOfLivingNodes ); // Do Garbage cleanup and free the whole Trie, which is no longer needed. So write a free function. // But never use it here because after mowing it is rendered useless. Free from the holding array. for ( X = 0; X < Result->MaxStringLength; X++ ) for ( W = 0; W < NodeNumberCounterInit[X]; W++ ) free( HolderOfAllTnodePointers[X][W] ); free( TemporaryTrie ); free( NodeNumberCounter ); free( NodeNumberCounterInit ); for ( X = 0; X < Result->MaxStringLength; X++ ) free( HolderOfAllTnodePointers[X] ); free( HolderOfAllTnodePointers ); printf("step 13 - Creation of dawg is complete, so store the optimal DAWG into a text file for verification as well as 32 and 64 bit binary files for use.\n"); FILE *Text; Text = fopen( OPTIMAL_DAWG_TEXT_DATA,"w" ); FILE *Short; FILE *Long; Short = fopen( OPTIMAL_DIRECT_DAWG_DATA_PART_ONE,"wb" ); Long = fopen( OPTIMAL_DIRECT_DAWG_DATA_PART_TWO,"wb" ); // The following variables will be used when setting up the child-offset 64 bit nodes. int CurrentNumberOfChildren = 0; char CurrentChildLetterString[NUMBER_OF_REDUCED_LETTERS + 1]; CurrentChildLetterString[0] = '\0'; char TheNodeInBinary[32+5+1]; char TheOffsetInBinary[64+16+1]; TheNodeInBinary[0] = '\0'; long int ChildOffsetNumber[NUMBER_OF_REDUCED_LETTERS]; for ( X = 0; X < NUMBER_OF_REDUCED_LETTERS; X++ ) ChildOffsetNumber[X] = 0; long int CurrentOffsetNumber; long int TotalOffsetNumber = 0; int CurrentOffsetNumberIndex; int CompleteThirtyTwoBitNode; fwrite( &NumberOfLivingNodes, 4, 1, Short ); FILE *ListE; int EndOfListCount = 0; int EOLTracker = 0; int *EndOfListIndicies; ListE = fopen ( END_OF_LIST_DATA, "wb" ); // Set up an array to hold all of the unique child strings for the reduced lexicon DAWG. The empty placeholder will be all zeds. long int NumberOfUniqueChildStrings = 0; int InsertionPoint = 0; Bool IsSheUnique = TRUE; char **HolderOfUniqueChildStrings = (char**)malloc(NumberOfLivingNodes * sizeof(char*)); for ( X = 0; X < NumberOfLivingNodes; X++ ) { HolderOfUniqueChildStrings[X] = (char*)malloc((NUMBER_OF_REDUCED_LETTERS + 1) * sizeof(char)); strcpy(HolderOfUniqueChildStrings[X], "ZZZZZZZZZZZZZZ"); } // Right here we will tabulate the child information so that it can be turned into a long int array and stored in a data file. // Also, 45 bits allows for an optimal child offset format, but tabulate how many unique values these bits hold. For TWL06 and the 14 chars chosen it is 1505, requiring 11 bits. // The idea is that there are a very small number of actual values that these 45 bits will hold due to their format and the prevalent patterns in the English Language. for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Insert the CurrentChildLetterString into the HolderOfUniqueChildStrings if, and only if, it is unique. for ( Y = 0; Y < NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { IsSheUnique = FALSE; InsertionPoint = 0; break; } if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) < 0 ) { IsSheUnique = TRUE; InsertionPoint = Y; break; } IsSheUnique = TRUE; InsertionPoint = Y + 1; } if ( IsSheUnique == TRUE ) { for ( Y = NumberOfUniqueChildStrings; Y > InsertionPoint; Y-- ) { strcpy(HolderOfUniqueChildStrings[Y], HolderOfUniqueChildStrings[Y-1]); } strcpy(HolderOfUniqueChildStrings[InsertionPoint], CurrentChildLetterString); NumberOfUniqueChildStrings += 1; } } long int *ParallelChildOffsetValues = (long int*)malloc(NumberOfUniqueChildStrings*sizeof(long int)); // Convert the unique child strings into the equivalent bitwise offset numbers, and store these values in a parallel array. for ( X = 0; X < NumberOfUniqueChildStrings ; X++ ){ strcpy(CurrentChildLetterString, HolderOfUniqueChildStrings[X]); CurrentNumberOfChildren = strlen(CurrentChildLetterString); for ( Y = 0; Y < CurrentNumberOfChildren; Y++ ) { CurrentOffsetNumber = Y + 1; CurrentOffsetNumber <<= BitShiftMe[(CurrentChildLetterString[Y] - 'A')]; ChildOffsetNumber[LetterArrayLocation[(CurrentChildLetterString[Y] - 'A')]] = CurrentOffsetNumber; } for ( Y = 0; Y < NUMBER_OF_REDUCED_LETTERS; Y++ ) { TotalOffsetNumber += ChildOffsetNumber[Y]; ChildOffsetNumber[Y] = 0; } ParallelChildOffsetValues[X] = TotalOffsetNumber; TotalOffsetNumber = 0; } // We are now ready to output to the 32 bit node data file and the text file. for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Get the correct offset index to store into the current node for ( Y = 0; Y < NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { CurrentOffsetNumberIndex = Y; break; } } CompleteThirtyTwoBitNode = CurrentOffsetNumberIndex; CompleteThirtyTwoBitNode <<= 15; CompleteThirtyTwoBitNode += (Result->DawgArray)[X].Child; // Bits (0, 14) are for the first child index. Bits (15, 25) contain the offset index. Bit 26 then represents the EOW flag. if ( (Result->DawgArray)[X].EndOfWordFlag == 1 ) CompleteThirtyTwoBitNode += 67108864; fwrite( &CompleteThirtyTwoBitNode, 4, 1, Short ); ConvertIntNodeToBinaryString(CompleteThirtyTwoBitNode, TheNodeInBinary); ConvertOffsetLongIntToBinaryString(ParallelChildOffsetValues[CurrentOffsetNumberIndex], TheOffsetInBinary); fprintf( Text, "N%d-%s,Lev|%d|,{'%c',%d,%d,%d},Childs|%d|-|%s|,Raw|%d|,Offset%s-|%ld|.\n", X, TheNodeInBinary, (Result->DawgArray)[X].Level, (Result->DawgArray)[X].Letter, (Result->DawgArray)[X].EndOfWordFlag, (Result->DawgArray)[X].Next, (Result->DawgArray)[X].Child, CurrentNumberOfChildren, CurrentChildLetterString, CompleteThirtyTwoBitNode, TheOffsetInBinary, ParallelChildOffsetValues[CurrentOffsetNumberIndex] ); if ( CompleteThirtyTwoBitNode == 0 ) printf("The Fuck You Do\n"); TotalOffsetNumber = 0; } fclose(Short); fwrite( &NumberOfUniqueChildStrings, 8, 1, Long ); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { fwrite( &ParallelChildOffsetValues[X], 8, 1, Long ); } fclose(Long); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { printf("#%d - |%s| - |%ld| - \n", X, HolderOfUniqueChildStrings[X], ParallelChildOffsetValues[X] ); } fprintf(Text, "Number Of Living Nodes |%d| Plus The NULL Node. Also, there are %ld child offset long ints.\n\n", NumberOfLivingNodes, NumberOfUniqueChildStrings); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { fprintf(Text, "#%d - |%s| - |%ld|\n", X, HolderOfUniqueChildStrings[X], ParallelChildOffsetValues[X] ); } // free all of the memory used to compile the two part DAWG encoding. for ( X = 0; X < NumberOfLivingNodes; X++ ) { free(HolderOfUniqueChildStrings[X]); } free(HolderOfUniqueChildStrings); free(ParallelChildOffsetValues); printf( "\nThe Number Of Unique Child Strings Has Been Found to be |%ld|.\n\n", NumberOfUniqueChildStrings ); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ){ EndOfListCount += 1; } } EndOfListIndicies = (int*)malloc(EndOfListCount * sizeof(int)); fwrite( &EndOfListCount, 4, 1, ListE ); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ){ EndOfListIndicies[EOLTracker] = X; EOLTracker += 1; } } printf("EndOfListCount |%d|\n", EndOfListCount); for ( X = 0; X < EndOfListCount; X++ ) { fwrite( &EndOfListIndicies[X], 4, 1, ListE ); } fclose(ListE); fprintf(Text, "\nEndOfListCount |%d|\n\n", EndOfListCount); for ( X = 0; X < EndOfListCount; X++ ) { fprintf(Text, "#%d - |%d|\n", X, EndOfListIndicies[X]); } fclose(Text); printf( "Out of text, 32 bit, and 64 bit output to file clean.\n" ); FILE *Data; Data = fopen( OPTIMAL_DAWG_DATA,"wb" ); int CurrentNodeInteger = NumberOfLivingNodes; // It is critical especially in a binary file that the first integer written to the file be the number of nodes stored in the file. fwrite( &CurrentNodeInteger, 4, 1, Data ); // Write the NULL node to the file first. CurrentNodeInteger = 0; fwrite( &CurrentNodeInteger, 4, 1, Data ); for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNodeInteger = (Result->DawgArray)[X].Child; CurrentNodeInteger <<= 5; CurrentNodeInteger += ((Result->DawgArray)[X].Letter) - 'A'; if ( (Result->DawgArray)[X].EndOfWordFlag == TRUE ) CurrentNodeInteger += 8388608; if ( (Result->DawgArray)[X].Next == 0 ) CurrentNodeInteger += 4194304; fwrite( &CurrentNodeInteger, 4, 1, Data ); } fclose(Data); printf( "Out of 32 bit traditional data output to file clean.\n" ); return Result; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// char ArrayDnodeWordSearchRecurse( ArrayDnodePtr DoggieDog, int Index, char *PotWord ){ if ( Index == 0 ) return FALSE; if ( (DoggieDog[Index].Letter) == PotWord[0] ){ printf( "Found first letter |%c| in |%s|\n", PotWord[0], PotWord ); if ( PotWord[1] == '\0' && DoggieDog[Index].EndOfWordFlag == TRUE ) return TRUE; else { return ArrayDnodeWordSearchRecurse( DoggieDog, DoggieDog[Index].Child, &(PotWord[1]) ); printf( "Not quite a match yet, continue.\n" ); } } else { printf( "|%c| in dawg does not equal |%c| in the word\n", DoggieDog[Index].Letter, PotWord[0] ); if ( (DoggieDog[Index].Letter) > PotWord[0] ){ printf( "|%c| is past the letter we need |%c|\n", DoggieDog[Index].Letter, PotWord[0] ); return FALSE; } else{ printf("Keep looking down the line\n"); return ArrayDnodeWordSearchRecurse( DoggieDog, DoggieDog[Index].Next, PotWord ); } } } char ArrayDawgWordSearch( ArrayDawgPtr ThisArrayDawg, char *PotWord ){ printf("in\n"); return ArrayDnodeWordSearchRecurse( ThisArrayDawg->DawgArray, 1, PotWord ); } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// int main() { printf( "The main function lives\n" ); int X = 0; // All of the words of similar length will be stored sequentially in the same array so that there will be (MAX + 1) arrays in total. //The Smallest length of a string is assumed to be 2. char *AllWordsInEnglish[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = NULL; FILE *Input; Input = fopen( RAW_LEXICON,"r" ); char ThisLine[100] = "\0"; int FirstLineIsSize = 0; int LineLengthWithNl = 0; int LineLength = 0; fgets( ThisLine, 100, Input ); // Correction is needed to get rid of the new line character that fgets appends to the word. LineLengthWithNl = strlen( ThisLine ); LineLength = LineLengthWithNl - 1; ThisLine[LineLength] = '\0'; FirstLineIsSize = StringToPositiveInt( ThisLine ); printf( "FirstLineIsSize is |%d|\n", FirstLineIsSize ); int KeepTracker[MAX + 1]; for ( X = 0; X <= MAX; X++ ) KeepTracker[X] = 0; char **LexiconInRam = (char**)malloc( FirstLineIsSize*sizeof( char* ) ); // The first line gives us the number of words so read in all of them to ram temporarily. for ( X = 0; X < FirstLineIsSize; X++ ) { fgets( ThisLine, 100, Input ); LineLengthWithNl = strlen( ThisLine ); LineLength = LineLengthWithNl - 1; ThisLine[LineLength] = '\0'; //if ( LineLength <= MAX ) { if ( LineLength <= MAX ) KeepTracker[LineLength] += 1; LexiconInRam[X] = (char*)malloc( (LineLength + 1)*sizeof( char ) ); strcpy( LexiconInRam[X], ThisLine ); //} } printf( "Ram write complete\n" ); for ( X = 0; X < (MAX + 1); X++ ) printf( "There are |%d| words of length |%d|\n", KeepTracker[X], X ); // Allocate enough space to hold all of the words in strings so that we can add them to the trie by length. for ( X = 2; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = (char*)malloc( (X + 1)*KeepTracker[X]*sizeof( char ) ); printf( "Initial malloc() complete\n" ); int CurrentTracker[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) CurrentTracker[X] = 0; int CurrentLength = 0; // Copy all of the strings into the halfway house 1. for ( X = 0; X < FirstLineIsSize; X++ ) { CurrentLength = strlen( LexiconInRam[X] ); //printf("|%s| - |%d|\n", LexiconInRam[X], CurrentLength ); // As convoluted as this command might seem, it simply copies a string from its temporary ram location to the array of length equivelant strings for processing in making the DAWG. if ( CurrentLength <= MAX ) strcpy( &((AllWordsInEnglish[CurrentLength])[(CurrentTracker[CurrentLength]*(CurrentLength + 1))]), LexiconInRam[X] ); CurrentTracker[CurrentLength] += 1; } printf( "halfway house copy complete\n" ); // Make sure that the counting has resulted in all of the strings being placed correctly. for ( X = 0; X < (MAX + 1); X++ ) assert( KeepTracker[X] == CurrentTracker[X] ); // Free the initial ram read space for ( X = 0; X < FirstLineIsSize; X++ ) free( LexiconInRam[X] ); free( LexiconInRam ); clock_t start_clock = clock(); int DictionarySizeIndex[MAX + 1]; for ( X = 0; X <= MAX; X++ ) DictionarySizeIndex[X] = KeepTracker[X]; printf( "I will now start the init function.\n" ); ArrayDawgPtr Adoggy = ArrayDawgInit( AllWordsInEnglish, DictionarySizeIndex, MAX ); printf( "The Init is over, so move on to a traditional DAWG word search. This will clearly demonstrate the list-based DAWG's shortcomings.\n" ); // Use an initial word to get into the search. char Query[INPUT_LIMIT +1] = "AGE"; while ( strlen(Query) > 1 ){ MakeMeAllCapital( Query ); if ( ArrayDawgWordSearch( Adoggy, Query ) == TRUE ) printf( "\nIIIIIIIIII - Found |%s|\n\n", Query ); else printf( "\nXXXXXXXXXX - |%s| NOT FOUND\n\n", Query ); printf( "Enter the word that you want to look for...(<2 letters to exit):" ); fgets(Query, INPUT_LIMIT, stdin); CutOffNewLine(Query); } char Outer[INPUT_LIMIT +1]; printf( "Processor Time Used: %g sec... --press enter to exit--", (clock() - start_clock) / (double) CLOCKS_PER_SEC ); fgets(Outer, INPUT_LIMIT, stdin); return 0; }

ADTDAWG_Creator_Part_2.c - This Program Is Non-Trivial

// The second part of the creation procedure is responsible for reording the lists, and deleting reduntant ones. #include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #include <assert.h> #define OPTIMAL_DIRECT_DAWG_DATA_PART_ONE "Optimal_Part_One_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_TWO "Optimal_Part_Two_Direct_Dawg_For_Lexicon_14.dat" #define END_OF_LIST_DATA "End_Of_List_Nodes_Direct_Dawg_For_Lexicon_14.dat" #define DRUMPSTER "Numbered_Lexicon_14.txt" #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_TEXT "Four_Part_DTDAWG_For_Lexicon_14.txt" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, -999, 1, 3, 5, -999, 8, -999, 11, -999, -999, 14, 17, 21, 25, 29, -999, 33, 37, 41, -999, -999, -999, -999, -999, -999}; FILE *WhatR; // The BinaryNode string must be at least 32 + 5 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertIntNodeToBinaryString( int TheNode, char* BinaryNode){ int X; int Bit; BinaryNode[0] = '|'; // Bit 31, the leftmost bit is the sign bit. BinaryNode[1] = (TheNode < 0)?'1':'0'; // Bit 30 to bit 27 are unused bits in a 32 bit node. BinaryNode[2] = (TheNode & (int)pow(2,30))?'1':'0'; BinaryNode[3] = (TheNode & (int)pow(2,29))?'1':'0'; BinaryNode[4] = (TheNode & (int)pow(2,28))?'1':'0'; BinaryNode[5] = (TheNode & (int)pow(2,27))?'1':'0'; BinaryNode[6] = '|'; // Bit 26 is the boolean EOW flag for end of word. BinaryNode[7] = (TheNode & (int)pow(2,26))?'1':'0'; BinaryNode[8] = '|'; // Bit 25 to bit 15 represent the child offset index. Bit = 25; for ( X = 9; X <= 19; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[20] = '|'; // The first child index that requires 15 bits will finish off the 32 bit int. Bit = 14; for ( X = 21; X <= 35; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[36] = '|'; BinaryNode[37] = '\0'; } // This function is used to make a text file for verification purposes. // The BinaryOffset string must be at least 64 + 16 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertOffsetLongIntToBinaryString( long int TheOffset, char* BinaryOffset){ int X; int Bit; BinaryOffset[0] = '|'; // Bit 63, the leftmost bit is the sign bit. BinaryOffset[1] = (TheOffset < 0)?'1':'0'; // Bit 62 to bit 45 are not used with the 14 char set chosen for this DAWG. 18 Unused bits. All bits including the sign bit will be used with an 18 character set. Bit = 62; for ( X = 2; X <= 19; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[20] = '|'; // The seven letters that require a 4 bit offset encoding start here --------------------------------------------------- // Bit [44,41] represent the "T" child. Bit = 44; for ( X = 21; X <= 24; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[25] = '|'; // Bit [40,37] represent the "S" child. Bit = 40; for ( X = 26; X <= 29; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[30] = '|'; // Bit [36,33] represent the "R" child. Bit = 36; for ( X = 31; X <= 34; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[35] = '|'; // Bit [32,29] represent the "P" child. Bit = 32; for ( X = 36; X <= 39; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[40] = '|'; // Bit [28,25] represent the "O" child. Bit = 28; for ( X = 41; X <= 44; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[45] = '|'; // Bit [24,21] represent the "N" child. Bit = 24; for ( X = 46; X <= 49; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[50] = '|'; // Bit [20,17] represent the "M" child. Bit = 20; for ( X = 51; X <= 54; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[55] = '|'; // The four letters that require a 3 bit offset encoding start here --------------------------------------------------- // Bit [16,14] represent the "L" child. Bit = 16; for ( X = 56; X <= 58; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[59] = '|'; // Bit [13,11] represent the "I" child. Bit = 13; for ( X = 60; X <= 62; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[63] = '|'; // Bit [10,8] represent the "G" child. Bit = 10; for ( X = 64; X <= 66; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[67] = '|'; // Bit [7,5] represent the "E" child. Bit = 7; for ( X = 68; X <= 70; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[71] = '|'; // The two letters that require a 2 bit offset encoding start here --------------------------------------------------- // Bit [4,3] represent the "G" child. Bit = 4; for ( X = 72; X <= 73; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[74] = '|'; // Bit [2,1] represent the "E" child. Bit = 2; for ( X = 75; X <= 76; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[77] = '|'; // Bit 0 represent the "A" child. BinaryOffset[78] = (TheOffset & (long int)pow(2,0))?'1':'0'; BinaryOffset[79] = '|'; BinaryOffset[80] = '\0'; } void ExtractChildStringFromLongInt(long int TheOffset, char *TheChildren){ int X; int FillThisSpot = 0; char CurrentChar; for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) { CurrentChar = CharacterSet[X]; if ( TheOffset & ChildLetterBitMasks[CurrentChar - 'A'] ) { TheChildren[FillThisSpot] = CurrentChar; FillThisSpot += 1; } } TheChildren[FillThisSpot] = '\0'; } int TraverseTheDawgArrayRecurse(int *TheDawg, long int *TheOffsets, int *SuckOnIt, int CurrentIndex, char *TheWordSoFar, int FillThisPosition, char CurrentLetter, int *WordCounter, Bool PrintMe){ int X; long int CurrentOffset; int CurrentChild; int WhatsBelowMe = 0; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( CurrentChild = (TheDawg[CurrentIndex] & CHILD_MASK) ) { for ( X = SIZE_OF_CHARACTER_SET - 1; X >= 0 ; X-- ) { // This assignment statement inside of a conditional statement seems very convoluted but all that it does is find a number between 0 and 14 // using two numbers that need bit masking and bit shifting. if ( CurrentOffset = (TheOffsets[(TheDawg[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[CharacterSet[X] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[CharacterSet[X] - 'A']; CurrentOffset -= 1; WhatsBelowMe += TraverseTheDawgArrayRecurse(TheDawg, TheOffsets, SuckOnIt, CurrentChild + (int)CurrentOffset, TheWordSoFar, FillThisPosition + 1, CharacterSet[X], WordCounter, PrintMe); } } } if ( TheDawg[CurrentIndex] & END_OF_WORD_FLAG ) { *WordCounter +=1; TheWordSoFar[FillThisPosition+ 1] = '\0'; if ( PrintMe == TRUE ) printf("#|%d| - |%s|\n", *WordCounter, TheWordSoFar); //if ( PrintMe == TRUE ) fprintf(What, "#|%d| - |%s|\n", *WordCounter, TheWordSoFar); if ( PrintMe == FALSE ) fprintf(WhatR, "#|%d| - |%s|\n", *WordCounter, TheWordSoFar); WhatsBelowMe += 1; } SuckOnIt[CurrentIndex] = WhatsBelowMe; return WhatsBelowMe; } void TraverseTheDawgArray(int *TheDawg, long int *TheOffsets, int *BelowingMe, Bool PrintToScreen){ int X; int TheCounter = 0; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = SIZE_OF_CHARACTER_SET - 1; X >= 0 ; X-- ) { TraverseTheDawgArrayRecurse(TheDawg, TheOffsets, BelowingMe, X + 1, RunningWord, 0, CharacterSet[X], &TheCounter, PrintToScreen); } } // The Governing Equation of The Four Part DTDAWG: NextMarker = CurrentMarker - FCWTEOBL + NCWTEOBL - Flag; WTEOBL means Words to end of branch list. // FC means FirstChild. NC means NextChild, the one we want to go to. void TraverseTheFourPartDawgArrayRecurse(int *One, long int *Two, int *Three, int CurrentIndex, char *TheWordSoFar, int FillThisPosition, char CurrentLetter, int CurrentMarker){ int X; long int CurrentOffset; int CurrentChild; int NextMarker; int CurrentNCWTEOBL; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { TheWordSoFar[FillThisPosition+ 1] = '\0'; printf("#|%d| - |%s|\n", CurrentMarker, TheWordSoFar); } if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { // Begin the calculation of the NextMarker; the part that is common to all children. NextMarker = CurrentMarker - Three[CurrentChild]; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) NextMarker -= 1; for ( X = 0; X < SIZE_OF_CHARACTER_SET ; X++ ) { // This assignment statement inside of an if statement seems very convoluted but all that it does, with the help of the next line, // is find a number between 0 and 14, using two numbers that need bit masking and bit shifting. if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[CharacterSet[X] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[CharacterSet[X] - 'A']; CurrentOffset -= 1; CurrentNCWTEOBL = Three[CurrentChild + (int)CurrentOffset]; NextMarker += CurrentNCWTEOBL; TraverseTheFourPartDawgArrayRecurse(One, Two, Three, CurrentChild + (int)CurrentOffset, TheWordSoFar, FillThisPosition + 1, CharacterSet[X], NextMarker); NextMarker -= CurrentNCWTEOBL; } } } } void TraverseTheFourPartDawgArray(int *First, long int *Second, int *Third){ int X; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = 0; X < SIZE_OF_CHARACTER_SET ; X++ ) { TraverseTheFourPartDawgArrayRecurse(First, Second, Third, (X + 1), RunningWord, 0, CharacterSet[X], Third[X + 1]); } } int main(){ int X; int Y; int NumberOfPartOneNodes; int NumberOfEndOfLists; long int NumberOfPartTwoNodes; int *PartOneArray; long int *PartTwoArray; int CurrentCount; int CurrentEndOfList; int FurthestBigNode = 0; WhatR = fopen(DRUMPSTER, "w"); // We are going to need to know the number of words in and below each node, then we can compute number till the end of the child list which is of paramount importance in word tracking. // When we have calculated the NumberOfWordsToEndOfBranchList, the NumberOfWordsBelowMe can be calculated with max 1 subtraction. int *NumberOfWordsBelowMe; int *NumberOfWordsToEndOfBranchList; int *RearrangedNumberOfWordsToEndOfBranchList; // In order to calculate NumberOfWordsToEndOfBranchList we need to know the location of the closest EndOfList node. int* EndOfListLocations; FILE *PartOne; FILE *PartTwo; // The PartThree data file requires rearranging of the PartOne node data so that the most WordsToEndOfBranchList numbers can be housed in a single unsigned char. // Only lists with nodes containing a number greater then 255 will need more bits. // PartTwo will remain untouched. FILE *ListE; FILE *Text; PartOne = fopen(OPTIMAL_DIRECT_DAWG_DATA_PART_ONE, "rb"); PartTwo = fopen(OPTIMAL_DIRECT_DAWG_DATA_PART_TWO, "rb"); ListE = fopen(END_OF_LIST_DATA, "rb"); Text = fopen(FOUR_PART_DTDAWG_14_TEXT, "w"); fread(&NumberOfPartOneNodes, 4, 1, PartOne); fread(&NumberOfPartTwoNodes, 8, 1, PartTwo); fread(&NumberOfEndOfLists, 4, 1, ListE); PartOneArray = (int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); PartTwoArray = (long int *)malloc(NumberOfPartTwoNodes * sizeof(long int)); NumberOfWordsBelowMe = (int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); EndOfListLocations = (int *)malloc(NumberOfEndOfLists * sizeof(int)); NumberOfWordsToEndOfBranchList =(int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); RearrangedNumberOfWordsToEndOfBranchList =(int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); NumberOfWordsBelowMe[0] = 0; NumberOfWordsToEndOfBranchList[0] = 0; RearrangedNumberOfWordsToEndOfBranchList[0] = 0; PartOneArray[0] = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { fread(&PartOneArray[X], 4, 1, PartOne); } for ( X = 0; X < NumberOfPartTwoNodes; X++ ) { fread(&PartTwoArray[X], 8, 1, PartTwo); } for ( X = 0; X < NumberOfEndOfLists; X++ ) { fread(&EndOfListLocations[X], 4, 1, ListE); } fclose(PartOne); fclose(PartTwo); fclose(ListE); // This function is run at this point in time to fill the NumberOfWordsBelowMe array. TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); CurrentEndOfList = 0; // This little piece of code compiles the NumberOfWordsToEndOfBranchList array. The requirements are the NumberOfWordsBelowMe array and the EndOfListLocations array. for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { CurrentCount = 0; for ( Y = X; Y <= EndOfListLocations[CurrentEndOfList]; Y++ ) { CurrentCount += NumberOfWordsBelowMe[Y]; } NumberOfWordsToEndOfBranchList[X] = CurrentCount; //printf("Node|%d|, WordsBelowMe|%d|, WordsToEndOfBranch|%d|, EndOfBranch|%d|\n", X, NumberOfWordsBelowMe[X], NumberOfWordsToEndOfBranchList[X], EndOfListLocations[CurrentEndOfList]); if ( X == EndOfListLocations[CurrentEndOfList] )CurrentEndOfList +=1; } CurrentEndOfList = 0; // Now with preliminary analysis complete, it is time to rearrange the PartOne nodes and then set up PartThree. int ListSizeCounter[SIZE_OF_CHARACTER_SET + 1]; int TotalNumberOfLists = 0; Bool AreWeInBigList = FALSE; int TheCurrentChild; int StartOfCurrentList = 1; int SizeOfCurrentList = EndOfListLocations[0]; int EndOfCurrentList = EndOfListLocations[0]; int InsertionPoint = 1; int CurrentlyCopyingThisList = 0; int *CurrentAdjustments; int *PartOneRearrangedArray; PartOneRearrangedArray = (int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); CurrentAdjustments = (int *)malloc((NumberOfPartOneNodes + 1) * sizeof(int)); PartOneRearrangedArray[0] = 0; for ( X = 0; X <= NumberOfPartOneNodes; X++ ) CurrentAdjustments[X] = 0; for ( X = 0; X <= SIZE_OF_CHARACTER_SET; X++ ) ListSizeCounter[X] = 0; // This code is responsible for the rearrangement of the node lists inside of the DAWG int array. CurrentEndOfList = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( NumberOfWordsToEndOfBranchList[X] > 255 ) AreWeInBigList = TRUE; if ( X == EndOfCurrentList ) { ListSizeCounter[SizeOfCurrentList] += 1; // We are now at the end of a big list that must to be moved up to the InsertionPoint. This also implies moving everything between its current location and its new one. if ( AreWeInBigList == TRUE ) { // First step is to copy the CurrentList into the new array at its correct position. for ( Y = 0; Y < SizeOfCurrentList; Y++ ) { PartOneRearrangedArray[InsertionPoint + Y] = PartOneArray[StartOfCurrentList + Y]; RearrangedNumberOfWordsToEndOfBranchList[InsertionPoint + Y] = NumberOfWordsToEndOfBranchList[StartOfCurrentList + Y]; } // The following steps are required when we are actually moving the position of a list. The first set of lists will bypass these steps. if ( InsertionPoint != StartOfCurrentList ) { // Step 2 is to move all of the nodes between the original and final location, "SizeOfCurrentList" number of places back, starting from the end. for ( Y = EndOfCurrentList; Y >= (InsertionPoint + SizeOfCurrentList); Y-- ) { PartOneArray[Y] = PartOneArray[Y - SizeOfCurrentList]; NumberOfWordsToEndOfBranchList[Y] = NumberOfWordsToEndOfBranchList[Y - SizeOfCurrentList]; } // Step 3 is to copy the list we are moving up from the rearranged array back into the original. for ( Y = InsertionPoint; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { PartOneArray[Y] = PartOneRearrangedArray[Y]; NumberOfWordsToEndOfBranchList[Y] = RearrangedNumberOfWordsToEndOfBranchList[Y]; } // Step 4 is to fill the CurrentAdjustments array with the amount that each child references must be moved. The two arrays are identical now up to the new insertion point. // At this stage, the CurrentAdjustments array is all zeros. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { TheCurrentChild = (PartOneArray[Y] & CHILD_MASK); if ( (TheCurrentChild >= InsertionPoint) && (TheCurrentChild < StartOfCurrentList) ) CurrentAdjustments[Y] = SizeOfCurrentList; if ( (TheCurrentChild >= StartOfCurrentList) && (TheCurrentChild <= EndOfCurrentList) ) CurrentAdjustments[Y] = InsertionPoint - StartOfCurrentList; } // Step 5 is to fix all of the child reference values in both of the arrays. // Start with the rearranged array. for ( Y = 1; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { if ( CurrentAdjustments[Y] != 0 ) PartOneRearrangedArray[Y] += CurrentAdjustments[Y]; } // Finish with the original array. Make sure to zero all the values after the adjustments have been made to get ready for the next round. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { if ( CurrentAdjustments[Y] != 0 ) { PartOneArray[Y] += CurrentAdjustments[Y]; CurrentAdjustments[Y] = 0; } } } // Step 7 is to set the new InsertionPoint and change the EndOfListLocations, so that they reflect the shift. InsertionPoint += SizeOfCurrentList; // Shift all of the end of lists between the currently copying and the CurrentEndOfList. for ( Y = CurrentEndOfList; Y > CurrentlyCopyingThisList; Y-- ) { EndOfListLocations[Y] = EndOfListLocations[Y - 1] + SizeOfCurrentList; } EndOfListLocations[CurrentlyCopyingThisList] = InsertionPoint - 1; CurrentlyCopyingThisList += 1; } // Even when we are not in a big list, we still need to update the current list parameters. CurrentEndOfList +=1; SizeOfCurrentList = EndOfListLocations[CurrentEndOfList] - EndOfCurrentList; EndOfCurrentList = EndOfListLocations[CurrentEndOfList]; StartOfCurrentList = X + 1; AreWeInBigList = FALSE; } } // Step 8 is to copy all of the small lists from the original array to the rearranged array. All of the references should be properly adjusted at this point. for ( X = InsertionPoint; X <= NumberOfPartOneNodes; X++ ) { PartOneRearrangedArray[X] = PartOneArray[X]; RearrangedNumberOfWordsToEndOfBranchList[X] = NumberOfWordsToEndOfBranchList[X]; } // The rearrangement of the DAWG lists to reduce size of the PartThree data file is complete, so check if the new and old lists are identical, because they should be. for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( PartOneArray[X] != PartOneRearrangedArray[X] ) printf("What A Mistake!\n"); assert(PartOneArray[X] == PartOneRearrangedArray[X]); assert(RearrangedNumberOfWordsToEndOfBranchList[X] == NumberOfWordsToEndOfBranchList[X]); } // The two arrays are now identical, so as a final precaution, traverse the rearranged array. TraverseTheDawgArray(PartOneRearrangedArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // Check for duplicate lists. It is now highly likely that there are duplicates printf("\n"); // Add up the total number of lists. for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; } int **NodeListsBySize = (int **)malloc((SIZE_OF_CHARACTER_SET + 1) * sizeof(int *)); int WhereWeAt[SIZE_OF_CHARACTER_SET + 1]; for ( X = 0; X <= SIZE_OF_CHARACTER_SET; X++ ) WhereWeAt[X] = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { NodeListsBySize[X] = (int *)malloc(ListSizeCounter[X] * sizeof(int)); } // We are now required to fill the NodeListsBySize array. Simply copy over the correct EndOfList information. // Note that the EndOfList information reflects the readjustment that just took place. CurrentEndOfList = 0; EndOfCurrentList = EndOfListLocations[0]; SizeOfCurrentList = EndOfListLocations[0]; for ( X = 0; X < NumberOfEndOfLists; X++ ) { (NodeListsBySize[SizeOfCurrentList])[WhereWeAt[SizeOfCurrentList]] = EndOfCurrentList; WhereWeAt[SizeOfCurrentList] += 1; CurrentEndOfList += 1; SizeOfCurrentList = EndOfListLocations[CurrentEndOfList] - EndOfCurrentList; EndOfCurrentList = EndOfListLocations[CurrentEndOfList]; } int Z; int V; int W; int TheNewChild; int TotalNumberOfKilledLists = 0; int NewNumberOfKilledLists = -1; int InspectThisEndOfList; int MaybeReplaceWithThisEndOfList; int NumberOfListsKilledThisRound = -1; int CurrentNumberOfPartOneNodes = NumberOfPartOneNodes; Bool EliminateCurrentList = TRUE; // Keep attempting to kill lists until there are no more redundant lists. It turns out that a single run does the trick. //Without an explicit character member in each node, many of the lists are redundant, run a series of tests to exterminate these lists. while ( NumberOfListsKilledThisRound != 0 ) { printf("Run a redundant check:\n"); NumberOfListsKilledThisRound = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { printf("Eliminate Lists of Size |%d|\n", X); while ( NewNumberOfKilledLists != 0 ) { NewNumberOfKilledLists = 0; for ( Y = 1; Y < ListSizeCounter[X]; Y++ ) { InspectThisEndOfList = (NodeListsBySize[X])[Y]; for ( Z = 0; Z < Y; Z++ ) { MaybeReplaceWithThisEndOfList = (NodeListsBySize[X])[Z]; for( W = 0; W < X; W++ ) { if ( PartOneArray[InspectThisEndOfList - W] != PartOneArray[MaybeReplaceWithThisEndOfList - W] ) { EliminateCurrentList = FALSE; break; } } // When eliminating a list, make sure to adjust the WTEOBL data. if ( EliminateCurrentList == TRUE ) { // Step 1 - replace all references to the duplicate list with the earlier equivalent. for( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( (TheCurrentChild > (InspectThisEndOfList - X)) && (TheCurrentChild <= InspectThisEndOfList) ) { TheNewChild = MaybeReplaceWithThisEndOfList - (InspectThisEndOfList - TheCurrentChild); PartOneArray[V] -= TheCurrentChild; PartOneArray[V] += TheNewChild; } } // Step 2 - eliminate the dupe list by moving the higher lists forward. for ( V = (InspectThisEndOfList - X + 1); V <= (CurrentNumberOfPartOneNodes - X); V++ ) { PartOneArray[V] = PartOneArray[V + X]; RearrangedNumberOfWordsToEndOfBranchList[V] = RearrangedNumberOfWordsToEndOfBranchList[V + X]; } // Step 3 - change CurrentNumberOfPartOneNodes. CurrentNumberOfPartOneNodes -= X; // Step 4 - lower all references to the nodes coming after the dupe list. for ( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( TheCurrentChild > InspectThisEndOfList ){ PartOneArray[V] -= X; } } // Step 5 - readjust all of the lists after Y forward 1 and down X to the value, and lower ListSizeCounter[X] by 1. for( V = Y; V < (ListSizeCounter[X] - 1); V++ ) { (NodeListsBySize[X])[V] = (NodeListsBySize[X])[V + 1] - X; } ListSizeCounter[X] -= 1; // Step 6 - Lower any list, of any size, greater than (NodeListsBySize[X])[Y], down by X. for ( V = 1; V <= (X - 1); V++ ) { for ( W = 0; W < ListSizeCounter[V]; W++ ) { if ( (NodeListsBySize[V])[W] > InspectThisEndOfList ) (NodeListsBySize[V])[W] -= X; } } for ( V = (X + 1); V <= SIZE_OF_CHARACTER_SET; V++ ) { for ( W = 0; W < ListSizeCounter[V]; W++ ) { if ( (NodeListsBySize[V])[W] > InspectThisEndOfList ) (NodeListsBySize[V])[W] -= X; } } // Step 7 - lower Y by 1 and increase NewNumberOfKilledLists. Y -= 1; NewNumberOfKilledLists += 1; break; } EliminateCurrentList = TRUE; } } printf("NewNumberOfKilledLists |%d|.\n", NewNumberOfKilledLists); TotalNumberOfKilledLists += NewNumberOfKilledLists; NumberOfListsKilledThisRound += NewNumberOfKilledLists; } NewNumberOfKilledLists = -1; } } printf("\nOriginalNumberOfPartOneNodes|%d|\n", NumberOfPartOneNodes); printf("CurrentNumberOfPartOneNodes|%d|\n\n", CurrentNumberOfPartOneNodes); printf("The Total Number Of Killed Lists Is |%d|\n", TotalNumberOfKilledLists); printf("The Total Number Of Killed Nodes Is |%d|\n\n", NumberOfPartOneNodes - CurrentNumberOfPartOneNodes); TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // Now that we have proved how a large number of additional nodes become redundant, set up and write the original, // the first, the only, 3 part direct tracking directed acyclic word graph. // Test it, write an anagram function, and finally a boggle scoring parallel method using 14 worker threads and a main thread, using OTS time stamp paradigm. // It is of critical importance that each thread is started only once during program execution. // RearrangedNumberOfWordsToEndOfBranchList has been kept up to date during the reworking of the DAWG // and there are CurrentNumberOfPartOneNodes meaningful value in the array. // EndOfListLocations needs to be recompiled from what is left from the NodeListsBySize arrays. TotalNumberOfLists = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; printf("List Size|%d| - Number Of Lists|%d|\n", X, ListSizeCounter[X]); } // Set all of the values that we are going to use to something very high. for ( X = 0; X < TotalNumberOfLists; X++ ) { EndOfListLocations[X] = 100000; } // Filter all of the living EndOfList values into the EndOfListLocations array. int TempValue; for ( X = SIZE_OF_CHARACTER_SET; X >= 1; X-- ) { for ( Y = 0; Y < ListSizeCounter[X]; Y++ ) { EndOfListLocations[TotalNumberOfLists - 1] = (NodeListsBySize[X])[Y]; // The new list has been placed at the end of the list, now filter it up to where it should be. for ( Z = (TotalNumberOfLists - 1); Z > 0; Z-- ) { if ( EndOfListLocations[Z - 1] > EndOfListLocations[Z] ) { TempValue = EndOfListLocations[Z - 1]; EndOfListLocations[Z - 1] = EndOfListLocations[Z]; EndOfListLocations[Z] = TempValue; } else break; } } } // Test for logical errors in the list elimination procedure. for ( X = 0; X < (TotalNumberOfLists - 1); X++ ) { if ( EndOfListLocations[X] == EndOfListLocations[X + 1] ) printf("No Two Lists Can End On The Same Node. |%d|\n", EndOfListLocations[X]); } // Find out the final index number that requires an integer greater in size than a byte to hold it. This will become the pivot point between part 3 and part 4. CurrentEndOfList = 0; FurthestBigNode = 0; int FirstSmallNode; for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) {if ( RearrangedNumberOfWordsToEndOfBranchList[X] > 255 ) FurthestBigNode = X; if ( X == EndOfListLocations[CurrentEndOfList] ) CurrentEndOfList += 1; } FirstSmallNode = FurthestBigNode + 1; printf("\nThe first node that requires only one byte for its WTEOBL is |%d|, it is the transition between PartThree And PartFour.\n\n", FirstSmallNode); //for ( X = 0; X < TotalNumberOfLists; X++ ) { // printf("EOL |%d|- |%d|\n", X, EndOfListLocations[X]); //} TraverseTheFourPartDawgArray(PartOneArray, PartTwoArray, RearrangedNumberOfWordsToEndOfBranchList); // All of the information required to store the DTDAWG into data files has been compiled, so proceed to save the data to disk. FILE *FinalPartOneData = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "wb"); FILE *FinalPartTwoData = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "wb"); FILE *FinalPartThreeData = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "wb"); FILE *FinalPartFourData = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "wb"); char PartOneNodeInBinary[38]; char PartTwoNodeInBinary[81]; char ChildString[SIZE_OF_CHARACTER_SET + 1]; // The final part one data is located in "PartOneArray" and has size "CurrentNumberOfPartOneNodes". Remember that the 0 node is the NULL node, it will not be written to file. fwrite(&CurrentNumberOfPartOneNodes, 4, 1, FinalPartOneData); fprintf(Text, "Part One\n\n"); for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) { fwrite(&(PartOneArray[X]), 4, 1, FinalPartOneData); ConvertIntNodeToBinaryString(PartOneArray[X], PartOneNodeInBinary); ExtractChildStringFromLongInt(PartTwoArray[(PartOneArray[X] & OFFSET_INDEX_MASK)>>15], ChildString); fprintf(Text, "1Node#|%d|-|%d|-Binary%s-Offsets|%d|-Child|%d|-Children|%s|\n", X, PartOneArray[X], PartOneNodeInBinary, (PartOneArray[X] & OFFSET_INDEX_MASK)>>15, (PartOneArray[X] & CHILD_MASK), ChildString); } // The final part two data has not changed at all, so simply copy the exact data to the new file. This array has no NULL value so start at zero. fwrite(&NumberOfPartTwoNodes, 8, 1, FinalPartTwoData); fprintf(Text, "\nPart Two\n\n"); for ( X = 0; X < NumberOfPartTwoNodes; X++ ) { fwrite(&(PartTwoArray[X]), 8, 1, FinalPartTwoData); ConvertOffsetLongIntToBinaryString(PartTwoArray[X], PartTwoNodeInBinary); ExtractChildStringFromLongInt(PartTwoArray[X], ChildString); fprintf(Text, "2Node#|%d|-|%ld|-Binary%s-ChildString|%s|\n", X, PartTwoArray[X], PartTwoNodeInBinary, ChildString); } // The final part three array consists of the first "FurthestBigNode" number of values in the // "RearrangedNumberOfWordsToEndOfBranchList" array, and note that the zero value is null so do not include it. // Also these values should use 32 bit integers, or 4 bytes. fwrite(&FurthestBigNode, 4, 1, FinalPartThreeData); fprintf(Text, "\nPart Three\n\n"); for ( X = 1; X <= FurthestBigNode; X++ ) { fwrite(&(RearrangedNumberOfWordsToEndOfBranchList[X]), 4, 1, FinalPartThreeData); fprintf(Text, "3Node#|%d| - WTEOBL|%d|\n", X, RearrangedNumberOfWordsToEndOfBranchList[X]); } // The final part four array consists of the [FirstSmallNode, CurrentNumberOfPartOneNodes] values in the "RearrangedNumberOfWordsToEndOfBranchList" array. int TheSizeOfPartFour = CurrentNumberOfPartOneNodes - FurthestBigNode; unsigned char TheCurrentPartFourNode; printf("TheSizeOfPartFour|%d|\n", TheSizeOfPartFour); fprintf(Text, "\nPart Four\n\n"); for ( X = FirstSmallNode; X <= CurrentNumberOfPartOneNodes; X++ ) { assert(RearrangedNumberOfWordsToEndOfBranchList[X] < 255); TheCurrentPartFourNode = RearrangedNumberOfWordsToEndOfBranchList[X]; assert(TheCurrentPartFourNode == RearrangedNumberOfWordsToEndOfBranchList[X]); fwrite(&TheCurrentPartFourNode, 1, 1, FinalPartFourData); fprintf(Text, "4Node#|%d| - For|%d| - WTEOBL|%d|\n", X - FirstSmallNode, X, TheCurrentPartFourNode); } printf("The four data files required for the 4 part DATDAWG have now been written. Inspect the text verification files.\n\n"); return 0; }

ADTDAWG_Word_Search_For_Lexicon_14.c - This Program Is Simple

#include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define MIN_WORD_LENGTH 3 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 #define MAX_INPUT_SIZE 100 #define LOWERIT 32 #define BOGUS -99 #define TOTAL_WORDS_IN_LEXICON 44220 #define NAME_OF_LEXICON "Lexicon_14" typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; char CharacterLocations[26] = {0, BOGUS, 1, 2, 3, BOGUS, 4, BOGUS, 5, BOGUS, BOGUS, 6, 7, 8, 9, 10, BOGUS, 11, 12, 13, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, BOGUS, 1, 3, 5, BOGUS, 8, BOGUS, 11, BOGUS, BOGUS, 14, 17, 21, 25, 29, BOGUS, 33, 37, 41, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; // This function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. // It is useful for when a user enters a string to be anagrammed. void MakeMeAllCapital(char* RawWord){ int X; for( X = 0; X < strlen( RawWord ); X++ ) { if( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] - LOWERIT; } } // This function tests for validity of "ThisWord" for potential inclusion in the lexicon. Bool IsThisWordValid(char *ThisWord){ int X; int ThisLength = strlen(ThisWord); if ( ThisLength < MIN_WORD_LENGTH ) return FALSE; if ( ThisLength > MAX_WORD_LENGTH ) return FALSE; for ( X = 0; X < ThisLength; X++ ) { if ( ThisWord[X] < 'A' || ThisWord[X] > 'Z') return FALSE; if ( CharacterLocations[ThisWord[X] - 'A'] == BOGUS) return FALSE; } return TRUE; } inline int WordsToEndOfBranchList(int *ThirdPart, unsigned char *FourthPart, int TheIndexInQuestion, int TheCutoff){ return (TheIndexInQuestion < TheCutoff)? ThirdPart[TheIndexInQuestion]: (int)FourthPart[TheIndexInQuestion - TheCutoff]; } // This function returns the lexicon position of "TheWord" if found, and BOGUS otherwise. int TrackingWordSearchWithFourPartDawgArrayRecurse(const char *TheWord, int *One, long int *Two, int *Three, unsigned char *Four , int BeginFourAt, int CurrentIndex, int TheCurrentPosition, char CurrentLetter, int CurrentMarker, int SizeOfTheWord, int *StampingSet, int CurrentTime){ long int CurrentOffset; int CurrentChild; int NextMarker = 0; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { if ( (TheCurrentPosition + 1) == SizeOfTheWord ) { printf("Found |%s| At Position |%d|. Old Stamp |%d| - New Stamp |%d|\n", TheWord, CurrentMarker, StampingSet[CurrentMarker], CurrentTime); StampingSet[CurrentMarker] = CurrentTime; return CurrentMarker; } NextMarker -= 1; } if ( (TheCurrentPosition + 1) == SizeOfTheWord ) return BOGUS; // If the current node has a child list, then check for the next letter in TheWord. if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[TheWord[TheCurrentPosition + 1] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[TheWord[TheCurrentPosition + 1] - 'A']; CurrentOffset -= 1; NextMarker += (CurrentMarker - WordsToEndOfBranchList(Three, Four, CurrentChild, BeginFourAt)+ WordsToEndOfBranchList(Three, Four, (CurrentChild + (int)CurrentOffset), BeginFourAt)); return TrackingWordSearchWithFourPartDawgArrayRecurse(TheWord, One, Two, Three, Four, BeginFourAt, (CurrentChild + (int)CurrentOffset) , (TheCurrentPosition + 1), TheWord[TheCurrentPosition + 1], NextMarker, SizeOfTheWord, StampingSet, CurrentTime); } } return BOGUS; } // The characters inside of TheRawWord should all be contained in the specified character set. // TheRawWord should have min length MIN_WORD_LENGTH,and max length MAX_WORD_LENGTH , if these conditions are not met, return BOGUS. // If theRawWord does exist, return the lexicon position integer. int TrackingWordSearchWithFourPartDawgArray(char *TheRawWord, int *First, long int *Second, int *Third, unsigned char *Fourth, int StartOfFourth, int *TheTimeStamps, int TheTime){ if ( IsThisWordValid(TheRawWord) == FALSE ) return BOGUS; TrackingWordSearchWithFourPartDawgArrayRecurse(TheRawWord, First, Second, Third, Fourth, StartOfFourth , (CharacterLocations[TheRawWord[0] - 'A'] + 1), 0, TheRawWord[0], Third[CharacterLocations[TheRawWord[0] - 'A'] + 1], strlen(TheRawWord), TheTimeStamps, TheTime); } int main(){ FILE *PartOne = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "rb"); FILE *PartTwo = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "rb"); FILE *PartThree = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "rb"); FILE *PartFour = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "rb"); int SizeOfPartOne; long int SizeOfPartTwo; int SizeOfPartThree; int SizeOfPartFour; int *PartOneArray; long int *PartTwoArray; int *PartThreeArray; unsigned char *PartFourArray; int WhatRoundAreWeOn = 1; int X; int PartThreeFourTransition; char TheUserInput[MAX_INPUT_SIZE + 1]; char Question[MAX_INPUT_SIZE + 1]; char DOM = '\0'; Bool ShouldWeContinue = TRUE; int LexiconTimeStamps[TOTAL_WORDS_IN_LEXICON + 1]; int CurrentPosition; int InputLength; for ( X = 0; X <= TOTAL_WORDS_IN_LEXICON; X++ ) LexiconTimeStamps[X] = 0; fread(&SizeOfPartOne, 4, 1, PartOne); fread(&SizeOfPartTwo, 8, 1, PartTwo); fread(&SizeOfPartThree, 4, 1, PartThree); PartThreeFourTransition = SizeOfPartThree + 1; SizeOfPartFour = SizeOfPartOne - SizeOfPartThree; printf("\n"); printf("SizeOfPartOne |%d|\n", SizeOfPartOne); printf("SizeOfPartTwo |%ld|\n", SizeOfPartTwo); printf("SizeOfPartThree |%d|\n", SizeOfPartThree); printf("Transition |%d|\n", PartThreeFourTransition); printf("SizeOfPartFour |%d|\n", SizeOfPartFour); printf("\n"); PartOneArray = (int *)malloc((SizeOfPartOne + 1) * sizeof(int)); PartTwoArray = (long int *)malloc(SizeOfPartTwo * sizeof(long int)); PartThreeArray = (int *)malloc((SizeOfPartThree + 1) * sizeof(int)); PartFourArray = (unsigned char *)malloc(SizeOfPartFour * sizeof(char)); PartOneArray[0] = 0; for ( X = 1; X <= SizeOfPartOne; X++ ) { fread(&(PartOneArray[X]), 4, 1, PartOne); } for ( X = 0; X < SizeOfPartTwo; X++ ) { fread(&(PartTwoArray[X]), 8, 1, PartTwo); } PartThreeArray[0] = 0; for ( X = 1; X <= SizeOfPartThree; X++ ) { fread(&(PartThreeArray[X]), 4, 1, PartThree); } for ( X = 0; X < SizeOfPartFour; X++ ) { fread(&(PartFourArray[X]), 1, 1, PartFour); } fclose(PartOne); fclose(PartTwo); fclose(PartThree); fclose(PartFour); printf("The four data files have been opened and read into memory.\n\n"); while ( ShouldWeContinue == TRUE ) { printf("Type in the word that you want to search for in %s:", NAME_OF_LEXICON); fgets(TheUserInput, MAX_INPUT_SIZE, stdin); // Get rid of the new line char that fgets always reads in. InputLength = strlen(TheUserInput) - 1; TheUserInput[InputLength] = '\0'; MakeMeAllCapital(TheUserInput); printf("We are now going to search for this word |%s|.\n\n", TheUserInput); CurrentPosition = TrackingWordSearchWithFourPartDawgArray(TheUserInput, PartOneArray, PartTwoArray, PartThreeArray, PartFourArray, PartThreeFourTransition, LexiconTimeStamps, WhatRoundAreWeOn); if ( CurrentPosition == BOGUS) printf("This word |%s| simply does not exist in %s\n", TheUserInput, NAME_OF_LEXICON); while ( !(DOM == 'Y' || DOM == 'y' || DOM == 'N' || DOM == 'n') ) { printf("\nDo you want to quit(y/n)?"); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' || DOM == 'y' ) ShouldWeContinue = FALSE; DOM = '\0'; WhatRoundAreWeOn += 1; } printf("\nYou have chosen to exit the program after |%d| rounds.\n\n", (WhatRoundAreWeOn - 2)); return 0; }

ADTDAWG_Anagram_For_Lexicon_14.c - This Program Is Simple

#include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define MIN_WORD_LENGTH 3 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 #define MAX_INPUT_SIZE 100 #define LOWERIT 32 #define BOGUS -99 #define TOTAL_WORDS_IN_LEXICON 44220 typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; char CharacterLocations[26] = {0, BOGUS, 1, 2, 3, BOGUS, 4, BOGUS, 5, BOGUS, BOGUS, 6, 7, 8, 9, 10, BOGUS, 11, 12, 13, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, BOGUS, 1, 3, 5, BOGUS, 8, BOGUS, 11, BOGUS, BOGUS, 14, 17, 21, 25, 29, BOGUS, 33, 37, 41, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; // qsort char comparison function. int CharCompare(const void *a, const void *b){ const char *A = (const char *)a; const char *B = (const char *)b; return (int)(*A - *B); } // This function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. // It is useful for when a user enters a string to be anagrammed. void MakeMeAllCapital(char* RawWord){ int X; for( X = 0; X < strlen( RawWord ); X++ ) { if( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] - LOWERIT; } } // This function will place the letters in the string "TheString," into alphabetical order. void AlphabetizeMe(char *TheString){ qsort(TheString, strlen(TheString), sizeof(char), CharCompare); } // This function will remove any characters in the string "TheString" that are not capital Letters contained in the chosen character set. void RidMeOfBadCharacters(char *TheString){ int X; int Y; for ( X = 0; TheString[X] != '\0'; X++ ) { if ( TheString[X] < 'A' || TheString[X] > 'Z' ) { for ( Y = X; TheString[Y] != '\0'; Y++) TheString[Y] = TheString[Y + 1]; X -= 1; } else if ( ChildLetterBitShifts[TheString[X] - 'A'] == BOGUS ) { for ( Y = X; TheString[Y] != '\0'; Y++) TheString[Y] = TheString[Y + 1]; X -= 1; } } } inline int WordsToEndOfBranchList(int *ThirdPart, unsigned char *FourthPart, int TheIndexInQuestion, int TheCutoff){ return (TheIndexInQuestion < TheCutoff)? ThirdPart[TheIndexInQuestion]: (int)FourthPart[TheIndexInQuestion - TheCutoff]; } // The introduction of "tracking," and "order doesn't matter," has made the basic recursive anagram function look more complex than what is actually go on. // The Governing Equation of The Four Part DTDAWG: NextMarker = CurrentMarker - FCWTEOBL + NCWTEOBL - Flag; WTEOBL means Words to end of branch list. // FC means FirstChild. NC means NextChild, the one we want to go to. int TrackingAnagramWithFourPartDawgArrayRecurse(char *RawLetters, int *One, long int *Two, int *Three, unsigned char *Four, int BeginFourAt , int CurrentIndex, char *TheWordSoFar, int FillThisPosition, char CurrentLetter, int CurrentMarker, int LettersRemaining, int *StampingSet, int CurrentTime){ int X; int Y; char TheChosenLetter; long int CurrentOffset; int CurrentChild; int NextMarker = 0; int CurrentNCWTEOBL; int Count = 0;; Bool LettersUsedSoFar[SIZE_OF_CHARACTER_SET]; //printf("We are in anagram recurse. CurrentLetter |%c|, CurrentIndex |%d|, LettersRemaining |%d|, RawLetters |%s|, CurrentMarker |%d|\n" //, CurrentLetter, CurrentIndex, LettersRemaining, RawLetters, CurrentMarker); for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) LettersUsedSoFar[X] = FALSE; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { TheWordSoFar[FillThisPosition+ 1] = '\0'; printf("#Word|%d| Is |%s| - Old Stamp |%d| - New Stamp |%d|\n", CurrentMarker, TheWordSoFar, StampingSet[CurrentMarker], CurrentTime); StampingSet[CurrentMarker] = CurrentTime; NextMarker -= 1; Count += 1; } // If we have zero RawLetters left, then do not waste processor time trying to move further down the graph. if ( LettersRemaining ) { // If the current node has a child list, then go to the RawLetters nodes. if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { //printf("CurrentChild |%d|\n", CurrentChild); // Begin the calculation of the NextMarker; the part that is common to all children. NextMarker += (CurrentMarker - WordsToEndOfBranchList(Three, Four, CurrentChild, BeginFourAt)); for ( X = 0; X < LettersRemaining ; X++ ) { TheChosenLetter = RawLetters[X]; if ( LettersUsedSoFar[CharacterLocations[TheChosenLetter - 'A']] == FALSE ) { if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[TheChosenLetter - 'A']) ) { // If we have made it to this point "TheChosenLetter" has not been tried yet, and it does exist in the next child list CurrentOffset >>= ChildLetterBitShifts[RawLetters[X] - 'A']; CurrentOffset -= 1; //printf("We are using an offset of |%d| for |%c|", (int)CurrentOffset, TheChosenLetter); CurrentNCWTEOBL = WordsToEndOfBranchList(Three, Four, (CurrentChild + (int)CurrentOffset), BeginFourAt); NextMarker += CurrentNCWTEOBL; // First remove the letter that we are using for the primary position, then pass the remaining letters to the recursive anagrammer. for ( Y = X; Y < LettersRemaining; Y++ ) RawLetters[Y] = RawLetters[Y + 1]; Count += TrackingAnagramWithFourPartDawgArrayRecurse(RawLetters, One, Two, Three, Four, BeginFourAt, (CurrentChild + (int)CurrentOffset) , TheWordSoFar, (FillThisPosition + 1), TheChosenLetter, NextMarker, (LettersRemaining - 1), StampingSet, CurrentTime); NextMarker -= CurrentNCWTEOBL; // Place the letter that we just used, back into TheRawLetters. for ( Y = LettersRemaining; Y > X; Y-- ) RawLetters[Y] = RawLetters[Y - 1]; RawLetters[X] = TheChosenLetter; } LettersUsedSoFar[CharacterLocations[TheChosenLetter - 'A']] = TRUE; } } } } //printf("Out. CurrentLetter |%c|\n", CurrentLetter); return Count; } // The characters inside of TheRawLetters must all be contained in the specified character set, but the order does not matter. int TrackingAnagramWithFourPartDawgArray(char *TheRawLetters, int *First, long int *Second, int *Third, unsigned char *Fourth , int StartOfFourth, int *TheTimeStamps, int TheTime){ int X; int Y; int Counter = 0; char TheChosenLetter; int NumberOfRawLetters = strlen(TheRawLetters); Bool LettersThatHaveBeenUsed[SIZE_OF_CHARACTER_SET]; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) LettersThatHaveBeenUsed[X] = FALSE; //printf("We are in the first Anagram function.\n"); for ( X = 0; X < NumberOfRawLetters ; X++ ) { TheChosenLetter = TheRawLetters[X]; if ( LettersThatHaveBeenUsed[CharacterLocations[TheChosenLetter - 'A']] == FALSE ) { // First remove the letter that we are using for the primary position, then pass the remaining letters to the recursive anagrammer. for ( Y = X; Y < NumberOfRawLetters; Y++ ) TheRawLetters[Y] = TheRawLetters[Y + 1]; printf("Run recurison on letter |%c| and pass on |%s|.\n", TheChosenLetter, TheRawLetters); Counter += TrackingAnagramWithFourPartDawgArrayRecurse(TheRawLetters, First, Second, Third, Fourth, StartOfFourth , (CharacterLocations[TheChosenLetter - 'A'] + 1), RunningWord, 0, TheChosenLetter, Third[CharacterLocations[TheChosenLetter - 'A'] + 1] , (NumberOfRawLetters - 1), TheTimeStamps, TheTime); // Place the letter that we just used, back into TheRawLetters. for ( Y = NumberOfRawLetters; Y > X; Y-- ) TheRawLetters[Y] = TheRawLetters[Y - 1]; TheRawLetters[X] = TheChosenLetter; LettersThatHaveBeenUsed[CharacterLocations[TheChosenLetter - 'A']] = TRUE; } } return Counter; } int main(){ FILE *PartOne = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "rb"); FILE *PartTwo = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "rb"); FILE *PartThree = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "rb"); FILE *PartFour = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "rb"); int SizeOfPartOne; long int SizeOfPartTwo; int SizeOfPartThree; int SizeOfPartFour; int *PartOneArray; long int *PartTwoArray; int *PartThreeArray; unsigned char *PartFourArray; int WhatRoundAreWeOn = 1; int X; int PartThreeFourTransition; char TheUserInput[MAX_INPUT_SIZE + 1]; char Question[MAX_INPUT_SIZE + 1]; char DOM = '\0'; Bool ShouldWeContinue = TRUE; int LexiconTimeStamps[TOTAL_WORDS_IN_LEXICON + 1]; int CurrentCount; for ( X = 0; X <= TOTAL_WORDS_IN_LEXICON; X++ ) LexiconTimeStamps[X] = 0; fread(&SizeOfPartOne, 4, 1, PartOne); fread(&SizeOfPartTwo, 8, 1, PartTwo); fread(&SizeOfPartThree, 4, 1, PartThree); PartThreeFourTransition = SizeOfPartThree + 1; SizeOfPartFour = SizeOfPartOne - SizeOfPartThree; printf("\n"); printf("SizeOfPartOne |%d|\n", SizeOfPartOne); printf("SizeOfPartTwo |%ld|\n", SizeOfPartTwo); printf("SizeOfPartThree |%d|\n", SizeOfPartThree); printf("Transition |%d|\n", PartThreeFourTransition); printf("SizeOfPartFour |%d|\n", SizeOfPartFour); printf("\n"); PartOneArray = (int *)malloc((SizeOfPartOne + 1) * sizeof(int)); PartTwoArray = (long int *)malloc(SizeOfPartTwo * sizeof(long int)); PartThreeArray = (int *)malloc((SizeOfPartThree + 1) * sizeof(int)); PartFourArray = (unsigned char *)malloc(SizeOfPartFour * sizeof(char)); PartOneArray[0] = 0; for ( X = 1; X <= SizeOfPartOne; X++ ) { fread(&(PartOneArray[X]), 4, 1, PartOne); } for ( X = 0; X < SizeOfPartTwo; X++ ) { fread(&(PartTwoArray[X]), 8, 1, PartTwo); } PartThreeArray[0] = 0; for ( X = 1; X <= SizeOfPartThree; X++ ) { fread(&(PartThreeArray[X]), 4, 1, PartThree); } for ( X = 0; X < SizeOfPartFour; X++ ) { fread(&(PartFourArray[X]), 1, 1, PartFour); } fclose(PartOne); fclose(PartTwo); fclose(PartThree); fclose(PartFour); printf("The four data files have been opened and read into memory.\n\n"); while ( ShouldWeContinue == TRUE ) { printf("Type in the letters that you want to anagram:"); fgets(TheUserInput, MAX_INPUT_SIZE, stdin); if ( strlen(TheUserInput) < (MIN_WORD_LENGTH + 1) ) ShouldWeContinue = FALSE; MakeMeAllCapital(TheUserInput); RidMeOfBadCharacters(TheUserInput); if ( strlen(TheUserInput) < MIN_WORD_LENGTH ) ShouldWeContinue = FALSE; if ( ShouldWeContinue == TRUE ) { printf("TheUserInput-|%s|-|%d|.\n", TheUserInput, (int)strlen(TheUserInput)); while ( !(DOM == 'Y' || DOM == 'y' || DOM == 'N' || DOM == 'n') ) { printf("Do you desire alphabetical order(y/n)?"); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' || DOM == 'y' ) AlphabetizeMe(TheUserInput); DOM = '\0'; printf("We are now going to perform an anagramming on this |%s|.\n\n", TheUserInput); CurrentCount = TrackingAnagramWithFourPartDawgArray(TheUserInput, PartOneArray, PartTwoArray, PartThreeArray , PartFourArray, PartThreeFourTransition, LexiconTimeStamps, WhatRoundAreWeOn); while ( !(DOM == 'Y' || DOM == 'y' || DOM == 'N' || DOM == 'n') ) { printf("\n|%d| Words generated in round %d, do you want to quit(y/n)?", CurrentCount, WhatRoundAreWeOn); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' || DOM == 'y' ) ShouldWeContinue = FALSE; DOM = '\0'; } WhatRoundAreWeOn += 1; } printf("\nYou have chosen to exit the program after |%d| rounds.\n\n", (WhatRoundAreWeOn - 2)); return 0; }

ADTDAWG_Creator_Part_1.c - This Program Is Non-Trivial
// Simply put, this program needs only 3 things to create an optimal 14 char DAWG for you. // 1) A text file with a lexicon inside of it and the number of words is written on the very first line, in linux format, so no DOS CR. // 2) Set the constant MAX below to the length of the longest word in the lexicon. Modify if your shortest words are not of length 2. Have fun. // 3) Define the 14 characters that you have chosen, and that is it. // The 14 Dawg data structure is important because each node will contain information about all of the child nodes below it. // In this respect, traversal time will be reduced beyond any published DAWG implementation to my knowledge. #include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #include <time.h> #include <assert.h> #define MAX 15 #define NUMBER_OF_REDUCED_LETTERS 14 #define INPUT_LIMIT 30 // C requires a boolean variable type so use C's typedef concept to create one. typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; // we will store the Dawg into a text file to verify the output. Use The Init function this time #define RAW_LEXICON "Lexicon_14.txt" #define OPTIMAL_DAWG_DATA "Traditional_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_ONE "Optimal_Part_One_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_TWO "Optimal_Part_Two_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DAWG_TEXT_DATA "Traditional_Dawg_Text_For_Lexicon_14.txt" #define END_OF_LIST_DATA "End_Of_List_Nodes_Direct_Dawg_For_Lexicon_14.dat" int LetterArrayLocation[26] = {0, -999, 1, 2, 3, -999, 4, -999, 5, -999, -999, 6, 7, 8, 9, 10, -999, 11, 12, 13, -999, -999, -999, -999, -999, -999}; int BitShiftMe[26] = {0, -999, 1, 3, 5, -999, 8, -999, 11, -999, -999, 14, 17, 21, 25, 29, -999, 33, 37, 41, -999, -999, -999, -999, -999, -999}; inline void CutOffNewLine(char String){ String[strlen(String) - 1] = '\0'; } // Returns the positive integer rerpresented by "TheNumberNotYet" string, and if not a number greater then zero, returns 0. int StringToPositiveInt( char TheNumberNotYet ) { int result = 0; int X; int Digits = strlen( TheNumberNotYet ); for ( X = 0; X < Digits; X++ ) { printf("X =\|%d\|, TheNumberNotYet[X]=\|%c\|, result=\|%d\|, add\|%d\|\n", X, TheNumberNotYet[X], result, (int)((TheNumberNotYet[X] - '0')pow( 10, Digits - X - 1 ))); if ( !(TheNumberNotYet[X] >= '0' && TheNumberNotYet[X] <= '9') ) return 0; result += (int)((TheNumberNotYet[X] - '0')pow( 10, Digits - X - 1 )); } printf("The result=\|%d\| is in.\n", result); return result; } // The BinaryNode string must be at least 32 + 5 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertIntNodeToBinaryString( int TheNode, char* BinaryNode){ int X; int Bit; BinaryNode[0] = '\|'; // Bit 31, the leftmost bit is the sign bit. BinaryNode[1] = (TheNode < 0)?'1':'0'; // Bit 30 to bit 27 are unused bits in a 32 bit node. BinaryNode[2] = (TheNode & (int)pow(2,30))?'1':'0'; BinaryNode[3] = (TheNode & (int)pow(2,29))?'1':'0'; BinaryNode[4] = (TheNode & (int)pow(2,28))?'1':'0'; BinaryNode[5] = (TheNode & (int)pow(2,27))?'1':'0'; BinaryNode[6] = '\|'; // Bit 26 is the boolean EOW flag for end of word. BinaryNode[7] = (TheNode & (int)pow(2,26))?'1':'0'; BinaryNode[8] = '\|'; // Bit 25 to bit 15 represent the child offset index. Bit = 25; for ( X = 9; X <= 19; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[20] = '\|'; // The first child index that requires 15 bits will finish off the 32 bit int. Bit = 14; for ( X = 21; X <= 35; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[36] = '\|'; BinaryNode[37] = '\0'; } // The BinaryOffset string must be at least 64 + 16 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertOffsetLongIntToBinaryString( long int TheOffset, char* BinaryOffset){ int X; int Bit; BinaryOffset[0] = '\|'; // Bit 63, the leftmost bit is the sign bit. BinaryOffset[1] = (TheOffset < 0)?'1':'0'; // Bit 62 to bit 45 are not used with the 14 char set chosen for this DAWG. 18 Unused bits. All bits including the sign bit will be used with an 18 character set. Bit = 62; for ( X = 2; X <= 19; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[20] = '\|'; // The seven letters that require a 4 bit offset encoding start here --------------------------------------------------- // Bit [44,41] represent the "T" child. Bit = 44; for ( X = 21; X <= 24; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[25] = '\|'; // Bit [40,37] represent the "S" child. Bit = 40; for ( X = 26; X <= 29; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[30] = '\|'; // Bit [36,33] represent the "R" child. Bit = 36; for ( X = 31; X <= 34; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[35] = '\|'; // Bit [32,29] represent the "P" child. Bit = 32; for ( X = 36; X <= 39; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[40] = '\|'; // Bit [28,25] represent the "O" child. Bit = 28; for ( X = 41; X <= 44; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[45] = '\|'; // Bit [24,21] represent the "N" child. Bit = 24; for ( X = 46; X <= 49; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[50] = '\|'; // Bit [20,17] represent the "M" child. Bit = 20; for ( X = 51; X <= 54; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[55] = '\|'; // The four letters that require a 3 bit offset encoding start here --------------------------------------------------- // Bit [16,14] represent the "L" child. Bit = 16; for ( X = 56; X <= 58; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[59] = '\|'; // Bit [13,11] represent the "I" child. Bit = 13; for ( X = 60; X <= 62; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[63] = '\|'; // Bit [10,8] represent the "G" child. Bit = 10; for ( X = 64; X <= 66; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[67] = '\|'; // Bit [7,5] represent the "E" child. Bit = 7; for ( X = 68; X <= 70; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[71] = '\|'; // The two letters that require a 2 bit offset encoding start here --------------------------------------------------- // Bit [4,3] represent the "G" child. Bit = 4; for ( X = 72; X <= 73; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[74] = '\|'; // Bit [2,1] represent the "E" child. Bit = 2; for ( X = 75; X <= 76; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[77] = '\|'; // Bit 0 represent the "A" child. BinaryOffset[78] = (TheOffset & (long int)pow(2,0))?'1':'0'; BinaryOffset[79] = '\|'; BinaryOffset[80] = '\0'; } /This Function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. / void MakeMeAllCapital(char RawWord){ int count = 0; for ( count=0; count < strlen(RawWord); count++ ){ if (RawWord[count] >= 97 && RawWord[count] <= 122 ) RawWord[count] = RawWord[count] - 32; } } /Trie to Dawg TypeDefs/ struct tnode { struct tnode Next; struct tnode* Child; struct tnode* ParentalUnit; // When populating the array, you must know the indices of only the child nodes. Hence we Store ArrayIndex in every node so that we can mine the information from the reduced trie. int ArrayIndex; char DirectChild; char Letter; char MaxChildDepth; char Level; char NumberOfChildren; char Dangling; char EndOfWordFlag; }; typedef struct tnode Tnode; typedef Tnode* TnodePtr; int TnodeArrayIndex( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->ArrayIndex; } char TnodeDirectChild( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->DirectChild; } TnodePtr TnodeNext( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Next; } TnodePtr TnodeChild ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Child; } TnodePtr TnodeParentalUnit( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->ParentalUnit; } char TnodeLetter( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Letter; } char TnodeMaxChildDepth( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->MaxChildDepth; } char TnodeNumberOfChildren( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->NumberOfChildren; } char TnodeEndOfWordFlag( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->EndOfWordFlag; } char TnodeLevel ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Level; } char TnodeDangling ( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); return ThisTnode->Dangling; } TnodePtr TnodeInit( char Chap, TnodePtr OverOne, char WordEnding, char Leveler, int StarterDepth, TnodePtr Parent, char IsaChild ) { TnodePtr Result = (TnodePtr)malloc( sizeof(Tnode) ); assert( Result != NULL ); Result->Letter = Chap; Result->ArrayIndex = 0; Result->NumberOfChildren = 0; Result->MaxChildDepth = StarterDepth; Result->Next = OverOne; Result->Child = NULL; Result->ParentalUnit = Parent; Result->Dangling = FALSE; Result->EndOfWordFlag = WordEnding; Result->Level = Leveler; Result->DirectChild = IsaChild; return Result; } void TnodeOutput( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); printf( "Letter\|%c\|, DirectChild\|%d\|, ArrayIndex\|%d\|, Level\|%d\|, MaxChildDepth\|%d\|, Dangling\|%d\|, NumberOfChildren\|%d\|, EndOfWordFlag\|%d\|\n" , ThisTnode->Letter, ThisTnode->DirectChild, ThisTnode->ArrayIndex, ThisTnode->Level, ThisTnode->MaxChildDepth, ThisTnode->Dangling , ThisTnode->NumberOfChildren, ThisTnode->EndOfWordFlag ); } void TnodeSetArrayIndex( TnodePtr ThisTnode, int TheWhat ){ assert( ThisTnode != NULL ); ThisTnode->ArrayIndex = TheWhat; } void TnodeSetChild( TnodePtr ThisTnode, TnodePtr Nexis ){ assert( ThisTnode != NULL ); ThisTnode->Child = Nexis; } void TnodeSetNext ( TnodePtr ThisTnode, TnodePtr Nexi ){ assert( ThisTnode != NULL ); ThisTnode->Next = Nexi; } void TnodeSetParentalUnit ( TnodePtr ThisTnode, TnodePtr Parent ){ assert( ThisTnode != NULL ); ThisTnode->ParentalUnit = Parent; } void TnodeSetMaxChildDepth( TnodePtr ThisTnode, int NewDepth ){ assert( ThisTnode != NULL ); ThisTnode->MaxChildDepth = NewDepth; } void TnodeSetDirectChild( TnodePtr ThisTnode, char Status ){ assert( ThisTnode != NULL ); ThisTnode->DirectChild = Status; } void TnodeFlyEndOfWordFlag ( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); ThisTnode->EndOfWordFlag = TRUE; } // This function Dangles a node, but also recursively dangles every node under it as well //, that way nodes that are not direct children do hit the chopping block. The function returns the total number of nodes dangled as a result. int TnodeDangle( TnodePtr ThisTnode ){ assert( ThisTnode != NULL ); if ( ThisTnode->Dangling == TRUE ) return 0; int Result = 0; if ( (ThisTnode->Next) != NULL ) Result += TnodeDangle( ThisTnode->Next ); if ( (ThisTnode->Child) != NULL ) Result += TnodeDangle( ThisTnode->Child); ThisTnode->Dangling = TRUE; Result += 1; return Result; } // This function returns the pointer to the Tnode in a parallel list of nodes with the letter "ThisLetter", // and returns NULL if the Tnode does not exist. In that case an insertion is required. TnodePtr TnodeFindParaNode( TnodePtr ThisTnode, char ThisLetter ) { if ( ThisTnode == NULL ) return NULL; TnodePtr Result = ThisTnode; if ( Result->Letter == ThisLetter ) return Result; while ( Result->Letter < ThisLetter ){ Result = Result->Next; if ( Result == NULL ) return NULL; } if ( Result->Letter == ThisLetter ) return Result; else return NULL; } // This function inserts a new Tnode before a larger letter node or at the end of a para list If the list does not esist then it is put at the beginnung. // The new node has ThisLetter in it. AboveTnode is the node 1 level above where the new node will be placed. // This function should never be passed a NULL pointer. ThisLetter should never exist in the next child para list. void TnodeInsertParaNode( TnodePtr AboveTnode, char ThisLetter, char WordEnder, int StartDepth){ assert( AboveTnode != NULL ); AboveTnode->NumberOfChildren += 1; TnodePtr Holder = NULL; TnodePtr Currently = NULL; // Case 1: ParaList does not exist yet so start it. if ( AboveTnode->Child == NULL ) AboveTnode->Child = TnodeInit( ThisLetter, NULL, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE ); // Case 2: ThisLetter should be the first in the ParaList. else if ( ((AboveTnode->Child)->Letter) > ThisLetter ) { Holder = AboveTnode->Child; // The holder node is no longer a direct child so set it as such. TnodeSetDirectChild( Holder, FALSE ); AboveTnode->Child = TnodeInit( ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE ); // The parent node needs to be changed on what used to be the child. it is the Tnode in "Holder". Holder->ParentalUnit = AboveTnode->Child; } // Case 3: The ParaList exists and ThisLetter is not first in the list. else if ( (AboveTnode->Child)->Letter < ThisLetter ) { Currently = AboveTnode->Child; while ( Currently->Next !=NULL ){ if ( TnodeLetter( Currently->Next ) > ThisLetter ){ break; } Currently = Currently->Next; } Holder = Currently->Next; Currently->Next = TnodeInit( ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, Currently, FALSE ); if ( Holder != NULL ) Holder->ParentalUnit = Currently->Next; } } // The MaxChildDepth of the two nodes can not be assumed equal due to the recursive nature of this function, so we must check for equivalence. char TnodeAreWeTheSame( TnodePtr FirstNode, TnodePtr SecondNode){ if ( FirstNode == SecondNode ) return TRUE; if ( FirstNode == NULL \|\| SecondNode == NULL ) return FALSE; if ( FirstNode->Letter != SecondNode->Letter ) return FALSE; if ( FirstNode->MaxChildDepth != SecondNode->MaxChildDepth ) return FALSE; if ( FirstNode->NumberOfChildren != SecondNode->NumberOfChildren ) return FALSE; if ( FirstNode->EndOfWordFlag != SecondNode->EndOfWordFlag ) return FALSE; if ( TnodeAreWeTheSame( FirstNode->Child, SecondNode->Child ) == FALSE ) return FALSE; if ( TnodeAreWeTheSame( FirstNode->Next, SecondNode->Next ) == FALSE ) return FALSE; else return TRUE; } struct dawg { int NumberOfTotalWords; int NumberOfTotalNodes; TnodePtr First; }; typedef struct dawg Dawg; typedef Dawg* DawgPtr; // Set up the parent nodes in the Dawg. DawgPtr DawgInit (void){ DawgPtr Result; Result = (DawgPtr)malloc( sizeof(Dawg)); assert( Result != NULL ); Result->NumberOfTotalWords = 0; Result->NumberOfTotalNodes = 0; Result->First = TnodeInit( '0', NULL, FALSE, 0, 0, NULL, FALSE ); return Result; } // This function simply makes DawgAddWord look a hell of a lot smaller. It returns the number of new nodes that it inserted. int TnodeDawgAddWord( TnodePtr ParentNode, const char Word ){ assert( ParentNode != NULL ); int Result = 0; int X, Y = 0; int WordLength = strlen( Word ); TnodePtr HangPoint = NULL; TnodePtr Current = ParentNode; for ( X = 0; X < WordLength; X++){ HangPoint = TnodeFindParaNode ( TnodeChild( Current ), Word[X] ); if ( HangPoint == NULL ) { TnodeInsertParaNode( Current, Word[X], (X == WordLength - 1 ? TRUE : FALSE), WordLength - X - 1 ); Result++; Current = TnodeFindParaNode ( TnodeChild( Current ), Word[X] );; for ( Y = X+1; Y < WordLength; Y++ ){ TnodeInsertParaNode( Current, Word[Y], (Y == WordLength - 1 ? TRUE : FALSE), WordLength - Y - 1 ); Result += 1; Current = TnodeChild( Current ); } break; } else { if ( TnodeMaxChildDepth( HangPoint ) < WordLength - X - 1 ) TnodeSetMaxChildDepth( HangPoint, WordLength - X - 1 ); } Current = HangPoint; // The path for the word that we are trying to insert already exists, so just make sure that the end flag is flying on the last node. //This should never happen if we are to add words in increasing word length, but hey, better safe than sorry. if ( X == WordLength - 1 ) TnodeFlyEndOfWordFlag( Current ); } return Result; } void DawgAddWord ( DawgPtr ThisDawg, char NewWord ) { assert( ThisDawg != NULL ); ThisDawg->NumberOfTotalWords += 1; int WordLength = strlen( NewWord ); int X = 0; char Temp; int Inter = 0; Inter = TnodeDawgAddWord( ThisDawg->First, NewWord ); ThisDawg->NumberOfTotalNodes += Inter; } // This is a standard depth first preorder tree traversal, whereby the objective is to count nodes of various MaxChildDepths to allow for an optimal reduction process. void TnodeGraphTabulateRecurse ( TnodePtr ThisTnode, int Level, int* Tabulator ) { assert( ThisTnode != NULL ); if ( Level == 0 ) TnodeGraphTabulateRecurse( TnodeChild( ThisTnode ), Level + 1, Tabulator ); else{ if ( ThisTnode->Dangling == FALSE ) { Tabulator[ThisTnode->MaxChildDepth] += 1; // Go Down if possible. if ( ThisTnode->Child != NULL ) TnodeGraphTabulateRecurse( TnodeChild( ThisTnode ), Level + 1, Tabulator ); // Go Right if possible. if ( ThisTnode->Next != NULL && ThisTnode->Dangling == FALSE ) TnodeGraphTabulateRecurse( TnodeNext( ThisTnode ), Level, Tabulator ); } } } // Count the nodes of each max child depth. void DawgGraphTabulate( DawgPtr ThisDawg, int* Count ) { int Numbers[MAX]; int X = 0; for ( X = 0; X < MAX; X++ ) Numbers[X] = 0; if ( ThisDawg->NumberOfTotalWords > 0){ TnodeGraphTabulateRecurse( ThisDawg->First, 0, Numbers ); for ( X = 0; X < MAX; X++ ) { Count[X] = Numbers[X]; } } } // This function can never be called after a trie has been mowed because this will result in pointers being freed twice resulting in a core dump! void FreeTnodeRecurse( TnodePtr ThisTnode ) { assert( ThisTnode != NULL ); if ( ThisTnode->Child != NULL ) FreeTnodeRecurse( ThisTnode->Child ) ; if ( ThisTnode->Next != NULL ) FreeTnodeRecurse( ThisTnode->Next ); free(ThisTnode); } // This function can never be called after a trie has been mowed because this will result in pointers being freed twice resulting in a core dump! void FreeDawg( DawgPtr ThisDawg ) { if ( ThisDawg->NumberOfTotalWords > 0 ){ FreeTnodeRecurse( ThisDawg->First ); } free( ThisDawg ); } // An important function, this function returns the first node that is identical to ThisTnode. TnodePtr TnodeMexicanEquivalent( TnodePtr ThisTnode, TnodePtr MaxChildDepthWist ) { assert ( ThisTnode != NULL ); int Tall = ThisTnode->MaxChildDepth; int X = 0; while ( TnodeAreWeTheSame( ThisTnode, MaxChildDepthWist[Tall][X] ) == FALSE ) { X += 1; } return MaxChildDepthWist[Tall][X]; } // Recursively replaces all redundant nodes in a trie with their first equivalent. void TnodeLawnMowerRecurse( TnodePtr ThisTnode, TnodePtr HeightWits ) { assert (ThisTnode != NULL); if ( ThisTnode->Level == 0 ) TnodeLawnMowerRecurse( ThisTnode->Child, HeightWits ); else { if ( ThisTnode->Next == NULL && ThisTnode->Child == NULL ) goto AndImDone; if ( ThisTnode->Child != NULL ) { // we have found a node that has been tagged to be mowed, so let us destroy it not literally and replace it with its first equivelant in the "HeightWits" list, and it won't be tagged. if ( (ThisTnode->Child)->Dangling == TRUE ) { ThisTnode->Child = TnodeMexicanEquivalent( ThisTnode->Child, HeightWits ); } else TnodeLawnMowerRecurse( ThisTnode->Child, HeightWits ); } if ( ThisTnode->Next != NULL ){ if ( (ThisTnode->Next)->Dangling == TRUE ) { ThisTnode->Next = TnodeMexicanEquivalent( ThisTnode->Next, HeightWits ); } else TnodeLawnMowerRecurse( ThisTnode->Next, HeightWits ); } } AndImDone:; } // Replaces all redundant nodes in a trie with teir first equivalent. void DawgLawnMower( DawgPtr ThisDawg, TnodePtr ** HeightWise ) { TnodeLawnMowerRecurse( ThisDawg->First, HeightWise ); } ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // A queue is required for breadth first traversal, and the rest is self evident. struct breadthqueuenode { TnodePtr Element; struct breadthqueuenode Next; }; typedef struct breadthqueuenode BreadthQueueNode; typedef BreadthQueueNode BreadthQueueNodePtr; void BreadthQueueNodeSetNext( BreadthQueueNodePtr ThisBreadthQueueNode, BreadthQueueNodePtr Nexit ) { assert ( ThisBreadthQueueNode != NULL ); ThisBreadthQueueNode->Next = Nexit; } BreadthQueueNodePtr BreadthQueueNodeNext( BreadthQueueNodePtr ThisBreadthQueueNode ) { assert ( ThisBreadthQueueNode != NULL ); return ThisBreadthQueueNode->Next; } TnodePtr BreadthQueueNodeElement( BreadthQueueNodePtr ThisBreadthQueueNode ) { assert ( ThisBreadthQueueNode != NULL ); return ThisBreadthQueueNode->Element; } BreadthQueueNodePtr BreadthQueueNodeInit( TnodePtr NewElement ) { BreadthQueueNodePtr Result = (BreadthQueueNodePtr)malloc( sizeof( BreadthQueueNode ) ); assert( Result != NULL ); Result->Element = NewElement; Result->Next = NULL; return Result; } struct breadthqueue { BreadthQueueNodePtr Front; BreadthQueueNodePtr Back; int Size; }; typedef struct breadthqueue BreadthQueue; typedef BreadthQueue* BreadthQueuePtr; BreadthQueuePtr BreadthQueueInit( void ) { BreadthQueuePtr Result = (BreadthQueuePtr)malloc( sizeof( BreadthQueue ) ); assert( Result != NULL ); Result->Front = NULL; Result->Back = NULL; Result->Size = 0; } void BreadthQueuePush( BreadthQueuePtr ThisBreadthQueue, TnodePtr NewElemental ) { assert( ThisBreadthQueue != NULL ); assert( NewElemental != NULL ); BreadthQueueNodePtr Noob = BreadthQueueNodeInit( NewElemental ); assert( Noob != NULL ); if ( (ThisBreadthQueue->Back) != NULL ) BreadthQueueNodeSetNext( ThisBreadthQueue->Back, Noob ); else ThisBreadthQueue->Front = Noob; ThisBreadthQueue->Back = Noob; (ThisBreadthQueue->Size) += 1; } TnodePtr BreadthQueuePop( BreadthQueuePtr ThisBreadthQueue ) { assert( ThisBreadthQueue != NULL ); if ( ThisBreadthQueue->Size == 0 ) return NULL; if ( ThisBreadthQueue->Size == 1 ) { ThisBreadthQueue->Back = NULL; ThisBreadthQueue->Size = 0; TnodePtr Result = (ThisBreadthQueue->Front)->Element; free( ThisBreadthQueue->Front ); ThisBreadthQueue->Front = NULL; return Result; } TnodePtr Result = (ThisBreadthQueue->Front)->Element; BreadthQueueNodePtr Holder = ThisBreadthQueue->Front; ThisBreadthQueue->Front = (ThisBreadthQueue->Front)->Next; free( Holder ); ThisBreadthQueue->Size -= 1; return Result; } void BreadthQueuePopulateReductionArray( BreadthQueuePtr ThisBreadthQueue, TnodePtr Root, TnodePtr *Holder ) { printf( "inside external function.\n" ); int Taker[MAX]; int X = 0; for ( X = 0; X < MAX; X++ ) Taker[X] = 0; int PopCount = 0; int CurrentMaxChildDepth = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } // Initiate the pop followed by push all children loop. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop( ThisBreadthQueue ); PopCount += 1; CurrentMaxChildDepth = Current->MaxChildDepth; Holder[CurrentMaxChildDepth][Taker[CurrentMaxChildDepth]] = Current; Taker[CurrentMaxChildDepth] += 1; Current = TnodeChild( Current ); while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = TnodeNext( Current ); } } printf( "Completed Populating The Reduction Array.\n" ); } int BreadthQueueUseToIndex( BreadthQueuePtr ThisBreadthQueue, TnodePtr Root ) { int IndexNow = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } // Pop each element off of the queue and only push its children if has not been dangled yet. Assign index if one has not been given to it yet. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop( ThisBreadthQueue ); if ( TnodeDangling( Current ) == FALSE && TnodeArrayIndex( Current ) == 0 ) { IndexNow += 1; TnodeSetArrayIndex( Current, IndexNow ); Current = TnodeChild( Current ); while ( Current != NULL ) { BreadthQueuePush( ThisBreadthQueue, Current ); Current = Current->Next; } } } printf( "Completed Assigning Indices To Living Nodes.\n" ); return IndexNow; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Next and Child become indices. struct arraydnode{ int Next; int Child; char Letter; char EndOfWordFlag; char Level; }; typedef struct arraydnode ArrayDnode; typedef ArrayDnode ArrayDnodePtr; void ArrayDnodeInit( ArrayDnodePtr ThisArrayDnode, char Chap, int Nextt, int Childd, char EndingFlag, char Breadth ){ ThisArrayDnode->Letter = Chap; ThisArrayDnode->EndOfWordFlag = EndingFlag; ThisArrayDnode->Next = Nextt; ThisArrayDnode->Child = Childd; ThisArrayDnode->Level = Breadth; } void ArrayDnodeTnodeTranspose( ArrayDnodePtr ThisArrayDnode, TnodePtr ThisTnode ){ ThisArrayDnode->Letter = ThisTnode->Letter; ThisArrayDnode->EndOfWordFlag = ThisTnode->EndOfWordFlag; ThisArrayDnode->Level = ThisTnode->Level; if ( ThisTnode->Next == NULL ) ThisArrayDnode->Next = 0; else ThisArrayDnode->Next = (ThisTnode->Next)->ArrayIndex; if ( ThisTnode->Child == NULL ) ThisArrayDnode->Child = 0; else ThisArrayDnode->Child = (ThisTnode->Child)->ArrayIndex; } int ArrayDnodeNext( ArrayDnodePtr ThisArrayDnode ){ return ThisArrayDnode->Next; } int ArrayDnodeChild (ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->Child; } char ArrayDnodeLetter( ArrayDnodePtr ThisArrayDnode ){ return ThisArrayDnode->Letter; } char ArrayDnodeEndOfWordFlag (ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->EndOfWordFlag; } int ArrayDnodeNumberOfChildrenPlusString(ArrayDnodePtr DoggieDog, int Index, char* FillThisString){ int X; int CurrentArrayPosition; if ( (DoggieDog[Index]).Child == 0 ) { FillThisString[0] = '\0'; return 0; } CurrentArrayPosition = (DoggieDog[Index]).Child; for ( X = 0; X < NUMBER_OF_REDUCED_LETTERS; X++ ) { FillThisString[X] = (DoggieDog[CurrentArrayPosition]).Letter; if ( (DoggieDog[CurrentArrayPosition]).Next == 0 ) { FillThisString[X+1] = '\0'; return (X + 1); } CurrentArrayPosition = CurrentArrayPosition + 1; } } struct arraydawg { int NumberOfStrings; ArrayDnodePtr DawgArray; int First; char MinStringLength; char MaxStringLength; }; typedef struct arraydawg ArrayDawg; typedef ArrayDawg* ArrayDawgPtr; //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // This function is the core of the dawg creation procedure. Pay close attention to the order of the steps involved. ArrayDawgPtr ArrayDawgInit( char *Dictionary, int SegmentLenghts, int MaxStringLength ){ int X = 0; int Y = 0; int ChildCount; char ChildStrings; printf("step 0 - Allocate the framework for the array data structure.\n"); /// Dynamically allocate the upper Data Structure. ArrayDawgPtr Result = (ArrayDawgPtr)malloc( sizeof( ArrayDawg ) ); /// set MinStringLength, MaxStringLength, and NumberOfStrings. X = 0; while ( SegmentLenghts[X] == 0 ) X++; Result->MinStringLength = X; Result->MaxStringLength = MaxStringLength; Result->NumberOfStrings = 0; for ( X = Result->MinStringLength; X <= Result->MaxStringLength ; X++ ) Result->NumberOfStrings += SegmentLenghts[X]; printf("step 1 - Create a temporary trie and begin filling it with the provided lexicon.\n"); /// Create a Temp Trie structure and then feed in the given dictionary. DawgPtr TemporaryTrie = DawgInit(); for ( Y = Result->MinStringLength; Y <= Result->MaxStringLength; Y++ ){ for ( X = 0; X < SegmentLenghts[Y]; X++ ){ DawgAddWord ( TemporaryTrie, &(Dictionary[Y][(Y + 1)X]) ); } } printf("step 2 - Finished adding words to the temporary trie.\n"); /// Create the associated pointer array with all the nodes inside it. int NodeNumberCounter = (int)malloc( (Result->MaxStringLength)sizeof(int) ); for ( X = 0; X < Result->MaxStringLength; X++ ) NodeNumberCounter[X] = 0; int NodeNumberCounterInit = (int)malloc( (Result->MaxStringLength)sizeof(int) ); for ( X = 0; X < Result->MaxStringLength; X++ ) NodeNumberCounterInit[X] = 0; // Pass a fake pointer because we haven't counted the nodes yet, after that, set NodeNumberCounterInit. Pass FALSE kuz we are not filling the array yet. printf("step 3 - Count the total number of nodes in the raw trie by height.\n"); DawgGraphTabulate( TemporaryTrie, NodeNumberCounter ); printf("step 4 - Initial counting of trie nodes complete, so display results.\n"); int TotalNodeSum = 0; for ( X = 0; X < Result->MaxStringLength; X++ ){ NodeNumberCounterInit[X] = NodeNumberCounter[X]; TotalNodeSum += NodeNumberCounter[X]; } for ( X = 0; X < Result->MaxStringLength; X++ ){ printf( "Initial Count For MaxChildDepth \|%d\| is \|%d\|\n", X, NodeNumberCounterInit[X] ); } printf("The Total Number Of Nodes In The Trie = \|%d\| \n", TotalNodeSum); // We will have exactly enough space for all of the Tnode pointers. printf("step 5 - Allocate a 2 dimensional array of Trie node pointers to minimize the trie into an optimal DAWG.\n"); TnodePtr * HolderOfAllTnodePointers = (TnodePtr *)malloc( (Result->MaxStringLength)sizeof( TnodePtr * ) ); for ( X = 0; X < MAX; X++ ) HolderOfAllTnodePointers[X] = (TnodePtr )malloc(NodeNumberCounterInit[X]sizeof(TnodePtr)); // When populating the array we are going to need to go about doing so in a breadth first manner to // make sure that the first time that we encounter something it will come first in the array structure. // Or maybe this is unnecessary, because we have literally replaced all references to obsolete nodes. printf("step 6 - Populate the 2 dimensional trie node pointer array.\n"); // Let us use a breadth first traversal to populate the HolderOfAllTnodePointers array so that the first occurance of a group of identical nodes becomes the one that survives. BreadthQueuePtr Populator = BreadthQueueInit(); BreadthQueuePopulateReductionArray( Populator, (TemporaryTrie->First)->Child, HolderOfAllTnodePointers ); printf("step 7 - Population complete.\n"); // Flag all of the reduntant nodes. Flagging requires the node comparison function that will take a very long time // for a big dictionary especially when comparing the nodes with small MaxChildDepths because there are so many, so start at the top. int NumberDangled = 0; int TotalDangled = 0; int W = 0; // keep track of the number of nodes of each MaxChildDepth dangled recursively so we can check how many remaining nodes we need for the optimal array. int DangleCount[Result->MaxStringLength]; for ( X = 0; X < Result->MaxStringLength; X++) DangleCount[X] = 0; printf("step 8 - Begin to tag redundant nodes as dangled - Use recursion because only direct children are considered for elimination to reduce node size.\n"); // Start at the largest MaxChildDepth and work down from there for recursive reduction to take place. for ( Y = (Result->MaxStringLength - 1); Y >= 0 ; Y--){ NumberDangled = 0; // Move through the holder array looking for any nodes that have not been dangled. for ( W = 0; W < (NodeNumberCounterInit[Y] - 1); W++ ){ // The Node is already Dangling. Note the this node need not be the first in a child line. //In order to eliminate the need for the next index all identical reduced nodes only must be direct children only. if ( TnodeDangling( HolderOfAllTnodePointers[Y][W] ) == TRUE ) { continue; } // Traverse the rest of the array looking for equivalent nodes that are both not dangled and are tagged as direct children. Then dangle when found to be identical. for ( X = W + 1; X < NodeNumberCounterInit[Y]; X++ ){ if ( TnodeDangling( HolderOfAllTnodePointers[Y][X] ) == FALSE && TnodeDirectChild( HolderOfAllTnodePointers[Y][X] ) == TRUE ) { if ( TnodeAreWeTheSame( HolderOfAllTnodePointers[Y][W], HolderOfAllTnodePointers[Y][X] ) == TRUE ){ NumberDangled += TnodeDangle( HolderOfAllTnodePointers[Y][X] ); } } } } printf( "Dangled \|%d\| Nodes In all, through recursion, for MaxChildDepth of \|%d\|\n", NumberDangled, Y ); DangleCount[Y] = NumberDangled; TotalDangled += NumberDangled; } printf( "Total Number Of Dangled Nodes \|%d\|\n", TotalDangled ); int NumberOfLivingNodes = TotalNodeSum - TotalDangled; printf( "Total Number Of Living Nodes \|%d\|\n", NumberOfLivingNodes ); printf("step 9 - Count the number of living nodes in the trie before replacement so that we check our numbers.\n"); DawgGraphTabulate( TemporaryTrie, NodeNumberCounter ); for ( X = 0; X < Result->MaxStringLength; X++ ){ printf( "Count for living nodes of MaxChildDepth \|%d\| is \|%d\|. It used to be \|%d\| and so the number dangled is \|%d\| \n" , X, NodeNumberCounter[X], NodeNumberCounterInit[X], NodeNumberCounterInit[X] - NodeNumberCounter[X] ); } int TotalDangledCheck = 0; for ( X = 0; X < MAX; X++ ) { TotalDangledCheck += (NodeNumberCounterInit[X] - NodeNumberCounter[X]); } assert( TotalDangled == TotalDangledCheck ); printf("step 10 - Dangling is complete, so replace all dangled nodes with their mexican equivelants in the Trie to make a distributed DAWG.\n"); // Node replacement has to take place before indices are set up so nothing points to redundant nodes. - This step is absolutely critical. Mow The Lawn so to speak! Then Index. DawgLawnMower( TemporaryTrie, HolderOfAllTnodePointers ); printf("step 11 - Mowing of the lawn is now complete, so start to assign array indices to all living nodes.\n step 11 - The assigning of array indices is accomplished with a breadth first queue so that all nodes with a Next refrence not equal to NULL will point to the next element of the array, this is after all an optimal DAWG.\n"); BreadthQueuePtr OrderMatters = BreadthQueueInit(); // Try to find out if the nodes we are setting are unique before we set them. int IndexCount = BreadthQueueUseToIndex( OrderMatters, HolderOfAllTnodePointers[MAX - 1][0] ); printf( "Finish\n" ); printf( "NumberOfLivingNodes from after the dangling process\|%d\|\n", NumberOfLivingNodes ); printf( "IndexCount from the index handing out breadth first traversal \|%d\|\n", IndexCount ); // We are going to likely need a FIFO queue to do a breadth first traversal but know that this will attempt to add the same nodes multiple times unless we test for it. // Assigning the array indices the way I have been doing to does not allow for the elimination of the next index reference, so a new method, using the trie structure is necessary. // Then assign array indices to the living nodes. Keep track of the total number of living nodes. The living nodes are now supposed to point to only living nodes. /// Allocate the space needed to store the "DawgArray". Result->DawgArray = (ArrayDnodePtr)calloc( (NumberOfLivingNodes + 1), sizeof( ArrayDnode ) ); int IndexFollow = 0; int IndexFollower = 0; /// Roll through the pointer arrays and use the "ArrayDnodeTnodeTranspose" function to populate it. // Set the dummy entry at the beginning of the array. ArrayDnodeInit( &(Result->DawgArray[0]), 0, 0, 0, 0, 0 ); Result->First = 1; printf("step 12 - Populate the new array dawg structure with each node being as small as possible for optimization.\n"); // Traverse the whole 2D pointer array and look for undangled nodes, if so then transpose that node into the optimal DAWG array. for ( X = Result->MaxStringLength - 1; X >= 0; X-- ){ for (W = 0; W < NodeNumberCounterInit[X]; W++ ){ if ( TnodeDangling( HolderOfAllTnodePointers[X][W] ) == FALSE ){ IndexFollow = TnodeArrayIndex( HolderOfAllTnodePointers[X][W] ); ArrayDnodeTnodeTranspose( &(Result->DawgArray[IndexFollow]), HolderOfAllTnodePointers[X][W] ); if ( IndexFollow > IndexFollower ) IndexFollower = IndexFollow; } } } printf( "IndexFollower, which is the largest index assigned in the array = \|%d\|\n", IndexFollower ); printf( "NumberOfLivingNodes\|%d\|, assert that these two are equal because they must be.\n", NumberOfLivingNodes ); assert ( IndexFollower == NumberOfLivingNodes ); // Do Garbage cleanup and free the whole Trie, which is no longer needed. So write a free function. // But never use it here because after mowing it is rendered useless. Free from the holding array. for ( X = 0; X < Result->MaxStringLength; X++ ) for ( W = 0; W < NodeNumberCounterInit[X]; W++ ) free( HolderOfAllTnodePointers[X][W] ); free( TemporaryTrie ); free( NodeNumberCounter ); free( NodeNumberCounterInit ); for ( X = 0; X < Result->MaxStringLength; X++ ) free( HolderOfAllTnodePointers[X] ); free( HolderOfAllTnodePointers ); printf("step 13 - Creation of dawg is complete, so store the optimal DAWG into a text file for verification as well as 32 and 64 bit binary files for use.\n"); FILE Text; Text = fopen( OPTIMAL_DAWG_TEXT_DATA,"w" ); FILE Short; FILE Long; Short = fopen( OPTIMAL_DIRECT_DAWG_DATA_PART_ONE,"wb" ); Long = fopen( OPTIMAL_DIRECT_DAWG_DATA_PART_TWO,"wb" ); // The following variables will be used when setting up the child-offset 64 bit nodes. int CurrentNumberOfChildren = 0; char CurrentChildLetterString[NUMBER_OF_REDUCED_LETTERS + 1]; CurrentChildLetterString[0] = '\0'; char TheNodeInBinary[32+5+1]; char TheOffsetInBinary[64+16+1]; TheNodeInBinary[0] = '\0'; long int ChildOffsetNumber[NUMBER_OF_REDUCED_LETTERS]; for ( X = 0; X < NUMBER_OF_REDUCED_LETTERS; X++ ) ChildOffsetNumber[X] = 0; long int CurrentOffsetNumber; long int TotalOffsetNumber = 0; int CurrentOffsetNumberIndex; int CompleteThirtyTwoBitNode; fwrite( &NumberOfLivingNodes, 4, 1, Short ); FILE ListE; int EndOfListCount = 0; int EOLTracker = 0; int EndOfListIndicies; ListE = fopen ( END_OF_LIST_DATA, "wb" ); // Set up an array to hold all of the unique child strings for the reduced lexicon DAWG. The empty placeholder will be all zeds. long int NumberOfUniqueChildStrings = 0; int InsertionPoint = 0; Bool IsSheUnique = TRUE; char HolderOfUniqueChildStrings = (char)malloc(NumberOfLivingNodes sizeof(char)); for ( X = 0; X < NumberOfLivingNodes; X++ ) { HolderOfUniqueChildStrings[X] = (char)malloc((NUMBER_OF_REDUCED_LETTERS + 1) * sizeof(char)); strcpy(HolderOfUniqueChildStrings[X], "ZZZZZZZZZZZZZZ"); } // Right here we will tabulate the child information so that it can be turned into a long int array and stored in a data file. // Also, 45 bits allows for an optimal child offset format, but tabulate how many unique values these bits hold. For TWL06 and the 14 chars chosen it is 1505, requiring 11 bits. // The idea is that there are a very small number of actual values that these 45 bits will hold due to their format and the prevalent patterns in the English Language. for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Insert the CurrentChildLetterString into the HolderOfUniqueChildStrings if, and only if, it is unique. for ( Y = 0; Y < NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { IsSheUnique = FALSE; InsertionPoint = 0; break; } if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) < 0 ) { IsSheUnique = TRUE; InsertionPoint = Y; break; } IsSheUnique = TRUE; InsertionPoint = Y + 1; } if ( IsSheUnique == TRUE ) { for ( Y = NumberOfUniqueChildStrings; Y > InsertionPoint; Y-- ) { strcpy(HolderOfUniqueChildStrings[Y], HolderOfUniqueChildStrings[Y-1]); } strcpy(HolderOfUniqueChildStrings[InsertionPoint], CurrentChildLetterString); NumberOfUniqueChildStrings += 1; } } long int ParallelChildOffsetValues = (long int)malloc(NumberOfUniqueChildStringssizeof(long int)); // Convert the unique child strings into the equivalent bitwise offset numbers, and store these values in a parallel array. for ( X = 0; X < NumberOfUniqueChildStrings ; X++ ){ strcpy(CurrentChildLetterString, HolderOfUniqueChildStrings[X]); CurrentNumberOfChildren = strlen(CurrentChildLetterString); for ( Y = 0; Y < CurrentNumberOfChildren; Y++ ) { CurrentOffsetNumber = Y + 1; CurrentOffsetNumber <<= BitShiftMe[(CurrentChildLetterString[Y] - 'A')]; ChildOffsetNumber[LetterArrayLocation[(CurrentChildLetterString[Y] - 'A')]] = CurrentOffsetNumber; } for ( Y = 0; Y < NUMBER_OF_REDUCED_LETTERS; Y++ ) { TotalOffsetNumber += ChildOffsetNumber[Y]; ChildOffsetNumber[Y] = 0; } ParallelChildOffsetValues[X] = TotalOffsetNumber; TotalOffsetNumber = 0; } // We are now ready to output to the 32 bit node data file and the text file. for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Get the correct offset index to store into the current node for ( Y = 0; Y < NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { CurrentOffsetNumberIndex = Y; break; } } CompleteThirtyTwoBitNode = CurrentOffsetNumberIndex; CompleteThirtyTwoBitNode <<= 15; CompleteThirtyTwoBitNode += (Result->DawgArray)[X].Child; // Bits (0, 14) are for the first child index. Bits (15, 25) contain the offset index. Bit 26 then represents the EOW flag. if ( (Result->DawgArray)[X].EndOfWordFlag == 1 ) CompleteThirtyTwoBitNode += 67108864; fwrite( &CompleteThirtyTwoBitNode, 4, 1, Short ); ConvertIntNodeToBinaryString(CompleteThirtyTwoBitNode, TheNodeInBinary); ConvertOffsetLongIntToBinaryString(ParallelChildOffsetValues[CurrentOffsetNumberIndex], TheOffsetInBinary); fprintf( Text, "N%d-%s,Lev\|%d\|,{'%c',%d,%d,%d},Childs\|%d\|-\|%s\|,Raw\|%d\|,Offset%s-\|%ld\|.\n", X, TheNodeInBinary, (Result->DawgArray)[X].Level, (Result->DawgArray)[X].Letter, (Result->DawgArray)[X].EndOfWordFlag, (Result->DawgArray)[X].Next, (Result->DawgArray)[X].Child, CurrentNumberOfChildren, CurrentChildLetterString, CompleteThirtyTwoBitNode, TheOffsetInBinary, ParallelChildOffsetValues[CurrentOffsetNumberIndex] ); if ( CompleteThirtyTwoBitNode == 0 ) printf("The Fuck You Do\n"); TotalOffsetNumber = 0; } fclose(Short); fwrite( &NumberOfUniqueChildStrings, 8, 1, Long ); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { fwrite( &ParallelChildOffsetValues[X], 8, 1, Long ); } fclose(Long); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { printf("#%d - \|%s\| - \|%ld\| - \n", X, HolderOfUniqueChildStrings[X], ParallelChildOffsetValues[X] ); } fprintf(Text, "Number Of Living Nodes \|%d\| Plus The NULL Node. Also, there are %ld child offset long ints.\n\n", NumberOfLivingNodes, NumberOfUniqueChildStrings); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { fprintf(Text, "#%d - \|%s\| - \|%ld\|\n", X, HolderOfUniqueChildStrings[X], ParallelChildOffsetValues[X] ); } // free all of the memory used to compile the two part DAWG encoding. for ( X = 0; X < NumberOfLivingNodes; X++ ) { free(HolderOfUniqueChildStrings[X]); } free(HolderOfUniqueChildStrings); free(ParallelChildOffsetValues); printf( "\nThe Number Of Unique Child Strings Has Been Found to be \|%ld\|.\n\n", NumberOfUniqueChildStrings ); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ){ EndOfListCount += 1; } } EndOfListIndicies = (int)malloc(EndOfListCount * sizeof(int)); fwrite( &EndOfListCount, 4, 1, ListE ); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ){ EndOfListIndicies[EOLTracker] = X; EOLTracker += 1; } } printf("EndOfListCount \|%d\|\n", EndOfListCount); for ( X = 0; X < EndOfListCount; X++ ) { fwrite( &EndOfListIndicies[X], 4, 1, ListE ); } fclose(ListE); fprintf(Text, "\nEndOfListCount \|%d\|\n\n", EndOfListCount); for ( X = 0; X < EndOfListCount; X++ ) { fprintf(Text, "#%d - \|%d\|\n", X, EndOfListIndicies[X]); } fclose(Text); printf( "Out of text, 32 bit, and 64 bit output to file clean.\n" ); FILE Data; Data = fopen( OPTIMAL_DAWG_DATA,"wb" ); int CurrentNodeInteger = NumberOfLivingNodes; // It is critical especially in a binary file that the first integer written to the file be the number of nodes stored in the file. fwrite( &CurrentNodeInteger, 4, 1, Data ); // Write the NULL node to the file first. CurrentNodeInteger = 0; fwrite( &CurrentNodeInteger, 4, 1, Data ); for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNodeInteger = (Result->DawgArray)[X].Child; CurrentNodeInteger <<= 5; CurrentNodeInteger += ((Result->DawgArray)[X].Letter) - 'A'; if ( (Result->DawgArray)[X].EndOfWordFlag == TRUE ) CurrentNodeInteger += 8388608; if ( (Result->DawgArray)[X].Next == 0 ) CurrentNodeInteger += 4194304; fwrite( &CurrentNodeInteger, 4, 1, Data ); } fclose(Data); printf( "Out of 32 bit traditional data output to file clean.\n" ); return Result; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// char ArrayDnodeWordSearchRecurse( ArrayDnodePtr DoggieDog, int Index, char PotWord ){ if ( Index == 0 ) return FALSE; if ( (DoggieDog[Index].Letter) == PotWord[0] ){ printf( "Found first letter \|%c\| in \|%s\|\n", PotWord[0], PotWord ); if ( PotWord[1] == '\0' && DoggieDog[Index].EndOfWordFlag == TRUE ) return TRUE; else { return ArrayDnodeWordSearchRecurse( DoggieDog, DoggieDog[Index].Child, &(PotWord[1]) ); printf( "Not quite a match yet, continue.\n" ); } } else { printf( "\|%c\| in dawg does not equal \|%c\| in the word\n", DoggieDog[Index].Letter, PotWord[0] ); if ( (DoggieDog[Index].Letter) > PotWord[0] ){ printf( "\|%c\| is past the letter we need \|%c\|\n", DoggieDog[Index].Letter, PotWord[0] ); return FALSE; } else{ printf("Keep looking down the line\n"); return ArrayDnodeWordSearchRecurse( DoggieDog, DoggieDog[Index].Next, PotWord ); } } } char ArrayDawgWordSearch( ArrayDawgPtr ThisArrayDawg, char PotWord ){ printf("in\n"); return ArrayDnodeWordSearchRecurse( ThisArrayDawg->DawgArray, 1, PotWord ); } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// int main() { printf( "The main function lives\n" ); int X = 0; // All of the words of similar length will be stored sequentially in the same array so that there will be (MAX + 1) arrays in total. //The Smallest length of a string is assumed to be 2. char AllWordsInEnglish[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = NULL; FILE Input; Input = fopen( RAW_LEXICON,"r" ); char ThisLine[100] = "\0"; int FirstLineIsSize = 0; int LineLengthWithNl = 0; int LineLength = 0; fgets( ThisLine, 100, Input ); // Correction is needed to get rid of the new line character that fgets appends to the word. LineLengthWithNl = strlen( ThisLine ); LineLength = LineLengthWithNl - 1; ThisLine[LineLength] = '\0'; FirstLineIsSize = StringToPositiveInt( ThisLine ); printf( "FirstLineIsSize is \|%d\|\n", FirstLineIsSize ); int KeepTracker[MAX + 1]; for ( X = 0; X <= MAX; X++ ) KeepTracker[X] = 0; char LexiconInRam = (char)malloc( FirstLineIsSizesizeof( char* ) ); // The first line gives us the number of words so read in all of them to ram temporarily. for ( X = 0; X < FirstLineIsSize; X++ ) { fgets( ThisLine, 100, Input ); LineLengthWithNl = strlen( ThisLine ); LineLength = LineLengthWithNl - 1; ThisLine[LineLength] = '\0'; //if ( LineLength <= MAX ) { if ( LineLength <= MAX ) KeepTracker[LineLength] += 1; LexiconInRam[X] = (char)malloc( (LineLength + 1)sizeof( char ) ); strcpy( LexiconInRam[X], ThisLine ); //} } printf( "Ram write complete\n" ); for ( X = 0; X < (MAX + 1); X++ ) printf( "There are \|%d\| words of length \|%d\|\n", KeepTracker[X], X ); // Allocate enough space to hold all of the words in strings so that we can add them to the trie by length. for ( X = 2; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = (char)malloc( (X + 1)KeepTracker[X]sizeof( char ) ); printf( "Initial malloc() complete\n" ); int CurrentTracker[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) CurrentTracker[X] = 0; int CurrentLength = 0; // Copy all of the strings into the halfway house 1. for ( X = 0; X < FirstLineIsSize; X++ ) { CurrentLength = strlen( LexiconInRam[X] ); //printf("\|%s\| - \|%d\|\n", LexiconInRam[X], CurrentLength ); // As convoluted as this command might seem, it simply copies a string from its temporary ram location to the array of length equivelant strings for processing in making the DAWG. if ( CurrentLength <= MAX ) strcpy( &((AllWordsInEnglish[CurrentLength])[(CurrentTracker[CurrentLength](CurrentLength + 1))]), LexiconInRam[X] ); CurrentTracker[CurrentLength] += 1; } printf( "halfway house copy complete\n" ); // Make sure that the counting has resulted in all of the strings being placed correctly. for ( X = 0; X < (MAX + 1); X++ ) assert( KeepTracker[X] == CurrentTracker[X] ); // Free the initial ram read space for ( X = 0; X < FirstLineIsSize; X++ ) free( LexiconInRam[X] ); free( LexiconInRam ); clock_t start_clock = clock(); int DictionarySizeIndex[MAX + 1]; for ( X = 0; X <= MAX; X++ ) DictionarySizeIndex[X] = KeepTracker[X]; printf( "I will now start the init function.\n" ); ArrayDawgPtr Adoggy = ArrayDawgInit( AllWordsInEnglish, DictionarySizeIndex, MAX ); printf( "The Init is over, so move on to a traditional DAWG word search. This will clearly demonstrate the list-based DAWG's shortcomings.\n" ); // Use an initial word to get into the search. char Query[INPUT_LIMIT +1] = "AGE"; while ( strlen(Query) > 1 ){ MakeMeAllCapital( Query ); if ( ArrayDawgWordSearch( Adoggy, Query ) == TRUE ) printf( "\nIIIIIIIIII - Found \|%s\|\n\n", Query ); else printf( "\nXXXXXXXXXX - \|%s\| NOT FOUND\n\n", Query ); printf( "Enter the word that you want to look for...(<2 letters to exit):" ); fgets(Query, INPUT_LIMIT, stdin); CutOffNewLine(Query); } char Outer[INPUT_LIMIT +1]; printf( "Processor Time Used: %g sec... --press enter to exit--", (clock() - start_clock) / (double) CLOCKS_PER_SEC ); fgets(Outer, INPUT_LIMIT, stdin); return 0; }

ADTDAWG_Creator_Part_2.c - This Program Is Non-Trivial
// The second part of the creation procedure is responsible for reording the lists, and deleting reduntant ones. #include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #include <assert.h> #define OPTIMAL_DIRECT_DAWG_DATA_PART_ONE "Optimal_Part_One_Direct_Dawg_For_Lexicon_14.dat" #define OPTIMAL_DIRECT_DAWG_DATA_PART_TWO "Optimal_Part_Two_Direct_Dawg_For_Lexicon_14.dat" #define END_OF_LIST_DATA "End_Of_List_Nodes_Direct_Dawg_For_Lexicon_14.dat" #define DRUMPSTER "Numbered_Lexicon_14.txt" #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_TEXT "Four_Part_DTDAWG_For_Lexicon_14.txt" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, -999, 1, 3, 5, -999, 8, -999, 11, -999, -999, 14, 17, 21, 25, 29, -999, 33, 37, 41, -999, -999, -999, -999, -999, -999}; FILE WhatR; // The BinaryNode string must be at least 32 + 5 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertIntNodeToBinaryString( int TheNode, char BinaryNode){ int X; int Bit; BinaryNode[0] = '\|'; // Bit 31, the leftmost bit is the sign bit. BinaryNode[1] = (TheNode < 0)?'1':'0'; // Bit 30 to bit 27 are unused bits in a 32 bit node. BinaryNode[2] = (TheNode & (int)pow(2,30))?'1':'0'; BinaryNode[3] = (TheNode & (int)pow(2,29))?'1':'0'; BinaryNode[4] = (TheNode & (int)pow(2,28))?'1':'0'; BinaryNode[5] = (TheNode & (int)pow(2,27))?'1':'0'; BinaryNode[6] = '\|'; // Bit 26 is the boolean EOW flag for end of word. BinaryNode[7] = (TheNode & (int)pow(2,26))?'1':'0'; BinaryNode[8] = '\|'; // Bit 25 to bit 15 represent the child offset index. Bit = 25; for ( X = 9; X <= 19; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[20] = '\|'; // The first child index that requires 15 bits will finish off the 32 bit int. Bit = 14; for ( X = 21; X <= 35; X++ ) { BinaryNode[X] = (TheNode & (int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryNode[36] = '\|'; BinaryNode[37] = '\0'; } // This function is used to make a text file for verification purposes. // The BinaryOffset string must be at least 64 + 16 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertOffsetLongIntToBinaryString( long int TheOffset, char* BinaryOffset){ int X; int Bit; BinaryOffset[0] = '\|'; // Bit 63, the leftmost bit is the sign bit. BinaryOffset[1] = (TheOffset < 0)?'1':'0'; // Bit 62 to bit 45 are not used with the 14 char set chosen for this DAWG. 18 Unused bits. All bits including the sign bit will be used with an 18 character set. Bit = 62; for ( X = 2; X <= 19; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[20] = '\|'; // The seven letters that require a 4 bit offset encoding start here --------------------------------------------------- // Bit [44,41] represent the "T" child. Bit = 44; for ( X = 21; X <= 24; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[25] = '\|'; // Bit [40,37] represent the "S" child. Bit = 40; for ( X = 26; X <= 29; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[30] = '\|'; // Bit [36,33] represent the "R" child. Bit = 36; for ( X = 31; X <= 34; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[35] = '\|'; // Bit [32,29] represent the "P" child. Bit = 32; for ( X = 36; X <= 39; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[40] = '\|'; // Bit [28,25] represent the "O" child. Bit = 28; for ( X = 41; X <= 44; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[45] = '\|'; // Bit [24,21] represent the "N" child. Bit = 24; for ( X = 46; X <= 49; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[50] = '\|'; // Bit [20,17] represent the "M" child. Bit = 20; for ( X = 51; X <= 54; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[55] = '\|'; // The four letters that require a 3 bit offset encoding start here --------------------------------------------------- // Bit [16,14] represent the "L" child. Bit = 16; for ( X = 56; X <= 58; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[59] = '\|'; // Bit [13,11] represent the "I" child. Bit = 13; for ( X = 60; X <= 62; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[63] = '\|'; // Bit [10,8] represent the "G" child. Bit = 10; for ( X = 64; X <= 66; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[67] = '\|'; // Bit [7,5] represent the "E" child. Bit = 7; for ( X = 68; X <= 70; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[71] = '\|'; // The two letters that require a 2 bit offset encoding start here --------------------------------------------------- // Bit [4,3] represent the "G" child. Bit = 4; for ( X = 72; X <= 73; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[74] = '\|'; // Bit [2,1] represent the "E" child. Bit = 2; for ( X = 75; X <= 76; X++ ) { BinaryOffset[X] = (TheOffset & (long int)pow(2,Bit))?'1':'0'; Bit -= 1; } BinaryOffset[77] = '\|'; // Bit 0 represent the "A" child. BinaryOffset[78] = (TheOffset & (long int)pow(2,0))?'1':'0'; BinaryOffset[79] = '\|'; BinaryOffset[80] = '\0'; } void ExtractChildStringFromLongInt(long int TheOffset, char TheChildren){ int X; int FillThisSpot = 0; char CurrentChar; for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) { CurrentChar = CharacterSet[X]; if ( TheOffset & ChildLetterBitMasks[CurrentChar - 'A'] ) { TheChildren[FillThisSpot] = CurrentChar; FillThisSpot += 1; } } TheChildren[FillThisSpot] = '\0'; } int TraverseTheDawgArrayRecurse(int TheDawg, long int TheOffsets, int SuckOnIt, int CurrentIndex, char TheWordSoFar, int FillThisPosition, char CurrentLetter, int WordCounter, Bool PrintMe){ int X; long int CurrentOffset; int CurrentChild; int WhatsBelowMe = 0; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( CurrentChild = (TheDawg[CurrentIndex] & CHILD_MASK) ) { for ( X = SIZE_OF_CHARACTER_SET - 1; X >= 0 ; X-- ) { // This assignment statement inside of a conditional statement seems very convoluted but all that it does is find a number between 0 and 14 // using two numbers that need bit masking and bit shifting. if ( CurrentOffset = (TheOffsets[(TheDawg[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[CharacterSet[X] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[CharacterSet[X] - 'A']; CurrentOffset -= 1; WhatsBelowMe += TraverseTheDawgArrayRecurse(TheDawg, TheOffsets, SuckOnIt, CurrentChild + (int)CurrentOffset, TheWordSoFar, FillThisPosition + 1, CharacterSet[X], WordCounter, PrintMe); } } } if ( TheDawg[CurrentIndex] & END_OF_WORD_FLAG ) { WordCounter +=1; TheWordSoFar[FillThisPosition+ 1] = '\0'; if ( PrintMe == TRUE ) printf("#\|%d\| - \|%s\|\n", WordCounter, TheWordSoFar); //if ( PrintMe == TRUE ) fprintf(What, "#\|%d\| - \|%s\|\n", WordCounter, TheWordSoFar); if ( PrintMe == FALSE ) fprintf(WhatR, "#\|%d\| - \|%s\|\n", WordCounter, TheWordSoFar); WhatsBelowMe += 1; } SuckOnIt[CurrentIndex] = WhatsBelowMe; return WhatsBelowMe; } void TraverseTheDawgArray(int TheDawg, long int TheOffsets, int BelowingMe, Bool PrintToScreen){ int X; int TheCounter = 0; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = SIZE_OF_CHARACTER_SET - 1; X >= 0 ; X-- ) { TraverseTheDawgArrayRecurse(TheDawg, TheOffsets, BelowingMe, X + 1, RunningWord, 0, CharacterSet[X], &TheCounter, PrintToScreen); } } // The Governing Equation of The Four Part DTDAWG: NextMarker = CurrentMarker - FCWTEOBL + NCWTEOBL - Flag; WTEOBL means Words to end of branch list. // FC means FirstChild. NC means NextChild, the one we want to go to. void TraverseTheFourPartDawgArrayRecurse(int One, long int Two, int Three, int CurrentIndex, char TheWordSoFar, int FillThisPosition, char CurrentLetter, int CurrentMarker){ int X; long int CurrentOffset; int CurrentChild; int NextMarker; int CurrentNCWTEOBL; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { TheWordSoFar[FillThisPosition+ 1] = '\0'; printf("#\|%d\| - \|%s\|\n", CurrentMarker, TheWordSoFar); } if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { // Begin the calculation of the NextMarker; the part that is common to all children. NextMarker = CurrentMarker - Three[CurrentChild]; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) NextMarker -= 1; for ( X = 0; X < SIZE_OF_CHARACTER_SET ; X++ ) { // This assignment statement inside of an if statement seems very convoluted but all that it does, with the help of the next line, // is find a number between 0 and 14, using two numbers that need bit masking and bit shifting. if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[CharacterSet[X] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[CharacterSet[X] - 'A']; CurrentOffset -= 1; CurrentNCWTEOBL = Three[CurrentChild + (int)CurrentOffset]; NextMarker += CurrentNCWTEOBL; TraverseTheFourPartDawgArrayRecurse(One, Two, Three, CurrentChild + (int)CurrentOffset, TheWordSoFar, FillThisPosition + 1, CharacterSet[X], NextMarker); NextMarker -= CurrentNCWTEOBL; } } } } void TraverseTheFourPartDawgArray(int First, long int Second, int Third){ int X; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = 0; X < SIZE_OF_CHARACTER_SET ; X++ ) { TraverseTheFourPartDawgArrayRecurse(First, Second, Third, (X + 1), RunningWord, 0, CharacterSet[X], Third[X + 1]); } } int main(){ int X; int Y; int NumberOfPartOneNodes; int NumberOfEndOfLists; long int NumberOfPartTwoNodes; int PartOneArray; long int PartTwoArray; int CurrentCount; int CurrentEndOfList; int FurthestBigNode = 0; WhatR = fopen(DRUMPSTER, "w"); // We are going to need to know the number of words in and below each node, then we can compute number till the end of the child list which is of paramount importance in word tracking. // When we have calculated the NumberOfWordsToEndOfBranchList, the NumberOfWordsBelowMe can be calculated with max 1 subtraction. int NumberOfWordsBelowMe; int NumberOfWordsToEndOfBranchList; int RearrangedNumberOfWordsToEndOfBranchList; // In order to calculate NumberOfWordsToEndOfBranchList we need to know the location of the closest EndOfList node. int EndOfListLocations; FILE PartOne; FILE PartTwo; // The PartThree data file requires rearranging of the PartOne node data so that the most WordsToEndOfBranchList numbers can be housed in a single unsigned char. // Only lists with nodes containing a number greater then 255 will need more bits. // PartTwo will remain untouched. FILE ListE; FILE Text; PartOne = fopen(OPTIMAL_DIRECT_DAWG_DATA_PART_ONE, "rb"); PartTwo = fopen(OPTIMAL_DIRECT_DAWG_DATA_PART_TWO, "rb"); ListE = fopen(END_OF_LIST_DATA, "rb"); Text = fopen(FOUR_PART_DTDAWG_14_TEXT, "w"); fread(&NumberOfPartOneNodes, 4, 1, PartOne); fread(&NumberOfPartTwoNodes, 8, 1, PartTwo); fread(&NumberOfEndOfLists, 4, 1, ListE); PartOneArray = (int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); PartTwoArray = (long int )malloc(NumberOfPartTwoNodes sizeof(long int)); NumberOfWordsBelowMe = (int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); EndOfListLocations = (int )malloc(NumberOfEndOfLists sizeof(int)); NumberOfWordsToEndOfBranchList =(int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); RearrangedNumberOfWordsToEndOfBranchList =(int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); NumberOfWordsBelowMe[0] = 0; NumberOfWordsToEndOfBranchList[0] = 0; RearrangedNumberOfWordsToEndOfBranchList[0] = 0; PartOneArray[0] = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { fread(&PartOneArray[X], 4, 1, PartOne); } for ( X = 0; X < NumberOfPartTwoNodes; X++ ) { fread(&PartTwoArray[X], 8, 1, PartTwo); } for ( X = 0; X < NumberOfEndOfLists; X++ ) { fread(&EndOfListLocations[X], 4, 1, ListE); } fclose(PartOne); fclose(PartTwo); fclose(ListE); // This function is run at this point in time to fill the NumberOfWordsBelowMe array. TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); CurrentEndOfList = 0; // This little piece of code compiles the NumberOfWordsToEndOfBranchList array. The requirements are the NumberOfWordsBelowMe array and the EndOfListLocations array. for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { CurrentCount = 0; for ( Y = X; Y <= EndOfListLocations[CurrentEndOfList]; Y++ ) { CurrentCount += NumberOfWordsBelowMe[Y]; } NumberOfWordsToEndOfBranchList[X] = CurrentCount; //printf("Node\|%d\|, WordsBelowMe\|%d\|, WordsToEndOfBranch\|%d\|, EndOfBranch\|%d\|\n", X, NumberOfWordsBelowMe[X], NumberOfWordsToEndOfBranchList[X], EndOfListLocations[CurrentEndOfList]); if ( X == EndOfListLocations[CurrentEndOfList] )CurrentEndOfList +=1; } CurrentEndOfList = 0; // Now with preliminary analysis complete, it is time to rearrange the PartOne nodes and then set up PartThree. int ListSizeCounter[SIZE_OF_CHARACTER_SET + 1]; int TotalNumberOfLists = 0; Bool AreWeInBigList = FALSE; int TheCurrentChild; int StartOfCurrentList = 1; int SizeOfCurrentList = EndOfListLocations[0]; int EndOfCurrentList = EndOfListLocations[0]; int InsertionPoint = 1; int CurrentlyCopyingThisList = 0; int CurrentAdjustments; int PartOneRearrangedArray; PartOneRearrangedArray = (int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); CurrentAdjustments = (int )malloc((NumberOfPartOneNodes + 1) sizeof(int)); PartOneRearrangedArray[0] = 0; for ( X = 0; X <= NumberOfPartOneNodes; X++ ) CurrentAdjustments[X] = 0; for ( X = 0; X <= SIZE_OF_CHARACTER_SET; X++ ) ListSizeCounter[X] = 0; // This code is responsible for the rearrangement of the node lists inside of the DAWG int array. CurrentEndOfList = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( NumberOfWordsToEndOfBranchList[X] > 255 ) AreWeInBigList = TRUE; if ( X == EndOfCurrentList ) { ListSizeCounter[SizeOfCurrentList] += 1; // We are now at the end of a big list that must to be moved up to the InsertionPoint. This also implies moving everything between its current location and its new one. if ( AreWeInBigList == TRUE ) { // First step is to copy the CurrentList into the new array at its correct position. for ( Y = 0; Y < SizeOfCurrentList; Y++ ) { PartOneRearrangedArray[InsertionPoint + Y] = PartOneArray[StartOfCurrentList + Y]; RearrangedNumberOfWordsToEndOfBranchList[InsertionPoint + Y] = NumberOfWordsToEndOfBranchList[StartOfCurrentList + Y]; } // The following steps are required when we are actually moving the position of a list. The first set of lists will bypass these steps. if ( InsertionPoint != StartOfCurrentList ) { // Step 2 is to move all of the nodes between the original and final location, "SizeOfCurrentList" number of places back, starting from the end. for ( Y = EndOfCurrentList; Y >= (InsertionPoint + SizeOfCurrentList); Y-- ) { PartOneArray[Y] = PartOneArray[Y - SizeOfCurrentList]; NumberOfWordsToEndOfBranchList[Y] = NumberOfWordsToEndOfBranchList[Y - SizeOfCurrentList]; } // Step 3 is to copy the list we are moving up from the rearranged array back into the original. for ( Y = InsertionPoint; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { PartOneArray[Y] = PartOneRearrangedArray[Y]; NumberOfWordsToEndOfBranchList[Y] = RearrangedNumberOfWordsToEndOfBranchList[Y]; } // Step 4 is to fill the CurrentAdjustments array with the amount that each child references must be moved. The two arrays are identical now up to the new insertion point. // At this stage, the CurrentAdjustments array is all zeros. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { TheCurrentChild = (PartOneArray[Y] & CHILD_MASK); if ( (TheCurrentChild >= InsertionPoint) && (TheCurrentChild < StartOfCurrentList) ) CurrentAdjustments[Y] = SizeOfCurrentList; if ( (TheCurrentChild >= StartOfCurrentList) && (TheCurrentChild <= EndOfCurrentList) ) CurrentAdjustments[Y] = InsertionPoint - StartOfCurrentList; } // Step 5 is to fix all of the child reference values in both of the arrays. // Start with the rearranged array. for ( Y = 1; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { if ( CurrentAdjustments[Y] != 0 ) PartOneRearrangedArray[Y] += CurrentAdjustments[Y]; } // Finish with the original array. Make sure to zero all the values after the adjustments have been made to get ready for the next round. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { if ( CurrentAdjustments[Y] != 0 ) { PartOneArray[Y] += CurrentAdjustments[Y]; CurrentAdjustments[Y] = 0; } } } // Step 7 is to set the new InsertionPoint and change the EndOfListLocations, so that they reflect the shift. InsertionPoint += SizeOfCurrentList; // Shift all of the end of lists between the currently copying and the CurrentEndOfList. for ( Y = CurrentEndOfList; Y > CurrentlyCopyingThisList; Y-- ) { EndOfListLocations[Y] = EndOfListLocations[Y - 1] + SizeOfCurrentList; } EndOfListLocations[CurrentlyCopyingThisList] = InsertionPoint - 1; CurrentlyCopyingThisList += 1; } // Even when we are not in a big list, we still need to update the current list parameters. CurrentEndOfList +=1; SizeOfCurrentList = EndOfListLocations[CurrentEndOfList] - EndOfCurrentList; EndOfCurrentList = EndOfListLocations[CurrentEndOfList]; StartOfCurrentList = X + 1; AreWeInBigList = FALSE; } } // Step 8 is to copy all of the small lists from the original array to the rearranged array. All of the references should be properly adjusted at this point. for ( X = InsertionPoint; X <= NumberOfPartOneNodes; X++ ) { PartOneRearrangedArray[X] = PartOneArray[X]; RearrangedNumberOfWordsToEndOfBranchList[X] = NumberOfWordsToEndOfBranchList[X]; } // The rearrangement of the DAWG lists to reduce size of the PartThree data file is complete, so check if the new and old lists are identical, because they should be. for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( PartOneArray[X] != PartOneRearrangedArray[X] ) printf("What A Mistake!\n"); assert(PartOneArray[X] == PartOneRearrangedArray[X]); assert(RearrangedNumberOfWordsToEndOfBranchList[X] == NumberOfWordsToEndOfBranchList[X]); } // The two arrays are now identical, so as a final precaution, traverse the rearranged array. TraverseTheDawgArray(PartOneRearrangedArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // Check for duplicate lists. It is now highly likely that there are duplicates printf("\n"); // Add up the total number of lists. for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; } int NodeListsBySize = (int )malloc((SIZE_OF_CHARACTER_SET + 1) * sizeof(int )); int WhereWeAt[SIZE_OF_CHARACTER_SET + 1]; for ( X = 0; X <= SIZE_OF_CHARACTER_SET; X++ ) WhereWeAt[X] = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { NodeListsBySize[X] = (int )malloc(ListSizeCounter[X] * sizeof(int)); } // We are now required to fill the NodeListsBySize array. Simply copy over the correct EndOfList information. // Note that the EndOfList information reflects the readjustment that just took place. CurrentEndOfList = 0; EndOfCurrentList = EndOfListLocations[0]; SizeOfCurrentList = EndOfListLocations[0]; for ( X = 0; X < NumberOfEndOfLists; X++ ) { (NodeListsBySize[SizeOfCurrentList])[WhereWeAt[SizeOfCurrentList]] = EndOfCurrentList; WhereWeAt[SizeOfCurrentList] += 1; CurrentEndOfList += 1; SizeOfCurrentList = EndOfListLocations[CurrentEndOfList] - EndOfCurrentList; EndOfCurrentList = EndOfListLocations[CurrentEndOfList]; } int Z; int V; int W; int TheNewChild; int TotalNumberOfKilledLists = 0; int NewNumberOfKilledLists = -1; int InspectThisEndOfList; int MaybeReplaceWithThisEndOfList; int NumberOfListsKilledThisRound = -1; int CurrentNumberOfPartOneNodes = NumberOfPartOneNodes; Bool EliminateCurrentList = TRUE; // Keep attempting to kill lists until there are no more redundant lists. It turns out that a single run does the trick. //Without an explicit character member in each node, many of the lists are redundant, run a series of tests to exterminate these lists. while ( NumberOfListsKilledThisRound != 0 ) { printf("Run a redundant check:\n"); NumberOfListsKilledThisRound = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { printf("Eliminate Lists of Size \|%d\|\n", X); while ( NewNumberOfKilledLists != 0 ) { NewNumberOfKilledLists = 0; for ( Y = 1; Y < ListSizeCounter[X]; Y++ ) { InspectThisEndOfList = (NodeListsBySize[X])[Y]; for ( Z = 0; Z < Y; Z++ ) { MaybeReplaceWithThisEndOfList = (NodeListsBySize[X])[Z]; for( W = 0; W < X; W++ ) { if ( PartOneArray[InspectThisEndOfList - W] != PartOneArray[MaybeReplaceWithThisEndOfList - W] ) { EliminateCurrentList = FALSE; break; } } // When eliminating a list, make sure to adjust the WTEOBL data. if ( EliminateCurrentList == TRUE ) { // Step 1 - replace all references to the duplicate list with the earlier equivalent. for( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( (TheCurrentChild > (InspectThisEndOfList - X)) && (TheCurrentChild <= InspectThisEndOfList) ) { TheNewChild = MaybeReplaceWithThisEndOfList - (InspectThisEndOfList - TheCurrentChild); PartOneArray[V] -= TheCurrentChild; PartOneArray[V] += TheNewChild; } } // Step 2 - eliminate the dupe list by moving the higher lists forward. for ( V = (InspectThisEndOfList - X + 1); V <= (CurrentNumberOfPartOneNodes - X); V++ ) { PartOneArray[V] = PartOneArray[V + X]; RearrangedNumberOfWordsToEndOfBranchList[V] = RearrangedNumberOfWordsToEndOfBranchList[V + X]; } // Step 3 - change CurrentNumberOfPartOneNodes. CurrentNumberOfPartOneNodes -= X; // Step 4 - lower all references to the nodes coming after the dupe list. for ( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( TheCurrentChild > InspectThisEndOfList ){ PartOneArray[V] -= X; } } // Step 5 - readjust all of the lists after Y forward 1 and down X to the value, and lower ListSizeCounter[X] by 1. for( V = Y; V < (ListSizeCounter[X] - 1); V++ ) { (NodeListsBySize[X])[V] = (NodeListsBySize[X])[V + 1] - X; } ListSizeCounter[X] -= 1; // Step 6 - Lower any list, of any size, greater than (NodeListsBySize[X])[Y], down by X. for ( V = 1; V <= (X - 1); V++ ) { for ( W = 0; W < ListSizeCounter[V]; W++ ) { if ( (NodeListsBySize[V])[W] > InspectThisEndOfList ) (NodeListsBySize[V])[W] -= X; } } for ( V = (X + 1); V <= SIZE_OF_CHARACTER_SET; V++ ) { for ( W = 0; W < ListSizeCounter[V]; W++ ) { if ( (NodeListsBySize[V])[W] > InspectThisEndOfList ) (NodeListsBySize[V])[W] -= X; } } // Step 7 - lower Y by 1 and increase NewNumberOfKilledLists. Y -= 1; NewNumberOfKilledLists += 1; break; } EliminateCurrentList = TRUE; } } printf("NewNumberOfKilledLists \|%d\|.\n", NewNumberOfKilledLists); TotalNumberOfKilledLists += NewNumberOfKilledLists; NumberOfListsKilledThisRound += NewNumberOfKilledLists; } NewNumberOfKilledLists = -1; } } printf("\nOriginalNumberOfPartOneNodes\|%d\|\n", NumberOfPartOneNodes); printf("CurrentNumberOfPartOneNodes\|%d\|\n\n", CurrentNumberOfPartOneNodes); printf("The Total Number Of Killed Lists Is \|%d\|\n", TotalNumberOfKilledLists); printf("The Total Number Of Killed Nodes Is \|%d\|\n\n", NumberOfPartOneNodes - CurrentNumberOfPartOneNodes); TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // Now that we have proved how a large number of additional nodes become redundant, set up and write the original, // the first, the only, 3 part direct tracking directed acyclic word graph. // Test it, write an anagram function, and finally a boggle scoring parallel method using 14 worker threads and a main thread, using OTS time stamp paradigm. // It is of critical importance that each thread is started only once during program execution. // RearrangedNumberOfWordsToEndOfBranchList has been kept up to date during the reworking of the DAWG // and there are CurrentNumberOfPartOneNodes meaningful value in the array. // EndOfListLocations needs to be recompiled from what is left from the NodeListsBySize arrays. TotalNumberOfLists = 0; for ( X = 1; X <= SIZE_OF_CHARACTER_SET; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; printf("List Size\|%d\| - Number Of Lists\|%d\|\n", X, ListSizeCounter[X]); } // Set all of the values that we are going to use to something very high. for ( X = 0; X < TotalNumberOfLists; X++ ) { EndOfListLocations[X] = 100000; } // Filter all of the living EndOfList values into the EndOfListLocations array. int TempValue; for ( X = SIZE_OF_CHARACTER_SET; X >= 1; X-- ) { for ( Y = 0; Y < ListSizeCounter[X]; Y++ ) { EndOfListLocations[TotalNumberOfLists - 1] = (NodeListsBySize[X])[Y]; // The new list has been placed at the end of the list, now filter it up to where it should be. for ( Z = (TotalNumberOfLists - 1); Z > 0; Z-- ) { if ( EndOfListLocations[Z - 1] > EndOfListLocations[Z] ) { TempValue = EndOfListLocations[Z - 1]; EndOfListLocations[Z - 1] = EndOfListLocations[Z]; EndOfListLocations[Z] = TempValue; } else break; } } } // Test for logical errors in the list elimination procedure. for ( X = 0; X < (TotalNumberOfLists - 1); X++ ) { if ( EndOfListLocations[X] == EndOfListLocations[X + 1] ) printf("No Two Lists Can End On The Same Node. \|%d\|\n", EndOfListLocations[X]); } // Find out the final index number that requires an integer greater in size than a byte to hold it. This will become the pivot point between part 3 and part 4. CurrentEndOfList = 0; FurthestBigNode = 0; int FirstSmallNode; for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) {if ( RearrangedNumberOfWordsToEndOfBranchList[X] > 255 ) FurthestBigNode = X; if ( X == EndOfListLocations[CurrentEndOfList] ) CurrentEndOfList += 1; } FirstSmallNode = FurthestBigNode + 1; printf("\nThe first node that requires only one byte for its WTEOBL is \|%d\|, it is the transition between PartThree And PartFour.\n\n", FirstSmallNode); //for ( X = 0; X < TotalNumberOfLists; X++ ) { // printf("EOL \|%d\|- \|%d\|\n", X, EndOfListLocations[X]); //} TraverseTheFourPartDawgArray(PartOneArray, PartTwoArray, RearrangedNumberOfWordsToEndOfBranchList); // All of the information required to store the DTDAWG into data files has been compiled, so proceed to save the data to disk. FILE FinalPartOneData = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "wb"); FILE FinalPartTwoData = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "wb"); FILE FinalPartThreeData = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "wb"); FILE FinalPartFourData = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "wb"); char PartOneNodeInBinary[38]; char PartTwoNodeInBinary[81]; char ChildString[SIZE_OF_CHARACTER_SET + 1]; // The final part one data is located in "PartOneArray" and has size "CurrentNumberOfPartOneNodes". Remember that the 0 node is the NULL node, it will not be written to file. fwrite(&CurrentNumberOfPartOneNodes, 4, 1, FinalPartOneData); fprintf(Text, "Part One\n\n"); for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) { fwrite(&(PartOneArray[X]), 4, 1, FinalPartOneData); ConvertIntNodeToBinaryString(PartOneArray[X], PartOneNodeInBinary); ExtractChildStringFromLongInt(PartTwoArray[(PartOneArray[X] & OFFSET_INDEX_MASK)>>15], ChildString); fprintf(Text, "1Node#\|%d\|-\|%d\|-Binary%s-Offsets\|%d\|-Child\|%d\|-Children\|%s\|\n", X, PartOneArray[X], PartOneNodeInBinary, (PartOneArray[X] & OFFSET_INDEX_MASK)>>15, (PartOneArray[X] & CHILD_MASK), ChildString); } // The final part two data has not changed at all, so simply copy the exact data to the new file. This array has no NULL value so start at zero. fwrite(&NumberOfPartTwoNodes, 8, 1, FinalPartTwoData); fprintf(Text, "\nPart Two\n\n"); for ( X = 0; X < NumberOfPartTwoNodes; X++ ) { fwrite(&(PartTwoArray[X]), 8, 1, FinalPartTwoData); ConvertOffsetLongIntToBinaryString(PartTwoArray[X], PartTwoNodeInBinary); ExtractChildStringFromLongInt(PartTwoArray[X], ChildString); fprintf(Text, "2Node#\|%d\|-\|%ld\|-Binary%s-ChildString\|%s\|\n", X, PartTwoArray[X], PartTwoNodeInBinary, ChildString); } // The final part three array consists of the first "FurthestBigNode" number of values in the // "RearrangedNumberOfWordsToEndOfBranchList" array, and note that the zero value is null so do not include it. // Also these values should use 32 bit integers, or 4 bytes. fwrite(&FurthestBigNode, 4, 1, FinalPartThreeData); fprintf(Text, "\nPart Three\n\n"); for ( X = 1; X <= FurthestBigNode; X++ ) { fwrite(&(RearrangedNumberOfWordsToEndOfBranchList[X]), 4, 1, FinalPartThreeData); fprintf(Text, "3Node#\|%d\| - WTEOBL\|%d\|\n", X, RearrangedNumberOfWordsToEndOfBranchList[X]); } // The final part four array consists of the [FirstSmallNode, CurrentNumberOfPartOneNodes] values in the "RearrangedNumberOfWordsToEndOfBranchList" array. int TheSizeOfPartFour = CurrentNumberOfPartOneNodes - FurthestBigNode; unsigned char TheCurrentPartFourNode; printf("TheSizeOfPartFour\|%d\|\n", TheSizeOfPartFour); fprintf(Text, "\nPart Four\n\n"); for ( X = FirstSmallNode; X <= CurrentNumberOfPartOneNodes; X++ ) { assert(RearrangedNumberOfWordsToEndOfBranchList[X] < 255); TheCurrentPartFourNode = RearrangedNumberOfWordsToEndOfBranchList[X]; assert(TheCurrentPartFourNode == RearrangedNumberOfWordsToEndOfBranchList[X]); fwrite(&TheCurrentPartFourNode, 1, 1, FinalPartFourData); fprintf(Text, "4Node#\|%d\| - For\|%d\| - WTEOBL\|%d\|\n", X - FirstSmallNode, X, TheCurrentPartFourNode); } printf("The four data files required for the 4 part DATDAWG have now been written. Inspect the text verification files.\n\n"); return 0; }

ADTDAWG_Word_Search_For_Lexicon_14.c - This Program Is Simple
#include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define MIN_WORD_LENGTH 3 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 #define MAX_INPUT_SIZE 100 #define LOWERIT 32 #define BOGUS -99 #define TOTAL_WORDS_IN_LEXICON 44220 #define NAME_OF_LEXICON "Lexicon_14" typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; char CharacterLocations[26] = {0, BOGUS, 1, 2, 3, BOGUS, 4, BOGUS, 5, BOGUS, BOGUS, 6, 7, 8, 9, 10, BOGUS, 11, 12, 13, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, BOGUS, 1, 3, 5, BOGUS, 8, BOGUS, 11, BOGUS, BOGUS, 14, 17, 21, 25, 29, BOGUS, 33, 37, 41, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; // This function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. // It is useful for when a user enters a string to be anagrammed. void MakeMeAllCapital(char* RawWord){ int X; for( X = 0; X < strlen( RawWord ); X++ ) { if( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] - LOWERIT; } } // This function tests for validity of "ThisWord" for potential inclusion in the lexicon. Bool IsThisWordValid(char ThisWord){ int X; int ThisLength = strlen(ThisWord); if ( ThisLength < MIN_WORD_LENGTH ) return FALSE; if ( ThisLength > MAX_WORD_LENGTH ) return FALSE; for ( X = 0; X < ThisLength; X++ ) { if ( ThisWord[X] < 'A' \|\| ThisWord[X] > 'Z') return FALSE; if ( CharacterLocations[ThisWord[X] - 'A'] == BOGUS) return FALSE; } return TRUE; } inline int WordsToEndOfBranchList(int ThirdPart, unsigned char FourthPart, int TheIndexInQuestion, int TheCutoff){ return (TheIndexInQuestion < TheCutoff)? ThirdPart[TheIndexInQuestion]: (int)FourthPart[TheIndexInQuestion - TheCutoff]; } // This function returns the lexicon position of "TheWord" if found, and BOGUS otherwise. int TrackingWordSearchWithFourPartDawgArrayRecurse(const char TheWord, int One, long int Two, int Three, unsigned char Four , int BeginFourAt, int CurrentIndex, int TheCurrentPosition, char CurrentLetter, int CurrentMarker, int SizeOfTheWord, int StampingSet, int CurrentTime){ long int CurrentOffset; int CurrentChild; int NextMarker = 0; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { if ( (TheCurrentPosition + 1) == SizeOfTheWord ) { printf("Found \|%s\| At Position \|%d\|. Old Stamp \|%d\| - New Stamp \|%d\|\n", TheWord, CurrentMarker, StampingSet[CurrentMarker], CurrentTime); StampingSet[CurrentMarker] = CurrentTime; return CurrentMarker; } NextMarker -= 1; } if ( (TheCurrentPosition + 1) == SizeOfTheWord ) return BOGUS; // If the current node has a child list, then check for the next letter in TheWord. if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[TheWord[TheCurrentPosition + 1] - 'A']) ) { CurrentOffset >>= ChildLetterBitShifts[TheWord[TheCurrentPosition + 1] - 'A']; CurrentOffset -= 1; NextMarker += (CurrentMarker - WordsToEndOfBranchList(Three, Four, CurrentChild, BeginFourAt)+ WordsToEndOfBranchList(Three, Four, (CurrentChild + (int)CurrentOffset), BeginFourAt)); return TrackingWordSearchWithFourPartDawgArrayRecurse(TheWord, One, Two, Three, Four, BeginFourAt, (CurrentChild + (int)CurrentOffset) , (TheCurrentPosition + 1), TheWord[TheCurrentPosition + 1], NextMarker, SizeOfTheWord, StampingSet, CurrentTime); } } return BOGUS; } // The characters inside of TheRawWord should all be contained in the specified character set. // TheRawWord should have min length MIN_WORD_LENGTH,and max length MAX_WORD_LENGTH , if these conditions are not met, return BOGUS. // If theRawWord does exist, return the lexicon position integer. int TrackingWordSearchWithFourPartDawgArray(char TheRawWord, int First, long int Second, int Third, unsigned char Fourth, int StartOfFourth, int TheTimeStamps, int TheTime){ if ( IsThisWordValid(TheRawWord) == FALSE ) return BOGUS; TrackingWordSearchWithFourPartDawgArrayRecurse(TheRawWord, First, Second, Third, Fourth, StartOfFourth , (CharacterLocations[TheRawWord[0] - 'A'] + 1), 0, TheRawWord[0], Third[CharacterLocations[TheRawWord[0] - 'A'] + 1], strlen(TheRawWord), TheTimeStamps, TheTime); } int main(){ FILE PartOne = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "rb"); FILE PartTwo = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "rb"); FILE PartThree = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "rb"); FILE PartFour = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "rb"); int SizeOfPartOne; long int SizeOfPartTwo; int SizeOfPartThree; int SizeOfPartFour; int PartOneArray; long int PartTwoArray; int PartThreeArray; unsigned char PartFourArray; int WhatRoundAreWeOn = 1; int X; int PartThreeFourTransition; char TheUserInput[MAX_INPUT_SIZE + 1]; char Question[MAX_INPUT_SIZE + 1]; char DOM = '\0'; Bool ShouldWeContinue = TRUE; int LexiconTimeStamps[TOTAL_WORDS_IN_LEXICON + 1]; int CurrentPosition; int InputLength; for ( X = 0; X <= TOTAL_WORDS_IN_LEXICON; X++ ) LexiconTimeStamps[X] = 0; fread(&SizeOfPartOne, 4, 1, PartOne); fread(&SizeOfPartTwo, 8, 1, PartTwo); fread(&SizeOfPartThree, 4, 1, PartThree); PartThreeFourTransition = SizeOfPartThree + 1; SizeOfPartFour = SizeOfPartOne - SizeOfPartThree; printf("\n"); printf("SizeOfPartOne \|%d\|\n", SizeOfPartOne); printf("SizeOfPartTwo \|%ld\|\n", SizeOfPartTwo); printf("SizeOfPartThree \|%d\|\n", SizeOfPartThree); printf("Transition \|%d\|\n", PartThreeFourTransition); printf("SizeOfPartFour \|%d\|\n", SizeOfPartFour); printf("\n"); PartOneArray = (int )malloc((SizeOfPartOne + 1) * sizeof(int)); PartTwoArray = (long int )malloc(SizeOfPartTwo sizeof(long int)); PartThreeArray = (int )malloc((SizeOfPartThree + 1) sizeof(int)); PartFourArray = (unsigned char )malloc(SizeOfPartFour sizeof(char)); PartOneArray[0] = 0; for ( X = 1; X <= SizeOfPartOne; X++ ) { fread(&(PartOneArray[X]), 4, 1, PartOne); } for ( X = 0; X < SizeOfPartTwo; X++ ) { fread(&(PartTwoArray[X]), 8, 1, PartTwo); } PartThreeArray[0] = 0; for ( X = 1; X <= SizeOfPartThree; X++ ) { fread(&(PartThreeArray[X]), 4, 1, PartThree); } for ( X = 0; X < SizeOfPartFour; X++ ) { fread(&(PartFourArray[X]), 1, 1, PartFour); } fclose(PartOne); fclose(PartTwo); fclose(PartThree); fclose(PartFour); printf("The four data files have been opened and read into memory.\n\n"); while ( ShouldWeContinue == TRUE ) { printf("Type in the word that you want to search for in %s:", NAME_OF_LEXICON); fgets(TheUserInput, MAX_INPUT_SIZE, stdin); // Get rid of the new line char that fgets always reads in. InputLength = strlen(TheUserInput) - 1; TheUserInput[InputLength] = '\0'; MakeMeAllCapital(TheUserInput); printf("We are now going to search for this word \|%s\|.\n\n", TheUserInput); CurrentPosition = TrackingWordSearchWithFourPartDawgArray(TheUserInput, PartOneArray, PartTwoArray, PartThreeArray, PartFourArray, PartThreeFourTransition, LexiconTimeStamps, WhatRoundAreWeOn); if ( CurrentPosition == BOGUS) printf("This word \|%s\| simply does not exist in %s\n", TheUserInput, NAME_OF_LEXICON); while ( !(DOM == 'Y' \|\| DOM == 'y' \|\| DOM == 'N' \|\| DOM == 'n') ) { printf("\nDo you want to quit(y/n)?"); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' \|\| DOM == 'y' ) ShouldWeContinue = FALSE; DOM = '\0'; WhatRoundAreWeOn += 1; } printf("\nYou have chosen to exit the program after \|%d\| rounds.\n\n", (WhatRoundAreWeOn - 2)); return 0; }

ADTDAWG_Anagram_For_Lexicon_14.c - This Program Is Simple
#include <stdlib.h> #include <math.h> #include <stdio.h> #include <string.h> #define FOUR_PART_DTDAWG_14_PART_ONE "Four_Part_1_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_TWO "Four_Part_2_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_THREE "Four_Part_3_DTDAWG_For_Lexicon_14.dat" #define FOUR_PART_DTDAWG_14_PART_FOUR "Four_Part_4_DTDAWG_For_Lexicon_14.dat" #define SIZE_OF_CHARACTER_SET 14 #define MAX_WORD_LENGTH 15 #define MIN_WORD_LENGTH 3 #define END_OF_WORD_FLAG 67108864 #define CHILD_MASK 32767 #define OFFSET_INDEX_MASK 67076096 #define OffSET_BIT_SHIFT 15 #define MAX_INPUT_SIZE 100 #define LOWERIT 32 #define BOGUS -99 #define TOTAL_WORDS_IN_LEXICON 44220 typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; char CharacterSet[SIZE_OF_CHARACTER_SET] = {'A', 'C', 'D', 'E', 'G', 'I', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T'}; char CharacterLocations[26] = {0, BOGUS, 1, 2, 3, BOGUS, 4, BOGUS, 5, BOGUS, BOGUS, 6, 7, 8, 9, 10, BOGUS, 11, 12, 13, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; long int ChildLetterBitMasks[26] = {1, 0, 6, 24, 224, 0, 1792, 0, 14336, 0, 0, 114688, 1966080, 31457280, 503316480, 8053063680, 0, 128849018880, 2061584302080, 32985348833280, 0, 0, 0, 0, 0, 0}; int ChildLetterBitShifts[26] = {0, BOGUS, 1, 3, 5, BOGUS, 8, BOGUS, 11, BOGUS, BOGUS, 14, 17, 21, 25, 29, BOGUS, 33, 37, 41, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS, BOGUS}; // qsort char comparison function. int CharCompare(const void a, const void b){ const char A = (const char )a; const char B = (const char )b; return (int)(A - B); } // This function converts any lower case letters in the string "RawWord," into all capitals, so that the whole string is made of capital letters. // It is useful for when a user enters a string to be anagrammed. void MakeMeAllCapital(char* RawWord){ int X; for( X = 0; X < strlen( RawWord ); X++ ) { if( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] - LOWERIT; } } // This function will place the letters in the string "TheString," into alphabetical order. void AlphabetizeMe(char TheString){ qsort(TheString, strlen(TheString), sizeof(char), CharCompare); } // This function will remove any characters in the string "TheString" that are not capital Letters contained in the chosen character set. void RidMeOfBadCharacters(char TheString){ int X; int Y; for ( X = 0; TheString[X] != '\0'; X++ ) { if ( TheString[X] < 'A' \|\| TheString[X] > 'Z' ) { for ( Y = X; TheString[Y] != '\0'; Y++) TheString[Y] = TheString[Y + 1]; X -= 1; } else if ( ChildLetterBitShifts[TheString[X] - 'A'] == BOGUS ) { for ( Y = X; TheString[Y] != '\0'; Y++) TheString[Y] = TheString[Y + 1]; X -= 1; } } } inline int WordsToEndOfBranchList(int ThirdPart, unsigned char FourthPart, int TheIndexInQuestion, int TheCutoff){ return (TheIndexInQuestion < TheCutoff)? ThirdPart[TheIndexInQuestion]: (int)FourthPart[TheIndexInQuestion - TheCutoff]; } // The introduction of "tracking," and "order doesn't matter," has made the basic recursive anagram function look more complex than what is actually go on. // The Governing Equation of The Four Part DTDAWG: NextMarker = CurrentMarker - FCWTEOBL + NCWTEOBL - Flag; WTEOBL means Words to end of branch list. // FC means FirstChild. NC means NextChild, the one we want to go to. int TrackingAnagramWithFourPartDawgArrayRecurse(char RawLetters, int One, long int Two, int Three, unsigned char Four, int BeginFourAt , int CurrentIndex, char TheWordSoFar, int FillThisPosition, char CurrentLetter, int CurrentMarker, int LettersRemaining, int StampingSet, int CurrentTime){ int X; int Y; char TheChosenLetter; long int CurrentOffset; int CurrentChild; int NextMarker = 0; int CurrentNCWTEOBL; int Count = 0;; Bool LettersUsedSoFar[SIZE_OF_CHARACTER_SET]; //printf("We are in anagram recurse. CurrentLetter \|%c\|, CurrentIndex \|%d\|, LettersRemaining \|%d\|, RawLetters \|%s\|, CurrentMarker \|%d\|\n" //, CurrentLetter, CurrentIndex, LettersRemaining, RawLetters, CurrentMarker); for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) LettersUsedSoFar[X] = FALSE; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( One[CurrentIndex] & END_OF_WORD_FLAG ) { TheWordSoFar[FillThisPosition+ 1] = '\0'; printf("#Word\|%d\| Is \|%s\| - Old Stamp \|%d\| - New Stamp \|%d\|\n", CurrentMarker, TheWordSoFar, StampingSet[CurrentMarker], CurrentTime); StampingSet[CurrentMarker] = CurrentTime; NextMarker -= 1; Count += 1; } // If we have zero RawLetters left, then do not waste processor time trying to move further down the graph. if ( LettersRemaining ) { // If the current node has a child list, then go to the RawLetters nodes. if ( CurrentChild = (One[CurrentIndex] & CHILD_MASK) ) { //printf("CurrentChild \|%d\|\n", CurrentChild); // Begin the calculation of the NextMarker; the part that is common to all children. NextMarker += (CurrentMarker - WordsToEndOfBranchList(Three, Four, CurrentChild, BeginFourAt)); for ( X = 0; X < LettersRemaining ; X++ ) { TheChosenLetter = RawLetters[X]; if ( LettersUsedSoFar[CharacterLocations[TheChosenLetter - 'A']] == FALSE ) { if ( CurrentOffset = (Two[(One[CurrentIndex] & OFFSET_INDEX_MASK) >> OffSET_BIT_SHIFT] & ChildLetterBitMasks[TheChosenLetter - 'A']) ) { // If we have made it to this point "TheChosenLetter" has not been tried yet, and it does exist in the next child list CurrentOffset >>= ChildLetterBitShifts[RawLetters[X] - 'A']; CurrentOffset -= 1; //printf("We are using an offset of \|%d\| for \|%c\|", (int)CurrentOffset, TheChosenLetter); CurrentNCWTEOBL = WordsToEndOfBranchList(Three, Four, (CurrentChild + (int)CurrentOffset), BeginFourAt); NextMarker += CurrentNCWTEOBL; // First remove the letter that we are using for the primary position, then pass the remaining letters to the recursive anagrammer. for ( Y = X; Y < LettersRemaining; Y++ ) RawLetters[Y] = RawLetters[Y + 1]; Count += TrackingAnagramWithFourPartDawgArrayRecurse(RawLetters, One, Two, Three, Four, BeginFourAt, (CurrentChild + (int)CurrentOffset) , TheWordSoFar, (FillThisPosition + 1), TheChosenLetter, NextMarker, (LettersRemaining - 1), StampingSet, CurrentTime); NextMarker -= CurrentNCWTEOBL; // Place the letter that we just used, back into TheRawLetters. for ( Y = LettersRemaining; Y > X; Y-- ) RawLetters[Y] = RawLetters[Y - 1]; RawLetters[X] = TheChosenLetter; } LettersUsedSoFar[CharacterLocations[TheChosenLetter - 'A']] = TRUE; } } } } //printf("Out. CurrentLetter \|%c\|\n", CurrentLetter); return Count; } // The characters inside of TheRawLetters must all be contained in the specified character set, but the order does not matter. int TrackingAnagramWithFourPartDawgArray(char TheRawLetters, int First, long int Second, int Third, unsigned char Fourth , int StartOfFourth, int TheTimeStamps, int TheTime){ int X; int Y; int Counter = 0; char TheChosenLetter; int NumberOfRawLetters = strlen(TheRawLetters); Bool LettersThatHaveBeenUsed[SIZE_OF_CHARACTER_SET]; char RunningWord[MAX_WORD_LENGTH +1]; for ( X = 0; X < SIZE_OF_CHARACTER_SET; X++ ) LettersThatHaveBeenUsed[X] = FALSE; //printf("We are in the first Anagram function.\n"); for ( X = 0; X < NumberOfRawLetters ; X++ ) { TheChosenLetter = TheRawLetters[X]; if ( LettersThatHaveBeenUsed[CharacterLocations[TheChosenLetter - 'A']] == FALSE ) { // First remove the letter that we are using for the primary position, then pass the remaining letters to the recursive anagrammer. for ( Y = X; Y < NumberOfRawLetters; Y++ ) TheRawLetters[Y] = TheRawLetters[Y + 1]; printf("Run recurison on letter \|%c\| and pass on \|%s\|.\n", TheChosenLetter, TheRawLetters); Counter += TrackingAnagramWithFourPartDawgArrayRecurse(TheRawLetters, First, Second, Third, Fourth, StartOfFourth , (CharacterLocations[TheChosenLetter - 'A'] + 1), RunningWord, 0, TheChosenLetter, Third[CharacterLocations[TheChosenLetter - 'A'] + 1] , (NumberOfRawLetters - 1), TheTimeStamps, TheTime); // Place the letter that we just used, back into TheRawLetters. for ( Y = NumberOfRawLetters; Y > X; Y-- ) TheRawLetters[Y] = TheRawLetters[Y - 1]; TheRawLetters[X] = TheChosenLetter; LettersThatHaveBeenUsed[CharacterLocations[TheChosenLetter - 'A']] = TRUE; } } return Counter; } int main(){ FILE PartOne = fopen(FOUR_PART_DTDAWG_14_PART_ONE, "rb"); FILE PartTwo = fopen(FOUR_PART_DTDAWG_14_PART_TWO, "rb"); FILE PartThree = fopen(FOUR_PART_DTDAWG_14_PART_THREE, "rb"); FILE PartFour = fopen(FOUR_PART_DTDAWG_14_PART_FOUR, "rb"); int SizeOfPartOne; long int SizeOfPartTwo; int SizeOfPartThree; int SizeOfPartFour; int PartOneArray; long int PartTwoArray; int PartThreeArray; unsigned char PartFourArray; int WhatRoundAreWeOn = 1; int X; int PartThreeFourTransition; char TheUserInput[MAX_INPUT_SIZE + 1]; char Question[MAX_INPUT_SIZE + 1]; char DOM = '\0'; Bool ShouldWeContinue = TRUE; int LexiconTimeStamps[TOTAL_WORDS_IN_LEXICON + 1]; int CurrentCount; for ( X = 0; X <= TOTAL_WORDS_IN_LEXICON; X++ ) LexiconTimeStamps[X] = 0; fread(&SizeOfPartOne, 4, 1, PartOne); fread(&SizeOfPartTwo, 8, 1, PartTwo); fread(&SizeOfPartThree, 4, 1, PartThree); PartThreeFourTransition = SizeOfPartThree + 1; SizeOfPartFour = SizeOfPartOne - SizeOfPartThree; printf("\n"); printf("SizeOfPartOne \|%d\|\n", SizeOfPartOne); printf("SizeOfPartTwo \|%ld\|\n", SizeOfPartTwo); printf("SizeOfPartThree \|%d\|\n", SizeOfPartThree); printf("Transition \|%d\|\n", PartThreeFourTransition); printf("SizeOfPartFour \|%d\|\n", SizeOfPartFour); printf("\n"); PartOneArray = (int )malloc((SizeOfPartOne + 1) * sizeof(int)); PartTwoArray = (long int )malloc(SizeOfPartTwo sizeof(long int)); PartThreeArray = (int )malloc((SizeOfPartThree + 1) sizeof(int)); PartFourArray = (unsigned char )malloc(SizeOfPartFour sizeof(char)); PartOneArray[0] = 0; for ( X = 1; X <= SizeOfPartOne; X++ ) { fread(&(PartOneArray[X]), 4, 1, PartOne); } for ( X = 0; X < SizeOfPartTwo; X++ ) { fread(&(PartTwoArray[X]), 8, 1, PartTwo); } PartThreeArray[0] = 0; for ( X = 1; X <= SizeOfPartThree; X++ ) { fread(&(PartThreeArray[X]), 4, 1, PartThree); } for ( X = 0; X < SizeOfPartFour; X++ ) { fread(&(PartFourArray[X]), 1, 1, PartFour); } fclose(PartOne); fclose(PartTwo); fclose(PartThree); fclose(PartFour); printf("The four data files have been opened and read into memory.\n\n"); while ( ShouldWeContinue == TRUE ) { printf("Type in the letters that you want to anagram:"); fgets(TheUserInput, MAX_INPUT_SIZE, stdin); if ( strlen(TheUserInput) < (MIN_WORD_LENGTH + 1) ) ShouldWeContinue = FALSE; MakeMeAllCapital(TheUserInput); RidMeOfBadCharacters(TheUserInput); if ( strlen(TheUserInput) < MIN_WORD_LENGTH ) ShouldWeContinue = FALSE; if ( ShouldWeContinue == TRUE ) { printf("TheUserInput-\|%s\|-\|%d\|.\n", TheUserInput, (int)strlen(TheUserInput)); while ( !(DOM == 'Y' \|\| DOM == 'y' \|\| DOM == 'N' \|\| DOM == 'n') ) { printf("Do you desire alphabetical order(y/n)?"); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' \|\| DOM == 'y' ) AlphabetizeMe(TheUserInput); DOM = '\0'; printf("We are now going to perform an anagramming on this \|%s\|.\n\n", TheUserInput); CurrentCount = TrackingAnagramWithFourPartDawgArray(TheUserInput, PartOneArray, PartTwoArray, PartThreeArray , PartFourArray, PartThreeFourTransition, LexiconTimeStamps, WhatRoundAreWeOn); while ( !(DOM == 'Y' \|\| DOM == 'y' \|\| DOM == 'N' \|\| DOM == 'n') ) { printf("\n\|%d\| Words generated in round %d, do you want to quit(y/n)?", CurrentCount, WhatRoundAreWeOn); fgets(Question, MAX_INPUT_SIZE, stdin); DOM = Question[0]; } if ( DOM == 'Y' \|\| DOM == 'y' ) ShouldWeContinue = FALSE; DOM = '\0'; } WhatRoundAreWeOn += 1; } printf("\nYou have chosen to exit the program after \|%d\| rounds.\n\n", (WhatRoundAreWeOn - 2)); return 0; }

Contact Information

Contact: JohnPaul Adamovsky � ( logarithm69@hotmail.com ) - Please feel free to ask me questions.

Phone: (416) 231-7577

This is the mark of my trade...

All The Very Best,

JohnPaul Adamovsky

| Go Back to the Top |

Adamovsky Direct Tracking Directed Acyclic Word Graph - An Advanced Lexicon Data Structure