Nothing but the very best ever.
Now that I have your attention, this lexicon-data-structure is named after my dear aunt, Caroline Furgiuele, for being one of my strongest supporters. Make no mistake, CWG is The World's Most Advanced Lexicon Data Structure.
On Java: It has a praiseworthy graphical-user-interface framework. I'm not a Java guy, but it provides a convenient way to distribute knowledge.
Caroline Word Graph is the culmination of several years of work in lexicon-data-structure optimization. It fills the void between compressed-postfix-Boolean-graphs, and incremental-Trie-hash-functions. In layman's terms, CWG is an aggressively-pruned-graph, with the ability to keep a running-index value for the words it contains. The CWG's encoding presented here, along with its built-in hash-functionality combine to produce unprecedented performance in a lexicon-data-structure.
A common Trie can operate as an incremental-lexicon-hash-function, but it requires lots of memory-space to do so. A traditional DAWG encoding will reduce the space requirement, but in doing so, it exists as a Boolean-graph, scrolling through node lists, and cannot return index-values for the words contained within. CWG starts its life as a Trie, progresses to a traditional DAWG, then reduces its node-count further using an optimized pop-count, and finally takes on the additional power of a (Perfect & Complete)-Incremental-Hash-Function by including an extra Byte per node. A small number of nodes need two extra Bytes, and the 26 root nodes use four Bytes for TWL06.
I invite you to investigate the World's Most Powerful encoding for the 178,691 words in the TWL06 Tournament Scrabble Word List. The 1800 lines of creator-code, presented below, are justified by the dazzling performance and small-size of CWG. This performance will be the most impressive when CWG is used to solve NP-Complete search problems. As long as, on-CPU 6-MOSFET-per-bit static-RAM is in short-supply, smaller lexicon-data-structures are also going to be faster lexicon-data-structures. CWG is the Ferrari of lexicon-data-structures. And just like a hand-crafted, high-performance, and dazzling super-car, CWG beckons you to take it for a ride.
The CWG encoding consists of several basic integer arrays. All told, there are 3 CWG components, stored in 5 arrays. The first two arrays are sufficient for CWG to function as a Boolean-graph. Keep in mind that as a Boolean-graph, CWG is both smaller and faster than a Traditional-DAWG encoding. The remaining three arrays hold the values which transform the Boolean-graph into a (Perfect & Complete)-Incremental-Hash-Function. Three arrays hold the Words-To-End-Of-Branch-List (WTEOBL) values to reduce the space required for this final CWG component.
Six Four-Byte Integers:
1 - | NumberOfWords |
2 - | NumberOfNodes |
3 - | NumberOfListFormats |
4 - | NumberOfRootNodes |
5 - | NumberOfShortWTEOBLs |
6 - | NumberOfUnsignedCharWTEOBLs |
Five Basic-Type Arrays:
1 - | int Array Of Nodes |
2 - | int Array Of List Formats |
3 - | int Array Of Root WTEOBLs |
4 - | short int Array Of WTEOBLs |
5 - | unsigned char Array Of WTEOBLs |
#1) Node-Array: CWG nodes do not hold letter information inside of them. The letter associated with a node is assigned by the parent node used to reach it. Thus each node must contain full-information about its Child-List. Explicit-list-format information for the 26 English letters requires 26 bits, but there are a limited number of Child-list configurations in TWL06, so that only 13 bits are needed to index the available List-Formats. An index value representing the position of the First-Child node is also required, and TWL06 with its 112,125 CWG-Nodes, requires 17 bits to store this value. A Single bit is needed to store an End-Of-Word flag, and 1 bit remains unused. A 32-bit signed-integer is used in the Node-Array, and its format is presented below:
[__32-Bit-Integer CWG-Node Format___]
----------------------------------------------------
[_|0|0000000000000|11111111111111111] - CHILD_MASK
[_|0|1111111111111|00000000000000000] - LIST_FORMAT_INDEX_MASK
[_|1|0000000000000|00000000000000000] - END_OF_WORD_MASK
#2) List-Format-Array: This array holds explicit Child-List information in the form of 26 toggle-bits held inside of 32-bit integers. The patterns present in the English language make this array quite small. TWL06 contains only 6180 unique Child-List-Formats. Testing for the existence of a particular Child-Letter amounts to a bitwise-AND with 2^n for the n'th letter, where 'A' = '0'th letter. Obtaining the offset from the First-Child is calculated by performing a bitwise-AND with 2^n - 1, and then running a jump-table optimized pop-count. Below is an example of how a Child-List is encoded as a 32-Bit integer:
Letters in Child-List: |CEGLNPRSTWZ|
26-Bit Decimal Encoding: |38709332|
26-Bit Binary Encoding: [______|10010011101010100001010100]
#3) Words-To-End-Of-Branch-List-Arrays: In a word-graph with compressed-postfix-structures, END_OF_WORD nodes are able to terminate a large number of unique words, and so unique index numbers are non-trivial to obtain. The core-insights required to understand how to generate the unique index values are presented below:
- A constant number for each node must exist, which binds each node to the number of words, before or after it, within its list structure.
- Nodes within a list need not share a common beginning, thanks to the pruning algorithm, but they must necessarily share a common end node.
- Number-Of-Words-To-End-Of-Branch-List fits the bill for this constant, and a majority of nodes require only "1" Byte to hold this number.
- A simple equation using WTEOBL values will produce an Incremental-Index-Marker during graph traversal.
- When arriving at a particular node, an Index-Marker can be inherited from the path used to reach it.
- The inherited Index-Marker minus "1" will define a unique Index-Value for each word when an END_OF_WORD node is reached.
- The Index-Markers for the 26 root-nodes are just their WTEOBL values. Only these nodes require 4-Bytes for their WTEOBL values.
- Using this scheme, "AA" will hash to "178,690", and "ZZZ" hashes to "0". Correct by subtracting from TotalNumberOfWords.
- These insights lead us to the Governing CWG Hash Equation:
FC_WTEOBL = First-Child Words-To-End-Of-Branch-List
NC_WTEOBL = Next-Child Words-To-End-Of-Branch-List
EOW_Flag = END_OF_WORD_FLAG
- Next_Index_Marker represents the number of words remaining in the graph, assuming a Depth-First-Preorder traversal.
- For TWL06, the WTEOBL-Values are contained in 27 ints, 4821 shorts, and 107,304 unsigned chars. The shorts hold 27 empty values up front, to eliminate a needless subtraction.
If you really want to understand the 29-step process I've developed for creating a CWG, then you must read through the 1800 lines of code published in the C Implementation section. In the current section, I will isolate the critical steps, and proceed to hand-wave the CWG-Creation process at you.
Each Tnode will contain data to streamline the Tnode-Elimination process:
*Some of these values need to be updated while building the initial-Trie, while others are set during the Tnode-Elimination process.
The algorithms needed to create CWG in a reasonable amount of time are complex.
C code:
----------------------------------------------------------------------------
1) Word-List.txt - The TWL06 Tournament-Scrabble-Word-List in a correct format.
2) CWG-Creator.c - The CWG-Creation C code.
3) Justin-CWG-Search.c - The Boolean-search and hash-function demonstration C code.
Java Web-Start code:
----------------------------------------------------------------------------
1) Cwg.java - DAWG functionality code, slanted for use in Java.
2) Interface.java - GUI container, a product of NetBeans.
3) CWG_Data_For_Word-List.dat - TWL06 pre-compiled CWG data file.
4) CWG-Demo.jar - A runnable Jar file via the Eclipse IDE.
5) CWG-Demo.jnlp - Java-Web-Start launching protocol file.
// This program will first compile a traditional DAWG encoding from the "Word-List.txt" file. // Next, data files are written to assist in the "CWG" creation process. // A "Caroline Word Graph" will then be created using the intermediate data files, and stored in "CWG_Data_For_Word-List.dat". // There is a very good reason for why this program is 1800 lines long. The CWG is a perfect and complete hash function for English-Language in TWL06. // 1) "Word-List.txt" is a text file with the number of words written on the very first line, and 1 word per line after that. The words are case-insensitive. // 2) The "CWG" encoding is very sensitive to the size and content of "Word-List.txt", so only minor alterations are guaranteed to work without code analysis. // Include the big-three header files. #include <stdlib.h> #include <stdio.h> #include <string.h> // General high-level program constants. #define MAX 15 #define NUMBER_OF_ENGLISH_LETTERS 26 #define INPUT_LIMIT 30 #define LOWER_IT 32 #define TEN 10 #define BIT_COUNT_FOR_CHILD_INDEX 17 #define EOW_FLAG 0X40000000 #define INT_BITS 32 #define CHILD_MASK 0X0001FFFF #define LIST_FORMAT_INDEX_MASK 0X3FFE0000 #define LIST_FORMAT_BIT_SHIFT 17 #define TRADITIONAL_CHILD_SHIFT 5 #define TRADITIONAL_EOW_FLAG 0X00800000 #define TRADITIONAL_EOL_FLAG 0X00400000 // C requires a boolean variable type so use C's typedef concept to create one. typedef enum { FALSE = 0, TRUE = 1 } Bool; typedef Bool* BoolPtr; // The lexicon text file. #define RAW_LEXICON "Word-List.txt" // This program will create "5" binary-data files for use, and "1" text-data file for inspection. #define TRADITIONAL_DAWG_DATA "Traditional_Dawg_For_Word-List.dat" #define TRADITIONAL_DAWG_TEXT_DATA "Traditional_Dawg_Text_For_Word-List.txt" #define DIRECT_GRAPH_DATA_PART_ONE "Part_One_Direct_Graph_For_Word-List.dat" #define DIRECT_GRAPH_DATA_PART_TWO "Part_Two_Direct_Graph_For_Word-List.dat" #define FINAL_NODES_DATA "Final_Nodes_For_Word-List.dat" // The complete "CWG" graph is stored here. #define CWG_DATA "CWG_Data_For_Word-List.dat" // Lookup tables used for node encoding and number-string decoding. const unsigned int PowersOfTwo[INT_BITS - 1] = { 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824 }; const unsigned int PowersOfTen[TEN] = { 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 }; // Lookup tables used to extract child-offset values. const unsigned char PopCountTable[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; const int ChildListMasks[NUMBER_OF_ENGLISH_LETTERS] = { 0X1, 0X3, 0X7, 0XF, 0X1F, 0X3F, 0X7F, 0XFF, 0X1FF, 0X3FF, 0X7FF, 0XFFF, 0X1FFF, 0X3FFF, 0X7FFF, 0XFFFF, 0X1FFFF, 0X3FFFF, 0X7FFFF, 0XFFFFF, 0X1FFFFF, 0X3FFFFF, 0X7FFFFF, 0XFFFFFF, 0X1FFFFFF, 0X3FFFFFF }; // This simple function clips off the extra chars for each "fgets()" line. Works for Linux and Windows text format. void CutOffExtraChars(char *ThisLine){ if ( ThisLine[strlen(ThisLine) - 2] == '\r' ) ThisLine[strlen(ThisLine) - 2] = '\0'; else if ( ThisLine[strlen(ThisLine) - 1] == '\n' ) ThisLine[strlen(ThisLine) - 1] = '\0'; } // This is an important function involved in any movement through the CWG. // The "CompleteChildList" is masked according to "LetterPosition", and then a byte-wise pop-count is carried out. ('A' => 0). // The function retruns the corresponding offset for the "LetterPosition"th letter. int ListFormatPopCount(int CompleteChildList, int LetterPosition){ // This jump-table eliminates needless instructions and cumbersome conditional tests. const static void *PositionJumpTable[NUMBER_OF_ENGLISH_LETTERS] = { &&On, &&On, &&On, &&On, &&On, &&On, &&On, &&On, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Fo, &&Fo }; int Result = 0; // By casting the agrument variable "CompleteChildList" as an "unsigned char *" we can access each byte within it using simple arithmetic. // Computer-programs in C use "little-endian" byte-order. The least significant bit comes first. unsigned char *ByteZero = (unsigned char *)&CompleteChildList; // Mask "CompleteChildList" according to "LetterPosition". CompleteChildList &= ChildListMasks[LetterPosition]; // Query the "PopCountTable" a minimum number of times. goto *PositionJumpTable[LetterPosition]; Fo: Result += PopCountTable[*(ByteZero + 3)]; Th: Result += PopCountTable[*(ByteZero + 2)]; Tw: Result += PopCountTable[*(ByteZero + 1)]; On: Result += PopCountTable[*ByteZero]; return Result; } // Returns the positive "int" rerpresented by "TheNumberNotYet" string. An invalid "TheNumberNotYet" returns "0". int StringToPositiveInt(char* TheNumberNotYet){ int Result = 0; int X; int Length = strlen(TheNumberNotYet); if ( Length > TEN ) return 0; for ( X = 0; X < Length; X++ ) { if ( TheNumberNotYet[X] < '0' || TheNumberNotYet[X] > '9' ) return 0; Result += ((TheNumberNotYet[X] - '0')*PowersOfTen[Length - X - 1 ]); } return Result; } // The "BinaryNode" string must be at least 32 + 5 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. // This function is used to fill the text file used to inspect the graph created in the first segment of the program. void ConvertIntNodeToBinaryString(int TheNode, char *BinaryNode){ int X; int Bit; BinaryNode[0] = '['; // Bit 31 is not being used. It will always be '0'. BinaryNode[1] = '_'; BinaryNode[2] = '|'; // Bit 30 holds the End-Of-Word, EOW_FLAG. BinaryNode[3] = (TheNode & PowersOfTwo[30])?'1':'0'; BinaryNode[4] = '|'; // 13 Bits, (29-->17) represent the child-format index. Bit = 29; for ( X = 5; X <= 17; X++, Bit-- ) BinaryNode[X] = (TheNode & PowersOfTwo[Bit])?'1':'0'; BinaryNode[18] = '|'; // The "Child" index requires 17 bits, and it will complete the 32 bit int. Bit = 16; for ( X = 19; X <= 35; X++, Bit-- ) BinaryNode[X] = (TheNode & PowersOfTwo[Bit])?'1':'0'; BinaryNode[36] = ']'; BinaryNode[37] = '\0'; } // The "BinaryChildList" string must be at least 32 + 3 + 1 bytes in length. Space for the bits, the seperation pipes, and the end of string char. void ConvertChildListIntToBinaryString(int TheChildList, char *BinaryChildList){ int X; int Bit; BinaryChildList[0] = '['; // Bit 31 to bit 26 remain unused in a "ChildList". for ( X = 1; X <= 6; X++ ) BinaryChildList[X] = '_'; BinaryChildList[7] = '|'; Bit = 25; for ( X = 8; X <= 33; X++, Bit-- ) BinaryChildList[X] = (TheChildList & PowersOfTwo[Bit])?'1':'0'; BinaryChildList[34] = ']'; BinaryChildList[35] = '\0'; } //This Function converts any lower case letters inside "RawWord" to capitals, so that the whole string is made of capital letters. void MakeMeAllCapital(char *RawWord){ unsigned int Count = 0; unsigned int Length = strlen(RawWord); for ( Count = 0; Count < Length; Count++ ) { if ( RawWord[Count] >= 'a' && RawWord[Count] <= 'z' ) RawWord[Count] -= LOWER_IT; } } /*Trie to Dawg TypeDefs*/ struct tnode { struct tnode* Next; struct tnode* Child; struct tnode* ParentalUnit; struct tnode* ReplaceMeWith; // When populating the DAWG array, you must know the index assigned to each "Child". // To get this information, "ArrayIndex" is stored in every node, so that we can mine the information from the reduced pointer-style-trie. int ArrayIndex; char DirectChild; char Letter; char MaxChildDepth; char Level; char NumberOfChildren; char Dangling; char Protected; char EndOfWordFlag; }; typedef struct tnode Tnode; typedef Tnode* TnodePtr; // Functions to access internal "Tnode" members. int TnodeArrayIndex(TnodePtr ThisTnode){ return ThisTnode->ArrayIndex; } char TnodeDirectChild(TnodePtr ThisTnode){ return ThisTnode->DirectChild; } TnodePtr TnodeNext(TnodePtr ThisTnode){ return ThisTnode->Next; } TnodePtr TnodeChild (TnodePtr ThisTnode){ return ThisTnode->Child; } TnodePtr TnodeParentalUnit(TnodePtr ThisTnode){ return ThisTnode->ParentalUnit; } TnodePtr TnodeReplaceMeWith(TnodePtr ThisTnode){ return ThisTnode->ReplaceMeWith; } char TnodeLetter(TnodePtr ThisTnode){ return ThisTnode->Letter; } char TnodeMaxChildDepth(TnodePtr ThisTnode){ return ThisTnode->MaxChildDepth; } char TnodeNumberOfChildren(TnodePtr ThisTnode){ return ThisTnode->NumberOfChildren; } char TnodeEndOfWordFlag(TnodePtr ThisTnode){ return ThisTnode->EndOfWordFlag; } char TnodeLevel(TnodePtr ThisTnode){ return ThisTnode->Level; } char TnodeDangling(TnodePtr ThisTnode){ return ThisTnode->Dangling; } char TnodeProtected(TnodePtr ThisTnode){ return ThisTnode->Protected; } // Allocate a "Tnode" and fill it with initial values. TnodePtr TnodeInit(char Chap, TnodePtr OverOne, char WordEnding, char Leveler, int StarterDepth, TnodePtr Parent, char IsaChild){ TnodePtr Result = (Tnode *)malloc(sizeof(Tnode)); Result->Letter = Chap; Result->ArrayIndex = 0; Result->NumberOfChildren = 0; Result->MaxChildDepth = StarterDepth; Result->Next = OverOne; Result->Child = NULL; Result->ParentalUnit = Parent; Result->Dangling = FALSE; Result->Protected = FALSE; Result->ReplaceMeWith = NULL; Result->EndOfWordFlag = WordEnding; Result->Level = Leveler; Result->DirectChild = IsaChild; return Result; } // Modify internal "Tnode" member values. void TnodeSetArrayIndex(TnodePtr ThisTnode, int TheWhat){ ThisTnode->ArrayIndex = TheWhat; } void TnodeSetChild(TnodePtr ThisTnode, TnodePtr Chump){ ThisTnode->Child = Chump; } void TnodeSetNext(TnodePtr ThisTnode, TnodePtr Nexis){ ThisTnode->Next = Nexis; } void TnodeSetParentalUnit(TnodePtr ThisTnode, TnodePtr Parent){ ThisTnode->ParentalUnit = Parent; } void TnodeSetReplaceMeWith(TnodePtr ThisTnode, TnodePtr Living){ ThisTnode->ReplaceMeWith = Living; } void TnodeSetMaxChildDepth(TnodePtr ThisTnode, int NewDepth){ ThisTnode->MaxChildDepth = NewDepth; } void TnodeSetDirectChild(TnodePtr ThisTnode, char Status){ ThisTnode->DirectChild = Status; } void TnodeFlyEndOfWordFlag(TnodePtr ThisTnode){ ThisTnode->EndOfWordFlag = TRUE; } // This function Dangles a node, but also recursively dangles every node under it as well. // Dangling a "Tnode" means that it will be exculded from the "DAWG". // By recursion, nodes that are not direct children will get dangled. // The function returns the total number of nodes dangled as a result. int TnodeDangle(TnodePtr ThisTnode){ if ( ThisTnode->Dangling == TRUE ) return 0; int Result = 0; if ( (ThisTnode->Next) != NULL ) Result += TnodeDangle(ThisTnode->Next); if ( (ThisTnode->Child) != NULL ) Result += TnodeDangle(ThisTnode->Child); ThisTnode->Dangling = TRUE; Result += 1; return Result; } // This function "Protects" a node being directly referenced in the elimination process. // "Protected" nodes can be "Dangled", but special precautions need to be taken to ensure graph-integrity. void TnodeProtect(TnodePtr ThisTnode){ ThisTnode->Protected = TRUE; } // This function removes "Protection" status from "ThisNode". void TnodeUnProtect(TnodePtr ThisTnode){ ThisTnode->Protected = FALSE; } // This function returns a Boolean value indicating if a node coming after "ThisTnode" is "Protected". // The Boolean argument "CheckThisNode" determines if "ThisTnode" is included in the search. // A "Tnode" being eliminated with "Protected" nodes beneath it requires special precautions. Bool TnodeProtectionCheck(TnodePtr ThisTnode, Bool CheckThisTnode){ if ( ThisTnode == NULL ) return FALSE; if ( CheckThisTnode == TRUE ) if ( ThisTnode->Protected == TRUE ) return TRUE; if ( TnodeProtectionCheck(ThisTnode->Next, TRUE) == TRUE ) return TRUE; if ( TnodeProtectionCheck(ThisTnode->Child, TRUE) == TRUE ) return TRUE; return FALSE; } // This function returns the pointer to the Tnode in a parallel list of nodes with the letter "ThisLetter", and returns NULL if the Tnode does not exist. // If the function returns NULL, then an insertion is required. TnodePtr TnodeFindParaNode(TnodePtr ThisTnode, char ThisLetter){ if ( ThisTnode == NULL ) return NULL; TnodePtr Result = ThisTnode; if ( Result->Letter == ThisLetter ) return Result; while ( Result->Letter < ThisLetter ) { Result = Result->Next; if ( Result == NULL ) return NULL; } if ( Result->Letter == ThisLetter ) return Result; else return NULL; } // This function inserts a new Tnode before a larger letter node or at the end of a para list If the list does not esist then it is put at the beginnung. // The new node has ThisLetter in it. AboveTnode is the node 1 level above where the new node will be placed. // This function should never be passed a "NULL" pointer. "ThisLetter" should never exist in the "Child" para-list. void TnodeInsertParaNode(TnodePtr AboveTnode, char ThisLetter, char WordEnder, int StartDepth){ AboveTnode->NumberOfChildren += 1; TnodePtr Holder = NULL; TnodePtr Currently = NULL; // Case 1: ParaList does not exist yet so start it. if ( AboveTnode->Child == NULL ) AboveTnode->Child = TnodeInit(ThisLetter, NULL, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE); // Case 2: ThisLetter should be the first in the ParaList. else if ( ((AboveTnode->Child)->Letter) > ThisLetter ) { Holder = AboveTnode->Child; // The holder node is no longer a direct child so set it as such. TnodeSetDirectChild(Holder, FALSE); AboveTnode->Child = TnodeInit(ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, AboveTnode, TRUE); // The parent node needs to be changed on what used to be the child. it is the Tnode in "Holder". Holder->ParentalUnit = AboveTnode->Child; } // Case 3: The ParaList exists and ThisLetter is not first in the list. else if ( (AboveTnode->Child)->Letter < ThisLetter ) { Currently = AboveTnode->Child; while ( Currently->Next !=NULL ) { if ( TnodeLetter(Currently->Next) > ThisLetter ) break; Currently = Currently->Next; } Holder = Currently->Next; Currently->Next = TnodeInit(ThisLetter, Holder, WordEnder, AboveTnode->Level + 1, StartDepth, Currently, FALSE); if ( Holder != NULL ) Holder->ParentalUnit = Currently->Next; } } // The "MaxChildDepth" of the two nodes cannot be assumed equal due to the recursive nature of this function, so we must check for equivalence. char TnodeAreWeTheSame(TnodePtr FirstNode, TnodePtr SecondNode){ if ( FirstNode == SecondNode ) return TRUE; if ( FirstNode == NULL || SecondNode == NULL ) return FALSE; if ( FirstNode->Letter != SecondNode->Letter ) return FALSE; if ( FirstNode->MaxChildDepth != SecondNode->MaxChildDepth ) return FALSE; if ( FirstNode->NumberOfChildren != SecondNode->NumberOfChildren ) return FALSE; if ( FirstNode->EndOfWordFlag != SecondNode->EndOfWordFlag ) return FALSE; if ( TnodeAreWeTheSame(FirstNode->Child, SecondNode->Child) == FALSE ) return FALSE; if ( TnodeAreWeTheSame(FirstNode->Next, SecondNode->Next) == FALSE ) return FALSE; else return TRUE; } struct dawg { int NumberOfTotalWords; int NumberOfTotalNodes; TnodePtr First; }; typedef struct dawg Dawg; typedef Dawg* DawgPtr; // Set up the parent nodes in the Dawg. DawgPtr DawgInit(void){ DawgPtr Result = (Dawg *)malloc(sizeof(Dawg)); Result->NumberOfTotalWords = 0; Result->NumberOfTotalNodes = 0; Result->First = TnodeInit('0', NULL, FALSE, 0, 0, NULL, FALSE); return Result; } // This function is responsible for adding "Word" to the "Dawg" under its root node. It returns the number of new nodes inserted. int TnodeDawgAddWord(TnodePtr ParentNode, const char *Word){ int Result = 0; int X, Y = 0; int WordLength = strlen(Word); TnodePtr HangPoint = NULL; TnodePtr Current = ParentNode; for ( X = 0; X < WordLength; X++){ HangPoint = TnodeFindParaNode(TnodeChild(Current), Word[X]); if ( HangPoint == NULL ) { TnodeInsertParaNode(Current, Word[X], (X == WordLength - 1 ? TRUE : FALSE), WordLength - X - 1); Result++; Current = TnodeFindParaNode(TnodeChild(Current), Word[X]); for ( Y = X + 1; Y < WordLength; Y++ ) { TnodeInsertParaNode(Current, Word[Y], (Y == WordLength - 1 ? TRUE : FALSE), WordLength - Y - 1); Result += 1; Current = TnodeChild(Current); } break; } else { if ( TnodeMaxChildDepth(HangPoint) < WordLength - X - 1 ) TnodeSetMaxChildDepth(HangPoint, WordLength - X - 1); } Current = HangPoint; // The path for the word that we are trying to insert already exists, so just make sure that the end flag is flying on the last node. // This should never happen if we are to add words in increasing word length. if ( X == WordLength - 1 ) TnodeFlyEndOfWordFlag(Current); } return Result; } // Add "NewWord" to "ThisDawg", which at this point is a "Trie" with a lot of information in each node. // "NewWord" must not exist in "ThisDawg" already. void DawgAddWord(DawgPtr ThisDawg, char * NewWord){ ThisDawg->NumberOfTotalWords += 1; int NodesAdded = TnodeDawgAddWord(ThisDawg->First, NewWord); ThisDawg->NumberOfTotalNodes += NodesAdded; } // This is a standard depth first preorder tree traversal. // The objective is to count "Living" "Tnodes" of various "MaxChildDepths", and store these values into "Tabulator". void TnodeGraphTabulateRecurse(TnodePtr ThisTnode, int Level, int* Tabulator){ if ( Level == 0 ) TnodeGraphTabulateRecurse(TnodeChild(ThisTnode), 1, Tabulator); else { // We will only ever be concerned with "Living" nodes. "Dangling" Nodes will be eliminated, so don't count them. if ( ThisTnode->Dangling == FALSE ) { Tabulator[ThisTnode->MaxChildDepth] += 1; // Go Down if possible. if ( ThisTnode->Child != NULL ) TnodeGraphTabulateRecurse(TnodeChild(ThisTnode), Level + 1, Tabulator); // Go Right if possible. if ( ThisTnode->Next != NULL ) TnodeGraphTabulateRecurse(TnodeNext(ThisTnode), Level, Tabulator); } } } // Count the "Living" "Tnode"s of each "MaxChildDepth" in "ThisDawg", and store the values in "Count". void DawgGraphTabulate(DawgPtr ThisDawg, int* Count){ int Numbers[MAX]; int X = 0; for ( X = 0; X < MAX; X++ ) Numbers[X] = 0; if ( ThisDawg->NumberOfTotalWords > 0 ) { TnodeGraphTabulateRecurse(ThisDawg->First, 0, Numbers); for ( X = 0; X < MAX; X++ ) { Count[X] = Numbers[X]; } } } // This function can never be called after a trie has been "Mowed" because this will result in pointers being freed twice. // A core dump is bad. void FreeTnodeRecurse(TnodePtr ThisTnode){ if ( ThisTnode->Child != NULL ) FreeTnodeRecurse(ThisTnode->Child); if ( ThisTnode->Next != NULL ) FreeTnodeRecurse(ThisTnode->Next); free(ThisTnode); } // This function can never be called after a trie has been "Mowed" because this will result in pointers being freed twice. // A core dump is bad. void FreeDawg(DawgPtr ThisDawg){ if ( ThisDawg->NumberOfTotalWords > 0 ) { FreeTnodeRecurse(ThisDawg->First); } free(ThisDawg); } // Recursively replaces all redundant nodes under "ThisTnode". // "DirectChild" "Tnode" in a "Dangling" state have "ReplaceMeWith" set within them. // Crosslinking of "Tnode"s being eliminated must be taken-care-of before this function gets called. // When "Tnode" branches are crosslinked, the living branch has members being replaced with "Tnode"s in the branch being killed. void TnodeLawnMowerRecurse(TnodePtr ThisTnode){ if ( ThisTnode->Level == 0 ) TnodeLawnMowerRecurse(ThisTnode->Child); else { if ( ThisTnode->Next == NULL && ThisTnode->Child == NULL ) return; // The first "Tnode" being eliminated will always be a "DirectChild". if ( ThisTnode->Child != NULL ) { // The node is tagged to be "Mowed", so replace it with "ReplaceMeWith", keep following "ReplaceMeWith" until an un"Dangling" "Tnode" is found. if ( (ThisTnode->Child)->Dangling == TRUE ) { ThisTnode->Child = TnodeReplaceMeWith(ThisTnode->Child); while ( (ThisTnode->Child)->Dangling == TRUE ) ThisTnode->Child = TnodeReplaceMeWith(ThisTnode->Child); } else { TnodeLawnMowerRecurse(ThisTnode->Child); } } if ( ThisTnode->Next != NULL ){ TnodeLawnMowerRecurse(ThisTnode->Next); } } } // Replaces all pointers to "Dangling" "Tnodes" in the "ThisDawg" Trie with living ones. void DawgLawnMower(DawgPtr ThisDawg){ TnodeLawnMowerRecurse(ThisDawg->First); } // This function accepts two identical node branch structures, "EliminateThis" and "ReplaceWith". It is recursive. // The Boolean "ValidReplacement" determines if the function will check for "Crosslink"s or alter "Protected" and "ReplaceMeWith" states. // When "ValidReplacement" is ON, it must set "ReplaceMeWith" values for "Protected" nodes under "EliminateThis". // It will also assign "Protected" status to the corresponding nodes under "ReplaceWith". // This function returns "FALSE" if a "Crosslink" is found. It returns "TRUE" if replacement is a GO. Bool TnodeProtectAndReplaceBranchStructure(TnodePtr EliminateThis, TnodePtr ReplaceWith, Bool ValidReplacement){ if ( EliminateThis == NULL || ReplaceWith == NULL) return TRUE; if ( TnodeProtected(EliminateThis) == TRUE ) { if ( TnodeDangling(ReplaceWith) == TRUE ) { // Though we verify the "Crosslink" condition for confirmation, the two conditions above guarantee the condition below. if ( EliminateThis == TnodeReplaceMeWith(ReplaceWith) ) { // In the case of a confirmed "Crosslink", "EliminateThis" will be a "DirectChild". // The logic is that "EliminateThis" and "ReplaceWith" have the same structure. // Since "ReplaceWith" is "Dangling" with a valid "ReplaceMeWith", it is a "DirectChild". // Thus "EliminateThis" must also be a "DirectChild". Verify this. printf("\n -***- Crosslink found, so exchange branches first, then Dangle the unProtected Tnodes.\n"); // Resort to drastic measures: Simply flip the actual nodes in the Trie. (ReplaceWith->ParentalUnit)->Child = EliminateThis; (EliminateThis->ParentalUnit)->Child = ReplaceWith; return FALSE; } } // When "ValidReplacement" is set, if "EliminateThis" is a protected "Tnode", "ReplaceWith" isn't "Dangling". // We can "Dangle" "Protected" "Tnodes", as long as we set a proper "ReplaceMeWith" value for them. // Since we are now pointing "Tnode"s at "ReplaceWith", protect it. if ( ValidReplacement ) { TnodeSetReplaceMeWith(EliminateThis, ReplaceWith); TnodeProtect(ReplaceWith); } } if ( TnodeProtectAndReplaceBranchStructure(EliminateThis->Next, ReplaceWith->Next, ValidReplacement) == FALSE ) return FALSE; if ( TnodeProtectAndReplaceBranchStructure(EliminateThis->Child, ReplaceWith->Child, ValidReplacement) == TRUE ) return TRUE; else return FALSE; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // A queue is required for breadth first traversal, and the rest is self-evident. struct breadthqueuenode { TnodePtr Element; struct breadthqueuenode *Next; }; typedef struct breadthqueuenode BreadthQueueNode; typedef BreadthQueueNode* BreadthQueueNodePtr; void BreadthQueueNodeSetNext(BreadthQueueNodePtr ThisBreadthQueueNode, BreadthQueueNodePtr Nexit){ ThisBreadthQueueNode->Next = Nexit; } BreadthQueueNodePtr BreadthQueueNodeNext(BreadthQueueNodePtr ThisBreadthQueueNode){ return ThisBreadthQueueNode->Next; } TnodePtr BreadthQueueNodeElement(BreadthQueueNodePtr ThisBreadthQueueNode){ return ThisBreadthQueueNode->Element; } BreadthQueueNodePtr BreadthQueueNodeInit(TnodePtr NewElement){ BreadthQueueNodePtr Result = (BreadthQueueNode *)malloc(sizeof(BreadthQueueNode)); Result->Element = NewElement; Result->Next = NULL; return Result; } struct breadthqueue { BreadthQueueNodePtr Front; BreadthQueueNodePtr Back; int Size; }; typedef struct breadthqueue BreadthQueue; typedef BreadthQueue* BreadthQueuePtr; BreadthQueuePtr BreadthQueueInit(void){ BreadthQueuePtr Result = (BreadthQueue *)malloc(sizeof(BreadthQueue)); Result->Front = NULL; Result->Back = NULL; Result->Size = 0; } void BreadthQueuePush(BreadthQueuePtr ThisBreadthQueue, TnodePtr NewElemental){ BreadthQueueNodePtr Noob = BreadthQueueNodeInit(NewElemental); if ( (ThisBreadthQueue->Back) != NULL ) BreadthQueueNodeSetNext(ThisBreadthQueue->Back, Noob); else ThisBreadthQueue->Front = Noob; ThisBreadthQueue->Back = Noob; (ThisBreadthQueue->Size) += 1; } TnodePtr BreadthQueuePop(BreadthQueuePtr ThisBreadthQueue){ if ( ThisBreadthQueue->Size == 0 ) return NULL; if ( ThisBreadthQueue->Size == 1 ) { ThisBreadthQueue->Back = NULL; ThisBreadthQueue->Size = 0; TnodePtr Result = (ThisBreadthQueue->Front)->Element; free(ThisBreadthQueue->Front); ThisBreadthQueue->Front = NULL; return Result; } TnodePtr Result = (ThisBreadthQueue->Front)->Element; BreadthQueueNodePtr Holder = ThisBreadthQueue->Front; ThisBreadthQueue->Front = (ThisBreadthQueue->Front)->Next; free(Holder); ThisBreadthQueue->Size -= 1; return Result; } // For the "Tnode" "Dangling" process, arrange the "Tnodes" in the "Holder" array, with breadth-first traversal order. void BreadthQueuePopulateReductionArray(BreadthQueuePtr ThisBreadthQueue, TnodePtr Root, TnodePtr **Holder){ int Taker[MAX]; int X = 0; memset(Taker, 0, MAX*sizeof(int)); int CurrentMaxChildDepth = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush(ThisBreadthQueue, Current); Current = Current->Next; } // Initiate the pop followed by push all children loop. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop(ThisBreadthQueue); CurrentMaxChildDepth = Current->MaxChildDepth; Holder[CurrentMaxChildDepth][Taker[CurrentMaxChildDepth]] = Current; Taker[CurrentMaxChildDepth] += 1; Current = TnodeChild(Current); while ( Current != NULL ) { BreadthQueuePush(ThisBreadthQueue, Current); Current = TnodeNext(Current); } } } // It is of absolutely critical importance that only "DirectChild" nodes are pushed onto the queue as child nodes. This will not always be the case. // In a DAWG a child pointer may point to an internal node in a longer list. Check for this. int BreadthQueueUseToIndex(BreadthQueuePtr ThisBreadthQueue, TnodePtr Root){ int IndexNow = 0; TnodePtr Current = Root; // Push the first row onto the queue. while ( Current != NULL ) { BreadthQueuePush(ThisBreadthQueue, Current); Current = Current->Next; } // Pop each element off of the queue and only push its children if has not been "Dangled" yet. Assign index if one has not been given to it yet. while ( (ThisBreadthQueue->Size) != 0 ) { Current = BreadthQueuePop(ThisBreadthQueue); // A traversal of the Trie will never land on "Dangling" "Tnodes", but it will try to visit certain "Tnodes" many times. if ( TnodeArrayIndex(Current) == 0 ) { IndexNow += 1; TnodeSetArrayIndex(Current, IndexNow); Current = TnodeChild(Current); if ( Current != NULL ) { // The graph will lead to intermediate positions, but we cannot start numbering "Tnodes" from the middle of a list. if ( TnodeDirectChild(Current) == TRUE && TnodeArrayIndex(Current) == 0 ) { while ( Current != NULL ) { BreadthQueuePush(ThisBreadthQueue, Current); Current = Current->Next; } } } } } return IndexNow; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Next and Child become indices. struct arraydnode{ int Next; int Child; char Letter; char EndOfWordFlag; char Level; }; typedef struct arraydnode ArrayDnode; typedef ArrayDnode* ArrayDnodePtr; void ArrayDnodeInit(ArrayDnodePtr ThisArrayDnode, char Chap, int Nextt, int Childd, char EndingFlag, char Breadth){ ThisArrayDnode->Letter = Chap; ThisArrayDnode->EndOfWordFlag = EndingFlag; ThisArrayDnode->Next = Nextt; ThisArrayDnode->Child = Childd; ThisArrayDnode->Level = Breadth; } void ArrayDnodeTnodeTranspose(ArrayDnodePtr ThisArrayDnode, TnodePtr ThisTnode){ ThisArrayDnode->Letter = ThisTnode->Letter; ThisArrayDnode->EndOfWordFlag = ThisTnode->EndOfWordFlag; ThisArrayDnode->Level = ThisTnode->Level; if ( ThisTnode->Next == NULL ) ThisArrayDnode->Next = 0; else ThisArrayDnode->Next = (ThisTnode->Next)->ArrayIndex; if ( ThisTnode->Child == NULL ) ThisArrayDnode->Child = 0; else ThisArrayDnode->Child = (ThisTnode->Child)->ArrayIndex; } int ArrayDnodeNext(ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->Next; } int ArrayDnodeChild (ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->Child; } char ArrayDnodeLetter(ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->Letter; } char ArrayDnodeEndOfWordFlag(ArrayDnodePtr ThisArrayDnode){ return ThisArrayDnode->EndOfWordFlag; } int ArrayDnodeNumberOfChildrenPlusString(ArrayDnodePtr DoggieDog, int Index, char* FillThisString){ int X; int CurrentArrayPosition; if ( (DoggieDog[Index]).Child == 0 ) { FillThisString[0] = '\0'; return 0; } CurrentArrayPosition = (DoggieDog[Index]).Child; for ( X = 0; X < NUMBER_OF_ENGLISH_LETTERS; X++ ) { FillThisString[X] = (DoggieDog[CurrentArrayPosition]).Letter; if ( (DoggieDog[CurrentArrayPosition]).Next == 0 ) { FillThisString[X + 1] = '\0'; return (X + 1); } CurrentArrayPosition += 1; } } struct arraydawg { int NumberOfStrings; ArrayDnodePtr DawgArray; int First; char MinStringLength; char MaxStringLength; }; typedef struct arraydawg ArrayDawg; typedef ArrayDawg* ArrayDawgPtr; ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// int CrossLinkCount = 0; // This function is the core of the dawg creation procedure. Pay close attention to the order of the steps involved. ArrayDawgPtr ArrayDawgInit(char **Dictionary, int *SegmentLenghts, int MaxStringLength){ int X = 0; int Y = 0; int *ChildCount; char *ChildStrings; printf("Step 0 - Allocate the framework for the intermediate Array-Data-Structure.\n"); // Dynamically allocate the upper Data-Structure. ArrayDawgPtr Result = (ArrayDawgPtr)malloc(sizeof(ArrayDawg)); // set MinStringLength, MaxStringLength, and NumberOfStrings. while ( SegmentLenghts[X] == 0 ) X++; Result->MinStringLength = X; Result->MaxStringLength = MaxStringLength; Result->NumberOfStrings = 0; for ( X = Result->MinStringLength; X <= Result->MaxStringLength ; X++ ) Result->NumberOfStrings += SegmentLenghts[X]; printf("\nStep 1 - Create a Temporary-Working-Trie and begin filling it with the |%d| words.\n", Result->NumberOfStrings); /// Create a Temp Trie structure and then feed in the given dictionary. DawgPtr TemporaryTrie = DawgInit(); for ( Y = Result->MinStringLength; Y <= Result->MaxStringLength; Y++ ) { for ( X = 0; X < SegmentLenghts[Y]; X++ ) { DawgAddWord(TemporaryTrie, &(Dictionary[Y][(Y + 1)*X])); } } printf("\nStep 2 - Finished adding words to the Temporary-Working-Trie.\n"); // Allocate two "Tnode" counter arrays. int *NodeNumberCounter = (int*)calloc((Result->MaxStringLength), sizeof(int)); int *NodeNumberCounterInit = (int*)calloc((Result->MaxStringLength), sizeof(int)); // Count up the number of "Tnode"s in the Raw-Trie according to MaxChildDepth. printf("\nStep 3 - Count the total number of Tnodes in the Raw-Trie according to MaxChildDepth.\n"); DawgGraphTabulate(TemporaryTrie, NodeNumberCounter); printf("\nStep 4 - Initial Tnode counting is complete, so display results:\n\n"); int TotalNodeSum = 0; for ( X = 0; X < Result->MaxStringLength; X++ ){ NodeNumberCounterInit[X] = NodeNumberCounter[X]; TotalNodeSum += NodeNumberCounter[X]; } for ( X = 0; X < Result->MaxStringLength; X++ ){ printf(" Initial Tnode Count For MaxChildDepth =|%2d| is |%6d|\n", X, NodeNumberCounterInit[X]); } printf("\n Total Tnode Count For The Raw-Trie = |%d| \n", TotalNodeSum); // We will have exactly enough space for all of the Tnode pointers. printf("\nStep 5 - Allocate a 2 dimensional array of Tnode pointers to search for redundant Tnodes.\n"); TnodePtr ** HolderOfAllTnodePointers = (TnodePtr **)malloc((Result->MaxStringLength)*sizeof(TnodePtr *)); for ( X = 0; X < MAX; X++ ) HolderOfAllTnodePointers[X] = (TnodePtr *)malloc(NodeNumberCounterInit[X]*sizeof(TnodePtr)); // A breadth-first traversal is used when populating the final array. // It is then much more likely for living "Tnode"s to appear first, if we fill "HolderOfAllTnodePointers" breadth first. printf("\nStep 6 - Populate the 2 dimensional Tnode pointer array.\n"); // Use a breadth first traversal to populate the "HolderOfAllTnodePointers" array. BreadthQueuePtr Populator = BreadthQueueInit(); BreadthQueuePopulateReductionArray(Populator, (TemporaryTrie->First)->Child, HolderOfAllTnodePointers); printf("\nStep 7 - Population complete.\n"); // Flag all of the reduntant "Tnode"s, and store a "ReplaceMeWith" "Tnode" reference inside the "Dangling" "Tnode"s. // Flagging requires the "TnodeAreWeTheSame()" function, and beginning with the highest "MaxChildDepth" "Tnode"s will reduce the processing time. int NumberDangled = 0; int DangledNow; int NumberAtHeight; int TotalDangled = 0; int W = 0; // keep track of the number of nodes of each MaxChildDepth dangled recursively so we can check how many remaining nodes we need for the optimal array. int DangleCount[Result->MaxStringLength]; for ( X = 0; X < Result->MaxStringLength; X++) DangleCount[X] = 0; printf("\nStep 8 - Tag redundant Tnodes as Dangling - Use recursion, because only DirectChild Tnodes are considered for elimination:\n"); printf("\n This procedure is at the very heart of the CWG creation alogirthm, and it would be much slower, without heavy optimization.\n"); printf("\n ---------------------------------------------------------------------------------------------------------------------------\n"); // *** Test the other way. Start at the largest "MaxChildDepth" and work down from there for recursive reduction to take place. for ( Y = Result->MaxStringLength - 1; Y >= 0; Y-- ) { NumberDangled = 0; NumberAtHeight = 0; // "X" is the index of the node we are trying to kill. for ( X = NodeNumberCounterInit[Y] - 1; X >= 0; X-- ) { // If the node is "Dangling" already, or it is not a "DirectChild", then "continue". if ( TnodeDangling(HolderOfAllTnodePointers[Y][X]) == TRUE ) continue; if ( TnodeDirectChild(HolderOfAllTnodePointers[Y][X]) == FALSE ) continue; // Make sure that we don't emiminate "Tnodes" being pointed to by other "Tnodes" in the graph. // This is a tricky procedure because node beneath "X" can be "Protected". // "W" will be the index of the first undangled "Tnode" with the same structure, if one exists. for ( W = 0; W < NodeNumberCounterInit[Y]; W++ ) { if ( W == X ) continue; if ( TnodeDangling(HolderOfAllTnodePointers[Y][W]) == FALSE ) { if ( TnodeAreWeTheSame(HolderOfAllTnodePointers[Y][X], HolderOfAllTnodePointers[Y][W]) == TRUE ) { // In the special case where the node being "Dangled" has "Protected" nodes beneath it, more needs to be done. // When we "Dangle" a "Protected" "Tnode", we must set it's "ReplaceMeWith", and a recursive function is needed for this special case. // This construct deals with regular and "Crosslink"ed branch structures. // It happens when "Protected" "Tnodes come beneath the one we want to "Dangle". if ( TnodeProtectionCheck(HolderOfAllTnodePointers[Y][X], FALSE) == TRUE ) { while ( !TnodeProtectAndReplaceBranchStructure(HolderOfAllTnodePointers[Y][X], HolderOfAllTnodePointers[Y][W], FALSE) ) ; TnodeProtectAndReplaceBranchStructure(HolderOfAllTnodePointers[Y][X], HolderOfAllTnodePointers[Y][W], TRUE); } // Set the "Protected" and "ReplaceMeWith" status of the corresponding top-level "Tnode"s. TnodeProtect(HolderOfAllTnodePointers[Y][W]); TnodeSetReplaceMeWith(HolderOfAllTnodePointers[Y][X], HolderOfAllTnodePointers[Y][W]); // "Dangle" all nodes under "HolderOfAllTnodePointers[Y][X]", and update the "Dangle" counters. NumberAtHeight += 1; DangledNow = TnodeDangle(HolderOfAllTnodePointers[Y][X]); NumberDangled += DangledNow; break; } } } } printf(" Dangled |%5d| Tnodes, and |%5d| Tnodes In all, through recursion, for MaxChildDepth of |%2d|\n", NumberAtHeight, NumberDangled, Y); DangleCount[Y] = NumberDangled; TotalDangled += NumberDangled; } printf(" ---------------------------------------------------------------------------------------------------------------------------\n"); int NumberOfLivingNodes; printf("\n |%6d| = Original # of Tnodes.\n", TotalNodeSum); printf(" |%6d| = Dangled # of Tnodes.\n", TotalDangled); printf(" |%6d| = Remaining # of Tnodes.\n", NumberOfLivingNodes = TotalNodeSum - TotalDangled); printf("\nStep 9 - Count the number of living Tnodes by traversing the Raw-Trie to check the Dangling numbers.\n\n"); DawgGraphTabulate(TemporaryTrie, NodeNumberCounter); for ( X = 0; X < Result->MaxStringLength; X++ ) { printf(" New count for MaxChildDepth |%2d| Tnodes is |%5d|. Tnode count was |%6d| in Raw-Trie pre-Dangling. Killed |%6d| Tnodes.\n" , X, NodeNumberCounter[X], NodeNumberCounterInit[X], NodeNumberCounterInit[X] - NodeNumberCounter[X]); } int TotalDangledCheck = 0; for ( X = 0; X < MAX; X++ ) { TotalDangledCheck += (NodeNumberCounterInit[X] - NodeNumberCounter[X]); } if ( TotalDangled == TotalDangledCheck ) printf("\n Tnode Dangling count is consistent.\n"); else printf("\n MISMATCH for Tnode Dangling count.\n"); printf("\nStep 9.5 - Run a final check to verify that all redundant nodes have been Dangled.\n\n"); for ( Y = Result->MaxStringLength - 1; Y >= 0; Y-- ) { NumberAtHeight = 0; // "X" is the index of the node we are trying to kill. for ( X = NodeNumberCounterInit[Y] - 1; X >= 0; X-- ) { // If the node is "Dangling" already, or it is not a "DirectChild", then "continue". if ( TnodeDangling(HolderOfAllTnodePointers[Y][X]) == TRUE ) continue; if ( TnodeDirectChild(HolderOfAllTnodePointers[Y][X]) == FALSE ) continue; // "W" will be the index of the first undangled "Tnode" with the same structure, if one slipped through the cracks. for ( W = 0; W < NodeNumberCounterInit[Y]; W++ ) { if ( W == X ) continue; if ( TnodeDangling(HolderOfAllTnodePointers[Y][W]) == FALSE ) { if ( TnodeAreWeTheSame(HolderOfAllTnodePointers[Y][X], HolderOfAllTnodePointers[Y][W]) == TRUE ) { NumberAtHeight += 1; break; } } } } printf(" MaxChildDepth |%2d| - Identical living nodes found = |%2d|.\n", Y, NumberAtHeight); } printf("\nstep 10 - Kill the Dangling Tnodes using the internal \"ReplaceMeWith\" values.\n"); // Node replacement has to take place before indices are set up so nothing points to redundant nodes. // - This step is absolutely critical. Mow The Lawn so to speak! Then Index. DawgLawnMower(TemporaryTrie); printf("\n Killing complete.\n"); printf("\nStep 11 - Dawg-Lawn-Mowing is now complete, so assign array indexes to all living Tnodes using a Breadth-First-Queue.\n"); BreadthQueuePtr OrderMatters = BreadthQueueInit(); // The Breadth-First-Queue must assign an index value to each living "Tnode" only once. // "HolderOfAllTnodePointers[MAX - 1][0]" becomes the first node in the new DAWG array. int IndexCount = BreadthQueueUseToIndex(OrderMatters, HolderOfAllTnodePointers[MAX - 1][0]); printf("\n Index assignment is now complete.\n"); printf("\n |%d| = NumberOfLivingNodes from after the Dangling process.\n", NumberOfLivingNodes); printf(" |%d| = IndexCount from the breadth-first assignment function.\n", IndexCount); // Allocate the space needed to store the "DawgArray". Result->DawgArray = (ArrayDnodePtr)calloc((NumberOfLivingNodes + 1), sizeof(ArrayDnode)); int IndexFollow = 0; int IndexFollower = 0; int TransposeCount = 0; // Roll through the pointer arrays and use the "ArrayDnodeTnodeTranspose" function to populate it. // Set the dummy entry at the beginning of the array. ArrayDnodeInit(&(Result->DawgArray[0]), 0, 0, 0, 0, 0); Result->First = 1; printf("\nStep 12 - Populate the new Working-Array-Dawg structure, which is used to verify validity and create the final integer-graph-encodings.\n"); // Scroll through "HolderOfAllTnodePointers" and look for un"Dangling" "Tnodes", if so then transpose them into "Result->DawgArray". for ( X = Result->MaxStringLength - 1; X >= 0; X-- ) { for (W = 0; W < NodeNumberCounterInit[X]; W++ ) { if ( TnodeDangling(HolderOfAllTnodePointers[X][W]) == FALSE ) { IndexFollow = TnodeArrayIndex(HolderOfAllTnodePointers[X][W]); ArrayDnodeTnodeTranspose(&(Result->DawgArray[IndexFollow]), HolderOfAllTnodePointers[X][W]); TransposeCount += 1; if ( IndexFollow > IndexFollower ) IndexFollower = IndexFollow; } } } printf("\n |%d| = IndexFollower, which is the largest index assigned in the Working-Array-Dawg.\n", IndexFollower); printf(" |%d| = TransposeCount, holds the number of Tnodes transposed into the Working-Array-Dawg.\n", TransposeCount); printf(" |%d| = NumberOfLivingNodes. Make sure that these three values are equal, because they must be.\n", NumberOfLivingNodes); if ( (IndexFollower == TransposeCount) && (IndexFollower == NumberOfLivingNodes) ) printf("\n Equality assertion passed.\n"); else printf("\n Equality assertion failed.\n"); // Conduct dynamic-memory-cleanup and free the whole Raw-Trie, which is no longer needed. for ( X = 0; X < Result->MaxStringLength; X++ ) for ( W = 0; W < NodeNumberCounterInit[X]; W++ ) free(HolderOfAllTnodePointers[X][W]); free(TemporaryTrie); free(NodeNumberCounter); free(NodeNumberCounterInit); for ( X = 0; X < Result->MaxStringLength; X++ ) free(HolderOfAllTnodePointers[X]); free(HolderOfAllTnodePointers); printf("\nStep 13 - Creation of the traditional-DAWG is complete, so store it in a binary file for use.\n"); FILE *Data; Data = fopen( TRADITIONAL_DAWG_DATA,"wb" ); // The "NULL" node in position "0" must be counted now. int CurrentNodeInteger = NumberOfLivingNodes + 1; // It is critical, especially in a binary file, that the first integer written to the file be the number of nodes stored in the file. fwrite( &CurrentNodeInteger, sizeof(int), 1, Data ); // Write the "NULL" node to the file first. CurrentNodeInteger = 0; fwrite( &CurrentNodeInteger, sizeof(int), 1, Data ); for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNodeInteger = (Result->DawgArray)[X].Child; CurrentNodeInteger <<= TRADITIONAL_CHILD_SHIFT; CurrentNodeInteger += ((Result->DawgArray)[X].Letter) - 'A'; if ( (Result->DawgArray)[X].EndOfWordFlag == TRUE ) CurrentNodeInteger += TRADITIONAL_EOW_FLAG; if ( (Result->DawgArray)[X].Next == 0 ) CurrentNodeInteger += TRADITIONAL_EOL_FLAG; fwrite( &CurrentNodeInteger, sizeof(int), 1, Data ); } fclose(Data); printf( "\n The Traditional-DAWG-Encoding data file is now written.\n" ); printf("\nStep 14 - Create a preliminary encoding of the more advanced CWG, and store these intermediate arrays into data files.\n"); FILE *Text; Text = fopen(TRADITIONAL_DAWG_TEXT_DATA,"w"); FILE *Main; FILE *Secondary; Main = fopen(DIRECT_GRAPH_DATA_PART_ONE,"wb" ); Secondary = fopen(DIRECT_GRAPH_DATA_PART_TWO,"wb" ); // The following variables will be used when setting up the child-List-Format integer values. int CurrentNumberOfChildren = 0; char CurrentChildLetterString[NUMBER_OF_ENGLISH_LETTERS + 1]; CurrentChildLetterString[0] = '\0'; char TheNodeInBinary[32+5+1]; char TheChildListInBinary[32+3+1]; TheNodeInBinary[0] = '\0'; int CompleteChildList; int CurrentOffsetNumberIndex; int CompleteThirtyTwoBitNode; fwrite(&NumberOfLivingNodes, 4, 1, Main); FILE *ListE; int EndOfListCount = 0; int EOLTracker = 0; int *EndOfListIndicies; ListE = fopen(FINAL_NODES_DATA, "wb"); // Set up an array to hold all of the unique child strings for the reduced lexicon DAWG. The empty placeholder will be all zeds. int NumberOfUniqueChildStrings = 0; int InsertionPoint = 0; Bool IsSheUnique = TRUE; char *TempHolder; char **HolderOfUniqueChildStrings = (char**)malloc(NumberOfLivingNodes * sizeof(char*)); for ( X = 0; X < NumberOfLivingNodes; X++ ) { HolderOfUniqueChildStrings[X] = (char*)malloc((NUMBER_OF_ENGLISH_LETTERS + 1) * sizeof(char)); strcpy(HolderOfUniqueChildStrings[X], "ZZZZZZZZZZZZZZZZZZZZZZZZZZ"); } // Right here we will tabulate the child information so that it can be turned into an "int" array and stored in a data file. // Also, we need to count the number of unique list structures, and calculate the number of bits required to store index values for them. // The idea is that there are a small number of actual values that these 26 bits will hold due to patterns in the English Language. for ( X = 1; X <= NumberOfLivingNodes ; X++ ) { CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Insert the "CurrentChildLetterString" into the "HolderOfUniqueChildStrings" if, and only if, it is unique. for ( Y = 0; Y <= NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { IsSheUnique = FALSE; InsertionPoint = 0; break; } if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) < 0 ) { IsSheUnique = TRUE; InsertionPoint = Y; break; } } if ( IsSheUnique == TRUE ) { TempHolder = HolderOfUniqueChildStrings[NumberOfUniqueChildStrings]; strcpy(TempHolder, CurrentChildLetterString); memmove(HolderOfUniqueChildStrings + InsertionPoint + 1, HolderOfUniqueChildStrings + InsertionPoint , (NumberOfUniqueChildStrings - InsertionPoint)*sizeof(char*)); HolderOfUniqueChildStrings[InsertionPoint] = TempHolder; NumberOfUniqueChildStrings += 1; } } printf("\nStep 15 - NumberOfUniqueChildStrings = |%d|.\n", NumberOfUniqueChildStrings); int *ChildListValues = (int*)malloc(NumberOfUniqueChildStrings*sizeof(int)); // Encode the unique child strings as "int"s, so that each corresponding bit is popped. for ( X = 0; X < NumberOfUniqueChildStrings ; X++ ) { strcpy(CurrentChildLetterString, HolderOfUniqueChildStrings[X]); CurrentNumberOfChildren = strlen(CurrentChildLetterString); CompleteChildList = 0; for ( Y = 0; Y < CurrentNumberOfChildren; Y++ ) { CompleteChildList += PowersOfTwo[CurrentChildLetterString[Y] - 'A']; } ChildListValues[X] = CompleteChildList; } fprintf(Text, "Behold, the |%d| graph nodes are decoded below.\r\n\r\n", NumberOfLivingNodes); // We are now ready to output to the text file, and the "Main" intermediate binary data file. for ( X = 1; X <= NumberOfLivingNodes ; X++ ){ CurrentNumberOfChildren = ArrayDnodeNumberOfChildrenPlusString(Result->DawgArray, X, CurrentChildLetterString); // Get the correct offset index to store into the current node for ( Y = 0; Y < NumberOfUniqueChildStrings; Y++ ) { if ( strcmp(CurrentChildLetterString, HolderOfUniqueChildStrings[Y]) == 0 ) { CurrentOffsetNumberIndex = Y; break; } } CompleteThirtyTwoBitNode = CurrentOffsetNumberIndex; CompleteThirtyTwoBitNode <<= BIT_COUNT_FOR_CHILD_INDEX; CompleteThirtyTwoBitNode += (Result->DawgArray)[X].Child; // The first "BIT_COUNT_FOR_GRAPH_INDEX" are for the first child index. The EOW_FLAG is stored in the 2^30 bit. // The remaining bits will hold a reference to the child list configuration. No space is used for a letter. // The 2's complement sign bit is not needed, so a signed integer is acceptable. if ( (Result->DawgArray)[X].EndOfWordFlag == 1 ) CompleteThirtyTwoBitNode += EOW_FLAG; fwrite(&CompleteThirtyTwoBitNode, sizeof(int), 1, Main); ConvertIntNodeToBinaryString(CompleteThirtyTwoBitNode, TheNodeInBinary); ConvertChildListIntToBinaryString(ChildListValues[CurrentOffsetNumberIndex], TheChildListInBinary); fprintf(Text, "N%6d-%s,Raw|%10d|,Lev|%2d|", X, TheNodeInBinary, CompleteThirtyTwoBitNode, (Result->DawgArray)[X].Level); fprintf(Text, ",{'%c',%d,%6d", (Result->DawgArray)[X].Letter, (Result->DawgArray)[X].EndOfWordFlag, (Result->DawgArray)[X].Next); fprintf(Text, ",%6d},Childs|%2d|-|%26s|", (Result->DawgArray)[X].Child , CurrentNumberOfChildren, CurrentChildLetterString); fprintf(Text, ",ChildList%s-|%8d|.\r\n", TheChildListInBinary, ChildListValues[CurrentOffsetNumberIndex] ); if ( CompleteThirtyTwoBitNode == 0 ) printf("\n Error in node encoding process.\n"); } fclose(Main); fwrite(&NumberOfUniqueChildStrings, sizeof(int), 1, Secondary); fwrite(ChildListValues, sizeof(int), NumberOfUniqueChildStrings, Secondary); fclose(Secondary); fprintf(Text, "\r\nNumber Of Living Nodes |%d| Plus The NULL Node. Also, there are %d child list ints.\r\n\r\n" , NumberOfLivingNodes, NumberOfUniqueChildStrings); for ( X = 0; X < NumberOfUniqueChildStrings; X++ ) { fprintf(Text, "#%4d - |%26s| - |%8d|\r\n", X, HolderOfUniqueChildStrings[X], ChildListValues[X] ); } // free all of the memory used to compile the Child-List-Format array. for ( X = 0; X < NumberOfLivingNodes; X++ ) { free(HolderOfUniqueChildStrings[X]); } free(HolderOfUniqueChildStrings); free(ChildListValues); printf("\nStep 16 - Create an array with all End-Of-List index values.\n"); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ){ EndOfListCount += 1; } } EndOfListIndicies = (int*)malloc(EndOfListCount*sizeof(int)); fwrite(&EndOfListCount, sizeof(int), 1, ListE); for ( X = 1; X <= NumberOfLivingNodes; X++ ) { if ( (Result->DawgArray)[X].Next == 0 ) { EndOfListIndicies[EOLTracker] = X; EOLTracker += 1; } } printf("\n EndOfListCount = |%d|\n", EndOfListCount); fwrite(EndOfListIndicies, sizeof(int), EndOfListCount, ListE); fclose(ListE); fprintf(Text, "\nEndOfListCount |%d|\r\n\r\n", EndOfListCount); for ( X = 0; X < EndOfListCount; X++ ) { fprintf(Text, "#%5d - |%d|\r\n", X, EndOfListIndicies[X]); } fclose(Text); printf("\nStep 17 - Creation of Traditional-DAWG-Encoding file, Intermediate-Array files, and text-inspection-file complete.\n"); return Result; } //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// int TraverseTheDawgArrayRecurse(int *TheDawg, int *ListFormats, int *OnIt, int CurrentIndex, char *TheWordSoFar , int FillThisPosition, char CurrentLetter, int *WordCounter, Bool PrintMe){ int X; int ChildListFormat; int CurrentChild; int WhatsBelowMe = 0; int CorrectOffset; TheWordSoFar[FillThisPosition] = CurrentLetter; if ( TheDawg[CurrentIndex] & EOW_FLAG ) { *WordCounter += 1; TheWordSoFar[FillThisPosition+ 1] = '\0'; if ( PrintMe ) printf("#|%d| - |%s|\n", *WordCounter, TheWordSoFar); WhatsBelowMe += 1; } if ( CurrentChild = (TheDawg[CurrentIndex] & CHILD_MASK) ) { ChildListFormat = ListFormats[(TheDawg[CurrentIndex] & LIST_FORMAT_INDEX_MASK) >> LIST_FORMAT_BIT_SHIFT]; for ( X = 0; X < NUMBER_OF_ENGLISH_LETTERS ; X++ ) { // Verify if the X'th letter exists in the Child-List. if ( ChildListFormat & PowersOfTwo[X] ) { // Because the X'th letter exists, run "ListFormatPopCount", to extract the "CorrectOffset". CorrectOffset = ListFormatPopCount(ChildListFormat, X); CorrectOffset -= 1; WhatsBelowMe += TraverseTheDawgArrayRecurse(TheDawg, ListFormats, OnIt, CurrentChild + CorrectOffset, TheWordSoFar , FillThisPosition + 1, X + 'A', WordCounter, PrintMe); } } } // Because CWG is a compressed graph, many values of the "OnIt" array will be updated multiple times with the same values. OnIt[CurrentIndex] = WhatsBelowMe; return WhatsBelowMe; } void TraverseTheDawgArray(int *TheDawg, int *TheListFormats, int *BelowingMe, Bool PrintToScreen){ int X; int TheCounter = 0; char RunningWord[MAX + 1]; for ( X = 0; X < NUMBER_OF_ENGLISH_LETTERS ; X++ ) { TraverseTheDawgArrayRecurse(TheDawg, TheListFormats, BelowingMe, X + 1, RunningWord, 0, 'A' + X, &TheCounter, PrintToScreen); } } int main(){ printf("\n The 29-Step CWG-Creation-Process has commenced: (Hang in there, it will be over soon.)\n"); int X; int Y; // All of the words of similar length will be stored sequentially in the same array so that there will be (MAX + 1) arrays in total. // The Smallest length of a string is assumed to be 2. char *AllWordsInEnglish[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = NULL; FILE *Input; Input = fopen(RAW_LEXICON,"r"); char ThisLine[100] = "\0"; int FirstLineIsSize; int LineLength; fgets(ThisLine, 100, Input); CutOffExtraChars(ThisLine); FirstLineIsSize = StringToPositiveInt(ThisLine); printf("\n FirstLineIsSize = Number-Of-Words = |%d|\n", FirstLineIsSize); int DictionarySizeIndex[MAX + 1]; for ( X = 0; X <= MAX; X++ ) DictionarySizeIndex[X] = 0; char **LexiconInRam = (char**)malloc(FirstLineIsSize*sizeof(char *)); // The first line is the Number-Of-Words, so read them all into RAM, temporarily. for ( X = 0; X < FirstLineIsSize; X++ ) { fgets(ThisLine, 100, Input); CutOffExtraChars(ThisLine); LineLength = strlen(ThisLine); if ( LineLength <= MAX ) DictionarySizeIndex[LineLength] += 1; LexiconInRam[X] = (char *)malloc((LineLength + 1)*sizeof(char)); strcpy(LexiconInRam[X], ThisLine); } printf("\n Word-List.txt is now in RAM.\n"); // Allocate enough space to hold all of the words in strings so that we can add them to the trie by length. for ( X = 2; X < (MAX + 1); X++ ) AllWordsInEnglish[X] = (char *)malloc((X + 1)*DictionarySizeIndex[X]*sizeof(char)); int CurrentTracker[MAX + 1]; for ( X = 0; X < (MAX + 1); X++ ) CurrentTracker[X] = 0; int CurrentLength; // Copy all of the strings into the halfway house 1. for ( X = 0; X < FirstLineIsSize; X++ ) { CurrentLength = strlen(LexiconInRam[X]); // Simply copy a string from its temporary ram location to the array of length equivelant strings for processing in making the DAWG. if ( CurrentLength <= MAX ) strcpy( &((AllWordsInEnglish[CurrentLength])[(CurrentTracker[CurrentLength]*(CurrentLength + 1))]), LexiconInRam[X] ); CurrentTracker[CurrentLength] += 1; } printf("\n The words are now stored in an array according to length.\n\n"); // Make sure that the counting has resulted in all of the strings being placed correctly. for ( X = 0; X < (MAX + 1); X++ ) { if ( DictionarySizeIndex[X] == CurrentTracker[X] ) printf(" |%2d| Letter word count = |%5d| is verified.\n", X, CurrentTracker[X]); else printf(" Something went wrong with |%2d| letter words.\n", X); } // Free the the initial dynamically allocated memory. for ( X = 0; X < FirstLineIsSize; X++ ) free(LexiconInRam[X]); free(LexiconInRam); printf("\n Begin Creator init function.\n\n"); ArrayDawgPtr Adoggy = ArrayDawgInit(AllWordsInEnglish, DictionarySizeIndex, MAX); //----------------------------------------------------------------------------------- // Begin tabulation of "NumberOfWordsToEndOfBranchList" array. FILE *PartOne = fopen(DIRECT_GRAPH_DATA_PART_ONE, "rb"); FILE *PartTwo = fopen(DIRECT_GRAPH_DATA_PART_TWO, "rb"); FILE *ListE = fopen(FINAL_NODES_DATA, "rb"); int NumberOfPartOneNodes; int NumberOfPartTwoNodes; int NumberOfFinalNodes; int CurrentFinalNodeIndex; int CurrentCount; fread(&NumberOfPartOneNodes, sizeof(int), 1, PartOne); fread(&NumberOfPartTwoNodes, sizeof(int), 1, PartTwo); fread(&NumberOfFinalNodes, sizeof(int), 1, ListE); int *PartOneArray = (int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); int *PartTwoArray = (int *)malloc(NumberOfPartTwoNodes*sizeof(int)); int *FinalNodeLocations = (int *)malloc(NumberOfFinalNodes*sizeof(int)); fread(PartOneArray + 1, sizeof(int), NumberOfPartOneNodes, PartOne); fread(PartTwoArray, sizeof(int), NumberOfPartTwoNodes, PartTwo); fread(FinalNodeLocations, sizeof(int), NumberOfFinalNodes, ListE); int *NumberOfWordsBelowMe = (int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); int *NumberOfWordsToEndOfBranchList =(int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); int *RearrangedNumberOfWordsToEndOfBranchList =(int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); NumberOfWordsBelowMe[0] = 0; NumberOfWordsToEndOfBranchList[0] = 0; RearrangedNumberOfWordsToEndOfBranchList[0] = 0; PartOneArray[0] = 0; fclose(PartOne); fclose(PartTwo); fclose(ListE); printf("\nStep 18 - Display the Mask-Format for CWG Main-Nodes:\n\n"); char Something[38]; ConvertIntNodeToBinaryString(CHILD_MASK, Something); printf(" %s - CHILD_MASK\n", Something); ConvertIntNodeToBinaryString(LIST_FORMAT_INDEX_MASK, Something); printf(" %s - LIST_FORMAT_INDEX_MASK\n", Something); printf("\nStep 19 - Traverse the DawgArray to fill the NumberOfWordsBelowMe array.\n"); // This function is run to fill the "NumberOfWordsBelowMe" array. TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); printf("\nStep 20 - Use FinalNodeLocations and NumberOfWordsBelowMe to fill the NumberOfWordsToEndOfBranchList array.\n"); // This little piece of code compiles the "NumberOfWordsToEndOfBranchList" array. // The requirements are the "NumberOfWordsBelowMe" array and the "FinalNodeLocations" array. CurrentFinalNodeIndex = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { CurrentCount = 0; for ( Y = X; Y <= FinalNodeLocations[CurrentFinalNodeIndex]; Y++ ) { CurrentCount += NumberOfWordsBelowMe[Y]; } NumberOfWordsToEndOfBranchList[X] = CurrentCount; if ( X == FinalNodeLocations[CurrentFinalNodeIndex] ) CurrentFinalNodeIndex +=1; } ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// // Now with preliminary analysis complete, it is time to rearrange the PartOne nodes and then set up PartThree. int ListSizeCounter[NUMBER_OF_ENGLISH_LETTERS + 1]; int TotalNumberOfLists = 0; Bool AreWeInBigList = FALSE; int TheCurrentChild; int StartOfCurrentList = 1; int SizeOfCurrentList = FinalNodeLocations[0]; int EndOfCurrentList = FinalNodeLocations[0]; int InsertionPoint = 1; int CurrentlyCopyingThisList = 0; int *PartOneRearrangedArray = (int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); int *CurrentAdjustments = (int *)malloc((NumberOfPartOneNodes + 1)*sizeof(int)); PartOneRearrangedArray[0] = 0; for ( X = 0; X <= NumberOfPartOneNodes; X++ ) CurrentAdjustments[X] = 0; for ( X = 0; X <= NUMBER_OF_ENGLISH_LETTERS; X++ ) ListSizeCounter[X] = 0; printf("\nStep 21 - Relocate all node-lists with WTEOBL values greater than 255, to the front of the Main CWG array.\n"); printf("\n All corresponding node data and End-Of-List data must also be shifted around.\n"); // This code is responsible for rearranging the node lists inside of the CWG int array so the word-heavy lists filter to the front. CurrentFinalNodeIndex = 0; for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( NumberOfWordsToEndOfBranchList[X] > 255 ) AreWeInBigList = TRUE; if ( X == EndOfCurrentList ) { ListSizeCounter[SizeOfCurrentList] += 1; // We are now at the end of a big list that must to be moved up to the InsertionPoint. // This also implies moving everything between its current location and its new one. if ( AreWeInBigList == TRUE ) { // First step is to copy the CurrentList into the new array at its correct position. for ( Y = 0; Y < SizeOfCurrentList; Y++ ) { PartOneRearrangedArray[InsertionPoint + Y] = PartOneArray[StartOfCurrentList + Y]; RearrangedNumberOfWordsToEndOfBranchList[InsertionPoint + Y] = NumberOfWordsToEndOfBranchList[StartOfCurrentList + Y]; } // The following steps are required when we are actually moving the position of a list. The first set of lists will bypass these steps. if ( InsertionPoint != StartOfCurrentList ) { // Step 2 is to move all of the nodes between the original and final location, "SizeOfCurrentList" number of places back, starting from the end. for ( Y = EndOfCurrentList; Y >= (InsertionPoint + SizeOfCurrentList); Y-- ) { PartOneArray[Y] = PartOneArray[Y - SizeOfCurrentList]; NumberOfWordsToEndOfBranchList[Y] = NumberOfWordsToEndOfBranchList[Y - SizeOfCurrentList]; } // Step 3 is to copy the list we are moving up from the rearranged array back into the original. for ( Y = InsertionPoint; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { PartOneArray[Y] = PartOneRearrangedArray[Y]; NumberOfWordsToEndOfBranchList[Y] = RearrangedNumberOfWordsToEndOfBranchList[Y]; } // Step 4 is to fill the "CurrentAdjustments" array with the amount that each child references must be moved. // The two arrays are identical now up to the new insertion point. // At this stage, the "CurrentAdjustments" array is all zeros. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { TheCurrentChild = (PartOneArray[Y] & CHILD_MASK); if ( (TheCurrentChild >= InsertionPoint) && (TheCurrentChild < StartOfCurrentList) ) CurrentAdjustments[Y] = SizeOfCurrentList; if ( (TheCurrentChild >= StartOfCurrentList) && (TheCurrentChild <= EndOfCurrentList) ) CurrentAdjustments[Y] = InsertionPoint - StartOfCurrentList; } // Step 5 is to fix all of the child reference values in both of the arrays. // Start with the rearranged array. for ( Y = 1; Y < (InsertionPoint + SizeOfCurrentList); Y++ ) { if ( CurrentAdjustments[Y] != 0 ) PartOneRearrangedArray[Y] += CurrentAdjustments[Y]; } // Finish with the original array. Make sure to zero all the values after the adjustments have been made to get ready for the next round. for ( Y = 1; Y <= NumberOfPartOneNodes; Y++ ) { if ( CurrentAdjustments[Y] != 0 ) { PartOneArray[Y] += CurrentAdjustments[Y]; CurrentAdjustments[Y] = 0; } } } // Step 7 is to set the new InsertionPoint and change the "FinalNodeLocations", so that they reflect the shift. InsertionPoint += SizeOfCurrentList; // Shift all of the end of lists between the "CurrentlyCopyingThisList" and "CurrentFinalNodeIndex". for ( Y = CurrentFinalNodeIndex; Y > CurrentlyCopyingThisList; Y-- ) { FinalNodeLocations[Y] = FinalNodeLocations[Y - 1] + SizeOfCurrentList; } FinalNodeLocations[CurrentlyCopyingThisList] = InsertionPoint - 1; CurrentlyCopyingThisList += 1; } // Even when we are not in a big list, we still need to update the current list parameters. CurrentFinalNodeIndex += 1; SizeOfCurrentList = FinalNodeLocations[CurrentFinalNodeIndex] - EndOfCurrentList; EndOfCurrentList = FinalNodeLocations[CurrentFinalNodeIndex]; StartOfCurrentList = X + 1; AreWeInBigList = FALSE; } } printf("\n Word-Heavy list-shifting is now complete.\n"); // Step 8 is to copy all of the small lists from the original array to the rearranged array. All of the references should be properly adjusted at this point. for ( X = InsertionPoint; X <= NumberOfPartOneNodes; X++ ) { PartOneRearrangedArray[X] = PartOneArray[X]; RearrangedNumberOfWordsToEndOfBranchList[X] = NumberOfWordsToEndOfBranchList[X]; } // Rearrangement of the DAWG lists to reduce size of the PartThree data file is complete, so check if the new and old lists are identical, because they should be. for ( X = 1; X <= NumberOfPartOneNodes; X++ ) { if ( PartOneArray[X] != PartOneRearrangedArray[X] ) printf(" What A Mistake!\n"); if ( RearrangedNumberOfWordsToEndOfBranchList[X] != NumberOfWordsToEndOfBranchList[X] ) printf(" Mistaken.\n"); } // The two arrays are now identical, so as a final precaution, traverse the rearranged array. TraverseTheDawgArray(PartOneRearrangedArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // Check for duplicate lists. It is now highly likely that there are some duplicates. // Lists of size X, can be replaced with partial lists of size X+n. Make sure to check for this case. printf("\nStep 22 - Create an array to organize End-Of-List values by size.\n\n"); // Add up the total number of lists. for ( X = 1; X <= NUMBER_OF_ENGLISH_LETTERS; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; printf(" Size|%2d| Lists = |%5d|\n", X, ListSizeCounter[X]); } printf("\n TotalNumberOfLists = |%d|\n", TotalNumberOfLists); int **NodeListsBySize = (int **)malloc((NUMBER_OF_ENGLISH_LETTERS + 1)*sizeof(int *)); int WhereWeAt[NUMBER_OF_ENGLISH_LETTERS + 1]; for ( X = 0; X <= NUMBER_OF_ENGLISH_LETTERS; X++ ) WhereWeAt[X] = 0; for ( X = 1; X <= NUMBER_OF_ENGLISH_LETTERS; X++ ) { NodeListsBySize[X] = (int *)malloc(ListSizeCounter[X]*sizeof(int)); } // We are now required to fill the "NodeListsBySize" array. Simply copy over the correct "FinalNode" information. // Note that the "FinalNode" information reflects the readjustment that just took place. CurrentFinalNodeIndex = 0; EndOfCurrentList = FinalNodeLocations[0]; SizeOfCurrentList = FinalNodeLocations[0]; for ( X = 0; X < NumberOfFinalNodes; X++ ) { (NodeListsBySize[SizeOfCurrentList])[WhereWeAt[SizeOfCurrentList]] = EndOfCurrentList; WhereWeAt[SizeOfCurrentList] += 1; CurrentFinalNodeIndex += 1; SizeOfCurrentList = FinalNodeLocations[CurrentFinalNodeIndex] - EndOfCurrentList; EndOfCurrentList = FinalNodeLocations[CurrentFinalNodeIndex]; } printf("\n End-Of-List values are now organized.\n"); int Z; int V; int W; int U; int TheNewChild; int TotalNumberOfKilledLists = 0; int TotalNumberOfKilledNodes = 0; int NewNumberOfKilledLists = -1; int InspectThisEndOfList; int MaybeReplaceWithThisEndOfList; int NumberOfListsKilledThisRound = -1; int CurrentNumberOfPartOneNodes = NumberOfPartOneNodes; Bool EliminateCurrentList = TRUE; printf("\nStep 23 - Kill more lists by using the ends of longer lists or lists of equal size.\n\n"); // Try to eliminate lists with partial lists. // "X" is the list-length of lists we are trying to kill. //for ( X = 1; X < NUMBER_OF_ENGLISH_LETTERS; X++ ) { for ( X = NUMBER_OF_ENGLISH_LETTERS; X >= 1; X-- ) { printf(" Try To Eliminate Lists of Size |%2d| - ", X); NewNumberOfKilledLists = 0; // Look for partial lists at the end of "Y" sized lists, to replace the "X" sized lists with. for ( Y = NUMBER_OF_ENGLISH_LETTERS; Y >= X; Y-- ) { // Try to kill list # "Z". for ( Z = 0; Z < ListSizeCounter[X]; Z++ ) { InspectThisEndOfList = (NodeListsBySize[X])[Z]; // Try to replace with list # "Z". for ( W = 0; W < ListSizeCounter[Y]; W++ ) { // Never try to replace a list with itself. if ( (X == Y) && (Z == W) ) continue; MaybeReplaceWithThisEndOfList = (NodeListsBySize[Y])[W]; for( V = 0; V < X; V++ ) { if ( PartOneArray[InspectThisEndOfList - V] != PartOneArray[MaybeReplaceWithThisEndOfList - V] ) { EliminateCurrentList = FALSE; break; } } // When eliminating a list, make sure to adjust the WTEOBL data. if ( EliminateCurrentList == TRUE ) { // Step 1 - Replace all references to the duplicate list with the earlier equivalent. for( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( (TheCurrentChild > (InspectThisEndOfList - X)) && (TheCurrentChild <= InspectThisEndOfList) ) { TheNewChild = MaybeReplaceWithThisEndOfList - (InspectThisEndOfList - TheCurrentChild); PartOneArray[V] -= TheCurrentChild; PartOneArray[V] += TheNewChild; } } // Step 2 - Eliminate the dupe list by moving the higher lists forward. for ( V = (InspectThisEndOfList - X + 1); V <= (CurrentNumberOfPartOneNodes - X); V++ ) { PartOneArray[V] = PartOneArray[V + X]; NumberOfWordsToEndOfBranchList[V] = NumberOfWordsToEndOfBranchList[V + X]; } // Step 3 - Change CurrentNumberOfPartOneNodes. CurrentNumberOfPartOneNodes -= X; // Step 4 - Lower all references to the nodes coming after the dupe list. for ( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( TheCurrentChild > InspectThisEndOfList ){ PartOneArray[V] -= X; } } // Step 5 - Readjust all of the lists after "Z" forward 1 and down X to the value, and lower ListSizeCounter[X] by 1. for( V = Z; V < (ListSizeCounter[X] - 1); V++ ) { (NodeListsBySize[X])[V] = (NodeListsBySize[X])[V + 1] - X; } ListSizeCounter[X] -= 1; // Step 6 - Lower any list, of any size, greater than (NodeListsBySize[X])[Z], down by X. for ( V = 1; V <= (X - 1); V++ ) { for ( U = 0; U < ListSizeCounter[V]; U++ ) { if ( (NodeListsBySize[V])[U] > InspectThisEndOfList ) (NodeListsBySize[V])[U] -= X; } } for ( V = (X + 1); V <= NUMBER_OF_ENGLISH_LETTERS; V++ ) { for ( U = 0; U < ListSizeCounter[V]; U++ ) { if ( (NodeListsBySize[V])[U] > InspectThisEndOfList ) (NodeListsBySize[V])[U] -= X; } } // Step 7 - Lower "Z" by 1 and increase "TotalNumberOfKilledLists". Z -= 1; TotalNumberOfKilledLists += 1; NewNumberOfKilledLists += 1; TotalNumberOfKilledNodes += X; break; } EliminateCurrentList = TRUE; } } } if ( NewNumberOfKilledLists > 0 ) printf("Killed |%d| lists.\n", NewNumberOfKilledLists); else printf("Empty handed.\n"); NewNumberOfKilledLists = 0; } printf("\n Removal of the new-redundant-lists is now complete:\n"); printf("\n |%5d| = Original # of lists.\n", TotalNumberOfLists); printf(" |%5d| = Killed # of lists.\n", TotalNumberOfKilledLists); printf(" |%5d| = Remaining # of lists.\n", TotalNumberOfLists = TotalNumberOfLists - TotalNumberOfKilledLists); printf("\n |%6d| = Original # of nodes.\n", NumberOfPartOneNodes); printf(" |%6d| = Killed # of nodes.\n", TotalNumberOfKilledNodes); printf(" |%6d| = Remaining # of nodes.\n", CurrentNumberOfPartOneNodes); // Try to eliminate lists with partial lists again to check that we've got em all. printf("\nStep 24 - Run the redundant-list-analysis one more time to test that no-more exist.\n\n"); NumberOfPartOneNodes = CurrentNumberOfPartOneNodes; TotalNumberOfKilledLists = 0; TotalNumberOfKilledNodes = 0; EliminateCurrentList = TRUE; // "X" is the list-length of lists we are trying to kill. for ( X = NUMBER_OF_ENGLISH_LETTERS; X >= 1; X-- ) { printf(" Try To Eliminate Lists of Size |%2d| - ", X); NewNumberOfKilledLists = 0; // Look for partial lists at the end of "Y" sized lists, to replace the "X" sized lists with. for ( Y = NUMBER_OF_ENGLISH_LETTERS; Y >= X; Y-- ) { // Try to kill list # "Z". for ( Z = 0; Z < ListSizeCounter[X]; Z++ ) { InspectThisEndOfList = (NodeListsBySize[X])[Z]; // Try to replace with list # "Z". for ( W = 0; W < ListSizeCounter[Y]; W++ ) { // Never try to replace a list with itself. if ( (X == Y) && (Z == W) ) continue; MaybeReplaceWithThisEndOfList = (NodeListsBySize[Y])[W]; for( V = 0; V < X; V++ ) { if ( PartOneArray[InspectThisEndOfList - V] != PartOneArray[MaybeReplaceWithThisEndOfList - V] ) { EliminateCurrentList = FALSE; break; } } // When eliminating a list, make sure to adjust the WTEOBL data. if ( EliminateCurrentList == TRUE ) { // Step 1 - Replace all references to the duplicate list with the earlier equivalent. for( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( (TheCurrentChild > (InspectThisEndOfList - X)) && (TheCurrentChild <= InspectThisEndOfList) ) { TheNewChild = MaybeReplaceWithThisEndOfList - (InspectThisEndOfList - TheCurrentChild); PartOneArray[V] -= TheCurrentChild; PartOneArray[V] += TheNewChild; } } // Step 2 - Eliminate the dupe list by moving the higher lists forward. for ( V = (InspectThisEndOfList - X + 1); V <= (CurrentNumberOfPartOneNodes - X); V++ ) { PartOneArray[V] = PartOneArray[V + X]; NumberOfWordsToEndOfBranchList[V] = NumberOfWordsToEndOfBranchList[V + X]; } // Step 3 - Change CurrentNumberOfPartOneNodes. CurrentNumberOfPartOneNodes -= X; // Step 4 - Lower all references to the nodes coming after the dupe list. for ( V = 1; V <= CurrentNumberOfPartOneNodes; V++ ) { TheCurrentChild = (PartOneArray[V] & CHILD_MASK); if ( TheCurrentChild > InspectThisEndOfList ){ PartOneArray[V] -= X; } } // Step 5 - Readjust all of the lists after "Z" forward 1 and down X to the value, and lower ListSizeCounter[X] by 1. for( V = Z; V < (ListSizeCounter[X] - 1); V++ ) { (NodeListsBySize[X])[V] = (NodeListsBySize[X])[V + 1] - X; } ListSizeCounter[X] -= 1; // Step 6 - Lower any list, of any size, greater than (NodeListsBySize[X])[Z], down by X. for ( V = 1; V <= (X - 1); V++ ) { for ( U = 0; U < ListSizeCounter[V]; U++ ) { if ( (NodeListsBySize[V])[U] > InspectThisEndOfList ) (NodeListsBySize[V])[U] -= X; } } for ( V = (X + 1); V <= NUMBER_OF_ENGLISH_LETTERS; V++ ) { for ( U = 0; U < ListSizeCounter[V]; U++ ) { if ( (NodeListsBySize[V])[U] > InspectThisEndOfList ) (NodeListsBySize[V])[U] -= X; } } // Step 7 - Lower "Z" by 1 and increase "TotalNumberOfKilledLists". Z -= 1; TotalNumberOfKilledLists += 1; NewNumberOfKilledLists += 1; TotalNumberOfKilledNodes += X; break; } EliminateCurrentList = TRUE; } } } if ( NewNumberOfKilledLists > 0 ) printf("Killed |%d| lists.\n", NewNumberOfKilledLists); else printf("Empty handed.\n"); NewNumberOfKilledLists = 0; } printf("\n The no-more redundant-list-test is now complete:\n"); printf("\n |%5d| = Original # of lists.\n", TotalNumberOfLists); printf(" |%5d| = Killed # of lists.\n", TotalNumberOfKilledLists); printf(" |%5d| = Remaining # of lists.\n", TotalNumberOfLists - TotalNumberOfKilledLists); printf("\n |%6d| = Original # of nodes.\n", NumberOfPartOneNodes); printf(" |%6d| = Killed # of nodes.\n", TotalNumberOfKilledNodes); printf(" |%6d| = Remaining # of nodes.\n", CurrentNumberOfPartOneNodes); // verify that the reduction procedures have resulted in a valid word graph. // "FinalNodeLocations" needs to be recompiled from what is left in the "NodeListsBySize" arrays. printf("\nStep 25 - Recompile the FinalNodeLocations array and display the distribution.\n\n"); TotalNumberOfLists = 0; for ( X = 1; X <= NUMBER_OF_ENGLISH_LETTERS; X++ ) { TotalNumberOfLists += ListSizeCounter[X]; printf(" List Size|%2d| - Number Of Lists|%5d|\n", X, ListSizeCounter[X]); } printf("\n TotalNumberOfLists|%d|\n", TotalNumberOfLists); // Set all initial values in "FinalNodeLocations" array to BOGUS numbers. for ( X = 0; X < TotalNumberOfLists; X++ ) { FinalNodeLocations[X] = 1000000; } // Filter all of the living "FinalNode" values into the "FinalNodeLocations" array. int TempValue; for ( X = NUMBER_OF_ENGLISH_LETTERS; X >= 1; X-- ) { for ( Y = 0; Y < ListSizeCounter[X]; Y++ ) { FinalNodeLocations[TotalNumberOfLists - 1] = (NodeListsBySize[X])[Y]; // The new list has been placed at the end of the "FinalNodeLocations" array, now filter it up to where it should be. for ( Z = (TotalNumberOfLists - 1); Z > 0; Z-- ) { if ( FinalNodeLocations[Z - 1] > FinalNodeLocations[Z] ) { TempValue = FinalNodeLocations[Z - 1]; FinalNodeLocations[Z - 1] = FinalNodeLocations[Z]; FinalNodeLocations[Z] = TempValue; } else break; } } } // Test for logical errors in the list elimination procedure. for ( X = 0; X < (TotalNumberOfLists - 1); X++ ) { if ( FinalNodeLocations[X] == FinalNodeLocations[X + 1] ) printf("\nNo Two Lists Can End On The Same Node. |%d|%d|\n", X, FinalNodeLocations[X]); } printf("\n The FinalNodeLocations array is now compiled and tested for obvious errors.\n"); printf("\nStep 26 - Recompile WTEOBL array by graph traversal, and test equivalence with the one modified during list-killing.\n"); // Compile "RearrangedNumberOfWordsToEndOfBranchList", and verify that it is the same as "NumberOfWordsToEndOfBranchList". TraverseTheDawgArray(PartOneArray, PartTwoArray, NumberOfWordsBelowMe, FALSE); // This little piece of code compiles the "RearrangedNumberOfWordsToEndOfBranchList" array. // The requirements are the "NumberOfWordsBelowMe" array and the "FinalNodeLocations" array. CurrentFinalNodeIndex = 0; for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) { CurrentCount = 0; for ( Y = X; Y <= FinalNodeLocations[CurrentFinalNodeIndex]; Y++ ) { CurrentCount += NumberOfWordsBelowMe[Y]; } RearrangedNumberOfWordsToEndOfBranchList[X] = CurrentCount; if ( X == FinalNodeLocations[CurrentFinalNodeIndex] ) CurrentFinalNodeIndex +=1; } printf("\n New WTEOBL array is compiled, so test for equality.\n"); for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) { if ( RearrangedNumberOfWordsToEndOfBranchList[X] != NumberOfWordsToEndOfBranchList[X] ) printf("\nMismatch found.\n"); } printf("\n Equality test complete.\n"); printf("\nStep 27 - Determine the final node index that requires a short integer for its WTEOBL value.\n"); // Find out the final index number that requires an integer greater in size than a byte to hold it. Part 3 of the data structure will be held in three arrays. int FurthestBigNode = 0; int FirstSmallNode; for ( X = 1; X <= CurrentNumberOfPartOneNodes; X++ ) { if ( RearrangedNumberOfWordsToEndOfBranchList[X] > 0XFF ) FurthestBigNode = X; } for ( X = 0; X < TotalNumberOfLists; X++ ) { if ( FinalNodeLocations[X] >= FurthestBigNode ) { printf("\n End of final short-integer WTEOBL list = |%d|.\n", FinalNodeLocations[X]); FurthestBigNode = FinalNodeLocations[X]; break; } } FirstSmallNode = FurthestBigNode + 1; printf("\n Index of first node requiring only an unsigned-char for its WTEOBL = |%d|.\n", FirstSmallNode); // The first 26 nodes are the only ones in need of 4-Byte int variables to hold their WTEOBL values. // Being the entry points for the graph, it makes sense to hold these values in a "const int" array, defined in the program code. // The "short int" array holding the medium size WTEOBL values will hold '0's for elements [0, 26], inclusive. // The "unsigned char" array must be unsigned because many of the values require 8-bit representation. // The entire CWG will be stored inside of the one "CWG_Data_For_Word-List.dat" data file. // The first integer will be the total number of words in the graph. // The next five integers will be the array sizes. // After these header values, each array will then be written to the file in order, using the correct integer type. printf("\nStep 28 - Separate the final 3 WTEOBL arrays, and write all 5 arrays to the FinalProduct CWG data file.\n"); int ArrayOneSize = CurrentNumberOfPartOneNodes + 1; int ArrayTwoSize = NumberOfPartTwoNodes; int ArrayThreeSize = NUMBER_OF_ENGLISH_LETTERS + 1; int ArrayFourSize = FurthestBigNode + 1; int ArrayFiveSize = CurrentNumberOfPartOneNodes - FurthestBigNode; // Allocate the final three arrays. int *PartThreeArray = (int *)malloc(ArrayThreeSize*sizeof(int)); short int *PartFourArray = (short int *)malloc(ArrayFourSize*sizeof(short int)); unsigned char *PartFiveArray = (unsigned char *)malloc(ArrayFiveSize*sizeof(unsigned char)); // Fill the final three CWG arrays. for ( X = 0; X < ArrayThreeSize; X++ ) { PartThreeArray[X] = RearrangedNumberOfWordsToEndOfBranchList[X]; PartFourArray[X] = 0; } for ( X = ArrayThreeSize; X < ArrayFourSize; X++ ) PartFourArray[X] = RearrangedNumberOfWordsToEndOfBranchList[X]; for ( X = 0; X < ArrayFiveSize; X++ ) PartFiveArray[X] = RearrangedNumberOfWordsToEndOfBranchList[ArrayFourSize + X]; FILE* FinalProduct = fopen(CWG_DATA, "wb"); fwrite(&FirstLineIsSize, sizeof(int), 1, FinalProduct); fwrite(&ArrayOneSize, sizeof(int), 1, FinalProduct); fwrite(&ArrayTwoSize, sizeof(int), 1, FinalProduct); fwrite(&ArrayThreeSize, sizeof(int), 1, FinalProduct); fwrite(&ArrayFourSize, sizeof(int), 1, FinalProduct); fwrite(&ArrayFiveSize, sizeof(int), 1, FinalProduct); fwrite(PartOneArray, sizeof(int), ArrayOneSize, FinalProduct); fwrite(PartTwoArray, sizeof(int), ArrayTwoSize, FinalProduct); fwrite(PartThreeArray, sizeof(int), ArrayThreeSize, FinalProduct); fwrite(PartFourArray, sizeof(short int), ArrayFourSize, FinalProduct); fwrite(PartFiveArray, sizeof(unsigned char), ArrayFiveSize, FinalProduct); fclose(FinalProduct); printf("\n The new CWG is ready to use. Now, understand it, then go out and use it.\n\n"); return 0; } |
// This is a very simple word-search and word-hash-function program. // In addition to its primary functions, the program writes a "Numbered-Word-List.txt" file, for inspection. // This incremental-hash-function for 178,691 words fits in less than 600K, which is very small. #include <stdlib.h> #include <stdio.h> #include <string.h> #define GRAPH_DATA "CWG_Data_For_Word-List.dat" #define OUT_LIST "Numbered-Word-List.txt" #define NUMBER_OF_ENGLISH_LETTERS 26 #define LOWER_IT 32 #define INPUT_BUFFER_SIZE 100 #define MAX 15 #define THREE 3 #define ESCAPE_SEQUENCE "999" #define TWO_UP_EIGHT 256 #define EOW_FLAG 1073741824 #define LIST_FORMAT_INDEX_MASK 0X3FFE0000 #define LIST_FORMAT_BIT_SHIFT 17 #define CHILD_MASK 0X0001FFFF // Define a Boolean, enumerated data type. typedef enum { FALSE = 0, TRUE = 1 } Bool; const int PowersOfTwo[NUMBER_OF_ENGLISH_LETTERS] = { 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432 }; // When using the "ChildListMasks", the letter being investigated is known to exist, so its bit will not be included in the claculated offset. const int ChildListMasks[NUMBER_OF_ENGLISH_LETTERS] = { 0X0, 0X1, 0X3, 0X7, 0XF, 0X1F, 0X3F, 0X7F, 0XFF, 0X1FF, 0X3FF, 0X7FF, 0XFFF, 0X1FFF, 0X3FFF, 0X7FFF, 0XFFFF, 0X1FFFF, 0X3FFFF, 0X7FFFF, 0XFFFFF, 0X1FFFFF, 0X3FFFFF, 0X7FFFFF, 0XFFFFFF, 0X1FFFFFF }; const char PopCountTable[TWO_UP_EIGHT] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; // Using an explicit jump table, this is an optimized 4-byte integer pop-count. // The function returns the offset of the "LetterPosition"th letter from the "FirstChild" node in a list. // If "LetterPosition" holds the first letter, then "0" is returned. int ListFormatPopCount(int CompleteChildList, int LetterPosition){ const static void *PositionJumpTable[NUMBER_OF_ENGLISH_LETTERS] = { &&Ze, &&On, &&On, &&On, &&On, &&On, &&On, &&On, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Tw, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Th, &&Fo, &&Fo }; int Result = 0; // By casting the internal integer variable "CompleteChildList" as an "unsigned char *" we can access each byte within it. // Remember that computer-programs in C use "little-endian" byte-order. unsigned char *ByteZero = (unsigned char *)&CompleteChildList; // Mask "CompleteChildList" so that CompleteChildList &= ChildListMasks[LetterPosition]; goto *PositionJumpTable[LetterPosition]; Fo: Result += PopCountTable[*(ByteZero + 3)]; Th: Result += PopCountTable[*(ByteZero + 2)]; Tw: Result += PopCountTable[*(ByteZero + 1)]; On: Result += PopCountTable[*ByteZero]; Ze: return Result; } // This simple function clips off the extra chars for each "fgets()" line. Works for Linux and Windows text format. void CutOffExtraChars(char *ThisLine){ if ( ThisLine[strlen(ThisLine) - 2] == '\r' ) ThisLine[strlen(ThisLine) - 2] = '\0'; else if ( ThisLine[strlen(ThisLine) - 1] == '\n' ) ThisLine[strlen(ThisLine) - 1] = '\0'; } // This Function converts any lower case letters inside "RawWord" to capitals, so that the whole string is made of capital letters. void MakeMeAllCapital(char *RawWord){ int X; int Length = strlen(RawWord); for ( X = 0; X < Length; X++ ) { if ( RawWord[X] >= 'a' && RawWord[X] <= 'z' ) RawWord[X] = RawWord[X] - LOWER_IT; } } // Tests the capitalized user input string for valid length and valid characters. // Returns "TRUE" for valid, and "FALSE" for invalid. Bool ValidateInput(char *CappedInput){ int X; int Length = strlen(CappedInput); if ( Length > MAX ) return FALSE; for ( X = 0; X < Length; X++ ) { if ( CappedInput[X] < 'A' || CappedInput[X] > 'Z' ) return FALSE; } return TRUE; } // The CWG basic-type arrays will have global scope to reduce function-argument overhead. int *TheNodeArray; int *TheListFormatArray; int *TheRoot_WTEOBL_Array; short *TheShort_WTEOBL_Array; unsigned char *TheUnsignedChar_WTEOBL_Array; // These two values are needed for the CWG Hash-Function. int WTEOBL_Transition; int IndexCorrection; // Use the first two CWG arrays to return a Boolean value indicating if "TheCandidate" word is in the lexicon. Bool SingleWordSearchBoolean(char *TheCandidate, int CandidateLength){ int X; int CurrentLetterPosition = TheCandidate[0] - 'A'; int CurrentNodeIndex = CurrentLetterPosition + 1; int CurrentChildListFormat; for ( X = 1; X < CandidateLength; X++ ) { if ( !(TheNodeArray[CurrentNodeIndex] & CHILD_MASK) ) return FALSE; CurrentChildListFormat = TheListFormatArray[(TheNodeArray[CurrentNodeIndex] & LIST_FORMAT_INDEX_MASK) >> LIST_FORMAT_BIT_SHIFT]; CurrentLetterPosition = TheCandidate[X] - 'A'; if ( !(CurrentChildListFormat & PowersOfTwo[CurrentLetterPosition]) ) return FALSE; else CurrentNodeIndex = (TheNodeArray[CurrentNodeIndex] & CHILD_MASK) + ListFormatPopCount(CurrentChildListFormat, CurrentLetterPosition); } if ( TheNodeArray[CurrentNodeIndex] & EOW_FLAG ) return TRUE; return FALSE; } // Using a novel graph mark-up scheme, this function returns the hash index of "TheCandidate", and "0" if it does not exist. // This function uses the additional 3 WTEOBL arrays. int SingleWordHashFunction(char *TheCandidate, int CandidateLength){ int X; int CurrentLetterPosition = TheCandidate[0] - 'A'; int CurrentNodeIndex = CurrentLetterPosition + 1; int CurrentChildListFormat; int CurrentHashMarker = TheRoot_WTEOBL_Array[CurrentNodeIndex]; for ( X = 1; X < CandidateLength; X++ ) { if ( !(TheNodeArray[CurrentNodeIndex] & CHILD_MASK) ) return 0; CurrentChildListFormat = TheListFormatArray[(TheNodeArray[CurrentNodeIndex] & LIST_FORMAT_INDEX_MASK) >> LIST_FORMAT_BIT_SHIFT]; CurrentLetterPosition = TheCandidate[X] - 'A'; if ( !(CurrentChildListFormat & PowersOfTwo[CurrentLetterPosition]) ) return 0; else { CurrentNodeIndex = TheNodeArray[CurrentNodeIndex] & CHILD_MASK; // Use "TheShort_WTEOBL_Array". if ( CurrentNodeIndex < WTEOBL_Transition ) { CurrentHashMarker -= TheShort_WTEOBL_Array[CurrentNodeIndex]; CurrentNodeIndex += ListFormatPopCount(CurrentChildListFormat, CurrentLetterPosition); CurrentHashMarker += TheShort_WTEOBL_Array[CurrentNodeIndex]; } // Use "TheUnsignedChar_WTEOBL_Array". else { CurrentHashMarker -= TheUnsignedChar_WTEOBL_Array[CurrentNodeIndex - WTEOBL_Transition]; CurrentNodeIndex += ListFormatPopCount(CurrentChildListFormat, CurrentLetterPosition); CurrentHashMarker += TheUnsignedChar_WTEOBL_Array[CurrentNodeIndex - WTEOBL_Transition]; } if ( TheNodeArray[CurrentNodeIndex] & EOW_FLAG ) CurrentHashMarker -= 1; } } if ( TheNodeArray[CurrentNodeIndex] & EOW_FLAG ) return IndexCorrection - CurrentHashMarker; return 0; } // List output variables. FILE *WordDump; int LastPosition = 0; // A recursive function to print the lexicon contained in the CWG to a text file. // The function also tests the hash-tracking logic. If the words are output using successive numbers, the graph works perfectly. void Print_CWG_Word_ListRecurse(int ThisIndex, int FillThisPlace, char ThisLetter, char *WorkingWord, int CurrentHashMarker){ int X; int TheChildIndex = TheNodeArray[ThisIndex] & CHILD_MASK; int TheChildListFormat; int ConstHashChange; int HashCheck; WorkingWord[FillThisPlace] = ThisLetter; if ( TheNodeArray[ThisIndex] & EOW_FLAG ) { WorkingWord[FillThisPlace + 1] = '\0'; CurrentHashMarker -= 1; fprintf(WordDump, "[%d]-|%s|\r\n", HashCheck = (IndexCorrection - CurrentHashMarker), WorkingWord); if ( HashCheck != (LastPosition + 1) ) printf("Major Mistake|%d|!\n", HashCheck); LastPosition = HashCheck; } if ( TheChildIndex ) { TheChildListFormat = TheListFormatArray[(TheNodeArray[ThisIndex] & LIST_FORMAT_INDEX_MASK) >> LIST_FORMAT_BIT_SHIFT]; if ( TheChildIndex < WTEOBL_Transition ) { ConstHashChange = TheShort_WTEOBL_Array[TheChildIndex]; for ( X = 0; X < NUMBER_OF_ENGLISH_LETTERS; X++ ) { if ( TheChildListFormat & PowersOfTwo[X] ) { Print_CWG_Word_ListRecurse(TheChildIndex, FillThisPlace + 1, X + 'A', WorkingWord, CurrentHashMarker - ConstHashChange + TheShort_WTEOBL_Array[TheChildIndex]); TheChildIndex += 1; } } } else { ConstHashChange = TheUnsignedChar_WTEOBL_Array[TheChildIndex - WTEOBL_Transition]; for ( X = 0; X < NUMBER_OF_ENGLISH_LETTERS; X++ ) { if ( TheChildListFormat & PowersOfTwo[X] ) { Print_CWG_Word_ListRecurse(TheChildIndex, FillThisPlace + 1, X + 'A', WorkingWord, CurrentHashMarker - ConstHashChange + TheUnsignedChar_WTEOBL_Array[TheChildIndex - WTEOBL_Transition]); TheChildIndex += 1; } } } } } // This function will print the word list to a numbered text file. void Print_CWG_Word_List(void){ int X; char MessWithMe[MAX + 1]; WordDump = fopen(OUT_LIST, "w"); for ( X = 1; X <= NUMBER_OF_ENGLISH_LETTERS; X ++ ) { Print_CWG_Word_ListRecurse(X, 0, X + '@', MessWithMe, TheRoot_WTEOBL_Array[X]); } fclose(WordDump); } int main(){ int X; int HashReturnValue; // Array size variables. int TotalNumberOfWords; int NodeArraySize; int ListFormatArraySize; int Root_WTEOBL_ArraySize; int Short_WTEOBL_ArraySize; int UnsignedChar_WTEOBL_ArraySize; char RawUserInput[INPUT_BUFFER_SIZE]; // Read the CWG graph, from the "GRAPH_DATA" file, into the global arrays. FILE *Data = fopen(GRAPH_DATA, "rb"); // Read the array sizes. fread(&TotalNumberOfWords, sizeof(int), 1, Data); fread(&NodeArraySize, sizeof(int), 1, Data); fread(&ListFormatArraySize, sizeof(int), 1, Data); fread(&Root_WTEOBL_ArraySize, sizeof(int), 1, Data); fread(&Short_WTEOBL_ArraySize, sizeof(int), 1, Data); fread(&UnsignedChar_WTEOBL_ArraySize, sizeof(int), 1, Data); // Allocate memory to hold the arrays. TheNodeArray = (int *)malloc(NodeArraySize*sizeof(int)); TheListFormatArray = (int *)malloc(ListFormatArraySize*sizeof(int)); TheRoot_WTEOBL_Array = (int *)malloc(Root_WTEOBL_ArraySize*sizeof(int)); TheShort_WTEOBL_Array = (short int *)malloc(Short_WTEOBL_ArraySize*sizeof(short int)); TheUnsignedChar_WTEOBL_Array = (unsigned char *)malloc(UnsignedChar_WTEOBL_ArraySize*sizeof(unsigned char)); // Read the 5 arrays into memory. fread(TheNodeArray, sizeof(int), NodeArraySize, Data); fread(TheListFormatArray, sizeof(int), ListFormatArraySize, Data); fread(TheRoot_WTEOBL_Array, sizeof(int), Root_WTEOBL_ArraySize, Data); fread(TheShort_WTEOBL_Array, sizeof(short int), Short_WTEOBL_ArraySize, Data); fread(TheUnsignedChar_WTEOBL_Array, sizeof(unsigned char), UnsignedChar_WTEOBL_ArraySize, Data); fclose(Data); // Make the proper assignments and adjustments to use the CWG. IndexCorrection = TotalNumberOfWords; WTEOBL_Transition = Short_WTEOBL_ArraySize; printf("The CWG data-structure is in memory and ready to use.\n\n"); printf("CWG Header Values = |%7d |%7d |%5d |%3d |%5d |%7d |\n", TotalNumberOfWords, NodeArraySize, ListFormatArraySize, Root_WTEOBL_ArraySize, Short_WTEOBL_ArraySize, UnsignedChar_WTEOBL_ArraySize); Print_CWG_Word_List(); // Run the demonstration program loop. while ( TRUE ) { printf("\nEnter a valid word to search for in the CWG... Enter \"999\" to exit.\nInput: "); fgets(RawUserInput, INPUT_BUFFER_SIZE, stdin); CutOffExtraChars(RawUserInput); if ( !strncmp(RawUserInput, ESCAPE_SEQUENCE, THREE) ) break; MakeMeAllCapital(RawUserInput); if ( !ValidateInput(RawUserInput) ) { printf("\nInvalid user input... Try again.\n\n"); continue; } if ( SingleWordSearchBoolean(RawUserInput, strlen(RawUserInput)) ) printf("\n|%s|=[FOUND]", RawUserInput); else printf("\n|%s|=[BOGUS]", RawUserInput); HashReturnValue = SingleWordHashFunction(RawUserInput, strlen(RawUserInput)); printf(", Hash Value|%d|\n\n", HashReturnValue); } // Free the dynamically allocated memory. free(TheNodeArray); free(TheListFormatArray); free(TheRoot_WTEOBL_Array); free(TheShort_WTEOBL_Array); free(TheUnsignedChar_WTEOBL_Array); return 0; } |
Contact: JohnPaul Adamovsky – ( logarithm69@hotmail.com ) - Please feel free to ask me questions.
Phone: (416) 231-7577
This is the mark of my trade...
All The Very Best,
JohnPaul Adamovsky