Data Dictionary Generator
Create a descriptive dictionary of the contents of a client’s database
Table of Contents
Overview
The Forms Data Dictionary tool is used to create a descriptive dictionary of the elements, pages and forms in a client’s database. This tool optionally allows for querying the database based on certain keywords to return any related information. There is also optional functionality for adding AI generated descriptions to the forms and elements in the database. The output of this tool is available in two formats: human readable YAML files, and CSV files.
Explanation of Terms
Forms: Data entry forms available in Client Track, used to input data into the system.
Pages: Pages within the form (there are only multiple per form if there are many elements).
Elements: Individual entries on the page which map to a column in the database.
Tables: Data tables within the database.
Columns: Individual columns in a table.
Forms Data Dictionary YAML Output
The output file is in YAML, which is a human readable data serialization language. It is automatically generated every time the pipeline is run. The YAML is formatted as a list of nested dictionaries, with the top-level representing Forms, the next level representing data entry Elements of that Form, and the lowest level giving examples of the data that may be input to these Elements. The data shown in the Elements has been denormalized, meaning that if the Element references a lookup table, this tool will report the human-readable value. The Form and Element levels contain additional information such as database information. We filter out Elements that are not used for data entry into the database.
Forms Data Dictionary CSV Output
One of the optional parameters to the pipeline is for the CSV output. There are two options, ‘column’ and ‘element’. ‘Column’ will produce a CSV file that lists every unique column and table combination present in the database. The rest of the row contains information such as associated Forms, descriptions, values etc. ‘Element’ will produce a CSV file enumerating the Forms, Pages and data entry Elements in the client’s database. If the CSV parameter is omitted, no CSV will be produced.
Descriptions and Form Names
Another optional parameter to the Forms Data Dictionary is ‘descriptions’. If you opt-in to descriptions, OpenAI will be used to generate names of Forms, and descriptions of each Form and Element. It is important to note that OpenAI may make mistakes and the names/descriptions may not be accurate to the use case of all entities.
Output Naming Convention
The name of each output file will be dynamically created based on user inputs, but follow the pattern of {Organization Prefix}_forms_data_dict_{Keyword}.yaml. This will be explained in greater detail in the How To Use section.
How To Use
Requesting Permissions
Before you start, you will need to request access to two things: first, the Azure Repo “FormsDataDict” with viewer permissions. Contact the Data Science Team and we will grant access. Second, you will need a username and password generated for the database that you are trying to create a descriptive dictionary of. Curtis Pistorius is the point of contact for this.
Setup
Before reading: red text means lines of code that should be run in the terminal. <Surrounding carats> mean that you need to substitute with your specific information, do not include the carats in the code.
If you do not already have Ubuntu and Python installed, follow steps 1-9 of these instructions
To clone the Forms Data Dictionary repository:
- Follow link to repository
- If needed, request access to be a viewer from Curtis Pistorius.
- In the upper right corner, click ‘Clone’
- Copy the https URL
- Open an Ubuntu terminal
- Type/paste: git clone <paste URL here>
- This may prompt you for a password. To get this password, go back to Azure repo and right under where you hit ‘Clone’ there will be a button ‘Generate Git Credentials’. Click this, and once it loads, copy the password and paste it into the terminal, and hit enter.
- Check to make sure the repository was cloned by running ls in the terminal. You should see ‘FormsDataDict’ in the list.
- Enter the repository by running cd FormsDataDict
To create .env file:
- In terminal, run touch .env
- To open the .env file, run nano .env
- Copy and paste the following into the text editor, replacing <> with your specific information:
OAI_API_ENDPOINT = 'https://oai-datasherpa-eu2-dev.openai.azure.com/'
OAI_DEPLOYMENT_NAME = 'gpt-4o' or whatever version the client is set up to use
OAI_API_KEY = '<oai key from azure resource>'
ENVIRONMENT = 'prod'
SERVER='<connection string for database server>'
DATABASE='<name of that database>'
USERNAME='<username to database>'
PASSWORD='<password to database>'
- OAI key can be found in the OAI resource here, under the ‘Develop’ tab
- Use the username and password that Curtis set up for you to access the client database you want described.
- Save this file by hitting Ctrl+X, then Y, then Enter.
To activate a python environment:
- The terminal line should start with your_user:~/FormsDataDict$
- To create the virtual environment, run python -m venv .venv
- Then activate your virtual environment by running source .venv/bin/activate
- Your terminal line should now start with (.venv) your_user:~/FormsDataDict$
To install requirements:
- In your terminal, run pip install -r requirements.txt
- You should see a flurry of loading bars that will end with a 'successfully installed…' message when it completes.
- Now you should be set up and ready to execute the form data dictionary generator.
Run
To run the pipeline from the command line:
- First, decide if you will be using any of the five parameters:
- Tag: If you would like to filter the forms in your database by a certain topic, you can use the tag parameter to query forms that have that tag. An example of a tag would be ‘refugee’, which will return all forms that mention ‘refugee’ in their description or explanation. The tag will be appended to the end of the output form name. This tag must be a single word, no spaces.
- Prefix: Prefix will be prepended to the name of the resulting form dictionary that will be saved to your system. It is intended to be used so that users can have unique identifiers for their form dictionaries, and subsequent runs will not overwrite old versions. Examples of a prefix would be 'UT_Outreach” or “OrganizationName_DateTime” etc. There may not be spaces in the prefix, use underscores instead.
- Description: If you would like to OpenAI to generate descriptions and names for the Forms and Elements, add ‘--add-description’ to the command. It is important to note that Eccovia pays for each request to OpenAI, and these calls add significantly to the runtime. We suggest not using this parameter until you are certain you are ready to create the final version of whatever dictionary you need.
- Limit: If you would like to explore what the data dictionary can create, we recommend limiting the number of forms that AI will provide descriptions for. For example, if you would like to see which/how many forms are returned by a tag, using limit = 1 will provide descriptions for just the first file, so you can get a preview of what it will return without overloading OpenAI.
- CSV: If you would like to create a CSV along with the YAML file, choose either ‘element’ or ‘column’. The tag ‘element’ will focus information on Forms, while ‘column’ will focus primarily on database tables and columns.
- You may choose to use all, some or none of the parameters. If tag is not specified, the dictionary will contain all forms in your database server. If prefix is not specified, the name of the file will be generic and can be overwritten. An example of the output form with tag “refugee” and prefix “UT_Outreach_2024” would be “UT_Outreach_forms_data_dict_refugee”. ‘Description’, ‘Limit’ and ‘CSV’ all default to none.
- In the terminal, run: python forms_yaml_generator.py --tag “<your tag>” --prefix “<your prefix>” --add-description --limit <int> --csv “<element or column>”
- If there are >50 forms and you choose --add-description, it will prompt you to confirm that you want to proceed. Again, avoid using --add-descriptions until you are sure you need the full output.
Troubleshooting:
- If you are getting an error related to pyodbc like the one below, or something relating to the server/database engine, try installing the latest ODBC driver here by copying and running the ‘Ubuntu’ terminal instructions. Then run pip install pyodbc before retrying.