Script Descriptions

This Wiki page is used to describe all the scripts being used in the caas-aspace-repository. All descriptions should include the following info:

A short description of the script's purpose
A link to any testing script(s) or automated testing for this specific script
A list of requirements to get the script to run. This can include:
- Packages/Libraries (ex. ArchivesSnake, loguru), including links to their documentation
- Folders/Directories where files will be written to/read from (ex. logs, test_data)
- Any special requirements for credentials (ex. secrets.py file, environment variables)

Optional information can include, but is not limited to:

Arguments to run the script (ex. files being passed to the script, --help, -dR)
Additional context for the script's purpose or design
Screenshots of the script

utilities.py

A compilation of useful functions shared across Python scripts. Included are:

ASpaceAPI(aspace_api, aspace_un, aspace_pw) class - handles common functions when working with the API. Connects to the ASnake client upon instantiation.
- get_repo_info(self) - Gets all the repository information for an ArchivesSpace instance in a list and assigns it to self.repo_info
- get_objects(self, repository_uri, record_type, parameters=('all_ids', True)) - Intakes a repository URI and returns all the digital object IDs as a list for that repository
- get_object(self, record_type, object_id, repo_uri='') - Get and return a digital object JSON metadata from its URI
- update_object(self, object_uri, updated_json) - Posts the updated JSON metadata for the given object_uri to ArchivesSpace
ASpaceDatabase(as_db_un, as_db_pw, as_db_host, as_db_name, as_db_port) class- Handles the connection to and data retrieval from the ArchivesSpace database
- connect_db(self) - Connects to the ArchivesSpace test database with credentials provided in local secrets.py file
- query_database(self, statement) - Runs a query on the database
- close_connection(self) - Closes the cursor and connection to the ArchivesSpace database
client_login(as_api, as_un, as_pw) function - Login to the ArchivesSnake client and return client
read_csv(csv_file) function - reads a csv file and returns csv_dict, a list of values in rows in the CSV file
check_url(url) function - inputs a URL to check and if it returns 200 status code, returns True, otherwise will log and print the error status code
record_error(message, status_input) function - Prints and logs an error message and the code/parameters causing the error

Tests

Requirements:

Packages:
- ArchivesSnake
- csv
- mysql.connector
- loguru
- requests
ArchivesSpace username, password, API URL in a secrets.py file
logs directory for storing local log files

One-time Scripts

delete_aaadigobjs.py

This script takes a CSV of archival object refIDs and runs each refID through an SQL query, which gets all associated digital objects with that archival object. It then returns the completed digital object URI and adds it to a list. Then it goes through the list of URIs and deletes them from ArchivesSpace using the delete_objects.py script.

Tests:

deleteaaadigobjs_tests.py

Requirements:

utilities.py
Packages:
- argparse
- csv
- dotenv
- loguru
- os
- pathlib
- subprocess
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
CSV file of archival object refIDs to search for linked digital objects to delete (the path of the file as an argument to the script)
jsonl file for saving data of digital objects before deletion (the path of the file as an argument to the script)

delete_dometadata.py

This script iterates through all the digital objects in every repository in SI's ArchivesSpace instance - except Test, Training, and NMAH-AF, parses them for any data in the following fields: agents, dates, extents, languages, notes, and subjects, and then deletes any data within those fields except digitized date and uploads the updated digital object back to ArchivesSpace

Tests:

dometadata_tests.py

Requirements:

Packages:
ArchivesSpace username, password, API URL in a secrets.py file
logs directory for storing local log files
test_data/dometadata_testdata.py file, with the following variables:
- test_record_type = string - the object endpoint ArchivesSpace uses; ex. 'digital_objects'
- test_object_id = int - the number of the digital object you want to use for testing (must have metadata in above-mentioned fields)
- test_object_repo_uri = string - the repository URI where the test digital object is; ex. '/repositories/12'
- test_object_user_identifier = string - the identifier that user's input in the digital_object_id field for testing; ex. 'NMAI.AC.066.ref21.1'
- test_digital_object_dates = dict - JSON data from a digital object that contains multiple date subrecords
- test_digital_object_dates_deleted = dict JSON data from the same digital object as above but without any data in the dates field (i.e. dates = [])

eepa_cameroonreport.py

This script takes CSV files listing specific collections from EEPA repository, extracts the resource URIs listed in each CSV, uses the ArchivesSpace API to grab the Abstract or Scope and Contents note from the JSON data, and writes the note to the provided CSV in a new column.

Tests:

eepacameroon_tests.py

Requirements:

CSV input(s) containing the following columns: ead_id,title,dates,publish,level,extents,uri
- Note: This script originally had 3 CSVs to iterate through, but any number of CSVs should work
ArchivesSnake
ArchivesSpace username, password, API URL in a secrets.py file
logs directory for storing local log files
test_data/eepacameroon_testdata.py file, with 3 variables:
- test_abstract_only_json = dict - JSON data from a resource that contains only an abstract note
- test_scope_only_json = dict - JSON data from a resource that contains only a scope note
- test_no_abstract_scope_json = dict - JSON data from a resource that contains no abstract or scope note
- Note for the above variables and values: these are for testing. You can get these from your API by running a client.get request for resources using their URI and the .json() function to return data in JSON format.

identifier_report.py

This script reads a CSV containing all the resource and accession identifiers in ArchivesSpace and prints a dictionary containing all the unique, non-alphanumeric characters in the identifiers and their counts

Requirements:

CSV input containing the following columns: id, repo_id, identifier, title, ead_id, recordType
- identifier should be structured like so: "['id_0','id_1','id_2','id_3']"

remove_missingtitles.py

This script takes a CSV of resources and archival objects from every repository with "Missing Title" titles in note lists and removes the title from the metadata, then posts the update to ArchivesSpace

Tests:

missingtitles_tests.py

Requirements:

Packages:
- ArchivesSnake
- loguru
ArchivesSpace username, password, API URL in a secrets.py file
logs directory for storing local log files
test_data/missingtitles_testdata.py file, with the following:
- test_object_metadata = {ArchivesSpace resource or archival object metadata} for testing. Can get this from your API by using a client.get request for a resource or archival object that has a "Missing Title" in one of its notes with a list.
- test_notes = [ArchivesSpace resource or archival object notes list] for testing. Can get this from your API using a client.get request for a resource or archival object that has a "Missing Title" in one of its notes with a list and taking all the data found in "notes" = [list of notes]
test_data/MissingTitles_BeGone.csv - a csv file containing the URIs of the objects that have "Missing Title" in their notes. URIs should be in the 4th spot (row[3])

report_sovatreeview.py

This script finds resource records with a finding aid status of "Publish (sync with EDAN/SOVA)" and that have a published archival object on the highest level component (c01), takes the list of EAD IDs from those resources, and tests them against "https://sova.si.edu/fancytree/", seeing if they return an empty treeview in SOVA. If so, the EAD ID and fancytree URL is logged in a CSV output file.

Tests:

deleteaaadigobjs_tests.py //TODO: add later!

Requirements:

utilities.py
Packages:
- argparse
- csv
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files

update_accessrestrictnotes.py

This script takes a CSV containing resource info (Id, resource_uri, updated_access_note) and retrieves the JSON data for any resources whose rows in the CSV contain data in the updated_access_note column. Then, it makes a copy of the resource JSON, updating the accessrestrict note's content with the data found in the updated_access_note cell. Finally, it posts the updated JSON to ArchivesSpace.

Tests:

No tests built, see PR #142

Requirements:

utilities.py
Packages:
- argparse
- copy
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files

update_agentids.py

This script takes an Excel file, "combined-aspace-agents-edited.xlsx", and uses the "combined and cleaned" sheet, takes the agent link from the "Aspace_link" column, grabs the agent JSON from ArchivesSpace using the API, then adds a new Record ID to the agent, if an existing record ID does not already exist. Non-matching record IDs are logged and the one existing in ArchivesSpace remains. Record IDs are located as columns in the sheet, named "Wikidata_id", "SNAC_id", "LCNAF_id", "ULAN_id", "VIAF_id", and "local". After adding the Record ID to the JSON locally, the script sorts the order of the IDs according to the above and posts the updated agent record via the ArchivesSpace API.

Script Arguments

excelPath: path to Excel input file", type=str
objectType: resources/archival_objects/digital_objects, type=str
-dR OR --dry-run: runs the script without posting updates to ArchivesSpace printing the expected results, action='store_true'
-v OR --version: display script version, version='%(prog)s - Version #.#'

Tests:

updateagentids_tests.py Uses data from updateagentids_testdata.py

Requirements:

Packages:
- argparse
- collections
- copy
- dotenv
- loguru
- os
- pandas
  - NOTE: secrets.py file will not work as will cause ImportError: cannot import name randbits when loading pandas. You will need to rename your secrets.py file to something else to get it to work. Make sure NOT to commit or git add the file change and don't let your IDE refactor your project for this change (or it will result in a lot of unnecessary changes to the repo). See first answer to What does "ImportError: cannot import name randbits" mean?.
- pathlib
- sys
- time
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files and the jsonlines file for storing the original agent data
test_data directory for accessing the .xlsx file (or anywhere, so long as you copy the path of the file as an argument to the script)

update_coordinates.py

This script finds locations with the label "MapCase" and looks for those that have leading zeros in their indicators such as 01, 02, etc. It then removes the leading zeros for coordinate_1_indicators and searches for and removes leading zeros from coordinate_2_indicators as well. It then posts the updates to those specific locations to ASpace. An SQL query is used to find the appropriate locations.

Script arguments

Call with python update_coordinates.py <jsonl_filepath>.jsonl <log_folder_path>

Adjust the SQL query associated with the find_mapcases variable, currently as:

SELECT location.id FROM location 
WHERE 
  coordinate_1_label = "MapCase" 
AND 
  building = "NMAH" 
AND 
  coordinate_1_indicator LIKE "0_" 
ORDER BY 
  coordinate_1_indicator

Tests:

updatecoordinates_tests.py

Requirements:

utilities.py
Packages:
- argparse
- copy
- dotenv
- loguru
- os
- pathlib
- sys
- time
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files and the jsonlines file for storing the original agent data

update_znames.py

This script collects all users from ArchivesSpace, parses their usernames to separate any starting with 'z-' and ending with '-expired-' into just the text in-between, then updates the username in ArchivesSpace with the new username

Tests:

znames_tests.py

Requirements:

Packages:
- ArchivesSnake
- loguru
ArchivesSpace username, password, API URL in a secrets.py file
logs directory for storing local log files
test_data/znames_testdata.py file, with viewer_user = {ArchivesSpace viewer user metadata} for testing. Can get this from your API by getting a client.get request for the `viewer' user in your ArchivesSpace instance.

Repeatable Scripts

delete_objects.py

This script takes a CSV of URIs and object type as inputs, grabs all the objects' JSON data using the API, saves them to a jsonL file using the jsonl_path input, and then deletes them in ArchivesSpace. Structure the CSV like so:

uri
/repositories/##/object_type/object_id
/repositories/##/object_type/object_id

It's recommended to check to see if locations have any top containers associated with them. You can run an SQL query to find any associated top containers such as the following:

SELECT * FROM top_container_housed_at_rlshp
WHERE
location_id in (49324,49338,49323,etc.)

Requirements

utilities.py
Packages:
- argparse
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
CSV file of URIs for objects to delete (or anywhere, so long as you copy the path of the file as an argument to the script)
jsonl file for saving data of objects before deletion

Tests

Tests were added for utilities.py delete_object() function.

merge_subjects.py

This script creates new subjects from a provided CSV. It is currently customized to support the needs of NMAI, but this hardcoded NMAI metadata can be changed/updated in the future.

Requirements:

ArchivesSnake
Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
  1. Create a new .env.dev file containing local credentials
  2. export ENV=dev
  3. Run script
- On test:
  1. Create a new .env.test file containing test credentials
  2. export ENV=test
  3. Run script
- On prod:
  1. Create a new .env.prod file containing prod credentials
  2. export ENV=prod
  3. CAREFULLY run script
logs directory for storing local log files

mergesubjects_tests.py

Unittests for mergesubjects_tests.py

Requirements:

test_data/subjects_testdata.py file, containing the following:
- test_merge_subject_destination = {JSON representation of an existing subject that will survive the merge} for testing.
  If newsubjects_tests.py has been run previously, you can use one of the subjects created by that test.
- test_merge_subject_candidate = {JSON representation of an existing subject that will be removed during the merge} for testing.
  If newsubjects_tests.py has been run previously, you can use one of the subjects created by that test.
test_data/mergesubjects_testdata.csv - a csv file of subjects to be merged, containing:
- aspace_subject_id - id of the merge destination/subject to be retained. If newsubjects_tests.py previously run, this can be one of the subjects created by those tests.
- title - title of the merge destination/subject to be retained. The title must match the existing subject with the above id.
- aspace_subject_id2 - id of the merge candidate/subject to be removed. If newsubjects_tests.py previously run, this can be one of the subjects created by those tests.
- Merge into - title of the merge candidate/subject to be removed. The title must match the existing subject with the above id.

new_subjects.py

This script creates new subjects from a provided CSV. It is currently customized to support the needs of NMAI, but this hardcoded NMAI metadata can be changed/updated in the future.

Requirements:

ArchivesSnake
Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
  1. Create a new .env.dev file containing local credentials
  2. export ENV=dev
  3. Run script
- On test:
  1. Create a new .env.test file containing test credentials
  2. export ENV=test
  3. Run script
- On prod:
  1. Create a new .env.prod file containing prod credentials
  2. export ENV=prod
  3. CAREFULLY run script
logs directory for storing local log files

newsubjects_tests.py

Unittests for newsubjects_tests.py

Requirements:

test_data/subjects_testdata.py file, containing the following:
- test_new_subject_metadata = {JSON representation of a new subject} for testing.
- duplicate_new_subject = test_new_subject_metadata ensures we can count on duplicate_new_subject to produce a not unique error during testing.
test_data/newsubjects_testdata.csv - a csv file of new subjects to be created, containing:
- new_title
- new_scope_note
- new_EMu_ID

grouppermissions_tests.py

Unittests for report_grouppermissions.py

remove_datetimestamps.py

This script takes a CSV of archival object URIs with timestamps in their date begin and/or end fields, retrieves the objects using the ASpace API, removes the timestamps, and re-posts the archival objects to ASpace.

The CSV should have the following columns:

uri, ex: repositories/27/archival_objects/3938698

Requirements:

utilities.py
Packages:
- argparse
- copy
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
jsonl file for storing data for backup

removedatetimestamps_tests.py

Unittests for remove_datetimestamps.py, specifically testing the update_date() function.

report_grouppermissions.py

This script generates a spreadsheet report that lists all the permissions in archivesspace by the first row, with each column displaying the permission and each row displaying the user group. If a user group has a permission, it is marked with the text of that permission in the spreadsheet or if not, FALSE. This is to check to make sure permissions are the same for each user group across all repositories.

Requirements:

Packages:
- archivessnake
- datetime
- dotenv
- loguru
- mysql-connector
- openpyxl
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files

suppress_edanrecords.py

This script takes a CSV of EAD IDs and suppresses the records in EDAN, removing them from view in SOVA. Optionally, it can also suppress the resources in ASpace if the resource URI is provided.

The CSV should have the following columns:

eadID, ex: NAA.2009-07
resourceURI, ex. /repositories/36/resource/12345 (optional - only needed if suppressing records in ASpace)

Arguments:

csvPath - path to CSV input file", type=str
logFolder - "path to the log folder for storing log files, type=str
-sA or --suppress-aspace - suppress the record in ArchivesSpace, action='store_true'
-jA or --jwt-algorithm - algorithm for encoding JWT payload. Ex. HS256, type=str
-dR or --dry-run - dry run?, action='store_true'
--version" - version info

Requirements

utilities.py
Packages:
- argparse
- dotenv
- datetime
- jwt
- loguru
- os
- pathlib
- requests
- sys
EDAN API, key, ISS, JWTID in the .env.prod file
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)

Tests

suppressedanrecords_tests.py

suppress_objects.py

This script takes a CSV file containing the URIs or URLs of objects to suppress in the ArchivesSpace staff interface, unpublishes and sets the finding aid status to staff only (for resources). The CSV should have a header row that reads "URI", and you can pass the object's repository identifier number and object type (resources, archival_objects, digital_objects) as script arguments. The script takes the CSV, splits the URI into the ArchivesSpace resource ID, repository ID (if not already supplied) and object type (if not already supplied), grabs the resource JSON data, then passes the data to the update_publish_status function, which modifies the JSON to publish=False and finding_aid_status=staff_only (for resources only). Then it posts the updated JSON to ArchivesSpace and suppresses the record.

The arguments passed to the script should look like: python suppress_objects.py <filename>.csv <repo_id> <object_type>

Type python suppress_objects.py --help for more info on commands, including -dR which is a dry run that doesn't execute changes, but spits out what changes will be made once the script runs.

Requirements

utilities.py
Packages:
- argparse
- copy
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)

Tests

suppressobjects_tests.py
test_data/suppressobjects_testdata.py

touch_resources.py

This script takes a CSV file containing the URIs for resource records to get and post back to ArchivesSpace without updating any data, used to kickstart an update to EDAN/SOVA by updating the system_mtime field.

Structure the CSV like so:

uri
repositories/##/resources/object_id
repositories/##/resources/object_id

Type python touch_resources.py --help for more info on commands, including -dR which is a dry run that doesn't execute changes, but spits out what changes will be made once the script runs.

Requirements

utilities.py
Packages:
- argparse
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)

Tests

N/A

update_locations.py

This script takes a CSV file containing the URIs of locations to be updated and the repository identifier number as script arguments, structured like so:

python update_locations.py <filename>.csv <repo_id>

The CSV should have at least one of the headers labeled URI. The script adds an 'owner repo' = {'ref': 'repository/<repo_numer>'} key-value to the location JSON retrieved from the API, then posts the updated JSON to ArchivesSpace

Requirements

utilities.py
Packages:
- copy
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files and storing jsonlines output file
test_data directory for accessing the CSV file

Tests

updatelocations_tests.py

update_refids.py

This script takes a CSV of archival object URIs as inputs, grabs all the archival objects' JSON data using the API, saves them to a jsonL file using the jsonl_path input, and updates the archival objects' update_refid field to True, posting them back to ArchivesSpace which regenerates the refids.

Requirements

utilities.py
Packages:
- argparse
- dotenv
- loguru
- os
- pathlib
- sys
ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
logs directory for storing local log files
CSV file of URIs for objects to update (the path of the file as an argument to the script)
jsonl file for saving data of objects before deletion (the path of the file as an argument to the script)

Tests

No tests were added, since the only unique thing this script is doing is changing the update_refid field to True in the objects given.

update_subjects.py

This script updates existing ArchivesSpace subjects from a provided CSV. It is currently customized to support the needs of NMAI, but can be changed/updated in the future.

Requirements:

ArchivesSnake
Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
  1. Create a new .env.dev file containing local credentials
  2. export ENV=dev
  3. Run script
- On test:
  1. Create a new .env.test file containing test credentials
  2. export ENV=test
  3. Run script
- On prod:
  1. Create a new .env.prod file containing prod credentials
  2. export ENV=prod
  3. CAREFULLY run script
logs directory for storing local log files

updatesubjects_tests.py

Unittests for updatesubjects_tests.py

Requirements:

test_data/subjects_testdata.py file, containing the following:
- test_update_subject_metadata = {JSON representation of an existing subject} for testing. If newsubjects_tests.py has been run previously, you can use one of the subjects created by that test.
test_data/newsubjects_testdata.csv - a csv file of changes to be made to an existing subject, containing:
- aspace_subject_id - id of the subject to update, this can match that in test_data/subjects_testdata.py
- new_title
- new_scope_note
- new_EMu_ID

SQL Scripts

delete_dometadata_query_union_example.sql

@fordmadox to fill in later

Requirements:

@fordmadox to fill in later

Tests:

@fordmadox to fill in later

report_cfchlinkedagents.sql

Retrieves all agent_persons that are linked to in the CFCH repository. Updating this script to another repository is possible by changing the ao.repo_id code to the desired repository.

Requirements:

MySQL software or terminal
Credentials to the ASpace Test and Prod database

Tests:

There are no tests with this script. Need to research how to test SQL queries or if it's necessary. This query does not modify any data, so testing may not be necessary. I did run it against test before running against prod.

report_digitalobjecttypes.sql

This query counts all the digital objects with digital object types per each repository. It also includes a count of all digital objects without a digital object type per repository. Now with proper 0s, due to count of digital object ids, rather than rows.

Requirements:

MySQL software or terminal
Credentials to the ASpace Test and Prod database

Tests:

No tests, but ran against ASpacetest before running it in prod and reviewing from other team members.

Script Descriptions

Script Descriptions

Tests

Requirements:

One-time Scripts

Tests:

Requirements:

Tests:

Requirements:

Tests:

Requirements:

Requirements:

Tests:

Requirements:

Tests:

Requirements:

Tests:

Requirements:

Script Arguments

Tests:

Requirements:

Script arguments

Tests:

Requirements:

Tests:

Requirements:

Repeatable Scripts

Requirements

Tests

Requirements:

Requirements:

Requirements:

Requirements:

Requirements:

Requirements:

Requirements

Tests

Requirements

Tests

Requirements

Tests

Requirements

Tests

Requirements

Tests

Requirements:

Requirements:

SQL Scripts

Requirements:

Tests:

Requirements:

Tests:

Requirements:

Tests:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally