-
Notifications
You must be signed in to change notification settings - Fork 0
Script Descriptions
This Wiki page is used to describe all the scripts being used in the caas-aspace-repository. All descriptions should include the following info:
- A short description of the script's purpose
- A link to any testing script(s) or automated testing for this specific script
- A list of requirements to get the script to run. This can include:
- Packages/Libraries (ex. ArchivesSnake, loguru), including links to their documentation
- Folders/Directories where files will be written to/read from (ex. logs, test_data)
- Any special requirements for credentials (ex. secrets.py file, environment variables)
Optional information can include, but is not limited to:
- Arguments to run the script (ex. files being passed to the script, --help, -dR)
- Additional context for the script's purpose or design
- Screenshots of the script
A compilation of useful functions shared across Python scripts. Included are:
- ASpaceAPI(aspace_api, aspace_un, aspace_pw) class - handles common functions when working with the API. Connects to the ASnake client upon instantiation.
- get_repo_info(self) - Gets all the repository information for an ArchivesSpace instance in a list and assigns it to self.repo_info
- get_objects(self, repository_uri, record_type, parameters=('all_ids', True)) - Intakes a repository URI and returns all the digital object IDs as a list for that repository
- get_object(self, record_type, object_id, repo_uri='') - Get and return a digital object JSON metadata from its URI
- update_object(self, object_uri, updated_json) - Posts the updated JSON metadata for the given object_uri to ArchivesSpace
- ASpaceDatabase(as_db_un, as_db_pw, as_db_host, as_db_name, as_db_port) class- Handles the connection to and data retrieval from the ArchivesSpace database
- connect_db(self) - Connects to the ArchivesSpace test database with credentials provided in local secrets.py file
- query_database(self, statement) - Runs a query on the database
- close_connection(self) - Closes the cursor and connection to the ArchivesSpace database
- client_login(as_api, as_un, as_pw) function - Login to the ArchivesSnake client and return client
- read_csv(csv_file) function - reads a csv file and returns csv_dict, a list of values in rows in the CSV file
- check_url(url) function - inputs a URL to check and if it returns 200 status code, returns True, otherwise will log and print the error status code
- record_error(message, status_input) function - Prints and logs an error message and the code/parameters causing the error
- Packages:
- ArchivesSpace username, password, API URL in a secrets.py file
- logs directory for storing local log files
This script takes a CSV of archival object refIDs and runs each refID through an SQL query, which gets all associated digital objects with that archival object. It then returns the completed digital object URI and adds it to a list. Then it goes through the list of URIs and deletes them from ArchivesSpace using the delete_objects.py script.
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- CSV file of archival object refIDs to search for linked digital objects to delete (the path of the file as an argument to the script)
- jsonl file for saving data of digital objects before deletion (the path of the file as an argument to the script)
This script iterates through all the digital objects in every repository in SI's ArchivesSpace instance - except Test, Training, and NMAH-AF, parses them for any data in the following fields: agents, dates, extents, languages, notes, and subjects, and then deletes any data within those fields except digitized date and uploads the updated digital object back to ArchivesSpace
- Packages:
- ArchivesSpace username, password, API URL in a secrets.py file
- logs directory for storing local log files
- test_data/dometadata_testdata.py file, with the following variables:
-
test_record_type = string- the object endpoint ArchivesSpace uses; ex. 'digital_objects' -
test_object_id = int- the number of the digital object you want to use for testing (must have metadata in above-mentioned fields) -
test_object_repo_uri = string- the repository URI where the test digital object is; ex. '/repositories/12' -
test_object_user_identifier = string- the identifier that user's input in the digital_object_id field for testing; ex. 'NMAI.AC.066.ref21.1' -
test_digital_object_dates = dict- JSON data from a digital object that contains multiple date subrecords -
test_digital_object_dates_deleted = dictJSON data from the same digital object as above but without any data in the dates field (i.e.dates = [])
-
This script takes CSV files listing specific collections from EEPA repository, extracts the resource URIs listed in each CSV, uses the ArchivesSpace API to grab the Abstract or Scope and Contents note from the JSON data, and writes the note to the provided CSV in a new column.
- CSV input(s) containing the following columns: ead_id,title,dates,publish,level,extents,uri
- Note: This script originally had 3 CSVs to iterate through, but any number of CSVs should work
- ArchivesSnake
- ArchivesSpace username, password, API URL in a secrets.py file
- logs directory for storing local log files
- test_data/eepacameroon_testdata.py file, with 3 variables:
-
test_abstract_only_json = dict- JSON data from a resource that contains only an abstract note -
test_scope_only_json = dict- JSON data from a resource that contains only a scope note -
test_no_abstract_scope_json = dict- JSON data from a resource that contains no abstract or scope note - Note for the above variables and values: these are for testing. You can get these from your API by running a
client.getrequest for resources using their URI and the .json() function to return data in JSON format.
-
This script reads a CSV containing all the resource and accession identifiers in ArchivesSpace and prints a dictionary containing all the unique, non-alphanumeric characters in the identifiers and their counts
- CSV input containing the following columns: id, repo_id, identifier, title, ead_id, recordType
- identifier should be structured like so: "['id_0','id_1','id_2','id_3']"
This script takes a CSV of resources and archival objects from every repository with "Missing Title" titles in note lists and removes the title from the metadata, then posts the update to ArchivesSpace
- Packages:
- ArchivesSpace username, password, API URL in a secrets.py file
- logs directory for storing local log files
- test_data/missingtitles_testdata.py file, with the following:
-
test_object_metadata = {ArchivesSpace resource or archival object metadata}for testing. Can get this from your API by using aclient.getrequest for a resource or archival object that has a "Missing Title" in one of its notes with a list. -
test_notes = [ArchivesSpace resource or archival object notes list]for testing. Can get this from your API using aclient.getrequest for a resource or archival object that has a "Missing Title" in one of its notes with a list and taking all the data found in"notes" = [list of notes]
-
- test_data/MissingTitles_BeGone.csv - a csv file containing the URIs of the objects that have "Missing Title" in their
notes. URIs should be in the 4th spot (
row[3])
This script finds resource records with a finding aid status of "Publish (sync with EDAN/SOVA)" and that have a published archival object on the highest level component (c01), takes the list of EAD IDs from those resources, and tests them against "https://sova.si.edu/fancytree/", seeing if they return an empty treeview in SOVA. If so, the EAD ID and fancytree URL is logged in a CSV output file.
- deleteaaadigobjs_tests.py //TODO: add later!
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
This script takes a CSV containing resource info (Id, resource_uri, updated_access_note) and retrieves the JSON data for any resources whose rows in the CSV contain data in the updated_access_note column. Then, it makes a copy of the resource JSON, updating the accessrestrict note's content with the data found in the updated_access_note cell. Finally, it posts the updated JSON to ArchivesSpace.
- No tests built, see PR #142
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
This script takes an Excel file, "combined-aspace-agents-edited.xlsx", and uses the "combined and cleaned" sheet, takes the agent link from the "Aspace_link" column, grabs the agent JSON from ArchivesSpace using the API, then adds a new Record ID to the agent, if an existing record ID does not already exist. Non-matching record IDs are logged and the one existing in ArchivesSpace remains. Record IDs are located as columns in the sheet, named "Wikidata_id", "SNAC_id", "LCNAF_id", "ULAN_id", "VIAF_id", and "local". After adding the Record ID to the JSON locally, the script sorts the order of the IDs according to the above and posts the updated agent record via the ArchivesSpace API.
- excelPath: path to Excel input file", type=str
- objectType: resources/archival_objects/digital_objects, type=str
- -dR OR --dry-run: runs the script without posting updates to ArchivesSpace printing the expected results, action='store_true'
- -v OR --version: display script version, version='%(prog)s - Version #.#'
- updateagentids_tests.py Uses data from updateagentids_testdata.py
- Packages:
- argparse
- collections
- copy
- dotenv
- loguru
- os
-
pandas
- NOTE: secrets.py file will not work as will cause ImportError: cannot import name randbits when loading pandas. You will need to rename your secrets.py file to something else to get it to work. Make sure NOT to commit or git add the file change and don't let your IDE refactor your project for this change (or it will result in a lot of unnecessary changes to the repo). See first answer to What does "ImportError: cannot import name randbits" mean?.
- pathlib
- sys
- time
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files and the jsonlines file for storing the original agent data
- test_data directory for accessing the .xlsx file (or anywhere, so long as you copy the path of the file as an argument to the script)
This script finds locations with the label "MapCase" and looks for those that have leading zeros in their indicators such as 01, 02, etc. It then removes the leading zeros for coordinate_1_indicators and searches for and removes leading zeros from coordinate_2_indicators as well. It then posts the updates to those specific locations to ASpace. An SQL query is used to find the appropriate locations.
Call with python update_coordinates.py <jsonl_filepath>.jsonl <log_folder_path>
Adjust the SQL query associated with the find_mapcases variable, currently as:
SELECT location.id FROM location
WHERE
coordinate_1_label = "MapCase"
AND
building = "NMAH"
AND
coordinate_1_indicator LIKE "0_"
ORDER BY
coordinate_1_indicator- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files and the jsonlines file for storing the original agent data
This script collects all users from ArchivesSpace, parses their usernames to separate any starting with 'z-' and ending with '-expired-' into just the text in-between, then updates the username in ArchivesSpace with the new username
- Packages:
- ArchivesSpace username, password, API URL in a secrets.py file
- logs directory for storing local log files
- test_data/znames_testdata.py file, with
viewer_user = {ArchivesSpace viewer user metadata}for testing. Can get this from your API by getting aclient.getrequest for the `viewer' user in your ArchivesSpace instance.
This script takes a CSV of URIs and object type as inputs, grabs all the objects' JSON data using the API, saves them to a jsonL file using the jsonl_path input, and then deletes them in ArchivesSpace. Structure the CSV like so:
| uri |
|---|
| /repositories/##/object_type/object_id |
| /repositories/##/object_type/object_id |
It's recommended to check to see if locations have any top containers associated with them. You can run an SQL query to find any associated top containers such as the following:
SELECT * FROM top_container_housed_at_rlshp
WHERE
location_id in (49324,49338,49323,etc.)- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- CSV file of URIs for objects to delete (or anywhere, so long as you copy the path of the file as an argument to the script)
- jsonl file for saving data of objects before deletion
- Tests were added for utilities.py delete_object() function.
This script creates new subjects from a provided CSV. It is currently customized to support the needs of NMAI, but this hardcoded NMAI metadata can be changed/updated in the future.
- ArchivesSnake
- Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
- Create a new
.env.devfile containing local credentials export ENV=dev- Run script
- Create a new
- On test:
- Create a new
.env.testfile containing test credentials export ENV=test- Run script
- Create a new
- On prod:
- Create a new
.env.prodfile containing prod credentials export ENV=prod- CAREFULLY run script
- Create a new
- On your local:
- logs directory for storing local log files
Unittests for mergesubjects_tests.py
- test_data/subjects_testdata.py file, containing the following:
-
test_merge_subject_destination = {JSON representation of an existing subject that will survive the merge}for testing.
Ifnewsubjects_tests.pyhas been run previously, you can use one of the subjects created by that test. -
test_merge_subject_candidate = {JSON representation of an existing subject that will be removed during the merge}for testing.
Ifnewsubjects_tests.pyhas been run previously, you can use one of the subjects created by that test.
-
- test_data/mergesubjects_testdata.csv - a csv file of subjects to be merged, containing:
- aspace_subject_id - id of the merge destination/subject to be retained. If newsubjects_tests.py previously run, this can be one of the subjects created by those tests.
- title - title of the merge destination/subject to be retained. The title must match the existing subject with the above id.
- aspace_subject_id2 - id of the merge candidate/subject to be removed. If newsubjects_tests.py previously run, this can be one of the subjects created by those tests.
- Merge into - title of the merge candidate/subject to be removed. The title must match the existing subject with the above id.
This script creates new subjects from a provided CSV. It is currently customized to support the needs of NMAI, but this hardcoded NMAI metadata can be changed/updated in the future.
- ArchivesSnake
- Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
- Create a new
.env.devfile containing local credentials export ENV=dev- Run script
- Create a new
- On test:
- Create a new
.env.testfile containing test credentials export ENV=test- Run script
- Create a new
- On prod:
- Create a new
.env.prodfile containing prod credentials export ENV=prod- CAREFULLY run script
- Create a new
- On your local:
- logs directory for storing local log files
Unittests for newsubjects_tests.py
- test_data/subjects_testdata.py file, containing the following:
-
test_new_subject_metadata = {JSON representation of a new subject}for testing. -
duplicate_new_subject = test_new_subject_metadataensures we can count onduplicate_new_subjectto produce a not unique error during testing.
-
- test_data/newsubjects_testdata.csv - a csv file of new subjects to be created, containing:
- new_title
- new_scope_note
- new_EMu_ID
Unittests for report_grouppermissions.py
This script takes a CSV of archival object URIs with timestamps in their date begin and/or end fields, retrieves the objects using the ASpace API, removes the timestamps, and re-posts the archival objects to ASpace.
The CSV should have the following columns:
- uri, ex: repositories/27/archival_objects/3938698
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- jsonl file for storing data for backup
Unittests for remove_datetimestamps.py, specifically testing the update_date() function.
This script generates a spreadsheet report that lists all the permissions in archivesspace by the first row, with each column displaying the permission and each row displaying the user group. If a user group has a permission, it is marked with the text of that permission in the spreadsheet or if not, FALSE. This is to check to make sure permissions are the same for each user group across all repositories.
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
This script takes a CSV of EAD IDs and suppresses the records in EDAN, removing them from view in SOVA. Optionally, it can also suppress the resources in ASpace if the resource URI is provided.
The CSV should have the following columns:
- eadID, ex: NAA.2009-07
- resourceURI, ex. /repositories/36/resource/12345 (optional - only needed if suppressing records in ASpace)
Arguments:
- csvPath - path to CSV input file", type=str
- logFolder - "path to the log folder for storing log files, type=str
- -sA or --suppress-aspace - suppress the record in ArchivesSpace, action='store_true'
- -jA or --jwt-algorithm - algorithm for encoding JWT payload. Ex. HS256, type=str
- -dR or --dry-run - dry run?, action='store_true'
- --version" - version info
- utilities.py
- Packages:
- EDAN API, key, ISS, JWTID in the .env.prod file
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)
This script takes a CSV file containing the URIs or URLs of objects to suppress in the ArchivesSpace staff interface, unpublishes and sets the finding aid status to staff only (for resources). The CSV should have a header row that reads "URI", and you can pass the object's repository identifier number and object type (resources, archival_objects, digital_objects) as script arguments. The script takes the CSV, splits the URI into the ArchivesSpace resource ID, repository ID (if not already supplied) and object type (if not already supplied), grabs the resource JSON data, then passes the data to the update_publish_status function, which modifies the JSON to publish=False and finding_aid_status=staff_only (for resources only). Then it posts the updated JSON to ArchivesSpace and suppresses the record.
The arguments passed to the script should look like: python suppress_objects.py <filename>.csv <repo_id> <object_type>
Type python suppress_objects.py --help for more info on commands, including -dR which is a dry run that doesn't execute changes, but spits out what changes will be made once the script runs.
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)
This script takes a CSV file containing the URIs for resource records to get and post back to ArchivesSpace without updating any data, used to kickstart an update to EDAN/SOVA by updating the system_mtime field.
Structure the CSV like so:
| uri |
|---|
| repositories/##/resources/object_id |
| repositories/##/resources/object_id |
Type python touch_resources.py --help for more info on commands, including -dR which is a dry run that doesn't execute changes, but spits out what changes will be made once the script runs.
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- test_data directory for accessing the CSV file (or anywhere, so long as you copy the path of the file as an argument to the script)
- N/A
This script takes a CSV file containing the URIs of locations to be updated and the repository identifier number as script arguments, structured like so:
python update_locations.py <filename>.csv <repo_id>
The CSV should have at least one of the headers labeled URI. The script adds an 'owner repo' = {'ref': 'repository/<repo_numer>'} key-value to the location JSON retrieved from the API, then posts the updated JSON to ArchivesSpace
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files and storing jsonlines output file
- test_data directory for accessing the CSV file
This script takes a CSV of archival object URIs as inputs, grabs all the archival objects' JSON data using the API, saves them to a jsonL file using the jsonl_path input, and updates the archival objects' update_refid field to True, posting them back to ArchivesSpace which regenerates the refids.
- utilities.py
- Packages:
- ArchivesSpace username, password, API URL in a .env.dev, .env.test, and .env.prod file
- logs directory for storing local log files
- CSV file of URIs for objects to update (the path of the file as an argument to the script)
- jsonl file for saving data of objects before deletion (the path of the file as an argument to the script)
- No tests were added, since the only unique thing this script is doing is changing the
update_refidfield to True in the objects given.
This script updates existing ArchivesSpace subjects from a provided CSV. It is currently customized to support the needs of NMAI, but can be changed/updated in the future.
- ArchivesSnake
- Environment-based ArchivesSpace username, password, API URL in a .env.{environment} file:
- On your local:
- Create a new
.env.devfile containing local credentials export ENV=dev- Run script
- Create a new
- On test:
- Create a new
.env.testfile containing test credentials export ENV=test- Run script
- Create a new
- On prod:
- Create a new
.env.prodfile containing prod credentials export ENV=prod- CAREFULLY run script
- Create a new
- On your local:
- logs directory for storing local log files
Unittests for updatesubjects_tests.py
- test_data/subjects_testdata.py file, containing the following:
-
test_update_subject_metadata = {JSON representation of an existing subject}for testing. Ifnewsubjects_tests.pyhas been run previously, you can use one of the subjects created by that test.
-
- test_data/newsubjects_testdata.csv - a csv file of changes to be made to an existing subject, containing:
- aspace_subject_id - id of the subject to update, this can match that in test_data/subjects_testdata.py
- new_title
- new_scope_note
- new_EMu_ID
@fordmadox to fill in later
@fordmadox to fill in later
@fordmadox to fill in later
Retrieves all agent_persons that are linked to in the CFCH repository. Updating this script to another repository is possible by changing the ao.repo_id code to the desired repository.
- MySQL software or terminal
- Credentials to the ASpace Test and Prod database
There are no tests with this script. Need to research how to test SQL queries or if it's necessary. This query does not modify any data, so testing may not be necessary. I did run it against test before running against prod.
This query counts all the digital objects with digital object types per each repository. It also includes a count of all digital objects without a digital object type per repository. Now with proper 0s, due to count of digital object ids, rather than rows.
- MySQL software or terminal
- Credentials to the ASpace Test and Prod database
No tests, but ran against ASpacetest before running it in prod and reviewing from other team members.