Info: This article will discuss the key issue of keeping your data up-to-date. Once your dataset is registered and the first upload of data is completed, you need to decide whether to schedule data updates. A variety of technical approaches (with comprehensive technical guides and examples included) will be discussed.
Once your dataset application is approved, the ability to upload data into the dataset application on the “Upload” tab of your DUT console will be made available.
From now on, any data you upload will go through our data ingest process (which includes a range of data validation and processing checks) and if successful, will be loaded into your slice of the data store.
Data updates vs data changes
Firstly, it’s useful to clarify the meaning of “data updates” and “keeping data up-to-date”. In this context, it is only changes to records within a dataset. This includes the addition, removal, or modification of individual records within a dataset. When your data store is loaded it replaces the existing data completely; there aren’t any processes applied that generate deltas or change files.
In DUT, a data update does not count as a “dataset change”.
You only need to lodge a dataset change when changes need to be made to your datasets:
- schema by adding, removing, or changing your attribute names or data types
- application, such as Feature Class name, dataset views, key contacts, etc.
Scheduling data updates
The process for scheduling updates to data may seem daunting for organisations without specialist knowledge in data management. However, history has shown that the process can be managed by anyone who works with data and information, and doesn’t need to be an IT and technology-centric process.
This article will step through the process and technical aspects of preparing and publishing your data. You’ll learn why it’s important to schedule updates to data, choose a method to use to perform updates, and schedule and test the end-to-end process. If you’re already familiar with using FTP to transfer data, then you’ll find the technical process of supplying data relatively straightforward.
Info: Depending on how you choose to proceed, lead time may need to be factored in to involve your IT or applications teams to set up and configure some of the tools that are required.
Help: The team is here to help. Please get in touch if you need support to decide about scheduling updates, or require technical assistance.
Why this is important
Just as users of data have a responsibility to ensure they are using the most recent and accurate datasets, data custodians have a responsibility to ensure that their data is published and kept up-to-date as it grows and evolves.
Data should be as up-to-date as possible and made available to users in a timely manner. As data is updated, agencies should aim to make it available as soon as possible, or on a consistent periodic basis.” – Western Australian Government Open Data Policy
Keeping data up-to-date through scheduled updates ensures that:
- Currency is maintained. You can have confidence that users are accessing the most recent and reliable data.
- Time and cost savings in distributing data. The latest data is always available in one place. Avoid the time and cost associated with distributing copies of data to users by hand, and ensure that users of your data can self-service data from one place - data.wa.gov.au.
- Updating data through scheduled updates is faster and more reliable in the longer term, than dedicating staff to manually update.
Before proceeding there are a few important prerequisites you need to meet:
- published a dataset through DUT by completing the steps in the article Registering your geospatial dataset
- approved the dataset through the Application review process
- successfully loaded the data into DUT at least once through the Managing data loads.
Should I schedule updates?
As a general rule, any dataset that will change more than roughly once a month should have scheduled updates put in place.
Relying on manual human intervention to ensure dataset updates are published is prone to failure (e.g. due to staff movements, holidays, competing priorities) and is generally less efficient than investing time to schedule data updates.
With scheduled updates, you can connect directly to your database to extract and upload data in one step. This avoids the need for manual multi-step processes that would be required to do the same thing by hand.
When is it appropriate not to schedule updates?
It’s not always necessary for data to have scheduled updates. The main test to apply in deciding to schedule updates or do them by hand is to ask, “Will this save time for myself and my team?”.
If you only have a couple of datasets that are updated twice a year, then it’s probably not worthwhile to schedule updates. This is true as long as you have good internal processes to ensure that somebody is assigned the task of uploading a new version of your data as it gets updated.
You can skip the need to schedule data updates if it is:
- part of a series of datasets published at infrequent but regular intervals as separate datasets (e.g. The Current Active Schools series of datasets from the Department of Education.)
- a once-off snapshot from a point in time that won’t change (e.g. Perth and Peel Urban Land Development Outlook 2016/17),
- is part of an administrative or reporting function of your agency, and a process for publishing reports and publications is already in place. You can include an “Update the raw data on data.wa.gov.au” action to the process. For example, data that is included as part of a regular quarterly report produced by your organisation and published on the web.
Help: If you’re unsure whether scheduled updates are applicable, please get in touch. The team is happy to discuss the specifics of your dataset and advise on the best approach.
If you make the decision not to schedule updates, then it’s important to ensure your organisation has processes in place to manage publishing updates to the dataset over its lifetime. At a minimum, this should involve:
- Documenting that your data is published through data.wa.gov.au and ensuring that everyone involved in creating and managing that dataset (e.g. data custodians, data stewards, publications team) is aware of this.
- Identifying the people who are responsible to notify data users and publishers that the data has been updated.
- Writing a process that outlines who is responsible for performing data updates, and how they will get an updated copy of the data loaded from your systems and into DUT.
If you’ve made the decision not to schedule updates, then please skip the rest of this article and proceed to the next article in the series, Preview services and sign off.
Note: The rest of this article assumes that you will schedule data updates.
Choosing an update frequency
Pick an update frequency that best reflects changes to the content of your data and aligns with the requirements of your business. For example, if your data changes sporadically throughout a given month it is best to set up a daily refresh.
From a technical point of view, there is no practical limit to the frequency of refreshes. For example, if there is a business need to refresh your data every five minutes, the platform can cope with that for most datasets.
Once uploaded, your dataset will be automatically run through a validation process. If the data loaded is valid, it will start to load into the data store straight away. Depending on the size of the dataset, it can be refreshed within minutes of your upload finishing.
A series of guides have been prepared for data publishers that provide a step-by-step technical walkthrough of scheduled data updates through DUT.
Each guide covers:
- Preparing your data for ingest through DUT.
- The software requirements for uploading data.
- Step-by-step instructions detailing how to perform an upload.
- Guidance on scheduling uploads to automate the process of uploading data.
There are three approaches available, each using a different piece of software. You will select one as best suits your team's skill set and the available tools. It’s best to go with what you know. For example, if you already own FME and run FME workbenches then that will probably be the easiest path to take.
- Using FME. A good choice if you already have FME, or if you need to manipulate data prior to upload.
- Using the AWS CLI. The AWS CLI is an open source tool built by Amazon Web Services on top of the AWS SDK for Python. It provides an easy to use a command-line interface to upload files to AWS’s S3 storage service used by DUT. Good for: Anyone comfortable with command-line tools, or organisations where Linux is in use.
- Using Python. A good choice if you are comfortable with Python, or already FTP-based workflows that work with data.
Help: The guides won’t go into detail about how to get data out of your systems, prepared, and in the right formats for uploading. It is assumed you have approaches already available to do that automatically. If that’s not the case, or you need advice on which data extraction tools to use with each upload tool, please get in touch.
All of the requirements outlined in the Requirements for uploading geospatial data article also apply to scheduled updates. Make sure you meet the data preparation requirements outlined in that article. The data scheduled processes created should be equivalent to the data manually uploaded to DUT during the process of registering your dataset application.
Zipping your data for uploading
In order to upload your data it needs to be zipped, either using the inbuilt file compression tools on your operating system, or third-party tools you already have installed, such as WinZip.
Tip: Don’t worry too much about the compression side of the zipping process, simply go with whatever the default is for the tool you’re using. Zipping files for uploading is more about packaging them into a single file rather than compressing them down to the smallest possible file size.
There are a few tricky caveats to be aware of in zipping data for uploading into DUT, particularly if using a File Geodatabase.
Naming your zip file
Name the zip file whatever you like. The only requirement is that the contents of the zip file (i.e. the File Geodatabase Feature Class or the filename given to the Shapefile) match the name of Feature Class Name provided on the Attributes tab when the dataset application was submitted.
Tip: You might want to use this to embed some metadata in the name of the zip file. e.g. instead of creating a file called shipwrecks.zip, include metadata about the time and source of the data and name it shipwrecks_20171123_postgresql_extract.zip.
In talking about uploading data, it’s worth briefly discussing the technology behind the scenes.
The Data Upload Tool uses Amazon’s S3 (aka Simple Storage Service) as a temporary place to store the data you upload. As part of setting you up as a data publisher, a folder is created that is unique to your organisation. Only authorised data publishers in your organisation can use it.
Tip: Your AWS S3 connection details are the same for every dataset you publish, so a single Cyberduck connection is set up for all datasets.
As part of the process of uploading data and scheduling data updates, you’ll work closely with AWS S3. There are a few key concepts and terms to be familiar with to effectively use and understand S3.
Your S3 credentials can be found in DUT by navigating to the Upload tab on any of your datasets. You use the same credentials for all datasets your organisation publishes.
- You may hear us use the term “S3 Bucket” when discussing data uploads. That’s just another way of describing the unique “folder” for your organisation.
- S3 Access Key. This is like your username. It is a unique key given for each organisation and a “bucket” is created for it in S3.
- Secret Access Key. This is like your password. Similar to a password, it shouldn’t be stored in plain text, written on a post-it note, or shared with anyone who isn’t authorised to use it. Wherever possible, follow best practice and store your passwords in a secure and encrypted password manager.
- S3 Bucket Folder. This is comparable to a folder path on your computer. It will resemble “lg-slip-selfservice-data/data-load/6”.
Tip: Do not use the word 'preview' on the end of the folder path when setting up your automated scripts for data upload. Once the data has gone live the preview will be dropped.
Tip: If you’re completely unfamiliar with AWS S3, don’t worry too much. Think of it like FTP, or modern file storage systems like Dropbox and OneDrive.
Important: You are responsible for keeping your credentials secure. If you believe they have been compromised, contact the team immediately and they will be reset.
Tip: If you’d like to use your S3 to distribute other types of data (for example, non-spatial), please get in touch. The S3 bucket created for you is solely for the purposes of loading data into DUT, and should not be used as a way of hosting and distributing other datasets.
Reference: Uploading geospatial data with FME
AWS Command-line Interface
Reference: Uploading geospatial data with the AWS CLI
Reference: Uploading geospatial data with Python
Where possible, use the same solution to create your initial data load for DUT as for your scheduled updates. This can eliminate a lot of issues that stem from using two different pieces of software with different settings and defaults, to extract your dataset (e.g. ArcMap and FME).
As always, there are some common errors that cause most of the issues with scheduling data updates, including:
- Introduce changes to your data. Ensure that no changes are being introduced to your data by the process that extracts it from your database. For example, FME will treat date attributes as a special fme_date attribute that causes problems during the data load process.
- Upload the right dataset. Check that the name of the Feature Class in DUT matches the Feature Class in your File Geodatabase or the name of your Shapefile. It needs to match exactly and is case-sensitive.
- Zip the dataset correctly. Follow the instructions above to ensure that the dataset is named and zipped correctly.
Verifying that your process is working
Determine that the scheduled update process has worked – that the data has been uploaded, ingested, and successfully loaded into the data store - is the same as when you first uploaded data through DUT as part of registering your dataset. Refer to the “Receiving your data upload notification” section in the Managing data loads article.
What happens now?
Once you’ve successfully implemented and tested your scheduled update process, the team will publish your preview service. When that’s ready, you will be notified and can move on to the next article in the series, Preview services and sign off.