What does it do?
Octopub is an experimental tool developed by the Open Data Institute to explore the future of open data publishing. It provides a step-by-step guided process to prepare a dataset and check its quality, before publishing it to the Github collaborative platform.
Octopub is free, and open source: anyone can use it (octopub.io is provided as is, with no warranty, and anyone is free to download and set up their own instance), and anyone can contribute to it, too.
Octopub will guide you through the publishing process. We will help you validate your data, choose the correct license and, when you are happy, publish it to GitHub.
GitHub is a well known website for sharing files, which provides you with a web page for your data that you can share with people.
Why should I use it? / Who is it for?
Octopub is for anyone wanting to publish open data: it doesn’t matter if you are a hobbyist, a data professional, or a government department. If this is your first time publishing data, Octopub will guide you through the process to make it as simple as possible. And if this isn’t your first time publishing data, Octopub makes it quick and easy for you to whizz through the process.
Octopub should be easy to use, but please get in touch if there is something you don’t understand or think could be improved.
You will find helpful tips and guidance on the screens and tooltips as you go. If this is your first time using Octopub we recommend that you click on the by each field for more information and examples.
1.You will need a GitHub account
Octopub will publish data to a file sharing website called GitHub.
Before getting started, make sure you've got a GitHub account. If you haven't got one, you can sign up at github.com/join.
2. Go through the “Three step wizard”
The wizard will guide you through the process of creating a data collection, choosing a licence, adding a schema (optional), adding file(s) to your collection, validating your file(s) and publishing them once you are ready.
3. Create data collections and data files
Many organisations publish data on a topic regularly e.g. for statutory reporting reasons. In this context new data files are added to a data collection. A data collection in this sense is a group of related data files on a particular subject.
We found that the term dataset often has an ambiguous meaning. A useful working definition is "a collection of data that is managed using the same set of governance processes, have a shared provenance and share a common schema".
We have used the term collection for both purposes, since they are both collections of related files.
A collection, for either purpose, is made up of one or more data files, just like you might organise files in a folder on your computer.
4. Using Schemas
In order to apply a schema to your data file you will first have to have either uploaded one, had one assigned to you to use, or have inferred one from an existing dataset. More on schemas
Using a schema isn’t mandatory but it is highly recommended, as they will allow other publishers and machines to find your data and use it effectively.
4a. Upload a Schema
You can optionally associate a schema with your file. See section below "Ensuring Quality with Schema" for more information on schemas
4b. Inferring Schemas
If you would like to associate a schema with your file, but do not want to create one yourself from scratch, then Octopub can make a first attempt. Caution: this is an experimental feature, you will probably want to fine-tune the generated schema before you use it.
5. Submit your collection for validation
When you submit your collection it will be sent to a validation service called CSVLint. See Data Validation section below for more information on CSVLint and validation errors.
To note: Pre-publishing and Publishing
When all the files in your collection have passed the validation stage you are ready to publish it. Until you choose to publish the collection it will stay as “pre-published” data. It is only when you choose to publish it that it will be made available on GitHub. Until then you can continue to review the data, make changes and re-validate the data until you are ready.
GitHub is a website that allows anyone to share files, but allows the author to keep control of changes. It offers features such as:
- A dedicated homepage
- A home for supporting documentation
- Version history of all your files
- Suggestions for corrections with an issue tracker
- Search files and descriptions
Because GitHub is so useful and well known in the open source community, we thought it was the ideal publishing site for open data.
Did you know?
The octopus-cat mascot for GitHub is affectionately known as “Octocat”, which is what inspired the name “Octopub”.
Validation means making sure that the structure of your files is acceptable, regardless of its contents.
At the moment Octopub is able to validate your data in two ways:
- CSV file validation with CSVLint, which is a service that checks for common errors such as:
- Missing header rows, or missing column names
- Rows in the file that dont have the same number of columns as the header
- Blank rows
- Odd characters in a file which could cause errors
- Stray/Unclosed quotes
- Inconsistent values: for example if most values in a column are numeric but there are are a few that aren't
- If you have provided a schema for a data file, Octopub can also use CSVLint to validate the data according to any rules defined in the schema.
See csvlint's website for more information about common errors.
In the future, we aim to validate more type files, using the Lintol.io application, which would allow to upload and validate almost any file types, from CSVs to Geospatial.
This is a prototype feature.
A Shapefile is a popular geospatial vector data format for geographic information system (GIS) software. It is typically comprised of multiple files with the extensions (.shp, .shx, .dbf, .prj, .sbn, .sbx .cpg) amongst others.
Octopub allows the uploading of files that comprise a Shapefile, to facilitate the automatic conversion of the Shapefile to the GeoJSON format.
How to convert a Shapefile to GeoJSON
- Create a new 'Collection'.
- Pick a title, description and licence for your 'Collection'.
- Add a new 'File' for each file that comprises the Shapefile.
- Once all 'Files' have been added, 'Submit collection for validation' (This process will fail if a file that comprises the Shapefile is missing).
- Visit the newly created 'Collection' page.
- An additional GeoJSON file will be present as the last row of the collection files.
- Click on the new row to view the mapped GeoJSON data and download it as a file.
What is open data?
Open data is data that’s available to everyone to access, use and share, which should be easy to find. However, it is only useful if shared in ways that people and machines can understand, which is why data needs to be explained, described and often formatted & standardised in reusable, machine-readable ways.
Why do I have to choose a licence?
Licensing is key. It's what makes data open. When data is published with an open licence, it's the licence that defines who can use it and for what purposes. Without a licence people just won't know whether they can use it or how, and the lawyers would say that they mustn't. Without a licence the data isn't explicitly free for everyone to access, use and share
Read more about licences here
Find out more
Read more about open data here
What is a schema?
The structure (or schema) of a data file is a major factor in how useful your data will be to others. Data can be more easily combined and compared if published in a consistent and predictable format.
A "schema" is a document that describes the format of your data. It typically describes the columns, the type of data in those columns and rules about the data itself. The schema can be used by a computer to automatically validate your data and catch common errors. Eg. In my data, there will be a column called "age" which is a number and cannot be more than 150 or less than 0.
For publishers, making data available in a consistent format is the best way to make open data useful, both internally and for others:
- It helps people combine or compare their data with the same type of data from other publishers, for use in their apps or analytics.
- It saves you time and money when preparing data for publishing, if you can use one that already exists.
How do I create a schema?
See here for more information on creating schemas for your data
Do you think others might want to create new datasets using the same schema as you? If so you can consider creating a new standard schema. See Going Further: Should I create a new common/standard schema?
At its simplest, open data requires just two things: data and openness. There are lots of aspects to openness, but at its most fundamental, the key is how the data is licensed. Data that doesn't explicitly have an open licence is not open data.
An open licence is one that places very few restrictions on what anyone can do with the content or data that is being licensed. An open licence allows others to do things like:
- republish the content or data on their own website
- derive new content or data from yours
- make money by selling products that use your content or data
- republish the content or data while charging a fee for access
Read more about licensing here.
What schemas can I use?
For common or collaborative reports (Eg. the census, other statutory reporting, or formats agreed by research partners) there may already be a standard schema available for you to use. We recommend using standard schemas if possible because datasets which use the same schema will be easier to compare and combine. You can find some schemas here
LGA keeps a definitive set of schemas for those datasets that local authorities have to publish under the Local Government Transparency Code, along with other popular open datasets. If you are publishing data in one of these categories, we strongly recommend finding and following the relevant schema.
If the dataset you are publishing is not covered by one of the core schemas referenced by LGA, you may be able to find a common schema from other local authorities in the community. For certain types of data, common schemas are published by other communities, such as the open data community Github INSPIRE for geospatial environmental data and BS7666 for address and land use. In addition, the LGA schema directory keeps track of over 640 schemas from local authorities across the UK. Simply search the directory for the type of data you plan to publish, and it will show you the common schemas that match.
Should I create a new common/standard schema?
Is there no common schema for the dataset you are publishing? Do you think others might want to create new datasets using the same schema as you? If so you can consider creating a new standard schema.
Before creating a new one, find out whether there is already some consensus among an existing community of users somewhere: you might need to create a new standard schema but there may be an opportunity to get the community to help.
A good place to start is iStandUK, an active community of local government experts who can support the development of a new schema.
Can I use different licences for different files in a collection
The licence you choose will be applied to all the files you add to the collection you’re creating and therefore has to be chosen carefully. If you are not sure then Octopub will help you choose a licence.
Is there a size limit for data files?
I have an validation error, what does it mean?
To discuss tailored training for your organisation or for further details, contact [email protected]