Today I attended the JISC Research Data Network Workshop hosted this time by Cardiff University. So far we know research data services are desirable (priority): there is a growing demand for managing heterogeneous and growing volumes of data across the entire research lifecycle; we know that research data services are possible (feasibility): there is an increasing array of on-premise and in-cloud platforms to select from; the challenge, certainly for us at least, is making these services sustainable, secure and manageable (viability). The RDN meeting provides an opportunity to find out more about good practice in the sector. Of particular interest today is less the technology and more the work JISC has been supporting around the business case and costing models for research data management.
There are plenty of notes and links to resources below but my key thoughts and takeaways are:
- research data management is a deceptively complex domain;
- there’s not even agreement on what research data management is, how it should be scoped and what ‘fit for purpose’ RDM looks like;
- getting buy-in for ‘feeling our way’ through this uncertainty to a viable business case and workable solution requires skilful advocacy;
- those of us grappling with this challenge see benefits in working together to tackle the problem, facilitated by networks like the RDN, but commercial sensitivity about sharing options analysis, solution choices and financial data constrains this;
- developing a cost-effective service incorporating standard business processes and technology is difficult to do given various detailed (internal and external) financial rule constraints and heterogenous use cases;
- documenting current and forecast capacity requirements is difficult because quality data about research demand and costs is unavailable or hard to obtain and may not even help inform decision making: it’s a very imprecise science;
- because research data management is a radical problem, (it’s relatively new and rapidly changing for everyone), the discovery and design phases of projects are resource intensive and often longer than anticipated;
- bridging from why to what to how is hard. The problems and challenges seem well understood and shared across the sector now. There is a real thirst to move from theory to examples, concepts to case studies in order figure out then replicate what works. Early adopters and pilots are invaluable in providing tangible starting points and generating evidence and feedback: try something and iterate;
- tactical approaches and development projects will continue to dominate in the shorter term whilst more strategic and standardised approaches are at least another 2-3 years away yet;
- c.2018 is likely to be an active deployment period as more services move from prototype to production and REF 2020/21 begins to concentrate minds;
- even economists are struggling with the economics of research data management.
- Research Data Network Agenda and Notes
- RDM Blog
- RDM Mailing List
- Directions for Research Data Management
1. Welcome from Jisc
The network is a way for people working in this area to connect and communicate instigated as part of the Research at Risk programme. There are quarterly meetings for sharing information, ideas and solutions around research data services. These are supported by a mailing list, a blog and a research.network online space
2. Welcome from Hosts / Neil Penry, University of Cardiff
After a bit of history, some pointed rugby World Cup jibes, and photos of Sherlock being filmed on campus yesterday, Cardiff outlined their current research programme. Initiated in 2013 it covers research information management to improve 2020 REF submission by implementing a CRIS and research portal (Converis by Thomson Reuters). Functionality includes research datasets and metadata, research pipeline, impact for REF and integration with finance, projects and HR systems.
There is an interim Research Data Storage solution, a research integrity code of practice and supporting web resources.
In development is a publications repository (currently ePrints but will move to Converis) and will be implementing Converis for award management. Programme will finish in 2018 with additional functionality such as costing and pricing, approval, internal, peer review and other administrative functions.
For RDS there will be central data storage for live data, supported by a selection tool for researchers, and automated transfer to an archive system. Looking to Converis to automate much of this.
- shifting landscape
- costs of RDM provisions
- selling to researchers the costs of doing this properly, the benefits of sharing data
- how to fund the service? Funders want us to share but funding rules (both funded and institutional) make it difficult to achieve
As with many institutions the immediate driver was the ESRPC requirements but also recognised that research was at risk. The discovery of Converis and its ability to automate data flow based on triggers brought it all together.
3. JISC Shared Service / Catherine Grout, Jisc
Why a shared service? Jisc don’t think there is a commercial solution (single product or group of products) meeting this need at present. The need is there (whether mandated or to maintain research integrity) and a shared service offers cost savings and efficiencies, common approach and practices and standardisation and interoperability.
It is a lot easier to go and buy off the shelf storage at the moment than it is to curate data. It has to be easy for researchers to user and engage and help institutions deliver their end to end process, both those that are ongoing and those that are cyclical e.g. REF submission.
Working with pilot institutions on core set of metadata and process to develop a minimal viable product. This is based around the JISC Shared Service architecture.
Catherine gave an update on the progress and the current development phase plus who is working on which lots (the architecture was split into 8 lots to make development more manageable).
Around this process there is also consultancy support working with the core Jisc team, the suppliers and pilot institutions.
Dissemination will include adding resources to research.network and assemble into a Research Data Toolkit.
University of Cambridge talked about their rationale for joining the pilot despite already having many, but disparate, research services. Preservation was an important driver, especially across very heterogeneous research activity across very different disciplines. Joining forces with others facing the same problems and working jointly to address these challenges was an attractive benefit. A second challenge is handling big data. Again researchers aren’t supplying huge volumes of days … yet … but need to be ready to handle data when it comes and scale services. A third challenge is managing sensitive data: having confidence that systems are secure and robust enough and that processes can support access requests.
Rough estimate is a production service will be ready 2 years from signed contracts on the framework (approx late 2018). This is possibly not fast enough for some institutions hence why dialogue across many institutions, not just the pilots, is important. Some institutions are waiting to see what happens with the shared service. Questions raised around risk assessment for institutions considering the shared service.
4. Show me the money – the long path to a sustainable Research Data Facility / Marta Teperek, University of Cambridge
Research data services at Cambridge cover:
- data management plan support
- data repository (free up to 1Gb)
- policy development
- advocacy and outreach
So, how much does this cost and how much should be recovered centrally? Three main types of cost:
- Research Data Management Facility. This covers the above services and is mostly person effort. Baseline (should be somehow recovered from most grants).
- Long-term Preservation of Data. Mostly technology and infrastructure (Archive Data). Optional (to be budgeted in grant applications).
- Data Management Dedicated to the Project. This can include storage for active data and person effort such as data managers and data scientists, especially on larger projects. Optional (to be budgeted in grant applications).
Active data storage is mostly provided within departmental units rather than centrally.
Costs for archive storage is £4/Gb for data above a 1Gb inclusive allocation. Based on at least 25 year retention with up front payment. A cost effective way of operating the charging process is still being worked out.
Developing a cost recovery model for the RDM facility started with guidance from RCUK on treating a research data management service as a small research facility as directly incurred rather than indirect. This was explored in consultation with the finance department. A problem is working out how much use a project would make of the service in order to estimate a service baseline charge. Thought about basing on data management plans but this was problematic.
Then explored possible models that included:
- no charge (if wont use facility)
- single flat charge
- multiple flat charge
- charge per FTE
- proportion of grant
Started with a single flat charge and collect more evidence to revisit the model. Advocated for this model by developing 2 different messages for 2 different audiences: researchers and senior managers. Prepared a nice slide showing the founder requirements and the services supplied by Funders and needed from the institution. Also included how much research income was received from research funding.
The data repository had existing some time but submissions increased substantially when the facility made it easier to submit data. The facility also provided training and asked for feedback during training sessions in order to improve but also to demonstrate value to senior managers.
This demonstrated a clear justification for providing a central service, but agreeing a business case was still very difficult. Also, ultimately funders said that RDM services should be recovered as indirect not direct costs. This excludes some funders, such a as charities, who don’t pay for overheads so a training charge per FTE may have to be added as a direct cost for these grants.
One of the real difficulties in defining a standard, cost-effective business model and business process is the varying and detailed constraints placed on how research income can be spent to comply with the contractual terms of funding. This is a huge overhead for institutions when developing services to support this revenue stream.
5. Business case and the costing model for storage volume requirements – a Royal Holloway case study
Sharing work on costing their RDM service, currently still in project not production stage. Policy in place and now need a supporting service. Had a 4 month investigation project planned starting July 2015. Has actually taken 10 months to prepare a costed project proposal.
- Improve DMP Online
- Active Data Storage
- Collaboration Tools
- Data Catalogue
- Archive Storage (analogue and digital)
- Preservation System
- Staging Storage (technical processing step to reduce costs)
Wanted the service to encompass all research data, not just digital, so have investigated and estimated analogue data volumes. Will not be taking forward active management of these data formats however.
- Funder Mandates
- Improve infrastructure to facilitate research excellence
- Improve research impact
- Partner of choice
Project involved a detailed option appraisal and detailed requirements for all parts of the service: it is a requirements led project. This included business analysis (elicitation via focus fouls, interviews, surveys), market investigation, supplier engagement and engaging with existing guidance and consultancy.
- Engaged DCC has consultants
- Conducted funder analysis
- Conducted DAF/CARDIO Lite Survey. Had 100+ responses from academic staff.
- Compared findings with a 2014 survey
Evidence of how much research data was held in unsecured locations helped convince senior managers, aided by a data loss event shortly before the first presentation of the business case.
Supplier and User Engagement
- Assess cost of delivery options for Active and Archive storage
- Investigated volume of existing cloud users
- Developed extensive business requirements for all deliverables
- At this point still didn’t know how much storage was required
- Explored capacity requirements based on estimating projects, considering unfunded research and plugging estimates into an algorithm with assumptions based on margins for error.
- Classified users into types based on attributes such as source, volume (low < 100Gb, medium 100gb – 1Tb, high 1Tb – 10Tb and massive >10Tb), sensitivity, location, bandwidth and I/O.
- Calculated proportion with access to available externally funded repositories
- Validate with departments and R&E
- 95% require collaboration
- 95% require offline or remote access
- 40% research unfunded (based on Trac report)
- 10% research data produced directly from equipment
- Investigated and costed storage allocations throughout the lifecycle
- Investigated delivery options for active and archive
- Asses licensing options
- Proposal for appraising data for archive storage – reducing data volumes
- Approx 250 researcher academics, 750 total research staff, 1000 research staff and RCUK students, 2100 research staff and students
- Approx 240 funded, 160 unfunded projects per year with 80 project churn
- Used user numbers for active storage and project numbers for archive storage
- Other costs include integration (to CRIS, DMPOnline, reporting tools, new services), service levels (expanding staffing, application support and advisory service that can’t be absorbed by current capacity), capex vs opex and cost recovery: estimated 35% maximum return over 4-5 year grant lifecycle after initial investment (based on Trac return). This is at an institutional level and doesn’t consider allocation do cost centres e.g. via cross charging.
- Balancing cost and control
- Data security risks of cloud can be mitigated and managed
- Data provision as it stands not adequate
- Cloud = bias towards OpEx rather than CapEx so may depend on institutional financial planning culture/rules
- Very time consuming project (with high project costs)
- Involved chasing down a lot of people and disparate information
- Analysis showed it was better to offer a universal service via overheads rather than direct costs. There will be a rate card and process for managing exceptional requirements.
- Archive storage by August, 2016
- Active storage to roll out Sep-Jan 2016
- Proposed service level is unlimited active storage per user and 2Tb archive storage per project based on 1000 users and 80 project p.a. churn free at the point of use (funded via overheads).
- Educating users that they are the data owners and responsible for using and sharing data with data security in mind.
6. Grant Funding Programme / Charlie Dormer, BIS and RCUK
Creating digital services to support the entire grant funding process from idea generation to impact reporting that enables the best possible funding of research excellence. Basic product by March 2017 and build the foundation for what grant funding could/should look like in 5 years time.
Development is based on extensive user research, service design and iterative prototyping. Also thinking about information architecture, such as bringing guidance together in once place from four different places. Looks like gov.uk at the moment as using many of the GDS design patterns but may be slightly adapted to differentiate ac.uk from gov.uk.
7. Demonstration of 4C Cost Comparison Tool / Paul Stoked, Jisc
4C was a project on collaboration to clarify the costs of curation. It developed a curation costs exchange (CCEx) based on a cost creation tool (CCT) to calculate and compare (either with peers or your own year on year) costs of preservation. It is intended to model institutional curation /preservation costs but want to explore if the CCT may be applicable to other parts of the architecture/lifecycle and if not, why not?
CCT allows you to create cost sets covering asset types and activities related to the data management lifecycle. Cost sets can be for different scopes and time periods. It’s an easy to use tool for modelling but the comparison functions will be improved if more organisations contribute meaningful data.
So, is it suitable for the entire RDM lifecycle (if not, why not?) and are attendees using it (if not, why not?).
Leaves open the concept of RDM and what’s included. How can you balance precision with inclusivity? Perhaps break down the activities mapping into smaller tasks. Perhaps provide guidance or prompts within the system for defining cost units. This was not included within the system because agreement couldn’t be reached. Without more common cost units it’s difficult to know if you are comparing like with like, or whether your model is too bespoke to make the comparisons useful.
There is some guidance provided on Understanding Costs that links into some of the 4C project analysis work such as a summary of cost models, an evaluation of these models and a cost concept model and specification.
Is it possible to curate some worked examples? There is variability between costs data sets currently entered into the system.
It comes back to the heterogeneity problem and the difficulties of developing common standards and approaches in this area.
8. Presentation on methods/approaches for measuring the costs and benefits of RDM / Graham Hay, Cambridge Econometrics
The session provided initial findings from the consultants Jisc have engaged to look at the economics of RDM. There is no clear economic evidence to support sustained investment in RDM; there are bespoke approaches to costing with little clarity on how to model the costs and benefits.
The aims of the business case and costing model cost area therefore to:
- develop a general framework for understanding RDM and identifying costs and benefits
- review and critique methods that have been used to model cost/benefits
- recommend methods to Jisc and HEIs for modelling costs and benefits
- (if possible) present a quantified baseline for institutions involved in the shared service pilot
The framework is non-specific and indicative but hopefully easily applicable to individual institutions.
RDM is a complex domain so some simplifying assumptions have been used. Also RDM has always been done. So it is not comparing RDM vs no RDM it’s more comparing ad hoc (or passive) RDM vs quality (or purposeful) RDM.
Some things have been omitted such as:
- the disconnect between stakeholders that bear the cost and those that benefit
- funding issues and who pays (or doesn’t pay)
- advocacy issues to communicate and obtain buy-in
- how RDM activities should be distributed/organised across institutional stakeholders.
Project has identified five elements in a logic map:
- Exemplar Activities
- Immediate Outcomes (shorter term)
- Impact (longer term)
Costs are related to Inputs and Activities; benefits are related to Outcome and Impact. Discussion of the presented costs and benefits and some of the challenges like heterogeneity and uncertainty around use cases. Also promoted questions around how is research data management scoped.
Presentation of possible cross framework that was amended at a recent workshop. The ‘ideal’ framework had institutional costs mapped to project costs and then scaled up to data volumes. There were problems raised with measuring and monitoring this. A more ‘feasible’ framework involves cost data at the service data that is approximated at the institutional level by scaling variable or semi-variable costs and calculating the costs of storing and managing data for a fixed time horizons based on calculations and assumptions of future RDM costs. This is linked to realised/expected volumes of data at the institutional level and anticipated volumes over fixed time horizon.
Trying to develop an activity cost structure that is inclusive enough to be applicable to each institution. This includes basic assumptions/information; breakdown of RDM by costing activity; and dimensions to consider for each activity.
Issues how complexity vs simplicity; the relative importance of each activity; whether any simplifying assumptions can be used or whether it should rely only on observed data.