Grid deployment planning

Mirco Mazzucato, Les Robertson

DRAFT 14 July 2004

This is a proposal for the LCG high level milestones associated with deploying the LCG Grid. It has been modified after discussion in the PEB on 9 April 2002, and following a phone call with Ruth Pordes, Matthias Kasemann and Lothar Bauerdick on 11 April. It will be discussed again in the PEB of 23 April, when it is hoped to reach agreement on the milestones and dependencies as they should appear in the High-Level Plan.


1. Milestones

High Level Milestones:

milestone date due name and description
M1.gd-A Apr 03 LCG Global Grid Service (LCG-1) available
    Deploy a reliable Global Grid Service offering 24x7 availability, including around ten Regional Centres in Europe, Asia and North America. The grid provides a batch service for all four experiments for event production and analysis of the simulated data set. The middleware deployed will be the "converged" European-US toolkit emerging from GLUE. This milestone is a functionality and existence test, with a related milestone 6 months later for which throughput and reliability will be the key measures. Target performance levels will be specified 6 months before the milestone date. All of these targets must be sustained during a 7-day period for the milestone to be considered met. Examples could be:
  • all four experiments must be able to use it;
  • capacity: at least 2 times that of the then available capacity of the CERN Prototype (i.e. >1600 processors, >1600 disks, > 160 TB of disk storage space);
  • throughput (capacity delivered as a percentage of capacity available): 50%;
  • reliability (% of successful jobs): 80%;
  • availability:  at least 60% of the service is available 95% of the time.
M1.gd-B Oct 03 LCG-1 meets performance and reliability targets
    This is the performance milestone related to M1.gd-A. The target levels will be specified 6 months before the due date of M1.gd-A, and must all be sustained during a 30-day period for the milestone to be considered met.
  • capacity: > 3 times that of the then available capacity of the CERN Prototype (i.e. >2400 processors, 2400 disks, 240 TB of disk storage space);
  • throughput (capacity delivered as a percentage of capacity available): about 90% of that provided at CERN on the LXBATCH service;
  • reliability (% of successful jobs): 95%;
  • availability: at least 90% of the service is available 99% of the time.

This service would be used for the "5% challenges" of the experiments.

LCG-1 will be operated continuously, evolving in terms of capacity, performance and functionality. Additional Regional Centres will be included as they come on-line. A second major release of the service will be made during 2004. However, these will  not be included in a level-1 milestone. 

M1.gd-C Dec 04 LCG-3 ready for LHC production service verification
    LCG-3 will include all essential functionality required for the initial LHC production service. This milestone will be met when specified levels of performance and reliability have been met for a period of 30-days. These target levels will be defined 6 months before the due date of this milestone. LCG-3 will be used as a proof that the LHC computing model will work, including Tier 0, 1, 2 and 3 Regional Centres, providing practical backup for the computing service TDR. LCG-3 will use the LHC Grid Toolkit, will have 50% of the components required for the 2007 production service for CMS or ATLAS, and will be used for the "20% milestones" of the experiments.
M1.tdr Jun 05 Computing Service TDR available
    The Computing Service TDR will specify the requirements for the Grid that will be used as the first production services for the four LHC experiments.It will include details of the architecture, functionality, capacity, performance, throughput and availability. It will include the Regional Centre plans that will have been developed to meet these requirements, and will provide cost estimates and an overall installation and verification schedule. It is assumed that the TDR will be approved by the LHCC within three months following its availability. The full process from acquisition to service verification is expected to take 12-18 months (according to the administrative procedures of the Regional Centres). The initial service must be in full production by September 2006 (6 months  before data taking). The TDR will therefore be approved after the acquisition procedures have started, but before orders are placed.

Level 2 Milestones, including major external dependencies:

Short-term planning - Milestones associated with the LCG-1 release and M1.gd-A

milestone date due name and description related milestone
EM2.edg-1 Jul 02 Datagrid Testbed 1 used in Data Challenges of LHC Experiments M1.gd-A
    This milestone is a deliverable to LCG by the Datagrid Project.

Datagrid Testbed 1 deployed at a number of Regional Centre sites, and used by the LHC experiments in their 2002 Data Challenge programmes for event production. This is a demonstration of functionality and reliability rather than performance and scaling. Only a small number of sites need be involved. The major reliability metric could be in terms of the number of events generated and data transferred during a specified time period.

The details of this milestone need to be agreed with  Datagrid, the experiments, and the Regional Centres concerned. The milestone will be considered to have been met if the reliability/throughput targets are achieved for at least two of the experiments.

 
EM2.usuf-1 ??? Grid Testbed of US CMS and/or ATLAS User Facilities used in Data Challenge(s) M1.gd-A
    This milestone is a deliverable to LCG by the either or both of the US User Facilities Projects. The US Grid projects (iVDGL, PPDG, GriPhyN) provide some of the resources for these testbeds, but the ATLAS and CMS User Facility Projects are responsible for deploying and operating the testbeds. The situation of ALICE concerning US sites has still to be clarified.

The milestone is a parallel of the EM2.edg-1 in Europe. A Grid testbed using middleware selected by the US User Facility Projects is deployed at a number of Regional Centre sites, and used by LHC experiments in their 2002 Data Challenge programmes for event production. This is a demonstration of functionality and reliability rather than performance and scaling. Only a small number of sites need be involved. The major reliability metric could be in terms of the number of events generated and data transferred during a specified time period.

The details of this milestone need to be agreed with the ATLAS and/or CMS User Facility Projects.

 
M2.hepcal May 02 HEPCAL RTAG completed EM2.glue
    Final report available from the RTAG on HEP requirements for Grid Middleware  
EM2.glue Sep 02 GLUE Recommendations for converged toolkit available EM2.edg-2, EM2.ivdgl-2
    GLUE is a collaboration between iVDGL and DataTAG. This milestone is a deliverable to the LCG by this collaboration. The timing and details need to be agreed with these projects.  
M2.gd-1 Oct 02 LCG-1 release plan complete M1.gd-A, M1.gd-B
    Definition of the LCG-1 release completed - includes: definition of software components, integration and testing (operating system; middleware selection, expected to be based on the GLUE recommendations, applications environment); capacity, reliability and performance targets for M1.gd-A and M1.gd-B; regional centre planning; network planning; validation process; operations plan; support and deployment plan. This will include details of the commitments by regional centres and other institutes for the provision of resources and infrastructure. It will also include details of the dependencies on other projects (e.g. middleware providers).  
EM2.edg-2 Dec 02 Delivery of LCG-1 middleware for EDG-domain sites M1.gd-A
    This is a deliverable of the EDG project to LCG.  
EM2.usuf-2 Dec 02 Delivery of LCG-1 middleware for US sites M1.gd-A
    It is not at present clear how the integration and support of middleware used by US sites will be done. It is assumed, however, that this is the responsibility of the US User Facility Projects of ATLAS and CMS. Hopefully there is a single set of middleware used by both experiments. The ALICE sites will have to make their own arrangements with one of the projects or with the EDG for the supply of middleware. The current assumption is that, for non-US sites, the LCG will provide the service to build, distribute and support the complete LCG package - applications environment, middleware, etc., including the EDG middleware. The responsibilities for for US sites needs to be clarified.  
M2.gd-2 Dec 02 LCG-1 infrastructure and support mechanisms in place M1.gd-A
    All of the infrastructure needed to deploy LCG-1 ready and where appropriate in operation - includes CA infrastructure, information services, certification service, distribution service, call centre, documentation (user guide, installation guide, operations manual), equipment installed and operational in participating Regional Centres.  

Level-2 Milestones associated with later high-level milestones

milestone date due name and description related milestone
M2.gd-3 Mar 03 Fully revised LCG computing model published M2.gd-4
    A full review of the LCG computing model will be undertaken starting in the autumn of 2002 (following reviews by the experiments of their models, and availability of the PASTA III technology review).  
M2.gd-4 Oct 03 Definition of LCG-2 M2.gd-5
    LCG-2 is the second major release of the LCG production grid service. This definition will include the middleware selection (note that at this time the final EDG toolkit will be available as one of the choices); the minimum set of Regional Centres to be included (full range of categories to enable testing of the full computing model); throughput, performance and stability targets.  
M2.gd-5 Mar 04 LCG-2 in operation M1.gd-C
M2.gd-6 Mar 04 Definition of LCG-3 M1.gd-C
    LCG-3 will include all essential functionality required for the initial LHC production service. This definition will include the middleware selection; the minimum set of Regional Centres to be included; throughput, performance and stability targets.  

2. Grid Deployment - responsibilities

Basic assumption is that overall responsibilities are shared according to the "Gordon Trident":

The short term strategy (next twelve months) is

 The following table suggests areas of work necessary to achieve the LCG-1 service, with proposed responsibilities - i,e those entities that should take responsibility for the area in 2003. 

area of work examples of activities suggested responsibility
Grid middleware integration of middleware into toolkits; maintenance and release process; implementation of GLUE proposals for "converged" middleware; provision of the agreed LCG-1 middleware EDG, US User Facility Projects
2002 testbed operation testbeds deployed and operated for use by data challenges in 2002 EDG, US User Facility Projects
GLUE recommendations Specification of "converged" middleware DataTAG, iVDGL
Distribution tools Common distribution tools that can be used for Grid Middleware, any common Fabric components, LCG common applications material and the experiments' environments.

To be investigated if the tools developed in the LCG applications area will be sufficient for the other components.

LCG applications area for requirements and specification;

??? for tool development

Call centre helpdesk; high-level problem management; training for installation, experiment experts, users LCG - IN2P3 Lyon (?)
LCG-1 distribution integration of the LCG-1 package (including middleware, common applications tools, experiment environment); certification; management of change process LCG - CERN
CA operation CAs will be operated by national/HEP authorities; LCG Grid Deployment will coordinate this as necessary for LCG-1 LCG Grid Deployment area
LCG-1 policies licencing, user rules, access policies, etc. LCG - agreed in GDMB
LCG-1 infrastructure information services infrastructure; VO management; user registration coordinated by LCG - CERN; operated by Regional Centres
LCG-1 documentation installation guide; operations manual; users guide LCG - ????
LCG-1 global grid operation Regional Centres are responsible for operating the fabric at their sites, and interfacing to the operations centre . The Operations Centre (maybe this is distributed?) monitors the operation of the grid; reacts to problem situations; informs the call centre of current status. Regional Centres

LCG - Global operations centre?

 


3. Further work - actions and agreements needed:

# description target date
1 Agree in PEB on high level milestones 23 apr 02
2 Agree in PEB on level-2 milestones and dependencies 23 apr 02
3 Negotiate agreements (synchronisation, "buy-in") with grid projects (EDG, iVDGL, DataTAG, GLUE), US User Facilities 15 may 02
4 Negotiate agreements with Regional Centres 15 may 02
5 Understand human resources - who is responsible for which parts of the deployment and operation 15 may 02