User Tools

Site Tools



Getting started with CAPRI

The CAPRI Data Base

Baseline Generation

Scenario simulation

Post model analysis

Spatial dis-aggregation CAPDIS module

Stability testing tools for model tasks


Annex Code lists

How to edit this wiki

pdf Version 2022 January


The Regionalised Data Base (CAPREG)

Data requirements and sources at the regional level

CAPRI aims at building up a Policy Information System of the EU’s agricultural sector, regionalised at NUTS 2 level or farm types inside NUTS 2 regions with an emphasis on the impact of the CAP. The core of the system consists of a regionalized or farm type agricultural sector model using an activity based non-linear programming approach. One feature of such a highly disaggregated, activity based agricultural sector model is the detailed information resulting from ex ante simulations of policy scenarios concerning the output and input of specific agricultural production activities and their relationships. This information is also a pre condition to judge possible impacts of agricultural production on the environment. However, these systems require as well this kind of information (data) ex-post, at least partially. It is especially necessary to define for each region in the model, at least for the basis year, the matrix of I/O-coefficients for the different production activities together with prices for these outputs and inputs. Moreover, for calibration and validation purposes information concerning land use and livestock numbers is necessary.

Already from the beginning of the development of the CAPRI model, the regional agricultural statistics (EUROSTAT table group reg_agr) was judged as the only harmonized data source available on regionalized agricultural data in the EU. Other regional Eurostat data are suplementing the regional agricultural statistics such that we are currently using the following:

  • Land use from regional landuse statistics [agr_r_landuse, discontinued table]
  • Land cover from LUCAS [lan_lcv_ovw, currently only used in COCO1]
  • Crop production - harvested areas, production and yields [table agr_r_crops]
  • Animal production - livestock numbers [table agr_r_animal]
  • Milk production [agr_r_milkpr]
  • Agricultural accounts on regional level [table agr_r_accts]
  • Structure of agricultural holdings including labour force [ef_ls_ovlsureg, ef_olslsureg, ef_oluaareg, ef_oluaareg, ef_r_nuts]

Although the content of the regional datasets has remained in time, the naming and classification within EUROSTAT is undergoing continuous modifications. Tables considered of low interest are discontinued (and may be still used in CAPRI some time after this point, such as table agr_r_landuse). And new topics are covered providing useful data in some areas, for example from agri-environmental indicators (table reg_aei):

  • Estimated soil erosion by water, by NUTS 3 regions (aei_pr_soiler)
  • Manure storage facilities by NUTS 3 regions (aei_fm_ms)

The following table shows the availability of the different regional tables as they have been used in the current database (with series completed up to 2014). However, the current coverage concerning time and sub-regions differs dramatically between the tables and within the tables between the Member States. A second problem consists in the relatively high aggregation level especially in the field of crop production. Hence, additional sources, assumptions and econometric procedures must be applied to close data gaps and to break down aggregated data.

Table 6 Availability of regional datain current database after 1983

Table Official availability
Land use from 1974 yearly
Crop production (harvested areas, production and yields) from 1975 yearly
Animal production (livestock numbers) from 1977 yearly
Agricultural accounts on regional level from 1980 yearly
Structure of agricultural holdings and labour force 2000, 2003, 2005, 2007, 2010, 2013

Source: capri\dat\capreg\regio_data_all.gdx

Methodology applied in the regional data consolidation

In the last major update of 2015 the original data had been first stored in the TSV format designed by EUROSTAT:

  • Unordered List ItemIn a first step, these files had been converted by an excel macro into csv format and an overall set with all items including their long text has been created to prepare further processing.
  • In a second step these alredy GAMS readable files are stored in GDX format in folder “dat\capreg” and under version control. Meta data are added in the process as well.

The results of these two steps is a single large tables, which comprise time series of all data retrieved from Eurostat for all tables: land use, crop production, animal populations, cow’s milk collection and agricultural accounts.

The starting point of the methodological approach is the decision to use the consistent and complete national data base (COCO) as a frame or reference point for any regionalization. In other words, any aggregation of the main data items (areas, herd sizes, gross production and intermediate use, unit value prices and EAA-positions) of the regionalized data over regions must match the national values. This is the general rule with some exceptions.

Given that starting position, the following approaches are generally applied:

  • Unordered List ItemData as loaded from the regional statistics are subject to some manual consistency checks (in gams\capreg\check_and_cor_regio.gms) as well as checks for regional consistency. The latter is mainly true for animal herd sizes where we have data at the same or even more disaggregated level as found in COCO.
  • Gaps in regional data are completed and data only given at a higher aggregation level as required in CAPRI are broken down by using existing national information.
  • Fall back and other rules for assignments are structurally and (often) numerically identical for all regional units and groups of activities and inputs/outputs.
  • Econometric analysis or additional data sources are used to close gaps.

All the approaches described in the following sub sections are only thought as a first crude estimate. Wherever additional data sources are available, their content should be checked and is often used to overcome the list of these ‘easy to use’ estimates presented in here. Examples are (some) data for Norway, Sweden or Luxembourg that have been collected from national sources. The procedures described in here can be thought as a ‘safety net’ to ensure that regionalized data are technically available but not as an adequate substitute for collecting these data from additional sources.


The agricultural domain of REGIO does not cover regionalized prices. For simplicity, the regional prices are therefore assumed to be identical to sectoral ones1):

\begin{equation} UVAG_r=UVAG_s \end{equation}

Young animal prices are a special case since they are not included in the COCO data base (the current methodology of the EAA does not value intermediate use of animals) but are necessary to calculate income indicators for intermediate activities (e.g. raising calves). Only exported or imported live animals are implicitly accounted for by valuing the connected meat imports and exports.

Young animals are valued based on the ‘meat value’ and assumed relationships between live and carcass weights. Male calves (ICAM, YCAM) are assumed to have a final weight of 55 kg, of which 60 % are valued at veal prices. Female calves (ICAF, YCAF) are assumed to have a final weight of 60 kg, of which 60 % are valued at veal prices. Young heifers (IHEI, YHEI) are assumed to have a final weight of 300 kg, of which 54 % are valued at beef. Young bulls (IBUL, YBUL) are assumed to have a final weight of 335 kg, of which 54 % are valued at beef. Young cows (ICOW, YCOW) are assumed to have a final weight of 575 kg, of which 54 % are valued at beef. For piglets (IPIG, YPIG), price notations were regressed on pig meat prices and are assumed to have a final weight of 20 kg of which 78 % are valued at pig meat prices. Lambs (ILAM, YLAM) are assumed to weight 4 kg and are valued at 80 % of sheep and goat meat prices. Chicken (ICHI, YCHI) are assumed to weight 0.1 kg and are valued at 80 % of poultry prices.

Another special case are sugar beet prices. They are still determined in a program (‘sugar\price_est.gms’) inherited from the 2003 EuroCARE sugar study (Henrichsmeyer et al. 2003). It determines sugar beet prices according to the sugar prices, levies and partial survey results in the 90ies. The estimation results are subsequently used to determine the beet price differentiation also in subsequent years. It is noteworthy that the same program is applied in CAPREG (via quotasprices.gms) and in CAPMOD (via data_prep.gms) to determine base year beet prices.

Activity Levels

In cases where data on regional activity levels are missing, a linear trend line is estimated for regional and Member State time series in the definition of the regional database. The gap is then filled with a weighted average between the trend line – using a weight of R² - and a weighted average of the available observations around the gap, using a weight of 1-R². The specific formulation has the following properties. In cases of a strong trend in a time series, the back-casted and forecasted numbers will be dominated by the trend as the weight of R² will be high. With decreasing R², the estimated values will be pulled towards known values.

Apart from gap filling another problem is that in annual cropland statistics at the regional level only cover a few crop activities (cereals with wheat, barley, grain maize, rice; potatoes, sugar beet, oil seeds with rape and sunflower; tobacco, fodder maize; grassland, permanent crops with vineyards and olive plantations). The COCO data base, however, covers some 30 different crop activities. In order to break these aggregates down to COCO definitions, the national shares of the aggregate are used.

As an example, this approach is explained for cereals. Data on the production activities WHEA (wheat = SWHE+DWHE), BARL (barley), MAIZ (grain maize) and PARI (paddy rice) as found in COCO match directly the level of disaggregation in the regional data. Therefore, the mapped regionalized data are directly set equal to the corresponding values in the regional “raw” data. The difference between the sum of these 4 activities and the aggregate data on cereals in the regional raw data must be equal to the sum of the remaining activities in cereals as shown in COCO, namely RYE (rye and meslin), OATS (oats) and OCER (other cereals). As long as no other regional information is available, this difference from the regional raw data is hence broken down applying national shares.

The approach is shown for OATS in the following equations, where the suffix r stands for regional data:

\begin{align} \begin{split} LEVL_{OATS,r} &= (CEREAL_r\\ &\quad -WHEAT_r-BARLEY_r-MAIZEGR_r-RICE_r)\cdot\\ &\quad\frac{LEVL_{OATS,COCO}}{(LEVL_{OATS,COCO}+LEVL_{RYE,COCO}+LEVL_{OCER,COCO})} \end{split} \end{align}

Similar equations are used to break down other aggregates and residual areas in the regional data 2). The Farm Structure Survey (FSS) provides crop areas for a larger number of crops but this survey is usually conducted only every three years. Data from FSS, when available, is also used to aproximate crop areas at regional level.

One important advantage of the approach is the fact that the resulting areas are automatically consistent to the national data if the ingoing information from REGIO was consistent to national level. Fortunately, the regional information on herd sizes covers most of the data needed to give nice proxies for all animal activities in COCO definition. The regional data break down for herd sizes is often more detailed than COCO at least for the important sectors. Regional estimates for the activity levels are therefore the result of an aggregation approach, in opposite to crop production.

In order to generate good starting points for the following steps of data processing and to avoid systematic deviations between regional and national levels in the following consistency steps, all regional level in REGIO are first scaled with the relation between the (national) results in COCO and the regional results when aggregated to the national level (key file is gams\capreg\map_from_regio.gms).

Besides technological plausibility and a good match with existing regional statistics, the regionalized data for the CAPRI model must be also consistent to the national level. The minimum requirement for this consistency includes activity levels and gross production. The “initialisation” of the regional database has been undertaken already to meet this requirement as good as possble but cannot guarantee it. Consistency for activity levels is therefore based on Highest Posterior Density Estimator which ensures (in gams\capreg\cons_levls.gms):

  1. Adding up of activity levels from lower regional level (NUTS II, NUTS I) to higher ones (NUTS I, NUTS 0)
  2. Adding up of crop areas to UAA at regional level.

The objective function minimizes in case of animal herds simple squared relative deviations from the herds. In case of crops, a 25% weight for absolute squared difference of the crop shares on UAA plus 75% deviation of relative squared differences is introduced. In the crop sector consistency is also imposed to regional transition matrices for 6 UNFCCC land use categories relevant for carbon accounting (forest land, cropland, grassland, settlements, wetlands, residual land) which are initialised from the national transition matrix estimated in the COCO1 module.

A specific problem is the fact that land use statistics do not report a break down of idling land into obligatory set aside, voluntary set aside and fallow land3). Equally, the share of oilseeds grown as energy crops on set aside needs to be determined. An Highest Posterior density estimator is used (in gams\capreg\cal_seta.gms) to ‘distribute’ the national information on the different types of idling land to regional level, with the following restrictions:

  • Obligatory set-aside areas must be equal to the set-aside obligations derived from areas and set-aside rates for Grandes Cultures (which may differ at regional level according to the share of small producers). For these crops, activity levels are partially endogenous in the estimation in order to allow a split up of oilseeds into those grown under the set-aside obligations and those grown as non-fo-od crops on set-aside.
  • Obligatory and voluntary set-aside cannot exceed certain shares of crops subjects to set-aside (at least before Agenda 2000 policy)
  • Fallow land must equalise the sum of obligatory set-aside, voluntary set-aside and other idling land.
  • Total utilisable area must stay constant.

In some cases, areas reported as fallow land are smaller than set-aside obligations. In these cases, parts of grassland areas and ‘other crops’ are allowed to be reduced.

Production and yields

The proceedure for gross output (GROF) is similar to the one for activity levels, as correction factors are applied to line up regional yields with given national production:

\begin{align} \begin{split} CORR_{GROF,o} &= \sum_{j,r}{Levl_{j,r}O_{j,r}}/GROF_{o,n}\\ O_{j,r}^*&=O_{j,r} \cdot CORR_{GROF,o} \end{split} \end{align}

In case of missing statistical information for regional yields, national yields are used. A special rule is used for fodder maize yields, where regional yields are derived from national fodder maize yields, and the relation between regional and national average cereal yields.

For grassland and fodder from arable land, missing yields are derived from national ones using the relation between regional and national stocking densities of ruminants, in combination with assumed share of concentrates in terms of a weighted sum of energy and protein per ruminant activity in CAPRI. Those shares are then scaled with a uniform factor to exhaust on average the available energy and protein from concentrates at the national level. Accordingly, higher fodder yields are expected where ruminant stocking densities are high, acknowledging differences in concentrate shares. If e.g. the stocking densities solely stem from sheep and goat, the assumed impacts on yields is higher. In order to avoid unrealistic low or high yields, those are bounded to a 25%-400% range compared to the regional aggregate.

The input allocation in any given year should not be linked to realised, but to expected yields. Expected yields are constructed using the following modified Hodrick-Prescott filter:

\begin{equation} \text{min} \quad hp=1000 \sum_{1<t<T-1}({y_{t+1}^*-y_{t-1}^*})^2 + \sum_{t}({y_t^*-y_t})^2 \end{equation}

where y covers all output coefficients in the data base. The Hodrick-Prescott filter is applied both at the national and regional level after any gaps in the time series had been closed.

Final steps of regional data completion

The regional database modules also cover some aspects which are discussed in other parts of this documentation.

  • For policy data at the regional level (mostly premium related data) see Section Policy data. These policy related assignments require a good part of the CAPREG module
  • For the fertiliser and feed allocations and environmental indicators, also important elements of the regional database, see the next Section Input Allocation
  • Towards the end of the regional data base consolidation supply side PMP parameters are calibrated as a final test of consistency and sometimes to serve as starting values for the subsequent baseline calibration (in gams\capreg\pmp.gms)

Build and compare time series of GHG inventories

The regionalised data base module CAPREG runs in two steps:

  • The first steps prepares regional time series covering activities, production, land use and the fertiliser allocation
  • The second step involves more time consuming processing steps which are therefore only executed for the selected base year: feed allocation, computation of GHG results, and the final calibration test

To assess the reliability of the CAPRI database in terms of GHG results against official UNFCCC notifications, results from the first step (time series) were insufficient, as the GHG accounting also requires information on the feed allocation. This problem was addressed within the scope of the IDEAg (Improving the quantification of GHG emissions and flows of reactive nitrogen) project4), where an option has been introduced to allow for a consistent accounting of GHG emissions over time. This is able to combine input information from CAPREG time series runs as well as (short run, nowcasting-style) CAPMOD simulation results. Furthermore, an R-based tool was introduced to the CAPRI GUI that maps GHG emissions data from CAPRI to the GHG emission balances contained in the National Inventory Reports (NIRs) that are submitted annually by countries in compliance with UNFCCC GHG reporting obligations.

There is no easy way to relax this assumption if no further data sources are available.
If no data at all are found, the share on the utilisable agricultural area is used.
The necessary additional information on non-food production on set-aside, obligatory and voluntary set-aside areas can be found on the DG-AGRI web server.
The IDEAg project was commissioned by the JRC-IES in Ispra in 2015 and was carried out by the Thünen Institute in cooperation with the JRC-IES (August 2015 – August 2016). A more detailed explanation of the CAPRI task “Build GHG inventories” and its use has been prepared by the Thünen contributors at the time, Sandra Marquardt and Alexander Gocht, see capri\doc\GHG_inventory_module.docx.
the_regionalised_data_base_capreg.txt · Last modified: 2020/03/31 07:09 (external edit)