China’s little-known efforts to protect its marine ecosystems safeguard some habitats but omit others

Description


Government PA and AGR lists
: Official government sources were always considered to be the most reliable. 2. Oceanol: The news outlet included articles and profiles of specific MPAs, including detailed descriptions, coordinates, and area of the sites. This source was regularly updated and included detailed information on SMPAs and MPAs not included in the government lists. 3. National Aquatic Resources Collection System (http://zyyh.cnfm.com.cn/): While this website had information on relatively few AGRs, it provided an organized system for recording the coordinates and area of these sites. 4. Baidu (baike.baidu.com): This platform, a China-based online encyclopedia, provided highly detailed information on conservation objectives and location for many MPAs, but was not government officiated. 5. Other websites (e.g., China Mangrove, Osgeo, Chinese news sites): While these other websites often contained only very general information on MPAs, they contained missing information for some sites. 6. Zeng 2013 (45): Although one of the first sources used in compiling the list of MPAs, this source dated from 2013 and was compiled independently rather than directly from government sources. Many MPAs had been expanded since its publication.
When a conflict between two sources arose, the data from the higher-priority source were used in the database. All sources of information were recorded for each variable of each MPA and AGR and are listed in the corresponding dataset.

Determining conservation objectives
Conservation objectives were identified based on priorities explicitly stated for MPAs, either in a dedicated "conservation objective" entry in the source material or in the name of the area (e.g., "Wuli Nanshan Mangroves"). The conservation objectives were recorded for each MPA and AGR at the highest level of detail provided in the source material. The raw data is available in a separate tab in the auxiliary dataset.
There were varying degrees of detail and syntax used to communicate objectives. Some listed very specific species and even scientific names in some cases (e.g. Atrina pectinata), others included more general common names (e.g. horseshoe crabs) or groups of species (e.g. marine mammals). Others referred to broader ecosystems (e.g. mangroves), terms of use (e.g., ecotourism), and geological or archaeological features. Terms that were considered practically interchangeable were consolidated, such as 'mangroves', 'mangrove habitat', and 'mangrove environment' which were all grouped under the term 'mangrove'. Terms like 'mammals', 'marine mammals', and 'Dugongs' were kept separate, as were different species or subspecies like 'Southern Horseshoe Crabs' versus 'Chinese Horseshoe Crabs'. This process yielded a total of 142 unique conservation objectives across the MPAs and AGRs in the dataset, which were integrated into a matrix marking the presence 'TRUE' or absence 'FALSE' of each objective for each site.
Objectives only indicate biodiversity and natural features that are of special importance for management and 'absence' of a species or feature from a given MPA or AGR does not mean it is not present nor protected by the site.
We then developed two tiers of aggregation, Aggregate 1 and Aggregate 2, to more succinctly organize, analyze, and present the 142 objective terms in a digestible manner (Tables 4 and 5 in the main body). The first, 'Aggregate 1', was used to represent the general types of environments (e.g., 'mangroves,' 'coral reefs') or types of marine organisms (e.g., 'fin fish,' 'marine mammals') that were of special importance or focus of the site. 11 unique terms were organized into Aggregate 1. Terms that were not specific to marine environments or biodiversity, such as 'Ecotourism', 'Temples', 'Beaches' and others were not included in the aggregation.
The second, 'Aggregate 2', was more detailed and identified common names for specific species (e.g., white dolphin), or groups of species (e.g., all species of clams, horseshoe crabs, and groupers were grouped under terms 'clams', 'horseshoe crabs', and 'grouper' respectively). Species were grouped in ways that would be communicable to a wide audience and did not follow a standard taxonomic approach. 62 unique terms were defined under Aggregate 2.

Determining coordinates for each MPA and AGR
We determined a single coordinate for each conservation area, which served as a centroid from which a radius was drawn to represent the spatial extent of the site. All spatial analyses were conducted in ArcGIS Desktop 10.7.1 software by ESRI (49) using China Geodetic Coordinate System 2000, China's officially recognized system (50). Using the "Buffer tool" from the "Proximity toolset" of the "Analysis toolbox," each conservation area centroid was assigned a circular buffer according to its radius as derived from its known area.
Coordinates for each site were determined based on one of the following 5 methods, in ranked in order of priority: (1) GIVEN: Centroids or coordinates provided. This method was used for 175 of 298 conservation areas included in the analysis, which collectively comprised 77.4% of the total protected area within the study's spatial scope. A handful of sites provided a single coordinate, and those coordinates were applied as the centroids. However, most sources provided a range of coordinates that were used to calculate centroids by taking the means of the longitudinal and latitudinal extremes. Any coordinates provided in degrees, minutes and seconds were converted to decimal degrees before calculating centroids.
(2) NEAR GIVEN: Exact coordinates were not provided, but coordinates were found for a landmark within or near the conservation area's boundaries. For example, coordinates provided for the Lusi Fishing Grounds were used to inform coordinates for the Lusi Fishing Grounds Fishery Conservation Zone (ID #22). This method produced centroids for 2 sites, comprising 12.0% of the total protected area.
(3) MAP: Coordinates were visually estimated using maps of the site that provided coordinates on the x and y scales. 2 MPA centroids, 0.6% of the total protected area, were processed by this method.
(4) GE (Google Earth): Searchable locations or landmarks were indicated by the name of the conservation area or a distinguishing feature nearby. The Google Earth search terms used for each of these locations are provided in the dataset. Centroids for 21 conservation areas, 1.0% of the total area, were determined using this method. Locations for areas outside of the spatial scope of the analysis (n = 4) were identified using this method. While these sites were not retained for final analysis, they are still made available with the supplementary dataset.
(5) GIS: Through the ArcMap data portal, a polygon shapefile was found for "China County City District Boundaries 2016" that contained locations and associated location codes (Table S2). Since location names (e.g. city, county, etc.) could be derived for 100% of conservation areas in the database, the retrieved locations were compared with the locations listed in the shapefile's table. Each prospective location polygon was visually assessed to ensure that it was on or near the coast, and in the correct province and sea. Furthermore, this method was never applied for cases where the conservation objective of a given MPA or AGR indicated that it was located offshore. For the conservation areas that passed the tests, a field was added to the dataset containing the shapefile's corresponding location code. A join was then executed between the set and the polygon shapefile based upon this location code field. Once joined, the centroid of each polygon was calculated and snapped to the nearest coastline. This method provided coordinates for 98 conservation areas, comprising 8.9% of the total protected area in the dataset.
Within the 249 MPAs and 53 AGRs that were included (247 MPAs and 51 AGRs within the study's focal area), there were some data limitations concerning the expanse and locations of some sites. For example, many MPAs have terrestrial areas included in their total size, but only one database (Oceanol) consistently distinguished marine and terrestrial area for MPAs that included dry land. Of the 67 MPAs listed on Oceanol, 27 reported having a terrestrial component. The sum of the total terrestrial area (km 2 ) within those 27 MPAs comprised only 4.9% of total cumulative area. Therefore, marine area (km 2 ) would be overestimated by ~5% had we assumed all 67 of these areas were 100% marine. As a result, considering Oceanol as a representative sample, it is unlikely that assuming the same ratio of land and sea for the rest of the MPAs and AGRs would lead to a significant overestimate of protected area in the analysis. Moreover, it is unlikely that MPAs or AGRs situated far from the coastline would contain any terrestrial component. We therefore proceeded with the assumption that 100% of reported area for all sites in the study as marine.

Assigning habitats
Polygons of each habitat were created using the cluster outputs from the PCA and k-means analyses (supplementary material section 3.2.) using the "Aggregate Points" tool from the "Generalization toolset" of the "Cartography toolbox" in ArcGIS. An aggregation distance was set for 0.015646 decimal degrees to eliminate the possibility of standalone points yet allow diagonals to be drawn. Once the polygons for the 16 classified habitats were developed ( Figure S1), we measured how they were distributed within the estimated boundaries for the MPAs and AGRs.
Habitats within MPAs and AGRs were identified using a two-step process. First, habitats within each site were evaluated using the "Tabulate Intersection" tool in the "Statistics toolset" of the "Analysis toolbox," by calculating area within each MPA and AGR that intersected with an individual classified habitat. This approach assigned 79.8% of the total area within MPAs and AGRs in the study to a specific habitat. It was not able to initially assign 100% of all MPAs and AGRs to a habitat because of limited refinement of available benthic and satellite data, especially along the coastline where data gaps were most prevalent with intrusion of dry land. Some MPAs and AGRs did not overlap any assigned habitats due to such limitations. Furthermore, the radian approach to estimating the extent of each MPA and AGR meant that the estimated range for many areas near the coast overlapped terrestrial zones.
The second step for assigning habitats to the remaining 20.2% of area within MPAs and AGRs incorporated one of two methods depending on whether (1) some, but not all, of the MPA or AGR was already assigned to one or more classified habitats in the first step, or (2) none of the MPA or AGR was initially assigned to one or more habitats in the first step.
For conservation areas falling under the first scenario where part but not all the area was assigned, we extrapolated the existing habitats proportionately to fill in the rest of that area. For example, if the first step only assigned 25% of an MPA to a specific habitat, 10% L1 and 15% to M1, then the rest of the unassigned area would be allocated to each of those habitats proportionately, such that a total of 40% was ultimately allocated to L1 and 60% to M1. This process of multiplying proportionate habitat allocations assigned habitats to 16.2% of area protected.
The remaining 4.0% of area in the dataset was contained within MPAs and AGRs that did not have any assigned habitats within their boundaries. For these cases, the MPAs and AGRs were manually assigned to the habitat that was closest to the centroid of the respective site. Only 1 habitat was assigned to each area through this process.

Data collection and consolidation
Four variables were considered for sea surface conditions that prior studies have incorporated for identifying habitats via remote sensing (Table S1) Cloud cover, seasonal ice (which is known to occur in parts of the Bohai Sea (54)), and other environmental conditions can sometimes prevent the collection of remote sensing data at specific locations. We were not able to collect monthly average measurements for all 120 months in the 10-year dataset for the majority of spatial data points, but more than 90% of data points had measurable data for at least 90 months. All available data within the study scope were used in the interpolation step, even locations where data availability was intermittent. However, to prioritize the locations that were more consistently measured, the interpolation included a weight for frequency of reporting.

Interpolation
A polygon shapefile of China's exclusive economic zone (the "study area") as taken from the Flanders Marine Institute (48), and the different depth zones and non-shelf zone within, were created before beginning the interpolation process. The relevant boundaries of the "World Exclusive Economic Zone Boundaries" feature layer (Table S2) were digitized to form the seaward bound of the polygon. In the nearnegligible areas where that EEZ did not meet the coast, the "Exclusive Economic Zones of Flanders Marine Institute _2018_ v_10" was used (Table S2). Excess area was then clipped to the "China Country Boundary 2018" (Table S2) to form the coastal bound of the polygon for the study area. To discriminate the shelf and non-shelf area portion, the areas beyond the continental shelf of the "World Seafloor Geomorphology" (Table S2) were digitized at a minimum scale of 1:63,360 and clipped the resulting polygon shapefile to the study area. A polygon of the continental shelf was then created using the difference between the nonshelf and study area polygons.
The "Seafloor Bathymetry (meters) -Transparent" raster file (Table S2, (55)) was imported to obtain depth points. The "Raster to Point" tool from the "Conversion toolbox" was used to convert the data into point form, which was then clipped to the shelf. Each cell of the raster produced a datapoint located in the center of the cell. Because of their even distribution, the data points were later used to estimate the proportion of habitats within the study area (see section 5.3.2).
Point shapefiles were generated from each of the 16 remote sensing tables. After clipping to the EEZ, the "Inverse Distance Weighting" (IDW) tool from the "Geostatistical toolbox" was used to interpolate each variable, using the frequency of reporting to weight the results. IDW uses known spatial values to predict unknown spatial values. It assumes that locations closer to the measurement points will have more similar characteristics than those further away. This method was chosen since it is applicable to large datasets and is bound by pre-existing minima and maxima (56). Since the resolution of the SSS remote sensing data was lower than the rest, its interpolated rasters did not cover the Bohai Sea nor northern areas of the Yellow Sea. For this reason, salinity was then removed from further analysis.
For each of the remaining 11 IDW rasters, the "Extract Multi Values to Points" tool from the "Spatial Analyst toolbox" was applied using the bathymetry shapefile as the input point feature. This created a new point shapefile comprised of the IDW raster values that existed at a given bathymetric datapoint. These files were then downloaded using the " Table to Excel" tool for use in the PCA and k-means analyses.
Following interpolation 1,268,528 datapoints remained upon which k-means analyses were performed separately as follows: Shallow n = 112,199, Medium n = 634,186, Deep n = 522,143.

PCA and K-means analyses
PCA and k-means analyses were performed in R version 4.0.0 "Arbor Day" using packages 'FactoExtra' and 'FactoMiner' (57,58). The .csv outputs from the interpolation process in ArcGIS were organized into three separate environmental data sets by their respective depth zones with the boundaries: Shallow (<10 meters), Medium (10-50 meters), and Deep (>50 meters) depth zones as defined by Harris et al. 2014 GSFM (30). The two-step PCA and k-means process (9) was performed separately for each depth zone. Following removal of SSS due to insufficient data resolution, we proceeded with SST, SST Range, and chl.-α concentration in the analysis. With seasonal variables for SST and chl.-α available, there were a combined total of 11 variables at our disposal for the k-means analysis in each depth zone. Variables were standardized and converted to z-scores such that 0 was equivalent to the mean and -1 and 1 were equal to the standard deviation of each sample. The z-scored variables were used for evaluating variables and assigning clusters in the PCA and k-means steps, respectively.
A principal components analysis (PCA) was performed to help ensure that selected variables substantially contributed to variance and would lead to the most robust clustering results. Separate PCAs were run for SST and chl.-α in each depth zone, incorporating their respective seasonal and annual averages. We selected the seasonal or annual measurement that contributed the most to variance within the data. SST range was not eligible for this process because only an annual value was available.
Per the PCAs, annual averages contributed more to variance than did seasonal averages for five of the six scenarios run for SST and chl.-α combined across the three depth zones. The one exception was SST for the low depth zone, for which spring SST contributed slightly more to variance across the dataset than annual SST. However, as there was little difference in that case, the analysis proceeded with average annual SST for the low depth zone to maintain consistency across the study.
Proceeding with annual average SST and chl.-α, an additional PCA was conducted to investigate contributions to variance of these variables combined with SST Range for each depth zone. A very strong negative correlation was observed between average annual SST and SST Range; Pearson's Correlation of -0.82, -0.95, and -0.99 for low, medium, and high depth zones respectively. Due to the redundancy of including both SST and SST Range, SST Range was dropped as a variable and the k-means analyses proceeded with average annual SST and chl.-α concentration.
Separate k-means analyses were conducted for each depth zone. A k-means analysis follows two steps such that (1) data points are distributed in Euclidean space, and a preselected number of points (or centroids/cluster centers, K) are randomly distributed in that space. Each datapoint is then assigned to the nearest centroid by straight-line Euclidean distance or the minimum square distance criterion. Then (2) the centroid's position is recalculated based on the mean of the datapoints assigned to that centroid/cluster. The process repeats until the value per the objective function (equation 1) does not significantly change between re-location of cluster centroids, indicating that the clusters are stable (59). The final centroid and datapoint pairing reflect the final cluster assignments for each datapoint. In this algorithm, refers to the position of a given data point in Euclidean space while refers to the position of the given cluster centroid k, and = 1 when the datapoint 'belongs' to cluster k and = 0 when it does not.
Cluster performance was then measured by observing explained variance for clusters as measured by between the sum of squares (BSS, equation 2) divided by the total sum of squares (TSS, equation 3) where represents the total number of data points in the cluster, the mean for each cluster, and the sample mean.

=1
(2) = ∑( − ) 2 The ratio of BSS:TSS indicates the percent of variance explained by the clusters, and is an established method for tracking cluster performance (60). We selected a preferred number of clusters that balanced complexity and cluster performance by observing the marginal increase in the percent of variance explained by this function. We selected 5 clusters for each depth zone at which point additional variance explained began to plateau. The k-means analyses achieved a BSS:TSS of 79.1% for low, 83.6% for medium, and 91.4% for high depth zones ( Figure S2). The inclusion of waters beyond the continental shelf as a single habitat provided 16 habitats from which to assess the representation of China's MPA and AGR network. Clusters were validated via pairwise t-tests for explanatory variables to ensure statistically distinct differences between them. For some datapoints, one variable could be extreme enough to define its cluster regardless of other characteristics. For example, habitat D2, characterized by high productivity in deeper waters, included extreme points ranging from the cold Bohai Sea to tropical waters off of Hainan if chlorophyll-α concentrations were high enough (about +2 SD) to statistically offset the influence of temperature on its cluster assignation ( Figure S1, S2, and main body Figure 2).  (Table S2).  Visualization 2019

Table S2
Characteristics of the spatial data derived from ArcMap 10.7.1 (49) data portal.