Ensemble flash flood predictions using a high-resolution nationwide distributed rainfall-runoff model: case study of the heavy rain event of July 2018 and Typhoon Hagibis in 2019

The heavy rain event of July 2018 and Typhoon Hagibis in October 2019 caused severe flash flood disasters in numerous parts of western and eastern Japan. Flash floods need to be predicted over a wide range with long forecasting lead time for effective evacuation. The predictability of flash floods caused by the two extreme events is investigated by using a high-resolution (~ 150 m) nationwide distributed rainfall-runoff model forced by ensemble precipitation forecasts with 39 h lead time. Results of the deterministic simulation at nowcasting mode with radar and gauge composite rainfall could reasonably simulate the storm runoff hydrographs at many dam reservoirs over western Japan for the case of heavy rainfall in 2018 (F18) with the default parameter setting. For the case of Typhoon Hagibis in 2019 (T19), a similar performance was obtained by incorporating unsaturated flow effect in the model applied to Kanto Region. The performance of the ensemble forecast was evaluated based on the bias ratios and the relative operating characteristic curves, which suggested the higher predictability in peak runoff for T19. For the F18, the uncertainty arises due to the difficulty in accurately forecasting the storm positions by the frontal zone; as a result, the actual distribution of the peak runoff could not be well forecasted. Overall, this study showed that the predictability of flash floods was different between the two extreme events. The ensemble spreads contain quantitative information of predictive uncertainty, which can be utilized for the decision making of emergency responses against flash floods.


Introduction
In the current decade (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016), worldwide occurrences of floods and extreme rainfall become four times that in the 1980s (EASAC 2018;UNESCO, UN-Water 2020). In Japan, flood disasters occur every year, causing devastating damages (Udmale et al. 2019). In the last 2 years in particular, the Heavy Rain Event of July 2018 (hereafter F18) caused by an intensified Baiu frontal zone killed or missed 245 people in the western part of Japan (Cabinet Office 2018), and Typhoon Hagibis in October 2019 (hereafter T19) killed or missed 114 people in the eastern part of Japan (Cabinet Office 2019). The two extreme events caused flood damages in 315 rivers including 37 levee breaching sections by F18 (MLIT 2018) and 325 rivers including 142 levee breaching sections by T19 (MLIT 2019), mostly along small-tomedium sized rivers. In countries like Japan, which have a steep topography and upland catchments receiving heavy torrential rain, flash floods occur frequently. Flash floods are characterized by a rapid increase in flood peaks after rainfall onset (Collier 2007;Georgakakos 1986), whose intervals are shorter than 12 to 24 hours (Georgakakos 1986;Raynaud et al. 2015). During such flash floods, safe evacuation is difficult to be performed because floods occur rapidly and roads are submerged (Terti et al. 2019;Vincendon et al. 2016). Thus, sitespecific flash flood mapping and forecasting are required for safe evacuations (Collier 2007;Gourley et al. 2017;Hapuarachchi et al. 2011).
Early warnings for flash flood cannot focus only on well gauged main rivers. Many small-to-medium sized rivers are typically poorly gauged and get damaged more frequently. Therefore, selecting a single river basin for flash flood prediction is not very useful, and instead a large-scale distributed approach forecasting all river sections including ungauged rivers is required (Grimaldi et al. 2013;Javelle et al. 2014;Reed et al. 2007). With the recent advancements in quantitative precipitation estimate (QPE) and quantitative precipitation forecast (QPF) models, large-scale flood forecasting systems have been developed (Emerton et al. 2016). Such systems have been operated at the continental to nationwide scales; for example, the European Flood Awareness System (EFAS) Thielen et al. 2009) in Europe, the Community Hydrologic Prediction System (CHPS) in USA , the Hydrological Forecasting System (HyFS) in Australia (Hapuarachchi et al. 2017), the Grid-to-grid Model (G2G) in England and Wales (Anderson et al. 2019;Price et al. 2012) and Scotland (Cranston et al. 2012), and AIGA (Adaptation d'Information Géographique pour l'Alerte en Crue) in France (Javelle et al. 2016). For site-specific flood predictions at a fine scale, for example, EC-JRC (European Commission -Joint Research Centre) provides a rainfall-driven flash flood indicator within the EFAS framework called the European Precipitation Index based on Climatology (EPIC)  and a runoff-driven indicator called the European Runoff Index based on Climatology (ERID) (Raynaud et al. 2015). The National Weather Service (NWS) has adopted the Flash Flood Guidance (FFG), which is based on the estimated amount of rainfall causing floods at a specific site by running a hydrologic model in the backward mode (Clark et al. 2014;Georgakakos 2006).
Recent advances in high-performance computing and geographic information systems have motivated the application of large-scale distributed hydrologic models to predict flash floods at any river sections including ungauged sites. Gourley et al. (2017) reported the latest research project called the Flooded Locations and Simulated Hydrographs (FLASH) across the Conterminous United States. One of their targets was to apply a distributed hydrologic model with 1 km resolution to estimate flood peak discharge forced by the latest QPE. Such model applications are important from the viewpoint of hydrological science because physically sound parsimonious distributed modes are necessary for the purpose as well as finding the dominant runoff processes (Antonetti et al. 2019), regionalizing parameters (Vergara et al. 2016) and assessing the impact of dataset resolution (Lovat et al. 2019).
As noted by many previous studies, the largest uncertainty is associated with the precipitation forecast (Hapuarachchi et al. 2011). In particular, even with the state-of-art NWPs, accurately predicting severe storms with sufficient prediction lead time is challenging. Instead of deterministic forecasting, probabilistic forecasting with Ensemble Prediction System (EPS) has advanced in the last decade (Cloke and Pappenberger 2009;Wu et al. 2020). When the possibility of a severe storm becomes high because a typhoon is approaching or because a frontal line is stagnant, if we can predict the occurrence probability of flash floods leading to local damage, we could prepare for the extreme weather as a society (Terti et al. 2019). For ensemble flood predictions, global scale EPSs such as European Centre for Medium-Range Weather Forecasts-Ensemble of forecast (ECMWF-ENS) have been widely used (Alfieri et al. 2014). Meanwhile, for ensemble flash flood predictions, some case studies have demonstrate the importance of meso-scale ensemble forecasting (Alfieri et al. 2012;Hsiao et al. 2013;Roux et al. 2020;Ushiyama et al. 2014) and their combinations by the Meteorological Ensemble Forecast Processor (MEFP) approach adopted by Hydrologic Ensemble Forecast Service (HEFS) (Brown et al. 2014a;Brown et al. 2014b).
In Japan, the Japan Meteorological Agency (JMA) recently started operation of the Meso-scale Ensemble Prediction System (MEPS) with a 39 h lead time with 21 ensemble members based on a meso-scale numerical weather prediction model with a 5 km resolution. The test operation began just before July 2018, and official operation started since June 2019. Since heavy storm is spatiotemporally concentrated in mountainous areas and the flood concentration time is much shorter in Japan, we believe it is important to use fine spatial resolution forecasting. For hydrological modeling, we used the Rainfall-Runoff-Inundation (RRI) model (Sayama et al. 2012;Sayama et al. 2015a;Sayama et al. 2015b) applied recently to all over Japan by dividing the whole of Japan into 14 regions with a 5 s (about 150 m) spatial resolution. We suppose that the impact of complex rainfallrunoff phenomena become comparatively less important during such extreme flood events; then the simple model structure and the default RRI model parameter setting, which considers only saturated subsurface flow and overland flow, may be able to reproduce the flash floods over a wide range. Because the two extreme flood events described above affected large areas over Japan, and because the latest ensemble forecasting product is now available, it is important to investigate the performance of the newly developed flash flood forecasting model by taking the two extreme events as a case study.
The objective of this study is to evaluate the predictability of flash floods caused by the two extreme events. In particular, the specific questions addressed in this study are described below.
1 Can the RRI model, representing lateral saturated subsurface flow and surface flow, forced by radar and gauge composite rainfall product reproduce the observed storm runoff hydrographs at many upstream dam reservoirs over wide ranges? 2 Do the spatial distributions of peak runoff (i.e., peak discharge normalized by the upstream contributing area) correspond to the actual damage by the flash floods? 3 How well can we forecast the spatial distributions of peak runoff with a 39 h lead time based on ensemble precipitation forecasting?

Nationwide application of the RRI model over Japan
In this study, the RRI model was applied to the whole of Japan with a spatial resolution of approximately 150 m (5 s). The RRI model is a two-dimensional model, which can simulate both rainfall-runoff and flood inundation simultaneously (see the details in the supplement). For model application, we used the Japan Flow Direction Map (J-FlwDir) developed by Yamazaki et al. (2018). The dataset is based on a digital elevation model and water map including water surface and stream lines provided by the Geospatial Information Authority of Japan and prepared with a 30 m (1 s) spatial resolution. The dataset contains elevation, flow direction and upstream contributing area, similar to HydroSHEDs (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales) available worldwide based on satellite-based topographic data. To upscale the flow direction of J-FlwDir, we used an upscaling algorithm developed by Masutani et al. (2006), which maintains the location of the main stream regardless of the upscaling. The RRI regards grid-cells with the upstream contributing area greater than 1.1 km 2 (i.e., more than 50 grid-cells) as a river grid cell. Hence, all river channels including small tributaries are explicitly modelled by the RRI. The cross sections of the rivers are assumed to be rectangular, whose widths (W) and depths (D) were estimated using the following formulae.
where A is the upstream contributing area [km 2 ] and the values of C W , S W , C D , and S D were estimated from our previous model application in Japan: C W = 4.73, S W = 0.4, C D = 1.57, and S D = 0.3 (Sayama et al. 2017). The depth parameters (C D and S D ) were intentionally set to be larger than many actual cases, so that the model had enough river cross section capacity. With this setting, the RRI model does not calculate the overtopping inundation effect and focuses primarily on the rainfall-runoff processes (i.e., river discharge). In terms of the model parameter settings, we attempted to prepare a parsimonious model to simulate extreme events. As described in the supplement, by taking d m = 0 in Eq. (S1), the number of parameters becomes only four. The default parameters are shown in Table 1. The model parameters were set to avoid strong non-linearity between storage and discharge relationship to ensure the model represented quick runoff responses to storm events. With the settings (i.e., d m = 0), the model may overestimate the discharge if catchments store large amount of rainfall in soil layers.

Flood events in July 2018 and October 2019
This study focuses on two extreme flood events that occurred in July 2018 (F18) and October 2019 (T19). F18 was caused by long-lasting Baiu frontal rain covering most of the western part of Japan. T19 was due to the Typhoon Hagibis which mostly damaged the eastern part of Japan. Detailed information regarding the meteorological conditions and analysis can be obtained from recent studies (Enomoto 2019;Kotsuki et al. 2019;Takemi and Unuma 2020;Tsuguti et al. 2018). The periods of the two simulations were 0:00 5 July 2018 to 0:  00 9 July 2018 (JST) for F18 and 9:00 11 October 2019 to 9:00 14 October 2019 (JST) for T19. The input rainfall data are JMA's radar and gauged composite products, whose spatial and temporal resolutions are 1 km and 30 min, respectively. In the ensemble flood forecasting experiments, we used the MEPS dataset provided by JMA. The forecasting lead time of this product was 39 h with a spatial resolution of approximately 5 km. The JMA nonhydrostatic model (NHM) was used for forecasting with 21 members by perturbating initial conditions. Although the forecasting is updated every 6 h, we decided to focus only on a single initial time for each event to cover the whole event except for an additional analysis with different initial conditions demonstrated in the discussion. The period of the MEPS forecasting rainfall is 21:00 5 July 2018 to 12:00 7 July 2018 (JST) for F18, and 9:00 12 October 2019 to 0:00 13 October 2019 (JST) for T19. Prior to the period of the MEPS forecasting rainfall, this study uses the JMA's radar and gauged composite products supposing the data is available to set the initial conditions of the RRI, while for posterior to the MEPS forecasting period, it assumes no rainfall up to the end of the simulation periods.
After the model was run with the nowcasting and forecasting modes, model performance was evaluated by comparing the observed and simulated discharge at dam reservoirs. The reasons for evaluating the model performance at dam reservoirs are threefold: (1) the quality of discharge data, especially in terms of total volume, is better than water level gauging stations, which require additional stage-discharge relationship curve, (2) the discharges at the downstream of the reservoirs are affected by upstream dam operations, (3) the official records of the dam inflow and outflow are available online, while those at water level gauging stations for T19 have not been released as of June 2020. Note that in this nationwide RRI model application, so far any dam reservoir operation model has not been incorporated. Furthermore, to visualize the spatial distribution of flood discharge, we compute peak discharge at all river gridcells and normalized it by the upstream contributing area (peak runoff). The evaluations were conducted focusing on all the river grid-cells over the target areas. The simulation of F18 was conducted in Kansai, Chugoku, Shikoku, and Kyushu regions. The total area of these regions is 125,864 km 2 , which is composed of 6, 364,492 grid-cells including 674,291 river grid-cells. The simulation of T19 was conducted in Kanto, Tohoku, and part of Hokuriku Regions. The total area of these regions is 121,129 km 2 , which is composed of 6,435,707 gridcells including 751,102 river grid-cells. For the computation, we used a workstation (Intel Xeon Gold 6134 CPU, 3.20 GHz, 192 GB Memory), which required maximum around 15 h for a single region for the whole simulation period. Multiple ensembles at multiple regions can be computed parallelly.

Verification metrics of simulated hydrographs
The performance of the hydrologic model was evaluated by the following three measures: the Kling-Gupta efficiency (KGE) (Gupta et al. 2009), the Nash-Sutcliffe efficiency (NSE), and the relative peak error (PE) defined as (3), (7), and (8).
where Q t o is the observed discharge at time t; Q t s is the simulated discharge at time t; Q o is the mean observed discharge in an event; Q s is the mean simulated discharge in an event; β is a measure of bias; α is a measure of the variability error; r is the correlation coefficient between Q t o and Q t s ; Q p, s is the simulated peak discharge; and Q p, o is the observed peak discharge. NSE has been widely used for hydrologic modeling, while KGE is being increasingly used as an alternative. KGE was developed to decompose the

Verification metrics of ensemble forecasting
To evaluate ensemble forecasting, this study uses the following the bias ratios (BI) defined as (9) and the relative operating characteristic (ROC) curve.
where Q p, f (k) is the ensemble mean of the forecasted peak discharge at each river grid cell k; and N is the number of river grid cells. The BI is evaluated depending on the ranges of Q p, s (k), which is the simulated peak discharge by the nowcasting mode used as a reference.
The ROC curve is obtained by plotting the false alarm rate F(p t ) on the x-axis and the hit rate H(p t ) on the yaxis. Here, we plot the ROC curves based on the simulated peak runoff at all river grid-cells. We first convert the continuous peak runoff variables into the binary data using a threshold. In this case study, we adopted 20 mm/h as the threshold. If the peak runoff exceeds 20 mm/h, it becomes 1 otherwise 0 for both nowcasting and forecasting modes (Toth et al. 2003). After reorganizing the forecasted peak runoff based on the descending order at each grid cells, we can prepare 21 maps of the binary data from the 21 ensemble forecasting. By comparing them with the simulated peak runoff at the Fig. 4 Peak runoff by ensemble forecasting (lines) and the simulation (dots) at all the dam reservoirs for F18. The x-axis shows the 21 ensemble members in descending order of forecasted peak runoff shown on the y-axis. Note that the flood peaks were evaluated after 16:00 6 July 2018 to exclude the effects of first flood peaks, which occurred at some dam reservoirs nearly at the same time as the MEPS initial time for F18 nowcasting mode, we can evaluate 21 different combinations of F(p t ) and H(p t ). By plotting the 21 combinations, we show the ROC curves. Note that both F(p t ) and H(p t ) depend on the decision probability p t within the range of ensemble members. F(p t ) and H(p t ) decrease from 1 to 0 as the decision probability p t increases from 0 to 1. The ROC curve is close to the perfect forecast H(p t ) = 1 and F(p t ) = 0 for better performance.

Simulation results with radar rainfall
We run the RRI model using the observed radar rainfall data during the F18 event with the default parameter setting (hereafter F18-Def). Figure 1 compares the observed and simulated hydrographs at 10 dam reservoirs. The results suggest that the default RRI model simulates the flood inflows generally well for the F18 event regardless of the geographic locations. Fig. 2, Table 2, and  Table 3 summarize the quantitative evaluations of the simulations. The average and standard deviations of KGE and NSE for F18-Def are 0.78 ± 0.11 and 0.75 ± 0.10, respectively. The boxplot of Fig. 2 further suggests that among the three components of KGE, the correlation coefficients are approximately one (r = 0.94 ± 0.04), suggesting that the shapes of the hydrographs are well represented. In contrast, the standard deviations of α and β are relatively high (α = 1.02 ± 0.17, β = 1.00 ± 0.16). Since the averages of α and β are close to one, there is no consistent overestimation or underestimation. In terms of simulated peak discharge, the result of PE = 0.87 ± 0.15 indicates 87% of underestimations of the peak discharge by the simulation.
The same experiment was conducted for T19 in Kanto, Tohoku, and Hokuriku regions. The default parameter setting (T19-Def) could not represent the observed hydrographs in this case. Figure S1 shows that at the eight catchments, the simulation generally overestimated the observed hydrographs. The values of α, β, and PE were higher than 1 (α = 1.45 ± 0.29, β = 1.36 ± 0.22, PE = 1.31 ± 0.29) indicating hydrograph variance, total volumes, and peak discharges were all overestimated by 30 to 40%. The shapes of the hydrographs represent the observed patterns, which are quantified as r = 0.96 ± 0.02. As a result, both KGE and NSE are much lower (0.42 ± 0.36 and 0.50 ± 0.46) for T19-Def compared with those for F18-Def even for the same model settings. For the case of T19 in the Kanto Region, we changed the model parameter setting by introducing unsaturated flow component (see Table 1). By introducing the unsaturated flow component, the model performance improved ( Fig. 3) with KGE = 0.76 ± 0.16 and NSE = 0.86 ± 0.10, and the performance was almost equivalent to that for the F18-Def case, as shown in Fig. 2. Figs. 1 and 3 show forecasted dam inflows by the RRI model with MEPS rainfall. The two sets of graphs indicate higher convergence among the 21 ensemble members for T19 compared to F18. For the F18 case, different members showed different hydrograph patterns with larger spreads in the forecasted peak runoff. On the other hand, for the T19 case, the flood peaks were well estimated about half day before the peak arrivals. Fig. 7 The same graph as Fig. 6 but for T19. The figures also show the major levee breaching points (red marks) and levee breaching points including small tributaries in Tochigi Prefecture (white marks) Sayama et al. Progress in Earth and Planetary Science (2020)  To analyze the characteristics of the forecasted peak runoff, Figs. 4 and 5 show plots of the descending orders of the forecasted peak runoff by 21 ensemble members and the simulated peak runoff estimated with the observed radar rainfall. The figures can show how the peak runoff is spread (a steeper line shows higher variation compared to a flatter line which represents a smaller variation among the members). Furthermore, if the simulated peak runoff points are within the range and are positioned close to the center, the median of ensemble forecast is suggested to cover what actually happened. The results support the above described point between F18 and T19; the former case has higher ensemble variations than the latter case. The implications of the difference between the events are discussed in the next section. Fig. 8 The mean and standard deviation of bias in the forecasted peak runoff compared to the simulated one for a F18 and b T19. The BI is computed depending on the peak runoff shown in x-axis. The dotted lines without the shading show the relative frequency of the simulated peak runoff Sayama et al. Progress in Earth and Planetary Science (2020)

Evaluations over the regions including tributaries
Figures 6 and 7 compare the forecasted and simulated peak runoff over the entire simulation domains. Figure 7 shows that the simulated peak runoff for T19 shows the areas of flash flood tributaries such as the tributaries in Tochigi Prefecture with white cross marks representing levee breaching points along the tributaries. Furthermore, it shows the higher peak runoff in west Kanto including the upstream of the Arakawa River. Figure 7 shows the forecasted result based on MEPS rainfall at 9: 00 on 12 October. The spatial variability such as high peak runoff in Tochigi Prefecture and the upstream of the Arakawa River is well represented. The performance of the F18 shown in Fig. 6 is not as clear as that of T19. First, the simulated peak runoff and actual flood damages did not agree very well. According to our simulation, the rivers in eastern Kochi and northern Kyushu islands show comparatively higher peak runoff. However, flash flood damages were concentrated mostly in the Chugoku Region and western Shikoku Island during F18.
To quantify the patterns, we used the BI and the ROC curve. The evaluations were conducted focusing on all river grid-cells over the target areas. Note that due to the non-availability of the observation data at all the river grid-cells including tributaries, these evaluations were conducted based on the model results by nowcasting and forecasting modes. In case of the BI shown in Fig. 8, the x-axis shows the peak runoff based on the simulation mode and the y-axis depicts the mean and the standard deviations of BI corresponding to the peak runoff for each region. For F18, the computed biases were 0.5-0.7 at the high peak runoff ranges (10-25 mm/ h), suggesting underestimation by the ensemble forecasting. Smaller biases (i.e., BI closer to 1.0) were confirmed for T19. Especially for the case of Kanto, for example, the biases were nearly one for almost the entire range, which indicates the high predictability of peak runoff. The ROC curves in Fig. 9 show also higher accuracy in T19. Moreover, the ROC curves from the three regions for T19 almost overlap each other. All the above evaluations are normally performed not for a single event but should be performed with many flood events or for long-terms. The present evaluation does not indicate the overall model performance, but the figures are used to quantify the results of the ensemble forecasting of the case studies.

Can the default RRI model reproduce the observed storm events?
For a regional flash flood prediction system, since it is impossible to calibrate hydrologic model parameters at individual river basins, limited sets of parameters should be applied over a wide range and to produce reliable results (Collier 2007). The simulation results for F18 showed that the default parameter representing lateral subsurface flow and surface flow perform fairly well to reproduce the observed hydrographs in many dam reservoirs. In case of T19, the same model with the same default parameter overestimated the flood peaks and the performance of T19-Def was unacceptable. Introducing the unsaturated soil layer into the model was necessary to reproduce the patterns observed hydrographs of the Kanto Region.
Here, we discuss the possible reason of the difference is related to the geologic settings in different regions. Large parts of the Chugoku and Shikoku Fig. 10 Simulated peak runoff at levee breaching points in small to medium sized rivers in Tochigi Prefecture (and Saitama Prefecture for the Oppe River). The red points are rivers that originated from mountains, while the blue points are rivers that originated from plains (< 400 m in this region), whose bank-full discharges are typically smaller Sayama et al. Progress in Earth and Planetary Science (2020)  regions affected by F18 belong to Granite rocks, Mesozoic or Paleozoic formations. The volcanic landscapes formed in Paleogene and Cretaceous are mostly distributed in central and western Japan. In contrast, mountainous area in northeastern Japan, including Kanto and Tohoku regions affected by T19, is dominated mostly by the volcanic rocks formed in the Quaternary and Neogene. The geologically younger catchments typically exhibit stable groundwater while the geologically older catchments show more flashy runoff patterns (Yoshida and Troch 2016;Shimizu 1980;Mushiake et al. 1981). These previous studies follow data driven approach focusing more on flow duration curves and baseflow. This study indicated the effect of geology on storm runoff, previously evaluated at small catchments in comparative studies (e.g., Onda et al. 2001;Onda et al. 2006) or modeling studies focusing on a selected river basin (Sayama et al. 2017).
4.2 Do the spatial distributions of peak runoff correspond to flash flood damages?
The simulated peak runoff distribution in Fig. 7 can help to visualize severely flood areas. The results of T19 showed that Tochigi Prefecture and the upper Arakawa River basin reached about 30 mm/h in the peak runoff. Furthermore, it shows that the peak runoff exceeded 40 mm/h in Marumori Township in Miyagi Prefecture in Tohoku Region, where severe damage was reported due to flash floods with many slope failures and debris flow.
To quantify the peak runoff where actual flood damages occurred, we collected location information of levee breaching points in Tochigi Prefecture by T19. Figure 10 plots the relationship between upstream contributing areas and the corresponding peak runoff simulated by the nowcasting mode at the levee breaching points. Although the plot does not indicate the levee breaches occur when the peak runoff exceeds the regression line, we can roughly understand that the estimated peak runoff exceeded approximately 30 mm/h along the tributaries where the levee breaches occurred with the upstream contributing areas less than 300 km 2 . The figure also indicates that the peak runoff becomes smaller for larger catchments at these points. This pattern is common; the peak runoff tends to be smaller due to the normalization by the upstream contributing area (Amponsah et al. 2018). Meanwhile, the bank-full discharge (i.e., the threshold) also varies depending on climatic and other geographic conditions as well as the status of river management works such as constructions of flood defiance structures. In fact, for the case of T18, the spatial pattern of high peak runoff did not represent the actual distribution of the flash flood damage over western Japan (Fig. 6). To advance the system, evaluating the peak runoff relative to the actual local bank-full capacity or converting the peak runoff to stream water levels are necessary. The other approach is to evaluate the frequency of the peak runoff at each location compared with the climatology as presented by previous studies (Alfieri et al. 2012;Yoshimura et al. 2008).
4.3 How well can we forecast the peak runoff distributions?
Operational flood forecasting in Japan typically focuses on stream water levels at important cross sections, where the water levels are monitored. The lead time of flood forecasting is normally 3 to 6 h. Recently, the JMA released a new type of flood forecasting information based on hydrologic simulation with a spatial resolution of 1 km covering all over Japan. The predicted streamflow is converted to a flood risk index and its lead time is set to be 6 h. Although our approach is similar to the JMA's new product, the demonstrated flash flood predictions for the two extreme cases with total of 21 ensemble members with 39 h lead time could provide different type of information that could be useful for the better preparedness, especially having more direct physical outputs including water levels and discharges. The estimated uncertainty bands by the two events were contrasting: smaller spreads for T19 and larger spreads for F18. The high predictability of the T19 event in particular is discussed from meteorological viewpoint. Takemi and Unuma (2020) reported that the moist absolutely unstable status under very humid conditions and a sufficient precipitable water were responsible for the heavy rainfall. For the F18 case, Kotsuki et al. (2019) indicated high predictability of intense rainfall at the synoptic-scale with long lead time (3 days) using their data assimilation system. However, forecasting accurate locations of the rain band causing heavy rainfall still has a high uncertainty (Matsunobu and Matsueda 2019). For this particular case, in addition to focus on the ensemble mean, we should pay attention to the worse cases (e.g., the runoff map of the fifth order).

What are the effects of different initial times of the forecasting?
Unlike many previous flood forecasting studies with shorter forecasting lead time such as three to 6 h, the demonstrated approach tries to cover the entire flood events prior to the beginning of the flood events. Due to the long-lasting characteristics of F18 event by the stagnant frontal line, it was not possible to select different initial times of MEPS for the evaluation of F18. For the case of T19, we selected the initial time of forecasting as 9:00 12 October 2019. Supposing that the forecasting can be updated every six hours with 39 h lead time, it is still possible to select earlier initial times. To understand the characteristics of the flood forecasting with the different initial times, we performed the same experiment with initial forecasting times including 15:00, 21:00 11 October 2019 and 3:00 12 October 2019 for T19 over the Kanto Region. Figure 11 show the results forecasted at different initial times, suggesting the larger spread of forecasted peak discharges. As the initial time becomes later, the ensemble spread becomes smaller with more distinct overestimation pattern for Ninose dam case. The improvement of the performance with later initial times can be confirmed also by the BI and ROC curve shown in Figs. 12 and 13. Among the investigated four initial times, the larger error is clearer for the earliest initial time (15:00 11 October) while the other three cases show fairly similar model performance, especially when they are evaluated by the ROC curve.

Conclusions
This study examined the predictability of flash floods on a nationwide scale in Japan using the new operational meso scale ensemble precipitation forecast and a highresolution distributed rainfall-runoff model. Two extreme events were selected as a case study, since both of them caused levee breaching and overtopping in various regions with different rainfall mechanisms; i.e., frontal rain in 2018 and the typhoon in 2019. Based on the numerical experiment at the nowcasting and forecasting modes, we addressed three research questions raised in the introduction. For the first question, the nationwide RRI model could reasonably reproduce extreme floods at many dam reservoirs over the wide range in the nowcasting mode without individual parameter tuning. However, in certain areas, such as the Kanto Region, remarkable basin storage effects were observed even under extreme events. The model parameters had to be tuned to reflect these effects in these areas. For the second question, the spatial distributions of peak runoff corresponded generally well the areas of flash flood tributaries especially for T19, while the performance of F18 covering western part of Japan was not as clear as that of T19. In case of F18, our target area contains wider ranges of climatic and geographic conditions in western part of Japan. It is likely the bank-full discharge (i.e., the threshold causing flooding) may vary more significantly within the region. To advance the system, evaluating the actual local bankfull capacity or converting the peak runoff into appropriate index representing the occurrence frequencies is required. For the third question, in terms of the forecasting with lead time, the predictability was different between the two event cases. In case of the T19, the prediction accuracy was high, and the spatial distribution of peak runoff estimated by the ensemble mean corresponded well with the results of the nowcasting mode. For the frontal heavy rain in July 2018, the detailed locations of high peak runoff could not be forecasted by the ensemble mean. Instead, the spread of 21 ensemble members showed the flash flood potential areas and their possible magnitudes. The large spread quantifies the higher uncertainty in the predictions.
The model presented in this study has not been operated on a real-time basis. The real-time system and its continuous verification should be performed in future studies. If such a system can be developed, stochastic, high-resolution, and long lead time flash flood predictions can be realized. Such a system can be useful for pre-imaging the situations of flash flood disasters, especially before a typhoon strikes or when a frontal rain stagnates, to realize early and safe evacuation and other preparations.