Utilization of Convolutional Neural Networks for H I Source Finding: Team FORSKA-Sweden approach to SKA Data Challenge 2

Abstract

The future deployment of the Square Kilometer Array (SKA) will lead to a massive increase of astronomical data which means that automatic detection and characterization of sources will be crucial for utilizing its full potential. We present an end-to-end pipeline for generating a source catalog from simulated 3D spectral line data, specifically tailored to future SKA observations of the 21-cm line emission of the neutral hydrogen (HI) from galaxies. The pipeline is mainly based on machine learning techniques built upon Convolutional Neural Networks (CNN) for source finding in combination with the existing source-finding software SOFIA for source characterization. This solution developed by our team FORSKA-Sweden provided the second-best submission of SKA Science Data Challenge 2, whose accompanying simulations of HI cubes and source catalogs were used for development and testing.

The pipeline relies on an existing catalog of identified sources for a given portion of the observations to be used as a training set in supervised learning. The large input datacubes are divided into smaller sub-cubes that can be digested by the computing resources in use. The first step of the pipeline consists of a convolutional neural network (CNN) with a u-net architecture. With the input of an HI sub-cube the CNN outputs a mask of the same shape as the input, where voxels with signal are marked and separated from the unmarked background voxels. To train the CNN in a supervised manner, but also to avoid manual labeling of all the voxels, the provided source catalog is utilized to algorithmically construct a target mask for the training set. For each source in the catalog, its listed properties (i.e., coordinates, central velocity, linewidth, inclination, position angle, etc) are used to mask the voxels in their neighborhood that capture all plausible signal distributions of the galaxy. Padding is also added to the selected mask to account for noise. To make training more efficient, the sub-cubes used for training that contain galaxies were oversampled compared to the total volume of background regions.

The second step of the pipeline is to find shapes resembling galaxies in the output masks of the CNN for all sub-cubes. For that purpose, the merging and dilation modules of SOFIA are used. Finally, to characterize each detected source (e.g., position and central velocity, integrated line flux, line width, etc.), the information provided by SOFIA and complementary calculations based on the final output mask are utilized to provide a final catalog of galaxies.

To cope with the size of HI cubes but also allow deployment on various computational resources, the implementation of the pipeline has a flexible and configurable memory usage. To decrease the memory usage the catalog for a single HI cube is computed by merging catalogs from multiple sub-cubes, which comes at the cost of increased computational time. A small memory budget will have close to no effect on the quality of the mask given by the CNN, but since galaxies may be cropped into multiple sub-cubes it can be harmful to the steps performed by SOFIA and the following calculations to derive source properties. A larger memory budget is therefore preferable when allowed by the computational environment.

We show that detection performance can be fine-tuned by adjusting the merging configuration of SOFIA, resulting in a trade-off between completeness and reliability. Predictions of line width, position angle, and inclination angle become more accurate when increasing the reliability, while axis size and integrated line flux are conversely more accurately predicted with higher completeness.