# Synthetic Datasets for Predictive Localization Monitoring

## Overview

This repository contains a synthetic dataset designed for training and evaluating predictive localization monitoring models for autonomous mobile robots, specifically focusing on LiDAR-based particle filter localization (Adaptive Monte Carlo Localization, AMCL). The dataset is generated using NVIDIA Isaac Sim and includes 21 ROS 2 rosbags, capturing diverse scenarios with localization estimates, ground-truth poses, sensor data (LiDAR and odometry), and automatically labeled failure cases. The dataset is intended to support research in proactive fault detection for ground robots navigating in dynamic and challenging environments, as described in the paper "Synthetic Datasets for Data-Driven Localization Monitoring".

## Dataset Description

The dataset comprises 417,185 labeled samples across 21 experiment runs, with a failure rate of 23.1% (96,315 failure instances and 320,870 nominal instances). Experiments were conducted in seven distinct environments:

- **Warehouse**: A large, prebuilt environment with aisles, storage racks, and handling equipment, mimicking real-world warehouse settings.
- **Symmetric Maps (1–3)**: Three small-scale environments with symmetrical layouts to induce localization confusion due to repetitive structures.
- **Asymmetric Maps (1–3)**: Three small-scale environments with asymmetrical layouts for varied localization challenges.

Each environment was tested under three obstacle configurations:

- **Dynamic Only**: 25 spherical obstacles (representing humans, robots, or industrial trucks) with randomized trajectories.
- **Static Only**: Manually placed static obstacles (cubes) not included in the navigational map.
- **Combined (Dynamic + Static)**: Both dynamic and static obstacles.

Localization failures were induced through environmental challenges (e.g., dynamic obstacles, featureless zones, map ambiguities) and randomized odometry drift, simulating real-world sensor inaccuracies. The dataset includes ROS topics such as `/amcl_pose`, `/robot/pose`, `/particle_cloud`, `/map`, `/position_error`, `/heading_error`, and `/localization_failures`, stored in rosbag files.

## File Structure

- **Rosbags**: 21 ROS 2 rosbag files, each corresponding to a unique experiment run (e.g., `rec_20250821_104113.bag`). Each rosbag contains the raw experiment data.

- **info.csv**: A summary file detailing the experiment configurations and key statistics for each rosbag.

- **Parquets**: Processed versions of the dataset stored in Apache Parquet format.  
  - For each rosbag, there are **two raw** and **two processed** parquet files.  
  - The files are split in half to avoid memory issues during loading.  
  - Raw parquets contain the unprocessed tabular data extracted from the rosbags.  
  - Processed parquets include additional preprocessing steps for machine learning pipelines.  
  - All parquet files are located in the `parquets/` folder.  
  - A python script is provided to load the parquets as a dataframe in the `scripts/` folder.


### info.csv Description

The `info.csv` file provides an overview of the 21 experiment runs, with the following columns:

- **Name**: The rosbag filename (e.g., `rec_20250821_104113`).
- **Map**: The environment used (`warehouse`, `symmetric_exp_1`, `symmetric_exp_2`, `symmetric_exp_3`, `unsymmetric_exp_1`, `unsymmetric_exp_2`, `unsymmetric_exp_3`).
- **Dynamic**: Indicates if dynamic obstacles were present (`✓` for yes, `x` for no).
- **Static**: Indicates if static obstacles were present (`✓` for yes, `x` for no).
- **Resets**: Number of localization resets triggered when position error exceeded 0.4 m or orientation error exceeded 0.4 rad.
- **Samples**: Total number of labeled samples in the rosbag.
- **Nominal**: Number of samples labeled as nominal (`y=0`, no failure).
- **Failure**: Number of samples labeled as failure (`y=1`, localization error beyond thresholds).
- **Percentage**: Failure rate as a percentage (`Failure / Samples * 100`).

The file also includes a total row summarizing the dataset: 417,185 samples, 320,870 nominal, 96,315 failures, and a 23.1% failure rate.

## Usage

The dataset is designed for training supervised machine learning models to predict localization failures. The rosbags can be processed using ROS 2 and the Flowcean framework (see Flowcean documentation: https://flowcean.me/examples/robot_localization_failure/) to convert data into tabular formats for machine learning pipelines.

### Loading the Dataset

1. Install ROS 2 and required dependencies.
2. Use the Flowcean framework to load rosbags and extract data into a tabular format.
3. Refer to `info.csv` to select rosbags based on map type, obstacle configuration, or failure rate for specific experiments.
4. Use the labeled samples (`y=0` for nominal, `y=1` for failure) to train classification models.

## Simulation Setup

The dataset was generated using NVIDIA Isaac Sim with the MoMo omnidirectional robot model, equipped with a VLP-16 Velodyne LiDAR. The simulation includes:

- **Robot Model**: A digital model of the MoMo robot with holonomic kinematics.
- **Environments**: Seven maps (one warehouse, three symmetric, three asymmetric).
- **Challenges**: Dynamic obstacles, static obstacles, and randomized odometry drift to induce failures.
- **Labeling**: Automatic labeling based on position error or orientation error.

The simulation code, robot model, and environment USD files are provided in this folder. Newer versions may be linked at https://flowcean.me/examples/robot_localization_failure/.

## Citation

If you use this dataset in your research, please cite:

```txt
Markus Knitt, Sean Maroofi, Manav Thakkar, Hendrik Rose, Philipp Braun. "Synthetic Datasets for Data-Driven Localization Monitoring." Logistics Journal: Proceedings, 2025. DOI: 10.2195/lj_proc_knitt_en_202503_01.
```

```bibtex
@inproceedings{knitt2025synthetic,
    title={Synthetic Datasets for Data-Driven Localization Monitoring},
    author={Knitt, Markus and Maroofi, Sean and Thakkar, Manav and Rose, Hendrik and Braun, Philipp},
    booktitle={Logistics Journal: Proceedings},
    year={2025},
    doi={10.2195/lj_proc_knitt_en_202503_01}
}
```

## Contact

For questions or support, contact:

- Markus Knitt (markus.knitt@tuhh.de)
- Institute of Logistics Engineering, Hamburg University of Technology, Theodor-Yorck-Straße 8, 21079 Hamburg, Germany