Big data usually means big problems and a lot of work. One of the problems is how to load a lot of data. There is usually some source of big data, and the problem is posed as some kind of legacy system migration or migration of historical data into a FHIR server.
The typical solution for such problems is to build an ETL (extract, transform, load) pipeline:
Key points for this architecture:
The task of transformation and storage of the incoming data is highly dependent on the state of the system that the import is done for, thus this information will be omitted from this article. We nevertheless needed a data source for our test. Consequently the Synthea tool was used, resulting in 40 batches, each containing around 25,000 Patient resources, and their related resources. The script for generation is accessible via the following link: gen_synthea.sh.
The task of loading a large amount of data is not exactly new and there are a number of well-established practices for this task that proved to be efficient.
From the perspective of HTTP data transmission via the net, loading the data pieces one by one is obviously not the best approach, as it leads to a significant overhead in connection initialization, receiving the response and so on. Data transmission via batches can dramatically reduce overhead costs and significantly speed up data transfer. Further improvements come with streaming, where all overhead costs wither away.
Use the PostgreSQL COPY command to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows. It is less flexible than INSERT, but incurs significantly less overhead for large data loads.
The database indexes can slow the insertion speed, since each table insert entails updating all indexes of the table to maintain transactionality. Creating indexes on existing data is much faster than incremental updates on each insert. For this reason, it is good practice to turn off (or delete) all existing table indexes before loading data, perform an insert via the COPY command, and turn on (or recreate) the table indexes.
The peculiarity of storing data in Aidbox in the form of JSONB and very large FHIR resources may cause difficulties with TOAST. In essence, the problem arises when the data does not fit into the PostgreSQL page, and the database engine starts to transfer part of the data into the TOAST-allocated table. As for the subsequent reading and writing, reading and writing internally is done for two tables: the toast-allocated and the standard.
There are a number of methods to reduce toasting:
ALTER TABLE <target_table> ALTER COLUMN resource SET STORAGE MAIN;
ALTER TABLE <target_table> SET (toast_tuple_target = 6000);
Aidbox has a load-from-bucket operation that permits asynchronous upload of FHIR resources from an S3 bucket. This operation has been specially designed for a parallel batch load of tons of FHIR data.
How it works:
Note: Disabling indexes is applied when the disable-idx? flag is specified. Load from bucket operation with flag disable-idx? is not recommended for use on live production, because this operation disables all indexes for target tables.
Infrastructure for this test:
First of all let’s try to load around 10% of data and check how Aidbox will respond:
Aidbox load-from-bucket RPC operation call example:
POST /rpc
content-type: text/yaml
accept: text/yaml
method: aidbox.bulk/load-from-bucket
params:
bucket: s3://<fhir-data-bucket>
prefixes: ["fhir/1/", "fhir/2/", "fhir/3/", "fhir/4/"]
thread-num: 16
disable-idx?: true
account:
access-key-id: <access-key-id>
secret-access-key: <secret-access-key>
region: <region>
Let's run imports with different numbers of threads (1, 2, 4, ...,16) and see how execution time and average values will change.
As seen from the graph with the increase in threads from 1 to 4, the general data loading performance grows almost linearly. The further increase in threads from 4 to 12 becomes insignificant. The additional increase from 12 to 16 does not affect performance growth. The performance limitations may be caused by the net bandwidth, database throughput or similar. In our experiment there were neither limitations from the database nor from the net bandwidth and the Java performance was the main bottleneck.
From our observations we conclude that the data loading performance is affected by the distribution of the data by volume. Thus, from the chart above, it is clear that small resources like Procedure or Observation can be loaded with a speed of up to 40,000 resources per second. On the other hand, huge resources like ExplanationOfBenefit can be loaded with only up to 5,300 resources per second, resulting in 20% of the entire dataset’s volume. Such data irregularity eventually greatly affects the throughput and the total time of the data import process, and so may vary greatly in projects.
Curious about the details? Dive deeper into our process and see how you can optimize your FHIR resource loading. Learn More Here
Based on the experiments above we chose to load data in 12 threads for the final data import. The total time to complete the import took 4 hours and 48 minutes. About 1 billion resources were created with a total database size of 1649 GB.
Below is a summary of the download result:
This graph illustrates the general import throughput of the data upload. It can be seen that the speed of the process is not constant and can vary from 8,000 to 80,000 resources per second at its peak. The average throughput was 57,000 resources per second, 10% higher than in the previous tests.
The most significant speed drops were detected for the Provenance resource. The average record size for this resource was 37 KB, which is very big. Records of such size are processed slowly and cause complexities related to TOAST technology in Postgres. The final throughput stats are highly dependent on the structure and volume of the data: small resources are almost instantly loaded, while the big ones are loaded very slowly.
The graph of the distribution looks particularly interesting. We can observe that there are about 300 Observations per Patient resource on average, and the resources that take up the most disk space are Claim, DiagnosticReport, DocumentReference, and ExplanationOfBenefit. The Provenance resource deserves special attention. With a total number of entries of 1 million, it takes up 40 GB of disk space with an average resource size of 37 KB.
The Aidbox and Aurora CPU load graph shows that for almost the entire duration of the data import, the load was in full use of 12 Aidbox cores. At the same time, the average CPU consumption for the Aurora database was 54%. On the graph, you can see a sharp increase in the CPU consumption of the database and a drop for Aidbox. This is due to the transfer of large Provenance resources. At this point, the Aidbox threads began to wait for a response from the database, while the database itself was busy compressing the data in order to store it in TOAST.
To experiment with loading large datasets into Aidbox and experience its scalability firsthand, try the free version of Aidbox. It provides a robust environment to test these capabilities, offering all necessary tools without any feature limitations.
The most obvious solution is not always the optimal one. Sometimes, in order to find an effective solution, one should pose the problem differently or make some technical compromises. In this post we offer our vision of the best practices to upload huge datasets into Aidbox.
Authors: Marat Surmashev, Georgii Ivannikov
Get in touch with us today!