Google Spreadsheets as synchronizable remote Data Sources for your Java web app.
TL;DR
This library mainly addresses the two missing parts of the Google Sheets API:
- consistent data synchronization between multiple dependent spreadsheets
- retrieving data as business entities instead of just raw cell data
As a byproduct, the first reactive wrapper around the existing Java API began to emerge.
Google Sheets is a great repository for small data. Spreadsheets release business users from the burden of interaction with obscure RDBMS systems while providing them with yet more powerful interface for entering and working with tabular data, allowing for illustrative programming and even giving rise to "code less" movements among IT laypeople (and it's totally ok).
Nowadays it seems to be ok to pick Google Spreadsheets for prototypes, internal tools or even real-world apps with short
time-to-market terms. Sheets tend to be more of a database. The major shortcoming of this approach becomes apparent when
your data reaches the boundary of this "small" amount (which is precisely 400,000
cells per sheet for Google Sheets)
and then finally exceeds it (thanks to the combinatorial explosion of the denormalized data representation).
In this case, the right thing to do is to normalize your data — breaking up larger sheets into smaller ones and finally
ending up with multiple dependent spreadsheets (since things become really slow when each individual cell update causes
the half of a spreadsheet's formulas to recalculate) with unsynchronized data that require manual synchronization (that
is copy-pasting). The usual means, e.g. the built-in IMPORTRANGE
function, don't help to solve the problem — dependent
tables do not automatically receive notifications about the need to reload data from the updated dependencies. And even
if they were receiving such notifications, it is not clear how they will operate in case of cascading dependency graph
updates.
Here's where the Data Sources come to the rescue! With this ready-to-go Java library one may keep their tabular data and business logic in a familiar Google Spreadsheets format (with built-in formulas and/or custom AppScript functions) while synchronizing dependencies graph automatically via the minimalistic API.
Moreover, you might want to export the data or run some Java workloads with it — here is where the
built-in JSON Schema-based data mapping facilities come in very handy. The ability to receive from your spreadsheets
not just raw data (meaningless List<List<...>>
in Java), but full-fledged business entities (with clear semantics
and defined structure) is simply magnificent — and it takes work with Google Sheets API to a new level of convenience,
starting to resemble the ORM for a traditional database.
The essence of the solution is to divide large spreadsheets into a set of smaller interconnected ones that are equipped with certain kind of metadata (may be totally invisible for an end user, more on this below). They can stay denormalized but much less frustrating to work with thanks to narrowing the scope for cascade formulas re-calculations.
Main concepts in use are:
- Data Source — a source of data for the application
- data models — a set of schemas defining your business entities
- data mapping — a prescription for mapping Data Sources data to business entities
Data Source description includes it's name, data structure, dependencies, etc. Each one is backed by the corresponding Google Spreadsheet which form its data structure with the help of named value ranges. One may think of these ranges as of columns in the column-oriented databases.
A business entity is composed of a group of similarly-prefixed named value ranges. The prefix is alphanumeric and ends
with _
, e.g. goods_
. In this case a group may consist of goods_ids
, goods_names
, goods_descriptions_full
, etc.
A single Data Source may serve as a data repository for multiple business entities, i.e. a single Google Spreadsheet may
contain more than one group of similarly-prefixed named value ranges.
Data Sources may (and normally should) depend on each other's data, a set of shared named value ranges. The dependency may be of two types: regular and historical. The historical data dependency is best captured in the following example: imagine you have to predict future costs of supplies by the previously made orders of these supplies.
The Data Sources dependency graph may have all possible quirks: detached vertices, loops (when one Data Source depends on the other which in turn depends on the historical data from the first one), etc. Cascade update may be started from any particular graph node. Partial synchronization of a pair of Data Sources is supported, still not recommended.
This graph can be built automatically, on a first remote request (for pre-initialized set), or upon the registration of a new Data Source.
The metadata that one is required to provide (in a form of named value ranges) for a particular Data Source consists of:
-
the
spreadsheet_name
, a unique Data Source identifier (required) -
a set of
3
indicators calculated by boolean spreadsheet formulas:spreadsheet_status_api_ready
— indicates the absence of running re-calculationsspreadsheet_status_data_errors
— indicates data inconsistency (optional)spreadsheet_status_data_warnings
— indicates data incompleteness (optional)
-
a set of up to
3
dependency descriptors, e.g. the first one is defined with:spreadsheet_dependency_1_source
— the dependency identifierspreadsheet_dependency_1_ranges
— the named ranges of the dependencyspreadsheet_dependency_1_historical
— is it used as a historical data only
These are simply a set of JSON Schema type definitions of your business entities. The "code less" JSON Schema approach was chosen as a logical continuation of the idea of using Google Spreadsheets as data repositories with the flexible and dynamic data structure. Ideally, you simply should not need to recompile your client code if you add a new data column to your business entity.
Here's how a Design
data model might look like:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"$id": "http://terakulta.com/schemas/model/Design.json",
"title": "Design",
"description": "A particular 'Tera kulta' design that can be ordered yet customized",
"definitions": {
"DesignDescriptions": {
"$id": "designDescriptions",
"type": "object",
"properties": {
"full": {
"type": "object",
"$ref": "Text.json"
},
"short": {
"type": "object",
"$ref": "Text.json",
"$comment": "Recommended max length is 64 characters"
}
}
}
},
"type": "object",
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"ver": {
"type": "integer",
"exclusiveMinimum": 0
},
"code": {
"type": "string",
"pattern": "[A-Z0-9]{3}-[A-Z]{1}"
},
"category": {
"type": "object",
"$ref": "GoodCategory.json"
},
"descriptions": {
"type": "object",
"$ref": "#designDescriptions"
},
"tags": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"id",
"name",
"ver",
"code",
"category"
]
}
Please, find all main JSON data types used in this example, as well as an internal entity (the DesignDescription
).
NB: The supported JSON schema version is draft-04
which don't support some latest features, e.g. handy const
construct. Still, there's a trivial work-around for this — just use a single-valued enum
properties instead.
...
"type": "string",
"enum": [ "?" ]
...
This JSON file prescribes how to map Data Sources data (i.e. the corresponding Spreadsheet's named value ranges values) into the data models described above.
Here's how the data mapping for Design
entities might look like.
{
"designs": {
"type": "com.tera.kulta.schemas.model.Design",
"dataSource": "tk-products",
"groups": [{
"key": "goods",
"skipRows": 1,
"splitIntoList": [{
"props": [ "tags" ],
"regex": ",\\s*"
}]
}],
"mapping": {
"goods_ids": "id",
"goods_names": "name",
"goods_vers": "ver",
"goods_codes": "code",
"goods_categories": "category.code",
"goods_descriptions_full": "descriptions.full.code",
"goods_descriptions_short": "descriptions.short.code",
"goods_tags": "tags"
}
}
}
NB: The usual Java Beans properties path syntax is used for mapping (including .
for nested and []
for mapped).
See Apache BeanUtils for more details.
Properties requiring multiple named ranges can be constructed as well with the following syntax:
...
"parts_gems1_size": [ "gems[0].size.length", "gems[0].size.width", "gems[0].size.height", "gems[0].size.diameter" ],
...
...
"assemblages_platings_2": [ "platings[1].quantity", "platings[1].code", "platings[1].valuation" ],
...
Please, mind that indexing starts from 0
which is usual for Java.
Talk is cheap. Show me the code.
— Linus Torvalds
First, add Data Sources as a dependency in your build/project-management system, for instance with Maven add JitPack
as a repository and the latest release as an artifact in the pom.xml
:
```xml
<repositories>
...
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
...
</repositories>
...
<dependencies>
...
<dependency>
<groupId>com.github.marksto</groupId>
<artifactId>data-sources</artifactId>
<version>RELEASE_VERSION</version>
</dependency>
...
</dependencies>
```
The main API is consolidated within these core services (see their JavaDoc for detailed description):
Most of the time the first two are enough for the aforementioned data sources synchronization and mapping functionality.
But there's always a lower-level GoogleSheetsService
for abstractions to leak.
There is also the thinnest domain model represented by the DomainType interface and its implementations. With these you wrap your actual domain models to pass them between your client code and the core library services. See the data mapping example below.
The basic feature set consists of operations on Data Sources level: registration, retrieval, and synchronization.
-
Supplement your Spring-based project with the following configuration:
@Import({ marksto.data.config.ServiceConfiguration.class, marksto.data.config.MappingConfiguration.class })
-
Provide the required basic configuration properties (see Configuration below).
If you configured everything right, the startup logs should look similar to these:
2019-09-25 03:26:39.604 INFO 43541 --- [ main] m.d.s.impl.RemoteDataSourceInitializer : Initializing a Data Source for '1ZWM...' 2019-09-25 03:26:39.704 INFO 43541 --- [hot-timeouter-0] m.d.i.s.impl.GoogleSheetsServiceImpl : Establishing connection to the remote service... 2019-09-25 03:26:42.579 INFO 43541 --- [hot-timeouter-0] m.d.i.s.impl.GoogleSheetsServiceImpl : Connection established successfully 2019-09-25 03:26:43.090 WARN 43541 --- [ remote-calls-3] m.d.s.impl.DataMappingProviderImpl : No 'data-mapping.json' file provided, only the default one is used 2019-09-25 03:26:47.032 INFO 43541 --- [ mapper-0] m.d.service.impl.DataSourcesServiceImpl : Registering new DataSource: name='tk-products'
Note that
tk-products
here comes from a dedicated named value range (spreadsheet_name
) in the provided spreadsheet. Find more details on what metadata you need to provide in your Data Source. -
Now you are good to use Data Sources in your client code! An arbitrary Spring WebFlux controller code for Data Sources metadata retrieval (update) and synchronization might look like this:
public static final String SYNC_ALL_ACTION = "syncAll"; ... private final DataSourcesService dataSourcesService; ... @PostMapping(path = "/update", consumes = APPLICATION_JSON_UTF8_VALUE) public Flux<DataSource> updateDataSources(@RequestBody DataSource_Update body) { if (StringUtils.isEmpty(body.getDataSource())) { return dataSourcesService.retrieveDataSources(body.getForceRemote()); } else { return dataSourcesService.retrieveDataSource(body.getName(), body.getForceRemote()).flux(); } } @PostMapping(path = "/sync", consumes = APPLICATION_JSON_UTF8_VALUE) public Mono<Void> synchronizeDataSources(@RequestBody DataSource_Sync body) { if (SYNC_ALL_ACTION.equals(body.getActionType())) { return dataSourcesService.synchronizeData(); } else { return dataSourcesService.synchronizeDataBetween( body.getDependency(), body.getDataSource(), body.getAndSubGraph()); } }
In case you want to use the data mapping capabilities of the Data Sources library:
-
Specify the
app.data.mapping.path
configuration property (there have to be a valid JSON file). -
Introduce your data models and Google Spreadsheets named value ranges mapping rules for them. Place models anywhere under the
resources
directory and mapping rules in the specified path. -
An arbitrary service code that retrieves remote data from registered Data Sources may look like this:
private static final DomainType<Design> DESIGNS = new StaticType<>("designs", Design.class); private static final DomainType<Assemblage> ASSEMBLAGES = new StaticType<>("assemblages", Assemblage.class); private static final DomainType<Position> PRICE_LIST_POSITIONS = new StaticType<>("positions", Position.class); private static final DomainType[] PRODUCT_REQUIRED_DOMAIN_TYPES = { DESIGNS, ASSEMBLAGES, PRICE_LIST_POSITIONS }; ... private final DataRetrievalService dataRetrievalService; ... private Mono<List<TypedObjects>> uploadRequiredRemoteData() { return dataRetrievalService .getDataFrom(PRODUCT_REQUIRED_DOMAIN_TYPES) .collectList(); }
Note the StaticType
being used to wrap designs
, assemblages
and positions
which are the names of the business
entities which are used, among other things, as keys in data mapping ruleset.
In case you want to use the EventReactivePublisher
features, e.g. for streaming operation status events to clients:
-
Supply the following configuration as well:
@Import({ marksto.events.config.EventsConfiguration.class })
-
An arbitrary Spring WebFlux controller code for events streaming might look like this:
private final EventReactivePublisher eventReactivePublisher; ... @GetMapping(path = "/status/events", produces = TEXT_EVENT_STREAM_VALUE) public Flux<DataSourceEvent> streamStatusEvents() { return eventReactivePublisher.on(DataSourceEvent.class); }
The standard application.properties
are used to configure the core services.
The required basic configuration:
sheets.client.type=SERVICE_ACCOUNT
sheets.client.name=<name of your application>
sheets.client.secret=<from your Service Account as JSON string>
sheets.test-sheet-id=<any spreadsheet ID to check the established connection>
These could also be passed as environment variables, e.g. SHEETS_CLIENT_SECRET = { ... }
.
NB: If you choose to do so, you might need to escape the =
characters in JSON values with a backslash.
Advanced properties with their defaults (for fine tuning):
sheets.firstConnectionTimeout=90s
sheets.apiRequestsLimitPerSecond=0.75
sheets.expireSpreadsheetsCacheEvery= ##unset##
sheets.copyDataRetriesNum=3
sheets.copyData1stBackoff=5s
sheets.copyDataMaxBackoff=10s
sheets.apiCheckRetriesNum=10
sheets.apiCheck1stBackoff=3s
sheets.apiCheckMaxBackoff=10s
Please, find the detailed description of these in SheetsProperties.
The required basic configuration:
app.data.sources.sheets-ids=<comma-separated list of spreadsheets IDs to pre-initialise>
app.data.sources.path=<e.g. '/data/meta/data-sources.json'>
These could also be passed as environment variables, e.g. APP_DATA_SOURCES_SHEETS_IDS = ...
.
Advanced properties with their defaults (for fine tuning):
app.data.sources.default-name-prefix= ##unset##
app.data.sources.autoReInitIn=3s
app.data.sources.reInitRetriesNum=1
Please, find the detailed description of these in DataSourcesProperties.
The basic configuration (optional until you want to use data mapping feature):
app.data.mapping.path=<e.g. '/data/meta/data-mapping.json'>
Advanced properties with their defaults (for fine tuning):
app.data.mapping.thread-pool-size=2
app.data.mapping.expireCacheEvery= ##unset##
Please, find the detailed description of these in DataMappingProperties.
Advanced properties with their defaults (for fine tuning):
app.data.providers.retrieveMetadataRetriesNum=3
app.data.providers.retrieveMetadata1stBackoff=500ms
app.data.providers.retrieveDataStructureRetriesNum=3
app.data.providers.retrieveDataStructure1stBackoff=500ms
app.data.providers.retrieveDataRetriesNum=3
app.data.providers.retrieveData1stBackoff=500ms
Please, find the detailed description of these in DataProvidersProperties.
TBD
The library was used in developing the solution for the Tera kulta web catalogue. This backend
service initializes 8
Data Sources upon the application startup (the exemplary startup logs).
Here's how it looks like in a dedicated Spring WebFlux-based Admin UI.
And here's how actually slow they get fully synchronized. Of course, this is due to the Google Spreadsheets themselves which are well-known for their catastrophic slowdown on fairly large data sets. Just imagine how terrible and error-prone would it be to synchronize all these spreadsheets manually! With Data Sources it's just a single method call!
In Tera kulta we use Data Sources to compose a localized price-list of more than 3,500 stock items from hundreds of thousands of pre-calculated cell values synchronized between spreadsheets, which then get converted into Tilda CSV format and split into semi-equal parts, which can then be easily uploaded to the web site's product catalog.
To build Data Sources locally you'll need Java 12 or later and Maven 3.
While in the project root as a working directory, build and install the project with the following command:
mvn clean install
You will get a fresh target/data-sources-{version}.jar
file within the project, as well as its copy installed
in your local Maven repository under the following path:
$M2_REPO/name/marksto/data-sources/{version}/data-sources-{version}.jar
Under the hood Data Sources leverage:
- Google Sheets API client with OAuth 2.0 credentials of the service account
- JSON Schema for Data Sources and Data Mapping definition (with support for validation)
- Manifold as the main data modeling framework with support for dynamic type reloading
- Kryo and BeanUtils for responses deserialization via the data mapping rules
- Project Reactor as a Reactive Streams API for data flows
Make sure you are familiar with these, especially with the Reactor's documentation.
NB: You might want to install the (paid) Manifold IntelliJ IDEA plugin for better meta-programming experience.
- Add new features such as Dynamic Domain Types support for reloading data models in runtime
- Make Sheets API use an asynchronous non-blocking web client (Spring's
WebClient
) - Optimize an existing threading model, parallelizing the job where possible
- Upload/download data sources and data mapping to/from the Google Drive
- Describe data mapping rules in more details with examples
- Create an environment for live demonstration
- Add more unit tests ¯\_(ツ)_/¯
Copyright © 2019-2020 Mark Sto. See the LICENSE file for license rights and limitations.