(WIP) Database Summaries

I plan to add to this post - absolutely not comprehensive, rather a few notes on dbs / warehousing solutions I’ve used. Also given the pace of deleveopment at places like Snowflake, there’s a distinct risk this page will be rapidly out of date 😂

Snowflake

  • Snowflake is a data warehousing solution with a very pleasing UI
  • It can sit over different cloud providers
  • Separate storage and compute
  • Warehouse (compute) management can be problematic to understand query loading, but warehouse auto-timeouts are effective
  • Rich SQL, though no support (still?) for regex lookarounds or dynamic SQL
  • Can enable row, table, schema, db level security with hierarchical roles
  • Can use table constraints though these are largely for indicative purposes
  • Can use snowpipe + streams and tasks for data streams, though there is also the new Snowflake Streaming, enabling direct streaming into the DWH from sources such as Kafka
  • Time Travel & FailSafe allow for checking or restoring historical data at the time it was queried
  • Functions can be written in a few languages (python, javascript) and are easy to utilise, as are stored procedures; Unfortunately no triggers (when I last checked)
  • Dyanmic data masking to protect PII data is now a feature, but no clear integrations to auto-recognise problematic columns
  • Great python / snowpark features for data science

BigQuery

  • Google Cloud Platform data warehousing solution
  • Optimal for OLAP on a larger scale
  • Optimised for denormalised or nested datasets
  • Works well on wide data; also well with long data which can be sharded or partitioned if needed (with auto-sharding on date possible)
  • Slow for joins and when using multiple tables
  • Inserts / deletes will operate on a similar timeframe for a few rows vs a few thousand
  • Can stream insert data via streaming API or use batching to speed up ingests
  • Caches query results
  • Fully managed service; storage and compute scale well without interference
  • Great integrations with services like Firebase, Google Analytics, Looker, Dataproc, Tableau
  • Exists in the GCP space of numerous integration connectors (Kafka, CouchDB, Redis, Redshift, etc)
  • Costing can be more tricky than in other systems

Redshift

  • Amazon Web Services column oriented data warehousing solution
  • Can hold up to 16 PB of data on a cluster
  • Redshift spectrum allows simultaneous querying of structured and semistructured s3 data without loading directly into the cluster; equally data can be quickly loaded from DynamoDB or EMR
  • Downside of Spectrum is that it charges per scanned byte
  • Slower JDBC loads from other sources
  • Not really optimised for real-time; (can ingest streaming data via Kinesis and use materialised views which update); better for batching and hot/cold stores
  • Table Sort and Dist keys should be optimised (not automatically managed) which can be both a plus and a minus as engineers will need to know how the data will be used
  • Analyse and Explain commands can be extremely helpful for understanding the query performance, query plan and to plan the query queue; but the workload management (WLM) can be difficult
  • Problematic re concurrant usage; multitenancy access may limit the system, serialisation errors can be a problem
  • No checks for primary key constraints (when I used it)
  • Automatic re-sorting of table rows and reclaiming of space (vacuuming) is a plus
  • Materialised views appear to refresh on auto-vaccuuming which increases cost - is a minus
  • Dynamic data masking to protect PII data
  • Potential to integrate with AWS tooling to e.g. use Amazon Comprehend to recognise PII columns

Elasticsearch

  • Distributed near-realtime JSON document search - can act as a NoSQL db but lacks distributed transactions
  • Shards can be replicable, with each node in a cluster able to host multiple shards whilst coordinating operations
  • Automatic rebalancing and rerouting is a plus, as is the effectiveness of their indexes
  • Kibana (often used with) effectively visualises the data
  • Multiple APIs available in difference languages such as PHP, Ruby, python
  • Can have problematic cluster restructuring or upgrading; equally large data volumes may need to be archived.. and restoring such indexes is a tricky procedure
  • Not easy to restucture or change your index
  • Not suited to relational data that will require joins between documents
  • Further info: this is a good (though maybe old) link on it