Databases Archives | simplyblock https://www.simplyblock.io/blog/tags/databases/ NVMe-First Kubernetes Storage Platform Tue, 04 Feb 2025 15:25:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://www.simplyblock.io/wp-content/media/cropped-icon-rgb-simplyblock-32x32.png Databases Archives | simplyblock https://www.simplyblock.io/blog/tags/databases/ 32 32 NVMe Storage for Database Optimization: Lessons from Tech Giants https://www.simplyblock.io/blog/nvme-database-optimization/ Thu, 17 Oct 2024 13:27:59 +0000 https://www.simplyblock.io/?p=3304 Leveraging NVMe-based storage for databases brings whole new set of capabilities and performance optimization opportunities. In this blog we explore how can you adopt NVMe storage for your database workloads with case studies from tech giants such as Pinterest or Discord.

The post NVMe Storage for Database Optimization: Lessons from Tech Giants appeared first on simplyblock.

]]>
Database Scalability Challenges in the Age of NVMe

In 2024, data-driven organizations increasingly recognize the crucial importance of adopting NVMe storage solutions to stay competitive. With NVMe adoption still below 30%, there’s significant room for growth as companies seek to optimize their database performance and storage efficiency. We’ve looked at how major tech companies have tackled database optimization and scalability challenges, often turning to self-hosted database solutions and NVMe storage.

While it’s interesting to see what Netflix or Pinterest engineers are investing their efforts into, it is also essential to ask yourself how your organization is adopting new technologies. As companies grow and their data needs expand, traditional database setups often struggle to keep up. Let’s look at some examples of how some of the major tech players have addressed these challenges.

Pinterest’s Journey to Horizontal Database Scalability with TiDB

Pinterest, which handles billions of pins and user interactions, faced significant challenges with its HBase setup as it scaled. As their business grew, HBase struggled to keep up with evolving needs, prompting a search for a more scalable database solution. They eventually decided to go with TiDB as it provided the best performance under load.

Selection Process:

  • Evaluated multiple options, including RocksDB, ShardDB, Vitess, VoltDB, Phoenix, Spanner, CosmosDB, Aurora, TiDB, YugabyteDB, and DB-X.
  • Narrowed down to TiDB, YugabyteDB, and DB-X for final testing.

Evaluation:

  • Conducted shadow traffic testing with production workloads.
  • TiDB performed well after tuning, providing sustained performance under load.

TiDB Adoption:

  • Deployed 20+ TiDB clusters in production.
  • Stores over 200+ TB of data across 400+ nodes.
  • Primarily uses TiDB 2.1 in production, with plans to migrate to 3.0.

Key Benefits:

  • Improved query performance, with 2-10x improvements in p99 latency.
  • More predictable performance with fewer spikes.
  • Reduced infrastructure costs by about 50%.
  • Enabled new product use cases due to improved database performance.

Challenges and Learnings:

  • Encountered issues like TiCDC throughput limitations and slow data movement during backups.
  • Worked closely with PingCAP to address these issues and improve the product.

Future Plans:

  • Exploring multi-region setups.
  • Considering removing Envoy as a proxy to the SQL layer for better connection control.
  • Exploring migrating to Graviton instance types for a better price-performance ratio and EBS for faster data movement (and, in turn, shorter MTTR on node failures).

Uber’s Approach to Scaling Datastores with NVMe

Uber, facing exponential growth in active users and ride volumes, needed a robust solution for their datastore “Docstore” challenges.

Hosting Environment and Limitations:

  • Initially on AWS, later migrated to hybrid cloud and on-premises infrastructure
  • Uber’s massive scale and need for customization exceeded the capabilities of managed database services

Uber’s Solution: Schemaless and MySQL with NVMe

  • Schemaless: A custom solution built on top of MySQL
  • Sharding: Implemented application-level sharding for horizontal scalability
  • Replication: Used MySQL replication for high availability
  • NVMe storage: Leveraged NVMe disks for improved I/O performance

Results:

  • Able to handle over 100 billion queries per day
  • Significantly reduced latency for read and write operations
  • Improved operational simplicity compared to Cassandra

Discord’s Storage Evolution and NVMe Adoption

Discord, facing rapid growth in user base and message volume, needed a scalable and performant storage solution.

Hosting Environment and Limitations:

  • Google Cloud Platform (GCP)
  • Discord’s specific performance requirements and need for customization led them to self-manage their database infrastructure

Discord’s storage evolution:

  1. MongoDB: Initially used for its flexibility, but faced scalability issues
  2. Cassandra: Adopted for better scalability but encountered performance and maintenance challenges
  3. ScyllaDB: Finally settled on ScyllaDB for its performance and compatibility with Cassandra

Discord also created a solution, “superdisk” with a RAID0 on top of the Local SSDs, and a RAID1 between the Persistent Disk and RAID0 array. They could configure the database with a disk drive that would offer low-latency reads while still allowing us to benefit from the best properties of Persistent Disks. One can think of it as a “simplyblock v0.1”.

Discord’s “superdisk” architecture
Figure 1: Discord’s “superdisk” architecture

Key improvements with ScyllaDB:

  • Reduced P99 latencies from 40-125ms to 15ms for read operations
  • Improved write performance, with P99 latencies dropping from 5-70ms to a consistent 5ms
  • Better resource utilization, allowing Discord to reduce their cluster size from 177 Cassandra nodes to just 72 ScyllaDB nodes

Summary of Case Studies

In the table below, we can see a summary of the key initiatives taken by tech giants and their respective outcomes. What is notable, all of the companies were self-hosting their databases (on Kubernetes or on bare-metal servers) and have leveraged local SSD (NVMe) for improved read/write performance and lower latency. However, at the same time, they all had to work around data protection and scalability of the local disk. Discord, for example, uses RAID to mirror the disk, which causes significant storage overhead. Such an approach doesn’t also offer a logical management layer (i.e. “storage/disk virtualization”). In the next paragraphs, let’s explore how simplyblock adds even more performance, scalability, and resource efficiency to such setups.

CompanyDatabaseHosting environmentKey Initiative
PinterestTiDBAWS EC2 & Kubernetes, local NVMe diskImproved performance & scalability
UberMySQLBare-metal, NVMe storageReduced read/write latency, improved scalability
DiscordScyllaDBGoogle Cloud, local NVMe disk with RAID mirroringReduced latency, improved performance and resource utilization

The Role of Intelligent Storage Optimization in NVMe-Based Systems

While these case studies demonstrate the power of NVMe and optimized database solutions, there’s still room for improvement. This is where intelligent storage optimization solutions like simplyblock are spearheading market changes.

Simplyblock vs. Local NVMe SSD: Enhancing Database Scalability

While local NVMe disks offer impressive performance, simplyblock provides several critical advantages for database scalability. Simplyblock builds a persistent layer out of local NVMe disks, which means that is not just a cache and it’s not just ephemeral storage. Let’s explore the benefits of simplyblock over local NVMe disk:

  1. Scalability: Unlike local NVMe storage, simplyblock offers dynamic scalability, allowing storage to grow or shrink as needed. Simplyblock can scale performance and capacity beyond the local node’s disk size, significantly improving tail latency.
  2. Reliability: Data on local NVMe is lost if an instance is stopped or terminated. Simplyblock provides advanced data protection that survives instance outages.
  3. High Availability: Local NVMe loses data availability during the node outage. Simplyblock ensures storage remains fully available even if a compute instance fails.
  4. Data Protection Efficiency: Simplyblock uses erasure coding (parity information) instead of triple replication, reducing network load and improving effective-to-raw storage ratios by about 150% (for a given amount of NVMe disk, there is 150% more usable storage with simplyblock).
  5. Predictable Performance: As IOPS demand increases, local NVMe access latency rises, often causing a significant increase in tail latencies (p99 latency). Simplyblock maintains constant access latencies at scale, improving both median and p99 access latency. Simplyblock also allows for much faster write at high IOPS as it’s not using NVMe layer as write-through cache, hence its performance isn’t dependent on a backing persistent storage layer (e.g. S3)
  6. Maintainability: Upgrading compute instances impacts local NVMe storage. With simplyblock, compute instances can be maintained without affecting storage.
  7. Data Services: Simplyblock provides advanced data services like snapshots, cloning, resizing, and compression without significant overhead on CPU performance or access latency.
  8. Intelligent Tiering: Simplyblock automatically moves infrequently accessed data to cheaper S3 storage, a feature unavailable with local NVMe.
  9. Thin Provisioning: This allows for more efficient use of storage resources, reducing overprovisioning common in cloud environments.
  10. Multi-attach Capability: Simplyblock enables multiple nodes to access the same volume, which is useful for high-availability setups without data duplication. Additionally, multi-attach can decrease the complexity of volume management and data synchronization.

Technical Deep Dive: Simplyblock’s Architecture

Simplyblock’s architecture is designed to maximize the benefits of NVMe while addressing common cloud storage challenges:

  1. NVMe-oF (NVMe over Fabrics) Interface: Exposes storage as NVMe volumes, allowing for seamless integration with existing systems while providing the low-latency benefits of NVMe.
  2. Distributed Data Plane: Uses a statistical placement algorithm to distribute data across nodes, balancing performance and reliability.
  3. Logical Volume Management: Supports thin provisioning, instant resizing, and copy-on-write clones, providing flexibility for database operations.
  4. Asynchronous Replication: Utilizes a block-storage-level write-ahead log (WAL) that’s asynchronously replicated to object storage, enabling disaster recovery with near-zero RPO (Recovery Point Objective).
  5. CSI Driver: Provides seamless integration with Kubernetes, allowing for dynamic provisioning and lifecycle management of volumes.

Below is a short overview of simplyblock’s high-level architecture in the context of PostgreSQL, MySQL, or Redis instances hosted in Kubernetes. Simplyblock creates a clustered shared pool out of local NVMe storage attached to Kubernetes compute worker nodes (storage is persistent, protected by erasure coding), serving database instances with the performance of local disk but with an option to scale out into other nodes (which can be either other compute nodes or separate, disaggregated, storage nodes). Further, the “colder” data is tiered into cheaper storage pools, such as HDD pools or object storage.

Simplified simplyblock architecture
Figure 2: Simplified simplyblock architecture

Applying Simplyblock to Real-World Scenarios

Let’s explore how simplyblock could enhance the setups of the companies we’ve discussed:

Pinterest and TiDB with simplyblock

While TiDB solved Pinterest’s scalability issues, and they are exploring Graviton instances and EBS for a better price-performance ratio and faster data movement, simplyblock could potentially offer additional benefits:

  1. Price/Performance Enhancement: Simplyblock’s storage orchestration could complement Pinterest’s move to Graviton instances, potentially amplifying the price-performance benefits. By intelligently managing storage across different tiers (including EBS and local NVMe), simplyblock could help optimize storage costs while maintaining or even improving performance.
  2. MTTR Improvement & Faster Data Movements: In line with Pinterest’s goal of faster data movement and reduced Mean Time To Recovery (MTTR), simplyblock’s advanced data management capabilities could further accelerate these processes. Its efficient data protection with erasure coding and multi-attach capabilities helps with smooth failovers or node failures without performance degradation. If a node fails, simplyblock can quickly and autonomously rebuild the data on another node using parity information provided by erasure coding, eliminating downtime.
  3. Better Scalability through Disaggregation: Simplyblock’s architecture allows for the disaggregation of storage and compute, which aligns well with Pinterest’s exploration of different instance types and storage options. This separation would provide Pinterest with greater flexibility in scaling their storage and compute resources independently, potentially leading to more efficient resource utilization and easier capacity planning.
Simplyblock’s multi-attach functionality visualized
Figure 3: Simplyblock’s multi-attach functionality visualized

Uber’s Schemaless

While Uber’s custom Schemaless solution on MySQL with NVMe storage is highly optimized, simplyblock could still offer benefits:

  1. Unified Storage Interface: Simplyblock could provide a consistent interface across Uber’s diverse storage needs, simplifying operations.
  2. Intelligent Data Placement: For Uber’s time-series data (like ride information), simplyblock’s tiering could automatically optimize data placement based on age and access patterns.
  3. Enhanced Disaster Recovery: Simplyblock’s asynchronous replication to S3 could complement Uber’s existing replication strategies, potentially improving RPO.

Discord and ScyllaDB

Discord’s move to ScyllaDB already provided significant performance improvements, but simplyblock could further enhance their setup:

  1. NVMe Resource Pooling: By pooling NVMe resources across nodes, simplyblock would allow Discord to further reduce their node count while maintaining performance.
  2. Cost-Efficient Scaling: For Discord’s rapidly growing data needs, simplyblock’s intelligent tiering could help manage costs as data volumes expand.
  3. Simplified Cloning for Testing: Simplyblock’s instant cloning feature could be valuable for Discord’s development and testing processes.It allows for quick replication of production data without additional storage overhead.

What’s next in the NVMe Storage Landscape?

The case studies from Pinterest, Uber, and Discord highlight the importance of continuous innovation in database and storage technologies. These companies have pushed beyond the limitations of managed services like Amazon RDS to create custom, high-performance solutions often built on NVMe storage.

However, the introduction of intelligent storage optimization solutions like simplyblock represents the next frontier in this evolution. By providing an innovative layer of abstraction over diverse storage types, implementing smart data placement strategies, and offering features like thin provisioning and instant cloning alongside tight integration with Kubernetes, simplyblock spearheads market changes in how companies approach storage optimization.

As data continues to grow exponentially and performance demands increase, the ability to intelligently manage and optimize NVMe storage will become ever more critical. Solutions that can seamlessly integrate with existing infrastructure while providing advanced features for performance, cost optimization, and disaster recovery will be key to helping companies navigate the challenges of the data-driven future.

The trend towards NVMe adoption, coupled with intelligent storage solutions like simplyblock is set to reshape the database infrastructure landscape. Companies that embrace these technologies early will be well-positioned to handle the data challenges of tomorrow, gaining a significant competitive advantage in their respective markets.

The post NVMe Storage for Database Optimization: Lessons from Tech Giants appeared first on simplyblock.

]]>
Discord’s “superdisk” architecture Simplified simplyblock architecture Simplyblock’s multi-attach functionality visualized
Unifying customer data | Steven Renwick https://www.simplyblock.io/blog/unifying-customer-data-steven-renwick/ Tue, 06 Aug 2024 01:33:44 +0000 https://www.simplyblock.io/?p=1759 Introduction: This interview is part of the simplyblock Cloud Frontier Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site . In this episode of simplyblock’s Cloud Frontier podcast, Rob Pankow speaks with Steven Renwick, co-founder and CEO of Tilores, about the critical need for unifying customer data across multiple platforms […]

The post Unifying customer data | Steven Renwick appeared first on simplyblock.

]]>
Introduction:

This interview is part of the simplyblock Cloud Frontier Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this episode of simplyblock’s Cloud Frontier podcast, Rob Pankow speaks with Steven Renwick, co-founder and CEO of Tilores, about the critical need for unifying customer data across multiple platforms and databases. Steven explains how Tilores helps enterprises solve the complex problem of scattered customer data through real-time unification. As companies increasingly rely on data to drive decision-making, unifying this data is key to ensuring a consistent, accurate, and actionable view of customers.

Key Takeaways

What is Data Unification, and why is it Crucial for Enterprises Managing Customer Data?

Data unification is the process of gathering fragmented customer data from multiple databases and systems into a single, coherent profile. For enterprises managing massive amounts of customer information, this unified view is critical for improving customer experience, optimizing decision-making, and ensuring compliance with regulations like GDPR. Without unified data, companies risk making uninformed decisions based on incomplete or inaccurate information. Through technologies like Tilores, enterprises can automate this process, achieving real-time synchronization of data across platforms.

What are Fuzzy Matching Algorithms, and how do they Improve Data Matching?

Fuzzy matching algorithms play a vital role in data unification by identifying records that are similar but not identical, such as when names or addresses are misspelled or slightly altered. This method helps ensure that disparate data points, which might otherwise be missed, are combined under one customer profile. For companies dealing with high volumes of data, fuzzy matching increases the accuracy of entity resolution, allowing them to consolidate customer information more effectively and reduce errors.

What are the Challenges of Fundraising for Early-stage Startups, and how can they be Overcome?

Raising investment as an early-stage startup can be daunting, especially when it involves speaking to hundreds of potential investors. According to Steven, one of the biggest challenges is keeping up morale through numerous rejections while maintaining the passion and energy to sell the vision to the next investor. Preparation is key—ensuring your pitch deck is polished and practicing your pitch extensively is critical. Additionally, building relationships with investors ahead of time and leveraging introductions can increase your chances of success. It’s a numbers game, and persistence is crucial.

EP1: Unifying customer data | Steven Renwick

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Rob Pankow. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

How can Enterprises Ensure the Security of Customer Data in Serverless Environments?

Enterprises moving to serverless environments must prioritize security by implementing best practices such as data encryption, access controls, and continuous monitoring. Serverless architecture reduces the need for manual infrastructure management, but it introduces new security considerations, such as the need for secure APIs and regular audits to ensure compliance with data protection laws. Enterprises should also work closely with cloud service providers to ensure that all security protocols are met and that data is stored in compliance with relevant regulations.

Simplyblock Insight:

Ensuring data security is a critical component of modern cloud architectures, particularly in serverless environments where flexibility and scalability are key. Simplyblock’s cloud storage solutions offer end-to-end encryption, ensuring that customer data is securely managed even as businesses scale their serverless infrastructure. By integrating simplyblock’s secure storage platform with cloud services, enterprises can ensure that sensitive data remains protected while benefiting from the agility and cost-efficiency of serverless technology.

How does Data Unification Improve the Accuracy of AI-driven Customer Insights?

Data unification improves the accuracy of AI-driven insights by providing a single source of truth for customer information. Without unified data, AI models risk being trained on incomplete or inconsistent data, leading to inaccurate predictions and recommendations. When data from different sources is unified, AI models can access a comprehensive and accurate dataset, which enhances their ability to generate actionable insights. This is especially important for enterprises leveraging AI for customer personalization, recommendation systems, and behavior analysis.

Simplyblock Insight:

The performance of AI models is highly dependent on the quality and consistency of the data they analyze. Simplyblock’s storage infrastructure ensures that unified customer data is stored securely and accessed with low latency, enabling AI models to perform efficiently. By offering scalable and reliable data storage solutions, simplyblock helps enterprises maintain the high-quality datasets required for accurate and impactful AI insights.

What are the Benefits of using Cloud-based Data Infrastructure for Enterprises?

Cloud-based data infrastructure provides enterprises with the flexibility to scale their operations, reduce costs, and improve accessibility. By leveraging cloud services, companies can avoid the upfront costs of physical hardware and instead pay for resources as needed. This allows for more efficient data management, particularly for large enterprises that need to handle vast amounts of customer data across multiple regions. Additionally, cloud infrastructure offers advanced features like automated backups, disaster recovery, and real-time data processing, which are essential for ensuring business continuity and minimizing downtime.

Simplyblock Insight:

Simplyblock’s cloud platform offers enterprises a high-performance, scalable storage solution that integrates seamlessly with cloud-based data infrastructure. By providing robust data management capabilities, simplyblock enables businesses to scale their operations without sacrificing data security or performance. Whether handling large datasets or integrating with AI systems, simplyblock’s infrastructure ensures reliable, real-time access to data, supporting business growth and innovation.

Additional Nugget of Information

What is Entity Resolution, and how can it help Businesses Manage Customer Data more Effectively?

Entity resolution is the process of identifying and merging different records that refer to the same entity, such as a customer, in a dataset. This is critical for businesses dealing with customer data spread across multiple systems. Entity resolution ensures that a single, unified profile is created for each customer, reducing duplication and enhancing data quality. This improves decision-making, customer service, and marketing efforts by providing a more accurate understanding of each customer.

Conclusion

Unifying customer data is essential for enterprises aiming to deliver seamless, personalized experiences across all touchpoints. As Steven Renwick shared, fragmented customer data can hinder an organization’s ability to make informed decisions and provide accurate insights. By utilizing technologies like fuzzy matching algorithms and APIs, companies can unify customer data in real time, improving the overall customer experience and enabling AI-driven insights.

Simplyblock’s cloud infrastructure supports these efforts by offering secure, scalable storage solutions that integrate seamlessly with data unification tools. Whether you’re managing sensitive customer information or building AI-driven applications, simplyblock provides the reliable, high-performance storage needed to unify data and drive business outcomes.

To learn more about optimizing data infrastructure and improving customer insights, be sure to tune in to future episodes of the Cloud Frontier podcast!

The post Unifying customer data | Steven Renwick appeared first on simplyblock.

]]>
EP1: Unifying customer data | Steven Renwick
Rockset alternatives: migrate with simplyblock https://www.simplyblock.io/blog/rockset-alternatives-migrate-with-simplyblock/ Wed, 24 Jul 2024 01:52:57 +0000 https://www.simplyblock.io/?p=1768 The Rockset Transition: what you need to know On June 21, 2024, Rockset announced its acquisition by OpenAI , setting off a countdown for many organizations using their database. If you’re a Rockset user, you’re now facing a critical deadline: September 30, 2024, at 17:00 PDT . By this date, all existing Rockset customers must […]

The post Rockset alternatives: migrate with simplyblock appeared first on simplyblock.

]]>
The Rockset Transition: what you need to know

On June 21, 2024, Rockset announced its acquisition by OpenAI , setting off a countdown for many organizations using their database. If you’re a Rockset user, you’re now facing a critical deadline: September 30, 2024, at 17:00 PDT . By this date, all existing Rockset customers must transition off the platform. This sudden shift has left many companies scrambling to find suitable alternatives that can match Rockset’s performance in real-time analytics, business intelligence, and machine learning applications. At simplyblock, we understand the urgency and complexity of this situation, and we’re here to guide you through this transition.

Key Transition Points:

Top Rockset Alternatives: Finding your Perfect Match

As you navigate this transition, it’s crucial to find a solution that not only matches Rockset’s capabilities but also aligns with your specific use cases. Here are some top alternatives, along with how simplyblock can enhance their performance:

rockset alternatives for migration

1. ClickHouse : the OLAP Powerhouse

ClickHouse is an open-source, column-oriented DBMS that excels in online analytical processing (OLAP) and real-time analytics.

simplyblock benefit : Our NVMe-based block storage significantly boosts ClickHouse’s already impressive query performance on large datasets, making it even more suitable for high-throughput analytics workloads.

2. StarTree : Real-Time Analytics at Scale

Built on Apache Pinot, StarTree is designed for real-time analytics at scale, making it a strong contender for Rockset users.

simplyblock benefit: StarTree’s distributed architecture pairs perfectly with our block storage, allowing for faster data ingestion and query processing across nodes.

3. Qdrant : Vector Similarity Search Engine

Qdrant is a vector similarity search engine designed for production environments, ideal for machine learning applications.

simplyblock benefit: Our storage solution dramatically reduces I/O wait times for Qdrant, enabling even faster vector searches on large datasets.

4. CrateDB : Distributed SQL with a Twist

CrateDB is an open-source distributed SQL database with built-in support for geospatial and full-text search.

simplyblock benefit: simplyblock enhances CrateDB’s distributed nature, allowing for faster data distribution and replication across nodes.

5. Weaviate : the Versatile Vector Database

Weaviate is an open-source vector database that supports various machine learning models and offers GraphQL-based queries.

simplyblock benefit: Our block storage solution enhances Weaviate’s performance, especially for write-intensive operations and real-time updates to vector indexes.

Additional Alternatives Worth considering

While the above options are our top picks, several other databases deserve mention: MongoDB Atlas : Flexible document database with analytics capabilities Redis Enterprise Cloud : In-memory data structure store, ideal for caching and real-time data processing CockroachDB : Distributed SQL database offering strong consistency Neo4j Graph Database : Specialized for handling complex, interconnected data Databricks Data Intelligence Platform : Comprehensive solution for big data analytics and ML YugabyteDB : Combines SQL and NoSQL capabilities in a distributed database

How Simplyblock Enhances your new Data Setup

Regardless of which alternative you choose, simplyblock’s NVMe-based block storage engine can significantly enhance your new data infrastructure:

  1. Reduced I/O Wait Times : Cut down on data access latency, crucial for real-time analytics. 2. Optimized for High-Concurrency : Handle numerous concurrent queries efficiently, perfect for busy BI dashboards. 3. AWS Compatibility : Seamlessly integrate with your existing AWS infrastructure. 4. Scalability : Maintain high performance as your data volumes grow. 5. Cost-Efficiency : Improve storage performance to potentially reduce overall resource needs and lower cloud costs.

Use Cases we Support

Our solution is versatile and can support a wide range of data-intensive applications, including: Real-time dashboards and analytics Business intelligence (BI) Data warehouse speed layer Logging and metrics analysis Machine learning (ML) and data science applications Vector similarity search

Conclusion: Turning Challenge into Opportunity

While the Rockset transition poses significant challenges, it also presents an opportunity to optimize your data infrastructure. By pairing your chosen alternative with simplyblock’s high-performance block storage, you can create a robust, efficient, and future-proof analytics solution.

As you evaluate these options, our team is ready to provide insights on leveraging simplyblock with your new data platform. We’re committed to helping you not just migrate, but upgrade your data capabilities in the process.

Remember, the September 30th deadline is approaching rapidly. Start your migration journey today, and let simplyblock help you build a data setup that outperforms your expectations.

How can Simplyblock be used on AWS?

simplyblock offers high-performance cloud block storage that not only enhances the performance of your databases and applications but also brings cost efficiency. Most importantly, simplyblock storage cluster is based on EC2 instances with local NVMe disks, which qualify for AWS Savings Plans and Reservation Discounts. This means you can leverage simplyblock’s technology while also fulfilling your compute commitment to AWS. Such solution extends AWS compute savings plan to storage . It’s a win-win situation for AWS users seeking performance, scalability, and cost-effectiveness for fast NVMe-based storage.

Simplyblock uses NVMe over TCP for minimal access latency, high IOPS/GB, and efficient CPU core utilization, surpassing local NVMe disks and Amazon EBS in cost/performance ratio at scale. Moreover, simplyblock can be used alongside AWS EDP or AWS Savings Plans.

Ideal for high-performance Kubernetes environments, simplyblock combines the benefits of local-like latency with the scalability and flexibility necessary for dynamic AWS EKS deployments , ensuring optimal performance for I/O-sensitive workloads like databases. Using erasure coding (a better RAID) instead of replicas helps to minimize storage overhead without sacrificing data safety and fault tolerance.

Additional features such as instant snapshots (full and incremental), copy-on-write clones, thin provisioning, compression, encryption, and many more, simplyblock meets your requirements before you set them. Get started using simplyblock right now or learn more about our feature set. Simplyblock is available on AWS marketplace.

The post Rockset alternatives: migrate with simplyblock appeared first on simplyblock.

]]>
rockset alternatives for migration
Neo4j in Cloud and Kubernetes: Advantages, Cypher Queries, and Use Cases https://www.simplyblock.io/blog/exploring-neo4j-advantages-query-simplification-and-practical-use-cases-in-cloud-and-kubernetes-e/ Mon, 15 Jul 2024 02:09:36 +0000 https://www.simplyblock.io/?p=1782 Introduction: In the era of big data, managing complex relationships is crucial. Neo4j , a leading graph database, excels at handling intricate data connections, making it indispensable for modern applications. This post explores Neo4j’s advantages, Cypher query language, and real-world applications in cloud and Kubernetes environments. Why Choose Neo4j over Traditional Relational Databases? Optimized for […]

The post Neo4j in Cloud and Kubernetes: Advantages, Cypher Queries, and Use Cases appeared first on simplyblock.

]]>
Introduction:

In the era of big data, managing complex relationships is crucial. Neo4j , a leading graph database, excels at handling intricate data connections, making it indispensable for modern applications. This post explores Neo4j’s advantages, Cypher query language, and real-world applications in cloud and Kubernetes environments.

Why Choose Neo4j over Traditional Relational Databases?

Optimized for complex relationships Greater schema flexibility Intuitive data modeling Efficient query performance Natural data exploration

What are the Main Differences between Graph Databases and Relational Databases?

Data model difference between graph and relational databases

Graph databases and relational databases differ primarily in their structure and data retrieval methods. Graph databases excel at managing and querying relationships between data points, using nodes to represent entities and edges to illustrate connections. This structure is particularly useful for applications like social networks, fraud detection, and recommendation engines where relationships are key. In contrast, relational databases organize data into tables with rows and columns, focusing on structured data and using SQL (Structured Query Language) for CRUD operations (Create, Read, Update, Delete). Relational databases are ideal for applications requiring complex queries and transactions, such as financial systems and enterprise resource planning (ERP) solutions. Understanding these differences helps in selecting the appropriate database type based on specific application needs and data complexities.

How does Neo4j Handle Data Relationships Compared to SQL Databases?

Neo4j handles data relationships by using a graph-based model that directly connects data points (nodes) through relationships (edges). This allows for highly efficient querying and traversal of complex relationships without the need for complex JOIN-like operations (merges). Each relationship in Neo4j is stored as a first-class entity, making it easy to navigate and query intricate connections with minimal latency. In contrast, SQL databases manage relationships using foreign keys and JOIN operations across tables. While SQL databases are efficient for structured data and predefined queries, handling deeply nested or highly interconnected data often requires complex JOIN statements, which can be resource-intensive and slower. Neo4j’s graph model is specifically optimized for queries involving relationships, providing significant performance advantages in scenarios where understanding and traversing connections between data points is crucial.

When should Developers Choose Neo4j over a Traditional Database?

Developers should choose Neo4j over a traditional database when their application involves complex and dynamic relationships between data points. Neo4j’s graph-based model excels in scenarios such as social networking, recommendation systems, fraud detection, network and IT operations, and knowledge graphs, where understanding and querying intricate connections is critical. If the use case demands real-time querying and analysis of data relationships, such as finding the shortest path between nodes or traversing multi-level hierarchies efficiently, Neo4j provides superior performance and scalability compared to traditional relational databases. Additionally, Neo4j is advantageous when the data structure is flexible and evolves over time, as its schema-free nature allows for easy adaptation to changing requirements without significant reworking of the database schema. Choosing Neo4j can greatly enhance performance and simplify development in applications heavily reliant on interconnected data.

Simplifying Complex Data with Cypher Query Language

Cypher, Neo4j’s query language, streamlines data relationship management through:

  1. Intuitive syntax
  2. Declarative nature
  3. Powerful pattern matching
  4. Efficient recursion handling
  5. Built-in graph functions
  6. Advanced aggregation and filtering
  7. Seamless integration with graph algorithms

How does Cypher Differ from SQL in Querying Complex Relationships?

Cypher, the query language.) for Neo4j, differs from SQL in its intuitive approach to querying complex relationships. Cypher uses pattern matching to navigate through nodes and relationships, making it naturally suited for graph traversal and relationship-focused queries. For example, finding connections between nodes in Cypher involves specifying patterns that resemble the graph structure, making the queries concise and easier to understand.

In contrast, SQL relies on JOIN operations to link tables based on foreign keys, which can become cumbersome and less efficient for deeply nested or highly interconnected data. Complex relationships in SQL require multiple JOINs and subqueries, often leading to more verbose and harder-to-maintain queries.

Cypher’s declarative syntax allows developers to describe what they want to retrieve without specifying how to retrieve it, optimizing the underlying traversal and execution. This makes Cypher particularly powerful for applications needing to query and analyze intricate data relationships, such as social networks, recommendation engines, and network analysis, providing a clear advantage over SQL in these scenarios.

What are the Key Features of Cypher that Make it Ideal for Graph Databases?

Cypher is ideal for graph databases due to several key features:

  1. Pattern Matching : Allows intuitive querying by describing graph structures.
  2. Declarative Syntax : Simplifies complex queries, letting the engine optimize execution.
  3. Traversal Efficiency : Excels at navigating and exploring interconnected data.
  4. Flexible Relationships : Easily handles various types and attributes of relationships.
  5. Readability : Shorter, more readable queries compared to SQL’s JOIN operations.
  6. Aggregation and Transformation : Supports advanced data analysis functions.
  7. Schema-Free : Works well with dynamic, evolving data models.
  8. Graph Algorithms : Integrates with Neo4j’s built-in algorithms for advanced analytics. These features make Cypher a powerful language for managing and querying complex relationships in graph databases.
Key features of Cypher

Neo4j use Cases in Cloud and Kubernetes Environments

  1. Microservices Management : Neo4j helps manage and visualize microservice architectures by tracking service dependencies and interactions by storing the relationships between each and every service, enhancing troubleshooting and system optimization.
  2. Fraud Detection : It identifies patterns and anomalies in transactional data, enabling real-time detection and prevention of fraudulent activities through relationship analysis.
  3. Identity and Access Management : Neo4j efficiently maps user permissions and roles, ensuring secure and scalable identity management and access control.
  4. IT Operations and Network Management : It monitors and optimizes IT infrastructure by mapping and analyzing network topologies, dependencies, and configurations.
  5. Recommendation Engines : Leveraging graph algorithms, Neo4j provides personalized recommendations by analyzing user preferences and relationships between items.
  6. Supply Chain Optimization : Neo4j optimizes supply chain processes by mapping product flows, identifying bottlenecks, and enhancing logistics management through relationship analysis.
  7. Healthcare Data Management : It manages complex healthcare data by integrating patient records, treatments, and outcomes, improving patient care and operational efficiency.
  8. Social Network Analysis : Neo4j uncovers insights into social networks by analyzing connections and interactions, supporting marketing, and user engagement strategies.
  9. Knowledge Graph Construction : It constructs and manages knowledge graphs, linking diverse data sources to provide a unified view and advanced search capabilities.
  10. Compliance and Regulatory Reporting : Neo4j ensures compliance by tracking data lineage, managing regulatory requirements, and generating comprehensive reports for audits and governance.

How does Neo4j Enhance Microservices Management in Kubernetes?

Neo4j enhances microservices management in Kubernetes by providing a clear visualization of service dependencies and interactions. It helps in tracking the relationships between microservices, enabling efficient monitoring and troubleshooting. By mapping the complex network of services, Neo4j allows for a better understanding and management of service communications and dependencies, making it easier to identify issues, optimize performance, and ensure seamless integration within a dynamic Kubernetes environment .

What Advantages does Neo4j Offer for Fraud Detection in Cloud Environments?

In cloud environments, Neo4j offers several advantages for fraud detection:

  1. Real-Time Analysis : Neo4j’s graph model allows for rapid querying and analysis of transactional data, enabling real-time detection of fraudulent activities by identifying unusual patterns and connections.
  2. Pattern Recognition : Its ability to model and analyze complex relationships helps in recognizing sophisticated fraud patterns that might be missed by traditional methods.
  3. Anomaly Detection : By examining relationships and behaviors across multiple dimensions, Neo4j can quickly spot anomalies and irregularities in transaction data.
  4. Scalability : Neo4j scales efficiently in cloud environments, handling large volumes of data and complex queries required for comprehensive fraud detection.
  5. Flexibility : The schema-free nature of Neo4j allows for easy adaptation to evolving fraud strategies and data models, ensuring ongoing effectiveness in detecting new types of fraud.

How can Neo4j Improve Supply Chain Management and Optimization?

Neo4j improves supply chain management by:

  1. Providing End-to-End Visibility : Maps relationships across the supply chain to identify bottlenecks and inefficiencies.
  2. Optimizing Demand and Inventory : Analyzes patterns to balance stock levels and prevent overstock or stockouts.
  3. Managing Risks : Identifies vulnerabilities and potential risks within the supply chain.
  4. Enhancing Logistics : Optimizes routes and distribution strategies for efficiency.
  5. Facilitating Collaboration : Improves coordination and decision-making among supply chain partners.

How Simplyblock Enhances Neo4j Performance in Kubernetes

Simplyblock optimizes Neo4j in Kubernetes environments through: High-performance block storage Scalable storage with zero downtime scalability High availability and durability Cost-effective solutions Seamless Kubernetes integration Enhanced data mobility Advanced data management features

What Specific Performance Improvements can Neo4j Expect with Simplyblock?

Neo4j can benefit from simplyblock’s high-performance block storage, which enhances data access and processing speeds. This leads to improved query performance and faster response times. Additionally, simplyblock’s scalable storage options ensure that performance remains consistent even as data volumes grow.

How does Simplyblock Ensure Data Integrity for Neo4j in Kubernetes?

Simplyblock ensures data integrity for Neo4j by providing high availability and durability through features like automatic erasure coding, sync and async cluster replication, as well as backups. These capabilities safeguard data against loss and ensure that it remains accessible and intact, which is crucial for maintaining data integrity in Kubernetes environments.

Furthermore, simplyblock provides immediate snapshots and copy-on-write clones, enabling instant database forks (or clones) for development and staging environments straight from production.

Can Simplyblock help Reduce Storage Costs for Neo4j Deployments?

Yes, simplyblock helps reduce storage costs for Neo4j deployments by offering cost-effective storage solutions. Its cost optimization strategies allow organizations to manage their storage expenses efficiently, making it a practical choice for controlling storage costs in cloud environments.

Neo4j, coupled with simplyblock’s advanced storage solutions, offers unparalleled performance, scalability, and reliability for graph databases in cloud and Kubernetes environments. By leveraging these technologies, organizations can unlock the full potential of their complex data relationships and drive innovation across various industries.

The post Neo4j in Cloud and Kubernetes: Advantages, Cypher Queries, and Use Cases appeared first on simplyblock.

]]>
Data model difference between graph and relational databases Key features of Cypher
Getting Started with Graph Databases with Jennifer Reif from Neo4j https://www.simplyblock.io/blog/jennifer-reif-from-neo4j-graph-database/ Fri, 12 Jul 2024 02:15:12 +0000 https://www.simplyblock.io/?p=1790 Introduction This interview is part of the simplyblock Cloud Frontier Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site . In this episode of the Cloud Commute podcast, host Chris Engelbert interviews Jennifer Reif, a Developer Advocate at Neo4j. Jennifer delves into the fundamentals of graph databases, explaining how they […]

The post Getting Started with Graph Databases with Jennifer Reif from Neo4j appeared first on simplyblock.

]]>
Introduction

This interview is part of the simplyblock Cloud Frontier Podcast, available on Youtube , Spotify , iTunes/Apple Podcasts , and our show site .

In this episode of the Cloud Commute podcast, host Chris Engelbert interviews Jennifer Reif, a Developer Advocate at Neo4j. Jennifer delves into the fundamentals of graph databases, explaining how they differ from traditional relational databases and why they are uniquely suited for specific use cases. If you’re curious about graph databases and their practical applications, this episode is a must-listen.

Key Takeaways

Why Isn’t SQL the right Fit for Graph Databases?

SQL is designed for querying relational databases, where data is organized in tables. While powerful for certain tasks, SQL struggles with complex queries involving multiple relationships, which are common in graph databases. Graph databases like Neo4j are optimized for handling deeply interconnected data, where relationships are as crucial as the entities themselves. In these scenarios, using a graph query language like Cypher, which visually represents relationships and paths, simplifies the query process and enhances performance.

What are Graph Databases used For?

Graph databases are particularly effective in use cases involving complex relationships, such as social networks, supply chains, and fraud detection. Graph databases excel in scenarios where data is interconnected, allowing users to efficiently navigate and query these relationships. Neo4j, for example, was instrumental in analyzing the Panama Papers, where journalists used it to uncover hidden relationships between entities in a massive dataset.

What is the Difference between a Graph Database and a Data Table?

A graph database stores data as nodes (entities) and edges (relationships), allowing for a more flexible and intuitive representation of complex data structures. In contrast, a data table in a relational database organizes data into rows and columns, which can become cumbersome when dealing with intricate relationships. Graph databases eliminate the need for extensive joins and complex queries, making it easier to explore and extract value from interconnected data.

EP20: Getting Started with Graph Databases with Jennifer Reif from Neo4j

In addition to highlighting the key takeaways, it’s essential to provide deeper context and insights that enrich the listener’s understanding of the episode. By offering this added layer of information, we ensure that when you tune in, you’ll have a clearer grasp of the nuances behind the discussion. This approach enhances your engagement with the content and helps shed light on the reasoning and perspective behind the thoughtful questions posed by our host, Chris Engelbert. Ultimately, this allows for a more immersive and insightful listening experience.

Key Learnings

How do Graph Databases Work?

Graph databases store data in nodes and edges, representing entities and their relationships. This structure allows for efficient querying of complex, interconnected data. Unlike relational databases, which require multiple joins to traverse relationships, graph databases can quickly navigate through connected nodes, making them ideal for applications with deeply nested relationships.

Simplyblock Insight: ****

While graph databases handle relationships efficiently, simplyblock provides the necessary infrastructure to ensure that these databases perform optimally in cloud environments. By offering reliable storage and high availability, simplyblock supports the scalability and resilience needed for managing large, interconnected datasets.

How are Graph Databases Implemented?

Graph databases are implemented using graph data models, where nodes represent entities, and edges represent the connections between them. Jennifer mentions that Neo4j uses a native graph processing engine, which allows for efficient querying and storage of graph data. This native approach ensures that graph operations are optimized, reducing latency and improving performance compared to non-native graph solutions.

Simplyblock Insight:

Implementing graph databases on platforms like Kubernetes is simplified with simplyblock’s storage solutions, which ensure that data persistence and recovery are handled seamlessly. Whether using Helm charts or Operators, simplyblock’s infrastructure ensures that Neo4j and other graph databases can be deployed and managed with minimal operational overhead.

What are the Practical Applications of Graph Databases?

Graph databases are widely used in areas where understanding relationships is key, such as social networking, fraud detection, recommendation systems, and supply chain management. Jennifer highlights how these databases allow organizations to uncover hidden patterns and insights by exploring the connections between data points, which would be difficult or impossible to achieve with traditional databases.

Simplyblock Insight:

Simplyblock’s platform complements these applications by providing a robust infrastructure that ensures high availability and consistent performance, even under heavy query loads. This makes it possible to apply graph databases in mission-critical applications where downtime or performance degradation is not an option.

Additional Nugget of Information

How do Graph Databases Handle Scalability in Large, Distributed Environments?

As datasets grow and become more interconnected, the ability to scale a graph database efficiently becomes crucial. Graph databases like Neo4j are designed to handle scalability challenges by distributing data across multiple nodes while maintaining the integrity of the relationships between entities. This distributed approach allows graph databases to manage large volumes of data without sacrificing performance, making them well-suited for enterprise-level applications.

Conclusion

Jennifer Reif offers a comprehensive introduction to graph databases, highlighting their strengths and how they differ from traditional relational databases. She emphasizes that graph databases, like Neo4j, are powerful tools for managing and querying complex relationships in data, making them invaluable in various industries. As the technology landscape continues to evolve, graph databases are poised to play a crucial role in applications where understanding relationships is key.

Whether you’re new to graph databases or looking to deepen your understanding, this conversation provides valuable insights into how they work and why they are increasingly important in today’s data-driven world. Be sure to tune in to future episodes of the Cloud Commute podcast for more expert discussions.

Full Video Transcript

Chris Engelbert: Hello. Well, welcome back everyone. Welcome back to the next episode of simplyblock’s Cloud Commute Podcast. This week I have another incredible guest. I know I say that every single time. It’s just true. They’re all incredible. And you know that, right? So, hello Jennifer. Um, maybe just give us a quick introduction real quick. Jennifer Reif: Sure. My name is Jennifer Reif. I’m a developer advocate at Neo4j, focusing on Java technologies and its ecosystem. So I cover the gamut on almost anything Java. I’ve worked at Neo4j since 2018, let me put it that way. It’s been a little bit. I show up at some conferences, write blog posts, do videos, and contribute to Neo4j’s podcast, graphstuff.fm . I also work on code demo projects and presentations, the whole nine yards. So I am happy to be here to talk to Christoph and chat a little bit about technology and Neo4j and so on. Chris Engelbert: Awesome. I think you’re actually the first person who ever said Christoph on the stream. So now people know how I’m really called. Chris is fine. It’s so much easier for the rest of the world, but you said you’re working for Neo4j. Obviously, I know what Neo4j is, but maybe just give the others a quick introduction. What is cool about it? What is it? And you know the spiel. Jennifer Reif: Sure. Neo4j is a graph database. And I guess to start off, just like any other database, it stores data. A lot of people will say, “oh, is it a layer on top of another type of database?” No, it actually is a storage system. You store the data rights to disk and the whole gamut there. But it stores data differently than rows, tables, documents, so on. It stores data as entities and then relationships between them. So you actually write the relationships to the database. That makes it really easy to read those relationships back. So anything where you have a lot of complex relationships or a lot of relationships and a lot of hops through different types of data, a graph database is going to be optimized and more performant for those types of queries. Chris Engelbert: Right. Jennifer Reif: So lots of things like networks or social network structures, supply chains, where you have a lot of depth and hopping around, even just fraud detection and there’s a variety of different use cases, software dependencies, lots of other things. So I’ve seen it used for kind of hit or miss just kind of random things that it’s like, “oh, I would have never thought to use a graph for that,” but it works really, really well for any type of case where you have a lot of relationships and a lot of connections in your data. Chris Engelbert: So that’s interesting. I think the weirdest thing that I’ve built, and at the same time, the most efficient thing was actually a permission system, with inheritance, and roles and permissions and inheritance between the different roles, because you’re basically can make a single like Cypher request and say, “give me every permission that is somehow in the hierarchy or in the inheritance graph, and remove everything that might be overridden” as, what is the term, uh, is it out, um, uh, denied. That’s it. Yeah. Blocked or denied. I like that. So that was, that was really nice. And it was so much easier than, than doing like a graph, or like a table tree kind of recursive SQL lookup on a relational database. Um, yeah, I think I still have the code somewhere. Jennifer Reif: That would be really cool. You should publish that somewhere or like, you know, highlight somewhere. Chris Engelbert: I can try to find it and, um, well, let’s see if I, maybe I hand it to you. Jennifer Reif: Yeah. I’ve seen some geology or like family tree type of scenarios. Chris Engelbert: In just a couple of lines, it was like, I think, uh, three types and four relationships or something, and you’re done. It was brilliant. Anyway. So you said it’s a graph database and you gave a couple of ideas what a graph database could be used for. And well, I hinted on why graph databases might be easier. Right. So especially when you do like topology or relate any kind of relation lookups, you said social networks, parent or family trees, anything like that, where you have relations, especially like when you look at European history, like between the different Kings families, and there’s a lot of connections and relations between almost all families. Jennifer Reif: Yeah. Chris Engelbert: So if you’re trying to understand or to look into those kinds of things, graphs are super helpful and much easier. But what would you say is like the biggest difference from a, from a typical database, for example, like a relational database, except you said that Neo4j or graph database store it slightly different. Jennifer Reif: I’m slightly biased. So I have a long list of things. I love a graph database over other things. But if I had to narrow it down to just one, the thing that I find most impactful is that you don’t need to have expert knowledge about the data model in order to pull valuable data from a graph database. So you had mentioned, you know, you have a few different types of relationships. You don’t have to know what those relationships are going into the graph database, you say, “hey, look, I know I have these entities, find all the ways they’re connected and remove the connections that are, you know, the denials or the denied or blocked or whatever credentials or access paths,” and you can filter those types of relationships out and with a relational database, sure, that’s probably possible, but the amount of work and the amount of knowledge you have to have upfront first of the data model and second of SQL in order to handle those very complex filterings and like sub queries and so on is a lot higher. That learning curve is a lot higher. Um, so that’s the thing that I love most about graph databases is the data model itself is not required to know it upfront well, and then it’s naturally very visual. So it’s just easier to navigate and easier to just explore without having this massive learning curve upfront to know the data. Chris Engelbert: I love that. Um, specifically as far as I remember Neo4j was involved into a lot of like analytical use cases, uh, towards things like the Panama papers, right? As far as I remember Panama papers, like the whole network was basically put into Neo4j and then the journalists started analyzing this massive graph and how all those companies worked together. And that is exactly what you said, right? You don’t have to understand or have to know yet how those things are connected or is it people, is it companies that somehow work together that make the relation? Um, you figure that out while you’re looking at the data and while you’re looking at the graph and trying to understand what that means. Jennifer Reif: Yeah. My favorite thing is to just take a data set that looks interesting to me. Dump it into Neo4j and then just start querying and see what interesting things I find from it. And then that’s what I end up focusing on and playing around with where I feel like a relational database, it’s almost the opposite. Um, you have to really kind of figure out and look at the data and the spreadsheets or whatever, you know, data format you have and figure out, “okay, what does the structure look like? How can I make the connections from one hop to the next table and so on?” And a graph is a little bit of the reverse there. Yeah. Chris Engelbert: Well, I’m not sure it’s about a general graph database thing or is that very specific to Neo4j because you don’t necessarily need a schema. Jennifer Reif: Yeah. I know there are some other graph databases that kind of have that optional schema, schemaless, schema free, however you want to term it. And Neo4j is not the only one in that category. But I feel like just the length of time that Neo4j has been around that, you know, we kind of have like a leg up on a lot of the other graph databases, so those that do provide that capability. Um, it’s just a really nice feature. Chris Engelbert: Right. Yeah. I’m asking because I think for relational databases one of the critics or points that people always talked about and the whole like NoSQL thing where it came from was like, you don’t want the schema. You want this kind of schemaless, you have an optional schema and if the schema can evolve over time, but with SQL database, or at least relational database, not necessarily SQL, but relational database, you have to come up with like relational model upfront and define it. Um, and I think that is where a lot of like the problems come when you have an unknown dataset and a very complex dataset, if it evolves over time, it’s probably fine, but when you get something it’s probably much more complicated. So as a developer, I mean, I’m coming from a relational world. Um, so I’m a Postgres developer, but I understand I may need a graph database like Neo4j. So how would I get started with that? Jennifer Reif: Well, one of the best ways we have currently is our database as a service, um, called Aura, Neo4j Aura. Um, and we have free instances. So we have, you know, different tiers, of course, uh, we have a free tier and then kind of your paid tiers above that, depending on your, on your needs there. But the free tier is a really great place to start. Um, there’s lots of tools surrounding that free tier. So they have like a data importer tool where you can dump, you can load up like PDFs or, or CSVs or some other different types of data and it will kind of help you get that data into a graph. So you don’t have to have that knowledge upfront. And then you can kind of query or play around with our visualization tool called Bloom, and it kind of is a natural language query interface. So you don’t have to know a lot of Cypher upfront. Um, even the Cypher portion of it, there’s guides that kind of walk you through, and so it’s just a, we try our best to have a very low barrier to entry pathway there for people to learn. Chris Engelbert: I think the… You mentioned Cypher, the thing that makes Cypher from my perspective, so much better than the other graph languages is that it actually looks like ASCII art. It looks beautiful. You look at the query and at some, if you go a little bit deeper and use some of more complex constructs, it’s a little bit more complicated to understand if you don’t know how it works, but like a standard graph query over multiple nodes and relationships, you look at that and it’s an arrow telling you, “oh, here’s a node, here’s the relationship, and that’s what I expect, and that is how many you can have between those.” I just love it. Whoever came up with Cypher. Thank you. Thank you for the love of God. Jennifer Reif: It’s a super approachable query language. I feel like I had learned several years of SQL before I even knew about Cypher, um, and when I came over to the light side, if you will, at Neo4j, um, and started exploring Cypher, there were several things that it’s like, “why in the world isn’t everybody, you know, using something like this?” Because it’s very easy to read, very easy to construct, at least kind of the general starting structures, right? Um, there’s way more complex things you can do with it. And there’s still lots of things I look at it and go, okay, “how do I do this pattern, you know, construction and manipulation?” Um, because patterns are very complex. Um, but yeah, just at the outset, it’s a much more approachable language. I feel like and has some really cool fun things to do with it. And I always like to give the example that it took me learning Cypher in order to understand what the SQL having and group by clause was trying to do. Um, it was just way more apparent in Cypher than in SQL. Chris Engelbert: I agree. And I think, and that is where a graph database comes in in general, as I said earlier, in SQL, when you have those like multi hop relationships, you end up doing something like this weird recursive SQL. It works, but it’s never going to be nice. It’s a recursive, common table expression, with the union and a join and I have to look it up every single time I have. I’ve used it so many times. I always get like 95% to where I want to be. And then it just doesn’t work the way I expected. And I have to look it up and I probably made some mistake on the join type or on the joint clause. And with Neo4j or in general with graph database and specifically Cypher, it is so much easier to model that stuff, even when you use a merge or something, it’s still way easier. Jennifer Reif: And for those of you who are not familiar with Cypher or thinking that this is a Neo4j thing. Um, first of all, we have OpenCypher, which is a completely open source. We open sourced it, I believe back in 2015, but just this year, Neo4j and several other graph database vendors all got together and came up with the ISO GQL standard, “Geequel standard”, that was released, I think like a month, month and a half ago now. And so there is an official Graph query language standard now that Cypher has poured a lot into that as well. Um, there’s a lot of things that have, have come over from Cypher as well as some other graph query languages too. So it will be an official, like unified standard. Of course, whenever, when everybody can kind of get to that. Chris Engelbert: An ISO standard. Jennifer Reif: Yep. Chris Engelbert: Wow. I did not expect that to see in my lifetime. That is incredible. Jennifer Reif: It’s been several years in the making. And Neo4j and all the other graph database vendors have been hard at work getting that all together, but yeah, it all got approved and everything. Just recently. Chris Engelbert: So how does it work from a programming language perspective? Um, I know that Neo4j has a lot of drivers, obviously it’s not a SQL interface, so you need something different than for example, in Java JDBC or in Go, the scan interface. But I think there’s drivers for almost every language I’ve ever considered. Jennifer Reif: Yeah. We provide official drivers for like the bulk of your core languages, and then there’s community drivers that are very well supported, very well maintained by partners or communities or so on for several other languages, and then we also do have like a JDBC driver and other things too, as well as integrations to major frameworks. So like our Spring Data in Neo4j integration has been around forever. Um, and several others as well. And of course, you know, we have like the big GenAI ones now, your Langchains, your Llama index, and so on too. So, basically anything you want to integrate with or around Neo4j has some kind of connector integration or driver or something to do with it. Chris Engelbert: All right. Cool. You already mentioned Neo4j Aura. And as far as I know, we’re a cloud podcast, but we’re also Kubernetes podcast. As far as I know, Neo4j Aura internally uses Kubernetes, right? Jennifer Reif: Yes. As far as I know. Yep. Kubernetes is the thing. Chris Engelbert: Okay. So we’re probably on the same level of understanding. Jennifer Reif: Yeah, there may be some other things they do as well, but yes, we run Kubernetes and we have a very good integration and partnership there. Chris Engelbert: Okay. So that means I can also use Neo4j on Kubernetes outside of Aura. Jennifer Reif: Yeah. The thing that, at least I didn’t realize and still like, until I started digging in just a little bit, is running a database on Kubernetes is not as simple as spin up X database. Um, there’s a lot of, you know, because… Chris Engelbert: If you don’t care for persistence, yes. Jennifer Reif: Right. Kubernetes is very customized because typically you’re dealing with enterprise systems and you need to mess or customize with individual components or pieces. So running Neo4j requires about four or five different components that technically run or would run separately on Kubernetes. And so, if you’ve ever heard of Helm and Helm Charts, that’s the easiest way to basically just outline, you know, these are the services, the pieces that I need in order to run Neo4j, spin all these up together and manage them this way and replicate them this way. Um, and so it’s actually pretty easy to get up and running with the Neo4j provided managed supported Helm chart. Chris Engelbert: Interesting. So the reason I’m saying interesting is because everyone these days talks about Kubernetes Operators and “we have the Operator to set it up for you” and you say “no, use the Helm chart.” And it’s like, it’s so refreshing. I haven’t heard that in a while. I think the reason is that Operators give you a lot more like operational… Well, you can react at runtime to certain situations where the Helm chart is basically just the installation. I think that is the reason why a lot of people use or move towards the Operator. Um, but that’s just my guess. Um, maybe it’s just like cool to have an Operator these days. Jennifer Reif: The latest thing. Chris Engelbert: Yeah. So let me see. We talked about developers, we talked about the programming languages, we know you can run it on Kubernetes. Um, make sure you have a persistent volume if you run a database. We talked about that. Jennifer Reif: Yep. Chris Engelbert: And if you need a persistent volume provider, I heard that simplyblock might have something for you. Um, but there’s a lot of others as well. Actually just yesterday, or on the weekend, I started a small website where you can look for all the different CSI providers. Basically the volume providers that can be plugged into Kubernetes everything that I know and found, and I split them by features and you can search. So, if you’re in the search for a CSI provider, storageclass.info is probably what you want to look into. If you find something that is wrong, feel free to send a pull request. It’s GitHub pages. Just like as a side note. Okay, because we’re pretty much out of time. What do you think is the next big thing in cloud, in graph database, in databases in general, in AI, feel free to name two or three things as well. Jennifer Reif: Yeah. Well, I think, you know, AI is kind of or it’s kind of the big thing right now, but I think we’ll start seeing that not necessarily taper off, but we’ll start seeing that integrate into, kind of just our standard day to day, rather than that, I think being the focus for everything. Um, I think we’ll kind of see, you know, us not go back to, but kind of modify what was our workflow to integrate LLMs and GenAI stuff into our day to day things. Um, and so it will become just a piece of the deployment puzzle or, you know, building a puzzle or application puzzle, whatever it is. And so I think that will kind of get standardized a little bit better. We’ll kind of figure out where the super useful applications are and the highly critical and impactful workflows that we need to use it. Um, and so I think databases are going to be a huge component of that. Whether it’s, you know, graph or something else entirely, we’re seeing this shift from, “okay, use LLM for everything,” realizing that LLM has some limitations, right. And some, and some weaknesses, but I think those are weaknesses and limitations that databases can really help mitigate. They’re not going to completely solve them, but they can help mitigate that. Because we have lots of good data in our data structures already. Um, and so pairing the two, I think together, this is where you see that retrieval, augmented generation or RAG concept pairing the database with an LLM I think is going to continue to improve that story together. Chris Engelbert: True. You said how to use it best or where to use it. The, I mean, right now there’s this meme going around, like, “I want my LLM to do my dishes and, I don’t know whatever.” Well, so it was, it was differently. “I don’t want my AI to do art and whatever. I wanted to do it, but dishes, so I can do the art.” Jennifer Reif: Yeah. I want to mitigate the low or delegate the low impact things to the LLM. Chris Engelbert: Exactly. I can’t remember exactly what it was right now. Um, but if I find it, I’ll put it in a show notes. Um, I read that and I was like, yes, that is exactly it. Why do we give the complicated tasks or the stuff that we love to do to an AI instead of trying to offload the stuff we really don’t like? A good example of that would probably be writing the initial documentation for stuff. Um, looking at the source code, at the comments and coming up with an initial draft for the documentation of that, whatever. Um, I mean, where most of us are engineers and engineers love one thing, which is writing code, but they hate the other thing, which is, well, love hate the other thing, which is documentation. So maybe, maybe that is something where we should look into and figure out if maybe it helps us that way. All right. Um, cool. Yeah. Um, that was a pleasure. Thank you very much for being here. Jennifer Reif: Thank you so much for having me. Chris Engelbert: My pleasure. Yes. And for the audience, Jennifer prepared a demo which unfortunately doesn’t work for an audio podcast, but we’ll put it in the show notes. It will show you exactly like how you can set up a Neo4j on Kubernetes yourself. Um, and we may actually do a recording. Um, so I can put that as well. We’ll see. Maybe not yet. Maybe it’s somewhere in the near future. Like a plan. I know, I know. Sometimes I have plans, not a lot of times, but sometimes. Jennifer Reif: Whether they actually get implemented, you know, who knows. Chris Engelbert: Exactly. You can always have good ideas. And there’s plenty of those, not all of them are getting implemented. All right. Yeah. As I said, thank you very much. Uh, it was a pleasure. Uh, it was good to talk to you after two years, three years again. Yeah. Time just flies. Jennifer Reif: Um, hopefully we’ll connect in person at a conference sometime in the future again. Chris Engelbert: I hope so. I hope so. I mean there is a lot of database conferences, a lot of Java conferences, so there’s a good chance, I guess. All right. And for the audience, thank you very much for being here again. Uh, see you all next week. Uh, we’ll be next episode and the next guest. Thank you very much for being here. Thanks.

The post Getting Started with Graph Databases with Jennifer Reif from Neo4j appeared first on simplyblock.

]]>
EP20: Getting Started with Graph Databases with Jennifer Reif from Neo4j