Socket.IO 1.0: Socket.IO has hit version 1.0 – the Node.js and browser library which started life as an implementation of the WebSockets interface and has gone on to “become the EventEmitter of the web”. The 1.0 release and changes are broken down in a blog posting, the first on a newly redesigned, and much more useful, Socket.IO website. In brief, modularisation, tighter code, binary support (so you can emit blobs and buffers), automated testing, better scalability using redis, more integration (including PHP support), better debugging support (and silence by default), sleeker APIs and CDN delivery. And the future plans include handling Node.js streams, Socket.IO support in Web Inspector and Firefox Dev Tools and more language and framework support. A splendid tool to have in your arsenal.
Git 2.0: The distributed version control system which made distributed version control systems cool before even version control could be cool, Git, has reached version 2.0. In the announcement of the new release, there’s a long list of all the changes and notes on backward compatibility. The 2.0 release has been anticipated by the developers for a while so a lot of ground work had already been done in previous 1.x versions making the 2.0 release look more like a minor release than a major version bump but there’s still plenty of changes and a foundation prepared for future changes. On that subject, there’s a promise of a shorter release cycle for the next release as delays have meant a number of features ‘cooking’ for longer in the ‘next’ branch.
OrientDB 1.7: Version 1.7 of the Document/Graph/Sql/NoSQL database OrientDB is available. The announcement for 1.7 notes better perforamnce, new clustering options, support for SSL and sharding, simplified configuration, new SQL commands including parallel queries, plugins for Lucene-based full text searching and more. There’s an Apache 2 licensed community edition of the database and commercially sold and supported professional and enterprise editions.
ElasticSearch 1.0 springs out: The search-oriented NoSQL database, built upon Lucene, ElasticSearch has hit version 1.0. It’s a big release with a lot of changes and a lot of new features – an API for selective snapshot/restore, federated search, aggregation, distributed percolation and software “circuit breakers” to stop some more dangerous actions from overwhelming the system. An interesting post from Found.no on ElasticSearch sums up the pros and cons (like no authentication or authorisation) places ElasticSearch in the domain of “secondary store” to be used alongside a primary database.
TokuMX 1.4: Tokutek’s “MongoDB-with-Toku-engine”, TokuMX, has hit version 1.4 and is addressing the performance of sharding and replication. Toku’s engine is reputed to be very good for particular use cases and it’s interesting to see alternative storage engines under the MongoDB infrastructure.
Plan 9 goes GPL2: It’s been a long time under a open source (but unblessed-by-the-FSF) licence, but the venerable and inspiring Plan 9 has not been relicensed (mostly) under the GPLv2. In an announcement. It can be downloaded from the Akaros project (or cloned from the GitHub repo) which seems to be breeding Inferno/Plan 9 with their own many-core large-smp research.
Python 3.4 on final approach: Out of beta and hitting release candidate a few days ago, Python 3.4 is now imminent. It’s expected to land just over a month from now on March 16. Check back to our coverage of the last beta for more details of what’s coming.
Facebook, in their now traditional goal of taking on big data problems, solving them and then open sourcing the result, have open-sourced Presto, a distributed SQL query engine “optimized for ad-hoc analysis at interactive speed”. This type of app is designed for the folks who need to work out what people who like chips and cheese and rock but dont like bagels or opera also have, statistically, in common. Its a simple enough question, but when you get up to Facebook scale, its a hard question to answer. This is the land of Hadoop and Hadoop has its own SQL-like query engine, Hive.
But unlike Hive which converts queries into MapReduce tasks saving intermediate results to disk, Presto has a query and execution engine which runs in memory and is pipelined through the network. Presto is implemented in Java for easy integration with other parts of Facebook that are also built in Java and compiles parts of queries down to bytecode, letting the JVM JIT compile to machine code to get the best out of the Java environment. Although it doesn’t need Hive, Presto does need a datasource for its queries and it includes a plugin for Hive, though it only uses the Hive metastore service, presumably to obtain structural information, and then accesses the data over HDFS.
The Facebook announcement says “Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook” and has been in use internally since Spring of this year with multiple deployments and one cluster scaled to a thousand nodes. A thousand users actively use it with 30000 queries and processing a petabyte a day. Thats a good work out for any big data offering.
There’s plenty missing from Presto; various joins and aggregations are restricted and there’s no way to write results back into tables – they go straight to the client. Those issues, plus improved performance, query accelerators, hot cached data subsets and a high performance HBase connector are all on the roadmap for Presto.
Presto is licensed under the Apache License 2.0 but does not appear to be heading to the foundation with active development taking place around Facebook’s GitHub repository.
Catching up on Codescaling with some of the less mentioned things worth noting…
- FreeBSD 10.0’s latest beta: It’s into the home/RC straight for FreeBSD 10 with the release of the third and hopefully last beta of the development cycle. The original schedule would have seen RC2 available around now, but with a focus on a quality release, there’s been a bit of slippage. Check out this FreeBSD News item from September for a feel of what’s going in. I’m looking forward to the switch to LLVM/Clang and seeing how the tickless kernel works out.
- SQL injection attacks by Google?: Sucuri have come across an odd thing, Google doing SQL Injection attacks. Basically, Google’s bots crawl a site with links which would carry out an SQLi attack if followed… and then follow them like the bots they are which carries out the attack. Google may want to add at least some filtering to their bots in future, but its something to remind any application that ingests URLs from the web to follow them that URLs are not necessarily passive.
- Rust reworks stack plan: For those interested in the implementation of languages, the Rust developers have decided to drop segmented stacks. Segmented stacks were stacks that were allocated small and expanded as needed. This would have allowed threads to have a much smaller footprint, but it didn’t quite work out that way. Followups on the thread discuss the cost of memory, both having it and accessing it, and alternative strategies.
- InfluxDB: Databases for time series data are in and the latest open source addition to the game is InfluxDB which prides itself in no external dependencies. The Go-based MIT-licensed code has a JSONic HTTP API, an SQLish query language and a playground server to get running with. Its early days for InfluxDB, but its off to a good start.
- Mozilla’s Circus Renewed: Mozilla’s Services project has announced a new version of its process/socket manager called Circus. Built using Python and ZeroMQ and recently redeveloped to be Python 3 compatible and fully asynchronous, the software lets an administrator manage processes and sockets on servers through a command line, Python API or web console. You can find the code on mozilla-services github.
At the opening of the conference day at Cassandra Summit Europe 2013, Johnathan Ellis, Datastax CTO, made a point of positioning Apache Cassandra as an enterprise scalable database and one that scales in a linear fashion to massive scales. Datastax is the leading developer of, and commercial vendor of Apache Cassandra in the form of DataStax enterprise.
MongoDB was very much in the company’s sights as it showed benchmarks with Cassandra running 20 times faster than MongoDB – the reason was simple though the dataset for the benchmark was bigger than the available memory on the nodes. While MongoDB performs well with the dataset in memory, Ellis says most customers want their hot-data in memory and their cold-data on disk and thats where Cassandra has the advantage with a balanced approach to memory and disk.
Away from the benchmarking, Ellis described this years focus for Cassandra as having been on was of use. That meant enhanced CQL, the Cassandra Query Language, a new CQL protocol for language drivers, more emphasis on features like tracing, lightweight transactions for the 1% of cases that need it and cursors to reduce query complexity.
Internal enhancements were equally important though. For example, 2.0 took back control of a lot of memory management in Cassandra, from the JVM and over to a more traditionally manually handled memory manager tuned for Cassandra’s needs. This has allowed lots of data structures to reside more efficiently in memory improving performance.
Next week will see the release of Cassandra 2.0.2 which will add what the DataStax people call “rapid read protection”. This means that when a query goes out to a cluster, rather than waiting until a node times out to return an error, the system will look for return times that are out of the ordinary (in the 99th percentile) and return an error on them early. This should make the ability to respond to nodes over-paused in GC or suffering some other performance hit.
Ellis also talked about Cassandra 2.1 which is pencilled in for January 2014. This will see nesting and collection indexing added to the database. The filtering inside the Cassandra software should also be improved with a new combination of pessimistic allocation and smarter estimates of required space using HyperLogLog to work out what data overlaps between sets. Ellis described his slides in this though as “hand wavy” as there was no code written yet and asked “Don’t send me hate mail…” if it didn’t make 2.1.
DataStax’s own certified DataStax Enterprise is set to move to a Cassandra 2.0 base by the end of the year.