Insights from curating NEAR data — Chuxin

On a sunny February morning in Vienna, Austria, I hopped on a Discord voice call with Chuxin, located 10 timezones away, to talk about data curation, what makes NEAR special and her journey to become a pod leader and data curator for MetricsDAO.

Q: Hi Chuxin, thanks for taking the time to talk with me. Let me ask you a couple of questions regarding the ongoing curation of NEAR protocol blockchain data. What makes NEAR different from say Ethereum, or from Harmony, where you also led the curation efforts for MetricsDAO?

Chuxin: NEAR is another Layer 1 blockchain protocol that has been growing fast, so that was exciting. It uses a Proof-of-Stake consensus mechanism and sharding design to improve scalability. In terms of data curation, because of this design, data such as chunks, actions etc. are what we didn’t see in that way in Harmony or Ethereum.

Q: Do you now have to parse multiple shards? Or is there a canonical chain that’s a single source of truth in the end?

C: At the end the fundamental structure is a single chain of blocks. However, because of the sharding design, consensus nodes only need to validate parts of the history, in comparison to the non-sharded chains where all consensus nodes need to agree on the state of the entire chain. So for NEAR, all transactions are aggregated and split into chunks to be validated. That’s different from what we saw with Harmony.

Q: How many people are working on data curation for NEAR at the moment?

C: We have four to five active contributors at this stage. We’re getting started with the main effort now. Before we were researching in order to come to a mutual understanding of what makes up the NEAR blockchain, and how we wanted to design the table schema to be useful for end users. Currently we have tables for NEAR blocks, actions and a staging table for NEAR transactions. You can see where we are here in our GitHub repo:

Q: As far as I know Ethereum smart contracts can also emit events, how is that different in NEAR?

C: Yeah, that’s kind of similar, but transactions in NEAR are a list of actions to be performed on the receiver side with some additional information such as block hash, signer and receiver ids.

Q: Which database are you using?

C: We pull data from the NEAR blockchain using chainwalker, which ingests all the data in the raw, JSON format. As curators we model the data into blocks, transactions and actions and put them into Snowflake tables.

Q: So this brings us to the heart of the process. What does data curation actually entail here?

C: First of all we come up with the core tables we need to curate into, which is usually quite similar to other chains. But in the raw data there’s going to be duplicates, which we need to clean up.

Then we expand and decode the JSON keys into columns, or into separate tables if it is merited. That was the case for actions in NEAR.

Q: A bit about yourself, do you have a blockchain background? Or are you a data scientist by profession?

C: I’m from the analytics space and work for a Web2 tech co, and have been doing data analysis and data science for the last three and a half years. I got started in the crypto space in the middle of 2021 with a program by Andrew Hong called OurNetwork Learn, and that is how I got started.

I became involved in MetricsDAO via Drake and actually was one of the first members there. I got to understand blockchain data better, and am familiar with Ethereum now. There’s so many different chains, and I was really interested in going deeper.

When the Harmony curation project started I raised my hand and thought to myself: I have experience with Snowflake, and with DBT, and these other tools, but no Harmony experience. MetricsDAO allowed me to really get my hands dirty with the multichain data.

This project progressed very well, with five to six people contributing actively. The bounties are almost up, and we’re happy with the output. We will also be able to use the curated data going forward. And from there on I naturally progressed into NEAR curation.

Q: Was it easy to find enough contributors on MetricsDAO?

C: Some people are interested in contributing to the curation itself, others are tangentially interested but have different aims, which is fine too.

We would love to bring in more curators to speed the process up and improve data quality. I’m going to give a workshop on data curation on March 11th, 2022, for anyone interested in getting into blockchain data curation.

And I hope we can make people more comfortable with the tech stack we use and onboard more curators.

Q: Tell me about the tech stack you use, please.

C: We use Snowflake as a database, and DBT (DataBuildTool) for orchestration, and collaborate over GitHub, and we use Docker for the environment. And that’s pretty much it.

Q: Any other links you would like to share?

C: There’s a Gitbook section for data curators, quite short but would be informative
Data Curator Onboarding — MetricsDAO (gitbook.io)

Then anyone using Flipside can build queries on the curated Harmony data, which is exciting.

Q: Thanks for the fascinating insights, highly recommend anyone interested into data curation to join the workshop. And of course join MetricsDAO on the Discord. Thanks Chuxin, have a great evening.

MetricsDAO provides a 6 step process for ​​Organized, On-Demand Analytics Delivery, or as we like to call it, OODAD.

Follow us on Twitter
Join our Discord

Subscribe to MetricsDAO
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.