Scaling DNA data storage with nanoscale electrode wells

Description

: Writing data in DNA with electrochemically mediated phosphoramidite synthesis. a) The DNA data storage pipeline. b) Phosphoramidite synthesis steps and reactions. The fourstep cycle used in phosphoramidite synthesis. c) Half-reactions occurring at the anode and cathode during the electrochemical deblock step. Coupling, capping, and oxidation synthesis cycles were performed using the default fluidics protocols for column synthesis on 50 nmol scale. Deblocking cycles were performed using a modified protocol which incorporated a triggering pulse to sync exposure to electrochemical deblock solution with application of voltage ( Figure S2). Fluids were supplied to the nanoelectrode array surface via a two-piece, stainless steel flowcell. PEEK fittings allowed access to an approximately 100 μL cavity bounded by an EPDM gasket. Electronic control of the microelectrode array is achieved via PCIe connectors exposed outside of the flow cell ( Figure  S3).

S2 DNA data storage throughput calculations
To achieve write speeds of kB/s of data in DNA, an array containing >8.8 million spots for DNA synthesis is required assuming each unique oligonucleotide encodes for 10 bytes of data [1] and is written over 24 hours. With these assumptions, write speeds of MB/s of data in DNA would require >8.8 billion unique oligos written every 24 hours. In order to minimize the array footprint as the required number of features increases to match target outputs, the pitch between the features becomes critical. An array with a 2μm pitch can fit 25 million features per cm 2 , write >2.8 kB/s/cm 2 , and would require roughly 360 cm 2 to write approximately 1 MB/s. Feature densities of the commercial DNA synthesis platforms were estimated by numbers provided by the company directly from their respective website or online presentations (31)(32)(33).

S3 Acid generation and diffusion modeling
We performed a basic finite element analysis of a 650nm diameter electrode pitched 2μm to test for sufficient acid confinement. We modeled a single electrode tile area with zero flux boundary conditions along the non-conducting areas as well as the tessellation boundaries. Electrode surfaces were set to values proportional to their relative area to account for the balanced generation rates on the anodes and cathodes. Since values for the acid/base species we used in this work were not available in the literature, we chose a value close to the middle of the range for other small molecules: 9 x10 -5 cm 2 /s.

S4 Chip Fabrication and Electrode Activation
A silicon wafer containing an array of 650nm diameter electrodes pitched 2μm was manufactured using standard nanolithography 130nm process technology. The wafer was diced and mounted on a FR4 PCB. The A and B face of the PCB were mirrored and designed for a high-density card interface. The die pins were wire bonded to their corresponding traces on the PCB and then protected by an epoxy encapsulate to create the slide assembly. A flow cell assembly was designed in two pieces. The bottom piece was fitted with two locating pins to sandwich the chip assembly in place with the top piece, which contained a recessed gasket. The top piece contained 2 I/O ports to connect inline with an Expedite 8900 oligonucleotide synthesizer. To address the electrodes on the array, a card edge connector was connected to the slide assembly driven by a National Instruments PXIe-4141 Source Measure Unit.

S5 Imaging
SEM images were taken with a FEI Quanta 600 SEM coupled with an EDS detector. Fluorescent images were taken on either an Olympus BX53MTRF-S equipped with a DP74 color CMOS camera or Leica DM8 Lightning confocal system. Image processing was performed using ImageJ. Image analysis of fluorescence on 650 nm electrodes showed a toroidal profile which indicates that oligonucleotide synthesis occurs on the walls of the SiO2 well, leaving the electrode surface free of fluorescence.  . Synthesized on a 650 nm electrode array acquired on an Olympus BX53MTRF-S demonstrating fluorescence uniformity across large section of the electrode array. a) Pattern was generated with 1 anode activated and imaged using a 50x objective b) Pattern generated with all 4 anodes activated and imaged using a 100x objective. Anode locations that are dark indicated an electrode fabrication failure.

S6 Array Synthesis Experiments
Master Sequence Generation: The master sequence and electrode activation sequencing are constructed by first only considering the payload sections (i.e., the sequences without the common primer pair on the 5' and 3' ends.). A periodic master sequence of sufficient length is then generated by repeating the symbols CAGT (or a shuffled version of the same characters). Electrode sequencing was determined for each electrode based on the next base is the 3'-5' sequence to be generated. At cycle N, if the Nth character in the master sequence is the same as the 3' most unincorporated base for the sequence associated with electrode a particular electrode, then that electrode is activated causing a base to be incorporated. Next the 20-base primer sequences are added to the 3' and 5' end of the master sequence as well as 20 3' and 5' electrode activations to ensure the primer sequence is incorporated. Finally, any cycles with 0 activations are removed.
This method ensures that no more than 4N + 20×2 cycles are required for any set of sequences with payload length N and primer length 20. While this algorithm is not optimal, we found it to be sufficient and that as the number of unique sequences increases the number of cycles needed approaches the upper limit above. We conjecture that finding the optimal master sequence is equivalent to finding the global multisequence alignment of all sequences to be synthesized with infinite mismatch penalty and gap penalty >0.
Error analysis: A high-level description of the end-to-end math pipeline used in Fig. 3

[Step 1]
Perform an alignment on the sequencing reads. This generates an alignment file in SAM format containing a CIGAR string for each aligned read. Here's an example CIGAR string: "25S18M1D26M3I27M79S".
This CIGAR string can be read like this: + 25S = SKIP: skip over the first 25 bases in the read + 18M = MATCH: the next 19 bases in the read match the bases in the reference. n.b. A "Match" can be a sequence match or a sequence mismatch; aka a substitution. + 1D = DELETION: the next 1 base in the reference is missing from the read + 26M = MATCH: the next 26 bases in the read match bases in the reference. + 3I = INSERTION: the next 3 bases in the read are not present in the reference + 27M = MATCH: the next 27 bases in the read match bases in the reference + 79S = SKIP: skip over the next 29 bases in the read

[Step 2]
Read in the alignment file containing the CIGAR strings and accumulate the CIGAR strings for each reference into three arrays.
For each reference, create: + One array capturing the total INS errors for each position in the reference oligo + One array capturing the total DEL errors for each position in the reference oligo + One array capturing the total SUB errors for each position in the reference oligo For each reference, this step also accumulates the total count of reads which aligned with the reference. i.e. The total count of CIGAR strings for the reference.    Correlation of errors in multisequence synthesis: There is a statistically significant association between error rates and base locations, excluding the primer regions, X 2 (141, N = 25832266) = 444046, p < 2.2e-16; however, the effect size is minimal, Cramer's V = 0.0757. The minimal effect size indicates that the error rates were likely the result of independent random processes, given the tendency of the X 2 test to overestimate significance at high sample numbers. No statistically significant correlation was found between cumulative error rates of sequences at positions in which electrodes were activated during the same synthesis cycle, with r = 0.093 and p = 0.125.
Single fluorophore experiment 5'-6AAA-3' Dual fluorophore experiment 5'-5AAA-3' 5'-6AAA-3' The four sequences were converted into a single "master sequence" that defined the fluid to be delivered at each cycle of synthesis. Selective electrochemical deprotection of the growing oligonucleotide over each of the four anodes differentiates this master sequence into the four individual sequences above. In keeping with the standard established above, this master sequence is written 5' -3'.  Fig. S8: Cell resistance as a function of synthesis cycle. The minimum observed cell resistance during a synthesis cycle was found to be correlated with the number of synthesis cycles, correcting for the number of electrodes activated during that cycle, with r = 0.33 nΩ/cycle and p = 5.6e-13.

Maximum length synthesis experiment
Synthesis results: Once the synthesis protocol was complete, DNA was cleaved off the surface of the chip using 32% ammonium hydroxide and deprotected overnight at 65°C. The solution was then concentrated to dryness in a SpeedVac vacuum concentrator, followed by resuspension in 40 μL of H2O. The DNA was amplified using PCR and purified using a Qiagen QIAquick spin column or gel extracted as needed. The enriched DNA was then amplified a second time with primers containing random 25N overhangs, ligated, and sequenced using an Illumina NextSeq. Sequences were aligned using a modified Bitwise Majority Alignment algorithm (BMA) (29). More details regarding usage of the 25N overhangs, ligation protocol, sequencing preparation, and error analysis are described in Organick et al. [12].