Qubit Matrix Chain Multiplication in O(n)

12 min readOct 28, 2022

Markdown version if you hate Medium’s layout: https://hirasawakinko.github.io/chika_home/toward_science/Qubit%20Matrix%20Chain%20Multiplication%20in%20O(n)/

Benchmark

Basics

Any Matrix can be expressed in sum of Pauli Operator.

Pauli Operator a set of operation in quantum computing. They are denoted as

I, X, Y, Z

In their matrix form

I := [ 1  0 ]
     [ 0  1 ]X := [ 0  1 ]
     [ 1  0 ]Y := [ 0 -i ]
     [ i  0 ]Z := [ 1  0 ]
     [ 0 -1 ]The "i" in Y is the "i" of imaginary number.

Let say there is a matrix A, you are able to carry Pauli decomposition to it.

Pauli decomposition:matrix A = summation(i)[ coefficient i * pauli{I, X, Y, Z} ]for example 4I + 2Y + 7X - 4Z + 3Y ...

Then you can carry out tensor product to make larger matrix.

I tensor X= [ 1  0 ] tensor X
  [ 0  1 ]= [ X  0 ]
  [ 0  X ]= [ 0  1  0  0 ]
  [ 1  0  0  0 ]
  [ 0  0  0  1 ]
  [ 0  0  1  0 ]

And the rules of multiplications are:

I* == + *
*I == + *
XY ==  iZ
YX == -iZ
YZ ==  iX
ZY == -iX
ZX ==  iY
XZ == -iY

You may try multipying those by hand, to confirm that they are valid.

And obviously these symbolic rules are valid, you can avoid matrix multiplication, and merely do the string manipulation.

IX -> X
XY -> Z
YZ -> X
...

For example:

matrix A = 4I + 2Y + 7X - 4Z + 3Ymatrix A * Pauli X
= (4I + 2Y + 7X - 4Z + 3Y) * X
= (4A + 2Z + 7I - 4Y + 3Z)

I omitted all those coefficient changes. Not because you don’t have to, it is just because coefficient is not the main concern in this research.

Current status

There are many Python library can do such symbolic quantum computation, however, not fast enough.

Algorithm they use are naive, causing masive redundent runtime. For example OpenFermion.

Some are not even usnig symbolic computation. Those are just expend string input into a matrix, and then carryout matrix vector multiplication. For example Qiskit, Qulacs. Which makes them absurbly slow in some of the most simple tasks.

For example this.

(*) represent tensor productmatrix A 
= I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*)Now we multiply it by
matrix B 
= X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*)

How long you think this would take? I can do this under a second single-handedly, but them? Probably take more than 1 year to output the answer.

For tensor product, you multiply stuffs bitwise-ly. As I*X -> X, the result is just matrix B itself. Cool?

Naive coding

So, what about OpenFermion’s naive coding? Well, they almost got it but not quite. Say you are multipying two matrix A and B:

matrix A  |  matrix B
= I Y     |  = X Z
+ Y X     |  + Y Y
+ X Y     |  + X Y
+ Z X     |  
+ Y Z     |

Just as what you will do to (x + 3)(x + 2), you do the same thing to these matrix.

first lable them...matrix A  |  matrix B
= I Y     |  = X Z  (b1)
+ Y X     |  + Y Y  (b2)
+ X Y     |  + X Y  (b3)
+ Z X     |  
+ Y Z     |matrix A * matrix B= matrix A * b1 + matrix A * b2 + matrix A * b3 ...

So you would carry the bitwise multiplication like this:

      X Z (b1)    Y Y  (b2)    X Y  (b3)  
    matrix A    matrix A     matrix A     
    = I Y       = I Y        = I Y        
    + Y X    +  + Y X    +   + Y X      
    + X Y       + X Y        + X Y        
    + Z X       + Z X        + Z X        
    + Y Z       + Y Z        + Y Z                     Y Y  (b2)    X Y  (b3)   
                   matrix A     matrix A      
                   = I Y        = I Y         
    = something +  + Y X    +   + Y X   
                   + X Y        + X Y         
                   + Z X        + Z X         
                   + Y Z        + Y Z                     X Y  (b3)  
                   matrix A     
                   = I Y        
    = something +  + Y X    
                   + X Y      
                   + Z X        
                   + Y Z

At each step, you broadcase one row of matrix B to all rows of matrix A.

matrix A
= I Y * X Z (b1)
+ Y X * X Z (b1)
+ X Y * X Z (b1)
+ Z X * X Z (b1)
+ Y Z * X Z (b1)

This approach seems legit, but absolutely we can do better and faster.

What I have done in SymPauli

As you can see in the precious section, we have to call out matrix A multiple times in order to carryout complete process of matrix multiplication. From this, we know that matrix A is basically not going to change anyway.

Also the not-so-obvious one is that, Pauli Operator has its cyclic group property. It means that we can totally predict the outcome.

When you merged like terms i.e. 2y + 3y +7x = 5y + 7x, there is only unique Pauli tensors left in the pool.

Let say you have I X Y Z, still ignoring coefficient.

  (I X Y Z) * X
= (X I Z Y)

(I +X +Y +Z) is the simpliest form of itself, you cannot simplify it anymore. After multiplying something, the outcome is still in its simpliest form. And you should notice that multiplication in here literally means permutation.

  (I X Y Z) * Y
= (Y Z I X)  (I X Y Z) * Z
= (Z Y X I)

Focus back on the matrix A * b1 :

  X Z (b1)
matrix A
= I Y
+ Y X
+ X Y
+ Z X
+ Y Z

Further focus on the left Pauli Operators:

  X (b1)
matrix A
= I
+ Y
+ X
+ Z
+ Y

By the previous rule, we know the answer already.

  (I X Y Z) * X
= (X I Z Y)  X (b1)
matrix A
= I -> X
+ Y -> Z
+ X -> I
+ Z -> Y
+ Y -> Z

Here you see some duplication. i.e. we already know that X*Z -> Y, so why are we still checking it row by row? This is just a waste of time.

Here it comes the relational data structure.

Relational data structure and super parallelism

This is one part of matrix A. And we are going to perform unified permutation operator to them all.

Text Only

matrix A
= I * X
+ Y * X
+ X * X
+ Z * X
+ Y * X

Here I purpose a relational data structure to execute a super parallelism algorithm.

You have a fontend and a backend for this data structure. And you map every value of the fontend to the backend, so that its value refers to what it is at the back.

front and backfontend   backend
   I      0 = I
   Y      1 = X
   X      2 = Y
   Z      3 = Z
   Ymapping front to back
fontend   backend
    0      0 = I
    2      1 = X
    1      2 = Y
    3      3 = Z
    2

When you do multiplication, we do it to the backend, not frontend. After that, you read out values of frontend from the backend.

multiplying X
fontend   backend
    0      0 = I
    2      1 = X     * X
    1      2 = Y
    3      3 = Z
    2multiplied X
fontend   backend
    0      0 = X
    2      1 = I
    1      2 = Z
    3      3 = Y
    2multiplied X
fontend   backend
X < 0       0 = X
Z < 2       1 = I
I < 1       2 = Z
Y < 3       3 = Y
Z < 2

After reading out, you can reset matrix A to its init state solely by reseting the backend.

reset backend
fontend   backend
I < 0       0 = I
Y < 2       1 = X
X < 1       2 = Y
Z < 3       3 = Z
Y < 2

So it is back to original matrix A. So you can do the next multiplication (also because it requires the original one), namely:

  Y Y  (b2)
matrix A
= I Y
+ Y X
+ X Y
+ Z X
+ Y Z

Chain multiplication

Last step in the last section we reset the backend. However it is not required if you are going to do a chain multiplication i.e. A * b * c * d * e * ..., {b c d e} are in small letter and they are one-row thing (not linear combination).

fontend   backend
   I       0 = I
   Y       1 = X   * b * c * d * e * ...
   X       2 = Y
   Z       3 = Z
   Y

By not reseting the backend after every turn of multiplication, you are able to carry out the chain action that use precious state as the starting point of next action. So you can maintain the initial mapping and read the result with that initial mapping.

initial mapping
fontend   backend
    0      0 = I
    2      1 = X  * b * c * d * e * ...
    1      2 = Y
    3      3 = Z
    2after all multiplications
fontend   backend
    0      0 = ?
    2      1 = ?
    1      2 = ?
    3      3 = ?
    2you can read the result by the very same initial mapping
fontend   backend
? < 0      0 = ?
? < 2      1 = ?
? < 1      2 = ?
? < 3      3 = ?
? < 2

Scalablility

This relational data structure and algorithm is parallel native. Even not under parallel programming, still be able to gain a masive speed up.

Due to the fact that all data manipulation is done solely to the backend, and the backend is as small as a constant, i.e. size of backend array = 4 Pauli Operator * N qubits, complexity of O(n). With this size, you can easily pack that into CPU cache and do everything in unprecedented speed.

This is the initial state of a 6 qubits backend, and you have one row of your matrix is (I I I I I I):

qubits  0  1  2  3  4  50 = I   0  0  0  0  0  0
1 = X   1  1  1  1  1  1
2 = Y   2  2  2  2  2  2
3 = Z   3  3  3  3  3  3my row : (I I I I I I) -> (0 0 0 0 0 0)

For example multiplied by (X Y Z Y Z X):

qubits  0  1  2  3  4  50 = I   1  2  3  2  3  1
1 = X   0  3  2  3  2  0
2 = Y   3  0  1  0  1  3
3 = Z   2  1  0  1  0  2

You just have to look at the same position of the table, to find out what it becomes now.

my row : (I I I I I I) -> (1 2 3 2 3 1) -> (X Y Z Y Z X)

Same table applies to all rows. Multiply once, read many. Because Pauli Operator is Cyclic, Just like you rotate a analogue clock (those come with hours minutes pins), the internal relative position of 10 o’clock and 4 o’clock and others are not gonna to change.

Second is that everything is summation, and the ordering of summation does not matter due to addition commutativity. So you can split the whole thing into subprocess and sum them up lately.

Further is that tensor product of Pauli operator is carried out bitwise-ly, so that you can further split it into smaller subtasks.

Also the small size of the backend makes the communicational cost very low.

But comes with great penalty

Read and write performance of this data structure is fatal. Since you have to first map the information of frontend to the backend, and after that you have to do the reverse. This kind of translation cost is super expensive, for matrix that is 30000 rows of Pauli tensor * 40 qubits, it is about 0.1X to 0.01X of the R/W performance.

However, for the chain multiplication situation, R/W is only performed at first and the last, multiplying chain of 30000 rows of Pauli tensor * 40 qubits to the matrix of same size but in a linear combunation, this data structure is giving out 30000X to 40000X performance.

Until now, I have been ignoring the coefficients, mainly focused on Pauli Operators. All comparison stated above is:

conventional with coefficient 
VS 
new without coefficient

So yeah it is just BS benchmark and nothing you can take seriously. But that is because there is no room for improvement when we include coefficients into the calculation. At most 10X speend up at the same scale using Python, And I am already using Numpy for the heavy leafting coefficient work. But not using Numpy is actually not that far away, about 5–6X.

So you can see that no coefficients, 40000X. Yes coefficients, 10X. Damn coefficients sucks.

Conclusion

I indipendently invented a new algorithm using relational data structure and phantomly revolutionized the contentional approach by 40000X under the same scale, but this is just when I ignore all computation work for the coefficients. With coefficients, 10X is the maximun I can get. So, long live the symbolic computation, god damn it numeric computation.

Github

GitHub — hirasawakinko/SymPauli: Symbolic Pauli Operation Multiplication

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

1. Symbolic Quantum Computation is the game changer that turns O(2^n) problem into O(log2 n) problem (https://hirasawakin.medium.com/symbolic-quantum-computation-is-a-game-changer-turn-o-2-n-problem-into-o-log2-n-problem-5654307a5255)

2. Cracking Qubit Operator Representation of Any Matrix (https://hirasawakin.medium.com/cracking-qubit-operator-representation-of-any-matrix-4e0451a4c7d7)

3. Unrespresentativity of Chemistry or Physiscs simulation in the realm of Quantum Computing (https://hirasawakin.medium.com/unrespresentitivity-of-chemistry-or-physiscs-simulation-in-the-realm-of-quantum-computing-51ecd5348d30)

4.Pauli Operator, Commutator, Bloch sphere, Binary Operation, and Cyclic Group
https://hirasawakinko.medium.com/pauli-operator-commutator-bloch-sphere-binary-operation-and-cyclic-group-1343ab18f038